- Kernel - mailweb.openeuler.org

[PATCH openEuler-1.0-LTS] ipv4: Fix device used for dst_alloc with local routes
by Ziyang Xuan 23 Dec '23

23 Dec '23

From: David Ahern <dsahern(a)kernel.org> mainline inclusion from mainline-v5.13-rc7 commit b87b04f5019e821c8c6c7761f258402e43500a1f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8KNM7 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… -------------------------------- Oliver reported a use case where deleting a VRF device can hang waiting for the refcnt to drop to 0. The root cause is that the dst is allocated against the VRF device but cached on the loopback device. The use case (added to the selftests) has an implicit VRF crossing due to the ordering of the FIB rules (lookup local is before the l3mdev rule, but the problem occurs even if the FIB rules are re-ordered with local after l3mdev because the VRF table does not have a default route to terminate the lookup). The end result is is that the FIB lookup returns the loopback device as the nexthop, but the ingress device is in a VRF. The mismatch causes the dst alloc against the VRF device but then cached on the loopback. The fix is to bring the trick used for IPv6 (see ip6_rt_get_dev_rcu): pick the dst alloc device based the fib lookup result but with checks that the result has a nexthop device (e.g., not an unreachable or prohibit entry). Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant") Reported-by: Oliver Herms <oliver.peter.herms(a)gmail.com> Signed-off-by: David Ahern <dsahern(a)kernel.org> Signed-off-by: David S. Miller <davem(a)davemloft.net> Conflicts: net/ipv4/route.c Signed-off-by: Ziyang Xuan <william.xuanziyang(a)huawei.com> --- net/ipv4/route.c | 15 +++++++++++++- tools/testing/selftests/net/fib_tests.sh | 25 ++++++++++++++++++++++++ 2 files changed, 39 insertions(+), 1 deletion(-) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 86096e2e43b0..e2408efb6ddf 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1962,6 +1962,19 @@ static int ip_mkroute_input(struct sk_buff *skb, return __mkroute_input(skb, res, in_dev, daddr, saddr, tos); } +/* get device for dst_alloc with local routes */ +static struct net_device *ip_rt_get_dev(struct net *net, + const struct fib_result *res) +{ + struct fib_nh_common *nhc = res->fi ? res->nhc : NULL; + struct net_device *dev = NULL; + + if (nhc) + dev = l3mdev_master_dev_rcu(nhc->nhc_dev); + + return dev ? : net->loopback_dev; +} + /* * NOTE. We drop all the packets that has local source * addresses, because every properly looped back packet @@ -2113,7 +2126,7 @@ out: return err; } } - rth = rt_dst_alloc(l3mdev_master_dev_rcu(dev) ? : net->loopback_dev, + rth = rt_dst_alloc(ip_rt_get_dev(net, res), flags | RTCF_LOCAL, res->type, IN_DEV_CONF_GET(in_dev, NOPOLICY), false, do_cache); if (!rth) diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh index 7d1a7c0dc56a..c9800635eea6 100755 --- a/tools/testing/selftests/net/fib_tests.sh +++ b/tools/testing/selftests/net/fib_tests.sh @@ -1221,12 +1221,37 @@ ipv4_rt_replace() ipv4_rt_replace_mpath } +# checks that cached input route on VRF port is deleted +# when VRF is deleted +ipv4_local_rt_cache() +{ + run_cmd "ip addr add 10.0.0.1/32 dev lo" + run_cmd "ip netns add test-ns" + run_cmd "ip link add veth-outside type veth peer name veth-inside" + run_cmd "ip link add vrf-100 type vrf table 1100" + run_cmd "ip link set veth-outside master vrf-100" + run_cmd "ip link set veth-inside netns test-ns" + run_cmd "ip link set veth-outside up" + run_cmd "ip link set vrf-100 up" + run_cmd "ip route add 10.1.1.1/32 dev veth-outside table 1100" + run_cmd "ip netns exec test-ns ip link set veth-inside up" + run_cmd "ip netns exec test-ns ip addr add 10.1.1.1/32 dev veth-inside" + run_cmd "ip netns exec test-ns ip route add 10.0.0.1/32 dev veth-inside" + run_cmd "ip netns exec test-ns ip route add default via 10.0.0.1" + run_cmd "ip netns exec test-ns ping 10.0.0.1 -c 1 -i 1" + run_cmd "ip link delete vrf-100" + + # if we do not hang test is a success + log_test $? 0 "Cached route removed from VRF port device" +} + ipv4_route_test() { route_setup ipv4_rt_add ipv4_rt_replace + ipv4_local_rt_cache route_cleanup } -- 2.25.1

2 1

[PATCH OLK-5.10 0/2] net: Backport bugfixes from mainline
by Ziyang Xuan 23 Dec '23

23 Dec '23

Backport bugfixes from mainline. Muchun Song (1): tcp: use alloc_large_system_hash() to allocate table_perturb Paolo Abeni (1): udp: skip L4 aggregation for UDP tunnel packets net/ipv4/inet_hashtables.c | 10 ++++++---- net/ipv4/udp_offload.c | 17 ++++++++++------- 2 files changed, 16 insertions(+), 11 deletions(-) -- 2.25.1

2 3

[PATCH OLK-5.10 0/2] fs: mitigatin cacheline false sharing in struct file
by Xie XiuQi 23 Dec '23

23 Dec '23

The cache false-sharing exists in the struct file, for syscall test case of Unixbench. In a system with a 128B cacheline size, we force set it to 64B alignment to get a better performance. If we use alignment, it would waste ~192 bytes at worst case for each file struct. If unsure, say N. Xie XiuQi (2): fs: mitigatin cacheline false sharing in struct file fs: enable CONFIG_FILE_MITIGATION_FALSE_SHARING by default on arm64 arch/arm64/configs/openeuler_defconfig | 1 + fs/Kconfig | 13 +++++++++ fs/file_table.c | 37 +++++++++++++++++++++++--- 3 files changed, 48 insertions(+), 3 deletions(-) -- 2.20.1

2 3

[PATCH OLK-6.6 0/4] Fix smmu pgtable prfetch and add support for ras features
by Zhang Zekun 23 Dec '23

23 Dec '23

Fix the pgtable prefetch problem, besides add some ras features which is used in ascend scenarios. Zhang Zekun (4): iommu/arm-smmu-v3: Add a SYNC command to avoid broken page table prefetch mm: memory-failure: Directly return the task for specific use ACPI: APEI: Don't call notifier again in ts senario mm/hwpoison: Add to check is a page is hwpoisoned arch/arm64/Kconfig | 13 +++++++++ arch/arm64/configs/openeuler_defconfig | 1 + arch/arm64/kernel/cpu_errata.c | 14 +++++++++ arch/arm64/tools/cpucaps | 1 + drivers/acpi/apei/ghes.c | 3 ++ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 20 +++++++++++++ include/linux/mm.h | 5 ++++ mm/Kconfig | 9 ++++++ mm/memory-failure.c | 32 +++++++++++++++++++++ 9 files changed, 98 insertions(+) -- 2.17.1

2 5

[PATCH OLK-6.6 0/4] Add L3T and LPDDRC PMU
by Chen Jun 23 Dec '23

23 Dec '23

From: c00424029 <c00424029(a)huawei.com> Chen Jun (4): perf: hisi: Make irq shared perf: hisi: Fix read sccl_id and ccl_id error in some platform perf: hisi: Add support for HiSilicon SoC L3T PMU perf: hisi: Add support for HiSilicon SoC LPDDRC PMU arch/arm64/configs/openeuler_defconfig | 2 + drivers/perf/hisilicon/Kconfig | 18 + drivers/perf/hisilicon/Makefile | 2 + drivers/perf/hisilicon/hisi_uncore_l3t_pmu.c | 403 +++++++++++++++++ .../perf/hisilicon/hisi_uncore_lpddrc_pmu.c | 408 ++++++++++++++++++ drivers/perf/hisilicon/hisi_uncore_pmu.c | 21 +- 6 files changed, 847 insertions(+), 7 deletions(-) create mode 100644 drivers/perf/hisilicon/hisi_uncore_l3t_pmu.c create mode 100644 drivers/perf/hisilicon/hisi_uncore_lpddrc_pmu.c -- 2.17.1

2 5

[PATCH OLK-6.6 0/3] files cgroup:merged patch
by chenridong 23 Dec '23

23 Dec '23

Binder Makin (1): cgroups: Resource controller for open files Lu Jialin (1): enable CONFIG_CGROUP_FILES in openeuler_defconfig for x86 and arm64 Yang Yingliang (1): cgroup/files: support boot parameter to control if disable files cgroup .../admin-guide/kernel-parameters.txt | 7 +- arch/arm64/configs/openeuler_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + fs/Makefile | 1 + fs/file.c | 68 +++- fs/filescontrol.c | 312 ++++++++++++++++++ include/linux/cgroup-defs.h | 8 +- include/linux/cgroup_subsys.h | 4 + include/linux/fdtable.h | 1 + include/linux/filescontrol.h | 40 +++ init/Kconfig | 10 + 11 files changed, 445 insertions(+), 8 deletions(-) create mode 100644 fs/filescontrol.c create mode 100644 include/linux/filescontrol.h -- 2.34.1

2 4

[PATCH OLK-6.6 0/4] Fix smmu pgtable prfetch and add support
by Zhang Zekun 23 Dec '23

23 Dec '23

Fix the pgtable prefetch problem, besides add some ras features which is used in ascend scenarios. Zhang Zekun (4): iommu/arm-smmu-v3: Add a SYNC command to avoid broken page table prefetch mm: memory-failure: Directly return the task for specific use ACPI: APEI: Don't call notifier again in ts senario mm/hwpoison: Add to check is a page is hwpoisoned arch/arm64/Kconfig | 13 +++++++++ arch/arm64/configs/openeuler_defconfig | 1 + arch/arm64/kernel/cpu_errata.c | 14 +++++++++ arch/arm64/tools/cpucaps | 1 + drivers/acpi/apei/ghes.c | 3 ++ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 20 +++++++++++++ include/linux/mm.h | 5 ++++ mm/Kconfig | 9 ++++++ mm/memory-failure.c | 32 +++++++++++++++++++++ 9 files changed, 98 insertions(+) -- 2.17.1

2 5

[PATCH OLK-6.6 00/33] Support kernel livepatching
by Zheng Yejian 23 Dec '23

23 Dec '23

Zheng Yejian (33): livepatch/core: Allow implementation without ftrace livepatch/core: Reuse common codes in the solution without ftrace Revert "x86/insn: Make insn_complete() static" livepatch/x86: Support livepatch without ftrace livepatch/core: Disable support for replacing livepatch/core: Restrict livepatch patched/unpatched when plant kprobe livepatch/core: Support load and unload hooks livepatch: samples: Adapt livepatch-sample for solution without ftrace livepatch/core: Support jump_label livepatch: Fix crash when access the global variable in hook livepatch: Fix patching functions which have static_call livepatch/core: Avoid conflict with static {call,key} livepatch/arm64: Support livepatch without ftrace livepatch/core: Revert module_enable_ro and module_disable_ro livepatch: Enable livepatch configs in openeuler_defconfig arm/module: Use plt section indices for relocations livepatch/core: Add support for arm for klp relocation livepatch/arm: Support livepatch without ftrace livepatch/ppc32: Support livepatch without ftrace livepatch: Use breakpoint exception to optimize enabling livepatch livepatch/x86: Support breakpoint exception optimization livepatch: Add arch_klp_init livepatch/arm64: Support breakpoint exception optimization livepatch/arm: Support breakpoint exception optimization livepatch: Add klp_module_delete_safety_check livepatch/x86: Add arch_klp_module_check_calltrace livepatch/arm64: Add arch_klp_module_check_calltrace livepatch/arm: Add arch_klp_module_check_calltrace livepatch: Bypass dead thread when check calltrace livepatch/ppc64: Implement livepatch without ftrace for ppc64be livepatch/ppc64: Sample testcase fix ppc64 livepatch/powerpc: Support breakpoint exception optimization livepatch/powerpc: Add arch_klp_module_check_calltrace Documentation/filesystems/proc.rst | 2 +- arch/arm/Kconfig | 3 + arch/arm/include/asm/livepatch.h | 61 + arch/arm/include/asm/module.h | 4 +- arch/arm/kernel/Makefile | 1 + arch/arm/kernel/ftrace.c | 4 +- arch/arm/kernel/livepatch.c | 322 ++++ arch/arm/kernel/module-plts.c | 22 +- arch/arm/kernel/module.c | 4 +- arch/arm64/Kconfig | 3 + arch/arm64/configs/openeuler_defconfig | 12 + arch/arm64/include/asm/brk-imm.h | 1 + arch/arm64/include/asm/debug-monitors.h | 2 + arch/arm64/include/asm/livepatch.h | 56 + arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/livepatch.c | 290 ++++ arch/powerpc/Kconfig | 5 +- arch/powerpc/include/asm/livepatch.h | 80 + arch/powerpc/include/asm/module.h | 3 + arch/powerpc/kernel/Makefile | 2 + arch/powerpc/kernel/livepatch.c | 357 +++++ arch/powerpc/kernel/livepatch_32.c | 124 ++ arch/powerpc/kernel/livepatch_64.c | 264 ++++ arch/powerpc/kernel/livepatch_tramp.S | 126 ++ arch/powerpc/kernel/module_64.c | 109 ++ arch/powerpc/kernel/traps.c | 8 + arch/s390/Kconfig | 2 +- arch/s390/configs/debug_defconfig | 1 + arch/s390/configs/defconfig | 1 + arch/x86/Kconfig | 3 +- arch/x86/configs/openeuler_defconfig | 12 +- arch/x86/include/asm/insn.h | 7 + arch/x86/include/asm/livepatch.h | 43 + arch/x86/kernel/Makefile | 1 + arch/x86/kernel/livepatch.c | 376 +++++ arch/x86/kernel/module.c | 2 +- arch/x86/kernel/traps.c | 10 + arch/x86/lib/insn.c | 7 - include/linux/jump_label.h | 10 + include/linux/livepatch.h | 129 +- include/linux/livepatch_sched.h | 6 +- include/linux/module.h | 33 + include/linux/moduleloader.h | 4 +- include/linux/static_call.h | 6 + kernel/jump_label.c | 22 + kernel/livepatch/Kconfig | 78 +- kernel/livepatch/Makefile | 3 +- kernel/livepatch/core.c | 1563 +++++++++++++++++++- kernel/livepatch/core.h | 16 + kernel/module/main.c | 16 +- kernel/module/strict_rwx.c | 17 + kernel/static_call_inline.c | 20 + lib/Kconfig.debug | 2 +- samples/livepatch/Makefile | 2 + samples/livepatch/livepatch-sample.c | 47 + tools/arch/x86/include/asm/insn.h | 7 + tools/arch/x86/lib/insn.c | 7 - tools/testing/selftests/bpf/config.aarch64 | 1 + tools/testing/selftests/bpf/config.s390x | 1 + tools/testing/selftests/livepatch/README | 1 + tools/testing/selftests/livepatch/config | 1 + 61 files changed, 4266 insertions(+), 57 deletions(-) create mode 100644 arch/arm/include/asm/livepatch.h create mode 100644 arch/arm/kernel/livepatch.c create mode 100644 arch/arm64/include/asm/livepatch.h create mode 100644 arch/arm64/kernel/livepatch.c create mode 100644 arch/powerpc/kernel/livepatch.c create mode 100644 arch/powerpc/kernel/livepatch_32.c create mode 100644 arch/powerpc/kernel/livepatch_64.c create mode 100644 arch/powerpc/kernel/livepatch_tramp.S create mode 100644 arch/x86/include/asm/livepatch.h create mode 100644 arch/x86/kernel/livepatch.c -- 2.25.1

2 34

[PATCH OLK-6.6 00/10] files cgroup all patch
by chenridong 23 Dec '23

23 Dec '23

Binder Makin (1): cgroups: Resource controller for open files Hou Tao (1): cgroup/files: use task_get_css() to get a valid css during dup_fd() Lu Jialin (2): fs: fix files.usage bug when move tasks fs/filescontrol.c: fix warning:large integer implicitly truncated to unsigned type Wenkai Lin (1): iommu/arm-smmu-v3: disable stall for quiet_cd Yang Yingliang (1): cgroup/files: support boot parameter to control if disable files cgroup Yu Kuai (1): fs/filescontrol: add a switch to enable / disable accounting of open fds Zhang Xiaoxu (2): files_cgroup: fix error pointer when kvm_vm_worker_thread files_cgroup: Fix soft lockup when refcnt overflow. zhangyi (F) (1): filescontrol: silence suspicious RCU warning -- 2.34.1

2 11

[PATCH openEuler-1.0-LTS] net: check vlan filter feature in vlan_vids_add_by_dev() and vlan_vids_del_by_dev()
by Liu Jian 22 Dec '23

22 Dec '23

mainline inclusion from mainline-v01a564bab4876007ce35f312e16797dfe40e4823 remotes/mainline/master commit 01a564bab4876007ce35f312e16797dfe40e4823 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8KNM7 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… --------------------------- I got the below warning trace: WARNING: CPU: 4 PID: 4056 at net/core/dev.c:11066 unregister_netdevice_many_notify CPU: 4 PID: 4056 Comm: ip Not tainted 6.7.0-rc4+ #15 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 RIP: 0010:unregister_netdevice_many_notify+0x9a4/0x9b0 Call Trace: rtnl_dellink rtnetlink_rcv_msg netlink_rcv_skb netlink_unicast netlink_sendmsg __sock_sendmsg ____sys_sendmsg ___sys_sendmsg __sys_sendmsg do_syscall_64 entry_SYSCALL_64_after_hwframe It can be repoduced via: ip netns add ns1 ip netns exec ns1 ip link add bond0 type bond mode 0 ip netns exec ns1 ip link add bond_slave_1 type veth peer veth2 ip netns exec ns1 ip link set bond_slave_1 master bond0 [1] ip netns exec ns1 ethtool -K bond0 rx-vlan-filter off [2] ip netns exec ns1 ip link add link bond_slave_1 name bond_slave_1.0 type vlan id 0 [3] ip netns exec ns1 ip link add link bond0 name bond0.0 type vlan id 0 [4] ip netns exec ns1 ip link set bond_slave_1 nomaster [5] ip netns exec ns1 ip link del veth2 ip netns del ns1 This is all caused by command [1] turning off the rx-vlan-filter function of bond0. The reason is the same as commit 01f4fd270870 ("bonding: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves"). Commands [2] [3] add the same vid to slave and master respectively, causing command [4] to empty slave->vlan_info. The following command [5] triggers this problem. To fix this problem, we should add VLAN_FILTER feature checks in vlan_vids_add_by_dev() and vlan_vids_del_by_dev() to prevent incorrect addition or deletion of vlan_vid information. Fixes: 348a1443cc43 ("vlan: introduce functions to do mass addition/deletion of vids by another device") Signed-off-by: Liu Jian <liujian56(a)huawei.com> Signed-off-by: Paolo Abeni <pabeni(a)redhat.com> (cherry picked from commit 01a564bab4876007ce35f312e16797dfe40e4823) Signed-off-by: Liu Jian <liujian56(a)huawei.com> --- net/8021q/vlan_core.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c index 4f60e86f4b8d3..e92c914316cbd 100644 --- a/net/8021q/vlan_core.c +++ b/net/8021q/vlan_core.c @@ -380,6 +380,8 @@ int vlan_vids_add_by_dev(struct net_device *dev, return 0; list_for_each_entry(vid_info, &vlan_info->vid_list, list) { + if (!vlan_hw_filter_capable(by_dev, vid_info->proto)) + continue; err = vlan_vid_add(dev, vid_info->proto, vid_info->vid); if (err) goto unwind; @@ -390,6 +392,8 @@ int vlan_vids_add_by_dev(struct net_device *dev, list_for_each_entry_continue_reverse(vid_info, &vlan_info->vid_list, list) { + if (!vlan_hw_filter_capable(by_dev, vid_info->proto)) + continue; vlan_vid_del(dev, vid_info->proto, vid_info->vid); } @@ -409,8 +413,11 @@ void vlan_vids_del_by_dev(struct net_device *dev, if (!vlan_info) return; - list_for_each_entry(vid_info, &vlan_info->vid_list, list) + list_for_each_entry(vid_info, &vlan_info->vid_list, list) { + if (!vlan_hw_filter_capable(by_dev, vid_info->proto)) + continue; vlan_vid_del(dev, vid_info->proto, vid_info->vid); + } } EXPORT_SYMBOL(vlan_vids_del_by_dev); -- 2.34.1

2 1

[PATCH OLK-5.10] mm/memcontrol: mitigate cacheline false sharing in struct mem_cgroup
by Zeng Heng 22 Dec '23

22 Dec '23

hulk inclusion category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/I8QP0D -------------------------------- Signed-off-by: Zeng Heng <zengheng4(a)huawei.com> --- mm/memcontrol.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c9ffc793e42b..42ed5778062a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6261,6 +6261,11 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) kfree(pn); } +struct mem_cgroup_padding { + u64 padding[6]; + struct mem_cgroup memcg; +} __attribute__((__packed__)); + static void __mem_cgroup_free(struct mem_cgroup *memcg) { int node; @@ -6269,7 +6274,8 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) free_mem_cgroup_per_node_info(memcg, node); free_percpu(memcg->vmstats_percpu); memcg_free_swap_device(memcg); - kfree(memcg); + + kfree(container_of(memcg, struct mem_cgroup_padding, memcg)); } static void mem_cgroup_free(struct mem_cgroup *memcg) @@ -6280,19 +6286,22 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) static struct mem_cgroup *mem_cgroup_alloc(void) { + struct mem_cgroup_padding *memcg_padding; struct mem_cgroup *memcg; unsigned int size; int node; int __maybe_unused i; long error = -ENOMEM; - size = sizeof(struct mem_cgroup); + size = sizeof(struct mem_cgroup_padding); size += nr_node_ids * sizeof(struct mem_cgroup_per_node *); - memcg = kzalloc(size, GFP_KERNEL); - if (!memcg) + memcg_padding = kzalloc(size, GFP_KERNEL); + if (!memcg_padding) return ERR_PTR(error); + memcg = &memcg_padding->memcg; + if (memcg_alloc_swap_device(memcg)) goto fail; -- 2.25.1

2 1

[PATCH OLK-6.6 0/2] block: show info about opening a mounted device for write
by Li Lingfeng 22 Dec '23

22 Dec '23

Both writing a mounted device and writing part0 while mounting a partition will provide a prompt message. Li Lingfeng (2): block: Add config to show info about opening a mounted device for write block: detect confilt of write and mount between partitions and part0 block/Kconfig | 7 +++++ block/bdev.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 83 insertions(+), 2 deletions(-) -- 2.31.1

2 3

[PATCH OLK-5.10] iommu/iova: Change the location of depot_size to avoid cache fake share
by Zhang Zekun 22 Dec '23

22 Dec '23

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8QOK2 CVE: NA ------------------------------------ Cache fake share could cause performance decrease. Change the location of depot_size to avoid this problem. Signed-off-by: Zhang Zekun <zhangzekun11(a)huawei.com> --- include/linux/iova.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/iova.h b/include/linux/iova.h index 25c447124638..8190fb2bf496 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -29,11 +29,11 @@ struct iova_cpu_rcache; struct iova_rcache { spinlock_t lock; - unsigned int depot_size; struct iova_magazine *depot; struct iova_cpu_rcache __percpu *cpu_rcaches; struct iova_domain *iovad; struct delayed_work work; + unsigned int depot_size; }; struct iova_domain; -- 2.17.1

2 1

[PATCH OLK-6.6 0/4] olk5.10 bugs related to cgroup
by chenridong 22 Dec '23

22 Dec '23

Lu Jialin (1): cgroup: Return ERSCH when add Z process into task Zefan Li (1): cgroup: check if cgroup root is alive in cgroupstats_show() chenridong (2): cgroup: wait for cgroup destruction to complete when umount cgroup: disable kernel memory accounting for all memory cgroups by default .../admin-guide/cgroup-v1/memory.rst | 6 ++--- .../admin-guide/kernel-parameters.txt | 1 + include/linux/cgroup-defs.h | 3 +++ kernel/cgroup/cgroup-v1.c | 11 +++++---- kernel/cgroup/cgroup.c | 23 +++++++++++++++++-- mm/memcontrol.c | 4 +++- 6 files changed, 38 insertions(+), 10 deletions(-) -- 2.34.1

2 5

[PATCH OLK-6.6 0/4] Fix smmu pgtable prfetch and add support
by Zhang Zekun 22 Dec '23

22 Dec '23

Fix the pgtable prefetch problem, besides add some ras features which is used in ascend scenarios. Zhang Zekun (4): iommu/arm-smmu-v3: Add a SYNC command to avoid broken page table prefetch mm: memory-failure: Directly return the task for specific use ACPI: APEI: Don't call notifier again in ts senario mm/hwpoison: Add to check is a page is hwpoisoned arch/arm64/Kconfig | 13 +++++++++ arch/arm64/configs/openeuler_defconfig | 1 + arch/arm64/kernel/cpu_errata.c | 14 +++++++++ arch/arm64/tools/cpucaps | 1 + drivers/acpi/apei/ghes.c | 3 ++ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 20 +++++++++++++ include/linux/mm.h | 5 ++++ mm/Kconfig | 9 ++++++ mm/memory-failure.c | 32 +++++++++++++++++++++ 9 files changed, 98 insertions(+) -- 2.17.1

2 6

[OLK-5.10 1/3] RDMA/hns: Fix Use-After-Free of rsv_qp
by Chengchang Tang 22 Dec '23

22 Dec '23

From: wenglianfa <wenglianfa(a)huawei.com> driver inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8N7BG ---------------------------------------------------------------------- For the HIP08, the reserved loopback QP is used to release MRs before the MPT is destroyed. After free_mr_exit() and before hns_roce_unregister_device(), rsv_qp is released and set to NULL, and ib_device is not unregister. During this period, the user mode can use ib_device to execute dereg_mr(). As a result, rsv_qp is accessed again and a NULL pointer is reported. To fix Use-After-Free of rsv_qp, execute free_mr_exit() after hns_roce_unregister_device(). Fixes: 6f5f556d3795 ("RDMA/hns: Use the reserved loopback QPs to free MR before destroying MPT") Signed-off-by: wenglianfa <wenglianfa(a)huawei.com> Signed-off-by: Juan Zhou <zhoujuan51(a)h-partners.com> --- drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c index 2b474c22cafa..34d5d1476db2 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c @@ -3339,6 +3339,9 @@ static int hns_roce_v2_init(struct hns_roce_dev *hr_dev) static void hns_roce_v2_exit(struct hns_roce_dev *hr_dev) { + if (hr_dev->pci_dev->revision == PCI_REVISION_ID_HIP08) + free_mr_exit(hr_dev); + hns_roce_function_clear(hr_dev); if (!hr_dev->is_vf) @@ -7699,9 +7702,6 @@ static void __hns_roce_hw_v2_uninit_instance(struct hnae3_handle *handle, hr_dev->state = HNS_ROCE_DEVICE_STATE_UNINIT; hns_roce_handle_device_err(hr_dev); - if (hr_dev->pci_dev->revision == PCI_REVISION_ID_HIP08) - free_mr_exit(hr_dev); - hns_roce_exit(hr_dev, bond_cleanup); kfree(hr_dev->priv); ib_dealloc_device(&hr_dev->ib_dev); -- 2.30.0

1 0

[PATCH OLK-6.6 0/1] arm64: ipi_nmi: fix compile error when CONFIG_KGDB is disabled
by Liao Chen 22 Dec '23

22 Dec '23

arch/arm64/kernel/ipi_nmi.c: In function ¡®ipi_nmi_handler¡¯: arch/arm64/kernel/ipi_nmi.c:54:7: error: implicit declaration of function ¡®kgdb_nmicallback¡¯ [-Werror=implicit-function-declaration] if (!kgdb_nmicallback(cpu, get_irq_regs())) ^~~~~~~~~~~~~~~~ Xiongfeng Wang (1): arm64: ipi_nmi: fix compile error when CONFIG_KGDB is disabled arch/arm64/kernel/ipi_nmi.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) -- 2.34.1

2 2

[PATCH OLK-6.6 0/1] arm64: Add non nmi ipi backtrace support
by Liao Chen 22 Dec '23

22 Dec '23

*** BLURB HERE *** Li Zhengyu (1): arm64: Add non nmi ipi backtrace support arch/arm64/kernel/ipi_nmi.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) -- 2.34.1

2 2

[PATCH OLK-6.6] tcp_comp: implement tcp compression
by Lu Wei 22 Dec '23

22 Dec '23

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4PNEK CVE: NA ------------------------------------------------- Signed-off-by: Lu Wei <luwei32(a)huawei.com> --- arch/arm64/configs/openeuler_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + include/linux/tcp.h | 6 +- include/net/inet_sock.h | 3 +- include/net/sock.h | 1 + include/net/tcp.h | 40 ++ net/ipv4/Kconfig | 10 + net/ipv4/Makefile | 1 + net/ipv4/syncookies.c | 2 + net/ipv4/sysctl_net_ipv4.c | 42 ++ net/ipv4/tcp.c | 5 + net/ipv4/tcp_comp.c | 912 +++++++++++++++++++++++++ net/ipv4/tcp_input.c | 26 + net/ipv4/tcp_ipv4.c | 2 + net/ipv4/tcp_minisocks.c | 3 + net/ipv4/tcp_output.c | 52 ++ net/ipv6/syncookies.c | 2 + 17 files changed, 1107 insertions(+), 2 deletions(-) create mode 100644 net/ipv4/tcp_comp.c diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 33ba39711884..abc33d2c29a6 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -1222,6 +1222,7 @@ CONFIG_DEFAULT_CUBIC=y # CONFIG_DEFAULT_RENO is not set CONFIG_DEFAULT_TCP_CONG="cubic" CONFIG_TCP_MD5SIG=y +CONFIG_TCP_COMP=y CONFIG_IPV6=y CONFIG_IPV6_ROUTER_PREF=y CONFIG_IPV6_ROUTE_INFO=y diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index 44040b835333..177a6fdcce58 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -1244,6 +1244,7 @@ CONFIG_DEFAULT_CUBIC=y # CONFIG_DEFAULT_RENO is not set CONFIG_DEFAULT_TCP_CONG="cubic" CONFIG_TCP_MD5SIG=y +CONFIG_TCP_COMP=y CONFIG_IPV6=y CONFIG_IPV6_ROUTER_PREF=y CONFIG_IPV6_ROUTE_INFO=y diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 3c5efeeb024f..37ae82ab7795 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -123,7 +123,8 @@ struct tcp_options_received { snd_wscale : 4, /* Window scaling received from sender */ rcv_wscale : 4; /* Window scaling to send to receiver */ u8 saw_unknown:1, /* Received unknown option */ - unused:7; + comp_ok:1, /* COMP seen on SYN packet */ + unused:6; u8 num_sacks; /* Number of SACK blocks */ u16 user_mss; /* mss requested by user in ioctl */ u16 mss_clamp; /* Maximal mss, negotiated at connection setup */ @@ -136,6 +137,9 @@ static inline void tcp_clear_options(struct tcp_options_received *rx_opt) #if IS_ENABLED(CONFIG_SMC) rx_opt->smc_ok = 0; #endif +#if IS_ENABLED(CONFIG_TCP_COMP) + rx_opt->comp_ok = 0; +#endif } /* This is the max number of SACKS that we'll generate and process. It's safe diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h index 2de0e4d4a027..bb09ac95cc9d 100644 --- a/include/net/inet_sock.h +++ b/include/net/inet_sock.h @@ -87,7 +87,8 @@ struct inet_request_sock { ecn_ok : 1, acked : 1, no_srccheck: 1, - smc_ok : 1; + smc_ok : 1, + comp_ok : 1; u32 ir_mark; union { struct ip_options_rcu __rcu *ireq_opt; diff --git a/include/net/sock.h b/include/net/sock.h index 7753354d59c0..c86845136ec5 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -959,6 +959,7 @@ enum sock_flags { SOCK_XDP, /* XDP is attached */ SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */ SOCK_RCVMARK, /* Receive SO_MARK ancillary data with packet */ + SOCK_COMP, }; #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE)) diff --git a/include/net/tcp.h b/include/net/tcp.h index e9d387fffe22..cb33a2c46b2f 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -195,6 +195,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); */ #define TCPOPT_FASTOPEN_MAGIC 0xF989 #define TCPOPT_SMC_MAGIC 0xE2D4C3D9 +#define TCPOPT_COMP_MAGIC 0x7954 /* * TCP option lengths @@ -208,6 +209,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); #define TCPOLEN_FASTOPEN_BASE 2 #define TCPOLEN_EXP_FASTOPEN_BASE 4 #define TCPOLEN_EXP_SMC_BASE 6 +#define TCPOLEN_EXP_COMP_BASE 4 /* But this is what stacks really send out. */ #define TCPOLEN_TSTAMP_ALIGNED 12 @@ -2531,4 +2533,42 @@ static inline u64 tcp_transmit_time(const struct sock *sk) return 0; } +#if IS_ENABLED(CONFIG_TCP_COMP) +extern struct static_key_false tcp_have_comp; + +extern unsigned long *sysctl_tcp_compression_ports; +extern int sysctl_tcp_compression_local; + +bool tcp_syn_comp_enabled(struct sock *sk); +bool tcp_synack_comp_enabled(struct sock *sk, + const struct inet_request_sock *ireq); +void tcp_init_compression(struct sock *sk); +void tcp_cleanup_compression(struct sock *sk); +int tcp_comp_init(void); +#else +static inline bool tcp_syn_comp_enabled(struct tcp_sock *tp) +{ + return false; +} + +static inline bool tcp_synack_comp_enabled(struct sock *sk, + const struct inet_request_sock *ireq) +{ + return false; +} + +static inline void tcp_init_compression(struct sock *sk) +{ +} + +static inline void tcp_cleanup_compression(struct sock *sk) +{ +} + +static inline int tcp_comp_init(void) +{ + return 0; +} +#endif + #endif /* _TCP_H */ diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig index 2dfb12230f08..a4405ea38338 100644 --- a/net/ipv4/Kconfig +++ b/net/ipv4/Kconfig @@ -751,3 +751,13 @@ config TCP_MD5SIG on the Internet. If unsure, say N. + +config TCP_COMP + bool "TCP: Transport Layer Compression support" + depends on CRYPTO_ZSTD=y + select STREAM_PARSER + help + Enable kernel payload compression support for TCP protocol. This allows + payload compression handling of the TCP protocol to be done in-kernel. + + If unsure, say Y. diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile index b18ba8ef93ad..aa458d9f534a 100644 --- a/net/ipv4/Makefile +++ b/net/ipv4/Makefile @@ -65,6 +65,7 @@ obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o obj-$(CONFIG_NET_SOCK_MSG) += tcp_bpf.o obj-$(CONFIG_BPF_SYSCALL) += udp_bpf.o obj-$(CONFIG_NETLABEL) += cipso_ipv4.o +obj-$(CONFIG_TCP_COMP) += tcp_comp.o obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \ xfrm4_output.o xfrm4_protocol.o diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c index 3b4dafefb4b0..42c577bfc26b 100644 --- a/net/ipv4/syncookies.c +++ b/net/ipv4/syncookies.c @@ -390,6 +390,8 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb) if (IS_ENABLED(CONFIG_SMC)) ireq->smc_ok = 0; + if (IS_ENABLED(CONFIG_TCP_COMP)) + ireq->comp_ok = 0; ireq->ir_iif = inet_request_bound_dev_if(sk, skb); diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index b17eb28a9690..f212133b05e9 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -469,6 +469,30 @@ static int proc_fib_multipath_hash_fields(struct ctl_table *table, int write, } #endif +#if IS_ENABLED(CONFIG_TCP_COMP) +static int proc_tcp_compression_ports(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + unsigned long *bitmap = *(unsigned long **)table->data; + unsigned long bitmap_len = table->maxlen; + int ret; + + ret = proc_do_large_bitmap(table, write, buffer, lenp, ppos); + if (write && ret == 0) { + if (bitmap_empty(bitmap, bitmap_len)) { + if (static_key_enabled(&tcp_have_comp)) + static_branch_disable(&tcp_have_comp); + } else { + if (!static_key_enabled(&tcp_have_comp)) + static_branch_enable(&tcp_have_comp); + } + } + + return ret; +} +#endif + static struct ctl_table ipv4_table[] = { { .procname = "tcp_max_orphans", @@ -587,6 +611,24 @@ static struct ctl_table ipv4_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, +#if IS_ENABLED(CONFIG_TCP_COMP) + { + .procname = "tcp_compression_ports", + .data = &sysctl_tcp_compression_ports, + .maxlen = 65536, + .mode = 0644, + .proc_handler = proc_tcp_compression_ports, + }, + { + .procname = "tcp_compression_local", + .data = &sysctl_tcp_compression_local, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, +#endif { } }; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 3d3a24f79573..2703be8a7316 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -302,6 +302,10 @@ DEFINE_STATIC_KEY_FALSE(tcp_have_smc); EXPORT_SYMBOL(tcp_have_smc); #endif +#if IS_ENABLED(CONFIG_TCP_COMP) +DEFINE_STATIC_KEY_FALSE(tcp_have_comp); +#endif + /* * Current number of TCP sockets. */ @@ -4707,5 +4711,6 @@ void __init tcp_init(void) tcp_metrics_init(); BUG_ON(tcp_register_congestion_control(&tcp_reno) != 0); tcp_tasklet_init(); + tcp_comp_init(); mptcp_init(); } diff --git a/net/ipv4/tcp_comp.c b/net/ipv4/tcp_comp.c new file mode 100644 index 000000000000..53157a413b58 --- /dev/null +++ b/net/ipv4/tcp_comp.c @@ -0,0 +1,912 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * TCP compression support + * + * Copyright(c) 2021 Huawei Technologies Co., Ltd + */ + +#include <linux/skmsg.h> +#include <linux/zstd.h> + +#define TCP_COMP_MAX_PADDING 64 +#define TCP_COMP_DATA_SIZE 65536 +#define TCP_COMP_SCRATCH_SIZE (TCP_COMP_DATA_SIZE - 1) +#define TCP_COMP_MAX_CSIZE (TCP_COMP_SCRATCH_SIZE + TCP_COMP_MAX_PADDING) +#define TCP_COMP_ALLOC_ORDER get_order(TCP_COMP_DATA_SIZE) +#define TCP_COMP_MAX_WINDOWLOG 17 +#define TCP_COMP_MAX_INPUT (1 << TCP_COMP_MAX_WINDOWLOG) + +#define TCP_COMP_SEND_PENDING 1 +#define ZSTD_COMP_DEFAULT_LEVEL 1 + +static unsigned long tcp_compression_ports[65536 / 8]; + +unsigned long *sysctl_tcp_compression_ports = tcp_compression_ports; +int sysctl_tcp_compression_local __read_mostly; + +static struct proto tcp_prot_override; + +struct tcp_comp_context_tx { + ZSTD_CStream *cstream; + void *cworkspace; + void *plaintext_data; + void *compressed_data; + struct sk_msg msg; + bool in_tcp_sendpages; +}; + +struct tcp_comp_context_rx { + ZSTD_DStream *dstream; + void *dworkspace; + void *plaintext_data; + + struct strparser strp; + void (*saved_data_ready)(struct sock *sk); + struct sk_buff *pkt; + struct sk_buff *dpkt; +}; + +struct tcp_comp_context { + struct rcu_head rcu; + + struct proto *sk_proto; + void (*sk_write_space)(struct sock *sk); + + struct tcp_comp_context_tx tx; + struct tcp_comp_context_rx rx; + + unsigned long flags; +}; + +static bool tcp_comp_is_write_pending(struct tcp_comp_context *ctx) +{ + return test_bit(TCP_COMP_SEND_PENDING, &ctx->flags); +} + +static void tcp_comp_err_abort(struct sock *sk, int err) +{ + sk->sk_err = err; + sk->sk_error_report(sk); +} + +static bool tcp_comp_enabled(__be32 saddr, __be32 daddr, int port) +{ + if (!sysctl_tcp_compression_local && + (saddr == daddr || ipv4_is_loopback(daddr))) + return false; + + return test_bit(port, sysctl_tcp_compression_ports); +} + +bool tcp_syn_comp_enabled(struct sock *sk) +{ + struct inet_sock *inet = inet_sk(sk); + + return tcp_comp_enabled(inet->inet_saddr, inet->inet_daddr, + ntohs(inet->inet_dport)); +} + +bool tcp_synack_comp_enabled(struct sock *sk, + const struct inet_request_sock *ireq) +{ + struct inet_sock *inet = inet_sk(sk); + + if (!ireq->comp_ok) + return false; + + return tcp_comp_enabled(ireq->ir_loc_addr, ireq->ir_rmt_addr, + ntohs(inet->inet_sport)); +} + +static struct tcp_comp_context *comp_get_ctx(const struct sock *sk) +{ + struct inet_connection_sock *icsk = inet_csk(sk); + + return (__force void *)icsk->icsk_ulp_data; +} + +static int tcp_comp_tx_context_init(struct tcp_comp_context *ctx) +{ + ZSTD_parameters params; + int csize; + + params = ZSTD_getParams(ZSTD_COMP_DEFAULT_LEVEL, PAGE_SIZE, 0); + csize = zstd_cstream_workspace_bound(&params.cParams); + if (csize <= 0) + return -EINVAL; + + ctx->tx.cworkspace = kmalloc(csize, GFP_KERNEL); + if (!ctx->tx.cworkspace) + return -ENOMEM; + + ctx->tx.cstream = zstd_init_cstream(&params, 0, ctx->tx.cworkspace, + csize); + if (!ctx->tx.cstream) + goto err_cstream; + + ctx->tx.plaintext_data = kvmalloc(TCP_COMP_SCRATCH_SIZE, GFP_KERNEL); + if (!ctx->tx.plaintext_data) + goto err_cstream; + + ctx->tx.compressed_data = kvmalloc(TCP_COMP_MAX_CSIZE, GFP_KERNEL); + if (!ctx->tx.compressed_data) + goto err_compressed; + + return 0; + +err_compressed: + kvfree(ctx->tx.plaintext_data); + ctx->tx.plaintext_data = NULL; +err_cstream: + kfree(ctx->tx.cworkspace); + ctx->tx.cworkspace = NULL; + + return -ENOMEM; +} + +static void *tcp_comp_get_tx_stream(struct sock *sk) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + + if (!ctx->tx.plaintext_data) + tcp_comp_tx_context_init(ctx); + + return ctx->tx.plaintext_data; +} + +static int alloc_compressed_msg(struct sock *sk, int len) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + struct sk_msg *msg = &ctx->tx.msg; + + sk_msg_init(msg); + + return sk_msg_alloc(sk, msg, len, 0); +} + +static int memcopy_from_iter(struct sock *sk, struct iov_iter *from, int copy) +{ + void *dest; + int rc; + + dest = tcp_comp_get_tx_stream(sk); + if (!dest) + return -ENOSPC; + + if (sk->sk_route_caps & NETIF_F_NOCACHE_COPY) + rc = copy_from_iter_nocache(dest, copy, from); + else + rc = copy_from_iter(dest, copy, from); + + if (rc != copy) + rc = -EFAULT; + + return rc; +} + +static int memcopy_to_msg(struct sock *sk, int bytes) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + struct sk_msg *msg = &ctx->tx.msg; + int i = msg->sg.curr; + struct scatterlist *sge; + u32 copy, buf_size; + void *from, *to; + + from = ctx->tx.compressed_data; + do { + sge = sk_msg_elem(msg, i); + /* This is possible if a trim operation shrunk the buffer */ + if (msg->sg.copybreak >= sge->length) { + msg->sg.copybreak = 0; + sk_msg_iter_var_next(i); + if (i == msg->sg.end) + break; + sge = sk_msg_elem(msg, i); + } + buf_size = sge->length - msg->sg.copybreak; + copy = (buf_size > bytes) ? bytes : buf_size; + to = sg_virt(sge) + msg->sg.copybreak; + msg->sg.copybreak += copy; + memcpy(to, from, copy); + bytes -= copy; + from += copy; + if (!bytes) + break; + msg->sg.copybreak = 0; + sk_msg_iter_var_next(i); + } while (i != msg->sg.end); + + msg->sg.curr = i; + return bytes; +} + +static int tcp_comp_compress_to_msg(struct sock *sk, int bytes) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + ZSTD_outBuffer outbuf; + ZSTD_inBuffer inbuf; + size_t ret; + + inbuf.src = ctx->tx.plaintext_data; + outbuf.dst = ctx->tx.compressed_data; + inbuf.size = bytes; + outbuf.size = TCP_COMP_MAX_CSIZE; + inbuf.pos = 0; + outbuf.pos = 0; + + ret = ZSTD_compressStream(ctx->tx.cstream, &outbuf, &inbuf); + if (ZSTD_isError(ret)) + return -EIO; + + ret = ZSTD_flushStream(ctx->tx.cstream, &outbuf); + if (ZSTD_isError(ret)) + return -EIO; + + if (inbuf.pos != inbuf.size) + return -EIO; + + if (memcopy_to_msg(sk, outbuf.pos)) + return -EIO; + + sk_msg_trim(sk, &ctx->tx.msg, outbuf.pos); + + return 0; +} + +static int tcp_comp_push_msg(struct sock *sk, struct sk_msg *msg, int flags) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + struct msghdr mh; + struct bio_vec bvec; + struct scatterlist *sg; + int ret, offset; + struct page *p; + size_t size; + + ctx->tx.in_tcp_sendpages = true; + while (1) { + sg = sk_msg_elem(msg, msg->sg.start); + offset = sg->offset; + size = sg->length; + p = sg_page(sg); +retry: + memset(&mh, 0, sizeof(struct msghdr)); + memset(&bvec, 0, sizeof(struct bio_vec)); + + mh.msg_flags = flags | MSG_SPLICE_PAGES; + bvec_set_page(&bvec, p, size, offset); + iov_iter_bvec(&mh.msg_iter, ITER_SOURCE, &bvec, 1, size); + + ret = tcp_sendmsg_locked(sk, &mh, size); + if (ret != size) { + if (ret > 0) { + sk_mem_uncharge(sk, ret); + sg->offset += ret; + sg->length -= ret; + size -= ret; + offset += ret; + goto retry; + } + ctx->tx.in_tcp_sendpages = false; + return ret; + } + + sk_mem_uncharge(sk, ret); + msg->sg.size -= size; + put_page(p); + sk_msg_iter_next(msg, start); + if (msg->sg.start == msg->sg.end) + break; + } + + clear_bit(TCP_COMP_SEND_PENDING, &ctx->flags); + ctx->tx.in_tcp_sendpages = false; + + return 0; +} + +static int tcp_comp_push(struct sock *sk, int bytes, int flags) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + int ret; + + ret = tcp_comp_compress_to_msg(sk, bytes); + if (ret < 0) { + pr_debug("%s: failed to compress sg\n", __func__); + return ret; + } + + set_bit(TCP_COMP_SEND_PENDING, &ctx->flags); + + ret = tcp_comp_push_msg(sk, &ctx->tx.msg, flags); + if (ret) { + pr_debug("%s: failed to tcp_comp_push_sg\n", __func__); + return ret; + } + + return 0; +} + +static int wait_on_pending_writer(struct sock *sk, long *timeo) +{ + DEFINE_WAIT_FUNC(wait, woken_wake_function); + int ret = 0; + + add_wait_queue(sk_sleep(sk), &wait); + while (1) { + if (!*timeo) { + ret = -EAGAIN; + break; + } + + if (signal_pending(current)) { + ret = sock_intr_errno(*timeo); + break; + } + + if (sk_wait_event(sk, timeo, !sk->sk_write_pending, &wait)) + break; + } + remove_wait_queue(sk_sleep(sk), &wait); + + return ret; +} + +static int tcp_comp_push_pending_msg(struct sock *sk, int flags) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + struct sk_msg *msg = &ctx->tx.msg; + + if (msg->sg.start == msg->sg.end) + return 0; + + return tcp_comp_push_msg(sk, msg, flags); +} + +static int tcp_comp_complete_pending_work(struct sock *sk, int flags, + long *timeo) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + int ret = 0; + + if (unlikely(sk->sk_write_pending)) + ret = wait_on_pending_writer(sk, timeo); + + if (!ret && tcp_comp_is_write_pending(ctx)) + ret = tcp_comp_push_pending_msg(sk, flags); + + return ret; +} + +static int tcp_comp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + int copied = 0, err = 0; + size_t try_to_copy; + int required_size; + long timeo; + + lock_sock(sk); + + timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT); + + err = tcp_comp_complete_pending_work(sk, msg->msg_flags, &timeo); + if (err) + goto out_err; + + while (msg_data_left(msg)) { + if (sk->sk_err) { + err = -sk->sk_err; + goto out_err; + } + + try_to_copy = msg_data_left(msg); + if (try_to_copy > TCP_COMP_SCRATCH_SIZE) + try_to_copy = TCP_COMP_SCRATCH_SIZE; + required_size = try_to_copy + TCP_COMP_MAX_PADDING; + + if (!sk_stream_memory_free(sk)) + goto wait_for_sndbuf; + +alloc_compressed: + err = alloc_compressed_msg(sk, required_size); + if (err) { + if (err != -ENOSPC) + goto wait_for_memory; + goto out_err; + } + + err = memcopy_from_iter(sk, &msg->msg_iter, try_to_copy); + if (err < 0) + goto out_err; + + copied += try_to_copy; + + err = tcp_comp_push(sk, try_to_copy, msg->msg_flags); + if (err < 0) { + if (err == -ENOMEM) + goto wait_for_memory; + if (err != -EAGAIN) + tcp_comp_err_abort(sk, EBADMSG); + goto out_err; + } + + continue; +wait_for_sndbuf: + set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); +wait_for_memory: + err = sk_stream_wait_memory(sk, &timeo); + if (err) + goto out_err; + if (ctx->tx.msg.sg.size < required_size) + goto alloc_compressed; + } + +out_err: + err = sk_stream_error(sk, msg->msg_flags, err); + + release_sock(sk); + + return copied ? copied : err; +} + +static struct sk_buff *comp_wait_data(struct sock *sk, int flags, + long timeo, int *err) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + struct sk_buff *skb; + DEFINE_WAIT_FUNC(wait, woken_wake_function); + + while (!(skb = ctx->rx.pkt)) { + if (sk->sk_err) { + *err = sock_error(sk); + return NULL; + } + + if (!skb_queue_empty(&sk->sk_receive_queue)) { + __strp_unpause(&ctx->rx.strp); + if (ctx->rx.pkt) + return ctx->rx.pkt; + } + + if (sk->sk_shutdown & RCV_SHUTDOWN) + return NULL; + + if (sock_flag(sk, SOCK_DONE)) + return NULL; + + if ((flags & MSG_DONTWAIT) || !timeo) { + *err = -EAGAIN; + return NULL; + } + + add_wait_queue(sk_sleep(sk), &wait); + sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk); + sk_wait_event(sk, &timeo, ctx->rx.pkt != skb, &wait); + sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk); + remove_wait_queue(sk_sleep(sk), &wait); + + /* Handle signals */ + if (signal_pending(current)) { + *err = sock_intr_errno(timeo); + return NULL; + } + } + + return skb; +} + +static bool comp_advance_skb(struct sock *sk, struct sk_buff *skb, + unsigned int len) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + struct strp_msg *rxm = strp_msg(skb); + + if (len < rxm->full_len) { + rxm->offset += len; + rxm->full_len -= len; + return false; + } + + /* Finished with message */ + ctx->rx.pkt = NULL; + kfree_skb(skb); + __strp_unpause(&ctx->rx.strp); + + return true; +} + +static bool comp_advance_dskb(struct sock *sk, struct sk_buff *skb, + unsigned int len) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + struct strp_msg *rxm = strp_msg(skb); + + if (len < rxm->full_len) { + rxm->offset += len; + rxm->full_len -= len; + return false; + } + + /* Finished with message */ + ctx->rx.dpkt = NULL; + kfree_skb(skb); + return true; +} + +static int tcp_comp_rx_context_init(struct tcp_comp_context *ctx) +{ + int dsize; + + dsize = zstd_dstream_workspace_bound(TCP_COMP_MAX_INPUT); + if (dsize <= 0) + return -EINVAL; + + ctx->rx.dworkspace = kmalloc(dsize, GFP_KERNEL); + if (!ctx->rx.dworkspace) + return -ENOMEM; + + ctx->rx.dstream = zstd_init_dstream(TCP_COMP_MAX_INPUT, + ctx->rx.dworkspace, dsize); + if (!ctx->rx.dstream) + goto err_dstream; + + ctx->rx.plaintext_data = kvmalloc(TCP_COMP_MAX_CSIZE * 32, GFP_KERNEL); + if (!ctx->rx.plaintext_data) + goto err_dstream; + + return 0; + +err_dstream: + kfree(ctx->rx.dworkspace); + ctx->rx.dworkspace = NULL; + + return -ENOMEM; +} + +static void *tcp_comp_get_rx_stream(struct sock *sk) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + + if (!ctx->rx.plaintext_data) + tcp_comp_rx_context_init(ctx); + + return ctx->rx.plaintext_data; +} + +static int tcp_comp_decompress(struct sock *sk, struct sk_buff *skb, int flags) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + struct strp_msg *rxm = strp_msg(skb); + size_t ret, compressed_len = 0; + int nr_frags_over = 0; + ZSTD_outBuffer outbuf; + ZSTD_inBuffer inbuf; + struct sk_buff *nskb; + int len, plen; + void *to; + + to = tcp_comp_get_rx_stream(sk); + if (!to) + return -ENOSPC; + + if (skb_linearize_cow(skb)) + return -ENOMEM; + + nskb = skb_copy(skb, GFP_KERNEL); + if (!nskb) + return -ENOMEM; + + while (compressed_len < (skb->len - rxm->offset)) { + if (skb_shinfo(nskb)->nr_frags >= MAX_SKB_FRAGS) + break; + + len = 0; + plen = skb->len - rxm->offset - compressed_len; + if (plen > TCP_COMP_MAX_CSIZE) + plen = TCP_COMP_MAX_CSIZE; + + inbuf.src = (char *)skb->data + rxm->offset + compressed_len; + inbuf.pos = 0; + inbuf.size = plen; + + outbuf.dst = ctx->rx.plaintext_data; + outbuf.pos = 0; + outbuf.size = MAX_SKB_FRAGS * TCP_COMP_DATA_SIZE; + outbuf.size -= skb_shinfo(nskb)->nr_frags * TCP_COMP_DATA_SIZE; + + ret = ZSTD_decompressStream(ctx->rx.dstream, &outbuf, &inbuf); + if (ZSTD_isError(ret)) { + kfree_skb(nskb); + return -EIO; + } + + if (!compressed_len) { + len = outbuf.pos - skb->len; + if (len > skb_tailroom(nskb)) + len = skb_tailroom(nskb); + + __skb_put(nskb, len); + + len += skb->len; + skb_copy_to_linear_data(nskb, to, len); + } + + while ((to += len, outbuf.pos -= len) > 0) { + struct page *pages; + skb_frag_t *frag; + + if (skb_shinfo(nskb)->nr_frags >= MAX_SKB_FRAGS) { + nr_frags_over = 1; + break; + } + + frag = skb_shinfo(nskb)->frags + + skb_shinfo(nskb)->nr_frags; + pages = alloc_pages(__GFP_NOWARN | GFP_KERNEL | __GFP_COMP, + TCP_COMP_ALLOC_ORDER); + if (!pages) { + kfree_skb(nskb); + return -ENOMEM; + } + + frag->bv_page = pages; + len = PAGE_SIZE << TCP_COMP_ALLOC_ORDER; + if (outbuf.pos < len) + len = outbuf.pos; + + frag->bv_offset = 0; + skb_frag_size_set(frag, len); + memcpy(skb_frag_address(frag), to, len); + + nskb->truesize += len; + nskb->data_len += len; + nskb->len += len; + skb_shinfo(nskb)->nr_frags++; + } + + if (nr_frags_over) + break; + + compressed_len += inbuf.pos; + } + + ctx->rx.dpkt = nskb; + rxm = strp_msg(nskb); + rxm->full_len = nskb->len; + rxm->offset = 0; + comp_advance_skb(sk, skb, compressed_len); + + return 0; +} + +static int tcp_comp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, + int flags, int *addr_len) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + struct strp_msg *rxm; + struct sk_buff *skb; + ssize_t copied = 0; + int target, err = 0; + long timeo; + + if (unlikely(flags & MSG_ERRQUEUE)) + return sock_recv_errqueue(sk, msg, len, SOL_IP, IP_RECVERR); + + lock_sock(sk); + + target = sock_rcvlowat(sk, flags & MSG_WAITALL, len); + timeo = sock_rcvtimeo(sk, flags & MSG_WAITALL); + + do { + int chunk = 0; + + if (!ctx->rx.dpkt) { + skb = comp_wait_data(sk, flags, timeo, &err); + if (!skb) + goto recv_end; + + err = tcp_comp_decompress(sk, skb, flags); + if (err < 0) { + goto recv_end; + } + } + + skb = ctx->rx.dpkt; + rxm = strp_msg(skb); + chunk = min_t(unsigned int, rxm->full_len, len); + err = skb_copy_datagram_msg(skb, rxm->offset, msg, + chunk); + if (err < 0) + goto recv_end; + + copied += chunk; + len -= chunk; + if (likely(!(flags & MSG_PEEK))) + comp_advance_dskb(sk, skb, chunk); + else + break; + + if (copied >= target && !ctx->rx.dpkt) + break; + } while (len > 0); + +recv_end: + release_sock(sk); + return copied ? : err; +} + +bool comp_stream_read(struct sock *sk) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + + if (!ctx) + return false; + + if (ctx->rx.pkt || ctx->rx.dpkt) + return true; + + return false; +} + +static void comp_data_ready(struct sock *sk) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + + strp_data_ready(&ctx->rx.strp); +} + +static void comp_queue(struct strparser *strp, struct sk_buff *skb) +{ + struct tcp_comp_context *ctx = comp_get_ctx(strp->sk); + + ctx->rx.pkt = skb; + strp_pause(strp); + ctx->rx.saved_data_ready(strp->sk); +} + +static int comp_read_size(struct strparser *strp, struct sk_buff *skb) +{ + struct strp_msg *rxm = strp_msg(skb); + + if (rxm->offset > skb->len) + return 0; + + return skb->len - rxm->offset; +} + +void comp_setup_strp(struct sock *sk, struct tcp_comp_context *ctx) +{ + struct strp_callbacks cb; + + memset(&cb, 0, sizeof(cb)); + cb.rcv_msg = comp_queue; + cb.parse_msg = comp_read_size; + strp_init(&ctx->rx.strp, sk, &cb); + + write_lock_bh(&sk->sk_callback_lock); + ctx->rx.saved_data_ready = sk->sk_data_ready; + sk->sk_data_ready = comp_data_ready; + write_unlock_bh(&sk->sk_callback_lock); + + strp_check_rcv(&ctx->rx.strp); +} + +static void tcp_comp_write_space(struct sock *sk) +{ + struct tcp_comp_context *ctx = comp_get_ctx(sk); + + if (ctx->tx.in_tcp_sendpages) { + ctx->sk_write_space(sk); + return; + } + + if (!sk->sk_write_pending && tcp_comp_is_write_pending(ctx)) { + gfp_t sk_allocation = sk->sk_allocation; + int rc; + + sk->sk_allocation = GFP_ATOMIC; + rc = tcp_comp_push_pending_msg(sk, MSG_DONTWAIT | MSG_NOSIGNAL); + sk->sk_allocation = sk_allocation; + + if (rc < 0) + return; + } + + ctx->sk_write_space(sk); +} + +void tcp_init_compression(struct sock *sk) +{ + struct inet_connection_sock *icsk = inet_csk(sk); + struct tcp_comp_context *ctx = NULL; + struct sk_msg *msg = NULL; + struct tcp_sock *tp = tcp_sk(sk); + + if (!tp->rx_opt.comp_ok) + return; + + ctx = kzalloc(sizeof(*ctx), GFP_ATOMIC); + if (!ctx) + return; + + msg = &ctx->tx.msg; + sk_msg_init(msg); + + ctx->sk_write_space = sk->sk_write_space; + ctx->sk_proto = sk->sk_prot; + WRITE_ONCE(sk->sk_prot, &tcp_prot_override); + sk->sk_write_space = tcp_comp_write_space; + + rcu_assign_pointer(icsk->icsk_ulp_data, ctx); + + sock_set_flag(sk, SOCK_COMP); + comp_setup_strp(sk, ctx); +} + +static void tcp_comp_context_tx_free(struct tcp_comp_context *ctx) +{ + kfree(ctx->tx.cworkspace); + ctx->tx.cworkspace = NULL; + + kvfree(ctx->tx.plaintext_data); + ctx->tx.plaintext_data = NULL; + + kvfree(ctx->tx.compressed_data); + ctx->tx.compressed_data = NULL; +} + +static void tcp_comp_context_rx_free(struct tcp_comp_context *ctx) +{ + kfree(ctx->rx.dworkspace); + ctx->rx.dworkspace = NULL; + + kvfree(ctx->rx.plaintext_data); + ctx->rx.plaintext_data = NULL; +} + +static void tcp_comp_context_free(struct rcu_head *head) +{ + struct tcp_comp_context *ctx; + + ctx = container_of(head, struct tcp_comp_context, rcu); + + tcp_comp_context_tx_free(ctx); + tcp_comp_context_rx_free(ctx); + strp_done(&ctx->rx.strp); + kfree(ctx); +} + +void tcp_cleanup_compression(struct sock *sk) +{ + struct inet_connection_sock *icsk = inet_csk(sk); + struct tcp_comp_context *ctx = comp_get_ctx(sk); + + if (!ctx || !sock_flag(sk, SOCK_COMP)) + return; + + if (ctx->rx.pkt) { + kfree_skb(ctx->rx.pkt); + ctx->rx.pkt = NULL; + } + + if (ctx->rx.dpkt) { + kfree_skb(ctx->rx.dpkt); + ctx->rx.dpkt = NULL; + } + strp_stop(&ctx->rx.strp); + + rcu_assign_pointer(icsk->icsk_ulp_data, NULL); + call_rcu(&ctx->rcu, tcp_comp_context_free); +} + +int tcp_comp_init(void) +{ + tcp_prot_override = tcp_prot; + tcp_prot_override.sendmsg = tcp_comp_sendmsg; + tcp_prot_override.recvmsg = tcp_comp_recvmsg; + tcp_prot_override.sock_is_readable = comp_stream_read; + + return 0; +} diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index fd5c13c1fbc8..a5adf4822663 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4012,6 +4012,24 @@ static bool smc_parse_options(const struct tcphdr *th, return false; } +static bool tcp_parse_comp_option(const struct tcphdr *th, + struct tcp_options_received *opt_rx, + const unsigned char *ptr, + int opsize) +{ +#if IS_ENABLED(CONFIG_TCP_COMP) + if (static_branch_unlikely(&tcp_have_comp)) { + if (th->syn && !(opsize & 1) && + opsize >= TCPOLEN_EXP_COMP_BASE && + get_unaligned_be16(ptr) == TCPOPT_COMP_MAGIC) { + opt_rx->comp_ok = 1; + return true; + } + } +#endif + return false; +} + /* Try to parse the MSS option from the TCP header. Return 0 on failure, clamped * value on success. */ @@ -4171,6 +4189,10 @@ void tcp_parse_options(const struct net *net, if (smc_parse_options(th, opt_rx, ptr, opsize)) break; + if (tcp_parse_comp_option(th, opt_rx, ptr, + opsize)) + break; + opt_rx->saw_unknown = 1; break; @@ -6097,6 +6119,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op, struct sk_buff *skb) /* Initialize congestion control unless BPF initialized it already: */ if (!icsk->icsk_ca_initialized) tcp_init_congestion_control(sk); + tcp_init_compression(sk); tcp_init_buffer_space(sk); } @@ -6835,6 +6858,9 @@ static void tcp_openreq_init(struct request_sock *req, ireq->smc_ok = rx_opt->smc_ok && !(tcp_sk(sk)->smc_hs_congested && tcp_sk(sk)->smc_hs_congested(sk)); #endif +#if IS_ENABLED(CONFIG_TCP_COMP) + ireq->comp_ok = rx_opt->comp_ok; +#endif } struct request_sock *inet_reqsk_alloc(const struct request_sock_ops *ops, diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 4167e8a48b60..9cbf23c1c039 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2304,6 +2304,8 @@ void tcp_v4_destroy_sock(struct sock *sk) tcp_cleanup_congestion_control(sk); + tcp_cleanup_compression(sk); + tcp_cleanup_ulp(sk); /* Cleanup up the write buffer. */ diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index b98d476f1594..d67a84c114ce 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -543,6 +543,9 @@ struct sock *tcp_create_openreq_child(const struct sock *sk, newtp->rcv_ssthresh = req->rsk_rcv_wnd; newtp->rcv_wnd = req->rsk_rcv_wnd; newtp->rx_opt.wscale_ok = ireq->wscale_ok; +#if IS_ENABLED(CONFIG_TCP_COMP) + newtp->rx_opt.comp_ok = ireq->comp_ok; +#endif if (newtp->rx_opt.wscale_ok) { newtp->rx_opt.snd_wscale = ireq->snd_wscale; newtp->rx_opt.rcv_wscale = ireq->rcv_wscale; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1917c62ad3bf..eee34f6c7643 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -422,6 +422,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp) #define OPTION_FAST_OPEN_COOKIE BIT(8) #define OPTION_SMC BIT(9) #define OPTION_MPTCP BIT(10) +#define OPTION_COMP BIT(11) static void smc_options_write(__be32 *ptr, u16 *options) { @@ -438,6 +439,19 @@ static void smc_options_write(__be32 *ptr, u16 *options) #endif } +static void comp_options_write(__be32 *ptr, u16 *options) +{ +#if IS_ENABLED(CONFIG_TCP_COMP) + if (static_branch_unlikely(&tcp_have_comp)) { + if (unlikely(OPTION_COMP & *options)) { + *ptr++ = htonl((TCPOPT_EXP << 24) | + (TCPOLEN_EXP_COMP_BASE << 16) | + (TCPOPT_COMP_MAGIC)); + } + } +#endif +} + struct tcp_out_options { u16 options; /* bit field of OPTION_* */ u16 mss; /* 0 to disable */ @@ -711,6 +725,8 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp, smc_options_write(ptr, &options); mptcp_options_write(th, ptr, tp, opts); + + comp_options_write(ptr, &options); } static void smc_set_option(const struct tcp_sock *tp, @@ -729,6 +745,39 @@ static void smc_set_option(const struct tcp_sock *tp, #endif } +static void comp_set_option(struct sock *sk, + struct tcp_out_options *opts, + unsigned int *remaining) +{ +#if IS_ENABLED(CONFIG_TCP_COMP) + if (static_branch_unlikely(&tcp_have_comp)) { + if (tcp_syn_comp_enabled(sk)) { + if (*remaining >= TCPOLEN_EXP_COMP_BASE) { + opts->options |= OPTION_COMP; + *remaining -= TCPOLEN_EXP_COMP_BASE; + } + } + } +#endif +} + +static void comp_set_option_cond(struct sock *sk, + const struct inet_request_sock *ireq, + struct tcp_out_options *opts, + unsigned int *remaining) +{ +#if IS_ENABLED(CONFIG_TCP_COMP) + if (static_branch_unlikely(&tcp_have_comp)) { + if (tcp_synack_comp_enabled(sk, ireq)) { + if (*remaining >= TCPOLEN_EXP_COMP_BASE) { + opts->options |= OPTION_COMP; + *remaining -= TCPOLEN_EXP_COMP_BASE; + } + } + } +#endif +} + static void smc_set_option_cond(const struct tcp_sock *tp, const struct inet_request_sock *ireq, struct tcp_out_options *opts, @@ -830,6 +879,7 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb, } smc_set_option(tp, opts, &remaining); + comp_set_option(sk, opts, &remaining); if (sk_is_mptcp(sk)) { unsigned int size; @@ -910,6 +960,8 @@ static unsigned int tcp_synack_options(const struct sock *sk, smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining); + comp_set_option_cond((struct sock *)sk, ireq, opts, &remaining); + bpf_skops_hdr_opt_len((struct sock *)sk, skb, req, syn_skb, synack_type, opts, &remaining); diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c index 8698b49dfc8d..d17389f1afe0 100644 --- a/net/ipv6/syncookies.c +++ b/net/ipv6/syncookies.c @@ -217,6 +217,8 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb) treq->txhash = net_tx_rndhash(); if (IS_ENABLED(CONFIG_SMC)) ireq->smc_ok = 0; + if (IS_ENABLED(CONFIG_TCP_COMP)) + ireq->comp_ok = 0; /* * We need to lookup the dst_entry to get the correct window size. -- 2.34.1

2 1

[PATCH OLK-6.6 00/10] ACPI & PCI bugfix from 22.03 SP3
by Xiongfeng Wang 22 Dec '23

22 Dec '23

Xiongfeng Wang (10): pci: do not save 'PCI_BRIDGE_CTL_BUS_RESET' PCI: check BIR before mapping MSI-X Table PCI: Fail MSI-X mapping if MSI-X Table offset is out of range of BAR space sysrq: avoid concurrently info printing by 'sysrq-trigger' PCI: Add MCFG quirks for some Hisilicon Chip host controllers PCI: add a member in 'struct pci_bus' to record the original 'pci_ops' PCI/AER: increments pci bus reference count in aer-inject process ntp: Avoid undefined behaviour in second_overflow() hinic: ethtool: Allow userspace to set more aggregation params PCI/sysfs: Take reference on device to be removed drivers/acpi/pci_mcfg.c | 4 ++++ drivers/net/ethernet/huawei/hinic/hinic_ethtool.c | 10 +++++----- drivers/pci/bus.c | 2 ++ drivers/pci/msi/msi.c | 11 +++++++++++ drivers/pci/pci-sysfs.c | 9 +++++++-- drivers/pci/pci.c | 3 +++ drivers/pci/pcie/aer_inject.c | 9 +++++++++ drivers/pci/probe.c | 12 +++++++++--- drivers/tty/sysrq.c | 6 ++++++ include/linux/pci.h | 1 + kernel/time/ntp.c | 2 ++ 11 files changed, 59 insertions(+), 10 deletions(-) -- 2.20.1

2 11

[PATCH OLK-6.6 0/1] cgroup: Return ERSCH when add Z process into task
by chenridong 22 Dec '23

22 Dec '23

Lu Jialin (1): cgroup: Return ERSCH when add Z process into task kernel/cgroup/cgroup.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) -- 2.34.1

2 2

[PATCH] memcg: support ksm merge any mode per cgroup
by Nanyong Sun 22 Dec '23

22 Dec '23

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8OIQR ---------------------------------------------------------------------- Add control file "memory.ksm" to enable ksm per cgroup. Echo to 1 will set all tasks currently in the cgroup to ksm merge any mode, which means ksm gets enabled for all vma's of a process. Meanwhile echo to 0 will disable ksm for them and unmerge the merged pages. Cat the file will show the above state and ksm related profits of this cgroup. Signed-off-by: Nanyong Sun <sunnanyong(a)huawei.com> --- .../admin-guide/cgroup-v1/memory.rst | 1 + mm/memcontrol.c | 110 +++++++++++++++++- 2 files changed, 109 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index ff456871bf4b..3fdb48435e8e 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -109,6 +109,7 @@ Brief summary of control files. memory.kmem.tcp.failcnt show the number of tcp buf memory usage hits limits memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded + memory.ksm set/show ksm merge any mode ==================================== ========================================== 1. History diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2d9a873e5522..30cafc2f22c1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -73,6 +73,7 @@ #include <linux/uaccess.h> #include <trace/events/vmscan.h> +#include <linux/ksm.h> struct cgroup_subsys memory_cgrp_subsys __read_mostly; EXPORT_SYMBOL(memory_cgrp_subsys); @@ -242,10 +243,15 @@ enum res_type { iter != NULL; \ iter = mem_cgroup_iter(NULL, iter, NULL)) +static inline bool __task_is_dying(struct task_struct *task) +{ + return tsk_is_oom_victim(task) || fatal_signal_pending(task) || + (task->flags & PF_EXITING); +} + static inline bool task_is_dying(void) { - return tsk_is_oom_victim(current) || fatal_signal_pending(current) || - (current->flags & PF_EXITING); + return __task_is_dying(current); } /* Some nice accessors for the vmpressure. */ @@ -5100,6 +5106,98 @@ static int mem_cgroup_slab_show(struct seq_file *m, void *p) } #endif +#ifdef CONFIG_KSM +static int memcg_set_ksm_for_tasks(struct mem_cgroup *memcg, bool enable) +{ + struct task_struct *task; + struct mm_struct *mm; + struct css_task_iter it; + int ret = 0; + + css_task_iter_start(&memcg->css, CSS_TASK_ITER_PROCS, &it); + while (!ret && (task = css_task_iter_next(&it))) { + if (__task_is_dying(task)) + continue; + + mm = get_task_mm(task); + if (!mm) + continue; + + if (mmap_write_lock_killable(mm)) { + mmput(mm); + continue; + } + + if (enable) + ret = ksm_enable_merge_any(mm); + else + ret = ksm_disable_merge_any(mm); + + mmap_write_unlock(mm); + mmput(mm); + } + css_task_iter_end(&it); + + return ret; +} + +static int memory_ksm_show(struct seq_file *m, void *v) +{ + unsigned long ksm_merging_pages = 0; + unsigned long ksm_rmap_items = 0; + long ksm_process_profits = 0; + unsigned int tasks = 0; + struct task_struct *task; + struct mm_struct *mm; + struct css_task_iter it; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + css_task_iter_start(&memcg->css, CSS_TASK_ITER_PROCS, &it); + while ((task = css_task_iter_next(&it))) { + mm = get_task_mm(task); + if (!mm) + continue; + + if (test_bit(MMF_VM_MERGE_ANY, &mm->flags)) + tasks++; + + ksm_rmap_items += mm->ksm_rmap_items; + ksm_merging_pages += mm->ksm_merging_pages; + ksm_process_profits += ksm_process_profit(mm); + mmput(mm); + } + css_task_iter_end(&it); + + seq_printf(m, "merge any tasks: %u\n", tasks); + seq_printf(m, "ksm_rmap_items %lu\n", ksm_rmap_items); + seq_printf(m, "ksm_merging_pages %lu\n", ksm_merging_pages); + seq_printf(m, "ksm_process_profits %ld\n", ksm_process_profits); + return 0; +} + +static ssize_t memory_ksm_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + bool enable; + int err; + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + + buf = strstrip(buf); + if (!buf) + return -EINVAL; + + err = kstrtobool(buf, &enable); + if (err) + return err; + + err = memcg_set_ksm_for_tasks(memcg, enable); + if (err) + return err; + + return nbytes; +} +#endif /* CONFIG_KSM */ + static int memory_stat_show(struct seq_file *m, void *v); #ifdef CONFIG_MEMCG_V1_RECLAIM @@ -5337,6 +5435,14 @@ static struct cftype mem_cgroup_legacy_files[] = { .name = "reclaim", .write = memory_reclaim, }, +#endif +#ifdef CONFIG_KSM + { + .name = "ksm", + .flags = CFTYPE_NOT_ON_ROOT, + .write = memory_ksm_write, + .seq_show = memory_ksm_show, + }, #endif { }, /* terminate */ }; -- 2.25.1

1 0

[PATCH OLK-6.6 v4 0/7] arm64: Add support to turn an IPI as NMI
by Liao Chen 22 Dec '23

22 Dec '23

With pseudo NMIs support available its possible to configure SGIs to be triggered as pseudo NMIs running in NMI context. And kernel features such as: - NMI backtrace can leverage IPI turned as NMI to get a backtrace of CPU stuck in hard lockup using magic SYSRQ. - kgdb relies on NMI support to round up CPUs which are stuck in hard lockup state with interrupts disabled. This patch-set adds framework to turn an IPI as NMI which can be triggered as a pseudo NMI which in turn invokes registered NMI handlers. After this patch-set we should be able to get a backtrace for a CPU stuck in HARDLOCKUP. Sumit Garg (7): arm64: Add framework to turn IPI as NMI irqchip/gic-v3: Enable support for SGIs to act as NMIs arm64: smp: Assign and setup an IPI as NMI nmi: backtrace: Allow runtime arch specific override arm64: ipi_nmi: Add support for NMI backtrace kgdb: Expose default CPUs roundup fallback mechanism arm64: kgdb: Roundup cpus using IPI as NMI arch/arm/include/asm/irq.h | 2 +- arch/arm/kernel/smp.c | 3 +- arch/arm64/include/asm/irq.h | 6 +++ arch/arm64/include/asm/nmi.h | 17 +++++++ arch/arm64/kernel/Makefile | 2 +- arch/arm64/kernel/ipi_nmi.c | 84 ++++++++++++++++++++++++++++++++ arch/arm64/kernel/kgdb.c | 18 +++++++ arch/arm64/kernel/smp.c | 8 +++ arch/mips/include/asm/irq.h | 2 +- arch/mips/kernel/process.c | 3 +- arch/powerpc/include/asm/irq.h | 2 +- arch/powerpc/kernel/stacktrace.c | 3 +- arch/sparc/include/asm/irq_64.h | 2 +- arch/sparc/kernel/process_64.c | 4 +- arch/x86/include/asm/irq.h | 2 +- arch/x86/kernel/apic/hw_nmi.c | 3 +- drivers/irqchip/irq-gic-v3.c | 29 ++++++++--- include/linux/kgdb.h | 12 +++++ include/linux/nmi.h | 12 ++--- kernel/debug/debug_core.c | 8 ++- 20 files changed, 194 insertions(+), 28 deletions(-) create mode 100644 arch/arm64/include/asm/nmi.h create mode 100644 arch/arm64/kernel/ipi_nmi.c -- 2.34.1

2 8

[PATCH OLK-6.6 v4 0/7] arm64: Add support to turn an IPI as NMI
by Liao Chen 22 Dec '23

22 Dec '23

With pseudo NMIs support available its possible to configure SGIs to be triggered as pseudo NMIs running in NMI context. And kernel features such as: - NMI backtrace can leverage IPI turned as NMI to get a backtrace of CPU stuck in hard lockup using magic SYSRQ. - kgdb relies on NMI support to round up CPUs which are stuck in hard lockup state with interrupts disabled. This patch-set adds framework to turn an IPI as NMI which can be triggered as a pseudo NMI which in turn invokes registered NMI handlers. After this patch-set we should be able to get a backtrace for a CPU stuck in HARDLOCKUP. Sumit Garg (7): arm64: Add framework to turn IPI as NMI irqchip/gic-v3: Enable support for SGIs to act as NMIs arm64: smp: Assign and setup an IPI as NMI nmi: backtrace: Allow runtime arch specific override arm64: ipi_nmi: Add support for NMI backtrace kgdb: Expose default CPUs roundup fallback mechanism arm64: kgdb: Roundup cpus using IPI as NMI arch/arm/include/asm/irq.h | 2 +- arch/arm/kernel/smp.c | 3 +- arch/arm64/include/asm/irq.h | 6 +++ arch/arm64/include/asm/nmi.h | 17 +++++++ arch/arm64/kernel/Makefile | 2 +- arch/arm64/kernel/ipi_nmi.c | 84 ++++++++++++++++++++++++++++++++ arch/arm64/kernel/kgdb.c | 18 +++++++ arch/arm64/kernel/smp.c | 8 +++ arch/mips/include/asm/irq.h | 2 +- arch/mips/kernel/process.c | 3 +- arch/powerpc/include/asm/irq.h | 2 +- arch/powerpc/kernel/stacktrace.c | 3 +- arch/sparc/include/asm/irq_64.h | 2 +- arch/sparc/kernel/process_64.c | 4 +- arch/x86/include/asm/irq.h | 2 +- arch/x86/kernel/apic/hw_nmi.c | 3 +- drivers/irqchip/irq-gic-v3.c | 29 ++++++++--- include/linux/kgdb.h | 12 +++++ include/linux/nmi.h | 12 ++--- kernel/debug/debug_core.c | 8 ++- 20 files changed, 194 insertions(+), 28 deletions(-) create mode 100644 arch/arm64/include/asm/nmi.h create mode 100644 arch/arm64/kernel/ipi_nmi.c -- 2.34.1

1 1

[PATCH OLK-6.6 v4 1/7] arm64: Add framework to turn IPI as NMI
by Liao Chen 22 Dec '23

22 Dec '23

From: Sumit Garg <sumit.garg(a)linaro.org> maillist inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7R4EN CVE: NA Reference: https://lore.kernel.org/all/1604317487-14543-1-git-send-email-sumit.garg@li… ------------------------------------------------- Introduce framework to turn an IPI as NMI using pseudo NMIs. The main motivation for this feature is to have an IPI that can be leveraged to invoke NMI functions on other CPUs. And current prospective users are NMI backtrace and KGDB CPUs round-up whose support is added via future patches. Signed-off-by: Sumit Garg <sumit.garg(a)linaro.org> Signed-off-by: Wei Li <liwei391(a)huawei.com> Reviewed-by: Xie XiuQi <xiexiuqi(a)huawei.com> Signed-off-by: Zheng Zengkai <zhengzengkai(a)huawei.com> Signed-off-by: Ruan Jinjie <ruanjinjie(a)huawei.com> Signed-off-by: Liao Chen <liaochen4(a)huawei.com> --- arch/arm64/include/asm/nmi.h | 17 ++++++++++ arch/arm64/kernel/ipi_nmi.c | 65 ++++++++++++++++++++++++++++++++++++ 2 files changed, 82 insertions(+) create mode 100644 arch/arm64/include/asm/nmi.h create mode 100644 arch/arm64/kernel/ipi_nmi.c diff --git a/arch/arm64/include/asm/nmi.h b/arch/arm64/include/asm/nmi.h new file mode 100644 index 000000000000..4cd14b6af88b --- /dev/null +++ b/arch/arm64/include/asm/nmi.h @@ -0,0 +1,17 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_NMI_H +#define __ASM_NMI_H + +#ifndef __ASSEMBLER__ + +#include <linux/cpumask.h> + +extern bool arm64_supports_nmi(void); +extern void arm64_send_nmi(cpumask_t *mask); + +void set_smp_dynamic_ipi(int ipi); +void dynamic_ipi_setup(int cpu); +void dynamic_ipi_teardown(int cpu); + +#endif /* !__ASSEMBLER__ */ +#endif diff --git a/arch/arm64/kernel/ipi_nmi.c b/arch/arm64/kernel/ipi_nmi.c new file mode 100644 index 000000000000..a945dcf8015f --- /dev/null +++ b/arch/arm64/kernel/ipi_nmi.c @@ -0,0 +1,65 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * NMI support for IPIs + * + * Copyright (C) 2020 Linaro Limited + * Author: Sumit Garg <sumit.garg(a)linaro.org> + */ + +#include <linux/interrupt.h> +#include <linux/irq.h> +#include <linux/smp.h> + +#include <asm/nmi.h> + +static struct irq_desc *ipi_nmi_desc __read_mostly; +static int ipi_nmi_id __read_mostly; + +bool arm64_supports_nmi(void) +{ + if (ipi_nmi_desc) + return true; + + return false; +} + +void arm64_send_nmi(cpumask_t *mask) +{ + if (WARN_ON_ONCE(!ipi_nmi_desc)) + return; + + __ipi_send_mask(ipi_nmi_desc, mask); +} + +static irqreturn_t ipi_nmi_handler(int irq, void *data) +{ + /* nop, NMI handlers for special features can be added here. */ + + return IRQ_NONE; +} + +void dynamic_ipi_setup(int cpu) +{ + if (!ipi_nmi_desc) + return; + + if (!prepare_percpu_nmi(ipi_nmi_id)) + enable_percpu_nmi(ipi_nmi_id, IRQ_TYPE_NONE); +} + +void dynamic_ipi_teardown(int cpu) +{ + if (!ipi_nmi_desc) + return; + + disable_percpu_nmi(ipi_nmi_id); + teardown_percpu_nmi(ipi_nmi_id); +} + +void __init set_smp_dynamic_ipi(int ipi) +{ + if (!request_percpu_nmi(ipi, ipi_nmi_handler, "IPI", &cpu_number)) { + ipi_nmi_desc = irq_to_desc(ipi); + ipi_nmi_id = ipi; + } +} -- 2.34.1

2 7

[PATCH OLK-6.6] memcg: support ksm merge any mode per cgroup
by Nanyong Sun 22 Dec '23

22 Dec '23

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I8OIQR ---------------------------------------------------------------------- Add control file "memory.ksm" to enable ksm per cgroup. Echo to 1 will set all tasks currently in the cgroup to ksm merge any mode, which means ksm gets enabled for all vma's of a process. Meanwhile echo to 0 will disable ksm for them and unmerge the merged pages. Cat the file will show the above state and ksm related profits of this cgroup. Signed-off-by: Nanyong Sun <sunnanyong(a)huawei.com> --- .../admin-guide/cgroup-v1/memory.rst | 1 + mm/memcontrol.c | 110 +++++++++++++++++- 2 files changed, 109 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index ff456871bf4b..3fdb48435e8e 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -109,6 +109,7 @@ Brief summary of control files. memory.kmem.tcp.failcnt show the number of tcp buf memory usage hits limits memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded + memory.ksm set/show ksm merge any mode ==================================== ========================================== 1. History diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2d9a873e5522..30cafc2f22c1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -73,6 +73,7 @@ #include <linux/uaccess.h> #include <trace/events/vmscan.h> +#include <linux/ksm.h> struct cgroup_subsys memory_cgrp_subsys __read_mostly; EXPORT_SYMBOL(memory_cgrp_subsys); @@ -242,10 +243,15 @@ enum res_type { iter != NULL; \ iter = mem_cgroup_iter(NULL, iter, NULL)) +static inline bool __task_is_dying(struct task_struct *task) +{ + return tsk_is_oom_victim(task) || fatal_signal_pending(task) || + (task->flags & PF_EXITING); +} + static inline bool task_is_dying(void) { - return tsk_is_oom_victim(current) || fatal_signal_pending(current) || - (current->flags & PF_EXITING); + return __task_is_dying(current); } /* Some nice accessors for the vmpressure. */ @@ -5100,6 +5106,98 @@ static int mem_cgroup_slab_show(struct seq_file *m, void *p) } #endif +#ifdef CONFIG_KSM +static int memcg_set_ksm_for_tasks(struct mem_cgroup *memcg, bool enable) +{ + struct task_struct *task; + struct mm_struct *mm; + struct css_task_iter it; + int ret = 0; + + css_task_iter_start(&memcg->css, CSS_TASK_ITER_PROCS, &it); + while (!ret && (task = css_task_iter_next(&it))) { + if (__task_is_dying(task)) + continue; + + mm = get_task_mm(task); + if (!mm) + continue; + + if (mmap_write_lock_killable(mm)) { + mmput(mm); + continue; + } + + if (enable) + ret = ksm_enable_merge_any(mm); + else + ret = ksm_disable_merge_any(mm); + + mmap_write_unlock(mm); + mmput(mm); + } + css_task_iter_end(&it); + + return ret; +} + +static int memory_ksm_show(struct seq_file *m, void *v) +{ + unsigned long ksm_merging_pages = 0; + unsigned long ksm_rmap_items = 0; + long ksm_process_profits = 0; + unsigned int tasks = 0; + struct task_struct *task; + struct mm_struct *mm; + struct css_task_iter it; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + css_task_iter_start(&memcg->css, CSS_TASK_ITER_PROCS, &it); + while ((task = css_task_iter_next(&it))) { + mm = get_task_mm(task); + if (!mm) + continue; + + if (test_bit(MMF_VM_MERGE_ANY, &mm->flags)) + tasks++; + + ksm_rmap_items += mm->ksm_rmap_items; + ksm_merging_pages += mm->ksm_merging_pages; + ksm_process_profits += ksm_process_profit(mm); + mmput(mm); + } + css_task_iter_end(&it); + + seq_printf(m, "merge any tasks: %u\n", tasks); + seq_printf(m, "ksm_rmap_items %lu\n", ksm_rmap_items); + seq_printf(m, "ksm_merging_pages %lu\n", ksm_merging_pages); + seq_printf(m, "ksm_process_profits %ld\n", ksm_process_profits); + return 0; +} + +static ssize_t memory_ksm_write(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + bool enable; + int err; + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + + buf = strstrip(buf); + if (!buf) + return -EINVAL; + + err = kstrtobool(buf, &enable); + if (err) + return err; + + err = memcg_set_ksm_for_tasks(memcg, enable); + if (err) + return err; + + return nbytes; +} +#endif /* CONFIG_KSM */ + static int memory_stat_show(struct seq_file *m, void *v); #ifdef CONFIG_MEMCG_V1_RECLAIM @@ -5337,6 +5435,14 @@ static struct cftype mem_cgroup_legacy_files[] = { .name = "reclaim", .write = memory_reclaim, }, +#endif +#ifdef CONFIG_KSM + { + .name = "ksm", + .flags = CFTYPE_NOT_ON_ROOT, + .write = memory_ksm_write, + .seq_show = memory_ksm_show, + }, #endif { }, /* terminate */ }; -- 2.25.1

2 1

[PATCH OLK-6.6 0/3] Print rootfs and tmpfs files charged by memcg
by Jinjiang Tu 22 Dec '23

22 Dec '23

Support to print rootfs files and tmpfs files that having pages charged in given memory cgroup. The files infomations can be printed through interface "memory.memfs_files_info" or printed when OOM is triggered. Jinjiang Tu (1): fs: move {lock, unlock}_mount_hash to fs/mount.h Liu Shixin (2): mm/memcg_memfs_info: show files that having pages charged in mem_cgroup config: enable CONFIG_MEMCG_MEMFS_INFO by default Documentation/vm/memcg_memfs_info.rst | 40 ++++ arch/arm64/configs/openeuler_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + fs/mount.h | 10 + fs/namespace.c | 10 - include/linux/memcg_memfs_info.h | 23 ++ init/Kconfig | 10 + mm/Makefile | 1 + mm/memcg_memfs_info.c | 318 +++++++++++++++++++++++++ mm/memcontrol.c | 12 + 10 files changed, 416 insertions(+), 10 deletions(-) create mode 100644 Documentation/vm/memcg_memfs_info.rst create mode 100644 include/linux/memcg_memfs_info.h create mode 100644 mm/memcg_memfs_info.c -- 2.25.1

2 4

[PATCH OLK-6.6 v4 0/7] arm64: Add support to turn an IPI as NMI
by Liao Chen 22 Dec '23

22 Dec '23

With pseudo NMIs support available its possible to configure SGIs to be triggered as pseudo NMIs running in NMI context. And kernel features such as: - NMI backtrace can leverage IPI turned as NMI to get a backtrace of CPU stuck in hard lockup using magic SYSRQ. - kgdb relies on NMI support to round up CPUs which are stuck in hard lockup state with interrupts disabled. This patch-set adds framework to turn an IPI as NMI which can be triggered as a pseudo NMI which in turn invokes registered NMI handlers. After this patch-set we should be able to get a backtrace for a CPU stuck in HARDLOCKUP. Sumit Garg (7): arm64: Add framework to turn IPI as NMI irqchip/gic-v3: Enable support for SGIs to act as NMIs arm64: smp: Assign and setup an IPI as NMI nmi: backtrace: Allow runtime arch specific override arm64: ipi_nmi: Add support for NMI backtrace kgdb: Expose default CPUs roundup fallback mechanism arm64: kgdb: Roundup cpus using IPI as NMI arch/arm/include/asm/irq.h | 2 +- arch/arm/kernel/smp.c | 3 +- arch/arm64/include/asm/irq.h | 6 +++ arch/arm64/include/asm/nmi.h | 17 +++++++ arch/arm64/kernel/Makefile | 2 +- arch/arm64/kernel/ipi_nmi.c | 84 ++++++++++++++++++++++++++++++++ arch/arm64/kernel/kgdb.c | 18 +++++++ arch/arm64/kernel/smp.c | 8 +++ arch/mips/include/asm/irq.h | 2 +- arch/mips/kernel/process.c | 3 +- arch/powerpc/include/asm/irq.h | 2 +- arch/powerpc/kernel/stacktrace.c | 3 +- arch/sparc/include/asm/irq_64.h | 2 +- arch/sparc/kernel/process_64.c | 4 +- arch/x86/include/asm/irq.h | 2 +- arch/x86/kernel/apic/hw_nmi.c | 3 +- drivers/irqchip/irq-gic-v3.c | 29 ++++++++--- include/linux/kgdb.h | 12 +++++ include/linux/nmi.h | 12 ++--- kernel/debug/debug_core.c | 8 ++- 20 files changed, 194 insertions(+), 28 deletions(-) create mode 100644 arch/arm64/include/asm/nmi.h create mode 100644 arch/arm64/kernel/ipi_nmi.c -- 2.34.1

2 8

[Cancel] openEuler Kernel SIG双周例会
by openEuler conference 22 Dec '23

22 Dec '23

Sorry! The WeLink meeting will be held at 2023-12-22 14:00 scheduled by openEuler Kernel SIG has been cancelled.

1 0

[Cancel] openEuler Kernel SIG双周例会
by openEuler conference 22 Dec '23

22 Dec '23

Sorry! The Zoom meeting will be held at 2022-11-04 14:00 scheduled by openEuler Kernel SIG has been cancelled.

1 0

openEuler Kernel SIG双周例会
by openEuler conference 22 Dec '23

22 Dec '23

您好！ Kernel SIG 邀请您参加 2023-12-22 14:00 召开的WeLink会议(自动录制) 会议主题：openEuler Kernel SIG双周例会会议内容： 1. 进展update 2. 议题征集中（新增议题可回复本邮件申报，也可直接填至会议看板）会议链接：https://bmeeting.huaweicloud.com:36443/#/j/966610402 会议纪要：https://etherpad.openeuler.org/p/Kernel-meetings 温馨提醒：建议接入会议后修改参会人的姓名，也可以使用您在gitee.com的ID 更多资讯尽在：https://openeuler.org/zh/ Hello! openEuler Kernel SIG invites you to attend the WeLink conference(auto recording) will be held at 2023-12-22 14:00, The subject of the conference is openEuler Kernel SIG双周例会, Summary: 1. 进展update 2. 议题征集中（新增议题可回复本邮件申报，也可直接填至会议看板） You can join the meeting at https://bmeeting.huaweicloud.com:36443/#/j/966610402. Add topics at https://etherpad.openeuler.org/p/Kernel-meetings. Note: You are advised to change the participant name after joining the conference or use your ID at gitee.com. More information: https://openeuler.org/en/

2 1

[PATCH OLK-6.6 v3 0/7] arm64: Add support to turn an IPI as NMI
by Liao Chen 22 Dec '23

22 Dec '23

With pseudo NMIs support available its possible to configure SGIs to be triggered as pseudo NMIs running in NMI context. And kernel features such as: - NMI backtrace can leverage IPI turned as NMI to get a backtrace of CPU stuck in hard lockup using magic SYSRQ. - kgdb relies on NMI support to round up CPUs which are stuck in hard lockup state with interrupts disabled. This patch-set adds framework to turn an IPI as NMI which can be triggered as a pseudo NMI which in turn invokes registered NMI handlers. After this patch-set we should be able to get a backtrace for a CPU stuck in HARDLOCKUP. Sumit Garg (7): arm64: Add framework to turn IPI as NMI irqchip/gic-v3: Enable support for SGIs to act as NMIs arm64: smp: Assign and setup an IPI as NMI nmi: backtrace: Allow runtime arch specific override arm64: ipi_nmi: Add support for NMI backtrace kgdb: Expose default CPUs roundup fallback mechanism arm64: kgdb: Roundup cpus using IPI as NMI arch/arm/include/asm/irq.h | 2 +- arch/arm/kernel/smp.c | 3 +- arch/arm64/include/asm/irq.h | 6 +++ arch/arm64/include/asm/nmi.h | 17 +++++++ arch/arm64/kernel/Makefile | 2 +- arch/arm64/kernel/ipi_nmi.c | 84 ++++++++++++++++++++++++++++++++ arch/arm64/kernel/kgdb.c | 18 +++++++ arch/arm64/kernel/smp.c | 8 +++ arch/mips/include/asm/irq.h | 2 +- arch/mips/kernel/process.c | 3 +- arch/powerpc/include/asm/irq.h | 2 +- arch/powerpc/kernel/stacktrace.c | 3 +- arch/sparc/include/asm/irq_64.h | 2 +- arch/sparc/kernel/process_64.c | 4 +- arch/x86/include/asm/irq.h | 2 +- arch/x86/kernel/apic/hw_nmi.c | 3 +- drivers/irqchip/irq-gic-v3.c | 29 ++++++++--- include/linux/kgdb.h | 12 +++++ include/linux/nmi.h | 12 ++--- kernel/debug/debug_core.c | 8 ++- 20 files changed, 194 insertions(+), 28 deletions(-) create mode 100644 arch/arm64/include/asm/nmi.h create mode 100644 arch/arm64/kernel/ipi_nmi.c -- 2.34.1

2 8

[PATCH v2 openEuler-1.0-LTS] umh: fix memory leak on execve failure
by Wenyu Huang 22 Dec '23

22 Dec '23

From: Vincent Minet <v.minet(a)criteo.com> mainline inclusion from mainline-v5.10-rc1 commit db803036ada7d61d096783726f9771b3fc540370 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8J9MA CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?… ------------------------------------------------- If a UMH process created by fork_usermode_blob() fails to execute, a pair of struct file allocated by umh_pipe_setup() will leak. Under normal conditions, the caller (like bpfilter) needs to manage the lifetime of the UMH and its two pipes. But when fork_usermode_blob() fails, the caller doesn't really have a way to know what needs to be done. It seems better to do the cleanup ourselves in this case. Fixes: 449325b52b7a ("umh: introduce fork_usermode_blob() helper") Signed-off-by: Vincent Minet <v.minet(a)criteo.com> Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> Signed-off-by: Wenyu Huang <huangwenyu5(a)huawei.com> --- kernel/umh.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/kernel/umh.c b/kernel/umh.c index 53611efb10cb..715b4368f9ed 100644 --- a/kernel/umh.c +++ b/kernel/umh.c @@ -495,6 +495,12 @@ static void umh_clean_and_save_pid(struct subprocess_info *info) { struct umh_info *umh_info = info->data; + /* cleanup if umh_pipe_setup() was successful but exec failed */ + if (info->pid && info->retval) { + fput(umh_info->pipe_to_umh); + fput(umh_info->pipe_from_umh); + } + argv_free(info->argv); umh_info->pid = info->pid; } -- 2.34.1

2 1

[PATCH OLK-6.6 0/3] files cgroups feature
by chenridong 22 Dec '23

22 Dec '23

Binder Makin (1): cgroups: Resource controller for open files Hou Tao (1): cgroup/files: fix bugs as mentioned Lu Jialin (1): enable CONFIG_CGROUP_FILES in openeuler_defconfig for x86 and arm64 .../admin-guide/kernel-parameters.txt | 7 +- arch/arm64/configs/openeuler_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + fs/Makefile | 1 + fs/file.c | 68 +++- fs/filescontrol.c | 312 ++++++++++++++++++ include/linux/cgroup-defs.h | 8 +- include/linux/cgroup_subsys.h | 4 + include/linux/fdtable.h | 1 + include/linux/filescontrol.h | 40 +++ init/Kconfig | 10 + 11 files changed, 445 insertions(+), 8 deletions(-) create mode 100644 fs/filescontrol.c create mode 100644 include/linux/filescontrol.h -- 2.34.1

2 4

[PATCH OLK-6.6 v2 0/7] arm64: Add support to turn an IPI as NMI
by Liao Chen 22 Dec '23

22 Dec '23

With pseudo NMIs support available its possible to configure SGIs to be triggered as pseudo NMIs running in NMI context. And kernel features such as: - NMI backtrace can leverage IPI turned as NMI to get a backtrace of CPU stuck in hard lockup using magic SYSRQ. - kgdb relies on NMI support to round up CPUs which are stuck in hard lockup state with interrupts disabled. This patch-set adds framework to turn an IPI as NMI which can be triggered as a pseudo NMI which in turn invokes registered NMI handlers. After this patch-set we should be able to get a backtrace for a CPU stuck in HARDLOCKUP. Sumit Garg (7): arm64: Add framework to turn IPI as NMI irqchip/gic-v3: Enable support for SGIs to act as NMIs arm64: smp: Assign and setup an IPI as NMI nmi: backtrace: Allow runtime arch specific override arm64: ipi_nmi: Add support for NMI backtrace kgdb: Expose default CPUs roundup fallback mechanism arm64: kgdb: Roundup cpus using IPI as NMI arch/arm/include/asm/irq.h | 2 +- arch/arm/kernel/smp.c | 3 +- arch/arm64/include/asm/irq.h | 6 +++ arch/arm64/include/asm/nmi.h | 17 +++++++ arch/arm64/kernel/Makefile | 2 +- arch/arm64/kernel/ipi_nmi.c | 84 ++++++++++++++++++++++++++++++++ arch/arm64/kernel/kgdb.c | 18 +++++++ arch/arm64/kernel/smp.c | 8 +++ arch/mips/include/asm/irq.h | 2 +- arch/mips/kernel/process.c | 3 +- arch/powerpc/include/asm/irq.h | 2 +- arch/powerpc/kernel/stacktrace.c | 3 +- arch/sparc/include/asm/irq_64.h | 2 +- arch/sparc/kernel/process_64.c | 4 +- arch/x86/include/asm/irq.h | 2 +- arch/x86/kernel/apic/hw_nmi.c | 3 +- drivers/irqchip/irq-gic-v3.c | 29 ++++++++--- include/linux/kgdb.h | 12 +++++ include/linux/nmi.h | 12 ++--- kernel/debug/debug_core.c | 8 ++- 20 files changed, 194 insertions(+), 28 deletions(-) create mode 100644 arch/arm64/include/asm/nmi.h create mode 100644 arch/arm64/kernel/ipi_nmi.c -- 2.34.1

2 8

[PATCH openEuler-22.03-LTS-SP2] tick/broadcast-hrtimer: Prevent the timer device on broadcast duty CPU from being disabled
by Yu Liao 22 Dec '23

22 Dec '23

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8PL17 CVE: NA ---------------------------------------- It was found that running the LTP hotplug stress test on a aarch64 system could produce rcu_sched stall warnings. The issue is the following: CPU1 (owns the broadcast hrtimer) CPU2 tick_broadcast_enter() //shut down local timer device ... tick_broadcast_exit() //exits with tick_broadcast_force_mask set, timer device remains disabled initiates offlining of CPU1 take_cpu_down() //CPU1 shuts down and does not send broadcast IPI anymore takedown_cpu() hotplug_cpu__broadcast_tick_pull() //move broadcast hrtimer to this CPU clockevents_program_event() bc_set_next() hrtimer_start() //does not call hrtimer_reprogram() to program timer device if expires equals dev->next_event, so the timer device remains disabled. CPU2 takes over the broadcast duty but local timer device is disabled, causing many CPUs to become stuck. Fix this by calling tick_program_event() to reprogram the local timer device in this scenario. Signed-off-by: Yu Liao <liaoyu15(a)huawei.com> --- kernel/time/tick-broadcast-hrtimer.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c index b5a65e212df2..b8e79f8012f3 100644 --- a/kernel/time/tick-broadcast-hrtimer.c +++ b/kernel/time/tick-broadcast-hrtimer.c @@ -42,6 +42,8 @@ static int bc_shutdown(struct clock_event_device *evt) */ static int bc_set_next(ktime_t expires, struct clock_event_device *bc) { + struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev); + /* * This is called either from enter/exit idle code or from the * broadcast handler. In all cases tick_broadcast_lock is held. @@ -62,6 +64,18 @@ static int bc_set_next(ktime_t expires, struct clock_event_device *bc) * hrtimer_start() can call into tracing. */ RCU_NONIDLE( { + + /* + * This can be called from CPU offline operation to move broadcast + * assignment. If tick_broadcast_force_mask is set, the CPU local + * timer device may be disabled. And hrtimer_reprogram() will not + * called if the timer is not the first expiring timer. Reprogram + * the cpu local timer device to ensure we can take over the + * broadcast duty. + */ + if (tick_check_broadcast_expired() && expires >= dev->next_event) + clockevents_program_event(dev, dev->next_event, 1); + hrtimer_start(&bctimer, expires, HRTIMER_MODE_ABS_PINNED_HARD); /* * The core tick broadcast mode expects bc->bound_on to be set -- 2.33.0

2 1

[PATCH OLK-6.6 0/7] Support Features to Turn an IPI as NMI
by Liao Chen 22 Dec '23

22 Dec '23

With pseudo NMIs support available its possible to configure SGIs to be triggered as pseudo NMIs running in NMI context. And kernel features such as: - NMI backtrace can leverage IPI turned as NMI to get a backtrace of CPU stuck in hard lockup using magic SYSRQ. - kgdb relies on NMI support to round up CPUs which are stuck in hard lockup state with interrupts disabled. This patch-set adds framework to turn an IPI as NMI which can be triggered as a pseudo NMI which in turn invokes registered NMI handlers. After this patch-set we should be able to get a backtrace for a CPU stuck in HARDLOCKUP. Sumit Garg (7): arm64: Add framework to turn IPI as NMI irqchip/gic-v3: Enable support for SGIs to act as NMIs arm64: smp: Assign and setup an IPI as NMI nmi: backtrace: Allow runtime arch specific override arm64: ipi_nmi: Add support for NMI backtrace kgdb: Expose default CPUs roundup fallback mechanism arm64: kgdb: Roundup cpus using IPI as NMI arch/arm/include/asm/irq.h | 2 +- arch/arm/kernel/smp.c | 3 +- arch/arm64/include/asm/irq.h | 6 +++ arch/arm64/include/asm/nmi.h | 17 +++++++ arch/arm64/kernel/Makefile | 4 +- arch/arm64/kernel/ipi_nmi.c | 84 ++++++++++++++++++++++++++++++++ arch/arm64/kernel/kgdb.c | 18 +++++++ arch/arm64/kernel/smp.c | 8 +++ arch/mips/include/asm/irq.h | 2 +- arch/powerpc/kernel/stacktrace.c | 3 +- arch/sparc/include/asm/irq_64.h | 2 +- arch/sparc/kernel/process_64.c | 4 +- arch/x86/include/asm/irq.h | 2 +- arch/x86/kernel/apic/hw_nmi.c | 3 +- drivers/irqchip/irq-gic-v3.c | 29 ++++++++--- include/linux/kgdb.h | 12 +++++ include/linux/nmi.h | 12 ++--- kernel/debug/debug_core.c | 8 ++- 18 files changed, 192 insertions(+), 27 deletions(-) create mode 100644 arch/arm64/include/asm/nmi.h create mode 100644 arch/arm64/kernel/ipi_nmi.c -- 2.34.1

2 8

[PATCH OLK-6.6 v2 0/2] optimize inline
by Yuntao Liu 21 Dec '23

21 Dec '23

*** BLURB HERE *** Guo Xuenan (2): Revert "compiler: remove CONFIG_OPTIMIZE_INLINING entirely" make OPTIMIZE_INLINING config editable arch/arm64/kvm/sys_regs.h | 5 +++++ arch/x86/configs/i386_defconfig | 1 + arch/x86/configs/x86_64_defconfig | 1 + drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h | 5 +++++ .../pci/hive_isp_css_include/print_support.h | 4 ++++ include/linux/compiler_types.h | 8 ++++++++ include/trace/trace_events.h | 2 +- kernel/configs/tiny.config | 1 + lib/Kconfig.debug | 13 +++++++++++++ 9 files changed, 39 insertions(+), 1 deletion(-) -- 2.34.1

2 3

[PATCH OLK-6.6 v2 0/2] optimize inline
by Yuntao Liu 21 Dec '23

21 Dec '23

*** BLURB HERE *** Guo Xuenan (2): Revert "compiler: remove CONFIG_OPTIMIZE_INLINING entirely" make OPTIMIZE_INLINING config editable arch/arm64/kvm/sys_regs.h | 5 +++++ arch/x86/configs/i386_defconfig | 1 + arch/x86/configs/x86_64_defconfig | 1 + drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h | 5 +++++ .../pci/hive_isp_css_include/print_support.h | 4 ++++ include/linux/compiler_types.h | 8 ++++++++ include/trace/trace_events.h | 2 +- kernel/configs/tiny.config | 1 + lib/Kconfig.debug | 13 +++++++++++++ 9 files changed, 39 insertions(+), 1 deletion(-) -- 2.34.1

2 3

[PATCH OLK-5.10] tick/broadcast-hrtimer: Prevent the timer device on broadcast duty CPU from being disabled
by Yu Liao 21 Dec '23

21 Dec '23

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8PL17 CVE: NA ---------------------------------------- It was found that running the LTP hotplug stress test on a aarch64 system could produce rcu_sched stall warnings. The issue is the following: CPU1 (owns the broadcast hrtimer) CPU2 tick_broadcast_enter() //shut down local timer device ... tick_broadcast_exit() //exits with tick_broadcast_force_mask set, timer device remains disabled initiates offlining of CPU1 take_cpu_down() //CPU1 shuts down and does not send broadcast IPI anymore takedown_cpu() hotplug_cpu__broadcast_tick_pull() //move broadcast hrtimer to this CPU clockevents_program_event() bc_set_next() hrtimer_start() //does not call hrtimer_reprogram() to program timer device if expires equals dev->next_event, so the timer device remains disabled. CPU2 takes over the broadcast duty but local timer device is disabled, causing many CPUs to become stuck. Fix this by calling tick_program_event() to reprogram the local timer device in this scenario. Signed-off-by: Yu Liao <liaoyu15(a)huawei.com> --- kernel/time/tick-broadcast-hrtimer.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c index b5a65e212df2..b8e79f8012f3 100644 --- a/kernel/time/tick-broadcast-hrtimer.c +++ b/kernel/time/tick-broadcast-hrtimer.c @@ -42,6 +42,8 @@ static int bc_shutdown(struct clock_event_device *evt) */ static int bc_set_next(ktime_t expires, struct clock_event_device *bc) { + struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev); + /* * This is called either from enter/exit idle code or from the * broadcast handler. In all cases tick_broadcast_lock is held. @@ -62,6 +64,18 @@ static int bc_set_next(ktime_t expires, struct clock_event_device *bc) * hrtimer_start() can call into tracing. */ RCU_NONIDLE( { + + /* + * This can be called from CPU offline operation to move broadcast + * assignment. If tick_broadcast_force_mask is set, the CPU local + * timer device may be disabled. And hrtimer_reprogram() will not + * called if the timer is not the first expiring timer. Reprogram + * the cpu local timer device to ensure we can take over the + * broadcast duty. + */ + if (tick_check_broadcast_expired() && expires >= dev->next_event) + clockevents_program_event(dev, dev->next_event, 1); + hrtimer_start(&bctimer, expires, HRTIMER_MODE_ABS_PINNED_HARD); /* * The core tick broadcast mode expects bc->bound_on to be set -- 2.33.0

2 1

[PATCH OLK-6.6] timekeeping: Avoiding false sharing in field access of tk_core
by Yu Liao 21 Dec '23

21 Dec '23

From: Wang ShaoBo <bobo.shaobowang(a)huawei.com> hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I47W8L CVE: NA --------------------------- We detect a performance deterioration when using Unixbench, we use the dichotomy to locate the patch 7e66740ad725 ("MPAM / ACPI: Refactoring MPAM init process and set MPAM ACPI as entrance"), In comparing two commit df5defd901ff ("KVM: X86: MMU: Use the correct inherited permissions to get shadow page") and ac4dbb7554ef ("ACPI 6.x: Add definitions for MPAM table") we get following testing result: CMD: ./Run -c xx context1 RESULT: +-------------UnixBench context1-----------+ +---------+--------------+-----------------+ + + ac4dbb7554ef + df5defd901ff + +---------+--------------+---------+-------+ + Cores + Score + Score + +---------+--------------+-----------------+ + 1 + 522.8 + 535.7 + +---------+--------------+-----------------+ + 24 + 11231.5 + 12111.2 + +---------+--------------+-----------------+ + 48 + 8535.1 + 8745.1 + +---------+--------------+-----------------+ + 72 + 10821.9 + 10343.8 + +---------+--------------+-----------------+ + 96 + 15238.5 + 42947.8 + +---------+--------------+-----------------+ We found a irrefutable difference in latency sampling when using the perf tool: HEAD:ac4dbb7554ef HEAD:df5defd901ff 45.18% [kernel] [k] ktime_get_coarse_real_ts64 -> 1.78% [kernel] [k] ktime_get_coarse_real_ts64 ... 65.87 │ dmb ishld //smp_rmb() Through ftrace we get the calltrace and and detected the number of visits of ktime_get_coarse_real_ts64, which frequently visits tk_core->seq and tk_core->timekeeper->tkr_mono: - 48.86% [kernel] [k] ktime_get_coarse_real_ts64 - 5.76% ktime_get_coarse_real_ts64 #about 111437657 times per 10 seconds - 14.70% __audit_syscall_entry syscall_trace_enter el0_svc_common el0_svc_handler + el0_svc - 2.85% current_time So this may be performance degradation caused by interference when happened different fields access, We compare .bss and .data section of this two version: HEAD:ac4dbb7554ef `-> ffff00000962e680 l O .bss 0000000000000110 tk_core ffff000009355680 l O .data 0000000000000078 tk_fast_mono ffff0000093557a0 l O .data 0000000000000090 dummy_clock ffff000009355700 l O .data 0000000000000078 tk_fast_raw ffff000009355778 l O .data 0000000000000028 timekeeping_syscore_ops ffff00000962e640 l O .bss 0000000000000008 cycles_at_suspend HEAD:df5defd901ff `-> ffff00000957dbc0 l O .bss 0000000000000110 tk_core ffff0000092b4e80 l O .data 0000000000000078 tk_fast_mono ffff0000092b4fa0 l O .data 0000000000000090 dummy_clock ffff0000092b4f00 l O .data 0000000000000078 tk_fast_raw ffff0000092b4f78 l O .data 0000000000000028 timekeeping_syscore_ops ffff00000957db80 l O .bss 0000000000000008 cycles_at_suspend By comparing this two version tk_core's address: ffff00000962e680 is 128Byte aligned but latter df5defd901ff is 64Byte aligned, the memory storage layout of tk_core has undergone subtle changes: HEAD:ac4dbb7554ef `-> |<--------formmer 64Bytes---------->|<------------latter 64Byte------------->| 0xffff00000957dbc0_>|<-seq 8Bytes->|<-tkr_mono 56Bytes->|<-thr_raw 56Bytes->|<-xtime_sec 8Bytes->| 0xffff00000957dc00_>... HEAD:df5defd901ff `-> |<------formmer 64Bytes---->|<------------latter 64Byte-------->| 0xffff00000962e680_>|<-Other variables 64Bytes->|<-seq 8Bytes->|<-tkr_mono 56Bytes->| 0xffff00000962e6c0_>.. We testified thr_raw,xtime_sec fields interfere strongly with seq,tkr_mono field because of frequent load/store operation, this will cause as known false sharing. We add a 64Bytes padding field in tk_core for reservation of any after usefull usage and keep tk_core 128Byte aligned, this can avoid changes in the way tk_core's layout is stored, In this solution, layout of tk_core always like this: crash> struct -o tk_core_t struct tk_core_t { [0] u64 padding[8]; [64] seqcount_t seq; [72] struct timekeeper timekeeper; } SIZE: 336 crash> struct -o timekeeper struct timekeeper { [0] struct tk_read_base tkr_mono; [56] struct tk_read_base tkr_raw; [112] u64 xtime_sec; [120] unsigned long ktime_sec; ... } SIZE: 264 After appling our own solution: +---------+--------------+ + + Our solution + +---------+--------------+ + Cores + Score + +---------+--------------+ + 1 + 548.9 + +---------+--------------+ + 24 + 11018.3 + +---------+--------------+ + 48 + 8938.2 + +---------+--------------+ + 72 + 14610.7 + +---------+--------------+ + 96 + 40811.7 + +---------+--------------+ Signed-off-by: Wang ShaoBo <bobo.shaobowang(a)huawei.com> Signed-off-by: Yu Liao <liaoyu15(a)huawei.com> --- arch/arm64/Kconfig | 9 +++++++++ arch/arm64/include/asm/cache.h | 6 ++++++ kernel/time/timekeeping.c | 7 +++++++ 3 files changed, 22 insertions(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 2aca373a7038..f5559e38b243 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1493,6 +1493,15 @@ config HW_PERF_EVENTS def_bool y depends on ARM_PMU +config ARCH_LLC_128_LINE_SIZE + bool "Force 128 bytes alignment for fitting LLC cacheline" + depends on ARM64 + default y + help + As specific machine's LLC cacheline size may be up to + 128 bytes, gaining performance improvement from fitting + 128 Bytes LLC cache aligned. + # Supported by clang >= 7.0 or GCC >= 12.0.0 config CC_HAVE_SHADOW_CALL_STACK def_bool $(cc-option, -fsanitize=shadow-call-stack -ffixed-x18) diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h index 1613779be63a..8455df351ef8 100644 --- a/arch/arm64/include/asm/cache.h +++ b/arch/arm64/include/asm/cache.h @@ -8,6 +8,12 @@ #define L1_CACHE_SHIFT (6) #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) +#ifdef CONFIG_ARCH_LLC_128_LINE_SIZE +#ifndef ____cacheline_aligned_128 +#define ____cacheline_aligned_128 __attribute__((__aligned__(128))) +#endif +#endif + #define CLIDR_LOUU_SHIFT 27 #define CLIDR_LOC_SHIFT 24 #define CLIDR_LOUIS_SHIFT 21 diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 266d02809dbb..35b366addf07 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -48,9 +48,16 @@ DEFINE_RAW_SPINLOCK(timekeeper_lock); * cache line. */ static struct { +#ifdef CONFIG_ARCH_LLC_128_LINE_SIZE + u64 padding[8]; +#endif seqcount_raw_spinlock_t seq; struct timekeeper timekeeper; +#ifdef CONFIG_ARCH_LLC_128_LINE_SIZE +} tk_core ____cacheline_aligned_128 = { +#else } tk_core ____cacheline_aligned = { +#endif .seq = SEQCNT_RAW_SPINLOCK_ZERO(tk_core.seq, &timekeeper_lock), }; -- 2.33.0

2 1

[PATCH OLK-6.6 0/2] support CLOCKSOURCE_VALIDATE_LAST_CYCLE on
by Yu Liao 21 Dec '23

21 Dec '23

Yu Liao (2): timekeeping: Make CLOCKSOURCE_VALIDATE_LAST_CYCLE configurable config: make CLOCKSOURCE_VALIDATE_LAST_CYCLE not set by default arch/arm64/configs/openeuler_defconfig | 1 + kernel/time/Kconfig | 13 ++++++++----- 2 files changed, 9 insertions(+), 5 deletions(-) -- 2.33.0

2 3

[PATCH OLK-6.6 v4 0/1] stop_machine: mask pseudo nmi before running the callback
by Yuntao Liu 21 Dec '23

21 Dec '23

*** BLURB HERE *** Wei Li (1): stop_machine: mask pseudo nmi before running the callback arch/arm64/include/asm/arch_gicv3.h | 12 ++++++++++++ kernel/stop_machine.c | 3 +++ 2 files changed, 15 insertions(+) -- 2.34.1

2 2

[PATCH OLK-6.6 0/3] Introduce CPU inspect feature
by Yu Liao 21 Dec '23

21 Dec '23

This patches series introduce CPU-inspect feature. CPU-inspect is designed to provide a framework for early detection of SDC by proactively executing CPU inspection test cases. Silent Data Corruption (SDC), sometimes referred to as Silent Data Error (SDE), is an industry-wide issue impacting not only long-protected memory, storage, and networking, but also computer CPUs. As with software issues, hardware-induced SDC can contribute to data loss and corruption. An SDC occurs when an impacted CPU inadvertently causes errors in the data it processes. For example, an impacted CPU might miscalculate data (i.e., 1+1=3). There may be no indication of these computational errors unless the software systematically checks for errors [1]. SDC issues have been around for many years, but as chips have become more advanced and compact in size, the transistors and lines have become so tiny that small electrical fluctuations can cause errors. Most of these errors are caused by defects during manufacturing and are screened out by the vendors; others are caught by hardware error detection or correction. However, some errors go undetected by hardware; therefore only detection software can protect against such errors [1]. [1] https://support.google.com/cloud/answer/10759085 Yu Liao (3): cpuinspect: add CPU-inspect infrastructure cpuinspect: add ATF inspector openeuler_defconfig: enable CPU inspect for arm64 by default arch/arm64/configs/openeuler_defconfig | 7 + drivers/Kconfig | 2 + drivers/Makefile | 1 + drivers/cpuinspect/Kconfig | 24 +++ drivers/cpuinspect/Makefile | 7 + drivers/cpuinspect/cpuinspect.c | 170 ++++++++++++++++ drivers/cpuinspect/cpuinspect.h | 46 +++++ drivers/cpuinspect/inspector-atf.c | 81 ++++++++ drivers/cpuinspect/inspector.c | 124 ++++++++++++ drivers/cpuinspect/sysfs.c | 258 +++++++++++++++++++++++++ include/linux/cpuinspect.h | 40 ++++ 11 files changed, 760 insertions(+) create mode 100644 drivers/cpuinspect/Kconfig create mode 100644 drivers/cpuinspect/Makefile create mode 100644 drivers/cpuinspect/cpuinspect.c create mode 100644 drivers/cpuinspect/cpuinspect.h create mode 100644 drivers/cpuinspect/inspector-atf.c create mode 100644 drivers/cpuinspect/inspector.c create mode 100644 drivers/cpuinspect/sysfs.c create mode 100644 include/linux/cpuinspect.h -- 2.33.0

2 4

[PATCH OLK-6.6 0/2] Introduce clear freelist feature
by Yu Liao 21 Dec '23

21 Dec '23

Yu Liao (2): mm: Add sysctl to clear free list pages config: make CONFIG_CLEAR_FREELIST_PAGE not set by default .../admin-guide/kernel-parameters.txt | 3 + Documentation/admin-guide/sysctl/vm.rst | 13 ++ arch/arm64/configs/openeuler_defconfig | 1 + mm/Kconfig | 12 ++ mm/Makefile | 2 + mm/clear_freelist_page.c | 178 ++++++++++++++++++ 6 files changed, 209 insertions(+) create mode 100644 mm/clear_freelist_page.c -- 2.33.0

2 3

[PATCH OLK-6.6 v3 0/1] stop_machine: mask pseudo nmi before running the callback
by Yuntao Liu 21 Dec '23

21 Dec '23

From: Jinjie Ruan <ruanjinjie(a)huawei.com> *** BLURB HERE *** Wei Li (1): stop_machine: mask pseudo nmi before running the callback arch/arm64/include/asm/arch_gicv3.h | 12 ++++++++++++ kernel/stop_machine.c | 3 +++ 2 files changed, 15 insertions(+) -- 2.34.1

2 2

[PATCH OLK-6.6 0/2] Introduce clear freelist feature
by Yu Liao 21 Dec '23

21 Dec '23

Yu Liao (2): mm: Add sysctl to clear free list pages config: make CONFIG_CLEAR_FREELIST_PAGE not set by default .../admin-guide/kernel-parameters.txt | 3 + Documentation/admin-guide/sysctl/vm.rst | 13 ++ arch/arm64/configs/openeuler_defconfig | 1 + mm/Kconfig | 13 ++ mm/Makefile | 2 + mm/clear_freelist_page.c | 178 ++++++++++++++++++ 6 files changed, 210 insertions(+) create mode 100644 mm/clear_freelist_page.c -- 2.33.0

2 3

[PATCH OLK-6.6 v2 0/1] stop_machine: mask pseudo nmi before running the callback
by Yuntao Liu 21 Dec '23

21 Dec '23

*** BLURB HERE *** Wei Li (1): stop_machine: mask pseudo nmi before running the callback arch/arm64/include/asm/arch_gicv3.h | 12 ++++++++++++ kernel/stop_machine.c | 3 +++ 2 files changed, 15 insertions(+) -- 2.34.1

2 2

[PATCH OLK-6.6 0/3] cgroup v1 writeback
by chenridong 21 Dec '23

21 Dec '23

Lu Jialin (2): cgroup: Factor out __cgroup_get_from_id() for cgroup v1 openeuler_defconfig: enable CONFIG_CGROUP_V1_WRITEBACK in openeuler_defconfig for x86 and arm64 chenridong (1): cgroup: support cgroup writeback on cgroupv1 arch/arm64/configs/openeuler_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + block/blk-cgroup.c | 3 + block/blk-cgroup.h | 3 + include/linux/backing-dev.h | 29 ++++++- include/linux/cgroup-defs.h | 2 + include/linux/cgroup.h | 1 + include/linux/memcontrol.h | 5 ++ init/Kconfig | 5 ++ kernel/cgroup/cgroup.c | 31 +++++-- mm/backing-dev.c | 116 ++++++++++++++++++++++++- mm/memcontrol.c | 83 +++++++++++++++++- 12 files changed, 267 insertions(+), 13 deletions(-) -- 2.34.1

2 4

[PATCH OLK-6.6 0/3] Revert "compiler: remove CONFIG_OPTIMIZE_INLINING entirely"
by Yuntao Liu 21 Dec '23

21 Dec '23

*** BLURB HERE *** Guo Xuenan (3): Revert "compiler: remove CONFIG_OPTIMIZE_INLINING entirely" make OPTIMIZE_INLINING config editable disable OPTIMIZE_INLINING by default arch/arm64/kvm/sys_regs.h | 5 +++++ arch/x86/configs/i386_defconfig | 1 + arch/x86/configs/x86_64_defconfig | 1 + drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h | 5 +++++ .../pci/hive_isp_css_include/print_support.h | 4 ++++ include/linux/compiler_types.h | 8 ++++++++ include/trace/trace_events.h | 2 +- kernel/configs/tiny.config | 1 + lib/Kconfig.debug | 13 +++++++++++++ 9 files changed, 39 insertions(+), 1 deletion(-) -- 2.34.1

2 4