From: Ye Bin yebin10@huawei.com
mainline inclusion from mainline-v5.13-rc1 commit ac2f7ca51b0929461ea49918f27c11b680f28995 category: bugfix bugzilla: 182973 CVE: NA
-------------------------------------------------
Before commit 014c9caa29d3 ("ext4: make ext4_abort() use __ext4_error()"), the following series of commands would trigger a panic:
1. mount /dev/sda -o ro,errors=panic test 2. mount /dev/sda -o remount,abort test
After commit 014c9caa29d3, remounting a file system using the test mount option "abort" will no longer trigger a panic. This commit will restore the behaviour immediately before commit 014c9caa29d3. (However, note that the Linux kernel's behavior has not been consistent; some previous kernel versions, including 5.4 and 4.19 similarly did not panic after using the mount option "abort".)
This also makes a change to long-standing behaviour; namely, the following series commands will now cause a panic, when previously it did not:
1. mount /dev/sda -o ro,errors=panic test 2. echo test > /sys/fs/ext4/sda/trigger_fs_error
However, this makes ext4's behaviour much more consistent, so this is a good thing.
Cc: stable@kernel.org Fixes: 014c9caa29d3 ("ext4: make ext4_abort() use __ext4_error()") Signed-off-by: Ye Bin yebin10@huawei.com Link: https://lore.kernel.org/r/20210401081903.3421208-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Zheng Liang zhengliang6@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index a051671f7cb89..5a58f72ac2090 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -652,12 +652,6 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error, ext4_commit_super(sb); }
- if (sb_rdonly(sb)) - return; - - if (continue_fs) - goto out; - /* * We force ERRORS_RO behavior when system is rebooting. Otherwise we * could panic during 'reboot -f' as the underlying device got already @@ -668,6 +662,12 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error, sb->s_id); }
+ if (sb_rdonly(sb)) + return; + + if (continue_fs) + goto out; + ext4_msg(sb, KERN_CRIT, "Remounting filesystem read-only"); /* * Make sure updated value of ->s_mount_flags will be visible before
From: zhangguijiang zhangguijiang@huawei.com
ascend inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4JMQA
----------
Hi everyone, This patch register a setup function for ascend gpio driver. With this patch, we can enable ascend gpio function independently, so that we can turn this feature on or off through bootargs.
Signed-off-by: zhangguijiang zhangguijiang@huawei.com Reviewed-by: Ding Tianhong dingtianhong@huawei.com Reviewed-by: Weilong Chen chenweilong@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/gpio/gpio-dwapb.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/drivers/gpio/gpio-dwapb.c b/drivers/gpio/gpio-dwapb.c index 2cdbb4a11075f..53eb5c506fea0 100644 --- a/drivers/gpio/gpio-dwapb.c +++ b/drivers/gpio/gpio-dwapb.c @@ -71,6 +71,14 @@ bool enable_ascend_mini_gpio_dwapb; bool enable_ascend_gpio_dwapb; struct dwapb_gpio;
+static int __init enable_ascend_gpio_dwapb_setup(char *str) +{ + enable_ascend_gpio_dwapb = true; + + return 1; +} +__setup("enable_ascend_gpio_dwapb", enable_ascend_gpio_dwapb_setup); + #ifdef CONFIG_PM_SLEEP /* Store GPIO context across system-wide suspend/resume transitions */ struct dwapb_context {
From: Qian Cai cai@lca.pw
mainline inclusion from mainline-v5.9-rc1 commit e0e3f42fd96cf830c61aa720f0abb4eef41c5ce3 category: bugfix bugzilla: 85986 CVE: NA
-------------------------------------------------------------------
struct mem_cgroup_per_node mz.lru_zone_size[zone_idx][lru] could be accessed concurrently as noticed by KCSAN,
BUG: KCSAN: data-race in lruvec_lru_size / mem_cgroup_update_lru_size
write to 0xffff9c804ca285f8 of 8 bytes by task 50951 on cpu 12: mem_cgroup_update_lru_size+0x11c/0x1d0 mem_cgroup_update_lru_size at mm/memcontrol.c:1266 isolate_lru_pages+0x6a9/0xf30 shrink_active_list+0x123/0xcc0 shrink_lruvec+0x8fd/0x1380 shrink_node+0x317/0xd80 do_try_to_free_pages+0x1f7/0xa10 try_to_free_pages+0x26c/0x5e0 __alloc_pages_slowpath+0x458/0x1290 __alloc_pages_nodemask+0x3bb/0x450 alloc_pages_vma+0x8a/0x2c0 do_anonymous_page+0x170/0x700 __handle_mm_fault+0xc9f/0xd00 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x6f9 page_fault+0x34/0x40
read to 0xffff9c804ca285f8 of 8 bytes by task 50964 on cpu 95: lruvec_lru_size+0xbb/0x270 mem_cgroup_get_zone_lru_size at include/linux/memcontrol.h:536 (inlined by) lruvec_lru_size at mm/vmscan.c:326 shrink_lruvec+0x1d0/0x1380 shrink_node+0x317/0xd80 do_try_to_free_pages+0x1f7/0xa10 try_to_free_pages+0x26c/0x5e0 __alloc_pages_slowpath+0x458/0x1290 __alloc_pages_nodemask+0x3bb/0x450 alloc_pages_current+0xa6/0x120 alloc_slab_page+0x3b1/0x540 allocate_slab+0x70/0x660 new_slab+0x46/0x70 ___slab_alloc+0x4ad/0x7d0 __slab_alloc+0x43/0x70 kmem_cache_alloc+0x2c3/0x420 getname_flags+0x4c/0x230 getname+0x22/0x30 do_sys_openat2+0x205/0x3b0 do_sys_open+0x9a/0xf0 __x64_sys_openat+0x62/0x80 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe
Reported by Kernel Concurrency Sanitizer on: CPU: 95 PID: 50964 Comm: cc1 Tainted: G W O L 5.5.0-next-20200204+ #6 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
The write is under lru_lock, but the read is done as lockless. The scan count is used to determine how aggressively the anon and file LRU lists should be scanned. Load tearing could generate an inefficient heuristic, so fix it by adding READ_ONCE() for the read.
Signed-off-by: Qian Cai cai@lca.pw Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Link: http://lkml.kernel.org/r/20200206034945.2481-1-cai@lca.pw Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: weiyang wang wangweiyang2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/memcontrol.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b4a25979ee8c7..4517d132d1e26 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -544,7 +544,7 @@ unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, struct mem_cgroup_per_node *mz;
mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec); - return mz->lru_zone_size[zone_idx][lru]; + return READ_ONCE(mz->lru_zone_size[zone_idx][lru]); }
void mem_cgroup_handle_over_high(void);
From: Yafang Shao laoar.shao@gmail.com
mainline inclusion from mainline-v5.7-rc5 commit 11d6761218d19ca06ae5387f4e3692c4fa9e7493 category: bugfix bugzilla: 91659 CVE: NA
------------------------------------------------------------------
When I run my memcg testcase which creates lots of memcgs, I found there're unexpected out of memory logs while there're still enough available free memory. The error log is
mkdir: cannot create directory 'foo.65533': Cannot allocate memory
The reason is when we try to create more than MEM_CGROUP_ID_MAX memcgs, an -ENOMEM errno will be set by mem_cgroup_css_alloc(), but the right errno should be -ENOSPC "No space left on device", which is an appropriate errno for userspace's failed mkdir.
As the errno really misled me, we should make it right. After this patch, the error log will be
mkdir: cannot create directory 'foo.65533': No space left on device
[akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal] [akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal] Fixes: 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Suggested-by: Matthew Wilcox willy@infradead.org Signed-off-by: Yafang Shao laoar.shao@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Michal Hocko mhocko@kernel.org Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Link: http://lkml.kernel.org/r/20200407063621.GA18914@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1586192163-20099-1-git-send-email-laoar.shao@gmail.... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Conflict: mm/memcontrol.c Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: weiyang wang wangweiyang2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0bccf2b8cc599..e4f90577571dc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4923,20 +4923,23 @@ static struct mem_cgroup *mem_cgroup_alloc(void) struct mem_cgroup_extension *memcg_ext; size_t size; int node; + long error = -ENOMEM;
size = sizeof(struct mem_cgroup_extension); size += nr_node_ids * sizeof(struct mem_cgroup_per_node *);
memcg_ext = kzalloc(size, GFP_KERNEL); if (!memcg_ext) - return NULL; + return ERR_PTR(error);
memcg = &memcg_ext->memcg; memcg->id.id = idr_alloc(&mem_cgroup_idr, NULL, 1, MEM_CGROUP_ID_MAX, GFP_KERNEL); - if (memcg->id.id < 0) + if (memcg->id.id < 0) { + error = memcg->id.id; goto fail; + }
memcg_ext->vmstats_local = alloc_percpu(struct mem_cgroup_stat_cpu); if (!memcg_ext->vmstats_local) @@ -4978,7 +4981,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) fail: mem_cgroup_id_remove(memcg); __mem_cgroup_free(memcg); - return NULL; + return ERR_PTR(error); }
static struct cgroup_subsys_state * __ref @@ -4989,8 +4992,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) long error = -ENOMEM;
memcg = mem_cgroup_alloc(); - if (!memcg) - return ERR_PTR(error); + if (IS_ERR(memcg)) + return ERR_CAST(memcg);
memcg->high = PAGE_COUNTER_MAX; memcg->soft_limit = PAGE_COUNTER_MAX; @@ -5037,7 +5040,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) fail: mem_cgroup_id_remove(memcg); mem_cgroup_free(memcg); - return ERR_PTR(-ENOMEM); + return ERR_PTR(error); }
static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.13-rc1 commit a3d4c05a447486b90298a8c964916c8f4fcb903f category: bugfix bugzilla: 185803 CVE: NA
--------------------------------------------------
Patch series "mm: memcontrol: switch to rstat", v3.
This series converts memcg stats tracking to the streamlined rstat infrastructure provided by the cgroup core code. rstat is already used by the CPU controller and the IO controller. This change is motivated by recent accuracy problems in memcg's custom stats code, as well as the benefits of sharing common infra with other controllers.
The current memcg implementation does batched tree aggregation on the write side: local stat changes are cached in per-cpu counters, which are then propagated upward in batches when a threshold (32 pages) is exceeded. This is cheap, but the error introduced by the lazy upward propagation adds up: 32 pages times CPUs times cgroups in the subtree. We've had complaints from service owners that the stats do not reliably track and react to allocation behavior as expected, sometimes swallowing the results of entire test applications.
The original memcg stat implementation used to do tree aggregation exclusively on the read side: local stats would only ever be tracked in per-cpu counters, and a memory.stat read would iterate the entire subtree and sum those counters up. This didn't keep up with the times:
- Cgroup trees are much bigger now. We switched to lazily-freed cgroups, where deleted groups would hang around until their remaining page cache has been reclaimed. This can result in large subtrees that are expensive to walk, while most of the groups are idle and their statistics don't change much anymore.
- Automated monitoring increased. With the proliferation of userspace oom killing, proactive reclaim, and higher-resolution logging of workload trends in general, top-level stat files are polled at least once a second in many deployments.
- The lifetime of cgroups got shorter. Where most cgroup setups in the past would have a few large policy-oriented cgroups for everything running on the system, newer cgroup deployments tend to create one group per application - which gets deleted again as the processes exit. An aggregation scheme that doesn't retain child data inside the parents loses event history of the subtree.
Rstat addresses all three of those concerns through intelligent, persistent read-side aggregation. As statistics change at the local level, rstat tracks - on a per-cpu basis - only those parts of a subtree that have changes pending and require aggregation. The actual aggregation occurs on the colder read side - which can now skip over (potentially large) numbers of recently idle cgroups.
===
The test_kmem cgroup selftest is currently failing due to excessive cumulative vmstat drift from 100 subgroups:
ok 1 test_kmem_basic memory.current = 8810496 slab + anon + file + kernel_stack = 17074568 slab = 6101384 anon = 946176 file = 0 kernel_stack = 10027008 not ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic
As you can see, memory.stat items far exceed memory.current. The kernel stack alone is bigger than all of charged memory. That's because the memory of the test has been uncharged from memory.current, but the negative vmstat deltas are still sitting in the percpu caches.
The test at this time isn't even counting percpu, pagetables etc. yet, which would further contribute to the error. The last patch in the series updates the test to include them - as well as reduces the vmstat tolerances in general to only expect page_counter batching.
With all patches applied, the (now more stringent) test succeeds:
ok 1 test_kmem_basic ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic
===
A kernel build test confirms that overhead is comparable. Two kernels are built simultaneously in a nested tree with several idle siblings:
root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16) `- build-b (defconfig, make -j16) `- idle-1 `- ... `- idle-9
During the builds, kernelbuild/memory.stat is read once a second.
A perf diff shows that the changes in cycle distribution is minimal. Top 10 kernel symbols:
0.09% +0.08% [kernel.kallsyms] [k] __mod_memcg_lruvec_state 0.00% +0.06% [kernel.kallsyms] [k] cgroup_rstat_updated 0.08% -0.05% [kernel.kallsyms] [k] __mod_memcg_state.part.0 0.16% -0.04% [kernel.kallsyms] [k] release_pages 0.00% +0.03% [kernel.kallsyms] [k] __count_memcg_events 0.01% +0.03% [kernel.kallsyms] [k] mem_cgroup_charge_statistics.constprop.0 0.10% -0.02% [kernel.kallsyms] [k] get_mem_cgroup_from_mm 0.05% -0.02% [kernel.kallsyms] [k] mem_cgroup_update_lru_size 0.57% +0.01% [kernel.kallsyms] [k] asm_exc_page_fault
===
The on-demand aggregated stats are now fully accurate:
$ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \ grep -e inactive_file /sys/fs/cgroup/memory.stat
vanilla: patched: nr_inactive_file 1574105088 nr_inactive_file 1027801088 inactive_file 1577410560 inactive_file 1027801088
===
This patch (of 8):
The memcg hotunplug callback erroneously flushes counts on the local CPU, not the counts of the CPU going away; those counts will be lost.
Flush the CPU that is actually going away.
Also simplify the code a bit by using mod_memcg_state() and count_memcg_events() instead of open-coding the upward flush - this is comparable to how vmstat.c handles hotunplug flushing.
Link: https://lkml.kernel.org/r/20210209163304.77088-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20210209163304.77088-2-hannes@cmpxchg.org Fixes: a983b5ebee572 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reviewed-by: Shakeel Butt shakeelb@google.com Reviewed-by: Roman Gushchin guro@fb.com Reviewed-by: Michal Koutný mkoutny@suse.com Acked-by: Michal Hocko mhocko@suse.com Cc: Tejun Heo tj@kernel.org Cc: Roman Gushchin guro@fb.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Conflict: mm/memcontrol.c Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: weiyang wang wangweiyang2@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 35 +++++++++++++++++++++-------------- 1 file changed, 21 insertions(+), 14 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e4f90577571dc..974fc5dc6dc81 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2234,45 +2234,52 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) static int memcg_hotplug_cpu_dead(unsigned int cpu) { struct memcg_stock_pcp *stock; - struct mem_cgroup *memcg, *mi; + struct mem_cgroup *memcg;
stock = &per_cpu(memcg_stock, cpu); drain_stock(stock);
for_each_mem_cgroup(memcg) { + struct mem_cgroup_stat_cpu *statc; int i;
+ statc = per_cpu_ptr(memcg->stat_cpu, cpu); + for (i = 0; i < MEMCG_NR_STAT; i++) { int nid; - long x;
- x = this_cpu_xchg(memcg->stat_cpu->count[i], 0); - if (x) - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &memcg->stat[i]); + if (statc->count[i]) { + mod_memcg_state(memcg, i, statc->count[i]); + statc->count[i] = 0; + }
if (i >= NR_VM_NODE_STAT_ITEMS) continue;
for_each_node(nid) { + struct lruvec_stat *lstatc; struct mem_cgroup_per_node *pn; + long x;
pn = mem_cgroup_nodeinfo(memcg, nid); - x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0); - if (x) + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); + + x = lstatc->count[i]; + lstatc->count[i] = 0; + + if (x) { do { atomic_long_add(x, &pn->lruvec_stat[i]); } while ((pn = parent_nodeinfo(pn, nid))); + } } }
for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { - long x; - - x = this_cpu_xchg(memcg->stat_cpu->events[i], 0); - if (x) - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &memcg->events[i]); + if (statc->events[i]) { + count_memcg_events(memcg, i, statc->events[i]); + statc->events[i] = 0; + } } }
From: Weilong Chen chenweilong@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LGV4 CVE: NA
---------------------------
Patch "cache: Workaround HiSilicon Taishan DC CVAU" breaks the verifiy of cpu capability when hot plug cpus. It set the system scope on but local cpu capability still off. This path fix it by two step: 1. Unset CTR_IDC_SHIFT bit from strict_mask to skip check. 2. Special treatment in read_cpuid_effective_cachetype
Signed-off-by: Weilong Chen chenweilong@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/arm64/include/asm/cache.h | 9 +++++++++ arch/arm64/kernel/cpu_errata.c | 1 + 2 files changed, 10 insertions(+)
diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h index 1c1485f669ea6..d1c46d75885fa 100644 --- a/arch/arm64/include/asm/cache.h +++ b/arch/arm64/include/asm/cache.h @@ -108,6 +108,15 @@ int cache_line_size(void); static inline u32 __attribute_const__ read_cpuid_effective_cachetype(void) { u32 ctr = read_cpuid_cachetype(); +#ifdef CONFIG_HISILICON_ERRATUM_1980005 + static const struct midr_range idc_support_list[] = { + MIDR_ALL_VERSIONS(MIDR_HISI_TSV110), + MIDR_REV(MIDR_HISI_TSV200, 1, 0), + { /* sentinel */ } + }; + if (is_midr_in_range_list(read_cpuid_id(), idc_support_list)) + ctr |= BIT(CTR_IDC_SHIFT); +#endif
if (!(ctr & BIT(CTR_IDC_SHIFT))) { u64 clidr = read_sysreg(clidr_el1); diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c index 8dbe94b4ec81e..a6f2451b44690 100644 --- a/arch/arm64/kernel/cpu_errata.c +++ b/arch/arm64/kernel/cpu_errata.c @@ -90,6 +90,7 @@ hisilicon_1980005_enable(const struct arm64_cpu_capabilities *__unused) { cpus_set_cap(ARM64_HAS_CACHE_IDC); arm64_ftr_reg_ctrel0.sys_val |= BIT(CTR_IDC_SHIFT); + arm64_ftr_reg_ctrel0.strict_mask &= ~BIT(CTR_IDC_SHIFT); sysreg_clear_set(sctlr_el1, SCTLR_EL1_UCT, 0); } #endif