From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-v5.3-rc7 commit b4c46484dc3fa3721d68fdfae85c1d7b1f6b5472 category: bugfix bugzilla: 51815, https://gitee.com/openeuler/kernel/issues/I3IJ9I CVE: NA
-------------------------------------------------
Commit 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones") effectively decreased the precision of per-memcg vmstats_local and per-memcg-per-node lruvec percpu counters.
That's good for displaying in memory.stat, but brings a serious regression into the reclaim process.
One issue I've discovered and debugged is the following: lruvec_lru_size() can return 0 instead of the actual number of pages in the lru list, preventing the kernel to reclaim last remaining pages. Result is yet another dying memory cgroups flooding. The opposite is also happening: scanning an empty lru list is the waste of cpu time.
Also, inactive_list_is_low() can return incorrect values, preventing the active lru from being scanned and freed. It can fail both because the size of active and inactive lists are inaccurate, and because the number of workingset refaults isn't precise. In other words, the result is pretty random.
I'm not sure, if using the approximate number of slab pages in count_shadow_number() is acceptable, but issues described above are enough to partially revert the patch.
Let's keep per-memcg vmstat_local batched (they are only used for displaying stats to the userspace), but keep lruvec stats precise. This change fixes the dead memcg flooding on my setup.
Link: http://lkml.kernel.org/r/20190817004726.2530670-1-guro@fb.com Fixes: 766a4c19d880 ("mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones") Signed-off-by: Roman Gushchin guro@fb.com Acked-by: Yafang Shao laoar.shao@gmail.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com (cherry-pick commit from 81b6bde6ce186b25297ffc0138f9bc4523f3497c) Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Jing Xiangfeng jingxiangfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cd9e34fbcfc52..d9f040773a473 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -752,15 +752,13 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, /* Update memcg */ __mod_memcg_state(memcg, idx, val);
+ /* Update lruvec */ + __this_cpu_add(pn->lruvec_stat_local->count[idx], val); + x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]); if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) { struct mem_cgroup_per_node *pi;
- /* - * Batch local counters to keep them in sync with - * the hierarchical ones. - */ - __this_cpu_add(pn->lruvec_stat_local->count[idx], x); for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id)) atomic_long_add(x, &pi->lruvec_stat[idx]); x = 0;