backport psi feature and avoid kabi change bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2
Baruch Siach (1): psi: fix reference to kernel commandline enable
Dan Schatzberg (1): kernel/sched/psi.c: expose pressure metrics on root cgroup
Johannes Weiner (12): mm: workingset: tell cache transitions from workingset thrashing sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD sched: loadavg: make calc_load_n() public sched: sched.h: make rq locking and clock functions available in stats.h sched: introduce this_rq_lock_irq() psi: pressure stall information for CPU, memory, and IO psi: cgroup support psi: make disabling/enabling easier for vendor kernels psi: fix aggregation idle shut-off psi: avoid divide-by-zero crash inside virtual machines fs: kernfs: add poll file operation sched/psi: Fix sampling error and rare div0 crashes with cgroups and high uptime
Josef Bacik (1): blk-iolatency: use a percentile approache for ssd's
Liu Xinpeng (2): psi:enable psi in config psi:avoid kabi change
Miklos Szeredi (1): fuse: ignore PG_workingset after stealing
Olof Johansson (1): kernel/sched/psi.c: simplify cgroup_move_task()
Suren Baghdasaryan (6): psi: introduce state_mask to represent stalled psi states psi: make psi_enable static psi: rename psi fields in preparation for psi trigger addition psi: split update_stats into parts psi: track changed states include/: refactor headers to allow kthread.h inclusion in psi_types.h
Yafang Shao (1): mm, memcg: add workingset_restore in memory.stat
Documentation/accounting/psi.txt | 73 +++ Documentation/admin-guide/cgroup-v2.rst | 22 + Documentation/admin-guide/kernel-parameters.txt | 4 + arch/arm64/configs/openeuler_defconfig | 2 + arch/powerpc/platforms/cell/cpufreq_spudemand.c | 2 +- arch/powerpc/platforms/cell/spufs/sched.c | 9 +- arch/s390/appldata/appldata_os.c | 4 - arch/x86/configs/openeuler_defconfig | 2 + block/blk-iolatency.c | 183 +++++- drivers/cpuidle/governors/menu.c | 4 - drivers/spi/spi-rockchip.c | 1 + fs/fuse/dev.c | 1 + fs/kernfs/file.c | 31 +- fs/proc/loadavg.c | 3 - include/linux/cgroup-defs.h | 12 + include/linux/cgroup.h | 17 + include/linux/kernfs.h | 8 + include/linux/kthread.h | 4 + include/linux/mmzone.h | 3 + include/linux/page-flags.h | 5 + include/linux/psi.h | 55 ++ include/linux/psi_types.h | 95 +++ include/linux/sched.h | 13 + include/linux/sched/loadavg.h | 24 +- include/linux/swap.h | 1 + include/trace/events/mmflags.h | 1 + init/Kconfig | 28 + kernel/cgroup/cgroup.c | 55 +- kernel/debug/kdb/kdb_main.c | 7 +- kernel/fork.c | 4 + kernel/kthread.c | 3 + kernel/sched/Makefile | 1 + kernel/sched/core.c | 16 +- kernel/sched/loadavg.c | 139 ++-- kernel/sched/psi.c | 823 ++++++++++++++++++++++++ kernel/sched/sched.h | 178 ++--- kernel/sched/stats.h | 86 +++ kernel/workqueue.c | 23 + kernel/workqueue_internal.h | 6 +- mm/compaction.c | 5 + mm/filemap.c | 20 +- mm/huge_memory.c | 1 + mm/memcontrol.c | 2 + mm/migrate.c | 2 + mm/page_alloc.c | 9 + mm/swap_state.c | 1 + mm/vmscan.c | 10 + mm/vmstat.c | 1 + mm/workingset.c | 117 +++- 49 files changed, 1837 insertions(+), 279 deletions(-) create mode 100644 Documentation/accounting/psi.txt create mode 100644 include/linux/psi.h create mode 100644 include/linux/psi_types.h create mode 100644 kernel/sched/psi.c
From: Josef Bacik josef@toxicpanda.com
mainline inclusion from mainline-v5.4 commit 1fa2840e56f9032e14a75fcf67edfe0f21102e4b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
We use an average latency approach for determining if we're missing our latency target. This works well for rotational storage where we have generally consistent latencies, but for ssd's and other low latency devices you have more of a spikey behavior, which means we often won't throttle misbehaving groups because a lot of IO completes at drastically faster times than our latency target. Instead keep track of how many IO's miss our target and how many IO's are done in our time window. If the p(90) latency is above our target then we know we need to throttle. With this change in place we are seeing the same throttling behavior with our testcase on ssd's as we see with rotational drives.
Signed-off-by: Josef Bacik josef@toxicpanda.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- block/blk-iolatency.c | 179 ++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 145 insertions(+), 34 deletions(-)
diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c index ae6ae7f..f84fdc7 100644 --- a/block/blk-iolatency.c +++ b/block/blk-iolatency.c @@ -117,9 +117,21 @@ struct child_latency_info { atomic_t scale_cookie; };
+struct percentile_stats { + u64 total; + u64 missed; +}; + +struct latency_stat { + union { + struct percentile_stats ps; + struct blk_rq_stat rqs; + }; +}; + struct iolatency_grp { struct blkg_policy_data pd; - struct blk_rq_stat __percpu *stats; + struct latency_stat __percpu *stats; struct blk_iolatency *blkiolat; struct rq_depth rq_depth; struct rq_wait rq_wait; @@ -134,6 +146,7 @@ struct iolatency_grp { /* Our current number of IO's for the last summation. */ u64 nr_samples;
+ bool ssd; struct child_latency_info child_lat; };
@@ -174,6 +187,80 @@ static inline struct blkcg_gq *lat_to_blkg(struct iolatency_grp *iolat) return pd_to_blkg(&iolat->pd); }
+static inline void latency_stat_init(struct iolatency_grp *iolat, + struct latency_stat *stat) +{ + if (iolat->ssd) { + stat->ps.total = 0; + stat->ps.missed = 0; + } else + blk_rq_stat_init(&stat->rqs); +} + +static inline void latency_stat_sum(struct iolatency_grp *iolat, + struct latency_stat *sum, + struct latency_stat *stat) +{ + if (iolat->ssd) { + sum->ps.total += stat->ps.total; + sum->ps.missed += stat->ps.missed; + } else + blk_rq_stat_sum(&sum->rqs, &stat->rqs); +} + +static inline void latency_stat_record_time(struct iolatency_grp *iolat, + u64 req_time) +{ + struct latency_stat *stat = get_cpu_ptr(iolat->stats); + if (iolat->ssd) { + if (req_time >= iolat->min_lat_nsec) + stat->ps.missed++; + stat->ps.total++; + } else + blk_rq_stat_add(&stat->rqs, req_time); + put_cpu_ptr(stat); +} + +static inline bool latency_sum_ok(struct iolatency_grp *iolat, + struct latency_stat *stat) +{ + if (iolat->ssd) { + u64 thresh = div64_u64(stat->ps.total, 10); + thresh = max(thresh, 1ULL); + return stat->ps.missed < thresh; + } + return stat->rqs.mean <= iolat->min_lat_nsec; +} + +static inline u64 latency_stat_samples(struct iolatency_grp *iolat, + struct latency_stat *stat) +{ + if (iolat->ssd) + return stat->ps.total; + return stat->rqs.nr_samples; +} + +static inline void iolat_update_total_lat_avg(struct iolatency_grp *iolat, + struct latency_stat *stat) +{ + int exp_idx; + + if (iolat->ssd) + return; + + /* + * CALC_LOAD takes in a number stored in fixed point representation. + * Because we are using this for IO time in ns, the values stored + * are significantly larger than the FIXED_1 denominator (2048). + * Therefore, rounding errors in the calculation are negligible and + * can be ignored. + */ + exp_idx = min_t(int, BLKIOLATENCY_NR_EXP_FACTORS - 1, + div64_u64(iolat->cur_win_nsec, + BLKIOLATENCY_EXP_BUCKET_SIZE)); + CALC_LOAD(iolat->lat_avg, iolatency_exp_factors[exp_idx], stat->rqs.mean); +} + static inline bool iolatency_may_queue(struct iolatency_grp *iolat, wait_queue_entry_t *wait, bool first_block) @@ -445,7 +532,6 @@ static void iolatency_record_time(struct iolatency_grp *iolat, struct bio_issue *issue, u64 now, bool issue_as_root) { - struct blk_rq_stat *rq_stat; u64 start = bio_issue_time(issue); u64 req_time;
@@ -471,9 +557,7 @@ static void iolatency_record_time(struct iolatency_grp *iolat, return; }
- rq_stat = get_cpu_ptr(iolat->stats); - blk_rq_stat_add(rq_stat, req_time); - put_cpu_ptr(rq_stat); + latency_stat_record_time(iolat, req_time); }
#define BLKIOLATENCY_MIN_ADJUST_TIME (500 * NSEC_PER_MSEC) @@ -484,17 +568,17 @@ static void iolatency_check_latencies(struct iolatency_grp *iolat, u64 now) struct blkcg_gq *blkg = lat_to_blkg(iolat); struct iolatency_grp *parent; struct child_latency_info *lat_info; - struct blk_rq_stat stat; + struct latency_stat stat; unsigned long flags; - int cpu, exp_idx; + int cpu;
- blk_rq_stat_init(&stat); + latency_stat_init(iolat, &stat); preempt_disable(); for_each_online_cpu(cpu) { - struct blk_rq_stat *s; + struct latency_stat *s; s = per_cpu_ptr(iolat->stats, cpu); - blk_rq_stat_sum(&stat, s); - blk_rq_stat_init(s); + latency_stat_sum(iolat, &stat, s); + latency_stat_init(iolat, s); } preempt_enable();
@@ -504,41 +588,33 @@ static void iolatency_check_latencies(struct iolatency_grp *iolat, u64 now)
lat_info = &parent->child_lat;
- /* - * CALC_LOAD takes in a number stored in fixed point representation. - * Because we are using this for IO time in ns, the values stored - * are significantly larger than the FIXED_1 denominator (2048). - * Therefore, rounding errors in the calculation are negligible and - * can be ignored. - */ - exp_idx = min_t(int, BLKIOLATENCY_NR_EXP_FACTORS - 1, - div64_u64(iolat->cur_win_nsec, - BLKIOLATENCY_EXP_BUCKET_SIZE)); - CALC_LOAD(iolat->lat_avg, iolatency_exp_factors[exp_idx], stat.mean); + iolat_update_total_lat_avg(iolat, &stat);
/* Everything is ok and we don't need to adjust the scale. */ - if (stat.mean <= iolat->min_lat_nsec && + if (latency_sum_ok(iolat, &stat) && atomic_read(&lat_info->scale_cookie) == DEFAULT_SCALE_COOKIE) return;
/* Somebody beat us to the punch, just bail. */ spin_lock_irqsave(&lat_info->lock, flags); lat_info->nr_samples -= iolat->nr_samples; - lat_info->nr_samples += stat.nr_samples; - iolat->nr_samples = stat.nr_samples; + lat_info->nr_samples += latency_stat_samples(iolat, &stat); + iolat->nr_samples = latency_stat_samples(iolat, &stat);
if ((lat_info->last_scale_event >= now || now - lat_info->last_scale_event < BLKIOLATENCY_MIN_ADJUST_TIME) && lat_info->scale_lat <= iolat->min_lat_nsec) goto out;
- if (stat.mean <= iolat->min_lat_nsec && - stat.nr_samples >= BLKIOLATENCY_MIN_GOOD_SAMPLES) { + if (latency_sum_ok(iolat, &stat)) { + if (latency_stat_samples(iolat, &stat) < + BLKIOLATENCY_MIN_GOOD_SAMPLES) + goto out; if (lat_info->scale_grp == iolat) { lat_info->last_scale_event = now; scale_cookie_change(iolat->blkiolat, lat_info, true); } - } else if (stat.mean > iolat->min_lat_nsec) { + } else { lat_info->last_scale_event = now; if (!lat_info->scale_grp || lat_info->scale_lat > iolat->min_lat_nsec) { @@ -853,13 +929,43 @@ static int iolatency_print_limit(struct seq_file *sf, void *v) return 0; }
+static size_t iolatency_ssd_stat(struct iolatency_grp *iolat, char *buf, + size_t size) +{ + struct latency_stat stat; + int cpu; + + latency_stat_init(iolat, &stat); + preempt_disable(); + for_each_online_cpu(cpu) { + struct latency_stat *s; + s = per_cpu_ptr(iolat->stats, cpu); + latency_stat_sum(iolat, &stat, s); + } + preempt_enable(); + + if (iolat->rq_depth.max_depth == UINT_MAX) + return scnprintf(buf, size, " missed=%llu total=%llu depth=max", + (unsigned long long)stat.ps.missed, + (unsigned long long)stat.ps.total); + return scnprintf(buf, size, " missed=%llu total=%llu depth=%u", + (unsigned long long)stat.ps.missed, + (unsigned long long)stat.ps.total, + iolat->rq_depth.max_depth); +} + static size_t iolatency_pd_stat(struct blkg_policy_data *pd, char *buf, size_t size) { struct iolatency_grp *iolat = pd_to_lat(pd); - unsigned long long avg_lat = div64_u64(iolat->lat_avg, NSEC_PER_USEC); - unsigned long long cur_win = div64_u64(iolat->cur_win_nsec, NSEC_PER_MSEC); + unsigned long long avg_lat; + unsigned long long cur_win; + + if (iolat->ssd) + return iolatency_ssd_stat(iolat, buf, size);
+ avg_lat = div64_u64(iolat->lat_avg, NSEC_PER_USEC); + cur_win = div64_u64(iolat->cur_win_nsec, NSEC_PER_MSEC); if (iolat->rq_depth.max_depth == UINT_MAX) return scnprintf(buf, size, " depth=max avg_lat=%llu win=%llu", avg_lat, cur_win); @@ -876,8 +982,8 @@ static struct blkg_policy_data *iolatency_pd_alloc(gfp_t gfp, int node) iolat = kzalloc_node(sizeof(*iolat), gfp, node); if (!iolat) return NULL; - iolat->stats = __alloc_percpu_gfp(sizeof(struct blk_rq_stat), - __alignof__(struct blk_rq_stat), gfp); + iolat->stats = __alloc_percpu_gfp(sizeof(struct latency_stat), + __alignof__(struct latency_stat), gfp); if (!iolat->stats) { kfree(iolat); return NULL; @@ -894,10 +1000,15 @@ static void iolatency_pd_init(struct blkg_policy_data *pd) u64 now = ktime_to_ns(ktime_get()); int cpu;
+ if (blk_queue_nonrot(blkg->q)) + iolat->ssd = true; + else + iolat->ssd = false; + for_each_possible_cpu(cpu) { - struct blk_rq_stat *stat; + struct latency_stat *stat; stat = per_cpu_ptr(iolat->stats, cpu); - blk_rq_stat_init(stat); + latency_stat_init(iolat, stat); }
rq_wait_init(&iolat->rq_wait);
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.4 commit 1899ad18c6072d689896badafb81267b0a1092a4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
Refaults happen during transitions between workingsets as well as in-place thrashing. Knowing the difference between the two has a range of applications, including measuring the impact of memory shortage on the system performance, as well as the ability to smarter balance pressure between the filesystem cache and the swap-backed workingset.
During workingset transitions, inactive cache refaults and pushes out established active cache. When that active cache isn't stale, however, and also ends up refaulting, that's bonafide thrashing.
Introduce a new page flag that tells on eviction whether the page has been active or not in its lifetime. This bit is then stored in the shadow entry, to classify refaults as transitioning or thrashing.
How many page->flags does this leave us with on 32-bit?
20 bits are always page flags
21 if you have an MMU
23 with the zone bits for DMA, Normal, HighMem, Movable
29 with the sparsemem section bits
30 if PAE is enabled
31 with this patch.
So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If that's not enough, the system can switch to discontigmem and re-gain the 6 or 7 sparsemem section bits.
Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Tested-by: Daniel Drake drake@endlessm.com Tested-by: Suren Baghdasaryan surenb@google.com Cc: Christopher Lameter cl@linux.com Cc: Ingo Molnar mingo@redhat.com Cc: Johannes Weiner jweiner@fb.com Cc: Mike Galbraith efault@gmx.de Cc: Peter Enderborg peter.enderborg@sony.com Cc: Randy Dunlap rdunlap@infradead.org Cc: Shakeel Butt shakeelb@google.com Cc: Tejun Heo tj@kernel.org Cc: Vinayak Menon vinmenon@codeaurora.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- include/linux/mmzone.h | 1 + include/linux/page-flags.h | 5 ++- include/linux/swap.h | 2 +- include/trace/events/mmflags.h | 1 + mm/filemap.c | 9 ++-- mm/huge_memory.c | 1 + mm/migrate.c | 2 + mm/swap_state.c | 1 + mm/vmscan.c | 1 + mm/vmstat.c | 1 + mm/workingset.c | 94 +++++++++++++++++++++++++++--------------- 11 files changed, 76 insertions(+), 42 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e7d2bca..b55dd22 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -164,6 +164,7 @@ enum node_stat_item { NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ WORKINGSET_REFAULT, WORKINGSET_ACTIVATE, + WORKINGSET_RESTORE, WORKINGSET_NODERECLAIM, NR_ANON_MAPPED, /* Mapped anonymous pages */ NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 7eb776a..97c0918 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -69,13 +69,14 @@ */ enum pageflags { PG_locked, /* Page is locked. Don't touch. */ - PG_error, PG_referenced, PG_uptodate, PG_dirty, PG_lru, PG_active, + PG_workingset, PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */ + PG_error, PG_slab, PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, @@ -281,6 +282,8 @@ static inline int PagePoisoned(const struct page *page) PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD) PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD) TESTCLEARFLAG(Active, active, PF_HEAD) +PAGEFLAG(Workingset, workingset, PF_HEAD) + TESTCLEARFLAG(Workingset, workingset, PF_HEAD) __PAGEFLAG(Slab, slab, PF_NO_TAIL) __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL) PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems */ diff --git a/include/linux/swap.h b/include/linux/swap.h index b7cfad3..12a39b2 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -322,7 +322,7 @@ struct vma_swap_readahead {
/* linux/mm/workingset.c */ void *workingset_eviction(struct address_space *mapping, struct page *page); -bool workingset_refault(void *shadow); +void workingset_refault(struct page *page, void *shadow); void workingset_activation(struct page *page);
/* Do not use directly, use workingset_lookup_update */ diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index b817bf1..edc9adf 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -88,6 +88,7 @@ {1UL << PG_dirty, "dirty" }, \ {1UL << PG_lru, "lru" }, \ {1UL << PG_active, "active" }, \ + {1UL << PG_workingset, "workingset" }, \ {1UL << PG_slab, "slab" }, \ {1UL << PG_owner_priv_1, "owner_priv_1" }, \ {1UL << PG_arch_1, "arch_1" }, \ diff --git a/mm/filemap.c b/mm/filemap.c index f578d4e..fd906c1 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1023,12 +1023,9 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping, * data from the working set, only to cache data that will * get overwritten with something else, is a waste of memory. */ - if (!(gfp_mask & __GFP_WRITE) && - shadow && workingset_refault(shadow)) { - SetPageActive(page); - workingset_activation(page); - } else - ClearPageActive(page); + WARN_ON_ONCE(PageActive(page)); + if (!(gfp_mask & __GFP_WRITE) && shadow) + workingset_refault(page, shadow); if (!PagePercpuRef(page)) lru_cache_add(page); } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 31f1c58..f71b7a7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2531,6 +2531,7 @@ static void __split_huge_page_tail(struct page *head, int tail, (1L << PG_mlocked) | (1L << PG_uptodate) | (1L << PG_active) | + (1L << PG_workingset) | (1L << PG_locked) | (1L << PG_unevictable) | (1L << PG_dirty))); diff --git a/mm/migrate.c b/mm/migrate.c index 90aa493..a41ef7f 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -684,6 +684,8 @@ void migrate_page_states(struct page *newpage, struct page *page) SetPageActive(newpage); } else if (TestClearPageUnevictable(page)) SetPageUnevictable(newpage); + if (PageWorkingset(page)) + SetPageWorkingset(newpage); if (PageChecked(page)) SetPageChecked(newpage); if (PageMappedToDisk(page)) diff --git a/mm/swap_state.c b/mm/swap_state.c index 2137e2d..3b538ef 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -460,6 +460,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, /* * Initiate read into locked page and return. */ + SetPageWorkingset(new_page); lru_cache_add_anon(new_page); *new_page_allocated = true; return new_page; diff --git a/mm/vmscan.c b/mm/vmscan.c index e8befe7..90b47ca 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2188,6 +2188,7 @@ static void shrink_active_list(unsigned long nr_to_scan, }
ClearPageActive(page); /* we are de-activating */ + SetPageWorkingset(page); list_add(&page->lru, &l_inactive); }
diff --git a/mm/vmstat.c b/mm/vmstat.c index 96028cc..e183cb2 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1145,6 +1145,7 @@ int fragmentation_index(struct zone *zone, unsigned int order) "nr_isolated_file", "workingset_refault", "workingset_activate", + "workingset_restore", "workingset_nodereclaim", "nr_anon_pages", "nr_mapped", diff --git a/mm/workingset.c b/mm/workingset.c index a1f61b3..039eea7 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -121,7 +121,7 @@ * the only thing eating into inactive list space is active pages. * * - * Activating refaulting pages + * Refaulting inactive pages * * All that is known about the active list is that the pages have been * accessed more than once in the past. This means that at any given @@ -134,6 +134,10 @@ * used less frequently than the refaulting page - or even not used at * all anymore. * + * That means if inactive cache is refaulting with a suitable refault + * distance, we assume the cache workingset is transitioning and put + * pressure on the current active list. + * * If this is wrong and demotion kicks in, the pages which are truly * used more frequently will be reactivated while the less frequently * used once will be evicted from memory. @@ -141,6 +145,14 @@ * But if this is right, the stale pages will be pushed out of memory * and the used pages get to stay in cache. * + * Refaulting active pages + * + * If on the other hand the refaulting pages have recently been + * deactivated, it means that the active list is no longer protecting + * actively used cache from reclaim. The cache is NOT transitioning to + * a different workingset; the existing workingset is thrashing in the + * space allocated to the page cache. + * * * Implementation * @@ -156,8 +168,7 @@ */
#define EVICTION_SHIFT (RADIX_TREE_EXCEPTIONAL_ENTRY + \ - NODES_SHIFT + \ - MEM_CGROUP_ID_SHIFT) + 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT) #define EVICTION_MASK (~0UL >> EVICTION_SHIFT)
/* @@ -170,23 +181,28 @@ */ static unsigned int bucket_order __read_mostly;
-static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction) +static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction, + bool workingset) { eviction >>= bucket_order; eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; eviction = (eviction << NODES_SHIFT) | pgdat->node_id; + eviction = (eviction << 1) | workingset; eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY); }
static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, - unsigned long *evictionp) + unsigned long *evictionp, bool *workingsetp) { unsigned long entry = (unsigned long)shadow; int memcgid, nid; + bool workingset;
entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT; + workingset = entry & 1; + entry >>= 1; nid = entry & ((1UL << NODES_SHIFT) - 1); entry >>= NODES_SHIFT; memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1); @@ -195,6 +211,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, *memcgidp = memcgid; *pgdat = NODE_DATA(nid); *evictionp = entry << bucket_order; + *workingsetp = workingset; }
/** @@ -207,8 +224,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, */ void *workingset_eviction(struct address_space *mapping, struct page *page) { - struct mem_cgroup *memcg = page_memcg(page); struct pglist_data *pgdat = page_pgdat(page); + struct mem_cgroup *memcg = page_memcg(page); int memcgid = mem_cgroup_id(memcg); unsigned long eviction; struct lruvec *lruvec; @@ -220,30 +237,30 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
lruvec = mem_cgroup_lruvec(pgdat, memcg); eviction = atomic_long_inc_return(&lruvec->inactive_age); - return pack_shadow(memcgid, pgdat, eviction); + return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page)); }
/** * workingset_refault - evaluate the refault of a previously evicted page + * @page: the freshly allocated replacement page * @shadow: shadow entry of the evicted page * * Calculates and evaluates the refault distance of the previously * evicted page in the context of the node it was allocated in. - * - * Returns %true if the page should be activated, %false otherwise. */ -bool workingset_refault(void *shadow) +void workingset_refault(struct page *page, void *shadow) { unsigned long refault_distance; + struct pglist_data *pgdat; unsigned long active_file; struct mem_cgroup *memcg; unsigned long eviction; struct lruvec *lruvec; unsigned long refault; - struct pglist_data *pgdat; + bool workingset; int memcgid;
- unpack_shadow(shadow, &memcgid, &pgdat, &eviction); + unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
rcu_read_lock(); /* @@ -263,41 +280,50 @@ bool workingset_refault(void *shadow) * configurations instead. */ memcg = mem_cgroup_from_id(memcgid); - if (!mem_cgroup_disabled() && !memcg) { - rcu_read_unlock(); - return false; - } + if (!mem_cgroup_disabled() && !memcg) + goto out; lruvec = mem_cgroup_lruvec(pgdat, memcg); refault = atomic_long_read(&lruvec->inactive_age); active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
/* + * Calculate the refault distance * The unsigned subtraction here gives an accurate distance - * across inactive_age overflows in most cases. - * - * There is a special case: usually, shadow entries have a - * short lifetime and are either refaulted or reclaimed along - * with the inode before they get too old. But it is not - * impossible for the inactive_age to lap a shadow entry in - * the field, which can then can result in a false small - * refault distance, leading to a false activation should this - * old entry actually refault again. However, earlier kernels - * used to deactivate unconditionally with *every* reclaim - * invocation for the longest time, so the occasional - * inappropriate activation leading to pressure on the active - * list is not a problem. + * across inactive_age overflows in most cases. There is a + * special case: usually, shadow entries have a short lifetime + * and are either refaulted or reclaimed along with the inode + * before they get too old. But it is not impossible for the + * inactive_age to lap a shadow entry in the field, which can + * then result in a false small refault distance, leading to a + * false activation should this old entry actually refault + * again. However, earlier kernels used to deactivate + * unconditionally with *every* reclaim invocation for the + * longest time, so the occasional inappropriate activation + * leading to pressure on the active list is not a problem. */ refault_distance = (refault - eviction) & EVICTION_MASK;
inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
- if (refault_distance <= active_file) { - inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); - rcu_read_unlock(); - return true; + /* + * Compare the distance to the existing workingset size. We + * don't act on pages that couldn't stay resident even if all + * the memory was available to the page cache. + */ + if (refault_distance > active_file) + goto out; + + SetPageActive(page); + atomic_long_inc(&lruvec->inactive_age); + inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); + + /* Page was active prior to eviction */ + if (workingset) { + SetPageWorkingset(page); + inc_lruvec_state(lruvec, WORKINGSET_RESTORE); } +out: rcu_read_unlock(); - return false; }
/**
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.4 commit 8508cf3ffad4defa202b303e5b6379efc4cd9054 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
There are several definitions of those functions/macros in places that mess with fixed-point load averages. Provide an official version.
[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c] Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Tested-by: Suren Baghdasaryan surenb@google.com Tested-by: Daniel Drake drake@endlessm.com Cc: Christopher Lameter cl@linux.com Cc: Ingo Molnar mingo@redhat.com Cc: Johannes Weiner jweiner@fb.com Cc: Mike Galbraith efault@gmx.de Cc: Peter Enderborg peter.enderborg@sony.com Cc: Randy Dunlap rdunlap@infradead.org Cc: Shakeel Butt shakeelb@google.com Cc: Tejun Heo tj@kernel.org Cc: Vinayak Menon vinmenon@codeaurora.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- arch/powerpc/platforms/cell/cpufreq_spudemand.c | 2 +- arch/powerpc/platforms/cell/spufs/sched.c | 9 +++------ arch/s390/appldata/appldata_os.c | 4 ---- block/blk-iolatency.c | 8 +++++--- drivers/cpuidle/governors/menu.c | 4 ---- fs/proc/loadavg.c | 3 --- include/linux/sched/loadavg.h | 21 +++++++++++++++++---- kernel/debug/kdb/kdb_main.c | 7 +------ kernel/sched/loadavg.c | 15 --------------- 9 files changed, 27 insertions(+), 46 deletions(-)
diff --git a/arch/powerpc/platforms/cell/cpufreq_spudemand.c b/arch/powerpc/platforms/cell/cpufreq_spudemand.c index 882944c..5d8e8b6 100644 --- a/arch/powerpc/platforms/cell/cpufreq_spudemand.c +++ b/arch/powerpc/platforms/cell/cpufreq_spudemand.c @@ -49,7 +49,7 @@ static int calc_freq(struct spu_gov_info_struct *info) cpu = info->policy->cpu; busy_spus = atomic_read(&cbe_spu_info[cpu_to_node(cpu)].busy_spus);
- CALC_LOAD(info->busy_spus, EXP, busy_spus * FIXED_1); + info->busy_spus = calc_load(info->busy_spus, EXP, busy_spus * FIXED_1); pr_debug("cpu %d: busy_spus=%d, info->busy_spus=%ld\n", cpu, busy_spus, info->busy_spus);
diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c index c9ef3c5..9fcccb4 100644 --- a/arch/powerpc/platforms/cell/spufs/sched.c +++ b/arch/powerpc/platforms/cell/spufs/sched.c @@ -987,9 +987,9 @@ static void spu_calc_load(void) unsigned long active_tasks; /* fixed-point */
active_tasks = count_active_contexts() * FIXED_1; - CALC_LOAD(spu_avenrun[0], EXP_1, active_tasks); - CALC_LOAD(spu_avenrun[1], EXP_5, active_tasks); - CALC_LOAD(spu_avenrun[2], EXP_15, active_tasks); + spu_avenrun[0] = calc_load(spu_avenrun[0], EXP_1, active_tasks); + spu_avenrun[1] = calc_load(spu_avenrun[1], EXP_5, active_tasks); + spu_avenrun[2] = calc_load(spu_avenrun[2], EXP_15, active_tasks); }
static void spusched_wake(struct timer_list *unused) @@ -1071,9 +1071,6 @@ void spuctx_switch_state(struct spu_context *ctx, } }
-#define LOAD_INT(x) ((x) >> FSHIFT) -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) - static int show_spu_loadavg(struct seq_file *s, void *private) { int a, b, c; diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c index 433a994..54f3756 100644 --- a/arch/s390/appldata/appldata_os.c +++ b/arch/s390/appldata/appldata_os.c @@ -25,10 +25,6 @@
#include "appldata.h"
- -#define LOAD_INT(x) ((x) >> FSHIFT) -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) - /* * OS data * diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c index f84fdc7..fa0421e 100644 --- a/block/blk-iolatency.c +++ b/block/blk-iolatency.c @@ -154,7 +154,7 @@ struct iolatency_grp { #define BLKIOLATENCY_MAX_WIN_SIZE NSEC_PER_SEC /* * These are the constants used to fake the fixed-point moving average - * calculation just like load average. The call to CALC_LOAD folds + * calculation just like load average. The call to calc_load() folds * (FIXED_1 (2048) - exp_factor) * new_sample into lat_avg. The sampling * window size is bucketed to try to approximately calculate average * latency such that 1/exp (decay rate) is [1 min, 2.5 min) when windows @@ -249,7 +249,7 @@ static inline void iolat_update_total_lat_avg(struct iolatency_grp *iolat, return;
/* - * CALC_LOAD takes in a number stored in fixed point representation. + * calc_load() takes in a number stored in fixed point representation. * Because we are using this for IO time in ns, the values stored * are significantly larger than the FIXED_1 denominator (2048). * Therefore, rounding errors in the calculation are negligible and @@ -258,7 +258,9 @@ static inline void iolat_update_total_lat_avg(struct iolatency_grp *iolat, exp_idx = min_t(int, BLKIOLATENCY_NR_EXP_FACTORS - 1, div64_u64(iolat->cur_win_nsec, BLKIOLATENCY_EXP_BUCKET_SIZE)); - CALC_LOAD(iolat->lat_avg, iolatency_exp_factors[exp_idx], stat->rqs.mean); + iolat->lat_avg = calc_load(iolat->lat_avg, + iolatency_exp_factors[exp_idx], + stat->rqs.mean); }
static inline bool iolatency_may_queue(struct iolatency_grp *iolat, diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index 955e776..89bfa97 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -131,10 +131,6 @@ struct menu_device { int interval_ptr; };
- -#define LOAD_INT(x) ((x) >> FSHIFT) -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) - static inline int get_loadavg(unsigned long load) { return LOAD_INT(load) * 10 + LOAD_FRAC(load) / 10; diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c index d066947..8468bae 100644 --- a/fs/proc/loadavg.c +++ b/fs/proc/loadavg.c @@ -10,9 +10,6 @@ #include <linux/seqlock.h> #include <linux/time.h>
-#define LOAD_INT(x) ((x) >> FSHIFT) -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) - static int loadavg_proc_show(struct seq_file *m, void *v) { unsigned long avnrun[3]; diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h index 80bc84b..cc9cc62 100644 --- a/include/linux/sched/loadavg.h +++ b/include/linux/sched/loadavg.h @@ -22,10 +22,23 @@ #define EXP_5 2014 /* 1/exp(5sec/5min) */ #define EXP_15 2037 /* 1/exp(5sec/15min) */
-#define CALC_LOAD(load,exp,n) \ - load *= exp; \ - load += n*(FIXED_1-exp); \ - load >>= FSHIFT; +/* + * a1 = a0 * e + a * (1 - e) + */ +static inline unsigned long +calc_load(unsigned long load, unsigned long exp, unsigned long active) +{ + unsigned long newload; + + newload = load * exp + active * (FIXED_1 - exp); + if (active >= load) + newload += FIXED_1-1; + + return newload / FIXED_1; +} + +#define LOAD_INT(x) ((x) >> FSHIFT) +#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
extern void calc_global_load(unsigned long ticks);
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index f338d23..9592420 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -2556,16 +2556,11 @@ static int kdb_summary(int argc, const char **argv) } kdb_printf("%02ld:%02ld\n", val.uptime/(60*60), (val.uptime/60)%60);
- /* lifted from fs/proc/proc_misc.c::loadavg_read_proc() */ - -#define LOAD_INT(x) ((x) >> FSHIFT) -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) kdb_printf("load avg %ld.%02ld %ld.%02ld %ld.%02ld\n", LOAD_INT(val.loads[0]), LOAD_FRAC(val.loads[0]), LOAD_INT(val.loads[1]), LOAD_FRAC(val.loads[1]), LOAD_INT(val.loads[2]), LOAD_FRAC(val.loads[2])); -#undef LOAD_INT -#undef LOAD_FRAC + /* Display in kilobytes */ #define K(x) ((x) << (PAGE_SHIFT - 10)) kdb_printf("\nMemTotal: %8lu kB\nMemFree: %8lu kB\n" diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c index a171c12..54fbdfb 100644 --- a/kernel/sched/loadavg.c +++ b/kernel/sched/loadavg.c @@ -91,21 +91,6 @@ long calc_load_fold_active(struct rq *this_rq, long adjust) return delta; }
-/* - * a1 = a0 * e + a * (1 - e) - */ -static unsigned long -calc_load(unsigned long load, unsigned long exp, unsigned long active) -{ - unsigned long newload; - - newload = load * exp + active * (FIXED_1 - exp); - if (active >= load) - newload += FIXED_1-1; - - return newload / FIXED_1; -} - #ifdef CONFIG_NO_HZ_COMMON /* * Handle NO_HZ for the global load-average.
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.4 commit 5c54f5b9edb1aa2eabbb1091c458f1b6776a1896 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
It's going to be used in a later patch. Keep the churn separate.
Link: http://lkml.kernel.org/r/20180828172258.3185-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Tested-by: Suren Baghdasaryan surenb@google.com Tested-by: Daniel Drake drake@endlessm.com Cc: Christopher Lameter cl@linux.com Cc: Ingo Molnar mingo@redhat.com Cc: Johannes Weiner jweiner@fb.com Cc: Mike Galbraith efault@gmx.de Cc: Peter Enderborg peter.enderborg@sony.com Cc: Randy Dunlap rdunlap@infradead.org Cc: Shakeel Butt shakeelb@google.com Cc: Tejun Heo tj@kernel.org Cc: Vinayak Menon vinmenon@codeaurora.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- include/linux/sched/loadavg.h | 3 + kernel/sched/loadavg.c | 138 +++++++++++++++++++++--------------------- 2 files changed, 72 insertions(+), 69 deletions(-)
diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h index cc9cc62..4859bea 100644 --- a/include/linux/sched/loadavg.h +++ b/include/linux/sched/loadavg.h @@ -37,6 +37,9 @@ return newload / FIXED_1; }
+extern unsigned long calc_load_n(unsigned long load, unsigned long exp, + unsigned long active, unsigned int n); + #define LOAD_INT(x) ((x) >> FSHIFT) #define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c index 54fbdfb..28a5165 100644 --- a/kernel/sched/loadavg.c +++ b/kernel/sched/loadavg.c @@ -91,6 +91,75 @@ long calc_load_fold_active(struct rq *this_rq, long adjust) return delta; }
+/** + * fixed_power_int - compute: x^n, in O(log n) time + * + * @x: base of the power + * @frac_bits: fractional bits of @x + * @n: power to raise @x to. + * + * By exploiting the relation between the definition of the natural power + * function: x^n := x*x*...*x (x multiplied by itself for n times), and + * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i, + * (where: n_i \elem {0, 1}, the binary vector representing n), + * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is + * of course trivially computable in O(log_2 n), the length of our binary + * vector. + */ +static unsigned long +fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n) +{ + unsigned long result = 1UL << frac_bits; + + if (n) { + for (;;) { + if (n & 1) { + result *= x; + result += 1UL << (frac_bits - 1); + result >>= frac_bits; + } + n >>= 1; + if (!n) + break; + x *= x; + x += 1UL << (frac_bits - 1); + x >>= frac_bits; + } + } + + return result; +} + +/* + * a1 = a0 * e + a * (1 - e) + * + * a2 = a1 * e + a * (1 - e) + * = (a0 * e + a * (1 - e)) * e + a * (1 - e) + * = a0 * e^2 + a * (1 - e) * (1 + e) + * + * a3 = a2 * e + a * (1 - e) + * = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e) + * = a0 * e^3 + a * (1 - e) * (1 + e + e^2) + * + * ... + * + * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1] + * = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e) + * = a0 * e^n + a * (1 - e^n) + * + * [1] application of the geometric series: + * + * n 1 - x^(n+1) + * S_n := \Sum x^i = ------------- + * i=0 1 - x + */ +unsigned long +calc_load_n(unsigned long load, unsigned long exp, + unsigned long active, unsigned int n) +{ + return calc_load(load, fixed_power_int(exp, FSHIFT, n), active); +} + #ifdef CONFIG_NO_HZ_COMMON /* * Handle NO_HZ for the global load-average. @@ -210,75 +279,6 @@ static long calc_load_nohz_fold(void) return delta; }
-/** - * fixed_power_int - compute: x^n, in O(log n) time - * - * @x: base of the power - * @frac_bits: fractional bits of @x - * @n: power to raise @x to. - * - * By exploiting the relation between the definition of the natural power - * function: x^n := x*x*...*x (x multiplied by itself for n times), and - * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i, - * (where: n_i \elem {0, 1}, the binary vector representing n), - * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is - * of course trivially computable in O(log_2 n), the length of our binary - * vector. - */ -static unsigned long -fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n) -{ - unsigned long result = 1UL << frac_bits; - - if (n) { - for (;;) { - if (n & 1) { - result *= x; - result += 1UL << (frac_bits - 1); - result >>= frac_bits; - } - n >>= 1; - if (!n) - break; - x *= x; - x += 1UL << (frac_bits - 1); - x >>= frac_bits; - } - } - - return result; -} - -/* - * a1 = a0 * e + a * (1 - e) - * - * a2 = a1 * e + a * (1 - e) - * = (a0 * e + a * (1 - e)) * e + a * (1 - e) - * = a0 * e^2 + a * (1 - e) * (1 + e) - * - * a3 = a2 * e + a * (1 - e) - * = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e) - * = a0 * e^3 + a * (1 - e) * (1 + e + e^2) - * - * ... - * - * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1] - * = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e) - * = a0 * e^n + a * (1 - e^n) - * - * [1] application of the geometric series: - * - * n 1 - x^(n+1) - * S_n := \Sum x^i = ------------- - * i=0 1 - x - */ -static unsigned long -calc_load_n(unsigned long load, unsigned long exp, - unsigned long active, unsigned int n) -{ - return calc_load(load, fixed_power_int(exp, FSHIFT, n), active); -} - /* * NO_HZ can leave us missing all per-CPU ticks calling * calc_load_fold_active(), but since a NO_HZ CPU folds its delta into
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.4 commit 1f351d7f7590857ea281579c26e6045b4c548ef4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
kernel/sched/sched.h includes "stats.h" half-way through the file. The next patch introduces users of sched.h's rq locking functions and update_rq_clock() in kernel/sched/stats.h. Move those definitions up in the file so they are available in stats.h.
Link: http://lkml.kernel.org/r/20180828172258.3185-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Tested-by: Suren Baghdasaryan surenb@google.com Tested-by: Daniel Drake drake@endlessm.com Cc: Christopher Lameter cl@linux.com Cc: Ingo Molnar mingo@redhat.com Cc: Johannes Weiner jweiner@fb.com Cc: Mike Galbraith efault@gmx.de Cc: Peter Enderborg peter.enderborg@sony.com Cc: Randy Dunlap rdunlap@infradead.org Cc: Shakeel Butt shakeelb@google.com Cc: Tejun Heo tj@kernel.org Cc: Vinayak Menon vinmenon@codeaurora.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- kernel/sched/sched.h | 164 +++++++++++++++++++++++++-------------------------- 1 file changed, 82 insertions(+), 82 deletions(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ca2fd2d..9aa672f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1010,6 +1010,8 @@ static inline void update_idle_core(struct rq *rq) { } #define cpu_curr(cpu) (cpu_rq(cpu)->curr) #define raw_rq() raw_cpu_ptr(&runqueues)
+extern void update_rq_clock(struct rq *rq); + static inline u64 __rq_clock_broken(struct rq *rq) { return READ_ONCE(rq->clock); @@ -1128,6 +1130,86 @@ static inline void rq_repin_lock(struct rq *rq, struct rq_flags *rf) #endif }
+struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf) + __acquires(rq->lock); + +struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf) + __acquires(p->pi_lock) + __acquires(rq->lock); + +static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf) + __releases(rq->lock) +{ + rq_unpin_lock(rq, rf); + raw_spin_unlock(&rq->lock); +} + +static inline void +task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf) + __releases(rq->lock) + __releases(p->pi_lock) +{ + rq_unpin_lock(rq, rf); + raw_spin_unlock(&rq->lock); + raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags); +} + +static inline void +rq_lock_irqsave(struct rq *rq, struct rq_flags *rf) + __acquires(rq->lock) +{ + raw_spin_lock_irqsave(&rq->lock, rf->flags); + rq_pin_lock(rq, rf); +} + +static inline void +rq_lock_irq(struct rq *rq, struct rq_flags *rf) + __acquires(rq->lock) +{ + raw_spin_lock_irq(&rq->lock); + rq_pin_lock(rq, rf); +} + +static inline void +rq_lock(struct rq *rq, struct rq_flags *rf) + __acquires(rq->lock) +{ + raw_spin_lock(&rq->lock); + rq_pin_lock(rq, rf); +} + +static inline void +rq_relock(struct rq *rq, struct rq_flags *rf) + __acquires(rq->lock) +{ + raw_spin_lock(&rq->lock); + rq_repin_lock(rq, rf); +} + +static inline void +rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf) + __releases(rq->lock) +{ + rq_unpin_lock(rq, rf); + raw_spin_unlock_irqrestore(&rq->lock, rf->flags); +} + +static inline void +rq_unlock_irq(struct rq *rq, struct rq_flags *rf) + __releases(rq->lock) +{ + rq_unpin_lock(rq, rf); + raw_spin_unlock_irq(&rq->lock); +} + +static inline void +rq_unlock(struct rq *rq, struct rq_flags *rf) + __releases(rq->lock) +{ + rq_unpin_lock(rq, rf); + raw_spin_unlock(&rq->lock); +} + #ifdef CONFIG_NUMA enum numa_topology_type { NUMA_DIRECT, @@ -1777,8 +1859,6 @@ static inline void sub_nr_running(struct rq *rq, unsigned count) sched_update_tick_dependency(rq); }
-extern void update_rq_clock(struct rq *rq); - extern void activate_task(struct rq *rq, struct task_struct *p, int flags); extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
@@ -1843,86 +1923,6 @@ unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu) #endif #endif
-struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf) - __acquires(rq->lock); - -struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf) - __acquires(p->pi_lock) - __acquires(rq->lock); - -static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf) - __releases(rq->lock) -{ - rq_unpin_lock(rq, rf); - raw_spin_unlock(&rq->lock); -} - -static inline void -task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf) - __releases(rq->lock) - __releases(p->pi_lock) -{ - rq_unpin_lock(rq, rf); - raw_spin_unlock(&rq->lock); - raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags); -} - -static inline void -rq_lock_irqsave(struct rq *rq, struct rq_flags *rf) - __acquires(rq->lock) -{ - raw_spin_lock_irqsave(&rq->lock, rf->flags); - rq_pin_lock(rq, rf); -} - -static inline void -rq_lock_irq(struct rq *rq, struct rq_flags *rf) - __acquires(rq->lock) -{ - raw_spin_lock_irq(&rq->lock); - rq_pin_lock(rq, rf); -} - -static inline void -rq_lock(struct rq *rq, struct rq_flags *rf) - __acquires(rq->lock) -{ - raw_spin_lock(&rq->lock); - rq_pin_lock(rq, rf); -} - -static inline void -rq_relock(struct rq *rq, struct rq_flags *rf) - __acquires(rq->lock) -{ - raw_spin_lock(&rq->lock); - rq_repin_lock(rq, rf); -} - -static inline void -rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf) - __releases(rq->lock) -{ - rq_unpin_lock(rq, rf); - raw_spin_unlock_irqrestore(&rq->lock, rf->flags); -} - -static inline void -rq_unlock_irq(struct rq *rq, struct rq_flags *rf) - __releases(rq->lock) -{ - rq_unpin_lock(rq, rf); - raw_spin_unlock_irq(&rq->lock); -} - -static inline void -rq_unlock(struct rq *rq, struct rq_flags *rf) - __releases(rq->lock) -{ - rq_unpin_lock(rq, rf); - raw_spin_unlock(&rq->lock); -} - #ifdef CONFIG_SMP #ifdef CONFIG_PREEMPT
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.4 commit 246b3b3342c9b0a2e24cda2178be87bc36e1c874 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
do_sched_yield() disables IRQs, looks up this_rq() and locks it. The next patch is adding another site with the same pattern, so provide a convenience function for it.
Link: http://lkml.kernel.org/r/20180828172258.3185-8-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Tested-by: Suren Baghdasaryan surenb@google.com Tested-by: Daniel Drake drake@endlessm.com Cc: Christopher Lameter cl@linux.com Cc: Ingo Molnar mingo@redhat.com Cc: Johannes Weiner jweiner@fb.com Cc: Mike Galbraith efault@gmx.de Cc: Peter Enderborg peter.enderborg@sony.com Cc: Randy Dunlap rdunlap@infradead.org Cc: Shakeel Butt shakeelb@google.com Cc: Tejun Heo tj@kernel.org Cc: Vinayak Menon vinmenon@codeaurora.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- kernel/sched/core.c | 4 +--- kernel/sched/sched.h | 12 ++++++++++++ 2 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8866cd7..6bc1b19 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5014,9 +5014,7 @@ static void do_sched_yield(void) struct rq_flags rf; struct rq *rq;
- local_irq_disable(); - rq = this_rq(); - rq_lock(rq, &rf); + rq = this_rq_lock_irq(&rf);
schedstat_inc(rq->yld_count); current->sched_class->yield_task(rq); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9aa672f..6e3171b 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1210,6 +1210,18 @@ static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf) raw_spin_unlock(&rq->lock); }
+static inline struct rq * +this_rq_lock_irq(struct rq_flags *rf) + __acquires(rq->lock) +{ + struct rq *rq; + + local_irq_disable(); + rq = this_rq(); + rq_lock(rq, rf); + return rq; +} + #ifdef CONFIG_NUMA enum numa_topology_type { NUMA_DIRECT,
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.4 commit eb414681d5a07d28d2ff90dc05f69ec6b232ebd2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
When systems are overcommitted and resources become contended, it's hard to tell exactly the impact this has on workload productivity, or how close the system is to lockups and OOM kills. In particular, when machines work multiple jobs concurrently, the impact of overcommit in terms of latency and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual job health or risk complete machine lockups, this patch implements a way to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that expose the percentage of time the system is stalled on CPU, memory, or IO, respectively. Stall states are aggregate versions of the per-task delay accounting delays:
cpu: some tasks are runnable but not executing on a CPU memory: tasks are reclaiming, or waiting for swapin or thrashing cache io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages, and they give a general sense of system health and productivity loss incurred by resource overcommit. They can also indicate when the system is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU and samples the time they spend in stall states. Every 2 seconds, the samples are averaged across CPUs - weighted by the CPUs' non-idle time to eliminate artifacts from unused CPUs - and translated into percentages of walltime. A running average of those percentages is maintained over 10s, 1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy] Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org [hannes@cmpxchg.org: code optimization] Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter] Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org [hannes@cmpxchg.org: fix build] Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Tested-by: Daniel Drake drake@endlessm.com Tested-by: Suren Baghdasaryan surenb@google.com Cc: Christopher Lameter cl@linux.com Cc: Ingo Molnar mingo@redhat.com Cc: Johannes Weiner jweiner@fb.com Cc: Mike Galbraith efault@gmx.de Cc: Peter Enderborg peter.enderborg@sony.com Cc: Randy Dunlap rdunlap@infradead.org Cc: Shakeel Butt shakeelb@google.com Cc: Tejun Heo tj@kernel.org Cc: Vinayak Menon vinmenon@codeaurora.org Cc: Randy Dunlap rdunlap@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- Documentation/accounting/psi.txt | 64 ++++ include/linux/psi.h | 28 ++ include/linux/psi_types.h | 92 ++++++ include/linux/sched.h | 10 + init/Kconfig | 15 + kernel/fork.c | 4 + kernel/sched/Makefile | 1 + kernel/sched/core.c | 12 +- kernel/sched/psi.c | 657 +++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 2 + kernel/sched/stats.h | 86 +++++ mm/compaction.c | 5 + mm/filemap.c | 11 + mm/page_alloc.c | 9 + mm/vmscan.c | 9 + 15 files changed, 1003 insertions(+), 2 deletions(-) create mode 100644 Documentation/accounting/psi.txt create mode 100644 include/linux/psi.h create mode 100644 include/linux/psi_types.h create mode 100644 kernel/sched/psi.c
diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt new file mode 100644 index 00000000..3753a82 --- /dev/null +++ b/Documentation/accounting/psi.txt @@ -0,0 +1,64 @@ +================================ +PSI - Pressure Stall Information +================================ + +:Date: April, 2018 +:Author: Johannes Weiner hannes@cmpxchg.org + +When CPU, memory or IO devices are contended, workloads experience +latency spikes, throughput losses, and run the risk of OOM kills. + +Without an accurate measure of such contention, users are forced to +either play it safe and under-utilize their hardware resources, or +roll the dice and frequently suffer the disruptions resulting from +excessive overcommit. + +The psi feature identifies and quantifies the disruptions caused by +such resource crunches and the time impact it has on complex workloads +or even entire systems. + +Having an accurate measure of productivity losses caused by resource +scarcity aids users in sizing workloads to hardware--or provisioning +hardware according to workload demand. + +As psi aggregates this information in realtime, systems can be managed +dynamically using techniques such as load shedding, migrating jobs to +other systems or data centers, or strategically pausing or killing low +priority or restartable batch jobs. + +This allows maximizing hardware utilization without sacrificing +workload health or risking major disruptions such as OOM kills. + +Pressure interface +================== + +Pressure information for each resource is exported through the +respective file in /proc/pressure/ -- cpu, memory, and io. + +The format for CPU is as such: + +some avg10=0.00 avg60=0.00 avg300=0.00 total=0 + +and for memory and IO: + +some avg10=0.00 avg60=0.00 avg300=0.00 total=0 +full avg10=0.00 avg60=0.00 avg300=0.00 total=0 + +The "some" line indicates the share of time in which at least some +tasks are stalled on a given resource. + +The "full" line indicates the share of time in which all non-idle +tasks are stalled on a given resource simultaneously. In this state +actual CPU cycles are going to waste, and a workload that spends +extended time in this state is considered to be thrashing. This has +severe impact on performance, and it's useful to distinguish this +situation from a state where some tasks are stalled but the CPU is +still doing productive work. As such, time spent in this subset of the +stall state is tracked separately and exported in the "full" averages. + +The ratios are tracked as recent trends over ten, sixty, and three +hundred second windows, which gives insight into short term events as +well as medium and long term trends. The total absolute stall time is +tracked and exported as well, to allow detection of latency spikes +which wouldn't necessarily make a dent in the time averages, or to +average trends over custom time frames. diff --git a/include/linux/psi.h b/include/linux/psi.h new file mode 100644 index 00000000..b0daf05 --- /dev/null +++ b/include/linux/psi.h @@ -0,0 +1,28 @@ +#ifndef _LINUX_PSI_H +#define _LINUX_PSI_H + +#include <linux/psi_types.h> +#include <linux/sched.h> + +#ifdef CONFIG_PSI + +extern bool psi_disabled; + +void psi_init(void); + +void psi_task_change(struct task_struct *task, int clear, int set); + +void psi_memstall_tick(struct task_struct *task, int cpu); +void psi_memstall_enter(unsigned long *flags); +void psi_memstall_leave(unsigned long *flags); + +#else /* CONFIG_PSI */ + +static inline void psi_init(void) {} + +static inline void psi_memstall_enter(unsigned long *flags) {} +static inline void psi_memstall_leave(unsigned long *flags) {} + +#endif /* CONFIG_PSI */ + +#endif /* _LINUX_PSI_H */ diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h new file mode 100644 index 00000000..2cf422d --- /dev/null +++ b/include/linux/psi_types.h @@ -0,0 +1,92 @@ +#ifndef _LINUX_PSI_TYPES_H +#define _LINUX_PSI_TYPES_H + +#include <linux/seqlock.h> +#include <linux/types.h> + +#ifdef CONFIG_PSI + +/* Tracked task states */ +enum psi_task_count { + NR_IOWAIT, + NR_MEMSTALL, + NR_RUNNING, + NR_PSI_TASK_COUNTS, +}; + +/* Task state bitmasks */ +#define TSK_IOWAIT (1 << NR_IOWAIT) +#define TSK_MEMSTALL (1 << NR_MEMSTALL) +#define TSK_RUNNING (1 << NR_RUNNING) + +/* Resources that workloads could be stalled on */ +enum psi_res { + PSI_IO, + PSI_MEM, + PSI_CPU, + NR_PSI_RESOURCES, +}; + +/* + * Pressure states for each resource: + * + * SOME: Stalled tasks & working tasks + * FULL: Stalled tasks & no working tasks + */ +enum psi_states { + PSI_IO_SOME, + PSI_IO_FULL, + PSI_MEM_SOME, + PSI_MEM_FULL, + PSI_CPU_SOME, + /* Only per-CPU, to weigh the CPU in the global average: */ + PSI_NONIDLE, + NR_PSI_STATES, +}; + +struct psi_group_cpu { + /* 1st cacheline updated by the scheduler */ + + /* Aggregator needs to know of concurrent changes */ + seqcount_t seq ____cacheline_aligned_in_smp; + + /* States of the tasks belonging to this group */ + unsigned int tasks[NR_PSI_TASK_COUNTS]; + + /* Period time sampling buckets for each state of interest (ns) */ + u32 times[NR_PSI_STATES]; + + /* Time of last task change in this group (rq_clock) */ + u64 state_start; + + /* 2nd cacheline updated by the aggregator */ + + /* Delta detection against the sampling buckets */ + u32 times_prev[NR_PSI_STATES] ____cacheline_aligned_in_smp; +}; + +struct psi_group { + /* Protects data updated during an aggregation */ + struct mutex stat_lock; + + /* Per-cpu task state & time tracking */ + struct psi_group_cpu __percpu *pcpu; + + /* Periodic aggregation state */ + u64 total_prev[NR_PSI_STATES - 1]; + u64 last_update; + u64 next_update; + struct delayed_work clock_work; + + /* Total stall times and sampled pressure averages */ + u64 total[NR_PSI_STATES - 1]; + unsigned long avg[NR_PSI_STATES - 1][3]; +}; + +#else /* CONFIG_PSI */ + +struct psi_group { }; + +#endif /* CONFIG_PSI */ + +#endif /* _LINUX_PSI_TYPES_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index ef03ea1..a812bc8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -25,6 +25,7 @@ #include <linux/latencytop.h> #include <linux/sched/prio.h> #include <linux/signal_types.h> +#include <linux/psi_types.h> #include <linux/mm_types_task.h> #include <linux/task_io_accounting.h> #include <linux/rseq.h> @@ -723,6 +724,10 @@ struct task_struct { unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; unsigned sched_remote_wakeup:1; +#ifdef CONFIG_PSI + unsigned sched_psi_wake_requeue:1; +#endif + /* Force alignment to the next boundary: */ unsigned :0;
@@ -976,6 +981,10 @@ struct task_struct { siginfo_t *last_siginfo;
struct task_io_accounting ioac; +#ifdef CONFIG_PSI + /* Pressure stall state */ + unsigned int psi_flags; +#endif #ifdef CONFIG_TASK_XACCT /* Accumulated RSS usage: */ u64 acct_rss_mem1; @@ -1441,6 +1450,7 @@ static inline int is_global_init(struct task_struct *tsk) #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ +#define PF_MEMSTALL 0x01000000 /* Stalled due to lack of memory */ #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */ #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ #define PF_IO_WORKER 0x20000000 /* Task is an IO worker */ diff --git a/init/Kconfig b/init/Kconfig index 7334599..4ad842c 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -504,6 +504,21 @@ config TASK_IO_ACCOUNTING
Say N if unsure.
+config PSI + bool "Pressure stall information tracking" + help + Collect metrics that indicate how overcommitted the CPU, memory, + and IO capacity are in the system. + + If you say Y here, the kernel will create /proc/pressure/ with the + pressure statistics files cpu, memory, and io. These will indicate + the share of walltime in which some or all tasks in the system are + delayed due to contention of the respective resource. + + For more details see Documentation/accounting/psi.txt. + + Say N if unsure. + endmenu # "CPU/Task time and stats accounting"
config CPU_ISOLATION diff --git a/kernel/fork.c b/kernel/fork.c index e17aaa5..34df7bf 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1889,6 +1889,10 @@ static __latent_entropy struct task_struct *copy_process(
p->default_timer_slack_ns = current->timer_slack_ns;
+#ifdef CONFIG_PSI + p->psi_flags = 0; +#endif + task_io_accounting_init(&p->ioac); acct_clear_integrals(p);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 7fe1834..21fb5a5 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -29,3 +29,4 @@ obj-$(CONFIG_CPU_FREQ) += cpufreq.o obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o obj-$(CONFIG_MEMBARRIER) += membarrier.o obj-$(CONFIG_CPU_ISOLATION) += isolation.o +obj-$(CONFIG_PSI) += psi.o diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6bc1b19..f4ec7a3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -748,8 +748,10 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) if (!(flags & ENQUEUE_NOCLOCK)) update_rq_clock(rq);
- if (!(flags & ENQUEUE_RESTORE)) + if (!(flags & ENQUEUE_RESTORE)) { sched_info_queued(rq, p); + psi_enqueue(p, flags & ENQUEUE_WAKEUP); + }
p->sched_class->enqueue_task(rq, p, flags); } @@ -759,8 +761,10 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) if (!(flags & DEQUEUE_NOCLOCK)) update_rq_clock(rq);
- if (!(flags & DEQUEUE_SAVE)) + if (!(flags & DEQUEUE_SAVE)) { sched_info_dequeued(rq, p); + psi_dequeue(p, flags & DEQUEUE_SLEEP); + }
p->sched_class->dequeue_task(rq, p, flags); } @@ -2059,6 +2063,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags) cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); if (task_cpu(p) != cpu) { wake_flags |= WF_MIGRATED; + psi_ttwu_dequeue(p); set_task_cpu(p, cpu); }
@@ -3071,6 +3076,7 @@ void scheduler_tick(void) curr->sched_class->task_tick(rq, curr, 0); cpu_load_update_active(rq); calc_global_load_tick(rq); + psi_task_tick(rq);
rq_unlock(rq, &rf);
@@ -6167,6 +6173,8 @@ void __init sched_init(void) if (use_sched_idle_time) rq_cputime_init();
+ psi_init(); + scheduler_running = 1; }
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c new file mode 100644 index 00000000..5954145 --- /dev/null +++ b/kernel/sched/psi.c @@ -0,0 +1,657 @@ +/* + * Pressure stall information for CPU, memory and IO + * + * Copyright (c) 2018 Facebook, Inc. + * Author: Johannes Weiner hannes@cmpxchg.org + * + * When CPU, memory and IO are contended, tasks experience delays that + * reduce throughput and introduce latencies into the workload. Memory + * and IO contention, in addition, can cause a full loss of forward + * progress in which the CPU goes idle. + * + * This code aggregates individual task delays into resource pressure + * metrics that indicate problems with both workload health and + * resource utilization. + * + * Model + * + * The time in which a task can execute on a CPU is our baseline for + * productivity. Pressure expresses the amount of time in which this + * potential cannot be realized due to resource contention. + * + * This concept of productivity has two components: the workload and + * the CPU. To measure the impact of pressure on both, we define two + * contention states for a resource: SOME and FULL. + * + * In the SOME state of a given resource, one or more tasks are + * delayed on that resource. This affects the workload's ability to + * perform work, but the CPU may still be executing other tasks. + * + * In the FULL state of a given resource, all non-idle tasks are + * delayed on that resource such that nobody is advancing and the CPU + * goes idle. This leaves both workload and CPU unproductive. + * + * (Naturally, the FULL state doesn't exist for the CPU resource.) + * + * SOME = nr_delayed_tasks != 0 + * FULL = nr_delayed_tasks != 0 && nr_running_tasks == 0 + * + * The percentage of wallclock time spent in those compound stall + * states gives pressure numbers between 0 and 100 for each resource, + * where the SOME percentage indicates workload slowdowns and the FULL + * percentage indicates reduced CPU utilization: + * + * %SOME = time(SOME) / period + * %FULL = time(FULL) / period + * + * Multiple CPUs + * + * The more tasks and available CPUs there are, the more work can be + * performed concurrently. This means that the potential that can go + * unrealized due to resource contention *also* scales with non-idle + * tasks and CPUs. + * + * Consider a scenario where 257 number crunching tasks are trying to + * run concurrently on 256 CPUs. If we simply aggregated the task + * states, we would have to conclude a CPU SOME pressure number of + * 100%, since *somebody* is waiting on a runqueue at all + * times. However, that is clearly not the amount of contention the + * workload is experiencing: only one out of 256 possible exceution + * threads will be contended at any given time, or about 0.4%. + * + * Conversely, consider a scenario of 4 tasks and 4 CPUs where at any + * given time *one* of the tasks is delayed due to a lack of memory. + * Again, looking purely at the task state would yield a memory FULL + * pressure number of 0%, since *somebody* is always making forward + * progress. But again this wouldn't capture the amount of execution + * potential lost, which is 1 out of 4 CPUs, or 25%. + * + * To calculate wasted potential (pressure) with multiple processors, + * we have to base our calculation on the number of non-idle tasks in + * conjunction with the number of available CPUs, which is the number + * of potential execution threads. SOME becomes then the proportion of + * delayed tasks to possibe threads, and FULL is the share of possible + * threads that are unproductive due to delays: + * + * threads = min(nr_nonidle_tasks, nr_cpus) + * SOME = min(nr_delayed_tasks / threads, 1) + * FULL = (threads - min(nr_running_tasks, threads)) / threads + * + * For the 257 number crunchers on 256 CPUs, this yields: + * + * threads = min(257, 256) + * SOME = min(1 / 256, 1) = 0.4% + * FULL = (256 - min(257, 256)) / 256 = 0% + * + * For the 1 out of 4 memory-delayed tasks, this yields: + * + * threads = min(4, 4) + * SOME = min(1 / 4, 1) = 25% + * FULL = (4 - min(3, 4)) / 4 = 25% + * + * [ Substitute nr_cpus with 1, and you can see that it's a natural + * extension of the single-CPU model. ] + * + * Implementation + * + * To assess the precise time spent in each such state, we would have + * to freeze the system on task changes and start/stop the state + * clocks accordingly. Obviously that doesn't scale in practice. + * + * Because the scheduler aims to distribute the compute load evenly + * among the available CPUs, we can track task state locally to each + * CPU and, at much lower frequency, extrapolate the global state for + * the cumulative stall times and the running averages. + * + * For each runqueue, we track: + * + * tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0) + * tFULL[cpu] = time(nr_delayed_tasks[cpu] && !nr_running_tasks[cpu]) + * tNONIDLE[cpu] = time(nr_nonidle_tasks[cpu] != 0) + * + * and then periodically aggregate: + * + * tNONIDLE = sum(tNONIDLE[i]) + * + * tSOME = sum(tSOME[i] * tNONIDLE[i]) / tNONIDLE + * tFULL = sum(tFULL[i] * tNONIDLE[i]) / tNONIDLE + * + * %SOME = tSOME / period + * %FULL = tFULL / period + * + * This gives us an approximation of pressure that is practical + * cost-wise, yet way more sensitive and accurate than periodic + * sampling of the aggregate task states would be. + */ + +#include <linux/sched/loadavg.h> +#include <linux/seq_file.h> +#include <linux/proc_fs.h> +#include <linux/seqlock.h> +#include <linux/cgroup.h> +#include <linux/module.h> +#include <linux/sched.h> +#include <linux/psi.h> +#include "sched.h" + +static int psi_bug __read_mostly; + +bool psi_disabled __read_mostly; +core_param(psi_disabled, psi_disabled, bool, 0644); + +/* Running averages - we need to be higher-res than loadavg */ +#define PSI_FREQ (2*HZ+1) /* 2 sec intervals */ +#define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */ +#define EXP_60s 1981 /* 1/exp(2s/60s) */ +#define EXP_300s 2034 /* 1/exp(2s/300s) */ + +/* Sampling frequency in nanoseconds */ +static u64 psi_period __read_mostly; + +/* System-level pressure and stall tracking */ +static DEFINE_PER_CPU(struct psi_group_cpu, system_group_pcpu); +static struct psi_group psi_system = { + .pcpu = &system_group_pcpu, +}; + +static void psi_update_work(struct work_struct *work); + +static void group_init(struct psi_group *group) +{ + int cpu; + + for_each_possible_cpu(cpu) + seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq); + group->next_update = sched_clock() + psi_period; + INIT_DELAYED_WORK(&group->clock_work, psi_update_work); + mutex_init(&group->stat_lock); +} + +void __init psi_init(void) +{ + if (psi_disabled) + return; + + psi_period = jiffies_to_nsecs(PSI_FREQ); + group_init(&psi_system); +} + +static bool test_state(unsigned int *tasks, enum psi_states state) +{ + switch (state) { + case PSI_IO_SOME: + return tasks[NR_IOWAIT]; + case PSI_IO_FULL: + return tasks[NR_IOWAIT] && !tasks[NR_RUNNING]; + case PSI_MEM_SOME: + return tasks[NR_MEMSTALL]; + case PSI_MEM_FULL: + return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]; + case PSI_CPU_SOME: + return tasks[NR_RUNNING] > 1; + case PSI_NONIDLE: + return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] || + tasks[NR_RUNNING]; + default: + return false; + } +} + +static void get_recent_times(struct psi_group *group, int cpu, u32 *times) +{ + struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu); + unsigned int tasks[NR_PSI_TASK_COUNTS]; + u64 now, state_start; + unsigned int seq; + int s; + + /* Snapshot a coherent view of the CPU state */ + do { + seq = read_seqcount_begin(&groupc->seq); + now = cpu_clock(cpu); + memcpy(times, groupc->times, sizeof(groupc->times)); + memcpy(tasks, groupc->tasks, sizeof(groupc->tasks)); + state_start = groupc->state_start; + } while (read_seqcount_retry(&groupc->seq, seq)); + + /* Calculate state time deltas against the previous snapshot */ + for (s = 0; s < NR_PSI_STATES; s++) { + u32 delta; + /* + * In addition to already concluded states, we also + * incorporate currently active states on the CPU, + * since states may last for many sampling periods. + * + * This way we keep our delta sampling buckets small + * (u32) and our reported pressure close to what's + * actually happening. + */ + if (test_state(tasks, s)) + times[s] += now - state_start; + + delta = times[s] - groupc->times_prev[s]; + groupc->times_prev[s] = times[s]; + + times[s] = delta; + } +} + +static void calc_avgs(unsigned long avg[3], int missed_periods, + u64 time, u64 period) +{ + unsigned long pct; + + /* Fill in zeroes for periods of no activity */ + if (missed_periods) { + avg[0] = calc_load_n(avg[0], EXP_10s, 0, missed_periods); + avg[1] = calc_load_n(avg[1], EXP_60s, 0, missed_periods); + avg[2] = calc_load_n(avg[2], EXP_300s, 0, missed_periods); + } + + /* Sample the most recent active period */ + pct = div_u64(time * 100, period); + pct *= FIXED_1; + avg[0] = calc_load(avg[0], EXP_10s, pct); + avg[1] = calc_load(avg[1], EXP_60s, pct); + avg[2] = calc_load(avg[2], EXP_300s, pct); +} + +static bool update_stats(struct psi_group *group) +{ + u64 deltas[NR_PSI_STATES - 1] = { 0, }; + unsigned long missed_periods = 0; + unsigned long nonidle_total = 0; + u64 now, expires, period; + int cpu; + int s; + + mutex_lock(&group->stat_lock); + + /* + * Collect the per-cpu time buckets and average them into a + * single time sample that is normalized to wallclock time. + * + * For averaging, each CPU is weighted by its non-idle time in + * the sampling period. This eliminates artifacts from uneven + * loading, or even entirely idle CPUs. + */ + for_each_possible_cpu(cpu) { + u32 times[NR_PSI_STATES]; + u32 nonidle; + + get_recent_times(group, cpu, times); + + nonidle = nsecs_to_jiffies(times[PSI_NONIDLE]); + nonidle_total += nonidle; + + for (s = 0; s < PSI_NONIDLE; s++) + deltas[s] += (u64)times[s] * nonidle; + } + + /* + * Integrate the sample into the running statistics that are + * reported to userspace: the cumulative stall times and the + * decaying averages. + * + * Pressure percentages are sampled at PSI_FREQ. We might be + * called more often when the user polls more frequently than + * that; we might be called less often when there is no task + * activity, thus no data, and clock ticks are sporadic. The + * below handles both. + */ + + /* total= */ + for (s = 0; s < NR_PSI_STATES - 1; s++) + group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL)); + + /* avgX= */ + now = sched_clock(); + expires = group->next_update; + if (now < expires) + goto out; + if (now - expires > psi_period) + missed_periods = div_u64(now - expires, psi_period); + + /* + * The periodic clock tick can get delayed for various + * reasons, especially on loaded systems. To avoid clock + * drift, we schedule the clock in fixed psi_period intervals. + * But the deltas we sample out of the per-cpu buckets above + * are based on the actual time elapsing between clock ticks. + */ + group->next_update = expires + ((1 + missed_periods) * psi_period); + period = now - (group->last_update + (missed_periods * psi_period)); + group->last_update = now; + + for (s = 0; s < NR_PSI_STATES - 1; s++) { + u32 sample; + + sample = group->total[s] - group->total_prev[s]; + /* + * Due to the lockless sampling of the time buckets, + * recorded time deltas can slip into the next period, + * which under full pressure can result in samples in + * excess of the period length. + * + * We don't want to report non-sensical pressures in + * excess of 100%, nor do we want to drop such events + * on the floor. Instead we punt any overage into the + * future until pressure subsides. By doing this we + * don't underreport the occurring pressure curve, we + * just report it delayed by one period length. + * + * The error isn't cumulative. As soon as another + * delta slips from a period P to P+1, by definition + * it frees up its time T in P. + */ + if (sample > period) + sample = period; + group->total_prev[s] += sample; + calc_avgs(group->avg[s], missed_periods, sample, period); + } +out: + mutex_unlock(&group->stat_lock); + return nonidle_total; +} + +static void psi_update_work(struct work_struct *work) +{ + struct delayed_work *dwork; + struct psi_group *group; + bool nonidle; + + dwork = to_delayed_work(work); + group = container_of(dwork, struct psi_group, clock_work); + + /* + * If there is task activity, periodically fold the per-cpu + * times and feed samples into the running averages. If things + * are idle and there is no data to process, stop the clock. + * Once restarted, we'll catch up the running averages in one + * go - see calc_avgs() and missed_periods. + */ + + nonidle = update_stats(group); + + if (nonidle) { + unsigned long delay = 0; + u64 now; + + now = sched_clock(); + if (group->next_update > now) + delay = nsecs_to_jiffies(group->next_update - now) + 1; + schedule_delayed_work(dwork, delay); + } +} + +static void record_times(struct psi_group_cpu *groupc, int cpu, + bool memstall_tick) +{ + u32 delta; + u64 now; + + now = cpu_clock(cpu); + delta = now - groupc->state_start; + groupc->state_start = now; + + if (test_state(groupc->tasks, PSI_IO_SOME)) { + groupc->times[PSI_IO_SOME] += delta; + if (test_state(groupc->tasks, PSI_IO_FULL)) + groupc->times[PSI_IO_FULL] += delta; + } + + if (test_state(groupc->tasks, PSI_MEM_SOME)) { + groupc->times[PSI_MEM_SOME] += delta; + if (test_state(groupc->tasks, PSI_MEM_FULL)) + groupc->times[PSI_MEM_FULL] += delta; + else if (memstall_tick) { + u32 sample; + /* + * Since we care about lost potential, a + * memstall is FULL when there are no other + * working tasks, but also when the CPU is + * actively reclaiming and nothing productive + * could run even if it were runnable. + * + * When the timer tick sees a reclaiming CPU, + * regardless of runnable tasks, sample a FULL + * tick (or less if it hasn't been a full tick + * since the last state change). + */ + sample = min(delta, (u32)jiffies_to_nsecs(1)); + groupc->times[PSI_MEM_FULL] += sample; + } + } + + if (test_state(groupc->tasks, PSI_CPU_SOME)) + groupc->times[PSI_CPU_SOME] += delta; + + if (test_state(groupc->tasks, PSI_NONIDLE)) + groupc->times[PSI_NONIDLE] += delta; +} + +static void psi_group_change(struct psi_group *group, int cpu, + unsigned int clear, unsigned int set) +{ + struct psi_group_cpu *groupc; + unsigned int t, m; + + groupc = per_cpu_ptr(group->pcpu, cpu); + + /* + * First we assess the aggregate resource states this CPU's + * tasks have been in since the last change, and account any + * SOME and FULL time these may have resulted in. + * + * Then we update the task counts according to the state + * change requested through the @clear and @set bits. + */ + write_seqcount_begin(&groupc->seq); + + record_times(groupc, cpu, false); + + for (t = 0, m = clear; m; m &= ~(1 << t), t++) { + if (!(m & (1 << t))) + continue; + if (groupc->tasks[t] == 0 && !psi_bug) { + printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u] clear=%x set=%x\n", + cpu, t, groupc->tasks[0], + groupc->tasks[1], groupc->tasks[2], + clear, set); + psi_bug = 1; + } + groupc->tasks[t]--; + } + + for (t = 0; set; set &= ~(1 << t), t++) + if (set & (1 << t)) + groupc->tasks[t]++; + + write_seqcount_end(&groupc->seq); + + if (!delayed_work_pending(&group->clock_work)) + schedule_delayed_work(&group->clock_work, PSI_FREQ); +} + +void psi_task_change(struct task_struct *task, int clear, int set) +{ + int cpu = task_cpu(task); + + if (!task->pid) + return; + + if (((task->psi_flags & set) || + (task->psi_flags & clear) != clear) && + !psi_bug) { + printk_deferred(KERN_ERR "psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n", + task->pid, task->comm, cpu, + task->psi_flags, clear, set); + psi_bug = 1; + } + + task->psi_flags &= ~clear; + task->psi_flags |= set; + + psi_group_change(&psi_system, cpu, clear, set); +} + +void psi_memstall_tick(struct task_struct *task, int cpu) +{ + struct psi_group_cpu *groupc; + + groupc = per_cpu_ptr(psi_system.pcpu, cpu); + write_seqcount_begin(&groupc->seq); + record_times(groupc, cpu, true); + write_seqcount_end(&groupc->seq); +} + +/** + * psi_memstall_enter - mark the beginning of a memory stall section + * @flags: flags to handle nested sections + * + * Marks the calling task as being stalled due to a lack of memory, + * such as waiting for a refault or performing reclaim. + */ +void psi_memstall_enter(unsigned long *flags) +{ + struct rq_flags rf; + struct rq *rq; + + if (psi_disabled) + return; + + *flags = current->flags & PF_MEMSTALL; + if (*flags) + return; + /* + * PF_MEMSTALL setting & accounting needs to be atomic wrt + * changes to the task's scheduling state, otherwise we can + * race with CPU migration. + */ + rq = this_rq_lock_irq(&rf); + + current->flags |= PF_MEMSTALL; + psi_task_change(current, 0, TSK_MEMSTALL); + + rq_unlock_irq(rq, &rf); +} + +/** + * psi_memstall_leave - mark the end of an memory stall section + * @flags: flags to handle nested memdelay sections + * + * Marks the calling task as no longer stalled due to lack of memory. + */ +void psi_memstall_leave(unsigned long *flags) +{ + struct rq_flags rf; + struct rq *rq; + + if (psi_disabled) + return; + + if (*flags) + return; + /* + * PF_MEMSTALL clearing & accounting needs to be atomic wrt + * changes to the task's scheduling state, otherwise we could + * race with CPU migration. + */ + rq = this_rq_lock_irq(&rf); + + current->flags &= ~PF_MEMSTALL; + psi_task_change(current, TSK_MEMSTALL, 0); + + rq_unlock_irq(rq, &rf); +} + +static int psi_show(struct seq_file *m, struct psi_group *group, + enum psi_res res) +{ + int full; + + if (psi_disabled) + return -EOPNOTSUPP; + + update_stats(group); + + for (full = 0; full < 2 - (res == PSI_CPU); full++) { + unsigned long avg[3]; + u64 total; + int w; + + for (w = 0; w < 3; w++) + avg[w] = group->avg[res * 2 + full][w]; + total = div_u64(group->total[res * 2 + full], NSEC_PER_USEC); + + seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n", + full ? "full" : "some", + LOAD_INT(avg[0]), LOAD_FRAC(avg[0]), + LOAD_INT(avg[1]), LOAD_FRAC(avg[1]), + LOAD_INT(avg[2]), LOAD_FRAC(avg[2]), + total); + } + + return 0; +} + +static int psi_io_show(struct seq_file *m, void *v) +{ + return psi_show(m, &psi_system, PSI_IO); +} + +static int psi_memory_show(struct seq_file *m, void *v) +{ + return psi_show(m, &psi_system, PSI_MEM); +} + +static int psi_cpu_show(struct seq_file *m, void *v) +{ + return psi_show(m, &psi_system, PSI_CPU); +} + +static int psi_io_open(struct inode *inode, struct file *file) +{ + return single_open(file, psi_io_show, NULL); +} + +static int psi_memory_open(struct inode *inode, struct file *file) +{ + return single_open(file, psi_memory_show, NULL); +} + +static int psi_cpu_open(struct inode *inode, struct file *file) +{ + return single_open(file, psi_cpu_show, NULL); +} + +static const struct file_operations psi_io_fops = { + .open = psi_io_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static const struct file_operations psi_memory_fops = { + .open = psi_memory_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static const struct file_operations psi_cpu_fops = { + .open = psi_cpu_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static int __init psi_proc_init(void) +{ + proc_mkdir("pressure", NULL); + proc_create("pressure/io", 0, NULL, &psi_io_fops); + proc_create("pressure/memory", 0, NULL, &psi_memory_fops); + proc_create("pressure/cpu", 0, NULL, &psi_cpu_fops); + return 0; +} +module_init(psi_proc_init); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 6e3171b..6365498 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -55,6 +55,7 @@ #include <linux/proc_fs.h> #include <linux/prefetch.h> #include <linux/profile.h> +#include <linux/psi.h> #include <linux/rcupdate_wait.h> #include <linux/security.h> #include <linux/stackprotector.h> @@ -330,6 +331,7 @@ bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw) #ifdef CONFIG_CGROUP_SCHED
#include <linux/cgroup.h> +#include <linux/psi.h>
struct cfs_rq; struct rt_rq; diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index f6a7d0b..7cd0104 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -76,6 +76,92 @@ static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delt # define schedstat_end_time(rq, t) do { } while (0) #endif /* CONFIG_SCHEDSTATS */
+#ifdef CONFIG_PSI +/* + * PSI tracks state that persists across sleeps, such as iowaits and + * memory stalls. As a result, it has to distinguish between sleeps, + * where a task's runnable state changes, and requeues, where a task + * and its state are being moved between CPUs and runqueues. + */ +static inline void psi_enqueue(struct task_struct *p, bool wakeup) +{ + int clear = 0, set = TSK_RUNNING; + + if (psi_disabled) + return; + + if (!wakeup || p->sched_psi_wake_requeue) { + if (p->flags & PF_MEMSTALL) + set |= TSK_MEMSTALL; + if (p->sched_psi_wake_requeue) + p->sched_psi_wake_requeue = 0; + } else { + if (p->in_iowait) + clear |= TSK_IOWAIT; + } + + psi_task_change(p, clear, set); +} + +static inline void psi_dequeue(struct task_struct *p, bool sleep) +{ + int clear = TSK_RUNNING, set = 0; + + if (psi_disabled) + return; + + if (!sleep) { + if (p->flags & PF_MEMSTALL) + clear |= TSK_MEMSTALL; + } else { + if (p->in_iowait) + set |= TSK_IOWAIT; + } + + psi_task_change(p, clear, set); +} + +static inline void psi_ttwu_dequeue(struct task_struct *p) +{ + if (psi_disabled) + return; + /* + * Is the task being migrated during a wakeup? Make sure to + * deregister its sleep-persistent psi states from the old + * queue, and let psi_enqueue() know it has to requeue. + */ + if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) { + struct rq_flags rf; + struct rq *rq; + int clear = 0; + + if (p->in_iowait) + clear |= TSK_IOWAIT; + if (p->flags & PF_MEMSTALL) + clear |= TSK_MEMSTALL; + + rq = __task_rq_lock(p, &rf); + psi_task_change(p, clear, 0); + p->sched_psi_wake_requeue = 1; + __task_rq_unlock(rq, &rf); + } +} + +static inline void psi_task_tick(struct rq *rq) +{ + if (psi_disabled) + return; + + if (unlikely(rq->curr->flags & PF_MEMSTALL)) + psi_memstall_tick(rq->curr, cpu_of(rq)); +} +#else /* CONFIG_PSI */ +static inline void psi_enqueue(struct task_struct *p, bool wakeup) {} +static inline void psi_dequeue(struct task_struct *p, bool sleep) {} +static inline void psi_ttwu_dequeue(struct task_struct *p) {} +static inline void psi_task_tick(struct rq *rq) {} +#endif /* CONFIG_PSI */ + #ifdef CONFIG_SCHED_INFO static inline void sched_info_reset_dequeued(struct task_struct *t) { diff --git a/mm/compaction.c b/mm/compaction.c index 5079ddb..120e555 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -22,6 +22,7 @@ #include <linux/kthread.h> #include <linux/freezer.h> #include <linux/page_owner.h> +#include <linux/psi.h> #include "internal.h"
#ifdef CONFIG_COMPACTION @@ -2059,11 +2060,15 @@ static int kcompactd(void *p) pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
while (!kthread_should_stop()) { + unsigned long pflags; + trace_mm_compaction_kcompactd_sleep(pgdat->node_id); wait_event_freezable(pgdat->kcompactd_wait, kcompactd_work_requested(pgdat));
+ psi_memstall_enter(&pflags); kcompactd_do_work(pgdat); + psi_memstall_leave(&pflags); }
return 0; diff --git a/mm/filemap.c b/mm/filemap.c index fd906c1..66460c9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -36,6 +36,7 @@ #include <linux/cleancache.h> #include <linux/shmem_fs.h> #include <linux/rmap.h> +#include <linux/psi.h> #include "internal.h"
#define CREATE_TRACE_POINTS @@ -1182,8 +1183,15 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, { struct wait_page_queue wait_page; wait_queue_entry_t *wait = &wait_page.wait; + bool thrashing = false; + unsigned long pflags; int ret = 0;
+ if (bit_nr == PG_locked && + !PageUptodate(page) && PageWorkingset(page)) { + psi_memstall_enter(&pflags); + thrashing = true; + } init_wait(wait); wait->flags = lock ? WQ_FLAG_EXCLUSIVE : 0; wait->func = wake_page_function; @@ -1222,6 +1230,9 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
finish_wait(q, wait);
+ if (thrashing) { + psi_memstall_leave(&pflags); + } /* * A signal could leave PageWaiters set. Clearing it here if * !waitqueue_active would be possible (by open-coding finish_wait), diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4aad622..20104d2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -66,6 +66,7 @@ #include <linux/ftrace.h> #include <linux/lockdep.h> #include <linux/nmi.h> +#include <linux/psi.h> #include <linux/khugepaged.h> #include <linux/ktask.h>
@@ -3669,15 +3670,20 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) enum compact_priority prio, enum compact_result *compact_result) { struct page *page; + unsigned long pflags; unsigned int noreclaim_flag;
if (!order) return NULL;
+ psi_memstall_enter(&pflags); noreclaim_flag = memalloc_noreclaim_save(); + *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac, prio); + memalloc_noreclaim_restore(noreclaim_flag); + psi_memstall_leave(&pflags);
if (*compact_result <= COMPACT_INACTIVE) return NULL; @@ -3884,11 +3890,13 @@ void fs_reclaim_release(gfp_t gfp_mask) struct reclaim_state reclaim_state; int progress; unsigned int noreclaim_flag; + unsigned long pflags;
cond_resched();
/* We now go into synchronous reclaim */ cpuset_memory_pressure_bump(); + psi_memstall_enter(&pflags); fs_reclaim_acquire(gfp_mask); noreclaim_flag = memalloc_noreclaim_save(); reclaim_state.reclaimed_slab = 0; @@ -3900,6 +3908,7 @@ void fs_reclaim_release(gfp_t gfp_mask) current->reclaim_state = NULL; memalloc_noreclaim_restore(noreclaim_flag); fs_reclaim_release(gfp_mask); + psi_memstall_leave(&pflags);
cond_resched();
diff --git a/mm/vmscan.c b/mm/vmscan.c index 90b47ca..64b7dd2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -50,6 +50,7 @@ #include <linux/prefetch.h> #include <linux/printk.h> #include <linux/dax.h> +#include <linux/psi.h>
#include <asm/tlbflush.h> #include <asm/div64.h> @@ -3340,6 +3341,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, { struct zonelist *zonelist; unsigned long nr_reclaimed; + unsigned long pflags; int nid; unsigned int noreclaim_flag; struct scan_control sc = { @@ -3368,9 +3370,13 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, sc.gfp_mask, sc.reclaim_idx);
+ psi_memstall_enter(&pflags); noreclaim_flag = memalloc_noreclaim_save(); + nr_reclaimed = do_try_to_free_pages(zonelist, &sc); + memalloc_noreclaim_restore(noreclaim_flag); + psi_memstall_leave(&pflags);
trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
@@ -3535,6 +3541,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) int i; unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; + unsigned long pflags; struct zone *zone; struct scan_control sc = { .gfp_mask = GFP_KERNEL, @@ -3545,6 +3552,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) .may_swap = 1, };
+ psi_memstall_enter(&pflags); __fs_reclaim_acquire();
count_vm_event(PAGEOUTRUN); @@ -3651,6 +3659,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) out: snapshot_refaults(NULL, pgdat); __fs_reclaim_release(); + psi_memstall_leave(&pflags); /* * Return the order kswapd stopped reclaiming at as * prepare_kswapd_sleep() takes it into account. If another caller
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.4 commit 2ce7135adc9ad081aa3c49744144376ac74fea60 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
On a system that executes multiple cgrouped jobs and independent workloads, we don't just care about the health of the overall system, but also that of individual jobs, so that we can ensure individual job health, fairness between jobs, or prioritize some jobs over others.
This patch implements pressure stall tracking for cgroups. In kernels with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure, and io.pressure files that track aggregate pressure stall times for only the tasks inside the cgroup.
Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Tested-by: Daniel Drake drake@endlessm.com Tested-by: Suren Baghdasaryan surenb@google.com Cc: Christopher Lameter cl@linux.com Cc: Ingo Molnar mingo@redhat.com Cc: Johannes Weiner jweiner@fb.com Cc: Mike Galbraith efault@gmx.de Cc: Peter Enderborg peter.enderborg@sony.com Cc: Randy Dunlap rdunlap@infradead.org Cc: Shakeel Butt shakeelb@google.com Cc: Vinayak Menon vinmenon@codeaurora.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- Documentation/accounting/psi.txt | 9 +++ Documentation/admin-guide/cgroup-v2.rst | 18 +++++ include/linux/cgroup-defs.h | 4 ++ include/linux/cgroup.h | 15 ++++ include/linux/psi.h | 25 +++++++ init/Kconfig | 4 ++ kernel/cgroup/cgroup.c | 44 +++++++++++- kernel/sched/psi.c | 118 +++++++++++++++++++++++++++++--- 8 files changed, 227 insertions(+), 10 deletions(-)
diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt index 3753a82..b8ca28b 100644 --- a/Documentation/accounting/psi.txt +++ b/Documentation/accounting/psi.txt @@ -62,3 +62,12 @@ well as medium and long term trends. The total absolute stall time is tracked and exported as well, to allow detection of latency spikes which wouldn't necessarily make a dent in the time averages, or to average trends over custom time frames. + +Cgroup2 interface +================= + +In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem +mounted, pressure stall information is also tracked for tasks grouped +into cgroups. Each subdirectory in the cgroupfs mountpoint contains +cpu.pressure, memory.pressure, and io.pressure files; the format is +the same as the /proc/pressure/ files. diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 9c194bc..c57b4fc 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -966,6 +966,12 @@ All time durations are in microseconds. $PERIOD duration. "max" for $MAX indicates no limit. If only one number is written, $MAX is updated.
+ cpu.pressure + A read-only nested-key file which exists on non-root cgroups. + + Shows pressure stall information for CPU. See + Documentation/accounting/psi.txt for details. +
Memory ------ @@ -1287,6 +1293,12 @@ PAGE_SIZE multiple when read back. higher than the limit for an extended period of time. This reduces the impact on the workload and memory management.
+ memory.pressure + A read-only nested-key file which exists on non-root cgroups. + + Shows pressure stall information for memory. See + Documentation/accounting/psi.txt for details. +
Usage Guidelines ~~~~~~~~~~~~~~~~ @@ -1424,6 +1436,12 @@ IO Interface Files
8:16 rbps=2097152 wbps=max riops=max wiops=max
+ io.pressure + A read-only nested-key file which exists on non-root cgroups. + + Shows pressure stall information for IO. See + Documentation/accounting/psi.txt for details. +
Writeback ~~~~~~~~~ diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index cad1a82..c026665 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -20,6 +20,7 @@ #include <linux/u64_stats_sync.h> #include <linux/workqueue.h> #include <linux/bpf-cgroup.h> +#include <linux/psi_types.h>
#ifdef CONFIG_CGROUPS
@@ -442,6 +443,9 @@ struct cgroup { /* used to schedule release agent */ struct work_struct release_agent_work;
+ /* used to track pressure stalls */ + struct psi_group psi; + /* used to store eBPF programs */ struct cgroup_bpf bpf;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 02da4e1..73acf8b 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -669,6 +669,11 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp) pr_cont_kernfs_path(cgrp->kn); }
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp) +{ + return &cgrp->psi; +} + static inline void cgroup_init_kthreadd(void) { /* @@ -723,6 +728,16 @@ static inline union kernfs_node_id *cgroup_get_kernfs_id(struct cgroup *cgrp) return NULL; }
+static inline struct cgroup *cgroup_parent(struct cgroup *cgrp) +{ + return NULL; +} + +static inline struct psi_group *cgroup_psi(struct cgroup *cgrp) +{ + return NULL; +} + static inline bool task_under_cgroup_hierarchy(struct task_struct *task, struct cgroup *ancestor) { diff --git a/include/linux/psi.h b/include/linux/psi.h index b0daf05..8e0725a 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -4,6 +4,9 @@ #include <linux/psi_types.h> #include <linux/sched.h>
+struct seq_file; +struct css_set; + #ifdef CONFIG_PSI
extern bool psi_disabled; @@ -16,6 +19,14 @@ void psi_memstall_enter(unsigned long *flags); void psi_memstall_leave(unsigned long *flags);
+int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res); + +#ifdef CONFIG_CGROUPS +int psi_cgroup_alloc(struct cgroup *cgrp); +void psi_cgroup_free(struct cgroup *cgrp); +void cgroup_move_task(struct task_struct *p, struct css_set *to); +#endif + #else /* CONFIG_PSI */
static inline void psi_init(void) {} @@ -23,6 +34,20 @@ static inline void psi_init(void) {} static inline void psi_memstall_enter(unsigned long *flags) {} static inline void psi_memstall_leave(unsigned long *flags) {}
+#ifdef CONFIG_CGROUPS +static inline int psi_cgroup_alloc(struct cgroup *cgrp) +{ + return 0; +} +static inline void psi_cgroup_free(struct cgroup *cgrp) +{ +} +static inline void cgroup_move_task(struct task_struct *p, struct css_set *to) +{ + rcu_assign_pointer(p->cgroups, to); +} +#endif + #endif /* CONFIG_PSI */
#endif /* _LINUX_PSI_H */ diff --git a/init/Kconfig b/init/Kconfig index 4ad842c..e94174f 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -515,6 +515,10 @@ config PSI the share of walltime in which some or all tasks in the system are delayed due to contention of the respective resource.
+ In kernels with cgroup support, cgroups (cgroup2 only) will + have cpu.pressure, memory.pressure, and io.pressure files, + which aggregate pressure stalls for the grouped tasks only. + For more details see Documentation/accounting/psi.txt.
Say N if unsure. diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 682c5e2..99e51fa 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -55,6 +55,7 @@ #include <linux/nsproxy.h> #include <linux/file.h> #include <linux/sched/cputime.h> +#include <linux/psi.h> #include <net/sock.h>
#define CREATE_TRACE_POINTS @@ -842,7 +843,7 @@ static void css_set_move_task(struct task_struct *task, */ WARN_ON_ONCE(task->flags & PF_EXITING);
- rcu_assign_pointer(task->cgroups, to_cset); + cgroup_move_task(task, to_cset); list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks : &to_cset->tasks); } @@ -3429,6 +3430,21 @@ static int cpu_stat_show(struct seq_file *seq, void *v) return ret; }
+#ifdef CONFIG_PSI +static int cgroup_io_pressure_show(struct seq_file *seq, void *v) +{ + return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_IO); +} +static int cgroup_memory_pressure_show(struct seq_file *seq, void *v) +{ + return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_MEM); +} +static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v) +{ + return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU); +} +#endif + static int cgroup_file_open(struct kernfs_open_file *of) { struct cftype *cft = of->kn->priv; @@ -4610,6 +4626,23 @@ static ssize_t cgroup_threads_write(struct kernfs_open_file *of, .flags = CFTYPE_NOT_ON_ROOT, .seq_show = cpu_stat_show, }, +#ifdef CONFIG_PSI + { + .name = "io.pressure", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cgroup_io_pressure_show, + }, + { + .name = "memory.pressure", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cgroup_memory_pressure_show, + }, + { + .name = "cpu.pressure", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cgroup_cpu_pressure_show, + }, +#endif { } /* terminate */ };
@@ -4670,6 +4703,7 @@ static void css_free_rwork_fn(struct work_struct *work) */ cgroup_put(cgroup_parent(cgrp)); kernfs_put(cgrp->kn); + psi_cgroup_free(cgrp); if (cgroup_on_dfl(cgrp)) cgroup_rstat_exit(cgrp); kfree(cgrp); @@ -4928,10 +4962,14 @@ static struct cgroup *cgroup_create(struct cgroup *parent) cgrp->self.parent = &parent->self; cgrp->root = root; cgrp->level = level; - ret = cgroup_bpf_inherit(cgrp); + ret = psi_cgroup_alloc(cgrp); if (ret) goto out_idr_free;
+ ret = cgroup_bpf_inherit(cgrp); + if (ret) + goto out_psi_free; + spin_lock_irq(&css_set_lock); for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) { cgrp->ancestor_ids[tcgrp->level] = tcgrp->id; @@ -4971,6 +5009,8 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
return cgrp;
+out_psi_free: + psi_cgroup_free(cgrp); out_idr_free: cgroup_idr_remove(&root->cgroup_idr, cgrp->id); out_stat_exit: diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 5954145..7cdecfc 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -473,9 +473,35 @@ static void psi_group_change(struct psi_group *group, int cpu, schedule_delayed_work(&group->clock_work, PSI_FREQ); }
+static struct psi_group *iterate_groups(struct task_struct *task, void **iter) +{ +#ifdef CONFIG_CGROUPS + struct cgroup *cgroup = NULL; + + if (!*iter) + cgroup = task->cgroups->dfl_cgrp; + else if (*iter == &psi_system) + return NULL; + else + cgroup = cgroup_parent(*iter); + + if (cgroup && cgroup_parent(cgroup)) { + *iter = cgroup; + return cgroup_psi(cgroup); + } +#else + if (*iter) + return NULL; +#endif + *iter = &psi_system; + return &psi_system; +} + void psi_task_change(struct task_struct *task, int clear, int set) { int cpu = task_cpu(task); + struct psi_group *group; + void *iter = NULL;
if (!task->pid) return; @@ -492,17 +518,23 @@ void psi_task_change(struct task_struct *task, int clear, int set) task->psi_flags &= ~clear; task->psi_flags |= set;
- psi_group_change(&psi_system, cpu, clear, set); + while ((group = iterate_groups(task, &iter))) + psi_group_change(group, cpu, clear, set); }
void psi_memstall_tick(struct task_struct *task, int cpu) { - struct psi_group_cpu *groupc; + struct psi_group *group; + void *iter = NULL;
- groupc = per_cpu_ptr(psi_system.pcpu, cpu); - write_seqcount_begin(&groupc->seq); - record_times(groupc, cpu, true); - write_seqcount_end(&groupc->seq); + while ((group = iterate_groups(task, &iter))) { + struct psi_group_cpu *groupc; + + groupc = per_cpu_ptr(group->pcpu, cpu); + write_seqcount_begin(&groupc->seq); + record_times(groupc, cpu, true); + write_seqcount_end(&groupc->seq); + } }
/** @@ -565,8 +597,78 @@ void psi_memstall_leave(unsigned long *flags) rq_unlock_irq(rq, &rf); }
-static int psi_show(struct seq_file *m, struct psi_group *group, - enum psi_res res) +#ifdef CONFIG_CGROUPS +int psi_cgroup_alloc(struct cgroup *cgroup) +{ + if (psi_disabled) + return 0; + + cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu); + if (!cgroup->psi.pcpu) + return -ENOMEM; + group_init(&cgroup->psi); + return 0; +} + +void psi_cgroup_free(struct cgroup *cgroup) +{ + if (psi_disabled) + return; + + cancel_delayed_work_sync(&cgroup->psi.clock_work); + free_percpu(cgroup->psi.pcpu); +} + +/** + * cgroup_move_task - move task to a different cgroup + * @task: the task + * @to: the target css_set + * + * Move task to a new cgroup and safely migrate its associated stall + * state between the different groups. + * + * This function acquires the task's rq lock to lock out concurrent + * changes to the task's scheduling state and - in case the task is + * running - concurrent changes to its stall state. + */ +void cgroup_move_task(struct task_struct *task, struct css_set *to) +{ + bool move_psi = !psi_disabled; + unsigned int task_flags = 0; + struct rq_flags rf; + struct rq *rq; + + if (move_psi) { + rq = task_rq_lock(task, &rf); + + if (task_on_rq_queued(task)) + task_flags = TSK_RUNNING; + else if (task->in_iowait) + task_flags = TSK_IOWAIT; + + if (task->flags & PF_MEMSTALL) + task_flags |= TSK_MEMSTALL; + + if (task_flags) + psi_task_change(task, task_flags, 0); + } + + /* + * Lame to do this here, but the scheduler cannot be locked + * from the outside, so we move cgroups from inside sched/. + */ + rcu_assign_pointer(task->cgroups, to); + + if (move_psi) { + if (task_flags) + psi_task_change(task, 0, task_flags); + + task_rq_unlock(rq, task, &rf); + } +} +#endif /* CONFIG_CGROUPS */ + +int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) { int full;
From: Olof Johansson olof@lixom.net
mainline inclusion from mainline-v5.4 commit 8fcb2312d1e3300e81aa871aad00d4c038cfc184 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
The existing code triggered an invalid warning about 'rq' possibly being used uninitialized. Instead of doing the silly warning suppression by initializa it to NULL, refactor the code to bail out early instead.
Warning was:
kernel/sched/psi.c: In function `cgroup_move_task': kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized]
Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net Fixes: 2ce7135adc9ad ("psi: cgroup support") Signed-off-by: Olof Johansson olof@lixom.net Reviewed-by: Andrew Morton akpm@linux-foundation.org Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- kernel/sched/psi.c | 43 ++++++++++++++++++++++--------------------- 1 file changed, 22 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 7cdecfc..3d7355d 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -633,38 +633,39 @@ void psi_cgroup_free(struct cgroup *cgroup) */ void cgroup_move_task(struct task_struct *task, struct css_set *to) { - bool move_psi = !psi_disabled; unsigned int task_flags = 0; struct rq_flags rf; struct rq *rq;
- if (move_psi) { - rq = task_rq_lock(task, &rf); + if (psi_disabled) { + /* + * Lame to do this here, but the scheduler cannot be locked + * from the outside, so we move cgroups from inside sched/. + */ + rcu_assign_pointer(task->cgroups, to); + return; + }
- if (task_on_rq_queued(task)) - task_flags = TSK_RUNNING; - else if (task->in_iowait) - task_flags = TSK_IOWAIT; + rq = task_rq_lock(task, &rf);
- if (task->flags & PF_MEMSTALL) - task_flags |= TSK_MEMSTALL; + if (task_on_rq_queued(task)) + task_flags = TSK_RUNNING; + else if (task->in_iowait) + task_flags = TSK_IOWAIT;
- if (task_flags) - psi_task_change(task, task_flags, 0); - } + if (task->flags & PF_MEMSTALL) + task_flags |= TSK_MEMSTALL;
- /* - * Lame to do this here, but the scheduler cannot be locked - * from the outside, so we move cgroups from inside sched/. - */ + if (task_flags) + psi_task_change(task, task_flags, 0); + + /* See comment above */ rcu_assign_pointer(task->cgroups, to);
- if (move_psi) { - if (task_flags) - psi_task_change(task, 0, task_flags); + if (task_flags) + psi_task_change(task, 0, task_flags);
- task_rq_unlock(rq, task, &rf); - } + task_rq_unlock(rq, task, &rf); } #endif /* CONFIG_CGROUPS */
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.4 commit e0c274472d5d27f277af722e017525e0b33784cd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
Mel Gorman reports a hackbench regression with psi that would prohibit shipping the suse kernel with it default-enabled, but he'd still like users to be able to opt in at little to no cost to others.
With the current combination of CONFIG_PSI and the psi_disabled bool set from the commandline, this is a challenge. Do the following things to make it easier:
1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros to enable CONFIG_PSI in their kernel but leave the feature disabled unless a user requests it at boot-time.
To avoid double negatives, rename psi_disabled= to psi=.
2. Make psi_disabled a static branch to eliminate any branch costs when the feature is disabled.
In terms of numbers before and after this patch, Mel says:
: The following is a comparision using CONFIG_PSI=n as a baseline against : your patch and a vanilla kernel : : 4.20.0-rc4 4.20.0-rc4 4.20.0-rc4 : kconfigdisable-v1r1 vanilla psidisable-v1r1 : Amean 1 1.3100 ( 0.00%) 1.3923 ( -6.28%) 1.3427 ( -2.49%) : Amean 3 3.8860 ( 0.00%) 4.1230 * -6.10%* 3.8860 ( -0.00%) : Amean 5 6.8847 ( 0.00%) 8.0390 * -16.77%* 6.7727 ( 1.63%) : Amean 7 9.9310 ( 0.00%) 10.8367 * -9.12%* 9.9910 ( -0.60%) : Amean 12 16.6577 ( 0.00%) 18.2363 * -9.48%* 17.1083 ( -2.71%) : Amean 18 26.5133 ( 0.00%) 27.8833 * -5.17%* 25.7663 ( 2.82%) : Amean 24 34.3003 ( 0.00%) 34.6830 ( -1.12%) 32.0450 ( 6.58%) : Amean 30 40.0063 ( 0.00%) 40.5800 ( -1.43%) 41.5087 ( -3.76%) : Amean 32 40.1407 ( 0.00%) 41.2273 ( -2.71%) 39.9417 ( 0.50%) : : It's showing that the vanilla kernel takes a hit (as the bisection : indicated it would) and that disabling PSI by default is reasonably : close in terms of performance for this particular workload on this : particular machine so;
Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Tested-by: Mel Gorman mgorman@techsingularity.net Reported-by: Mel Gorman mgorman@techsingularity.net Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- Documentation/admin-guide/kernel-parameters.txt | 4 ++++ include/linux/psi.h | 3 ++- init/Kconfig | 9 ++++++++ kernel/sched/psi.c | 30 +++++++++++++++++-------- kernel/sched/stats.h | 8 +++---- 5 files changed, 40 insertions(+), 14 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 54127ef..a3621003 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3638,6 +3638,10 @@ before loading. See Documentation/blockdev/ramdisk.txt.
+ psi= [KNL] Enable or disable pressure stall information + tracking. + Format: <bool> + psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to probe for; one of (bare|imps|exps|lifebook|any). psmouse.rate= [HW,MOUSE] Set desired mouse report rate, in reports diff --git a/include/linux/psi.h b/include/linux/psi.h index 8e0725a..7006008 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -1,6 +1,7 @@ #ifndef _LINUX_PSI_H #define _LINUX_PSI_H
+#include <linux/jump_label.h> #include <linux/psi_types.h> #include <linux/sched.h>
@@ -9,7 +10,7 @@
#ifdef CONFIG_PSI
-extern bool psi_disabled; +extern struct static_key_false psi_disabled;
void psi_init(void);
diff --git a/init/Kconfig b/init/Kconfig index e94174f..7f6b705 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -523,6 +523,15 @@ config PSI
Say N if unsure.
+config PSI_DEFAULT_DISABLED + bool "Require boot parameter to enable pressure stall information tracking" + default n + depends on PSI + help + If set, pressure stall information tracking will be disabled + per default but can be enabled through passing psi_enable=1 + on the kernel commandline during boot. + endmenu # "CPU/Task time and stats accounting"
config CPU_ISOLATION diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 3d7355d..fe24de3 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -136,8 +136,18 @@
static int psi_bug __read_mostly;
-bool psi_disabled __read_mostly; -core_param(psi_disabled, psi_disabled, bool, 0644); +DEFINE_STATIC_KEY_FALSE(psi_disabled); + +#ifdef CONFIG_PSI_DEFAULT_DISABLED +bool psi_enable; +#else +bool psi_enable = true; +#endif +static int __init setup_psi(char *str) +{ + return kstrtobool(str, &psi_enable) == 0; +} +__setup("psi=", setup_psi);
/* Running averages - we need to be higher-res than loadavg */ #define PSI_FREQ (2*HZ+1) /* 2 sec intervals */ @@ -169,8 +179,10 @@ static void group_init(struct psi_group *group)
void __init psi_init(void) { - if (psi_disabled) + if (!psi_enable) { + static_branch_enable(&psi_disabled); return; + }
psi_period = jiffies_to_nsecs(PSI_FREQ); group_init(&psi_system); @@ -549,7 +561,7 @@ void psi_memstall_enter(unsigned long *flags) struct rq_flags rf; struct rq *rq;
- if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return;
*flags = current->flags & PF_MEMSTALL; @@ -579,7 +591,7 @@ void psi_memstall_leave(unsigned long *flags) struct rq_flags rf; struct rq *rq;
- if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return;
if (*flags) @@ -600,7 +612,7 @@ void psi_memstall_leave(unsigned long *flags) #ifdef CONFIG_CGROUPS int psi_cgroup_alloc(struct cgroup *cgroup) { - if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return 0;
cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu); @@ -612,7 +624,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
void psi_cgroup_free(struct cgroup *cgroup) { - if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return;
cancel_delayed_work_sync(&cgroup->psi.clock_work); @@ -637,7 +649,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to) struct rq_flags rf; struct rq *rq;
- if (psi_disabled) { + if (static_branch_likely(&psi_disabled)) { /* * Lame to do this here, but the scheduler cannot be locked * from the outside, so we move cgroups from inside sched/. @@ -673,7 +685,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) { int full;
- if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return -EOPNOTSUPP;
update_stats(group); diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index 7cd0104..1b49905 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -87,7 +87,7 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup) { int clear = 0, set = TSK_RUNNING;
- if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return;
if (!wakeup || p->sched_psi_wake_requeue) { @@ -107,7 +107,7 @@ static inline void psi_dequeue(struct task_struct *p, bool sleep) { int clear = TSK_RUNNING, set = 0;
- if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return;
if (!sleep) { @@ -123,7 +123,7 @@ static inline void psi_dequeue(struct task_struct *p, bool sleep)
static inline void psi_ttwu_dequeue(struct task_struct *p) { - if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return; /* * Is the task being migrated during a wakeup? Make sure to @@ -149,7 +149,7 @@ static inline void psi_ttwu_dequeue(struct task_struct *p)
static inline void psi_task_tick(struct rq *rq) { - if (psi_disabled) + if (static_branch_likely(&psi_disabled)) return;
if (unlikely(rq->curr->flags & PF_MEMSTALL))
From: Baruch Siach baruch@tkos.co.il
mainline inclusion from mainline-v5.4 commit 428a1cb4baeb9e5c7feda93af7372ba6d2491558 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
The kernel commandline parameter named in CONFIG_PSI_DEFAULT_DISABLED help text contradicts the documentation in kernel-parameters.txt, and the code. Fix that.
Link: http://lkml.kernel.org/r/20181203213416.GA12627@cmpxchg.org Fixes: e0c274472d ("psi: make disabling/enabling easier for vendor kernels") Signed-off-by: Baruch Siach baruch@tkos.co.il Acked-by: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- init/Kconfig | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/init/Kconfig b/init/Kconfig index 7f6b705..d3d6463 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -529,8 +529,8 @@ config PSI_DEFAULT_DISABLED depends on PSI help If set, pressure stall information tracking will be disabled - per default but can be enabled through passing psi_enable=1 - on the kernel commandline during boot. + per default but can be enabled through passing psi=1 on the + kernel commandline during boot.
endmenu # "CPU/Task time and stats accounting"
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.4 commit 1b69ac6b40ebd85eed73e4dbccde2a36961ab990 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
psi has provisions to shut off the periodic aggregation worker when there is a period of no task activity - and thus no data that needs aggregating. However, while developing psi monitoring, Suren noticed that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run will see no task activity and decide to go to sleep; shortly thereafter, the kworker thread that executed the aggregation will go idle and cause a scheduling change, during which the psi callback will kick the !pending worker again. This will ping-pong forever, and is equivalent to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic when the state change is them going to sleep. To do this, tag workers with the last work function they executed, and if in psi we see a worker going to sleep after aggregating psi data, we will not reschedule the aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the aggregation work will have been collected from the per-cpu buckets during the aggregation itself. If there are work items following the aggregation work, the worker's last_func tag will be overwritten and the aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide to go idle, the brief period of non-idle time incurred between the aggregation run and the kworker's dequeue will be stranded in the per-cpu buckets until the clock is woken by later activity. But that should not be a problem. The buckets can hold 4s worth of time, and future activity will wake the clock with a 2s delay, giving us 2s worth of data we can leave behind when disabling aggregation. If it takes a worker more than two seconds to go idle after it finishes its last work item, we likely have bigger problems in the system, and won't notice one sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO") Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reported-by: Suren Baghdasaryan surenb@google.com Acked-by: Tejun Heo tj@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Lai Jiangshan jiangshanlai@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- kernel/sched/psi.c | 21 +++++++++++++++++---- kernel/workqueue.c | 23 +++++++++++++++++++++++ kernel/workqueue_internal.h | 6 +++++- 3 files changed, 45 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index fe24de3..c348478 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -124,6 +124,7 @@ * sampling of the aggregate task states would be. */
+#include "../workqueue_internal.h" #include <linux/sched/loadavg.h> #include <linux/seq_file.h> #include <linux/proc_fs.h> @@ -480,9 +481,6 @@ static void psi_group_change(struct psi_group *group, int cpu, groupc->tasks[t]++;
write_seqcount_end(&groupc->seq); - - if (!delayed_work_pending(&group->clock_work)) - schedule_delayed_work(&group->clock_work, PSI_FREQ); }
static struct psi_group *iterate_groups(struct task_struct *task, void **iter) @@ -513,6 +511,7 @@ void psi_task_change(struct task_struct *task, int clear, int set) { int cpu = task_cpu(task); struct psi_group *group; + bool wake_clock = true; void *iter = NULL;
if (!task->pid) @@ -530,8 +529,22 @@ void psi_task_change(struct task_struct *task, int clear, int set) task->psi_flags &= ~clear; task->psi_flags |= set;
- while ((group = iterate_groups(task, &iter))) + /* + * Periodic aggregation shuts off if there is a period of no + * task changes, so we wake it back up if necessary. However, + * don't do this if the task change is the aggregation worker + * itself going to sleep, or we'll ping-pong forever. + */ + if (unlikely((clear & TSK_RUNNING) && + (task->flags & PF_WQ_WORKER) && + wq_worker_last_func(task) == psi_update_work)) + wake_clock = false; + + while ((group = iterate_groups(task, &iter))) { psi_group_change(group, cpu, clear, set); + if (wake_clock && !delayed_work_pending(&group->clock_work)) + schedule_delayed_work(&group->clock_work, PSI_FREQ); + } }
void psi_memstall_tick(struct task_struct *task, int cpu) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index a46532b..60f5730 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -906,6 +906,26 @@ void wq_worker_sleeping(struct task_struct *task) }
/** + * wq_worker_last_func - retrieve worker's last work function + * + * Determine the last function a worker executed. This is called from + * the scheduler to get a worker's last known identity. + * + * CONTEXT: + * spin_lock_irq(rq->lock) + * + * Return: + * The last work function %current executed as a worker, NULL if it + * hasn't executed any work yet. + */ +work_func_t wq_worker_last_func(struct task_struct *task) +{ + struct worker *worker = kthread_data(task); + + return worker->last_func; +} + +/** * worker_set_flags - set worker flags and adjust nr_running accordingly * @worker: self * @flags: flags to set @@ -2191,6 +2211,9 @@ static void process_one_work(struct worker *worker, struct work_struct *work) worker_clr_flags(worker, WORKER_NICED); }
+ /* tag the worker for identification in schedule() */ + worker->last_func = worker->current_func; + /* we're done with it, release */ hash_del(&worker->hentry); worker->current_work = NULL; diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h index 30cfed2..498de0e 100644 --- a/kernel/workqueue_internal.h +++ b/kernel/workqueue_internal.h @@ -54,6 +54,9 @@ struct worker {
/* used only by rescuers to point to the target workqueue */ struct workqueue_struct *rescue_wq; /* I: the workqueue to rescue */ + + /* used by the scheduler to determine a worker's last known identity */ + work_func_t last_func; };
/** @@ -68,9 +71,10 @@ static inline struct worker *current_wq_worker(void)
/* * Scheduler hooks for concurrency managed workqueue. Only to be used from - * sched/core.c and workqueue.c. + * sched/ and workqueue.c. */ void wq_worker_running(struct task_struct *task); void wq_worker_sleeping(struct task_struct *task); +work_func_t wq_worker_last_func(struct task_struct *task);
#endif /* _KERNEL_WORKQUEUE_INTERNAL_H */
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.4 commit 4e37504d1c49eec6434d0cc97278d2b51c9e8763 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
-------------------------------- MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
We've been seeing hard-to-trigger psi crashes when running inside VM instances:
divide error: 0000 [#1] SMP PTI Modules linked in: [...] CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 Workqueue: events psi_clock RIP: 0010:psi_update_stats+0x270/0x490 RSP: 0018:ffffc90001117e10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8 RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000 RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502 FS: 0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: psi_clock+0x12/0x50 process_one_work+0x1e0/0x390 worker_thread+0x2b/0x3c0 ? rescuer_thread+0x330/0x330 kthread+0x113/0x130 ? kthread_create_worker_on_cpu+0x40/0x40 ? SyS_exit_group+0x10/0x10 ret_from_fork+0x35/0x40 Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48
The Code-line points to `period` being 0 inside update_stats(), and we divide by that when calculating that period's pressure percentage.
The elapsed period should never be 0. The reason this can happen is due to an off-by-one in the idle time / missing period calculation combined with a coarse sched_clock() in the virtual machine.
The target time for aggregation is advanced into the future on a fixed grid to prevent clock drift. So when an aggregation runs after some idle period, we can not just set it to "now + psi_period", but have to calculate the downtime and advance the target time relative to itself.
However, if the aggregator was disabled exactly one psi_period (ns), we drop one idle period in the calculation due to a > when we should do >=. In that case, next_update will be advanced from 'now - psi_period' to 'now' when it should be moved to 'now + psi_period'. The run finishes with last_update == next_update == sched_clock().
With hardware clocks, this exact nanosecond match isn't likely in the first place; but if it does happen, the clock will still have moved on and the period non-zero by the time the worker runs. A pointlessly short period, but besides the extra work, no harm no foul. However, a slow sched_clock() like we have on VMs might not have advanced either by the time the worker runs again. And when we calculate the elapsed period, the result, our pressure divisor, will be 0. Ouch.
Fix this by correctly handling the situation when the elapsed time between aggregation runs is precisely two periods, and advance the expiration timestamp correctly to period into the future.
Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner hannes@cmpxchg.org Reported-by: Łukasz Siudut <lsiudut@fb.com Reviewed-by: Andrew Morton akpm@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- kernel/sched/psi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index c348478..0e97ca9 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -322,7 +322,7 @@ static bool update_stats(struct psi_group *group) expires = group->next_update; if (now < expires) goto out; - if (now - expires > psi_period) + if (now - expires >= psi_period) missed_periods = div_u64(now - expires, psi_period);
/*
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.4 commit 147e1a97c4a0bdd43f55a582a9416bb9092563a9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
Patch series "psi: pressure stall monitors", v3.
Android is adopting psi to detect and remedy memory pressure that results in stuttering and decreased responsiveness on mobile devices.
Psi gives us the stall information, but because we're dealing with latencies in the millisecond range, periodically reading the pressure files to detect stalls in a timely fashion is not feasible. Psi also doesn't aggregate its averages at a high enough frequency right now.
This patch series extends the psi interface such that users can configure sensitive latency thresholds and use poll() and friends to be notified when these are breached.
As high-frequency aggregation is costly, it implements an aggregation method that is optimized for fast, short-interval averaging, and makes the aggregation frequency adaptive, such that high-frequency updates only happen while monitored stall events are actively occurring.
With these patches applied, Android can monitor for, and ward off, mounting memory shortages before they cause problems for the user. For example, using memory stall monitors in userspace low memory killer daemon (lmkd) we can detect mounting pressure and kill less important processes before device becomes visibly sluggish.
In our memory stress testing psi memory monitors produce roughly 10x less false positives compared to vmpressure signals. Having ability to specify multiple triggers for the same psi metric allows other parts of Android framework to monitor memory state of the device and act accordingly.
The new interface is straightforward. The user opens one of the pressure files for writing and writes a trigger description into the file descriptor that defines the stall state - some or full, and the maximum stall time over a given window of time. E.g.:
/* Signal when stall time exceeds 100ms of a 1s window */ char trigger[] = "full 100000 1000000"; fd = open("/proc/pressure/memory"); write(fd, trigger, sizeof(trigger)); while (poll() >= 0) { ... } close(fd);
When the monitored stall state is entered, psi adapts its aggregation frequency according to what the configured time window requires in order to emit event signals in a timely fashion. Once the stalling subsides, aggregation reverts back to normal.
The trigger is associated with the open file descriptor. To stop monitoring, the user only needs to close the file descriptor and the trigger is discarded.
Patches 1-4 prepare the psi code for polling support. Patch 5 implements the adaptive polling logic, the pressure growth detection optimized for short intervals, and hooks up write() and poll() on the pressure files.
The patches were developed in collaboration with Johannes Weiner.
This patch (of 5):
Kernfs has a standardized poll/notification mechanism for waking all pollers on all fds when a filesystem node changes. To allow polling for custom events, add a .poll callback that can override the default.
This is in preparation for pollable cgroup pressure files which have per-fd trigger configurations.
Link: http://lkml.kernel.org/r/20190124211518.244221-2-surenb@google.com Signed-off-by: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Suren Baghdasaryan surenb@google.com Cc: Dennis Zhou dennis@kernel.org Cc: Ingo Molnar mingo@redhat.com Cc: Jens Axboe axboe@kernel.dk Cc: Li Zefan lizefan@huawei.com Cc: Peter Zijlstra peterz@infradead.org Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- fs/kernfs/file.c | 31 ++++++++++++++++++++----------- include/linux/kernfs.h | 6 ++++++ 2 files changed, 26 insertions(+), 11 deletions(-)
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c index dbf5bc2..2d8b91f 100644 --- a/fs/kernfs/file.c +++ b/fs/kernfs/file.c @@ -832,26 +832,35 @@ void kernfs_drain_open_files(struct kernfs_node *kn) * to see if it supports poll (Neither 'poll' nor 'select' return * an appropriate error code). When in doubt, set a suitable timeout value. */ +__poll_t kernfs_generic_poll(struct kernfs_open_file *of, poll_table *wait) +{ + struct kernfs_node *kn = kernfs_dentry_node(of->file->f_path.dentry); + struct kernfs_open_node *on = kn->attr.open; + + poll_wait(of->file, &on->poll, wait); + + if (of->event != atomic_read(&on->event)) + return DEFAULT_POLLMASK|EPOLLERR|EPOLLPRI; + + return DEFAULT_POLLMASK; +} + static __poll_t kernfs_fop_poll(struct file *filp, poll_table *wait) { struct kernfs_open_file *of = kernfs_of(filp); struct kernfs_node *kn = kernfs_dentry_node(filp->f_path.dentry); - struct kernfs_open_node *on = kn->attr.open; + __poll_t ret;
if (!kernfs_get_active(kn)) - goto trigger; + return DEFAULT_POLLMASK|EPOLLERR|EPOLLPRI;
- poll_wait(filp, &on->poll, wait); + if (kn->attr.ops->poll) + ret = kn->attr.ops->poll(of, wait); + else + ret = kernfs_generic_poll(of, wait);
kernfs_put_active(kn); - - if (of->event != atomic_read(&on->event)) - goto trigger; - - return DEFAULT_POLLMASK; - - trigger: - return DEFAULT_POLLMASK|EPOLLERR|EPOLLPRI; + return ret; }
static void kernfs_notify_workfn(struct work_struct *work) diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index c41790d..ab81a22 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -26,6 +26,7 @@ struct vm_area_struct; struct super_block; struct file_system_type; +struct poll_table_struct;
struct kernfs_open_node; struct kernfs_iattrs; @@ -268,6 +269,9 @@ struct kernfs_ops { ssize_t (*write)(struct kernfs_open_file *of, char *buf, size_t bytes, loff_t off);
+ __poll_t (*poll)(struct kernfs_open_file *of, + struct poll_table_struct *pt); + int (*mmap)(struct kernfs_open_file *of, struct vm_area_struct *vma);
#ifdef CONFIG_DEBUG_LOCK_ALLOC @@ -360,6 +364,8 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name, int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent, const char *new_name, const void *new_ns); int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr); +__poll_t kernfs_generic_poll(struct kernfs_open_file *of, + struct poll_table_struct *pt); void kernfs_notify(struct kernfs_node *kn);
const void *kernfs_super_ns(struct super_block *sb);
From: Suren Baghdasaryan surenb@google.com
mainline inclusion from mainline-v5.4 commit 33b2d6302abc4ccea1d9b3f095e2e27b02ca264e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
Patch series "psi: pressure stall monitors", v6.
This is a respin of: https://lwn.net/ml/linux-kernel/20190308184311.144521-1-surenb%40google.com/
Android is adopting psi to detect and remedy memory pressure that results in stuttering and decreased responsiveness on mobile devices.
Psi gives us the stall information, but because we're dealing with latencies in the millisecond range, periodically reading the pressure files to detect stalls in a timely fashion is not feasible. Psi also doesn't aggregate its averages at a high-enough frequency right now.
This patch series extends the psi interface such that users can configure sensitive latency thresholds and use poll() and friends to be notified when these are breached.
As high-frequency aggregation is costly, it implements an aggregation method that is optimized for fast, short-interval averaging, and makes the aggregation frequency adaptive, such that high-frequency updates only happen while monitored stall events are actively occurring.
With these patches applied, Android can monitor for, and ward off, mounting memory shortages before they cause problems for the user. For example, using memory stall monitors in userspace low memory killer daemon (lmkd) we can detect mounting pressure and kill less important processes before device becomes visibly sluggish. In our memory stress testing psi memory monitors produce roughly 10x less false positives compared to vmpressure signals. Having ability to specify multiple triggers for the same psi metric allows other parts of Android framework to monitor memory state of the device and act accordingly.
The new interface is straight-forward. The user opens one of the pressure files for writing and writes a trigger description into the file descriptor that defines the stall state - some or full, and the maximum stall time over a given window of time. E.g.:
/* Signal when stall time exceeds 100ms of a 1s window */ char trigger[] = "full 100000 1000000" fd = open("/proc/pressure/memory") write(fd, trigger, sizeof(trigger)) while (poll() >= 0) { ... }; close(fd);
When the monitored stall state is entered, psi adapts its aggregation frequency according to what the configured time window requires in order to emit event signals in a timely fashion. Once the stalling subsides, aggregation reverts back to normal.
The trigger is associated with the open file descriptor. To stop monitoring, the user only needs to close the file descriptor and the trigger is discarded.
Patches 1-6 prepare the psi code for polling support. Patch 7 implements the adaptive polling logic, the pressure growth detection optimized for short intervals, and hooks up write() and poll() on the pressure files.
The patches were developed in collaboration with Johannes Weiner.
This patch (of 7):
The psi monitoring patches will need to determine the same states as record_times(). To avoid calculating them twice, maintain a state mask that can be consulted cheaply. Do this in a separate patch to keep the churn in the main feature patch at a minimum.
This adds 4-byte state_mask member into psi_group_cpu struct which results in its first cacheline-aligned part becoming 52 bytes long. Add explicit values to enumeration element counters that affect psi_group_cpu struct size.
Link: http://lkml.kernel.org/r/20190124211518.244221-4-surenb@google.com Signed-off-by: Suren Baghdasaryan surenb@google.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Dennis Zhou dennis@kernel.org Cc: Ingo Molnar mingo@redhat.com Cc: Jens Axboe axboe@kernel.dk Cc: Li Zefan lizefan@huawei.com Cc: Peter Zijlstra peterz@infradead.org Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- include/linux/psi_types.h | 9 ++++++--- kernel/sched/psi.c | 29 +++++++++++++++++++---------- 2 files changed, 25 insertions(+), 13 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 2cf422d..762c6bb 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -11,7 +11,7 @@ enum psi_task_count { NR_IOWAIT, NR_MEMSTALL, NR_RUNNING, - NR_PSI_TASK_COUNTS, + NR_PSI_TASK_COUNTS = 3, };
/* Task state bitmasks */ @@ -24,7 +24,7 @@ enum psi_res { PSI_IO, PSI_MEM, PSI_CPU, - NR_PSI_RESOURCES, + NR_PSI_RESOURCES = 3, };
/* @@ -41,7 +41,7 @@ enum psi_states { PSI_CPU_SOME, /* Only per-CPU, to weigh the CPU in the global average: */ PSI_NONIDLE, - NR_PSI_STATES, + NR_PSI_STATES = 6, };
struct psi_group_cpu { @@ -53,6 +53,9 @@ struct psi_group_cpu { /* States of the tasks belonging to this group */ unsigned int tasks[NR_PSI_TASK_COUNTS];
+ /* Aggregate pressure state derived from the tasks */ + u32 state_mask; + /* Period time sampling buckets for each state of interest (ns) */ u32 times[NR_PSI_STATES];
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 0e97ca9..22c1505 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -213,17 +213,17 @@ static bool test_state(unsigned int *tasks, enum psi_states state) static void get_recent_times(struct psi_group *group, int cpu, u32 *times) { struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu); - unsigned int tasks[NR_PSI_TASK_COUNTS]; u64 now, state_start; + enum psi_states s; unsigned int seq; - int s; + u32 state_mask;
/* Snapshot a coherent view of the CPU state */ do { seq = read_seqcount_begin(&groupc->seq); now = cpu_clock(cpu); memcpy(times, groupc->times, sizeof(groupc->times)); - memcpy(tasks, groupc->tasks, sizeof(groupc->tasks)); + state_mask = groupc->state_mask; state_start = groupc->state_start; } while (read_seqcount_retry(&groupc->seq, seq));
@@ -239,7 +239,7 @@ static void get_recent_times(struct psi_group *group, int cpu, u32 *times) * (u32) and our reported pressure close to what's * actually happening. */ - if (test_state(tasks, s)) + if (state_mask & (1 << s)) times[s] += now - state_start;
delta = times[s] - groupc->times_prev[s]; @@ -407,15 +407,15 @@ static void record_times(struct psi_group_cpu *groupc, int cpu, delta = now - groupc->state_start; groupc->state_start = now;
- if (test_state(groupc->tasks, PSI_IO_SOME)) { + if (groupc->state_mask & (1 << PSI_IO_SOME)) { groupc->times[PSI_IO_SOME] += delta; - if (test_state(groupc->tasks, PSI_IO_FULL)) + if (groupc->state_mask & (1 << PSI_IO_FULL)) groupc->times[PSI_IO_FULL] += delta; }
- if (test_state(groupc->tasks, PSI_MEM_SOME)) { + if (groupc->state_mask & (1 << PSI_MEM_SOME)) { groupc->times[PSI_MEM_SOME] += delta; - if (test_state(groupc->tasks, PSI_MEM_FULL)) + if (groupc->state_mask & (1 << PSI_MEM_FULL)) groupc->times[PSI_MEM_FULL] += delta; else if (memstall_tick) { u32 sample; @@ -436,10 +436,10 @@ static void record_times(struct psi_group_cpu *groupc, int cpu, } }
- if (test_state(groupc->tasks, PSI_CPU_SOME)) + if (groupc->state_mask & (1 << PSI_CPU_SOME)) groupc->times[PSI_CPU_SOME] += delta;
- if (test_state(groupc->tasks, PSI_NONIDLE)) + if (groupc->state_mask & (1 << PSI_NONIDLE)) groupc->times[PSI_NONIDLE] += delta; }
@@ -448,6 +448,8 @@ static void psi_group_change(struct psi_group *group, int cpu, { struct psi_group_cpu *groupc; unsigned int t, m; + enum psi_states s; + u32 state_mask = 0;
groupc = per_cpu_ptr(group->pcpu, cpu);
@@ -480,6 +482,13 @@ static void psi_group_change(struct psi_group *group, int cpu, if (set & (1 << t)) groupc->tasks[t]++;
+ /* Calculate state mask representing active states */ + for (s = 0; s < NR_PSI_STATES; s++) { + if (test_state(groupc->tasks, s)) + state_mask |= (1 << s); + } + groupc->state_mask = state_mask; + write_seqcount_end(&groupc->seq); }
From: Suren Baghdasaryan surenb@google.com
mainline inclusion from mainline-v5.4 commit bcc78db64168eb6dede056fed2999f75f7ace309 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
Rename psi_group structure member fields used for calculating psi totals and averages for clear distinction between them and for trigger-related fields that will be added by "psi: introduce psi monitor".
[surenb@google.com: v6] Link: http://lkml.kernel.org/r/20190319235619.260832-4-surenb@google.com Link: http://lkml.kernel.org/r/20190124211518.244221-5-surenb@google.com Signed-off-by: Suren Baghdasaryan surenb@google.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Dennis Zhou dennis@kernel.org Cc: Ingo Molnar mingo@redhat.com Cc: Jens Axboe axboe@kernel.dk Cc: Li Zefan lizefan@huawei.com Cc: Peter Zijlstra peterz@infradead.org Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- include/linux/psi_types.h | 14 +++++++------- kernel/sched/psi.c | 41 +++++++++++++++++++++-------------------- 2 files changed, 28 insertions(+), 27 deletions(-)
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 762c6bb..4d1c1f6 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -69,17 +69,17 @@ struct psi_group_cpu { };
struct psi_group { - /* Protects data updated during an aggregation */ - struct mutex stat_lock; + /* Protects data used by the aggregator */ + struct mutex avgs_lock;
/* Per-cpu task state & time tracking */ struct psi_group_cpu __percpu *pcpu;
- /* Periodic aggregation state */ - u64 total_prev[NR_PSI_STATES - 1]; - u64 last_update; - u64 next_update; - struct delayed_work clock_work; + /* Running pressure averages */ + u64 avg_total[NR_PSI_STATES - 1]; + u64 avg_last_update; + u64 avg_next_update; + struct delayed_work avgs_work;
/* Total stall times and sampled pressure averages */ u64 total[NR_PSI_STATES - 1]; diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 281702d..4fb4d991 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -165,7 +165,7 @@ static int __init setup_psi(char *str) .pcpu = &system_group_pcpu, };
-static void psi_update_work(struct work_struct *work); +static void psi_avgs_work(struct work_struct *work);
static void group_init(struct psi_group *group) { @@ -173,9 +173,9 @@ static void group_init(struct psi_group *group)
for_each_possible_cpu(cpu) seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq); - group->next_update = sched_clock() + psi_period; - INIT_DELAYED_WORK(&group->clock_work, psi_update_work); - mutex_init(&group->stat_lock); + group->avg_next_update = sched_clock() + psi_period; + INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work); + mutex_init(&group->avgs_lock); }
void __init psi_init(void) @@ -278,7 +278,7 @@ static bool update_stats(struct psi_group *group) int cpu; int s;
- mutex_lock(&group->stat_lock); + mutex_lock(&group->avgs_lock);
/* * Collect the per-cpu time buckets and average them into a @@ -319,7 +319,7 @@ static bool update_stats(struct psi_group *group)
/* avgX= */ now = sched_clock(); - expires = group->next_update; + expires = group->avg_next_update; if (now < expires) goto out; if (now - expires >= psi_period) @@ -332,14 +332,14 @@ static bool update_stats(struct psi_group *group) * But the deltas we sample out of the per-cpu buckets above * are based on the actual time elapsing between clock ticks. */ - group->next_update = expires + ((1 + missed_periods) * psi_period); - period = now - (group->last_update + (missed_periods * psi_period)); - group->last_update = now; + group->avg_next_update = expires + ((1 + missed_periods) * psi_period); + period = now - (group->avg_last_update + (missed_periods * psi_period)); + group->avg_last_update = now;
for (s = 0; s < NR_PSI_STATES - 1; s++) { u32 sample;
- sample = group->total[s] - group->total_prev[s]; + sample = group->total[s] - group->avg_total[s]; /* * Due to the lockless sampling of the time buckets, * recorded time deltas can slip into the next period, @@ -359,22 +359,22 @@ static bool update_stats(struct psi_group *group) */ if (sample > period) sample = period; - group->total_prev[s] += sample; + group->avg_total[s] += sample; calc_avgs(group->avg[s], missed_periods, sample, period); } out: - mutex_unlock(&group->stat_lock); + mutex_unlock(&group->avgs_lock); return nonidle_total; }
-static void psi_update_work(struct work_struct *work) +static void psi_avgs_work(struct work_struct *work) { struct delayed_work *dwork; struct psi_group *group; bool nonidle;
dwork = to_delayed_work(work); - group = container_of(dwork, struct psi_group, clock_work); + group = container_of(dwork, struct psi_group, avgs_work);
/* * If there is task activity, periodically fold the per-cpu @@ -391,8 +391,9 @@ static void psi_update_work(struct work_struct *work) u64 now;
now = sched_clock(); - if (group->next_update > now) - delay = nsecs_to_jiffies(group->next_update - now) + 1; + if (group->avg_next_update > now) + delay = nsecs_to_jiffies( + group->avg_next_update - now) + 1; schedule_delayed_work(dwork, delay); } } @@ -546,13 +547,13 @@ void psi_task_change(struct task_struct *task, int clear, int set) */ if (unlikely((clear & TSK_RUNNING) && (task->flags & PF_WQ_WORKER) && - wq_worker_last_func(task) == psi_update_work)) + wq_worker_last_func(task) == psi_avgs_work)) wake_clock = false;
while ((group = iterate_groups(task, &iter))) { psi_group_change(group, cpu, clear, set); - if (wake_clock && !delayed_work_pending(&group->clock_work)) - schedule_delayed_work(&group->clock_work, PSI_FREQ); + if (wake_clock && !delayed_work_pending(&group->avgs_work)) + schedule_delayed_work(&group->avgs_work, PSI_FREQ); } }
@@ -649,7 +650,7 @@ void psi_cgroup_free(struct cgroup *cgroup) if (static_branch_likely(&psi_disabled)) return;
- cancel_delayed_work_sync(&cgroup->psi.clock_work); + cancel_delayed_work_sync(&cgroup->psi.avgs_work); free_percpu(cgroup->psi.pcpu); }
From: Suren Baghdasaryan surenb@google.com
mainline inclusion from mainline-v5.4 commit 7fc70a3999366560ad1d4f2389a78360300c2c6a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
Split update_stats into collect_percpu_times and update_averages for collect_percpu_times to be reused later inside psi monitor.
Link: http://lkml.kernel.org/r/20190319235619.260832-5-surenb@google.com Signed-off-by: Suren Baghdasaryan surenb@google.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Dennis Zhou dennis@kernel.org Cc: Ingo Molnar mingo@redhat.com Cc: Jens Axboe axboe@kernel.dk Cc: Li Zefan lizefan@huawei.com Cc: Peter Zijlstra peterz@infradead.org Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- kernel/sched/psi.c | 57 ++++++++++++++++++++++++++++++++---------------------- 1 file changed, 34 insertions(+), 23 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 4fb4d991..ace5ed9 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -269,17 +269,13 @@ static void calc_avgs(unsigned long avg[3], int missed_periods, avg[2] = calc_load(avg[2], EXP_300s, pct); }
-static bool update_stats(struct psi_group *group) +static bool collect_percpu_times(struct psi_group *group) { u64 deltas[NR_PSI_STATES - 1] = { 0, }; - unsigned long missed_periods = 0; unsigned long nonidle_total = 0; - u64 now, expires, period; int cpu; int s;
- mutex_lock(&group->avgs_lock); - /* * Collect the per-cpu time buckets and average them into a * single time sample that is normalized to wallclock time. @@ -317,11 +313,18 @@ static bool update_stats(struct psi_group *group) for (s = 0; s < NR_PSI_STATES - 1; s++) group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL));
+ return nonidle_total; +} + +static u64 update_averages(struct psi_group *group, u64 now) +{ + unsigned long missed_periods = 0; + u64 expires, period; + u64 avg_next_update; + int s; + /* avgX= */ - now = sched_clock(); expires = group->avg_next_update; - if (now < expires) - goto out; if (now - expires >= psi_period) missed_periods = div_u64(now - expires, psi_period);
@@ -332,7 +335,7 @@ static bool update_stats(struct psi_group *group) * But the deltas we sample out of the per-cpu buckets above * are based on the actual time elapsing between clock ticks. */ - group->avg_next_update = expires + ((1 + missed_periods) * psi_period); + avg_next_update = expires + ((1 + missed_periods) * psi_period); period = now - (group->avg_last_update + (missed_periods * psi_period)); group->avg_last_update = now;
@@ -362,9 +365,8 @@ static bool update_stats(struct psi_group *group) group->avg_total[s] += sample; calc_avgs(group->avg[s], missed_periods, sample, period); } -out: - mutex_unlock(&group->avgs_lock); - return nonidle_total; + + return avg_next_update; }
static void psi_avgs_work(struct work_struct *work) @@ -372,10 +374,16 @@ static void psi_avgs_work(struct work_struct *work) struct delayed_work *dwork; struct psi_group *group; bool nonidle; + u64 now;
dwork = to_delayed_work(work); group = container_of(dwork, struct psi_group, avgs_work);
+ mutex_lock(&group->avgs_lock); + + now = sched_clock(); + + nonidle = collect_percpu_times(group); /* * If there is task activity, periodically fold the per-cpu * times and feed samples into the running averages. If things @@ -383,19 +391,15 @@ static void psi_avgs_work(struct work_struct *work) * Once restarted, we'll catch up the running averages in one * go - see calc_avgs() and missed_periods. */ - - nonidle = update_stats(group); + if (now >= group->avg_next_update) + group->avg_next_update = update_averages(group, now);
if (nonidle) { - unsigned long delay = 0; - u64 now; - - now = sched_clock(); - if (group->avg_next_update > now) - delay = nsecs_to_jiffies( - group->avg_next_update - now) + 1; - schedule_delayed_work(dwork, delay); + schedule_delayed_work(dwork, nsecs_to_jiffies( + group->avg_next_update - now) + 1); } + + mutex_unlock(&group->avgs_lock); }
static void record_times(struct psi_group_cpu *groupc, int cpu, @@ -707,11 +711,18 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to) int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) { int full; + u64 now;
if (static_branch_likely(&psi_disabled)) return -EOPNOTSUPP;
- update_stats(group); + /* Update averages before reporting them */ + mutex_lock(&group->avgs_lock); + now = sched_clock(); + collect_percpu_times(group); + if (now >= group->avg_next_update) + group->avg_next_update = update_averages(group, now); + mutex_unlock(&group->avgs_lock);
for (full = 0; full < 2 - (res == PSI_CPU); full++) { unsigned long avg[3];
From: Suren Baghdasaryan surenb@google.com
mainline inclusion from mainline-v5.4 commit 333f3017c5a893b000b2b4a3529814ce93fa83d7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
Introduce changed_states parameter into collect_percpu_times to track the states changed since the last update.
This will be needed to detect whether polled states activated in the monitor patch.
Link: http://lkml.kernel.org/r/20190319235619.260832-6-surenb@google.com Signed-off-by: Suren Baghdasaryan surenb@google.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Dennis Zhou dennis@kernel.org Cc: Ingo Molnar mingo@redhat.com Cc: Jens Axboe axboe@kernel.dk Cc: Li Zefan lizefan@huawei.com Cc: Peter Zijlstra peterz@infradead.org Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- kernel/sched/psi.c | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index ace5ed9..1b99eef 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -210,7 +210,8 @@ static bool test_state(unsigned int *tasks, enum psi_states state) } }
-static void get_recent_times(struct psi_group *group, int cpu, u32 *times) +static void get_recent_times(struct psi_group *group, int cpu, u32 *times, + u32 *pchanged_states) { struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu); u64 now, state_start; @@ -218,6 +219,8 @@ static void get_recent_times(struct psi_group *group, int cpu, u32 *times) unsigned int seq; u32 state_mask;
+ *pchanged_states = 0; + /* Snapshot a coherent view of the CPU state */ do { seq = read_seqcount_begin(&groupc->seq); @@ -246,6 +249,8 @@ static void get_recent_times(struct psi_group *group, int cpu, u32 *times) groupc->times_prev[s] = times[s];
times[s] = delta; + if (delta) + *pchanged_states |= (1 << s); } }
@@ -269,10 +274,11 @@ static void calc_avgs(unsigned long avg[3], int missed_periods, avg[2] = calc_load(avg[2], EXP_300s, pct); }
-static bool collect_percpu_times(struct psi_group *group) +static void collect_percpu_times(struct psi_group *group, u32 *pchanged_states) { u64 deltas[NR_PSI_STATES - 1] = { 0, }; unsigned long nonidle_total = 0; + u32 changed_states = 0; int cpu; int s;
@@ -287,8 +293,11 @@ static bool collect_percpu_times(struct psi_group *group) for_each_possible_cpu(cpu) { u32 times[NR_PSI_STATES]; u32 nonidle; + u32 cpu_changed_states;
- get_recent_times(group, cpu, times); + get_recent_times(group, cpu, times, + &cpu_changed_states); + changed_states |= cpu_changed_states;
nonidle = nsecs_to_jiffies(times[PSI_NONIDLE]); nonidle_total += nonidle; @@ -313,7 +322,8 @@ static bool collect_percpu_times(struct psi_group *group) for (s = 0; s < NR_PSI_STATES - 1; s++) group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL));
- return nonidle_total; + if (pchanged_states) + *pchanged_states = changed_states; }
static u64 update_averages(struct psi_group *group, u64 now) @@ -373,6 +383,7 @@ static void psi_avgs_work(struct work_struct *work) { struct delayed_work *dwork; struct psi_group *group; + u32 changed_states; bool nonidle; u64 now;
@@ -383,7 +394,8 @@ static void psi_avgs_work(struct work_struct *work)
now = sched_clock();
- nonidle = collect_percpu_times(group); + collect_percpu_times(group, &changed_states); + nonidle = changed_states & (1 << PSI_NONIDLE); /* * If there is task activity, periodically fold the per-cpu * times and feed samples into the running averages. If things @@ -719,7 +731,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) /* Update averages before reporting them */ mutex_lock(&group->avgs_lock); now = sched_clock(); - collect_percpu_times(group); + collect_percpu_times(group, NULL); if (now >= group->avg_next_update) group->avg_next_update = update_averages(group, now); mutex_unlock(&group->avgs_lock);
From: Suren Baghdasaryan surenb@google.com
mainline inclusion from mainline-v5.4 commit 8af0c18af1425fc70686c0fdcfc0072cd8431aa0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
kthread.h can't be included in psi_types.h because it creates a circular inclusion with kthread.h eventually including psi_types.h and complaining on kthread structures not being defined because they are defined further in the kthread.h. Resolve this by removing psi_types.h inclusion from the headers included from kthread.h.
Link: http://lkml.kernel.org/r/20190319235619.260832-7-surenb@google.com Signed-off-by: Suren Baghdasaryan surenb@google.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Dennis Zhou dennis@kernel.org Cc: Ingo Molnar mingo@redhat.com Cc: Jens Axboe axboe@kernel.dk Cc: Li Zefan lizefan@huawei.com Cc: Peter Zijlstra peterz@infradead.org Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- drivers/spi/spi-rockchip.c | 1 + include/linux/kthread.h | 3 ++- include/linux/sched.h | 1 - kernel/kthread.c | 1 + 4 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/spi/spi-rockchip.c b/drivers/spi/spi-rockchip.c index 185bbdc..e054260 100644 --- a/drivers/spi/spi-rockchip.c +++ b/drivers/spi/spi-rockchip.c @@ -15,6 +15,7 @@
#include <linux/clk.h> #include <linux/dmaengine.h> +#include <linux/interrupt.h> #include <linux/module.h> #include <linux/of.h> #include <linux/pinctrl/consumer.h> diff --git a/include/linux/kthread.h b/include/linux/kthread.h index 8613e49..206f0c3 100644 --- a/include/linux/kthread.h +++ b/include/linux/kthread.h @@ -4,7 +4,6 @@ /* Simple interface for creating and stopping kernel threads without mess. */ #include <linux/err.h> #include <linux/sched.h> -#include <linux/cgroup.h>
__printf(4, 5) struct task_struct *kthread_create_on_node(int (*threadfn)(void *data), @@ -202,6 +201,8 @@ bool kthread_mod_delayed_work(struct kthread_worker *worker,
void kthread_destroy_worker(struct kthread_worker *worker);
+struct cgroup_subsys_state; + #ifdef CONFIG_BLK_CGROUP void kthread_associate_blkcg(struct cgroup_subsys_state *css); struct cgroup_subsys_state *kthread_blkcg(void); diff --git a/include/linux/sched.h b/include/linux/sched.h index a812bc8..d137ff5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -25,7 +25,6 @@ #include <linux/latencytop.h> #include <linux/sched/prio.h> #include <linux/signal_types.h> -#include <linux/psi_types.h> #include <linux/mm_types_task.h> #include <linux/task_io_accounting.h> #include <linux/rseq.h> diff --git a/kernel/kthread.c b/kernel/kthread.c index 4191fed..da07863 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -11,6 +11,7 @@ #include <linux/kthread.h> #include <linux/completion.h> #include <linux/err.h> +#include <linux/cgroup.h> #include <linux/cpuset.h> #include <linux/unistd.h> #include <linux/file.h>
From: Dan Schatzberg dschatzberg@fb.com
mainline inclusion from mainline-v5.4 commit df5ba5be7425e1df296d40c5f37a39d98ec666a2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
Pressure metrics are already recorded and exposed in procfs for the entire system, but any tool which monitors cgroup pressure has to special case the root cgroup to read from procfs. This patch exposes the already recorded pressure metrics on the root cgroup.
Link: http://lkml.kernel.org/r/20190510174938.3361741-1-dschatzberg@fb.com Signed-off-by: Dan Schatzberg dschatzberg@fb.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Tejun Heo tj@kernel.org Cc: Li Zefan lizefan@huawei.com Cc: Ingo Molnar mingo@redhat.com Cc: Peter Zijlstra peterz@infradead.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- include/linux/psi.h | 1 + kernel/cgroup/cgroup.c | 18 ++++++++++++------ kernel/sched/psi.c | 2 +- 3 files changed, 14 insertions(+), 7 deletions(-)
diff --git a/include/linux/psi.h b/include/linux/psi.h index 7006008..b981e5c 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -11,6 +11,7 @@ #ifdef CONFIG_PSI
extern struct static_key_false psi_disabled; +extern struct psi_group psi_system;
void psi_init(void);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 99e51fa..8066faa 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3433,15 +3433,24 @@ static int cpu_stat_show(struct seq_file *seq, void *v) #ifdef CONFIG_PSI static int cgroup_io_pressure_show(struct seq_file *seq, void *v) { - return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_IO); + struct cgroup *cgroup = seq_css(seq)->cgroup; + struct psi_group *psi = cgroup->id == 1 ? &psi_system : &cgroup->psi; + + return psi_show(seq, psi, PSI_IO); } static int cgroup_memory_pressure_show(struct seq_file *seq, void *v) { - return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_MEM); + struct cgroup *cgroup = seq_css(seq)->cgroup; + struct psi_group *psi = cgroup->id == 1 ? &psi_system : &cgroup->psi; + + return psi_show(seq, psi, PSI_MEM); } static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v) { - return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU); + struct cgroup *cgroup = seq_css(seq)->cgroup; + struct psi_group *psi = cgroup->id == 1 ? &psi_system : &cgroup->psi; + + return psi_show(seq, psi, PSI_CPU); } #endif
@@ -4629,17 +4638,14 @@ static ssize_t cgroup_threads_write(struct kernfs_open_file *of, #ifdef CONFIG_PSI { .name = "io.pressure", - .flags = CFTYPE_NOT_ON_ROOT, .seq_show = cgroup_io_pressure_show, }, { .name = "memory.pressure", - .flags = CFTYPE_NOT_ON_ROOT, .seq_show = cgroup_memory_pressure_show, }, { .name = "cpu.pressure", - .flags = CFTYPE_NOT_ON_ROOT, .seq_show = cgroup_cpu_pressure_show, }, #endif diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 1b99eef..b1eb45c 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -161,7 +161,7 @@ static int __init setup_psi(char *str)
/* System-level pressure and stall tracking */ static DEFINE_PER_CPU(struct psi_group_cpu, system_group_pcpu); -static struct psi_group psi_system = { +struct psi_group psi_system = { .pcpu = &system_group_pcpu, };
From: Johannes Weiner hannes@cmpxchg.org
mainline inclusion from mainline-v5.4 commit 3dfbe25c27eab7c90c8a7e97b4c354a9d24dd985 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
Jingfeng reports rare div0 crashes in psi on systems with some uptime:
[58914.066423] divide error: 0000 [#1] SMP [58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O) [58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1 [58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019 [58914.102728] Workqueue: events psi_update_work [58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000 [58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330 [58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246 [58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640 [58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e [58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000 [58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000 [58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78 [58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000 [58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0 [58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [58914.200360] PKRU: 55555554 [58914.203221] Stack: [58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8 [58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000 [58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [58914.228734] Call Trace: [58914.231337] [] psi_update_work+0x22/0x60 [58914.237067] [] process_one_work+0x189/0x420 [58914.243063] [] worker_thread+0x4e/0x4b0 [58914.248701] [] ? process_one_work+0x420/0x420 [58914.254869] [] kthread+0xe6/0x100 [58914.259994] [] ? kthread_park+0x60/0x60 [58914.265640] [] ret_from_fork+0x39/0x50 [58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc <49> f7 f1 48 8b 13 48 89 c7 48 c1 [58914.279691] RIP [] psi_update_stats+0x1c1/0x330
The crashing instruction is trying to divide the observed stall time by the sampling period. The period, stored in R8, is not 0, but we are dividing by the lower 32 bits only, which are all 0 in this instance.
We could switch to a 64-bit division, but the period shouldn't be that big in the first place. It's the time between the last update and the next scheduled one, and so should always be around 2s and comfortably fit into 32 bits.
The bug is in the initialization of new cgroups: we schedule the first sampling event in a cgroup as an offset of sched_clock(), but fail to initialize the last_update timestamp, and it defaults to 0. That results in a bogusly large sampling period the first time we run the sampling code, and consequently we underreport pressure for the first 2s of a cgroup's life. But worse, if sched_clock() is sufficiently advanced on the system, and the user gets unlucky, the period's lower 32 bits can all be 0 and the sampling division will crash.
Fix this by initializing the last update timestamp to the creation time of the cgroup, thus correctly marking the start of the first pressure sampling period in a new cgroup.
Reported-by: Jingfeng Xie xiejingfeng@linux.alibaba.com Signed-off-by: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: Suren Baghdasaryan surenb@google.com Link: https://lkml.kernel.org/r/20191203183524.41378-2-hannes@cmpxchg.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- kernel/sched/psi.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index b1eb45c..35b8c79 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -173,7 +173,8 @@ static void group_init(struct psi_group *group)
for_each_possible_cpu(cpu) seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq); - group->avg_next_update = sched_clock() + psi_period; + group->avg_last_update = sched_clock(); + group->avg_next_update = group->avg_last_update + psi_period; INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work); mutex_init(&group->avgs_lock); }
From: Yafang Shao laoar.shao@gmail.com
mainline inclusion commit a6f5576bb195c3b7508e3e1c98d2dcf6691f96e8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
There's a new workingset counter introduced in commit 1899ad18c607 ("mm: workingset: tell cache transitions from workingset thrashing"). With the help of this counter we can know the workingset is transitioning or thrashing. To leverage the benifit of this counter to memcg, we should introduce it into memory.stat. Then we could know the workingset of the workload inside a memcg better.
Bellow is the verification of this new counter in memory.stat. Read a file into the memory and then read it again to make these pages be active. The size of this file is 1G. (memory.max is greater than file size) The counters in memory.stat will be
inactive_file 0 active_file 1073639424
workingset_refault 0 workingset_activate 0 workingset_restore 0 workingset_nodereclaim 0
Trigger the memcg reclaim by setting a lower value to memory.high, and then some pages will be demoted into inactive list, and then some pages in the inactive list will be evicted into the storage.
inactive_file 498094080 active_file 310063104
workingset_refault 0 workingset_activate 0 workingset_restore 0 workingset_nodereclaim 0
Then recover the memory.high and read the file into memory again. As a result of it, the transitioning will occur. Bellow is the result of this transitioning,
inactive_file 498094080 active_file 575397888
workingset_refault 64746 workingset_activate 64746 workingset_restore 64746 workingset_nodereclaim 0
Signed-off-by: Yafang Shao laoar.shao@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Michal Hocko mhocko@suse.com Acked-by: Chris Down chris@chrisdown.name Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Suren Baghdasaryan surenb@google.com Cc: Shakeel Butt shakeelb@google.com Link: http://lkml.kernel.org/r/20200504153522.11553-1-laoar.shao@gmail.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- Documentation/admin-guide/cgroup-v2.rst | 4 ++++ mm/memcontrol.c | 2 ++ 2 files changed, 6 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index c57b4fc..2f2a4e9 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1214,6 +1214,10 @@ PAGE_SIZE multiple when read back.
Number of refaulted pages that were immediately activated
+ workingset_restore + Number of restored pages which have been detected as an active + workingset before they got reclaimed. + workingset_nodereclaim
Number of times a shadow node has been reclaimed diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0bccf2b..46596d2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6130,6 +6130,8 @@ static int memory_stat_show(struct seq_file *m, void *v) memcg_page_state(memcg, WORKINGSET_REFAULT)); seq_printf(m, "workingset_activate %lu\n", memcg_page_state(memcg, WORKINGSET_ACTIVATE)); + seq_printf(m, "workingset_restore %lu\n", + memcg_page_state(memcg, WORKINGSET_RESTORE)); seq_printf(m, "workingset_nodereclaim %lu\n", memcg_page_state(memcg, WORKINGSET_NODERECLAIM));
From: Miklos Szeredi mszeredi@redhat.com
mainline inclusion commit b89ecd60d38ec042d63bdb376c722a16f92bcb88 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
Fix the "fuse: trying to steal weird page" warning.
Description from Johannes Weiner:
"Think of it as similar to PG_active. It's just another usage/heat indicator of file and anon pages on the reclaim LRU that, unlike PG_active, persists across deactivation and even reclaim (we store it in the page cache / swapper cache tree until the page refaults).
So if fuse accepts pages that can legally have PG_active set, PG_workingset is fine too."
Reported-by: Thomas Lindroth thomas.lindroth@gmail.com Fixes: 1899ad18c607 ("mm: workingset: tell cache transitions from workingset thrashing") Cc: stable@vger.kernel.org # v4.20 Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- fs/fuse/dev.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 498a4fa..58bf254 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -837,6 +837,7 @@ static int fuse_check_page(struct page *page) 1 << PG_uptodate | 1 << PG_lru | 1 << PG_active | + 1 << PG_workingset | 1 << PG_reclaim | 1 << PG_waiters))) { printk(KERN_WARNING "fuse: trying to steal weird page\n");
Enable the psi feature in x86&arm64 openeuler config
openEuler inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- arch/arm64/configs/openeuler_defconfig | 2 ++ arch/x86/configs/openeuler_defconfig | 2 ++ 2 files changed, 4 insertions(+)
diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index 1845c50..44825ba 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -89,6 +89,8 @@ CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y CONFIG_TASK_XACCT=y CONFIG_TASK_IO_ACCOUNTING=y +CONFIG_PSI=y +CONFIG_PSI_DEFAULT_DISABLED=y CONFIG_CPU_ISOLATION=y
# diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index 5a56a81..888195f 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -94,6 +94,8 @@ CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y CONFIG_TASK_XACCT=y CONFIG_TASK_IO_ACCOUNTING=y +CONFIG_PSI=y +CONFIG_PSI_DEFAULT_DISABLED=y CONFIG_CPU_ISOLATION=y
#
Kabi compatibility with old third modules
openEuler inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2 CVE: NA
--------------------------------
Signed-off-by: Liu Xinpeng liuxp11@chinatelecom.cn Signed-off-by: Ctyun Kernel ctyuncommiter01@chinatelecom.cn --- include/linux/cgroup-defs.h | 14 +++++++++++--- include/linux/cgroup.h | 4 +++- include/linux/kernfs.h | 8 +++++--- include/linux/kthread.h | 3 +++ include/linux/mmzone.h | 4 +++- include/linux/page-flags.h | 6 ++++-- include/linux/sched.h | 14 +++++++++----- include/linux/swap.h | 3 ++- kernel/cgroup/cgroup.c | 11 ++++++++--- kernel/kthread.c | 2 ++ kernel/sched/psi.c | 8 ++++++-- mm/filemap.c | 2 +- mm/vmstat.c | 2 +- mm/workingset.c | 39 +++++++++++++++++++++++++++++++++------ 14 files changed, 91 insertions(+), 29 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index c026665..c07126b 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -443,9 +443,6 @@ struct cgroup { /* used to schedule release agent */ struct work_struct release_agent_work;
- /* used to track pressure stalls */ - struct psi_group psi; - /* used to store eBPF programs */ struct cgroup_bpf bpf;
@@ -456,6 +453,17 @@ struct cgroup { int ancestor_ids[]; };
+struct cgroup_psi { + struct cgroup cgrp; + /* used to track pressure stalls */ + struct psi_group psi; +}; + +static inline struct cgroup_psi *get_cgroup_psi(struct cgroup *cgrp) +{ + return container_of(cgrp, struct cgroup_psi, cgrp); +} + /* * A cgroup_root represents the root of a cgroup hierarchy, and may be * associated with a kernfs_root to form an active hierarchy. This is diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 73acf8b..1b68815 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -671,7 +671,9 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
static inline struct psi_group *cgroup_psi(struct cgroup *cgrp) { - return &cgrp->psi; + struct cgroup_psi *cgroup = get_cgroup_psi(cgrp); + + return &cgroup->psi; }
static inline void cgroup_init_kthreadd(void) diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index ab81a22..4d3ddaa 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -269,16 +269,18 @@ struct kernfs_ops { ssize_t (*write)(struct kernfs_open_file *of, char *buf, size_t bytes, loff_t off);
- __poll_t (*poll)(struct kernfs_open_file *of, - struct poll_table_struct *pt); - int (*mmap)(struct kernfs_open_file *of, struct vm_area_struct *vma);
#ifdef CONFIG_DEBUG_LOCK_ALLOC struct lock_class_key lockdep_key; #endif
+#ifndef __GENKSYMS__ + __poll_t (*poll)(struct kernfs_open_file *of, + struct poll_table_struct *pt); +#else KABI_RESERVE(1) +#endif KABI_RESERVE(2) };
diff --git a/include/linux/kthread.h b/include/linux/kthread.h index 206f0c3..de1b645 100644 --- a/include/linux/kthread.h +++ b/include/linux/kthread.h @@ -4,6 +4,9 @@ /* Simple interface for creating and stopping kernel threads without mess. */ #include <linux/err.h> #include <linux/sched.h> +#ifdef __GENKSYMS__ +#include <linux/cgroup.h> +#endif
__printf(4, 5) struct task_struct *kthread_create_on_node(int (*threadfn)(void *data), diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index b55dd22..b8ff084 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -164,7 +164,6 @@ enum node_stat_item { NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ WORKINGSET_REFAULT, WORKINGSET_ACTIVATE, - WORKINGSET_RESTORE, WORKINGSET_NODERECLAIM, NR_ANON_MAPPED, /* Mapped anonymous pages */ NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. @@ -183,6 +182,9 @@ enum node_stat_item { NR_DIRTIED, /* page dirtyings since bootup */ NR_WRITTEN, /* page writings since bootup */ NR_INDIRECTLY_RECLAIMABLE_BYTES, /* measured in bytes */ +#ifndef __GENKSYMS__ + WORKINGSET_RESTORE, +#endif NR_VM_NODE_STAT_ITEMS };
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 97c0918..dcafd8c 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -69,14 +69,13 @@ */ enum pageflags { PG_locked, /* Page is locked. Don't touch. */ + PG_error, PG_referenced, PG_uptodate, PG_dirty, PG_lru, PG_active, - PG_workingset, PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */ - PG_error, PG_slab, PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/ PG_arch_1, @@ -103,6 +102,9 @@ enum pageflags { PG_idle, #endif PG_percpu_ref, +#ifndef __GENKSYMS__ + PG_workingset, +#endif __NR_PAGEFLAGS,
/* Filesystems */ diff --git a/include/linux/sched.h b/include/linux/sched.h index d137ff5..1c254d8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -723,7 +723,7 @@ struct task_struct { unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; unsigned sched_remote_wakeup:1; -#ifdef CONFIG_PSI +#if (defined(CONFIG_PSI) && !defined(__GENKSYMS__)) unsigned sched_psi_wake_requeue:1; #endif
@@ -980,10 +980,6 @@ struct task_struct { siginfo_t *last_siginfo;
struct task_io_accounting ioac; -#ifdef CONFIG_PSI - /* Pressure stall state */ - unsigned int psi_flags; -#endif #ifdef CONFIG_TASK_XACCT /* Accumulated RSS usage: */ u64 acct_rss_mem1; @@ -1255,7 +1251,15 @@ struct task_struct { #else KABI_RESERVE(5) #endif +#if (defined(CONFIG_PSI) && !defined(__GENKSYMS__)) + /* Pressure stall state */ + union { + unsigned int psi_flags; + unsigned long psi_flags_padding; + }; +#else KABI_RESERVE(6) +#endif KABI_RESERVE(7) KABI_RESERVE(8)
diff --git a/include/linux/swap.h b/include/linux/swap.h index 12a39b2..cab2987 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -322,7 +322,8 @@ struct vma_swap_readahead {
/* linux/mm/workingset.c */ void *workingset_eviction(struct address_space *mapping, struct page *page); -void workingset_refault(struct page *page, void *shadow); +bool workingset_refault(void *shadow); +void workingset_refault_fixkabi(struct page *page, void *shadow); void workingset_activation(struct page *page);
/* Do not use directly, use workingset_lookup_update */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 8066faa..934b33c 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3434,21 +3434,21 @@ static int cpu_stat_show(struct seq_file *seq, void *v) static int cgroup_io_pressure_show(struct seq_file *seq, void *v) { struct cgroup *cgroup = seq_css(seq)->cgroup; - struct psi_group *psi = cgroup->id == 1 ? &psi_system : &cgroup->psi; + struct psi_group *psi = cgroup->id == 1 ? &psi_system : cgroup_psi(cgroup);
return psi_show(seq, psi, PSI_IO); } static int cgroup_memory_pressure_show(struct seq_file *seq, void *v) { struct cgroup *cgroup = seq_css(seq)->cgroup; - struct psi_group *psi = cgroup->id == 1 ? &psi_system : &cgroup->psi; + struct psi_group *psi = cgroup->id == 1 ? &psi_system : cgroup_psi(cgroup);
return psi_show(seq, psi, PSI_MEM); } static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v) { struct cgroup *cgroup = seq_css(seq)->cgroup; - struct psi_group *psi = cgroup->id == 1 ? &psi_system : &cgroup->psi; + struct psi_group *psi = cgroup->id == 1 ? &psi_system : cgroup_psi(cgroup);
return psi_show(seq, psi, PSI_CPU); } @@ -4938,8 +4938,13 @@ static struct cgroup *cgroup_create(struct cgroup *parent) int ret;
/* allocate the cgroup and its ID, 0 is reserved for the root */ +#ifdef CONFIG_PSI + cgrp = kzalloc(struct_size(cgrp, ancestor_ids, (level + 1)) + sizeof + (struct cgroup_psi), GFP_KERNEL); +#else cgrp = kzalloc(struct_size(cgrp, ancestor_ids, (level + 1)), GFP_KERNEL); +#endif if (!cgrp) return ERR_PTR(-ENOMEM);
diff --git a/kernel/kthread.c b/kernel/kthread.c index da07863..79d068e 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -11,7 +11,9 @@ #include <linux/kthread.h> #include <linux/completion.h> #include <linux/err.h> +#ifndef __GENKSYMS__ #include <linux/cgroup.h> +#endif #include <linux/cpuset.h> #include <linux/unistd.h> #include <linux/file.h> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 35b8c79..ed9eeb9 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -650,11 +650,13 @@ void psi_memstall_leave(unsigned long *flags) }
#ifdef CONFIG_CGROUPS -int psi_cgroup_alloc(struct cgroup *cgroup) +int psi_cgroup_alloc(struct cgroup *cgrp) { + struct cgroup_psi *cgroup; if (static_branch_likely(&psi_disabled)) return 0;
+ cgroup = get_cgroup_psi(cgrp); cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu); if (!cgroup->psi.pcpu) return -ENOMEM; @@ -662,11 +664,13 @@ int psi_cgroup_alloc(struct cgroup *cgroup) return 0; }
-void psi_cgroup_free(struct cgroup *cgroup) +void psi_cgroup_free(struct cgroup *cgrp) { + struct cgroup_psi *cgroup; if (static_branch_likely(&psi_disabled)) return;
+ cgroup = get_cgroup_psi(cgrp); cancel_delayed_work_sync(&cgroup->psi.avgs_work); free_percpu(cgroup->psi.pcpu); } diff --git a/mm/filemap.c b/mm/filemap.c index 66460c9..3df50ba 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1026,7 +1026,7 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping, */ WARN_ON_ONCE(PageActive(page)); if (!(gfp_mask & __GFP_WRITE) && shadow) - workingset_refault(page, shadow); + workingset_refault_fixkabi(page, shadow); if (!PagePercpuRef(page)) lru_cache_add(page); } diff --git a/mm/vmstat.c b/mm/vmstat.c index e183cb2..b249a9f 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1145,7 +1145,6 @@ int fragmentation_index(struct zone *zone, unsigned int order) "nr_isolated_file", "workingset_refault", "workingset_activate", - "workingset_restore", "workingset_nodereclaim", "nr_anon_pages", "nr_mapped", @@ -1163,6 +1162,7 @@ int fragmentation_index(struct zone *zone, unsigned int order) "nr_dirtied", "nr_written", "", /* nr_indirectly_reclaimable */ + "workingset_restore",
/* enum writeback_stat_item counters */ "nr_dirty_threshold", diff --git a/mm/workingset.c b/mm/workingset.c index 039eea7..a612c3a 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -242,13 +242,14 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
/** * workingset_refault - evaluate the refault of a previously evicted page - * @page: the freshly allocated replacement page * @shadow: shadow entry of the evicted page * * Calculates and evaluates the refault distance of the previously * evicted page in the context of the node it was allocated in. + * + * Returns %true if the page should be activated, %false otherwise. */ -void workingset_refault(struct page *page, void *shadow) +bool workingset_refault(void *shadow) { unsigned long refault_distance; struct pglist_data *pgdat; @@ -258,6 +259,7 @@ void workingset_refault(struct page *page, void *shadow) struct lruvec *lruvec; unsigned long refault; bool workingset; + bool ret = false; int memcgid;
unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset); @@ -313,17 +315,42 @@ void workingset_refault(struct page *page, void *shadow) if (refault_distance > active_file) goto out;
- SetPageActive(page); + ret = true; atomic_long_inc(&lruvec->inactive_age); inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
/* Page was active prior to eviction */ - if (workingset) { - SetPageWorkingset(page); + if (workingset) inc_lruvec_state(lruvec, WORKINGSET_RESTORE); - } + out: rcu_read_unlock(); + + return ret; +} + +/** + * workingset_refault_fixkabi - evaluate the refault of a previously + * evicted page + * @page: the freshly allocated replacement page + * @shadow: shadow entry of the evicted page + * + * Calculates and evaluates the refault distance of the previously + * evicted page in the context of the node it was allocated in. + */ +void workingset_refault_fixkabi(struct page *page, void *shadow) +{ + unsigned long entry = (unsigned long)shadow; + bool workingset; + + entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT; + workingset = entry & 1; + + if (workingset_refault(shadow)) { + SetPageActive(page); + if (workingset) + SetPageWorkingset(page); + } }
/**