From: Huaixin Chang changhuaixin@linux.alibaba.com
anolis inclusion from anolis_master commit 9d168f216486333f24aa1b33706eddf3b13d7228 category: performance bugzilla: NA CVE: NA ---------------------------
Kernel limitation on cpu.cfs_quota_us is insufficient. Some large numbers might cause overflow in to_ratio() calculation and produce unexpected results.
For example, if we make two cpu cgroups and then write a reasonable value and a large value into child's and parent's cpu.cfs_quota_us. This will cause a write error.
cd /sys/fs/cgroup/cpu mkdir parent; mkdir parent/child echo 8000 > parent/child/cpu.cfs_quota_us # 17592186044416 is (1UL << 44) echo 17592186044416 > parent/cpu.cfs_quota_us
In this case, quota will overflow and thus fail the __cfs_schedulable check. Similar overflow also affects rt bandwidth.
Burstable CFS bandwidth controller will also benefit from limiting quota.
Change-Id: I0f89d1f26b168c5cfa041e886395c7f3068114ae Reviewed-by: Shanpei Chen shanpeic@linux.alibaba.com Signed-off-by: Huaixin Chang changhuaixin@linux.alibaba.com Signed-off-by: Zhengyuan Liu liuzhengyuan@kylinos.cn --- kernel/sched/core.c | 8 ++++++++ kernel/sched/rt.c | 9 +++++++++ kernel/sched/sched.h | 2 ++ 3 files changed, 19 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 36d7422da0ac..51fdd30f188a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6679,6 +6679,8 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */ const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */ +/* More than 203 days if BW_SHIFT equals 20. */ +const u64 max_cfs_runtime = MAX_BW_USEC * NSEC_PER_USEC;
static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
@@ -6706,6 +6708,12 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) if (period > max_cfs_quota_period) return -EINVAL;
+ /* + * Bound quota to defend quota against overflow during bandwidth shift. + */ + if (quota != RUNTIME_INF && quota > max_cfs_runtime) + return -EINVAL; + /* * Prevent race between setting of cfs_rq->runtime_enabled and * unthrottle_offline_cfs_rqs(). diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 301ba04d9130..f31e0aaf1f43 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -2518,6 +2518,9 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime) return ret; }
+/* More than 203 days if BW_SHIFT equals 20. */ +static const u64 max_rt_runtime = MAX_BW_USEC * NSEC_PER_USEC; + static int tg_set_rt_bandwidth(struct task_group *tg, u64 rt_period, u64 rt_runtime) { @@ -2534,6 +2537,12 @@ static int tg_set_rt_bandwidth(struct task_group *tg, if (rt_period == 0) return -EINVAL;
+ /* + * Bound quota to defend quota against overflow during bandwidth shift. + */ + if (rt_runtime != RUNTIME_INF && rt_runtime > max_rt_runtime) + return -EINVAL; + mutex_lock(&rt_constraints_mutex); read_lock(&tasklist_lock); err = __rt_schedulable(tg, rt_period, rt_runtime); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ae3068153093..f3808a49ce48 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1732,6 +1732,8 @@ extern void init_dl_rq_bw_ratio(struct dl_rq *dl_rq); #define BW_SHIFT 20 #define BW_UNIT (1 << BW_SHIFT) #define RATIO_SHIFT 8 +#define MAX_BW_BITS (64 - BW_SHIFT) +#define MAX_BW_USEC ((1UL << MAX_BW_BITS) - 1) unsigned long to_ratio(u64 period, u64 runtime);
extern void init_entity_runnable_average(struct sched_entity *se);
From: Huaixin Chang changhuaixin@linux.alibaba.com
mainline inclusion from mainline-v5.13-rc6 commit f4183717b370ad28dd0c0d74760142b20e6e7931 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5CPWE CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The CFS bandwidth controller limits CPU requests of a task group to quota during each period. However, parallel workloads might be bursty so that they get throttled even when their average utilization is under quota. And they are latency sensitive at the same time so that throttling them is undesired.
We borrow time now against our future underrun, at the cost of increased interference against the other system users. All nicely bounded.
Traditional (UP-EDF) bandwidth control is something like:
(U = \Sum u_i) <= 1
This guaranteeds both that every deadline is met and that the system is stable. After all, if U were > 1, then for every second of walltime, we'd have to run more than a second of program time, and obviously miss our deadline, but the next deadline will be further out still, there is never time to catch up, unbounded fail.
This work observes that a workload doesn't always executes the full quota; this enables one to describe u_i as a statistical distribution.
For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100) (the traditional WCET). This effectively allows u to be smaller, increasing the efficiency (we can pack more tasks in the system), but at the cost of missing deadlines when all the odds line up. However, it does maintain stability, since every overrun must be paired with an underrun as long as our x is above the average.
That is, suppose we have 2 tasks, both specify a p(95) value, then we have a p(95)*p(95) = 90.25% chance both tasks are within their quota and everything is good. At the same time we have a p(5)p(5) = 0.25% chance both tasks will exceed their quota at the same time (guaranteed deadline fail). Somewhere in between there's a threshold where one exceeds and the other doesn't underrun enough to compensate; this depends on the specific CDFs.
At the same time, we can say that the worst case deadline miss, will be \Sum e_i; that is, there is a bounded tardiness (under the assumption that x+e is indeed WCET).
The benefit of burst is seen when testing with schbench. Default value of kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
mkdir /sys/fs/cgroup/cpu/test echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
./schbench -m 1 -t 3 -r 20 -c 80000 -R 10
The average CPU usage is at 80%. I run this for 10 times, and got long tail latency for 6 times and got throttled for 8 times.
Tail latencies are shown below, and it wasn't the worst case.
Latency percentiles (usec) 50.0000th: 19872 75.0000th: 21344 90.0000th: 22176 95.0000th: 22496 *99.0000th: 22752 99.5000th: 22752 99.9000th: 22752 min=0, max=22727 rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%
The interferenece when using burst is valued by the possibilities for missing the deadline and the average WCET. Test results showed that when there many cgroups or CPU is under utilized, the interference is limited. More details are shown in: https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alib...
Co-developed-by: Shanpei Chen shanpeic@linux.alibaba.com Signed-off-by: Shanpei Chen shanpeic@linux.alibaba.com Co-developed-by: Tianchen Ding dtcccc@linux.alibaba.com Signed-off-by: Tianchen Ding dtcccc@linux.alibaba.com Signed-off-by: Huaixin Chang changhuaixin@linux.alibaba.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Ben Segall bsegall@google.com Acked-by: Tejun Heo tj@kernel.org Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.... Signed-off-by: Hui Tang tanghui20@huawei.com Reviewed-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Signed-off-by: Zhengyuan Liu liuzhengyuan@kylinos.cn Change-Id: I83e00d983cdb130b685239469308a0706ab69f25 --- kernel/sched/core.c | 68 ++++++++++++++++++++++++++++++++++++++++---- kernel/sched/fair.c | 14 ++++++--- kernel/sched/sched.h | 1 + 3 files changed, 73 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 51fdd30f188a..145aaaffbc2f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6684,7 +6684,8 @@ const u64 max_cfs_runtime = MAX_BW_USEC * NSEC_PER_USEC;
static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
-static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota, + u64 burst) { int i, ret = 0, runtime_enabled, runtime_was_enabled; struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; @@ -6708,6 +6709,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) if (period > max_cfs_quota_period) return -EINVAL;
+ if (quota != RUNTIME_INF && (burst > quota || + burst + quota > max_cfs_runtime)) + return -EINVAL; + /* * Bound quota to defend quota against overflow during bandwidth shift. */ @@ -6735,6 +6740,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) raw_spin_lock_irq(&cfs_b->lock); cfs_b->period = ns_to_ktime(period); cfs_b->quota = quota; + cfs_b->burst = burst;
__refill_cfs_bandwidth_runtime(cfs_b);
@@ -6768,9 +6774,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us) { - u64 quota, period; + u64 quota, period, burst;
period = ktime_to_ns(tg->cfs_bandwidth.period); + burst = tg->cfs_bandwidth.burst; if (cfs_quota_us < 0) quota = RUNTIME_INF; else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC) @@ -6778,7 +6785,7 @@ int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us) else return -EINVAL;
- return tg_set_cfs_bandwidth(tg, period, quota); + return tg_set_cfs_bandwidth(tg, period, quota, burst); }
long tg_get_cfs_quota(struct task_group *tg) @@ -6796,15 +6803,16 @@ long tg_get_cfs_quota(struct task_group *tg)
int tg_set_cfs_period(struct task_group *tg, long cfs_period_us) { - u64 quota, period; + u64 quota, period, burst;
if ((u64)cfs_period_us > U64_MAX / NSEC_PER_USEC) return -EINVAL;
period = (u64)cfs_period_us * NSEC_PER_USEC; quota = tg->cfs_bandwidth.quota; + burst = tg->cfs_bandwidth.burst;
- return tg_set_cfs_bandwidth(tg, period, quota); + return tg_set_cfs_bandwidth(tg, period, quota, burst); }
long tg_get_cfs_period(struct task_group *tg) @@ -6817,6 +6825,30 @@ long tg_get_cfs_period(struct task_group *tg) return cfs_period_us; }
+static int tg_set_cfs_burst(struct task_group *tg, long cfs_burst_us) +{ + u64 quota, period, burst; + + if ((u64)cfs_burst_us > U64_MAX / NSEC_PER_USEC) + return -EINVAL; + + burst = (u64)cfs_burst_us * NSEC_PER_USEC; + period = ktime_to_ns(tg->cfs_bandwidth.period); + quota = tg->cfs_bandwidth.quota; + + return tg_set_cfs_bandwidth(tg, period, quota, burst); +} + +static long tg_get_cfs_burst(struct task_group *tg) +{ + u64 burst_us; + + burst_us = tg->cfs_bandwidth.burst; + do_div(burst_us, NSEC_PER_USEC); + + return burst_us; +} + static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) { @@ -6841,6 +6873,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css, return tg_set_cfs_period(css_tg(css), cfs_period_us); }
+static u64 cpu_cfs_burst_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + return tg_get_cfs_burst(css_tg(css)); +} + +static int cpu_cfs_burst_write_u64(struct cgroup_subsys_state *css, + struct cftype *cftype, u64 cfs_burst_us) +{ + return tg_set_cfs_burst(css_tg(css), cfs_burst_us); +} + struct cfs_schedulable_data { struct task_group *tg; u64 period, quota; @@ -7055,6 +7099,11 @@ static struct cftype cpu_legacy_files[] = { .read_u64 = cpu_cfs_period_read_u64, .write_u64 = cpu_cfs_period_write_u64, }, + { + .name = "cfs_burst_us", + .read_u64 = cpu_cfs_burst_read_u64, + .write_u64 = cpu_cfs_burst_write_u64, + }, { .name = "stat", .seq_show = cpu_cfs_stat_show, @@ -7213,12 +7262,13 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of, { struct task_group *tg = css_tg(of_css(of)); u64 period = tg_get_cfs_period(tg); + u64 burst = tg_get_cfs_burst(tg); u64 quota; int ret;
ret = cpu_period_quota_parse(buf, &period, "a); if (!ret) - ret = tg_set_cfs_bandwidth(tg, period, quota); + ret = tg_set_cfs_bandwidth(tg, period, quota, burst); return ret ?: nbytes; } #endif @@ -7245,6 +7295,12 @@ static struct cftype cpu_files[] = { .seq_show = cpu_max_show, .write = cpu_max_write, }, + { + .name = "max.burst", + .flags = CFTYPE_NOT_ON_ROOT, + .read_u64 = cpu_cfs_burst_read_u64, + .write_u64 = cpu_cfs_burst_write_u64, + }, #endif { } /* terminate */ }; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7d553a4c5120..2cd22d984388 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4434,8 +4434,11 @@ static inline u64 sched_cfs_bandwidth_slice(void) */ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) { - if (cfs_b->quota != RUNTIME_INF) - cfs_b->runtime = cfs_b->quota; + if (unlikely(cfs_b->quota == RUNTIME_INF)) + return; + + cfs_b->runtime += cfs_b->quota; + cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst); }
static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) @@ -4760,6 +4763,9 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u throttled = !list_empty(&cfs_b->throttled_cfs_rq); cfs_b->nr_periods += overrun;
+ /* Refill extra burst quota even if cfs_b->idle */ + __refill_cfs_bandwidth_runtime(cfs_b); + /* * idle depends on !throttled (for the case of a large deficit), and if * we're going inactive then everything else can be deferred @@ -4767,8 +4773,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u if (cfs_b->idle && !throttled) goto out_deactivate;
- __refill_cfs_bandwidth_runtime(cfs_b); - if (!throttled) { /* mark as potentially idle for the upcoming period */ cfs_b->idle = 1; @@ -5032,6 +5036,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) if (new < max_cfs_quota_period) { cfs_b->period = ns_to_ktime(new); cfs_b->quota *= 2; + cfs_b->burst *= 2;
pr_warn_ratelimited( "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n", @@ -5065,6 +5070,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) cfs_b->runtime = 0; cfs_b->quota = RUNTIME_INF; cfs_b->period = ns_to_ktime(default_cfs_period()); + cfs_b->burst = 0;
INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq); hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index f3808a49ce48..15f75d1cf5b3 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -356,6 +356,7 @@ struct cfs_bandwidth { int nr_periods; int nr_throttled; u64 throttled_time; + u64 burst;
bool distribute_running; #endif
From: Huaixin Chang changhuaixin@linux.alibaba.com
mainline inclusion from mainline-v5.15-rc4 commit bcb1704a1ed2de580a46f28922e223a65f16e0f5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5CPWE CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Two new statistics are introduced to show the internal of burst feature and explain why burst helps or not.
nr_bursts: number of periods bandwidth burst occurs burst_time: cumulative wall-time (in nanoseconds) that any cpus has used above quota in respective periods
Co-developed-by: Shanpei Chen shanpeic@linux.alibaba.com Signed-off-by: Shanpei Chen shanpeic@linux.alibaba.com Co-developed-by: Tianchen Ding dtcccc@linux.alibaba.com Signed-off-by: Tianchen Ding dtcccc@linux.alibaba.com Signed-off-by: Huaixin Chang changhuaixin@linux.alibaba.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Daniel Jordan daniel.m.jordan@oracle.com Acked-by: Tejun Heo tj@kernel.org Link: https://lore.kernel.org/r/20210830032215.16302-2-changhuaixin@linux.alibaba.... Signed-off-by: Hui Tang tanghui20@huawei.com Reviewed-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com Signed-off-by: Zhengyuan Liu liuzhengyuan@kylinos.cn --- kernel/sched/core.c | 13 ++++++++++--- kernel/sched/fair.c | 9 +++++++++ kernel/sched/sched.h | 3 +++ 3 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 145aaaffbc2f..26279d129d06 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6987,6 +6987,9 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v) seq_printf(sf, "wait_sum %llu\n", ws); }
+ seq_printf(sf, "nr_bursts %d\n", cfs_b->nr_burst); + seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time); + return 0; } #endif /* CONFIG_CFS_BANDWIDTH */ @@ -7138,16 +7141,20 @@ static int cpu_extra_stat_show(struct seq_file *sf, { struct task_group *tg = css_tg(css); struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; - u64 throttled_usec; + u64 throttled_usec, burst_usec;
throttled_usec = cfs_b->throttled_time; do_div(throttled_usec, NSEC_PER_USEC); + burst_usec = cfs_b->burst_time; + do_div(burst_usec, NSEC_PER_USEC);
seq_printf(sf, "nr_periods %d\n" "nr_throttled %d\n" - "throttled_usec %llu\n", + "throttled_usec %llu\n" + "nr_bursts %d\n" + "burst_usec %llu\n", cfs_b->nr_periods, cfs_b->nr_throttled, - throttled_usec); + throttled_usec, cfs_b->nr_burst, burst_usec); } #endif return 0; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2cd22d984388..1efd36d1c419 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4434,11 +4434,20 @@ static inline u64 sched_cfs_bandwidth_slice(void) */ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) { + s64 runtime; + if (unlikely(cfs_b->quota == RUNTIME_INF)) return;
cfs_b->runtime += cfs_b->quota; + runtime = cfs_b->runtime_snap - cfs_b->runtime; + if (runtime > 0) { + cfs_b->burst_time += runtime; + cfs_b->nr_burst++; + } + cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst); + cfs_b->runtime_snap = cfs_b->runtime; }
static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 15f75d1cf5b3..8685da35bdf7 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -357,6 +357,9 @@ struct cfs_bandwidth { int nr_throttled; u64 throttled_time; u64 burst; + u64 runtime_snap; + int nr_burst; + u64 burst_time;
bool distribute_running; #endif
From: Huaixin Chang changhuaixin@linux.alibaba.com
anolis inclusion from anolis_master commit a0b0376bdbdccc48c1c279179cac8826a687ab3a category: performance bugzilla: NA CVE: NA ---------------------------
Basic description of usage and effect for CFS Bandwidth Control Burst.
Change-Id: I14db5096945ab39e3a6a6a7b6144e7830e74e91a Reviewed-by: Shanpei Chen shanpeic@linux.alibaba.com Signed-off-by: Huaixin Chang changhuaixin@linux.alibaba.com Signed-off-by: Zhengyuan Liu liuzhengyuan@kylinos.cn --- Documentation/scheduler/sched-bwc.txt | 70 +++++++++++++++++++++------ 1 file changed, 55 insertions(+), 15 deletions(-)
diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt index de583fbbfe42..9878ffe28de8 100644 --- a/Documentation/scheduler/sched-bwc.txt +++ b/Documentation/scheduler/sched-bwc.txt @@ -7,12 +7,14 @@ CFS Bandwidth Control CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the specification of the maximum CPU bandwidth available to a group or hierarchy.
-The bandwidth allowed for a group is specified using a quota and period. Within -each given "period" (microseconds), a group is allowed to consume only up to -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a -group exceeds this limit (for that period), the tasks belonging to its -hierarchy will be throttled and are not allowed to run again until the next -period. +The bandwidth allowed for a group is specified using a quota, period and burst. +Within each given "period" (microseconds), a group is filled with "quota" +microseconds of CPU time. If the group has consumed less than that in a period, +unused "quota" will be accumulated and allowd to be used in the following +periods. A cap "burst" should be set by user via cpu.cfs_burst_us. The +accumulated CPU time won't exceed this time. When the CPU bandwidth consumption +of a group exceeds its limit, the tasks belonging to its hierarchy will be +throttled and are not allowed to run again until the next period.
A group's unused runtime is globally tracked, being refreshed with quota units above at each period boundary. As threads consume this bandwidth it is @@ -21,26 +23,33 @@ within each of these updates is tunable and described as the "slice".
Management ---------- -Quota and period are managed within the cpu subsystem via cgroupfs. +Quota, period and burst are managed within the cpu subsystem via cgroupfs.
cpu.cfs_quota_us: the total available run-time within a period (in microseconds) cpu.cfs_period_us: the length of a period (in microseconds) +cpu.cfs_burst_us: the maximum accumulated run-time cpu.stat: exports throttling statistics [explained further below]
The default values are: cpu.cfs_period_us=100ms - cpu.cfs_quota=-1 + cpu.cfs_quota_us=-1 + cpu.cfs_burst_us=-1
A value of -1 for cpu.cfs_quota_us indicates that the group does not have any bandwidth restriction in place, such a group is described as an unconstrained -bandwidth group. This represents the traditional work-conserving behavior for +bandwidth group. This represents the traditional work-conserving behavior for CFS.
-Writing any (valid) positive value(s) will enact the specified bandwidth limit. -The minimum quota allowed for the quota or period is 1ms. There is also an -upper bound on the period length of 1s. Additional restrictions exist when -bandwidth limits are used in a hierarchical fashion, these are explained in -more detail below. +Writing any (valid) positive value(s) into cpu.cfs_quota_us will enact the +specified bandwidth limit. The minimum quota allowed for the quota or period +is 1ms. There is also an upper bound on the period length of 1s. Additional +restrictions exist when bandwidth limits are used in a hierarchical fashion, +these are explained in more detail below. + +A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate +any unused bandwidth. This represents the traditional bandwidth control +behavior for CFS. Writing any (valid) positive value(s) into cpu.cfs_burst_us +will enact the cap on unused bandwidth accumulation.
Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit and return the group to an unconstrained state once more. @@ -61,15 +70,35 @@ This is tunable via procfs: Larger slice values will reduce transfer overheads, while smaller values allow for more fine-grained consumption.
+ +There is also a global switch to turn off burst for all groups: + /proc/sys/kernel/sched_cfs_bw_burst_enabled (default=1) + +By default it is enabled. Write 0 values means no accumulated CPU time can be +used for any group, even if cpu.cfs_burst_us is configured. + + +Sometimes users might want a group to burst without accumulation. This is +tunable via: + /proc/sys/kernel/sched_cfs_bw_burst_onset_percent (default=0) + +Up to 100% runtime of cpu.cfs_burst_us might be given on setting bandwidth. + Statistics ---------- -A group's bandwidth statistics are exported via 3 fields in cpu.stat. +A group's bandwidth statistics are exported via 7 fields in cpu.stat.
cpu.stat: - nr_periods: Number of enforcement intervals that have elapsed. - nr_throttled: Number of times the group has been throttled/limited. - throttled_time: The total time duration (in nanoseconds) for which entities of the group have been throttled. +- wait_sum: The total time duration (in nanoseconds) for which entities + of the group have been waiting. +- current_bw: Current runtime in global pool. +- nr_burst: Number of periods burst occurs. +- burst_time: Cumulative wall-time that any cpus has used above quota in + respective periods
This interface is read-only.
@@ -165,3 +194,14 @@ Examples By using a small period here we are ensuring a consistent latency response at the expense of burst capacity.
+4. Limit a group to 20% of 1 CPU, and allow accumulate up to 60% of 1 CPU + addtionally, in case accumulation has been done. + + With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU. + And 30ms burst will be equivalent to 60% of 1 CPU. + + # echo 10000 > cpu.cfs_quota_us /* quota = 10ms */ + # echo 50000 > cpu.cfs_period_us /* period = 50ms */ + # echo 30000 > cpu.cfs_burst_us /* burst = 30ms */ + + Larger buffer setting allows greater burst capacity.
Hi,
Thanks for your patch.
On 2022/6/27 17:51, liuzhengyuan@kylinos.cn wrote:
From: Huaixin Chang changhuaixin@linux.alibaba.com
anolis inclusion from anolis_master commit 9d168f216486333f24aa1b33706eddf3b13d7228 category: performance bugzilla: NA
Please attach the url of issue.
Hi chenhui, Please help to review this series for openEuler-1.0-LTS.
Thanks.
CVE: NA
Kernel limitation on cpu.cfs_quota_us is insufficient. Some large numbers might cause overflow in to_ratio() calculation and produce unexpected results.
For example, if we make two cpu cgroups and then write a reasonable value and a large value into child's and parent's cpu.cfs_quota_us. This will cause a write error.
cd /sys/fs/cgroup/cpu mkdir parent; mkdir parent/child echo 8000 > parent/child/cpu.cfs_quota_us # 17592186044416 is (1UL << 44) echo 17592186044416 > parent/cpu.cfs_quota_us
In this case, quota will overflow and thus fail the __cfs_schedulable check. Similar overflow also affects rt bandwidth.
Burstable CFS bandwidth controller will also benefit from limiting quota.
Change-Id: I0f89d1f26b168c5cfa041e886395c7f3068114ae Reviewed-by: Shanpei Chen shanpeic@linux.alibaba.com Signed-off-by: Huaixin Chang changhuaixin@linux.alibaba.com Signed-off-by: Zhengyuan Liu liuzhengyuan@kylinos.cn
kernel/sched/core.c | 8 ++++++++ kernel/sched/rt.c | 9 +++++++++ kernel/sched/sched.h | 2 ++ 3 files changed, 19 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 36d7422da0ac..51fdd30f188a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6679,6 +6679,8 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */ const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */ +/* More than 203 days if BW_SHIFT equals 20. */ +const u64 max_cfs_runtime = MAX_BW_USEC * NSEC_PER_USEC;
static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
@@ -6706,6 +6708,12 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) if (period > max_cfs_quota_period) return -EINVAL;
- /*
* Bound quota to defend quota against overflow during bandwidth shift.
*/
- if (quota != RUNTIME_INF && quota > max_cfs_runtime)
return -EINVAL;
- /*
- Prevent race between setting of cfs_rq->runtime_enabled and
- unthrottle_offline_cfs_rqs().
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 301ba04d9130..f31e0aaf1f43 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -2518,6 +2518,9 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime) return ret; }
+/* More than 203 days if BW_SHIFT equals 20. */ +static const u64 max_rt_runtime = MAX_BW_USEC * NSEC_PER_USEC;
static int tg_set_rt_bandwidth(struct task_group *tg, u64 rt_period, u64 rt_runtime) { @@ -2534,6 +2537,12 @@ static int tg_set_rt_bandwidth(struct task_group *tg, if (rt_period == 0) return -EINVAL;
- /*
* Bound quota to defend quota against overflow during bandwidth shift.
*/
- if (rt_runtime != RUNTIME_INF && rt_runtime > max_rt_runtime)
return -EINVAL;
- mutex_lock(&rt_constraints_mutex); read_lock(&tasklist_lock); err = __rt_schedulable(tg, rt_period, rt_runtime);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ae3068153093..f3808a49ce48 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1732,6 +1732,8 @@ extern void init_dl_rq_bw_ratio(struct dl_rq *dl_rq); #define BW_SHIFT 20 #define BW_UNIT (1 << BW_SHIFT) #define RATIO_SHIFT 8 +#define MAX_BW_BITS (64 - BW_SHIFT) +#define MAX_BW_USEC ((1UL << MAX_BW_BITS) - 1) unsigned long to_ratio(u64 period, u64 runtime);
extern void init_entity_runnable_average(struct sched_entity *se);
On 2022/6/29 上午11:44, Xie XiuQi wrote:
Hi,
Thanks for your patch.
On 2022/6/27 17:51, liuzhengyuan@kylinos.cn wrote:
From: Huaixin Chang changhuaixin@linux.alibaba.com
anolis inclusion from anolis_master commit 9d168f216486333f24aa1b33706eddf3b13d7228 category: performance bugzilla: NA
Please attach the url of issue.
XiuQi,
I have no idea about the url, should we create a issue list in gitee.com before sending this series?
Please also help to review another series as I haven't receive any reply yet. https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/thread/VA...
Hi chenhui, Please help to review this series for openEuler-1.0-LTS.
Thanks.
CVE: NA
Kernel limitation on cpu.cfs_quota_us is insufficient. Some large numbers might cause overflow in to_ratio() calculation and produce unexpected results.
For example, if we make two cpu cgroups and then write a reasonable value and a large value into child's and parent's cpu.cfs_quota_us. This will cause a write error.
cd /sys/fs/cgroup/cpu mkdir parent; mkdir parent/child echo 8000 > parent/child/cpu.cfs_quota_us # 17592186044416 is (1UL << 44) echo 17592186044416 > parent/cpu.cfs_quota_us
In this case, quota will overflow and thus fail the __cfs_schedulable check. Similar overflow also affects rt bandwidth.
Burstable CFS bandwidth controller will also benefit from limiting quota.
Change-Id: I0f89d1f26b168c5cfa041e886395c7f3068114ae Reviewed-by: Shanpei Chen shanpeic@linux.alibaba.com Signed-off-by: Huaixin Chang changhuaixin@linux.alibaba.com Signed-off-by: Zhengyuan Liu liuzhengyuan@kylinos.cn
kernel/sched/core.c | 8 ++++++++ kernel/sched/rt.c | 9 +++++++++ kernel/sched/sched.h | 2 ++ 3 files changed, 19 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 36d7422da0ac..51fdd30f188a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6679,6 +6679,8 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */ const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */ +/* More than 203 days if BW_SHIFT equals 20. */ +const u64 max_cfs_runtime = MAX_BW_USEC * NSEC_PER_USEC;
static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
@@ -6706,6 +6708,12 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) if (period > max_cfs_quota_period) return -EINVAL;
- /*
* Bound quota to defend quota against overflow during bandwidth shift.
*/
- if (quota != RUNTIME_INF && quota > max_cfs_runtime)
return -EINVAL;
- /*
- Prevent race between setting of cfs_rq->runtime_enabled and
- unthrottle_offline_cfs_rqs().
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 301ba04d9130..f31e0aaf1f43 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -2518,6 +2518,9 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime) return ret; }
+/* More than 203 days if BW_SHIFT equals 20. */ +static const u64 max_rt_runtime = MAX_BW_USEC * NSEC_PER_USEC;
- static int tg_set_rt_bandwidth(struct task_group *tg, u64 rt_period, u64 rt_runtime) {
@@ -2534,6 +2537,12 @@ static int tg_set_rt_bandwidth(struct task_group *tg, if (rt_period == 0) return -EINVAL;
- /*
* Bound quota to defend quota against overflow during bandwidth shift.
*/
- if (rt_runtime != RUNTIME_INF && rt_runtime > max_rt_runtime)
return -EINVAL;
- mutex_lock(&rt_constraints_mutex); read_lock(&tasklist_lock); err = __rt_schedulable(tg, rt_period, rt_runtime);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ae3068153093..f3808a49ce48 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1732,6 +1732,8 @@ extern void init_dl_rq_bw_ratio(struct dl_rq *dl_rq); #define BW_SHIFT 20 #define BW_UNIT (1 << BW_SHIFT) #define RATIO_SHIFT 8 +#define MAX_BW_BITS (64 - BW_SHIFT) +#define MAX_BW_USEC ((1UL << MAX_BW_BITS) - 1) unsigned long to_ratio(u64 period, u64 runtime);
extern void init_entity_runnable_average(struct sched_entity *se);
On 2022/6/29 上午11:44, Xie XiuQi wrote:
Hi,
Thanks for your patch.
On 2022/6/27 17:51, liuzhengyuan@kylinos.cn wrote:
From: Huaixin Chang changhuaixin@linux.alibaba.com
anolis inclusion from anolis_master commit 9d168f216486333f24aa1b33706eddf3b13d7228 category: performance bugzilla: NA
Please attach the url of issue.
XiuQi,
I have no idea about the url, should we create a issue list in gitee.com before sending this series?
Please also help to review another series as I haven't receive any reply yet. https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/thread/VA...
Hi chenhui, Please help to review this series for openEuler-1.0-LTS.
Thanks.
CVE: NA
Kernel limitation on cpu.cfs_quota_us is insufficient. Some large numbers might cause overflow in to_ratio() calculation and produce unexpected results.
For example, if we make two cpu cgroups and then write a reasonable value and a large value into child's and parent's cpu.cfs_quota_us. This will cause a write error.
cd /sys/fs/cgroup/cpu mkdir parent; mkdir parent/child echo 8000 > parent/child/cpu.cfs_quota_us # 17592186044416 is (1UL << 44) echo 17592186044416 > parent/cpu.cfs_quota_us
In this case, quota will overflow and thus fail the __cfs_schedulable check. Similar overflow also affects rt bandwidth.
Burstable CFS bandwidth controller will also benefit from limiting quota.
Change-Id: I0f89d1f26b168c5cfa041e886395c7f3068114ae Reviewed-by: Shanpei Chen shanpeic@linux.alibaba.com Signed-off-by: Huaixin Chang changhuaixin@linux.alibaba.com Signed-off-by: Zhengyuan Liu liuzhengyuan@kylinos.cn
kernel/sched/core.c | 8 ++++++++ kernel/sched/rt.c | 9 +++++++++ kernel/sched/sched.h | 2 ++ 3 files changed, 19 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 36d7422da0ac..51fdd30f188a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6679,6 +6679,8 @@ static DEFINE_MUTEX(cfs_constraints_mutex);
const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */ const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */ +/* More than 203 days if BW_SHIFT equals 20. */ +const u64 max_cfs_runtime = MAX_BW_USEC * NSEC_PER_USEC;
static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
@@ -6706,6 +6708,12 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) if (period > max_cfs_quota_period) return -EINVAL;
- /*
* Bound quota to defend quota against overflow during bandwidth shift.
*/
- if (quota != RUNTIME_INF && quota > max_cfs_runtime)
return -EINVAL;
- /*
- Prevent race between setting of cfs_rq->runtime_enabled and
- unthrottle_offline_cfs_rqs().
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 301ba04d9130..f31e0aaf1f43 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -2518,6 +2518,9 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime) return ret; }
+/* More than 203 days if BW_SHIFT equals 20. */ +static const u64 max_rt_runtime = MAX_BW_USEC * NSEC_PER_USEC;
- static int tg_set_rt_bandwidth(struct task_group *tg, u64 rt_period, u64 rt_runtime) {
@@ -2534,6 +2537,12 @@ static int tg_set_rt_bandwidth(struct task_group *tg, if (rt_period == 0) return -EINVAL;
- /*
* Bound quota to defend quota against overflow during bandwidth shift.
*/
- if (rt_runtime != RUNTIME_INF && rt_runtime > max_rt_runtime)
return -EINVAL;
- mutex_lock(&rt_constraints_mutex); read_lock(&tasklist_lock); err = __rt_schedulable(tg, rt_period, rt_runtime);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ae3068153093..f3808a49ce48 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1732,6 +1732,8 @@ extern void init_dl_rq_bw_ratio(struct dl_rq *dl_rq); #define BW_SHIFT 20 #define BW_UNIT (1 << BW_SHIFT) #define RATIO_SHIFT 8 +#define MAX_BW_BITS (64 - BW_SHIFT) +#define MAX_BW_USEC ((1UL << MAX_BW_BITS) - 1) unsigned long to_ratio(u64 period, u64 runtime);
extern void init_entity_runnable_average(struct sched_entity *se);