[OLK-6.6 3/3] sched/eevdf: Skip eligibility check for current entity during wakeup preemption

7 Apr 2024

From: K Prateek Nayak <kprateek.nayak@amd.com>

maillist inclusion
category: performance
bugzilla: https://gitee.com/openeuler/kernel/issues/I9EHKI
CVE: NA

Reference: https://lore.kernel.org/lkml/20240325060226.1540-2-kprateek.nayak@amd.com/

--------------------------------

With the curr entity's eligibility check, a wakeup preemption is very
likely when an entity with positive lag joins the runqueue pushing the
avg_vruntime of the runqueue backwards, making the vruntime of the
current entity ineligible. This leads to aggressive wakeup preemption
which was previously guarded by wakeup_granularity_ns in legacy CFS.
Below figure depicts one such aggressive preemption scenario with EEVDF
in DeathStarBench [1]:

				deadline for Nginx
		   |
	+-------+  |			|
    /-- | Nginx | -|------------------> |
    |	+-------+  |			|
    |		   |
    |	-----------|-------------------------------> vruntime timeline
    |		   \--> rq->avg_vruntime
    |
    | 	wakes service on the same runqueue since system is busy
    |
    |	+---------+|
    \-->| Service || (service has +ve lag pushes avg_vruntime backwards)
	+---------+|
	  |	   |
   wakeup |	+--|-----+			 |
 preempts \---->| N|ginx | --------------------> | {deadline for Nginx}
		+--|-----+  			 |
	   (Nginx ineligible)
	-----------|-------------------------------> vruntime timeline
		   \--> rq->avg_vruntime

When NGINX server is involuntarily switched out, it cannot accept any
incoming request, leading to longer turn around time for the clients and
thus loss in DeathStarBench throughput.

    ==================================================================
    Test          : DeathStarBench
    Units         : Normalized latency
    Interpretation: Lower is better
    Statistic     : Mean
    ==================================================================
    tip		1.00
    eevdf	1.14 (+14.61%)

For current running task, skip eligibility check in pick_eevdf() if it
has not exhausted the slice promised to it during selection despite the
situation having changed since. The behavior is guarded by
RUN_TO_PARITY_WAKEUP sched_feat to simplify testing. With
RUN_TO_PARITY_WAKEUP enabled, performance loss seen with DeathStarBench
since the merge of EEVDF disappears. Following are the results from
testing on a Dual Socket 3rd Generation EPYC server (2 x 64C/128T):

    ==================================================================
    Test          : DeathStarBench
    Units         : Normalized throughput
    Interpretation: Higher is better
    Statistic     : Mean
    ==================================================================
    Pinning      scaling     tip            run-to-parity-wakeup(pct imp)
     1CCD           1       1.00            1.16 (%diff: 16%)
     2CCD           2       1.00            1.03 (%diff: 3%)
     4CCD           4       1.00            1.12 (%diff: 12%)
     8CCD           8       1.00            1.05 (%diff: 6%)

With spec_rstack_overflow=off, the DeathStarBench performance with the
proposed solution is same as the performance on v6.5 release before
EEVDF was merged.

This may lead to newly waking task waiting longer for its turn on the
CPU, however, testing on the same system did not reveal any consistent
regressions with the standard benchmarks.

Link: https://github.com/delimitrou/DeathStarBench/ [1]
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c     | 24 ++++++++++++++++++++----
 kernel/sched/features.h |  1 +
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ddb6462babda..7b0cb2f090da 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1032,7 +1032,7 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
  *
  * Which allows tree pruning through eligibility.
  */
-static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq, bool wakeup_preempt)
 {
 	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
 	struct sched_entity *se = __pick_first_entity(cfs_rq);
@@ -1046,7 +1046,23 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 	if (cfs_rq->nr_running == 1)
 		return curr && curr->on_rq ? curr : se;
 
-	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
+	if (curr && !curr->on_rq)
+		curr = NULL;
+
+	/*
+	 * When an entity with positive lag wakes up, it pushes the
+	 * avg_vruntime of the runqueue backwards. This may causes the
+	 * current entity to be ineligible soon into its run leading to
+	 * wakeup preemption.
+	 *
+	 * To prevent such aggressive preemption of the current running
+	 * entity during task wakeups, skip the eligibility check if the
+	 * slice promised to the entity since its selection has not yet
+	 * elapsed.
+	 */
+	if (curr &&
+	    !(sched_feat(RUN_TO_PARITY_WAKEUP) && wakeup_preempt && curr->vlag == curr->deadline) &&
+	    !entity_eligible(cfs_rq, curr))
 		curr = NULL;
 
 	/*
@@ -5576,7 +5592,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
 		return cfs_rq->next;
 
-	return pick_eevdf(cfs_rq);
+	return pick_eevdf(cfs_rq, false);
 }
 
 static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
@@ -9245,7 +9261,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	/*
 	 * XXX pick_eevdf(cfs_rq) != se ?
 	 */
-	if (pick_eevdf(cfs_rq) == pse)
+	if (pick_eevdf(cfs_rq, true) == pse)
 		goto preempt;
 
 	return;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index e4789d09f58e..26b1a03bd3d2 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,6 +7,7 @@
 SCHED_FEAT(PLACE_LAG, true)
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 SCHED_FEAT(RUN_TO_PARITY, true)
+SCHED_FEAT(RUN_TO_PARITY_WAKEUP, true)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed
-- 
2.18.0.huawei.25

    

[OLK-6.6 3/3] sched/eevdf: Skip eligibility check for current entity during wakeup preemption

Zhang Qiao