Backport 2fdb8a8f07c2 ("drm/scheduler: rework entity flush, kill and fini") to fix the buggy synchronous waiting. Other patches are preparations and bugfixes.
Andrey Grodzovsky (1): drm/sched: Avoid lockdep spalt on killing a processes
Boris Brezillon (2): drm/sched: Avoid infinite waits in the drm_sched_entity_destroy() path drm/sched: Declare entity idle only after HW submission
Christian König (2): drm/scheduler: fix fence ref counting drm/scheduler: rework entity flush, kill and fini
Daniel Vetter (1): drm/sched: Add dependency tracking
Dmitry Osipenko (2): drm/scheduler: Don't kill jobs in interrupt context drm/scheduler: Fix lockup in drm_sched_entity_kill()
ZhenGuo Yin (1): drm/scheduler: avoid infinite loop if entity's dependency is a scheduled error fence
drivers/gpu/drm/scheduler/sched_entity.c | 193 +++++++++++------------ drivers/gpu/drm/scheduler/sched_main.c | 111 ++++++++++++- include/drm/gpu_scheduler.h | 45 +++++- 3 files changed, 245 insertions(+), 104 deletions(-)
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/13270 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/S...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/13270 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/S...
From: Daniel Vetter daniel.vetter@ffwll.ch
mainline inclusion from mainline-v5.16-rc1 commit ebd5f74255b9f5f8a154ba5535f83387ae599d46 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IB3ULB
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Instead of just a callback we can just glue in the gem helpers that panfrost, v3d and lima currently use. There's really not that many ways to skin this cat.
v2/3: Rebased.
v4: Repaint this shed. The functions are now called _add_dependency() and _add_implicit_dependency()
Reviewed-by: Christian König christian.koenig@amd.com Reviewed-by: Boris Brezillon boris.brezillon@collabora.com (v3) Reviewed-by: Steven Price steven.price@arm.com (v1) Acked-by: Melissa Wen mwen@igalia.com Signed-off-by: Daniel Vetter daniel.vetter@intel.com Cc: David Airlie airlied@linux.ie Cc: Daniel Vetter daniel@ffwll.ch Cc: Sumit Semwal sumit.semwal@linaro.org Cc: "Christian König" christian.koenig@amd.com Cc: Andrey Grodzovsky andrey.grodzovsky@amd.com Cc: Lee Jones lee.jones@linaro.org Cc: Nirmoy Das nirmoy.aiemd@gmail.com Cc: Boris Brezillon boris.brezillon@collabora.com Cc: Luben Tuikov luben.tuikov@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: linux-media@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Link: https://patchwork.freedesktop.org/patch/msgid/20210805104705.862416-5-daniel...
Conflicts: context conflicts in below files: include/drm/gpu_scheduler.h drivers/gpu/drm/scheduler/sched_main.c rename dma_resv_get_excl_unlocked to dma_resv_get_excl_rcu rename dma_resv_get_fences to dma_resv_get_fences_rcu #include <linux/dma-resv.h> & <drm/drm_gem.h>
Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/gpu/drm/scheduler/sched_entity.c | 18 +++- drivers/gpu/drm/scheduler/sched_main.c | 104 +++++++++++++++++++++++ include/drm/gpu_scheduler.h | 33 ++++++- 3 files changed, 149 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index 3f7f761df4cd..1b1b770dc949 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -208,6 +208,19 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, job->sched->ops->free_job(job); }
+static struct dma_fence * +drm_sched_job_dependency(struct drm_sched_job *job, + struct drm_sched_entity *entity) +{ + if (!xa_empty(&job->dependencies)) + return xa_erase(&job->dependencies, job->last_dependency++); + + if (job->sched->ops->dependency) + return job->sched->ops->dependency(job, entity); + + return NULL; +} + /** * drm_sched_entity_kill_jobs - Make sure all remaining jobs are killed * @@ -226,7 +239,7 @@ static void drm_sched_entity_kill_jobs(struct drm_sched_entity *entity) struct drm_sched_fence *s_fence = job->s_fence;
/* Wait for all dependencies to avoid data corruptions */ - while ((f = job->sched->ops->dependency(job, entity))) + while ((f = drm_sched_job_dependency(job, entity))) dma_fence_wait(f, false);
drm_sched_fence_scheduled(s_fence); @@ -416,7 +429,6 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity) */ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity) { - struct drm_gpu_scheduler *sched = entity->rq->sched; struct drm_sched_job *sched_job;
sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue)); @@ -424,7 +436,7 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity) return NULL;
while ((entity->dependency = - sched->ops->dependency(sched_job, entity))) { + drm_sched_job_dependency(sched_job, entity))) { trace_drm_sched_job_wait_dep(sched_job, entity->dependency);
if (drm_sched_entity_add_dependency_cb(entity)) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index b6c2757c3d83..3780433e93d9 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -48,9 +48,11 @@ #include <linux/wait.h> #include <linux/sched.h> #include <linux/completion.h> +#include <linux/dma-resv.h> #include <uapi/linux/sched/types.h>
#include <drm/drm_print.h> +#include <drm/drm_gem.h> #include <drm/gpu_scheduler.h> #include <drm/spsc_queue.h>
@@ -567,10 +569,104 @@ int drm_sched_job_init(struct drm_sched_job *job,
INIT_LIST_HEAD(&job->node);
+ xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC); + return 0; } EXPORT_SYMBOL(drm_sched_job_init);
+/** + * drm_sched_job_add_dependency - adds the fence as a job dependency + * @job: scheduler job to add the dependencies to + * @fence: the dma_fence to add to the list of dependencies. + * + * Note that @fence is consumed in both the success and error cases. + * + * Returns: + * 0 on success, or an error on failing to expand the array. + */ +int drm_sched_job_add_dependency(struct drm_sched_job *job, + struct dma_fence *fence) +{ + struct dma_fence *entry; + unsigned long index; + u32 id = 0; + int ret; + + if (!fence) + return 0; + + /* Deduplicate if we already depend on a fence from the same context. + * This lets the size of the array of deps scale with the number of + * engines involved, rather than the number of BOs. + */ + xa_for_each(&job->dependencies, index, entry) { + if (entry->context != fence->context) + continue; + + if (dma_fence_is_later(fence, entry)) { + dma_fence_put(entry); + xa_store(&job->dependencies, index, fence, GFP_KERNEL); + } else { + dma_fence_put(fence); + } + return 0; + } + + ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b, GFP_KERNEL); + if (ret != 0) + dma_fence_put(fence); + + return ret; +} +EXPORT_SYMBOL(drm_sched_job_add_dependency); + +/** + * drm_sched_job_add_implicit_dependencies - adds implicit dependencies as job + * dependencies + * @job: scheduler job to add the dependencies to + * @obj: the gem object to add new dependencies from. + * @write: whether the job might write the object (so we need to depend on + * shared fences in the reservation object). + * + * This should be called after drm_gem_lock_reservations() on your array of + * GEM objects used in the job but before updating the reservations with your + * own fences. + * + * Returns: + * 0 on success, or an error on failing to expand the array. + */ +int drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job, + struct drm_gem_object *obj, + bool write) +{ + int ret; + struct dma_fence **fences; + unsigned int i, fence_count; + + if (!write) { + struct dma_fence *fence = dma_resv_get_excl_rcu(obj->resv); + + return drm_sched_job_add_dependency(job, fence); + } + + ret = dma_resv_get_fences_rcu(obj->resv, NULL, &fence_count, &fences); + if (ret || !fence_count) + return ret; + + for (i = 0; i < fence_count; i++) { + ret = drm_sched_job_add_dependency(job, fences[i]); + if (ret) + break; + } + + for (; i < fence_count; i++) + dma_fence_put(fences[i]); + kfree(fences); + return ret; +} +EXPORT_SYMBOL(drm_sched_job_add_implicit_dependencies); + /** * drm_sched_job_cleanup - clean up scheduler job resources * @@ -578,8 +674,16 @@ EXPORT_SYMBOL(drm_sched_job_init); */ void drm_sched_job_cleanup(struct drm_sched_job *job) { + struct dma_fence *fence; + unsigned long index; + dma_fence_put(&job->s_fence->finished); job->s_fence = NULL; + + xa_for_each(&job->dependencies, index, fence) { + dma_fence_put(fence); + } + xa_destroy(&job->dependencies); } EXPORT_SYMBOL(drm_sched_job_cleanup);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 92436553fd6a..842df37f43be 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -27,9 +27,12 @@ #include <drm/spsc_queue.h> #include <linux/dma-fence.h> #include <linux/completion.h> +#include <linux/xarray.h>
#define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
+struct drm_gem_object; + struct drm_gpu_scheduler; struct drm_sched_rq;
@@ -198,6 +201,17 @@ struct drm_sched_job { enum drm_sched_priority s_priority; struct drm_sched_entity *entity; struct dma_fence_cb cb; + /** + * @dependencies: + * + * Contains the dependencies as struct dma_fence for this job, see + * drm_sched_job_add_dependency() and + * drm_sched_job_add_implicit_dependencies(). + */ + struct xarray dependencies; + + /** @last_dependency: tracks @dependencies as they signal */ + unsigned long last_dependency; };
static inline bool drm_sched_invalidate_job(struct drm_sched_job *s_job, @@ -214,9 +228,15 @@ static inline bool drm_sched_invalidate_job(struct drm_sched_job *s_job, */ struct drm_sched_backend_ops { /** - * @dependency: Called when the scheduler is considering scheduling - * this job next, to get another struct dma_fence for this job to - * block on. Once it returns NULL, run_job() may be called. + * @dependency: + * + * Called when the scheduler is considering scheduling this job next, to + * get another struct dma_fence for this job to block on. Once it + * returns NULL, run_job() may be called. + * + * If a driver exclusively uses drm_sched_job_add_dependency() and + * drm_sched_job_add_implicit_dependencies() this can be ommitted and + * left as NULL. */ struct dma_fence *(*dependency)(struct drm_sched_job *sched_job, struct drm_sched_entity *s_entity); @@ -299,6 +319,13 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched); int drm_sched_job_init(struct drm_sched_job *job, struct drm_sched_entity *entity, void *owner); +int drm_sched_job_add_dependency(struct drm_sched_job *job, + struct dma_fence *fence); +int drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job, + struct drm_gem_object *obj, + bool write); + + void drm_sched_entity_modify_sched(struct drm_sched_entity *entity, struct drm_gpu_scheduler **sched_list, unsigned int num_sched_list);
From: Andrey Grodzovsky andrey.grodzovsky@amd.com
mainline inclusion from mainline-v5.17-rc1 commit 542cff7893a37445f98ece26aeb3c9c1055e9ea4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IB3ULB
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Probelm: Singlaning one sched fence from within another's sched fence singal callback generates lockdep splat because the both have same lockdep class of their fence->lock
Fix: Fix bellow stack by rescheduling to irq work of signaling and killing of jobs that left when entity is killed.
[11176.741181] dump_stack+0x10/0x12 [11176.741186] __lock_acquire.cold+0x208/0x2df [11176.741197] lock_acquire+0xc6/0x2d0 [11176.741204] ? dma_fence_signal+0x28/0x80 [11176.741212] _raw_spin_lock_irqsave+0x4d/0x70 [11176.741219] ? dma_fence_signal+0x28/0x80 [11176.741225] dma_fence_signal+0x28/0x80 [11176.741230] drm_sched_fence_finished+0x12/0x20 [gpu_sched] [11176.741240] drm_sched_entity_kill_jobs_cb+0x1c/0x50 [gpu_sched] [11176.741248] dma_fence_signal_timestamp_locked+0xac/0x1a0 [11176.741254] dma_fence_signal+0x3b/0x80 [11176.741260] drm_sched_fence_finished+0x12/0x20 [gpu_sched] [11176.741268] drm_sched_job_done.isra.0+0x7f/0x1a0 [gpu_sched] [11176.741277] drm_sched_job_done_cb+0x12/0x20 [gpu_sched] [11176.741284] dma_fence_signal_timestamp_locked+0xac/0x1a0 [11176.741290] dma_fence_signal+0x3b/0x80 [11176.741296] amdgpu_fence_process+0xd1/0x140 [amdgpu] [11176.741504] sdma_v4_0_process_trap_irq+0x8c/0xb0 [amdgpu] [11176.741731] amdgpu_irq_dispatch+0xce/0x250 [amdgpu] [11176.741954] amdgpu_ih_process+0x81/0x100 [amdgpu] [11176.742174] amdgpu_irq_handler+0x26/0xa0 [amdgpu] [11176.742393] __handle_irq_event_percpu+0x4f/0x2c0 [11176.742402] handle_irq_event_percpu+0x33/0x80 [11176.742408] handle_irq_event+0x39/0x60 [11176.742414] handle_edge_irq+0x93/0x1d0 [11176.742419] __common_interrupt+0x50/0xe0 [11176.742426] common_interrupt+0x80/0x90
Signed-off-by: Andrey Grodzovsky andrey.grodzovsky@amd.com Suggested-by: Daniel Vetter daniel.vetter@ffwll.ch Suggested-by: Christian König christian.koenig@amd.com Tested-by: Christian König christian.koenig@amd.com Reviewed-by: Christian König christian.koenig@amd.com Link: https://www.spinics.net/lists/dri-devel/msg321250.html
Conflicts: context conflicts in below files: drivers/gpu/drm/scheduler/sched_entity.c include/drm/gpu_scheduler.h
Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/gpu/drm/scheduler/sched_entity.c | 15 ++++++++++++--- include/drm/gpu_scheduler.h | 12 +++++++++++- 2 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index 1b1b770dc949..fa2771b9bac8 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -189,6 +189,16 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout) } EXPORT_SYMBOL(drm_sched_entity_flush);
+static void drm_sched_entity_kill_jobs_irq_work(struct irq_work *wrk) +{ + struct drm_sched_job *job = container_of(wrk, typeof(*job), work); + + drm_sched_fence_finished(job->s_fence); + WARN_ON(job->s_fence->parent); + job->sched->ops->free_job(job); +} + + /** * drm_sched_entity_kill_jobs - helper for drm_sched_entity_kill_jobs * @@ -203,9 +213,8 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, struct drm_sched_job *job = container_of(cb, struct drm_sched_job, finish_cb);
- drm_sched_fence_finished(job->s_fence); - WARN_ON(job->s_fence->parent); - job->sched->ops->free_job(job); + init_irq_work(&job->work, drm_sched_entity_kill_jobs_irq_work); + irq_work_queue(&job->work); }
static struct dma_fence * diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 842df37f43be..56562016cefd 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -28,6 +28,7 @@ #include <linux/dma-fence.h> #include <linux/completion.h> #include <linux/xarray.h> +#include <linux/irq_work.h>
#define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
@@ -194,7 +195,16 @@ struct drm_sched_job { struct spsc_node queue_node; struct drm_gpu_scheduler *sched; struct drm_sched_fence *s_fence; - struct dma_fence_cb finish_cb; + + /* + * work is used only after finish_cb has been used and will not be + * accessed anymore. + */ + union { + struct dma_fence_cb finish_cb; + struct irq_work work; + }; + struct list_head node; uint64_t id; atomic_t karma;
From: Dmitry Osipenko dmitry.osipenko@collabora.com
mainline inclusion from mainline-v5.19-rc8 commit 9b04369b060fd4885f728b7a4ab4851ffb1abb64 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IB3ULB
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Interrupt context can't sleep. Drivers like Panfrost and MSM are taking mutex when job is released, and thus, that code can sleep. This results into "BUG: scheduling while atomic" if locks are contented while job is freed. There is no good reason for releasing scheduler's jobs in IRQ context, hence use normal context to fix the trouble.
Cc: stable@vger.kernel.org Fixes: 542cff7893a3 ("drm/sched: Avoid lockdep spalt on killing a processes") Signed-off-by: Dmitry Osipenko dmitry.osipenko@collabora.com Signed-off-by: Andrey Grodzovsky andrey.grodzovsky@amd.com Link: https://patchwork.freedesktop.org/patch/msgid/20220411221536.283312-1-dmitry...
Conflicts: context conflicts in below files: include/drm/gpu_scheduler.h
Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/gpu/drm/scheduler/sched_entity.c | 6 +++--- include/drm/gpu_scheduler.h | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index fa2771b9bac8..844fc38ad43d 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -189,7 +189,7 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout) } EXPORT_SYMBOL(drm_sched_entity_flush);
-static void drm_sched_entity_kill_jobs_irq_work(struct irq_work *wrk) +static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk) { struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
@@ -213,8 +213,8 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, struct drm_sched_job *job = container_of(cb, struct drm_sched_job, finish_cb);
- init_irq_work(&job->work, drm_sched_entity_kill_jobs_irq_work); - irq_work_queue(&job->work); + INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work); + schedule_work(&job->work); }
static struct dma_fence * diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 56562016cefd..ba955f5f957b 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -28,7 +28,7 @@ #include <linux/dma-fence.h> #include <linux/completion.h> #include <linux/xarray.h> -#include <linux/irq_work.h> +#include <linux/workqueue.h>
#define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
@@ -202,7 +202,7 @@ struct drm_sched_job { */ union { struct dma_fence_cb finish_cb; - struct irq_work work; + struct work_struct work; };
struct list_head node;
From: Christian König christian.koenig@amd.com
mainline inclusion from mainline-v6.1-rc3 commit b3af84383e7abdc5e63435817bb73a268e7c3637 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IB3ULB
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
We leaked dependency fences when processes were beeing killed.
Additional to that grab a reference to the last scheduled fence.
Signed-off-by: Christian König christian.koenig@amd.com Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com Link: https://patchwork.freedesktop.org/patch/msgid/20220929180151.139751-1-christ... Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/gpu/drm/scheduler/sched_entity.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index 844fc38ad43d..eedb94272a09 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -213,6 +213,7 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, struct drm_sched_job *job = container_of(cb, struct drm_sched_job, finish_cb);
+ dma_fence_put(f); INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work); schedule_work(&job->work); } @@ -248,8 +249,10 @@ static void drm_sched_entity_kill_jobs(struct drm_sched_entity *entity) struct drm_sched_fence *s_fence = job->s_fence;
/* Wait for all dependencies to avoid data corruptions */ - while ((f = drm_sched_job_dependency(job, entity))) + while ((f = drm_sched_job_dependency(job, entity))) { dma_fence_wait(f, false); + dma_fence_put(f); + }
drm_sched_fence_scheduled(s_fence); dma_fence_set_error(&s_fence->finished, -ESRCH); @@ -264,6 +267,7 @@ static void drm_sched_entity_kill_jobs(struct drm_sched_entity *entity) continue; }
+ dma_fence_get(entity->last_scheduled); r = dma_fence_add_callback(entity->last_scheduled, &job->finish_cb, drm_sched_entity_kill_jobs_cb);
From: Christian König christian.koenig@amd.com
mainline inclusion from mainline-v6.2-rc1 commit 2fdb8a8f07c2f1353770a324fd19b8114e4329ac category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IB3ULB
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
This was buggy because when we had to wait for entities which were killed as well we would just deadlock.
Instead move all the dependency handling into the callbacks so that will all happen asynchronously.
Signed-off-by: Christian König christian.koenig@amd.com Reviewed-by: Luben Tuikov luben.tuikov@amd.com Link: https://patchwork.freedesktop.org/patch/msgid/20221014084641.128280-13-chris...
Conflicts: context conflicts in below file: drivers/gpu/drm/scheduler/sched_entity.c
Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/gpu/drm/scheduler/sched_entity.c | 211 ++++++++++------------- 1 file changed, 91 insertions(+), 120 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index eedb94272a09..f7d449b1c162 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -138,6 +138,73 @@ bool drm_sched_entity_is_ready(struct drm_sched_entity *entity) return true; }
+static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk) +{ + struct drm_sched_job *job = container_of(wrk, typeof(*job), work); + + drm_sched_fence_finished(job->s_fence); + WARN_ON(job->s_fence->parent); + job->sched->ops->free_job(job); +} + +/* Signal the scheduler finished fence when the entity in question is killed. */ +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, + struct dma_fence_cb *cb) +{ + struct drm_sched_job *job = container_of(cb, struct drm_sched_job, + finish_cb); + int r; + + dma_fence_put(f); + + /* Wait for all dependencies to avoid data corruptions */ + while (!xa_empty(&job->dependencies)) { + f = xa_erase(&job->dependencies, job->last_dependency++); + r = dma_fence_add_callback(f, &job->finish_cb, + drm_sched_entity_kill_jobs_cb); + if (!r) + return; + + dma_fence_put(f); + } + + INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work); + schedule_work(&job->work); +} + +/* Remove the entity from the scheduler and kill all pending jobs */ +static void drm_sched_entity_kill(struct drm_sched_entity *entity) +{ + struct drm_sched_job *job; + struct dma_fence *prev; + + if (!entity->rq) + return; + + spin_lock(&entity->rq_lock); + entity->stopped = true; + drm_sched_rq_remove_entity(entity->rq, entity); + spin_unlock(&entity->rq_lock); + + /* Make sure this entity is not used by the scheduler at the moment */ + wait_for_completion(&entity->entity_idle); + + prev = dma_fence_get(entity->last_scheduled); + while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) { + struct drm_sched_fence *s_fence = job->s_fence; + + dma_fence_set_error(&s_fence->finished, -ESRCH); + + dma_fence_get(&s_fence->finished); + if (!prev || dma_fence_add_callback(prev, &job->finish_cb, + drm_sched_entity_kill_jobs_cb)) + drm_sched_entity_kill_jobs_cb(NULL, &job->finish_cb); + + prev = &s_fence->finished; + } + dma_fence_put(prev); +} + /** * drm_sched_entity_flush - Flush a context entity * @@ -178,106 +245,13 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout) /* For killed process disable any more IBs enqueue right now */ last_user = cmpxchg(&entity->last_user, current->group_leader, NULL); if ((!last_user || last_user == current->group_leader) && - (current->flags & PF_EXITING) && (current->exit_code == SIGKILL)) { - spin_lock(&entity->rq_lock); - entity->stopped = true; - drm_sched_rq_remove_entity(entity->rq, entity); - spin_unlock(&entity->rq_lock); - } + (current->flags & PF_EXITING) && (current->exit_code == SIGKILL)) + drm_sched_entity_kill(entity);
return ret; } EXPORT_SYMBOL(drm_sched_entity_flush);
-static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk) -{ - struct drm_sched_job *job = container_of(wrk, typeof(*job), work); - - drm_sched_fence_finished(job->s_fence); - WARN_ON(job->s_fence->parent); - job->sched->ops->free_job(job); -} - - -/** - * drm_sched_entity_kill_jobs - helper for drm_sched_entity_kill_jobs - * - * @f: signaled fence - * @cb: our callback structure - * - * Signal the scheduler finished fence when the entity in question is killed. - */ -static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f, - struct dma_fence_cb *cb) -{ - struct drm_sched_job *job = container_of(cb, struct drm_sched_job, - finish_cb); - - dma_fence_put(f); - INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work); - schedule_work(&job->work); -} - -static struct dma_fence * -drm_sched_job_dependency(struct drm_sched_job *job, - struct drm_sched_entity *entity) -{ - if (!xa_empty(&job->dependencies)) - return xa_erase(&job->dependencies, job->last_dependency++); - - if (job->sched->ops->dependency) - return job->sched->ops->dependency(job, entity); - - return NULL; -} - -/** - * drm_sched_entity_kill_jobs - Make sure all remaining jobs are killed - * - * @entity: entity which is cleaned up - * - * Makes sure that all remaining jobs in an entity are killed before it is - * destroyed. - */ -static void drm_sched_entity_kill_jobs(struct drm_sched_entity *entity) -{ - struct drm_sched_job *job; - struct dma_fence *f; - int r; - - while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) { - struct drm_sched_fence *s_fence = job->s_fence; - - /* Wait for all dependencies to avoid data corruptions */ - while ((f = drm_sched_job_dependency(job, entity))) { - dma_fence_wait(f, false); - dma_fence_put(f); - } - - drm_sched_fence_scheduled(s_fence); - dma_fence_set_error(&s_fence->finished, -ESRCH); - - /* - * When pipe is hanged by older entity, new entity might - * not even have chance to submit it's first job to HW - * and so entity->last_scheduled will remain NULL - */ - if (!entity->last_scheduled) { - drm_sched_entity_kill_jobs_cb(NULL, &job->finish_cb); - continue; - } - - dma_fence_get(entity->last_scheduled); - r = dma_fence_add_callback(entity->last_scheduled, - &job->finish_cb, - drm_sched_entity_kill_jobs_cb); - if (r == -ENOENT) - drm_sched_entity_kill_jobs_cb(NULL, &job->finish_cb); - else if (r) - DRM_ERROR("fence add callback failed (%d)\n", r); - } -} - /** * drm_sched_entity_cleanup - Destroy a context entity * @@ -289,33 +263,17 @@ static void drm_sched_entity_kill_jobs(struct drm_sched_entity *entity) */ void drm_sched_entity_fini(struct drm_sched_entity *entity) { - struct drm_gpu_scheduler *sched = NULL; - - if (entity->rq) { - sched = entity->rq->sched; - drm_sched_rq_remove_entity(entity->rq, entity); - } - - /* Consumption of existing IBs wasn't completed. Forcefully - * remove them here. + /* + * If consumption of existing IBs wasn't completed. Forcefully remove + * them here. Also makes sure that the scheduler won't touch this entity + * any more. */ - if (spsc_queue_count(&entity->job_queue)) { - if (sched) { - /* - * Wait for thread to idle to make sure it isn't processing - * this entity. - */ - wait_for_completion(&entity->entity_idle); + drm_sched_entity_kill(entity);
- } - if (entity->dependency) { - dma_fence_remove_callback(entity->dependency, - &entity->cb); - dma_fence_put(entity->dependency); - entity->dependency = NULL; - } - - drm_sched_entity_kill_jobs(entity); + if (entity->dependency) { + dma_fence_remove_callback(entity->dependency, &entity->cb); + dma_fence_put(entity->dependency); + entity->dependency = NULL; }
dma_fence_put(entity->last_scheduled); @@ -433,6 +391,19 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity) return false; }
+static struct dma_fence * +drm_sched_job_dependency(struct drm_sched_job *job, + struct drm_sched_entity *entity) +{ + if (!xa_empty(&job->dependencies)) + return xa_erase(&job->dependencies, job->last_dependency++); + + if (job->sched->ops->dependency) + return job->sched->ops->dependency(job, entity); + + return NULL; +} + /** * drm_sched_entity_pop_job - get a ready to be scheduled job from the entity *
From: Boris Brezillon boris.brezillon@collabora.com
mainline inclusion from mainline-v5.11-rc1 commit 170fb58ee329bcfa8dc4c0c5eab0ea5af7fb7f3b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IB3ULB
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
If we don't initialize the entity to idle and the entity is never scheduled before being destroyed we end up with an infinite wait in the destroy path.
v2: - Add Steven's R-b
Signed-off-by: Boris Brezillon boris.brezillon@collabora.com Reviewed-by: Steven Price steven.price@arm.com Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Christian König christian.koenig@amd.com Link: https://patchwork.freedesktop.org/patch/393486/ Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/gpu/drm/scheduler/sched_entity.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index f7d449b1c162..6cedce084762 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -73,6 +73,9 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
init_completion(&entity->entity_idle);
+ /* We start in an idle state. */ + complete(&entity->entity_idle); + spin_lock_init(&entity->rq_lock); spsc_queue_init(&entity->job_queue);
From: Boris Brezillon boris.brezillon@collabora.com
mainline inclusion from mainline-v5.15-rc1 commit 3b5ac97ad468f6cfd31346821a3a2b9f13d23015 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IB3ULB
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The panfrost driver tries to kill in-flight jobs on FD close after destroying the FD scheduler entities. For this to work properly, we need to make sure the jobs popped from the scheduler entities have been queued at the HW level before declaring the entity idle, otherwise we might iterate over a list that doesn't contain those jobs.
Suggested-by: Lucas Stach l.stach@pengutronix.de Signed-off-by: Boris Brezillon boris.brezillon@collabora.com Cc: Lucas Stach l.stach@pengutronix.de Reviewed-by: Steven Price steven.price@arm.com Reviewed-by: Lucas Stach l.stach@pengutronix.de Link: https://patchwork.freedesktop.org/patch/msgid/20210624140850.2229697-1-boris... Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/gpu/drm/scheduler/sched_main.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 3780433e93d9..c93f75325b6c 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -895,10 +895,10 @@ static int drm_sched_main(void *param)
sched_job = drm_sched_entity_pop_job(entity);
- complete(&entity->entity_idle); - - if (!sched_job) + if (!sched_job) { + complete(&entity->entity_idle); continue; + }
s_fence = sched_job->s_fence;
@@ -907,6 +907,7 @@ static int drm_sched_main(void *param)
trace_drm_run_job(sched_job, entity); fence = sched->ops->run_job(sched_job); + complete(&entity->entity_idle); drm_sched_fence_scheduled(s_fence);
if (!IS_ERR_OR_NULL(fence)) {
From: Dmitry Osipenko dmitry.osipenko@collabora.com
mainline inclusion from mainline-v6.2-rc3 commit 69555549cfa42e10f2fdd2699ed4e34d9d4f392b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IB3ULB
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The drm_sched_entity_kill() is invoked twice by drm_sched_entity_destroy() while userspace process is exiting or being killed. First time it's invoked when sched entity is flushed and second time when entity is released. This causes a lockup within wait_for_completion(entity_idle) due to how completion API works.
Calling wait_for_completion() more times than complete() was invoked is a error condition that causes lockup because completion internally uses counter for complete/wait calls. The complete_all() must be used instead in such cases.
This patch fixes lockup of Panfrost driver that is reproducible by killing any application in a middle of 3d drawing operation.
Fixes: 2fdb8a8f07c2 ("drm/scheduler: rework entity flush, kill and fini") Signed-off-by: Dmitry Osipenko dmitry.osipenko@collabora.com Reviewed-by: Christian König christian.koenig@amd.com Link: https://patchwork.freedesktop.org/patch/msgid/20221123001303.533968-1-dmitry... Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/gpu/drm/scheduler/sched_entity.c | 2 +- drivers/gpu/drm/scheduler/sched_main.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index 6cedce084762..be0d5bfb5df1 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -74,7 +74,7 @@ int drm_sched_entity_init(struct drm_sched_entity *entity, init_completion(&entity->entity_idle);
/* We start in an idle state. */ - complete(&entity->entity_idle); + complete_all(&entity->entity_idle);
spin_lock_init(&entity->rq_lock); spsc_queue_init(&entity->job_queue); diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index c93f75325b6c..936e87ad3feb 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -896,7 +896,7 @@ static int drm_sched_main(void *param) sched_job = drm_sched_entity_pop_job(entity);
if (!sched_job) { - complete(&entity->entity_idle); + complete_all(&entity->entity_idle); continue; }
@@ -907,7 +907,7 @@ static int drm_sched_main(void *param)
trace_drm_run_job(sched_job, entity); fence = sched->ops->run_job(sched_job); - complete(&entity->entity_idle); + complete_all(&entity->entity_idle); drm_sched_fence_scheduled(s_fence);
if (!IS_ERR_OR_NULL(fence)) {
From: ZhenGuo Yin zhenguo.yin@amd.com
mainline inclusion from mainline-v6.5-rc1 commit 4f9b94d848696166011bead3109541ec2a523bb8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IB3ULB
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
[Why] drm_sched_entity_add_dependency_cb ignores the scheduled fence and return false. If entity's dependency is a scheduler error fence and drm_sched_stop is called due to TDR, drm_sched_entity_pop_job will wait for the dependency infinitely.
[How] Do not wait or ignore the scheduled error fence, add drm_sched_entity_wakeup callback for the dependency with scheduled error fence.
Signed-off-by: ZhenGuo Yin zhenguo.yin@amd.com Acked-by: Alex Deucher alexander.deucher@amd.com Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com
Conflicts: context conflicts in below file: drivers/gpu/drm/scheduler/sched_entity.c
Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/gpu/drm/scheduler/sched_entity.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index be0d5bfb5df1..3d5c94496aa3 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -368,7 +368,7 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity) }
s_fence = to_drm_sched_fence(fence); - if (s_fence && s_fence->sched == sched) { + if (!fence->error && s_fence && s_fence->sched == sched) {
/* * Fence is from the same scheduler, only need to wait for