[PATCH v3 0/3] iommu: add iopf timeout procssing function

older
linuxarm@openeuler.org You Have 5...

Longfang Liu

5 Jan 2023 5 Jan '23

3:48 p.m.

In the current iopf processing flow in iommu, the CPU can be preempted, which may lead to a possible timeout problem in iopf processing. For this timeout problem, a timeout timer function has been added, and the lock inside the response of iopf has been optimized. The current solution refers to the solution part of this group of patches from Intel: "iommu: handle page response timeout". However, it is different from Intel's implementation scheme. For the specific scheme description, please refer to the description information of patchs. Changes v2 -> v3 - Modify the configuration method of the timeout parameter. Changes v1 -> v2 - Add timeout time CONFIG configuration function. Longfang Liu (3): iommu/iopf: add a timeout parameter for IOPF iommu/iopf: update the lock of the response iommu/iopf: add timeout function to iommu .../admin-guide/kernel-parameters.txt | 8 ++ drivers/iommu/io-pgfault.c | 95 ++++++++++++++++++- drivers/iommu/iommu.c | 14 +-- include/linux/iommu.h | 2 +- 4 files changed, 109 insertions(+), 10 deletions(-) -- 2.24.0

Show replies by date

Longfang Liu

5 Jan 5 Jan

3:48 p.m.

New subject: [PATCH v3 1/3] iommu/iopf: add a timeout parameter for IOPF

In the current IOMMU subsystem, IO page fault processing has delays or unresponsive problems. By adding a tunable setup parameter, users can decide whether to use a timeout timer to prevent unresponsive problems. This implementation refers to Intel's previous patch, URL: https://lore.kernel.org/lkml/1565900005-62508-2-git- send-email-jacob.jun.pan@linux.intel.com/ Signed-off-by: Longfang Liu <liulongfang@huawei.com> --- .../admin-guide/kernel-parameters.txt | 8 +++++ drivers/iommu/io-pgfault.c | 32 +++++++++++++++++++ 2 files changed, 40 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 6b838869554b..425d446d92c6 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2176,6 +2176,14 @@ 1 - Bypass the IOMMU for DMA. unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH. + iommu.iopf_timeout= + Timeout in milliseconds to wait for page response + of a IO page fault request. + Format: <integer> + Default: 600 + 0 - no timeout tracking + 100 to 10000 - allowed range + io7= [HW] IO7 for Marvel-based Alpha systems See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index e5b8b9110c13..cd7ebbfe4a67 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -45,6 +45,17 @@ struct iopf_fault { struct list_head list; }; +/** + * Timeout to wait for page response of a pending page request. This is + * intended as a basic safety net in case a pending page request is not + * responded for an exceptionally long time. Device may also implement + * its own protection mechanism against this exception. + * default to 600 milliseconds. + */ +#define IOPF_RESPONSE_DEF_TIMEOUT 600 +#define IOPF_RESPONSE_MAX_TIMEOUT 10000 +static unsigned long iopf_timeout_ms = IOPF_RESPONSE_DEF_TIMEOUT; + struct iopf_group { struct iopf_fault last_fault; struct list_head faults; @@ -52,6 +63,27 @@ struct iopf_group { struct device *dev; }; +static int __init iommu_set_iopf_timeout(char *str) +{ + unsigned long timeout; + int ret; + + if (!str) + return -EINVAL; + + ret = kstrtoul(str, 10, &timeout); + if (ret) + return ret; + + if (timeout > IOPF_RESPONSE_MAX_TIMEOUT) + return -EINVAL; + + iopf_timeout_ms = timeout; + + return 0; +} +early_param("iommu.iopf_timeout", iommu_set_iopf_timeout); + static int iopf_complete_group(struct device *dev, struct iopf_fault *iopf, enum iommu_page_response_code status) { -- 2.24.0

Shameerali Kolothum Thodi

6 Jan 6 Jan

1:15 a.m.

New subject: [PATCH v3 1/3] iommu/iopf: add a timeout parameter for IOPF

...

-----Original Message----- From: liulongfang Sent: 05 January 2023 07:48 To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; liulongfang <liulongfang@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: [PATCH v3 1/3] iommu/iopf: add a timeout parameter for IOPF

In the current IOMMU subsystem, IO page fault processing has delays or unresponsive problems. By adding a tunable setup parameter, users can decide whether to use a timeout timer to prevent unresponsive problems.

This implementation refers to Intel's previous patch, URL: https://lore.kernel.org/lkml/1565900005-62508-2-git-

send-email-jacob.jun.pan@linux.intel.com/

Looks like the link has a line break.

...

Signed-off-by: Longfang Liu <liulongfang@huawei.com>

Since this is almost similar to Jacob's patch, you should make him as the author and add your sign off tag after his. Thanks, Shameer

...

--- .../admin-guide/kernel-parameters.txt | 8 +++++ drivers/iommu/io-pgfault.c | 32 +++++++++++++++++++ 2 files changed, 40 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 6b838869554b..425d446d92c6 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2176,6 +2176,14 @@ 1 - Bypass the IOMMU for DMA. unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH.

+ iommu.iopf_timeout= + Timeout in milliseconds to wait for page response + of a IO page fault request. + Format: <integer> + Default: 600 + 0 - no timeout tracking + 100 to 10000 - allowed range + io7= [HW] IO7 for Marvel-based Alpha systems See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index e5b8b9110c13..cd7ebbfe4a67 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -45,6 +45,17 @@ struct iopf_fault { struct list_head list; };

+/** + * Timeout to wait for page response of a pending page request. This is + * intended as a basic safety net in case a pending page request is not + * responded for an exceptionally long time. Device may also implement + * its own protection mechanism against this exception. + * default to 600 milliseconds. + */ +#define IOPF_RESPONSE_DEF_TIMEOUT 600 +#define IOPF_RESPONSE_MAX_TIMEOUT 10000 +static unsigned long iopf_timeout_ms = IOPF_RESPONSE_DEF_TIMEOUT; + struct iopf_group { struct iopf_fault last_fault; struct list_head faults; @@ -52,6 +63,27 @@ struct iopf_group { struct device *dev; };

+static int __init iommu_set_iopf_timeout(char *str) +{ + unsigned long timeout; + int ret; + + if (!str) + return -EINVAL; + + ret = kstrtoul(str, 10, &timeout); + if (ret) + return ret; + + if (timeout > IOPF_RESPONSE_MAX_TIMEOUT) + return -EINVAL; + + iopf_timeout_ms = timeout; + + return 0; +} +early_param("iommu.iopf_timeout", iommu_set_iopf_timeout); + static int iopf_complete_group(struct device *dev, struct iopf_fault *iopf, enum iommu_page_response_code status) { -- 2.24.0

liulongfang

8:52 a.m.

New subject: [PATCH v3 1/3] iommu/iopf: add a timeout parameter for IOPF

On 2023/1/6 1:15, Shameerali Kolothum Thodi wrote:

...

...
-----Original Message----- From: liulongfang Sent: 05 January 2023 07:48 To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; liulongfang <liulongfang@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: [PATCH v3 1/3] iommu/iopf: add a timeout parameter for IOPF

In the current IOMMU subsystem, IO page fault processing has delays or unresponsive problems. By adding a tunable setup parameter, users can decide whether to use a timeout timer to prevent unresponsive problems.

This implementation refers to Intel's previous patch, URL: https://lore.kernel.org/lkml/1565900005-62508-2-git-

send-email-jacob.jun.pan@linux.intel.com/

Looks like the link has a line break.

Yes, But that will cause checkpatch to report a warning. I need to find out if there is a way to have both.

...

...
Signed-off-by: Longfang Liu <liulongfang@huawei.com>

Since this is almost similar to Jacob's patch, you should make him as the author and add your sign off tag after his.

Yes, this author needs to be added.

...

Thanks, Shameer

Thanks, Longfang.

...

...
--- .../admin-guide/kernel-parameters.txt | 8 +++++ drivers/iommu/io-pgfault.c | 32 +++++++++++++++++++ 2 files changed, 40 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 6b838869554b..425d446d92c6 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2176,6 +2176,14 @@ 1 - Bypass the IOMMU for DMA. unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH.

+ iommu.iopf_timeout= + Timeout in milliseconds to wait for page response + of a IO page fault request. + Format: <integer> + Default: 600 + 0 - no timeout tracking + 100 to 10000 - allowed range + io7= [HW] IO7 for Marvel-based Alpha systems See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index e5b8b9110c13..cd7ebbfe4a67 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -45,6 +45,17 @@ struct iopf_fault { struct list_head list; };

+/** + * Timeout to wait for page response of a pending page request. This is + * intended as a basic safety net in case a pending page request is not + * responded for an exceptionally long time. Device may also implement + * its own protection mechanism against this exception. + * default to 600 milliseconds. + */ +#define IOPF_RESPONSE_DEF_TIMEOUT 600 +#define IOPF_RESPONSE_MAX_TIMEOUT 10000 +static unsigned long iopf_timeout_ms = IOPF_RESPONSE_DEF_TIMEOUT; + struct iopf_group { struct iopf_fault last_fault; struct list_head faults; @@ -52,6 +63,27 @@ struct iopf_group { struct device *dev; };

+static int __init iommu_set_iopf_timeout(char *str) +{ + unsigned long timeout; + int ret; + + if (!str) + return -EINVAL; + + ret = kstrtoul(str, 10, &timeout); + if (ret) + return ret; + + if (timeout > IOPF_RESPONSE_MAX_TIMEOUT) + return -EINVAL; + + iopf_timeout_ms = timeout; + + return 0; +} +early_param("iommu.iopf_timeout", iommu_set_iopf_timeout); + static int iopf_complete_group(struct device *dev, struct iopf_fault *iopf, enum iommu_page_response_code status) { -- 2.24.0

.

Longfang Liu

5 Jan 5 Jan

3:48 p.m.

New subject: [PATCH v3 2/3] iommu/iopf: update the lock of the response

When the timeout timer function is added, the callback interface of the timer is actually a soft interrupt response function. When the IOMMU processing response interface is placed in the soft interrupt for processing, using the original mutex lock may cause the kernel calltrace, so it needs to be modified it. Signed-off-by: Longfang Liu <liulongfang@huawei.com> --- drivers/iommu/iommu.c | 14 +++++++------- include/linux/iommu.h | 2 +- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 7c99d8eb3182..b927a35aefa6 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1221,7 +1221,7 @@ int iommu_register_device_fault_handler(struct device *dev, } param->fault_param->handler = handler; param->fault_param->data = data; - mutex_init(¶m->fault_param->lock); + spin_lock_init(¶m->fault_param->lock); INIT_LIST_HEAD(¶m->fault_param->faults); done_unlock: @@ -1306,16 +1306,16 @@ int iommu_report_device_fault(struct device *dev, struct iommu_fault_event *evt) ret = -ENOMEM; goto done_unlock; } - mutex_lock(&fparam->lock); + spin_lock(&fparam->lock); list_add_tail(&evt_pending->list, &fparam->faults); - mutex_unlock(&fparam->lock); + spin_unlock(&fparam->lock); } ret = fparam->handler(&evt->fault, fparam->data); if (ret && evt_pending) { - mutex_lock(&fparam->lock); + spin_lock(&fparam->lock); list_del(&evt_pending->list); - mutex_unlock(&fparam->lock); + spin_unlock(&fparam->lock); kfree(evt_pending); } done_unlock: @@ -1346,7 +1346,7 @@ int iommu_page_response(struct device *dev, return -EINVAL; /* Only send response if there is a fault report pending */ - mutex_lock(¶m->fault_param->lock); + spin_lock(¶m->fault_param->lock); if (list_empty(¶m->fault_param->faults)) { dev_warn_ratelimited(dev, "no pending PRQ, drop response\n"); goto done_unlock; @@ -1383,7 +1383,7 @@ int iommu_page_response(struct device *dev, } done_unlock: - mutex_unlock(¶m->fault_param->lock); + spin_unlock(¶m->fault_param->lock); return ret; } EXPORT_SYMBOL_GPL(iommu_page_response); diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 68d7d304cdb7..0d808ef68704 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -384,7 +384,7 @@ struct iommu_fault_param { iommu_dev_fault_handler_t handler; void *data; struct list_head faults; - struct mutex lock; + spinlock_t lock; }; /** -- 2.24.0

Longfang Liu

3:48 p.m.

New subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

In the current iopf(IO page fault) processing flow of iommu, each iopf event will create a queue_work event and put it into the iopf workqueue. When the system schedules the work event, the page table creation will be completed. Its basic processing flow is as follows: +------------------+ | recv iopf event | +--------+---------+ | +--------v---------+ | init iopf event | +--------+---------+ | +--------v---------+ | create queue_work| +--------+---------+ | +----------v-----------+ +---->| wait for queue_work | | | to run | | +----------+-----------+ | | |No +----v----+ +-----------+ run ? | +----+----+ |Yes +---------v---------+ +----->| hand queue_work | | | create page table | | +---------+---------+ | | |No +----------v------------+ +-----+ processing completed ?| +----------+------------+ |Yes +-------v---------+ | return response | +-----------------+ In this processing flow, there are two nodes that will be unable to respond for a long time in some scenarios of high system load: 1. On the node where the work queue is waiting to be scheduled. 2. After the work queue is scheduled, in the process of creating the page table. A hardware error and reset occurs when the two nodes cannot complete processing within the limited wait time of the hardware device. So I added a timeout function to this iopf processing flow. First, it creates a timer when creating the queue_work event and sets its timeout. At the same time, set an initial state for the iopf event. Second, the timer is used to time the process from the waiting of the iopf event to the completion of the iopf processing. Third, if the processing is completed within the timeout period, the timer will be destroyed directly. If the processing of the iopf event has not been completed after the timer timeout, the timer callback interface will be activated, it will directly return the result of the current iopf processing failure to the device, and set the status of the iopf event to the timeout state. Finally, when the iopf event is scheduled by the system, it will be judged according to the state of the iopf event. If it is in the timeout state, the iopf event will be skipped directly without processing. Different from the previous Intel's timeout timer method, URL: https://lore.kernel.org/lkml/1565900005-62508-3-git- send-email-jacob.jun.pan@linux.intel.com/ The current timeout timer will create a timer for each IO page fault request. After the page fault request is processed normally or an error is returned after timeout, the timer will be automatically destroyed and will not be reused. Signed-off-by: Longfang Liu <liulongfang@huawei.com> --- drivers/iommu/io-pgfault.c | 63 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 61 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index cd7ebbfe4a67..0fb3ec1390a8 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -6,9 +6,11 @@ */ #include <linux/iommu.h> +#include <linux/jiffies.h> #include <linux/list.h> #include <linux/sched/mm.h> #include <linux/slab.h> +#include <linux/timer.h> #include <linux/workqueue.h> #include "iommu-sva.h" @@ -56,11 +58,20 @@ struct iopf_fault { #define IOPF_RESPONSE_MAX_TIMEOUT 10000 static unsigned long iopf_timeout_ms = IOPF_RESPONSE_DEF_TIMEOUT; +enum iopf_timer_state { + IOPF_TIME_INIT = 0, + IOPF_IN_PROCESS, + IOPF_COMPLETE, + IOPF_TIME_OUT, +}; + struct iopf_group { struct iopf_fault last_fault; struct list_head faults; struct work_struct work; struct device *dev; + struct timer_list iopf_timer; + atomic_t iopf_state; }; static int __init iommu_set_iopf_timeout(char *str) @@ -101,6 +112,32 @@ static int iopf_complete_group(struct device *dev, struct iopf_fault *iopf, return iommu_page_response(dev, &resp); } +static void iopf_timer_func(struct timer_list *t) +{ + struct iopf_group *group = from_timer(group, t, iopf_timer); + + /* Workqueue process time out */ + if (atomic_read(&group->iopf_state) < IOPF_COMPLETE) { + atomic_set(&group->iopf_state, IOPF_TIME_OUT); + iopf_complete_group(group->dev, &group->last_fault, + IOMMU_PAGE_RESP_FAILURE); + dev_err(group->dev, "[timer] iopf workqueue process time out.\n"); + } +} + +static void iopf_timer_setup(struct iopf_group *group) +{ + atomic_set(&group->iopf_state, IOPF_TIME_INIT); + + /* IOPF timeout is 0 means no timeout tracking */ + if (iopf_timeout_ms) { + timer_setup(&group->iopf_timer, iopf_timer_func, 0); + group->iopf_timer.expires = jiffies + + msecs_to_jiffies(iopf_timeout_ms); + mod_timer(&group->iopf_timer, group->iopf_timer.expires); + } +} + static void iopf_handler(struct work_struct *work) { struct iopf_group *group; @@ -109,6 +146,17 @@ static void iopf_handler(struct work_struct *work) enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS; group = container_of(work, struct iopf_group, work); + if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { + list_for_each_entry_safe(iopf, next, &group->faults, list) { + if (!(iopf->fault.prm.flags & + IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) + kfree(iopf); + } + dev_err(group->dev, "iopf workqueue wait time out.\n"); + goto iopf_err; + } + atomic_set(&group->iopf_state, IOPF_IN_PROCESS); + domain = iommu_get_domain_for_dev_pasid(group->dev, group->last_fault.fault.prm.pasid, 0); if (!domain || !domain->iopf_handler) @@ -128,7 +176,16 @@ static void iopf_handler(struct work_struct *work) kfree(iopf); } + if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { + dev_err(group->dev, "iopf handle fault time out.\n"); + goto iopf_err; + } + + atomic_set(&group->iopf_state, IOPF_COMPLETE); iopf_complete_group(group->dev, &group->last_fault, status); + +iopf_err: + del_timer_sync(&group->iopf_timer); kfree(group); } @@ -183,7 +240,6 @@ int iommu_queue_iopf(struct iommu_fault *fault, void *cookie) struct dev_iommu *param = dev->iommu; lockdep_assert_held(¶m->lock); - if (fault->type != IOMMU_FAULT_PAGE_REQ) /* Not a recoverable page fault */ return -EOPNOTSUPP; @@ -222,6 +278,8 @@ int iommu_queue_iopf(struct iommu_fault *fault, void *cookie) group->dev = dev; group->last_fault.fault = *fault; + iopf_timer_setup(group); + INIT_LIST_HEAD(&group->faults); list_add(&group->last_fault.list, &group->faults); INIT_WORK(&group->work, iopf_handler); @@ -414,7 +472,8 @@ struct iopf_queue *iopf_queue_alloc(const char *name) * that's dealt with, the high-level function can handle groups out of * order. */ - queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name); + queue->wq = alloc_workqueue("iopf_queue/%s", WQ_HIGHPRI | + WQ_MEM_RECLAIM | WQ_UNBOUND, 0, name); if (!queue->wq) { kfree(queue); return NULL; -- 2.24.0

Shameerali Kolothum Thodi

6 Jan 6 Jan

1:11 a.m.

New subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

...

-----Original Message----- From: liulongfang Sent: 05 January 2023 07:48 To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; liulongfang <liulongfang@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

In the current iopf(IO page fault) processing flow of iommu, each iopf event will create a queue_work event and put it into the iopf workqueue. When the system schedules the work event, the page table creation will be completed. Its basic processing flow is as follows:

+------------------+ | recv iopf event | +--------+---------+ | +--------v---------+ | init iopf event | +--------+---------+ | +--------v---------+ | create queue_work| +--------+---------+ | +----------v-----------+ +---->| wait for queue_work | | | to run | | +----------+-----------+ | | |No +----v----+ +-----------+ run ? | +----+----+ |Yes +---------v---------+ +----->| hand queue_work | | | create page table | | +---------+---------+ | | |No +----------v------------+ +-----+ processing completed ?| +----------+------------+ |Yes +-------v---------+ | return response | +-----------------+

In this processing flow, there are two nodes that will be unable to respond for a long time in some scenarios of high system load: 1. On the node where the work queue is waiting to be scheduled. 2. After the work queue is scheduled, in the process of creating the page table.

A hardware error and reset occurs when the two nodes cannot complete processing within the limited wait time of the hardware device.

So I added a timeout function to this iopf processing flow. First, it creates a timer when creating the queue_work event and sets its timeout. At the same time, set an initial state for the iopf event.

Second, the timer is used to time the process from the waiting of the iopf event to the completion of the iopf processing.

Third, if the processing is completed within the timeout period, the timer will be destroyed directly. If the processing of the iopf event has not been completed after the timer timeout, the timer callback interface will be activated, it will directly return the result of the current iopf processing failure to the device, and set the status of the iopf event to the timeout state.

Finally, when the iopf event is scheduled by the system, it will be judged according to the state of the iopf event. If it is in the timeout state, the iopf event will be skipped directly without processing.

Different from the previous Intel's timeout timer method, URL: https://lore.kernel.org/lkml/1565900005-62508-3-git-

send-email-jacob.jun.pan@linux.intel.com/

The current timeout timer will create a timer for each IO page fault request. After the page fault request is processed normally or an error is returned after timeout, the timer will be automatically destroyed and will not be reused.

I think it is better to have a per device timer rather than a timer for each request. What was the problem with Intel's approach? Thanks, Shameer

...

Signed-off-by: Longfang Liu <liulongfang@huawei.com> --- drivers/iommu/io-pgfault.c | 63 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 61 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index cd7ebbfe4a67..0fb3ec1390a8 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -6,9 +6,11 @@ */

#include <linux/iommu.h> +#include <linux/jiffies.h> #include <linux/list.h> #include <linux/sched/mm.h> #include <linux/slab.h> +#include <linux/timer.h> #include <linux/workqueue.h>

#include "iommu-sva.h" @@ -56,11 +58,20 @@ struct iopf_fault { #define IOPF_RESPONSE_MAX_TIMEOUT 10000 static unsigned long iopf_timeout_ms = IOPF_RESPONSE_DEF_TIMEOUT;

+enum iopf_timer_state { + IOPF_TIME_INIT = 0, + IOPF_IN_PROCESS, + IOPF_COMPLETE, + IOPF_TIME_OUT, +}; + struct iopf_group { struct iopf_fault last_fault; struct list_head faults; struct work_struct work; struct device *dev; + struct timer_list iopf_timer; + atomic_t iopf_state; };

static int __init iommu_set_iopf_timeout(char *str) @@ -101,6 +112,32 @@ static int iopf_complete_group(struct device *dev, struct iopf_fault *iopf, return iommu_page_response(dev, &resp); }

+static void iopf_timer_func(struct timer_list *t) +{ + struct iopf_group *group = from_timer(group, t, iopf_timer); + + /* Workqueue process time out */ + if (atomic_read(&group->iopf_state) < IOPF_COMPLETE) { + atomic_set(&group->iopf_state, IOPF_TIME_OUT); + iopf_complete_group(group->dev, &group->last_fault, + IOMMU_PAGE_RESP_FAILURE); + dev_err(group->dev, "[timer] iopf workqueue process time out.\n"); + } +} + +static void iopf_timer_setup(struct iopf_group *group) +{ + atomic_set(&group->iopf_state, IOPF_TIME_INIT); + + /* IOPF timeout is 0 means no timeout tracking */ + if (iopf_timeout_ms) { + timer_setup(&group->iopf_timer, iopf_timer_func, 0); + group->iopf_timer.expires = jiffies + + msecs_to_jiffies(iopf_timeout_ms); + mod_timer(&group->iopf_timer, group->iopf_timer.expires); + } +} + static void iopf_handler(struct work_struct *work) { struct iopf_group *group; @@ -109,6 +146,17 @@ static void iopf_handler(struct work_struct *work) enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS;

group = container_of(work, struct iopf_group, work); + if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { + list_for_each_entry_safe(iopf, next, &group->faults, list) { + if (!(iopf->fault.prm.flags & + IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) + kfree(iopf); + } + dev_err(group->dev, "iopf workqueue wait time out.\n"); + goto iopf_err; + } + atomic_set(&group->iopf_state, IOPF_IN_PROCESS); + domain = iommu_get_domain_for_dev_pasid(group->dev, group->last_fault.fault.prm.pasid, 0); if (!domain || !domain->iopf_handler) @@ -128,7 +176,16 @@ static void iopf_handler(struct work_struct *work) kfree(iopf); }

+ if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { + dev_err(group->dev, "iopf handle fault time out.\n"); + goto iopf_err; + } + + atomic_set(&group->iopf_state, IOPF_COMPLETE); iopf_complete_group(group->dev, &group->last_fault, status); + +iopf_err: + del_timer_sync(&group->iopf_timer); kfree(group); }

@@ -183,7 +240,6 @@ int iommu_queue_iopf(struct iommu_fault *fault, void *cookie) struct dev_iommu *param = dev->iommu;

lockdep_assert_held(¶m->lock); - if (fault->type != IOMMU_FAULT_PAGE_REQ) /* Not a recoverable page fault */ return -EOPNOTSUPP; @@ -222,6 +278,8 @@ int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)

group->dev = dev; group->last_fault.fault = *fault; + iopf_timer_setup(group); + INIT_LIST_HEAD(&group->faults); list_add(&group->last_fault.list, &group->faults); INIT_WORK(&group->work, iopf_handler); @@ -414,7 +472,8 @@ struct iopf_queue *iopf_queue_alloc(const char *name) * that's dealt with, the high-level function can handle groups out of * order. */ - queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name); + queue->wq = alloc_workqueue("iopf_queue/%s", WQ_HIGHPRI | + WQ_MEM_RECLAIM | WQ_UNBOUND, 0, name); if (!queue->wq) { kfree(queue); return NULL; -- 2.24.0

liulongfang

8:54 a.m.

New subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/6 1:11, Shameerali Kolothum Thodi wrote:

...

...
-----Original Message----- From: liulongfang Sent: 05 January 2023 07:48 To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; liulongfang <liulongfang@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

In the current iopf(IO page fault) processing flow of iommu, each iopf event will create a queue_work event and put it into the iopf workqueue. When the system schedules the work event, the page table creation will be completed. Its basic processing flow is as follows:

+------------------+ | recv iopf event | +--------+---------+ | +--------v---------+ | init iopf event | +--------+---------+ | +--------v---------+ | create queue_work| +--------+---------+ | +----------v-----------+ +---->| wait for queue_work | | | to run | | +----------+-----------+ | | |No +----v----+ +-----------+ run ? | +----+----+ |Yes +---------v---------+ +----->| hand queue_work | | | create page table | | +---------+---------+ | | |No +----------v------------+ +-----+ processing completed ?| +----------+------------+ |Yes +-------v---------+ | return response | +-----------------+

In this processing flow, there are two nodes that will be unable to respond for a long time in some scenarios of high system load: 1. On the node where the work queue is waiting to be scheduled. 2. After the work queue is scheduled, in the process of creating the page table.

A hardware error and reset occurs when the two nodes cannot complete processing within the limited wait time of the hardware device.

So I added a timeout function to this iopf processing flow. First, it creates a timer when creating the queue_work event and sets its timeout. At the same time, set an initial state for the iopf event.

Second, the timer is used to time the process from the waiting of the iopf event to the completion of the iopf processing.

Third, if the processing is completed within the timeout period, the timer will be destroyed directly. If the processing of the iopf event has not been completed after the timer timeout, the timer callback interface will be activated, it will directly return the result of the current iopf processing failure to the device, and set the status of the iopf event to the timeout state.

Finally, when the iopf event is scheduled by the system, it will be judged according to the state of the iopf event. If it is in the timeout state, the iopf event will be skipped directly without processing.

Different from the previous Intel's timeout timer method, URL: https://lore.kernel.org/lkml/1565900005-62508-3-git-

send-email-jacob.jun.pan@linux.intel.com/

The current timeout timer will create a timer for each IO page fault request. After the page fault request is processed normally or an error is returned after timeout, the timer will be automatically destroyed and will not be reused.

I think it is better to have a per device timer rather than a timer for each request. What was the problem with Intel's approach?

Intel's patch did not handle the timer scheme completely. In addition, it will appear kernel calltrace under high system load.

...

Thanks, Shameer

Thanks, Longfang.

...

...
Signed-off-by: Longfang Liu <liulongfang@huawei.com> --- drivers/iommu/io-pgfault.c | 63 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 61 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index cd7ebbfe4a67..0fb3ec1390a8 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -6,9 +6,11 @@ */

#include <linux/iommu.h> +#include <linux/jiffies.h> #include <linux/list.h> #include <linux/sched/mm.h> #include <linux/slab.h> +#include <linux/timer.h> #include <linux/workqueue.h>

#include "iommu-sva.h" @@ -56,11 +58,20 @@ struct iopf_fault { #define IOPF_RESPONSE_MAX_TIMEOUT 10000 static unsigned long iopf_timeout_ms = IOPF_RESPONSE_DEF_TIMEOUT;

+enum iopf_timer_state { + IOPF_TIME_INIT = 0, + IOPF_IN_PROCESS, + IOPF_COMPLETE, + IOPF_TIME_OUT, +}; + struct iopf_group { struct iopf_fault last_fault; struct list_head faults; struct work_struct work; struct device *dev; + struct timer_list iopf_timer; + atomic_t iopf_state; };

static int __init iommu_set_iopf_timeout(char *str) @@ -101,6 +112,32 @@ static int iopf_complete_group(struct device *dev, struct iopf_fault *iopf, return iommu_page_response(dev, &resp); }

+static void iopf_timer_func(struct timer_list *t) +{ + struct iopf_group *group = from_timer(group, t, iopf_timer); + + /* Workqueue process time out */ + if (atomic_read(&group->iopf_state) < IOPF_COMPLETE) { + atomic_set(&group->iopf_state, IOPF_TIME_OUT); + iopf_complete_group(group->dev, &group->last_fault, + IOMMU_PAGE_RESP_FAILURE); + dev_err(group->dev, "[timer] iopf workqueue process time out.\n"); + } +} + +static void iopf_timer_setup(struct iopf_group *group) +{ + atomic_set(&group->iopf_state, IOPF_TIME_INIT); + + /* IOPF timeout is 0 means no timeout tracking */ + if (iopf_timeout_ms) { + timer_setup(&group->iopf_timer, iopf_timer_func, 0); + group->iopf_timer.expires = jiffies + + msecs_to_jiffies(iopf_timeout_ms); + mod_timer(&group->iopf_timer, group->iopf_timer.expires); + } +} + static void iopf_handler(struct work_struct *work) { struct iopf_group *group; @@ -109,6 +146,17 @@ static void iopf_handler(struct work_struct *work) enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS;

group = container_of(work, struct iopf_group, work); + if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { + list_for_each_entry_safe(iopf, next, &group->faults, list) { + if (!(iopf->fault.prm.flags & + IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) + kfree(iopf); + } + dev_err(group->dev, "iopf workqueue wait time out.\n"); + goto iopf_err; + } + atomic_set(&group->iopf_state, IOPF_IN_PROCESS); + domain = iommu_get_domain_for_dev_pasid(group->dev, group->last_fault.fault.prm.pasid, 0); if (!domain || !domain->iopf_handler) @@ -128,7 +176,16 @@ static void iopf_handler(struct work_struct *work) kfree(iopf); }

+ if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { + dev_err(group->dev, "iopf handle fault time out.\n"); + goto iopf_err; + } + + atomic_set(&group->iopf_state, IOPF_COMPLETE); iopf_complete_group(group->dev, &group->last_fault, status); + +iopf_err: + del_timer_sync(&group->iopf_timer); kfree(group); }

@@ -183,7 +240,6 @@ int iommu_queue_iopf(struct iommu_fault *fault, void *cookie) struct dev_iommu *param = dev->iommu;

lockdep_assert_held(¶m->lock); - if (fault->type != IOMMU_FAULT_PAGE_REQ) /* Not a recoverable page fault */ return -EOPNOTSUPP; @@ -222,6 +278,8 @@ int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)

group->dev = dev; group->last_fault.fault = *fault; + iopf_timer_setup(group); + INIT_LIST_HEAD(&group->faults); list_add(&group->last_fault.list, &group->faults); INIT_WORK(&group->work, iopf_handler); @@ -414,7 +472,8 @@ struct iopf_queue *iopf_queue_alloc(const char *name) * that's dealt with, the high-level function can handle groups out of * order. */ - queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name); + queue->wq = alloc_workqueue("iopf_queue/%s", WQ_HIGHPRI | + WQ_MEM_RECLAIM | WQ_UNBOUND, 0, name); if (!queue->wq) { kfree(queue); return NULL; -- 2.24.0

.

Shameerali Kolothum Thodi

4:57 p.m.

New subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

...

-----Original Message----- From: liulongfang Sent: 06 January 2023 00:55 To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/6 1:11, Shameerali Kolothum Thodi wrote:

...
...
-----Original Message----- From: liulongfang Sent: 05 January 2023 07:48 To: Shameerali Kolothum Thodi

<shameerali.kolothum.thodi@huawei.com>;

...
...
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; liulongfang <liulongfang@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

In the current iopf(IO page fault) processing flow of iommu, each iopf event will create a queue_work event and put it into the iopf workqueue. When the system schedules the work event, the page table creation will be completed. Its basic processing flow is as follows:

+------------------+ | recv iopf event | +--------+---------+ | +--------v---------+ | init iopf event | +--------+---------+ | +--------v---------+ | create queue_work| +--------+---------+ | +----------v-----------+ +---->| wait for queue_work | | | to run | | +----------+-----------+ | | |No +----v----+ +-----------+ run ? | +----+----+ |Yes +---------v---------+ +----->| hand queue_work | | | create page table | | +---------+---------+ | | |No +----------v------------+ +-----+ processing completed ?| +----------+------------+ |Yes +-------v---------+ | return response | +-----------------+

In this processing flow, there are two nodes that will be unable to respond for a long time in some scenarios of high system load: 1. On the node where the work queue is waiting to be scheduled. 2. After the work queue is scheduled, in the process of creating the page table.

A hardware error and reset occurs when the two nodes cannot complete processing within the limited wait time of the hardware device.

So I added a timeout function to this iopf processing flow. First, it creates a timer when creating the queue_work event and sets its timeout. At the same time, set an initial state for the iopf event.

Second, the timer is used to time the process from the waiting of the iopf event to the completion of the iopf processing.

Third, if the processing is completed within the timeout period, the timer will be destroyed directly. If the processing of the iopf event has not been completed after the timer timeout, the timer callback interface will be activated, it will directly return the result of the current iopf processing failure to the device, and set the status of the iopf event to the timeout state.

Finally, when the iopf event is scheduled by the system, it will be judged according to the state of the iopf event. If it is in the timeout state, the iopf event will be skipped directly without processing.

Different from the previous Intel's timeout timer method, URL: https://lore.kernel.org/lkml/1565900005-62508-3-git-

send-email-jacob.jun.pan@linux.intel.com/

The current timeout timer will create a timer for each IO page fault request. After the page fault request is processed normally or an error is returned after timeout, the timer will be automatically destroyed and will not be reused.

I think it is better to have a per device timer rather than a timer for each request. What was the problem with Intel's approach?

Intel's patch did not handle the timer scheme completely.

You meant not calling the iopf_complete_group()/iommu_page_response() from their timer fn? But they have listed out the options there to handle the timeout. So probably we could just send FAILURE response from that timer fn for now and leave the other options for discussion.

...

In addition, it will appear kernel calltrace under high system load.

Have you debugged this further? Going through their patch, one problem could be that they reset the timer in the iommu_report_device_fault(). So if there are multiple page faults from the save device, the timer expiry will not happen in case of a fault handler failure. If you move that timer reset to iommu_page_response() fn, I think it will work. Thanks, Shameer

...

...
Thanks, Shameer

Thanks, Longfang.

...
...
Signed-off-by: Longfang Liu <liulongfang@huawei.com> --- drivers/iommu/io-pgfault.c | 63 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 61 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index cd7ebbfe4a67..0fb3ec1390a8 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -6,9 +6,11 @@ */

#include <linux/iommu.h> +#include <linux/jiffies.h> #include <linux/list.h> #include <linux/sched/mm.h> #include <linux/slab.h> +#include <linux/timer.h> #include <linux/workqueue.h>

#include "iommu-sva.h" @@ -56,11 +58,20 @@ struct iopf_fault { #define IOPF_RESPONSE_MAX_TIMEOUT 10000 static unsigned long iopf_timeout_ms = IOPF_RESPONSE_DEF_TIMEOUT;

+enum iopf_timer_state { + IOPF_TIME_INIT = 0, + IOPF_IN_PROCESS, + IOPF_COMPLETE, + IOPF_TIME_OUT, +}; + struct iopf_group { struct iopf_fault last_fault; struct list_head faults; struct work_struct work; struct device *dev; + struct timer_list iopf_timer; + atomic_t iopf_state; };

static int __init iommu_set_iopf_timeout(char *str) @@ -101,6 +112,32 @@ static int iopf_complete_group(struct device *dev, struct iopf_fault *iopf, return iommu_page_response(dev, &resp); }

+static void iopf_timer_func(struct timer_list *t) +{ + struct iopf_group *group = from_timer(group, t, iopf_timer); + + /* Workqueue process time out */ + if (atomic_read(&group->iopf_state) < IOPF_COMPLETE) { + atomic_set(&group->iopf_state, IOPF_TIME_OUT); + iopf_complete_group(group->dev, &group->last_fault, + IOMMU_PAGE_RESP_FAILURE); + dev_err(group->dev, "[timer] iopf workqueue process time out.\n"); + } +} + +static void iopf_timer_setup(struct iopf_group *group) +{ + atomic_set(&group->iopf_state, IOPF_TIME_INIT); + + /* IOPF timeout is 0 means no timeout tracking */ + if (iopf_timeout_ms) { + timer_setup(&group->iopf_timer, iopf_timer_func, 0); + group->iopf_timer.expires = jiffies + + msecs_to_jiffies(iopf_timeout_ms); + mod_timer(&group->iopf_timer, group->iopf_timer.expires); + } +} + static void iopf_handler(struct work_struct *work) { struct iopf_group *group; @@ -109,6 +146,17 @@ static void iopf_handler(struct work_struct *work) enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS;

group = container_of(work, struct iopf_group, work); + if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { + list_for_each_entry_safe(iopf, next, &group->faults, list) { + if (!(iopf->fault.prm.flags & + IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) + kfree(iopf); + } + dev_err(group->dev, "iopf workqueue wait time out.\n"); + goto iopf_err; + } + atomic_set(&group->iopf_state, IOPF_IN_PROCESS); + domain = iommu_get_domain_for_dev_pasid(group->dev, group->last_fault.fault.prm.pasid, 0); if (!domain || !domain->iopf_handler) @@ -128,7 +176,16 @@ static void iopf_handler(struct work_struct *work) kfree(iopf); }

+ if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { + dev_err(group->dev, "iopf handle fault time out.\n"); + goto iopf_err; + } + + atomic_set(&group->iopf_state, IOPF_COMPLETE); iopf_complete_group(group->dev, &group->last_fault, status); + +iopf_err: + del_timer_sync(&group->iopf_timer); kfree(group); }

@@ -183,7 +240,6 @@ int iommu_queue_iopf(struct iommu_fault *fault, void *cookie) struct dev_iommu *param = dev->iommu;

lockdep_assert_held(¶m->lock); - if (fault->type != IOMMU_FAULT_PAGE_REQ) /* Not a recoverable page fault */ return -EOPNOTSUPP; @@ -222,6 +278,8 @@ int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)

group->dev = dev; group->last_fault.fault = *fault; + iopf_timer_setup(group); + INIT_LIST_HEAD(&group->faults); list_add(&group->last_fault.list, &group->faults); INIT_WORK(&group->work, iopf_handler); @@ -414,7 +472,8 @@ struct iopf_queue *iopf_queue_alloc(const char *name) * that's dealt with, the high-level function can handle groups out of * order. */ - queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name); + queue->wq = alloc_workqueue("iopf_queue/%s", WQ_HIGHPRI | + WQ_MEM_RECLAIM | WQ_UNBOUND, 0, name); if (!queue->wq) { kfree(queue); return NULL; -- 2.24.0

.

liulongfang

7 Jan 7 Jan

5:24 p.m.

New subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/6 16:57, Shameerali Kolothum Thodi wrote:

...

...
-----Original Message----- From: liulongfang Sent: 06 January 2023 00:55 To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/6 1:11, Shameerali Kolothum Thodi wrote:

...
...
-----Original Message----- From: liulongfang Sent: 05 January 2023 07:48 To: Shameerali Kolothum Thodi

<shameerali.kolothum.thodi@huawei.com>;

...
...
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; liulongfang <liulongfang@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

In the current iopf(IO page fault) processing flow of iommu, each iopf event will create a queue_work event and put it into the iopf workqueue. When the system schedules the work event, the page table creation will be completed. Its basic processing flow is as follows:

+------------------+ | recv iopf event | +--------+---------+ | +--------v---------+ | init iopf event | +--------+---------+ | +--------v---------+ | create queue_work| +--------+---------+ | +----------v-----------+ +---->| wait for queue_work | | | to run | | +----------+-----------+ | | |No +----v----+ +-----------+ run ? | +----+----+ |Yes +---------v---------+ +----->| hand queue_work | | | create page table | | +---------+---------+ | | |No +----------v------------+ +-----+ processing completed ?| +----------+------------+ |Yes +-------v---------+ | return response | +-----------------+

In this processing flow, there are two nodes that will be unable to respond for a long time in some scenarios of high system load: 1. On the node where the work queue is waiting to be scheduled. 2. After the work queue is scheduled, in the process of creating the page table.

A hardware error and reset occurs when the two nodes cannot complete processing within the limited wait time of the hardware device.

So I added a timeout function to this iopf processing flow. First, it creates a timer when creating the queue_work event and sets its timeout. At the same time, set an initial state for the iopf event.

Second, the timer is used to time the process from the waiting of the iopf event to the completion of the iopf processing.

Third, if the processing is completed within the timeout period, the timer will be destroyed directly. If the processing of the iopf event has not been completed after the timer timeout, the timer callback interface will be activated, it will directly return the result of the current iopf processing failure to the device, and set the status of the iopf event to the timeout state.

Finally, when the iopf event is scheduled by the system, it will be judged according to the state of the iopf event. If it is in the timeout state, the iopf event will be skipped directly without processing.

Different from the previous Intel's timeout timer method, URL: https://lore.kernel.org/lkml/1565900005-62508-3-git-

send-email-jacob.jun.pan@linux.intel.com/

The current timeout timer will create a timer for each IO page fault request. After the page fault request is processed normally or an error is returned after timeout, the timer will be automatically destroyed and will not be reused.

I think it is better to have a per device timer rather than a timer for each request. What was the problem with Intel's approach?

Intel's patch did not handle the timer scheme completely.

You meant not calling the iopf_complete_group()/iommu_page_response() from their timer fn? But they have listed out the options there to handle the timeout. So probably we could just send FAILURE response from that timer fn for now and leave the other options for discussion.

I didn't analyze their plan carefully, because their plan will query the list every time the timer is activated, and their processing part is still TODO, which has not been implemented.

...

...
In addition, it will appear kernel calltrace under high system load.

Have you debugged this further? Going through their patch, one problem could be that they reset the timer in the iommu_report_device_fault(). So if there are multiple page faults from the save device, the timer expiry will not happen in case of a fault handler failure. If you move that timer reset to iommu_page_response() fn, I think it will work.

Since they have not completed their plans themselves, and there are too many TODO parts in their plans, which do not match our current needs, we will not debug their code. Their patch has been tested on our version, and it will cause kernel calltrace, so we reverted it directly, and redesigned another solution by ourselves. Thanks, Longfang.

...

Thanks, Shameer

...
...
Thanks, Shameer

Thanks, Longfang.

...
...
Signed-off-by: Longfang Liu <liulongfang@huawei.com> --- drivers/iommu/io-pgfault.c | 63 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 61 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index cd7ebbfe4a67..0fb3ec1390a8 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -6,9 +6,11 @@ */

#include <linux/iommu.h> +#include <linux/jiffies.h> #include <linux/list.h> #include <linux/sched/mm.h> #include <linux/slab.h> +#include <linux/timer.h> #include <linux/workqueue.h>

#include "iommu-sva.h" @@ -56,11 +58,20 @@ struct iopf_fault { #define IOPF_RESPONSE_MAX_TIMEOUT 10000 static unsigned long iopf_timeout_ms = IOPF_RESPONSE_DEF_TIMEOUT;

+enum iopf_timer_state { + IOPF_TIME_INIT = 0, + IOPF_IN_PROCESS, + IOPF_COMPLETE, + IOPF_TIME_OUT, +}; + struct iopf_group { struct iopf_fault last_fault; struct list_head faults; struct work_struct work; struct device *dev; + struct timer_list iopf_timer; + atomic_t iopf_state; };

static int __init iommu_set_iopf_timeout(char *str) @@ -101,6 +112,32 @@ static int iopf_complete_group(struct device *dev, struct iopf_fault *iopf, return iommu_page_response(dev, &resp); }

+static void iopf_timer_func(struct timer_list *t) +{ + struct iopf_group *group = from_timer(group, t, iopf_timer); + + /* Workqueue process time out */ + if (atomic_read(&group->iopf_state) < IOPF_COMPLETE) { + atomic_set(&group->iopf_state, IOPF_TIME_OUT); + iopf_complete_group(group->dev, &group->last_fault, + IOMMU_PAGE_RESP_FAILURE); + dev_err(group->dev, "[timer] iopf workqueue process time out.\n"); + } +} + +static void iopf_timer_setup(struct iopf_group *group) +{ + atomic_set(&group->iopf_state, IOPF_TIME_INIT); + + /* IOPF timeout is 0 means no timeout tracking */ + if (iopf_timeout_ms) { + timer_setup(&group->iopf_timer, iopf_timer_func, 0); + group->iopf_timer.expires = jiffies + + msecs_to_jiffies(iopf_timeout_ms); + mod_timer(&group->iopf_timer, group->iopf_timer.expires); + } +} + static void iopf_handler(struct work_struct *work) { struct iopf_group *group; @@ -109,6 +146,17 @@ static void iopf_handler(struct work_struct *work) enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS;

group = container_of(work, struct iopf_group, work); + if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { + list_for_each_entry_safe(iopf, next, &group->faults, list) { + if (!(iopf->fault.prm.flags & + IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) + kfree(iopf); + } + dev_err(group->dev, "iopf workqueue wait time out.\n"); + goto iopf_err; + } + atomic_set(&group->iopf_state, IOPF_IN_PROCESS); + domain = iommu_get_domain_for_dev_pasid(group->dev, group->last_fault.fault.prm.pasid, 0); if (!domain || !domain->iopf_handler) @@ -128,7 +176,16 @@ static void iopf_handler(struct work_struct *work) kfree(iopf); }

+ if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { + dev_err(group->dev, "iopf handle fault time out.\n"); + goto iopf_err; + } + + atomic_set(&group->iopf_state, IOPF_COMPLETE); iopf_complete_group(group->dev, &group->last_fault, status); + +iopf_err: + del_timer_sync(&group->iopf_timer); kfree(group); }

@@ -183,7 +240,6 @@ int iommu_queue_iopf(struct iommu_fault *fault, void *cookie) struct dev_iommu *param = dev->iommu;

lockdep_assert_held(¶m->lock); - if (fault->type != IOMMU_FAULT_PAGE_REQ) /* Not a recoverable page fault */ return -EOPNOTSUPP; @@ -222,6 +278,8 @@ int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)

group->dev = dev; group->last_fault.fault = *fault; + iopf_timer_setup(group); + INIT_LIST_HEAD(&group->faults); list_add(&group->last_fault.list, &group->faults); INIT_WORK(&group->work, iopf_handler); @@ -414,7 +472,8 @@ struct iopf_queue *iopf_queue_alloc(const char *name) * that's dealt with, the high-level function can handle groups out of * order. */ - queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name); + queue->wq = alloc_workqueue("iopf_queue/%s", WQ_HIGHPRI | + WQ_MEM_RECLAIM | WQ_UNBOUND, 0, name); if (!queue->wq) { kfree(queue); return NULL; -- 2.24.0

.

.

Shameerali Kolothum Thodi

8 Jan 8 Jan

5:12 p.m.

New subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

...

-----Original Message----- From: liulongfang Sent: 07 January 2023 09:24 To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/6 16:57, Shameerali Kolothum Thodi wrote:

...
...
-----Original Message----- From: liulongfang Sent: 06 January 2023 00:55 To: Shameerali Kolothum Thodi

<shameerali.kolothum.thodi@huawei.com>;

...
...
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/6 1:11, Shameerali Kolothum Thodi wrote:

...
...
-----Original Message----- From: liulongfang Sent: 05 January 2023 07:48 To: Shameerali Kolothum Thodi

<shameerali.kolothum.thodi@huawei.com>;

...
...
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; liulongfang <liulongfang@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

In the current iopf(IO page fault) processing flow of iommu, each iopf event will create a queue_work event and put it into the iopf workqueue. When the system schedules the work event, the page table creation will be completed. Its basic processing flow is as follows:

+------------------+ | recv iopf event | +--------+---------+ | +--------v---------+ | init iopf event | +--------+---------+ | +--------v---------+ | create queue_work| +--------+---------+ | +----------v-----------+ +---->| wait for queue_work | | | to run | | +----------+-----------+ | | |No +----v----+ +-----------+ run ? | +----+----+ |Yes +---------v---------+ +----->| hand queue_work | | | create page table | | +---------+---------+ | | |No +----------v------------+ +-----+ processing completed ?| +----------+------------+ |Yes +-------v---------+ | return response | +-----------------+

In this processing flow, there are two nodes that will be unable to respond for a long time in some scenarios of high system load: 1. On the node where the work queue is waiting to be scheduled. 2. After the work queue is scheduled, in the process of creating the page table.

A hardware error and reset occurs when the two nodes cannot complete processing within the limited wait time of the hardware device.

So I added a timeout function to this iopf processing flow. First, it creates a timer when creating the queue_work event and sets its timeout. At the same time, set an initial state for the iopf event.

Second, the timer is used to time the process from the waiting of the iopf event to the completion of the iopf processing.

Third, if the processing is completed within the timeout period, the timer will be destroyed directly. If the processing of the iopf event has not been completed after the timer timeout, the timer callback interface will be activated, it will directly return the result of the current iopf processing failure to the device, and set the status of the iopf event to the timeout state.

Finally, when the iopf event is scheduled by the system, it will be judged according to the state of the iopf event. If it is in the timeout state, the iopf event will be skipped directly without processing.

Different from the previous Intel's timeout timer method, URL: https://lore.kernel.org/lkml/1565900005-62508-3-git-

send-email-jacob.jun.pan@linux.intel.com/

The current timeout timer will create a timer for each IO page fault request. After the page fault request is processed normally or an error is returned after timeout, the timer will be automatically destroyed and will not be reused.

I think it is better to have a per device timer rather than a timer for each request. What was the problem with Intel's approach?

Intel's patch did not handle the timer scheme completely.

You meant not calling the iopf_complete_group()/iommu_page_response() from their timer fn? But they have listed out the options there to handle the timeout. So probably we could just send FAILURE response from that timer fn for now and leave the other options for discussion.

I didn't analyze their plan carefully, because their plan will query the list every time the timer is activated,

Because it is per device timer, not per event timer. and their processing part is still TODO,

...

which has not been implemented.

Also that ToDo list is how they see this interface will used further which is valid. Remember this is common IOMMU code and should cater to ARM/Intel/AMD etc and not just devices that support STALL mode but also normal PCI PRQ as well .

...

...
...
In addition, it will appear kernel calltrace under high system load.

Have you debugged this further? Going through their patch, one problem could be that they reset the timer in the iommu_report_device_fault(). So if there are multiple page faults from the save device, the timer expiry will not happen in case of a fault handler failure. If you move that timer reset to iommu_page_response() fn, I think it will work.

Since they have not completed their plans themselves, and there are too many TODO parts in their plans, which do not match our current needs, we will not debug their code. Their patch has been tested on our version, and it will cause kernel calltrace, so we reverted it directly, and redesigned another solution by ourselves.

I think per device timer is more optimum than setting up a timer for each event unless you can prove/explain otherwise. Since a solution is already proposed in the ML, it is always good to anlyse it and come up with an explanation why another different solution is required. Thanks, Shameer

...

Thanks, Longfang.

...
Thanks, Shameer

...
...
Thanks, Shameer

Thanks, Longfang.

...
...
Signed-off-by: Longfang Liu <liulongfang@huawei.com> --- drivers/iommu/io-pgfault.c | 63 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 61 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index cd7ebbfe4a67..0fb3ec1390a8 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -6,9 +6,11 @@ */

#include <linux/iommu.h> +#include <linux/jiffies.h> #include <linux/list.h> #include <linux/sched/mm.h> #include <linux/slab.h> +#include <linux/timer.h> #include <linux/workqueue.h>

#include "iommu-sva.h" @@ -56,11 +58,20 @@ struct iopf_fault { #define IOPF_RESPONSE_MAX_TIMEOUT 10000 static unsigned long iopf_timeout_ms = IOPF_RESPONSE_DEF_TIMEOUT;

+enum iopf_timer_state { + IOPF_TIME_INIT = 0, + IOPF_IN_PROCESS, + IOPF_COMPLETE, + IOPF_TIME_OUT, +}; + struct iopf_group { struct iopf_fault last_fault; struct list_head faults; struct work_struct work; struct device *dev; + struct timer_list iopf_timer; + atomic_t iopf_state; };

static int __init iommu_set_iopf_timeout(char *str) @@ -101,6 +112,32 @@ static int iopf_complete_group(struct device *dev, struct iopf_fault *iopf, return iommu_page_response(dev, &resp); }

+static void iopf_timer_func(struct timer_list *t) +{ + struct iopf_group *group = from_timer(group, t, iopf_timer); + + /* Workqueue process time out */ + if (atomic_read(&group->iopf_state) < IOPF_COMPLETE) { + atomic_set(&group->iopf_state, IOPF_TIME_OUT); + iopf_complete_group(group->dev, &group->last_fault, + IOMMU_PAGE_RESP_FAILURE); + dev_err(group->dev, "[timer] iopf workqueue process time out.\n"); + } +} + +static void iopf_timer_setup(struct iopf_group *group) +{ + atomic_set(&group->iopf_state, IOPF_TIME_INIT); + + /* IOPF timeout is 0 means no timeout tracking */ + if (iopf_timeout_ms) { + timer_setup(&group->iopf_timer, iopf_timer_func, 0); + group->iopf_timer.expires = jiffies + + msecs_to_jiffies(iopf_timeout_ms); + mod_timer(&group->iopf_timer, group->iopf_timer.expires); + } +} + static void iopf_handler(struct work_struct *work) { struct iopf_group *group; @@ -109,6 +146,17 @@ static void iopf_handler(struct work_struct *work) enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS;

group = container_of(work, struct iopf_group, work); + if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { + list_for_each_entry_safe(iopf, next, &group->faults, list) { + if (!(iopf->fault.prm.flags & + IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) + kfree(iopf); + } + dev_err(group->dev, "iopf workqueue wait time out.\n"); + goto iopf_err; + } + atomic_set(&group->iopf_state, IOPF_IN_PROCESS); + domain = iommu_get_domain_for_dev_pasid(group->dev, group->last_fault.fault.prm.pasid, 0); if (!domain || !domain->iopf_handler) @@ -128,7 +176,16 @@ static void iopf_handler(struct work_struct *work) kfree(iopf); }

+ if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { + dev_err(group->dev, "iopf handle fault time out.\n"); + goto iopf_err; + } + + atomic_set(&group->iopf_state, IOPF_COMPLETE); iopf_complete_group(group->dev, &group->last_fault, status); + +iopf_err: + del_timer_sync(&group->iopf_timer); kfree(group); }

@@ -183,7 +240,6 @@ int iommu_queue_iopf(struct iommu_fault

*fault,

...
...
...
void *cookie) struct dev_iommu *param = dev->iommu;

lockdep_assert_held(¶m->lock); - if (fault->type != IOMMU_FAULT_PAGE_REQ) /* Not a recoverable page fault */ return -EOPNOTSUPP; @@ -222,6 +278,8 @@ int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)

group->dev = dev; group->last_fault.fault = *fault; + iopf_timer_setup(group); + INIT_LIST_HEAD(&group->faults); list_add(&group->last_fault.list, &group->faults); INIT_WORK(&group->work, iopf_handler); @@ -414,7 +472,8 @@ struct iopf_queue *iopf_queue_alloc(const char *name) * that's dealt with, the high-level function can handle groups out of * order. */ - queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name); + queue->wq = alloc_workqueue("iopf_queue/%s", WQ_HIGHPRI | + WQ_MEM_RECLAIM | WQ_UNBOUND, 0, name); if (!queue->wq) { kfree(queue); return NULL; -- 2.24.0

.

.

liulongfang

9 Jan 9 Jan

10:39 a.m.

New subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/8 17:12, Shameerali Kolothum Thodi Wrote:

...

...
-----Original Message----- From: liulongfang Sent: 07 January 2023 09:24 To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/6 16:57, Shameerali Kolothum Thodi wrote:

...
...
-----Original Message----- From: liulongfang Sent: 06 January 2023 00:55 To: Shameerali Kolothum Thodi

<shameerali.kolothum.thodi@huawei.com>;

...
...
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/6 1:11, Shameerali Kolothum Thodi wrote:

...
...
-----Original Message----- From: liulongfang Sent: 05 January 2023 07:48 To: Shameerali Kolothum Thodi

<shameerali.kolothum.thodi@huawei.com>;

...
...
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; liulongfang <liulongfang@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

In the current iopf(IO page fault) processing flow of iommu, each iopf event will create a queue_work event and put it into the iopf workqueue. When the system schedules the work event, the page table creation will be completed. Its basic processing flow is as follows:

+------------------+ | recv iopf event | +--------+---------+ | +--------v---------+ | init iopf event | +--------+---------+ | +--------v---------+ | create queue_work| +--------+---------+ | +----------v-----------+ +---->| wait for queue_work | | | to run | | +----------+-----------+ | | |No +----v----+ +-----------+ run ? | +----+----+ |Yes +---------v---------+ +----->| hand queue_work | | | create page table | | +---------+---------+ | | |No +----------v------------+ +-----+ processing completed ?| +----------+------------+ |Yes +-------v---------+ | return response | +-----------------+

In this processing flow, there are two nodes that will be unable to respond for a long time in some scenarios of high system load: 1. On the node where the work queue is waiting to be scheduled. 2. After the work queue is scheduled, in the process of creating the page table.

A hardware error and reset occurs when the two nodes cannot complete processing within the limited wait time of the hardware device.

So I added a timeout function to this iopf processing flow. First, it creates a timer when creating the queue_work event and sets its timeout. At the same time, set an initial state for the iopf event.

Second, the timer is used to time the process from the waiting of the iopf event to the completion of the iopf processing.

Third, if the processing is completed within the timeout period, the timer will be destroyed directly. If the processing of the iopf event has not been completed after the timer timeout, the timer callback interface will be activated, it will directly return the result of the current iopf processing failure to the device, and set the status of the iopf event to the timeout state.

Finally, when the iopf event is scheduled by the system, it will be judged according to the state of the iopf event. If it is in the timeout state, the iopf event will be skipped directly without processing.

Different from the previous Intel's timeout timer method, URL: https://lore.kernel.org/lkml/1565900005-62508-3-git-

send-email-jacob.jun.pan@linux.intel.com/

The current timeout timer will create a timer for each IO page fault request. After the page fault request is processed normally or an error is returned after timeout, the timer will be automatically destroyed and will not be reused.

I think it is better to have a per device timer rather than a timer for each request. What was the problem with Intel's approach?

Intel's patch did not handle the timer scheme completely.

You meant not calling the iopf_complete_group()/iommu_page_response() from their timer fn? But they have listed out the options there to handle the timeout. So probably we could just send FAILURE response from that timer fn for now and leave the other options for discussion.

I didn't analyze their plan carefully, because their plan will query the list every time the timer is activated,

Because it is per device timer, not per event timer.

and their processing part is still TODO,

...
which has not been implemented.

Also that ToDo list is how they see this interface will used further which is valid. Remember this is common IOMMU code and should cater to ARM/Intel/AMD etc and not just devices that support STALL mode but also normal PCI PRQ as well .

Their plan may have value, but it is a surefire one. As for the reason for the failure, it is because their timers are created based on the device. Every time the device receives a new IOPF event, it will refresh this timer, but its timing time starting point cannot be updated (the previous IOPF event still needs to continue timing), which leads to the fact that as long as one IOPF event times out, the all IOPF events received by the device after this will time out, that making the device unusable at all. Therefore, to solve this problem, it can only be solved by creating a timer independently for each IOPF event. Thansk, Longfang.

...

...
...
...
In addition, it will appear kernel calltrace under high system load.

Have you debugged this further? Going through their patch, one problem could be that they reset the timer in the iommu_report_device_fault(). So if there are multiple page faults from the save device, the timer expiry will not happen in case of a fault handler failure. If you move that timer reset to iommu_page_response() fn, I think it will work.

Since they have not completed their plans themselves, and there are too many TODO parts in their plans, which do not match our current needs, we will not debug their code. Their patch has been tested on our version, and it will cause kernel calltrace, so we reverted it directly, and redesigned another solution by ourselves.

I think per device timer is more optimum than setting up a timer for each event unless you can prove/explain otherwise. Since a solution is already proposed in the ML, it is always good to anlyse it and come up with an explanation why another different solution is required.

Thanks, Shameer

...
Thanks, Longfang.

...
Thanks, Shameer

...
...
Thanks, Shameer

Thanks, Longfang.

...
...
Signed-off-by: Longfang Liu <liulongfang@huawei.com> --- drivers/iommu/io-pgfault.c | 63 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 61 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index cd7ebbfe4a67..0fb3ec1390a8 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -6,9 +6,11 @@ */

#include <linux/iommu.h> +#include <linux/jiffies.h> #include <linux/list.h> #include <linux/sched/mm.h> #include <linux/slab.h> +#include <linux/timer.h> #include <linux/workqueue.h>

#include "iommu-sva.h" @@ -56,11 +58,20 @@ struct iopf_fault { #define IOPF_RESPONSE_MAX_TIMEOUT 10000 static unsigned long iopf_timeout_ms = IOPF_RESPONSE_DEF_TIMEOUT;

+enum iopf_timer_state { + IOPF_TIME_INIT = 0, + IOPF_IN_PROCESS, + IOPF_COMPLETE, + IOPF_TIME_OUT, +}; + struct iopf_group { struct iopf_fault last_fault; struct list_head faults; struct work_struct work; struct device *dev; + struct timer_list iopf_timer; + atomic_t iopf_state; };

static int __init iommu_set_iopf_timeout(char *str) @@ -101,6 +112,32 @@ static int iopf_complete_group(struct device *dev, struct iopf_fault *iopf, return iommu_page_response(dev, &resp); }

+static void iopf_timer_func(struct timer_list *t) +{ + struct iopf_group *group = from_timer(group, t, iopf_timer); + + /* Workqueue process time out */ + if (atomic_read(&group->iopf_state) < IOPF_COMPLETE) { + atomic_set(&group->iopf_state, IOPF_TIME_OUT); + iopf_complete_group(group->dev, &group->last_fault, + IOMMU_PAGE_RESP_FAILURE); + dev_err(group->dev, "[timer] iopf workqueue process time out.\n"); + } +} + +static void iopf_timer_setup(struct iopf_group *group) +{ + atomic_set(&group->iopf_state, IOPF_TIME_INIT); + + /* IOPF timeout is 0 means no timeout tracking */ + if (iopf_timeout_ms) { + timer_setup(&group->iopf_timer, iopf_timer_func, 0); + group->iopf_timer.expires = jiffies + + msecs_to_jiffies(iopf_timeout_ms); + mod_timer(&group->iopf_timer, group->iopf_timer.expires); + } +} + static void iopf_handler(struct work_struct *work) { struct iopf_group *group; @@ -109,6 +146,17 @@ static void iopf_handler(struct work_struct *work) enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS;

group = container_of(work, struct iopf_group, work); + if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { + list_for_each_entry_safe(iopf, next, &group->faults, list) { + if (!(iopf->fault.prm.flags & + IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) + kfree(iopf); + } + dev_err(group->dev, "iopf workqueue wait time out.\n"); + goto iopf_err; + } + atomic_set(&group->iopf_state, IOPF_IN_PROCESS); + domain = iommu_get_domain_for_dev_pasid(group->dev, group->last_fault.fault.prm.pasid, 0); if (!domain || !domain->iopf_handler) @@ -128,7 +176,16 @@ static void iopf_handler(struct work_struct *work) kfree(iopf); }

+ if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { + dev_err(group->dev, "iopf handle fault time out.\n"); + goto iopf_err; + } + + atomic_set(&group->iopf_state, IOPF_COMPLETE); iopf_complete_group(group->dev, &group->last_fault, status); + +iopf_err: + del_timer_sync(&group->iopf_timer); kfree(group); }

@@ -183,7 +240,6 @@ int iommu_queue_iopf(struct iommu_fault

*fault,

...
...
...
void *cookie) struct dev_iommu *param = dev->iommu;

lockdep_assert_held(¶m->lock); - if (fault->type != IOMMU_FAULT_PAGE_REQ) /* Not a recoverable page fault */ return -EOPNOTSUPP; @@ -222,6 +278,8 @@ int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)

group->dev = dev; group->last_fault.fault = *fault; + iopf_timer_setup(group); + INIT_LIST_HEAD(&group->faults); list_add(&group->last_fault.list, &group->faults); INIT_WORK(&group->work, iopf_handler); @@ -414,7 +472,8 @@ struct iopf_queue *iopf_queue_alloc(const char *name) * that's dealt with, the high-level function can handle groups out of * order. */ - queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, name); + queue->wq = alloc_workqueue("iopf_queue/%s", WQ_HIGHPRI | + WQ_MEM_RECLAIM | WQ_UNBOUND, 0, name); if (!queue->wq) { kfree(queue); return NULL; -- 2.24.0

.

.

.

Shameerali Kolothum Thodi

2:56 p.m.

New subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

...

-----Original Message----- From: liulongfang Sent: 09 January 2023 02:39 To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/8 17:12, Shameerali Kolothum Thodi Wrote:

...
...
-----Original Message----- From: liulongfang Sent: 07 January 2023 09:24 To: Shameerali Kolothum Thodi

<shameerali.kolothum.thodi@huawei.com>;

...
...
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/6 16:57, Shameerali Kolothum Thodi wrote:

...
...
-----Original Message----- From: liulongfang Sent: 06 January 2023 00:55 To: Shameerali Kolothum Thodi

<shameerali.kolothum.thodi@huawei.com>;

...
...
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/6 1:11, Shameerali Kolothum Thodi wrote:

...
> -----Original Message----- > From: liulongfang > Sent: 05 January 2023 07:48 > To: Shameerali Kolothum Thodi

<shameerali.kolothum.thodi@huawei.com>;

...
> Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) > <prime.zeng@hisilicon.com> > Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao > (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; > shenyang (M) <shenyang39@huawei.com>; liulongfang > <liulongfang@huawei.com>; qianweili <qianweili@huawei.com>; > liudongdong (C) <liudongdong3@huawei.com> > Subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu > > In the current iopf(IO page fault) processing flow of iommu, each > iopf event will create a queue_work event and put it into the > iopf workqueue. When the system schedules the work event, the > page table creation will be completed. > Its basic processing flow is as follows: > > +------------------+ > | recv iopf event | > +--------+---------+ > | > +--------v---------+ > | init iopf event | > +--------+---------+ > | > +--------v---------+ > | create queue_work| > +--------+---------+ > | > +----------v-----------+ > +---->| wait for queue_work | > | | to run | > | +----------+-----------+ > | | > |No +----v----+ > +-----------+ run ? | > +----+----+ > |Yes > +---------v---------+ > +----->| hand queue_work | > | | create page table | > | +---------+---------+ > | | > |No +----------v------------+ > +-----+ processing completed ?| > +----------+------------+ > |Yes > +-------v---------+ > | return response | > +-----------------+ > > In this processing flow, there are two nodes that will be unable > to respond for a long time in some scenarios of high system load: > 1. On the node where the work queue is waiting to be scheduled. > 2. After the work queue is scheduled, in the process of creating > the page table. > > A hardware error and reset occurs when the two nodes cannot complete > processing within the limited wait time of the hardware device. > > So I added a timeout function to this iopf processing flow. > First, it creates a timer when creating the queue_work event and > sets its timeout. At the same time, set an initial state for the > iopf event. > > Second, the timer is used to time the process from the waiting of > the iopf event to the completion of the iopf processing. > > Third, if the processing is completed within the timeout period, > the timer will be destroyed directly. > If the processing of the iopf event has not been completed after > the timer timeout, the timer callback interface will be > activated, it will directly return the result of the current iopf > processing failure to the device, and set the status of the iopf > event to the timeout state. > > Finally, when the iopf event is scheduled by the system, it will > be judged according to the state of the iopf event. If it is in > the timeout state, the iopf event will be skipped directly > without processing. > > Different from the previous Intel's timeout timer method, > URL: https://lore.kernel.org/lkml/1565900005-62508-3-git- > > send-email-jacob.jun.pan@linux.intel.com/ > > The current timeout timer will create a timer for each IO page > fault request. After the page fault request is processed normally > or an error is returned after timeout, the timer will be > automatically destroyed and will not be reused.

I think it is better to have a per device timer rather than a timer for each request. What was the problem with Intel's approach?

Intel's patch did not handle the timer scheme completely.

You meant not calling the iopf_complete_group()/iommu_page_response() from their timer fn? But they have listed out the options there to handle the timeout. So probably we could just send FAILURE response from that timer fn for now and leave the other options for discussion.

I didn't analyze their plan carefully, because their plan will query the list every time the timer is activated,

Because it is per device timer, not per event timer.

and their processing part is still TODO,

...
which has not been implemented.

Also that ToDo list is how they see this interface will used further which is valid. Remember this is common IOMMU code and should cater to ARM/Intel/AMD etc and not just devices that support STALL mode but also normal PCI PRQ as well .

Their plan may have value, but it is a surefire one. As for the reason for the failure, it is because their timers are created based on the device.

Every time the device receives a new IOPF event, it will refresh this timer, but its timing time starting point cannot be updated (the previous IOPF event still needs to continue timing),

As I said in previous comment, could you please move that refresh timer to page response path? I think that will solve the problem. which leads to the fact that as long as one IOPF

...

event times out, the all IOPF events received by the device after this will time out, that making the device unusable at all.

Ok. I am not sure what device does here when you send out a FAILURE response code to any one IOPF event. Does it continue to work after that or does require a reset? Thanks, Shameer

...

Therefore, to solve this problem, it can only be solved by creating a timer independently for each IOPF event.

Thansk, Longfang.

...
...
...
...
In addition, it will appear kernel calltrace under high system load.

Have you debugged this further? Going through their patch, one problem could be that they reset the timer in the iommu_report_device_fault(). So if there are multiple page faults from the save device, the timer expiry will not happen in case of a fault handler failure. If you move that timer reset to iommu_page_response() fn, I think it will work.

Since they have not completed their plans themselves, and there are too many TODO parts in their plans, which do not match our current needs, we will not debug their code. Their patch has been tested on our version, and it will cause kernel calltrace, so we reverted it directly, and redesigned another solution by ourselves.

I think per device timer is more optimum than setting up a timer for each event unless you can prove/explain otherwise. Since a solution is already proposed in the ML, it is always good to anlyse it and come up with an explanation why another different solution is required.

Thanks, Shameer

...
Thanks, Longfang.

...
Thanks, Shameer

...
...
Thanks, Shameer

Thanks, Longfang.

...
> Signed-off-by: Longfang Liu <liulongfang@huawei.com> > --- > drivers/iommu/io-pgfault.c | 63 > ++++++++++++++++++++++++++++++++++++-- > 1 file changed, 61 insertions(+), 2 deletions(-) > > diff --git a/drivers/iommu/io-pgfault.c > b/drivers/iommu/io-pgfault.c index cd7ebbfe4a67..0fb3ec1390a8 > 100644 > --- a/drivers/iommu/io-pgfault.c > +++ b/drivers/iommu/io-pgfault.c > @@ -6,9 +6,11 @@ > */ > > #include <linux/iommu.h> > +#include <linux/jiffies.h> > #include <linux/list.h> > #include <linux/sched/mm.h> > #include <linux/slab.h> > +#include <linux/timer.h> > #include <linux/workqueue.h> > > #include "iommu-sva.h" > @@ -56,11 +58,20 @@ struct iopf_fault { > #define IOPF_RESPONSE_MAX_TIMEOUT 10000 > static unsigned long iopf_timeout_ms = IOPF_RESPONSE_DEF_TIMEOUT; > > +enum iopf_timer_state { > + IOPF_TIME_INIT = 0, > + IOPF_IN_PROCESS, > + IOPF_COMPLETE, > + IOPF_TIME_OUT, > +}; > + > struct iopf_group { > struct iopf_fault last_fault; > struct list_head faults; > struct work_struct work; > struct device *dev; > + struct timer_list iopf_timer; > + atomic_t iopf_state; > }; > > static int __init iommu_set_iopf_timeout(char *str) @@ -101,6 > +112,32 @@ static int iopf_complete_group(struct device *dev, > struct iopf_fault *iopf, > return iommu_page_response(dev, &resp); } > > +static void iopf_timer_func(struct timer_list *t) { > + struct iopf_group *group = from_timer(group, t, iopf_timer); > + > + /* Workqueue process time out */ > + if (atomic_read(&group->iopf_state) < IOPF_COMPLETE) { > + atomic_set(&group->iopf_state, IOPF_TIME_OUT); > + iopf_complete_group(group->dev, &group->last_fault, > + IOMMU_PAGE_RESP_FAILURE); > + dev_err(group->dev, "[timer] iopf workqueue process time > out.\n"); > + } > +} > + > +static void iopf_timer_setup(struct iopf_group *group) { > + atomic_set(&group->iopf_state, IOPF_TIME_INIT); > + > + /* IOPF timeout is 0 means no timeout tracking */ > + if (iopf_timeout_ms) { > + timer_setup(&group->iopf_timer, iopf_timer_func, 0); > + group->iopf_timer.expires = jiffies + > + msecs_to_jiffies(iopf_timeout_ms); > + mod_timer(&group->iopf_timer, group->iopf_timer.expires); > + } > +} > + > static void iopf_handler(struct work_struct *work) { > struct iopf_group *group; > @@ -109,6 +146,17 @@ static void iopf_handler(struct work_struct *work) > enum iommu_page_response_code status =

IOMMU_PAGE_RESP_SUCCESS;

...
...
> > group = container_of(work, struct iopf_group, work); > + if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { > + list_for_each_entry_safe(iopf, next, &group->faults, list) { > + if (!(iopf->fault.prm.flags & > + IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) > + kfree(iopf); > + } > + dev_err(group->dev, "iopf workqueue wait time out.\n"); > + goto iopf_err; > + } > + atomic_set(&group->iopf_state, IOPF_IN_PROCESS); > + > domain = iommu_get_domain_for_dev_pasid(group->dev, > group->last_fault.fault.prm.pasid, 0); > if (!domain || !domain->iopf_handler) @@ -128,7 +176,16 @@ > static void iopf_handler(struct work_struct *work) > kfree(iopf); > } > > + if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { > + dev_err(group->dev, "iopf handle fault time out.\n"); > + goto iopf_err; > + } > + > + atomic_set(&group->iopf_state, IOPF_COMPLETE); > iopf_complete_group(group->dev, &group->last_fault, status); > + > +iopf_err: > + del_timer_sync(&group->iopf_timer); > kfree(group); > } > > @@ -183,7 +240,6 @@ int iommu_queue_iopf(struct iommu_fault *fault, > void *cookie) > struct dev_iommu *param = dev->iommu; > > lockdep_assert_held(¶m->lock); > - > if (fault->type != IOMMU_FAULT_PAGE_REQ) > /* Not a recoverable page fault */ > return -EOPNOTSUPP; > @@ -222,6 +278,8 @@ int iommu_queue_iopf(struct iommu_fault *fault, > void *cookie) > > group->dev = dev; > group->last_fault.fault = *fault; > + iopf_timer_setup(group); > + > INIT_LIST_HEAD(&group->faults); > list_add(&group->last_fault.list, &group->faults); > INIT_WORK(&group->work, iopf_handler); @@ -414,7 +472,8 @@ > struct iopf_queue *iopf_queue_alloc(const char > *name) > * that's dealt with, the high-level function can handle groups > out of > * order. > */ > - queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, > name); > + queue->wq = alloc_workqueue("iopf_queue/%s", WQ_HIGHPRI | > + WQ_MEM_RECLAIM | WQ_UNBOUND, 0, name); > if (!queue->wq) { > kfree(queue); > return NULL; > -- > 2.24.0

.

.

.

liulongfang

14 Jan 14 Jan

3:46 p.m.

New subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/9 14:56, Shameerali Kolothum Thodi Wrote:

...

...
-----Original Message----- From: liulongfang Sent: 09 January 2023 02:39 To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/8 17:12, Shameerali Kolothum Thodi Wrote:

...
...
-----Original Message----- From: liulongfang Sent: 07 January 2023 09:24 To: Shameerali Kolothum Thodi

<shameerali.kolothum.thodi@huawei.com>;

...
...
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/6 16:57, Shameerali Kolothum Thodi wrote:

...
...
-----Original Message----- From: liulongfang Sent: 06 January 2023 00:55 To: Shameerali Kolothum Thodi

<shameerali.kolothum.thodi@huawei.com>;

...
...
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/6 1:11, Shameerali Kolothum Thodi wrote: > > >> -----Original Message----- >> From: liulongfang >> Sent: 05 January 2023 07:48 >> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; >> Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) >> <prime.zeng@hisilicon.com> >> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao >> (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; >> shenyang (M) <shenyang39@huawei.com>; liulongfang >> <liulongfang@huawei.com>; qianweili <qianweili@huawei.com>; >> liudongdong (C) <liudongdong3@huawei.com> >> Subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu >> >> In the current iopf(IO page fault) processing flow of iommu, each >> iopf event will create a queue_work event and put it into the >> iopf workqueue. When the system schedules the work event, the >> page table creation will be completed. >> Its basic processing flow is as follows: >> >> +------------------+ >> | recv iopf event | >> +--------+---------+ >> | >> +--------v---------+ >> | init iopf event | >> +--------+---------+ >> | >> +--------v---------+ >> | create queue_work| >> +--------+---------+ >> | >> +----------v-----------+ >> +---->| wait for queue_work | >> | | to run | >> | +----------+-----------+ >> | | >> |No +----v----+ >> +-----------+ run ? | >> +----+----+ >> |Yes >> +---------v---------+ >> +----->| hand queue_work | >> | | create page table | >> | +---------+---------+ >> | | >> |No +----------v------------+ >> +-----+ processing completed ?| >> +----------+------------+ >> |Yes >> +-------v---------+ >> | return response | >> +-----------------+ >> >> In this processing flow, there are two nodes that will be unable >> to respond for a long time in some scenarios of high system load: >> 1. On the node where the work queue is waiting to be scheduled. >> 2. After the work queue is scheduled, in the process of creating >> the page table. >> >> A hardware error and reset occurs when the two nodes cannot complete >> processing within the limited wait time of the hardware device. >> >> So I added a timeout function to this iopf processing flow. >> First, it creates a timer when creating the queue_work event and >> sets its timeout. At the same time, set an initial state for the >> iopf event. >> >> Second, the timer is used to time the process from the waiting of >> the iopf event to the completion of the iopf processing. >> >> Third, if the processing is completed within the timeout period, >> the timer will be destroyed directly. >> If the processing of the iopf event has not been completed after >> the timer timeout, the timer callback interface will be >> activated, it will directly return the result of the current iopf >> processing failure to the device, and set the status of the iopf >> event to the timeout state. >> >> Finally, when the iopf event is scheduled by the system, it will >> be judged according to the state of the iopf event. If it is in >> the timeout state, the iopf event will be skipped directly >> without processing. >> >> Different from the previous Intel's timeout timer method, >> URL: https://lore.kernel.org/lkml/1565900005-62508-3-git- >> >> send-email-jacob.jun.pan@linux.intel.com/ >> >> The current timeout timer will create a timer for each IO page >> fault request. After the page fault request is processed normally >> or an error is returned after timeout, the timer will be >> automatically destroyed and will not be reused. > > I think it is better to have a per device timer rather than a > timer for each request. What was the problem with Intel's approach? >

Intel's patch did not handle the timer scheme completely.

You meant not calling the iopf_complete_group()/iommu_page_response() from their timer fn? But they have listed out the options there to handle the timeout. So probably we could just send FAILURE response from that timer fn for now and leave the other options for discussion.

I didn't analyze their plan carefully, because their plan will query the list every time the timer is activated,

Because it is per device timer, not per event timer.

and their processing part is still TODO,

...
which has not been implemented.

Also that ToDo list is how they see this interface will used further which is valid. Remember this is common IOMMU code and should cater to ARM/Intel/AMD etc and not just devices that support STALL mode but also normal PCI PRQ as well .

Their plan may have value, but it is a surefire one. As for the reason for the failure, it is because their timers are created based on the device.

Every time the device receives a new IOPF event, it will refresh this timer, but its timing time starting point cannot be updated (the previous IOPF event still needs to continue timing),

As I said in previous comment, could you please move that refresh timer to page response path? I think that will solve the problem.

You said "page response path" specifically refers to which part of the code ? can you give me a specific file path or function name? In addition, why is it better to put the refresh timer in the "page response path" than the current response workqueue of IOPF? Can you explain why?

...

which leads to the fact that as long as one IOPF

...
event times out, the all IOPF events received by the device after this will time out, that making the device unusable at all.

Ok. I am not sure what device does here when you send out a FAILURE response code to any one IOPF event. Does it continue to work after that or does require a reset?

Take the current accelerator device as an example. When the refresh timer returns a FAULURE, the accelerator device will end the current task and return a task error message to the software. Then wait for the delivery and processing of the next task. Thanks, Longfang.

...

Thanks, Shameer

...
Therefore, to solve this problem, it can only be solved by creating a timer independently for each IOPF event.

Thansk, Longfang.

...
...
...
...
In addition, it will appear kernel calltrace under high system load.

Have you debugged this further? Going through their patch, one problem could be that they reset the timer in the iommu_report_device_fault(). So if there are multiple page faults from the save device, the timer expiry will not happen in case of a fault handler failure. If you move that timer reset to iommu_page_response() fn, I think it will work.

Since they have not completed their plans themselves, and there are too many TODO parts in their plans, which do not match our current needs, we will not debug their code. Their patch has been tested on our version, and it will cause kernel calltrace, so we reverted it directly, and redesigned another solution by ourselves.

I think per device timer is more optimum than setting up a timer for each event unless you can prove/explain otherwise. Since a solution is already proposed in the ML, it is always good to anlyse it and come up with an explanation why another different solution is required.

Thanks, Shameer

...
Thanks, Longfang.

...
Thanks, Shameer

...
> Thanks, > Shameer >

Thanks, Longfang.

>> Signed-off-by: Longfang Liu <liulongfang@huawei.com> >> --- >> drivers/iommu/io-pgfault.c | 63 >> ++++++++++++++++++++++++++++++++++++-- >> 1 file changed, 61 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/iommu/io-pgfault.c >> b/drivers/iommu/io-pgfault.c index cd7ebbfe4a67..0fb3ec1390a8 >> 100644 >> --- a/drivers/iommu/io-pgfault.c >> +++ b/drivers/iommu/io-pgfault.c >> @@ -6,9 +6,11 @@ >> */ >> >> #include <linux/iommu.h> >> +#include <linux/jiffies.h> >> #include <linux/list.h> >> #include <linux/sched/mm.h> >> #include <linux/slab.h> >> +#include <linux/timer.h> >> #include <linux/workqueue.h> >> >> #include "iommu-sva.h" >> @@ -56,11 +58,20 @@ struct iopf_fault { >> #define IOPF_RESPONSE_MAX_TIMEOUT 10000 >> static unsigned long iopf_timeout_ms = IOPF_RESPONSE_DEF_TIMEOUT; >> >> +enum iopf_timer_state { >> + IOPF_TIME_INIT = 0, >> + IOPF_IN_PROCESS, >> + IOPF_COMPLETE, >> + IOPF_TIME_OUT, >> +}; >> + >> struct iopf_group { >> struct iopf_fault last_fault; >> struct list_head faults; >> struct work_struct work; >> struct device *dev; >> + struct timer_list iopf_timer; >> + atomic_t iopf_state; >> }; >> >> static int __init iommu_set_iopf_timeout(char *str) @@ -101,6 >> +112,32 @@ static int iopf_complete_group(struct device *dev, >> struct iopf_fault *iopf, >> return iommu_page_response(dev, &resp); } >> >> +static void iopf_timer_func(struct timer_list *t) { >> + struct iopf_group *group = from_timer(group, t, iopf_timer); >> + >> + /* Workqueue process time out */ >> + if (atomic_read(&group->iopf_state) < IOPF_COMPLETE) { >> + atomic_set(&group->iopf_state, IOPF_TIME_OUT); >> + iopf_complete_group(group->dev, &group->last_fault, >> + IOMMU_PAGE_RESP_FAILURE); >> + dev_err(group->dev, "[timer] iopf workqueue process time >> out.\n"); >> + } >> +} >> + >> +static void iopf_timer_setup(struct iopf_group *group) { >> + atomic_set(&group->iopf_state, IOPF_TIME_INIT); >> + >> + /* IOPF timeout is 0 means no timeout tracking */ >> + if (iopf_timeout_ms) { >> + timer_setup(&group->iopf_timer, iopf_timer_func, 0); >> + group->iopf_timer.expires = jiffies + >> + msecs_to_jiffies(iopf_timeout_ms); >> + mod_timer(&group->iopf_timer, group->iopf_timer.expires); >> + } >> +} >> + >> static void iopf_handler(struct work_struct *work) { >> struct iopf_group *group; >> @@ -109,6 +146,17 @@ static void iopf_handler(struct work_struct *work) >> enum iommu_page_response_code status =

IOMMU_PAGE_RESP_SUCCESS;

...
>> >> group = container_of(work, struct iopf_group, work); >> + if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { >> + list_for_each_entry_safe(iopf, next, &group->faults, list) { >> + if (!(iopf->fault.prm.flags & >> + IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) >> + kfree(iopf); >> + } >> + dev_err(group->dev, "iopf workqueue wait time out.\n"); >> + goto iopf_err; >> + } >> + atomic_set(&group->iopf_state, IOPF_IN_PROCESS); >> + >> domain = iommu_get_domain_for_dev_pasid(group->dev, >> group->last_fault.fault.prm.pasid, 0); >> if (!domain || !domain->iopf_handler) @@ -128,7 +176,16 @@ >> static void iopf_handler(struct work_struct *work) >> kfree(iopf); >> } >> >> + if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { >> + dev_err(group->dev, "iopf handle fault time out.\n"); >> + goto iopf_err; >> + } >> + >> + atomic_set(&group->iopf_state, IOPF_COMPLETE); >> iopf_complete_group(group->dev, &group->last_fault, status); >> + >> +iopf_err: >> + del_timer_sync(&group->iopf_timer); >> kfree(group); >> } >> >> @@ -183,7 +240,6 @@ int iommu_queue_iopf(struct iommu_fault *fault, >> void *cookie) >> struct dev_iommu *param = dev->iommu; >> >> lockdep_assert_held(¶m->lock); >> - >> if (fault->type != IOMMU_FAULT_PAGE_REQ) >> /* Not a recoverable page fault */ >> return -EOPNOTSUPP; >> @@ -222,6 +278,8 @@ int iommu_queue_iopf(struct iommu_fault *fault, >> void *cookie) >> >> group->dev = dev; >> group->last_fault.fault = *fault; >> + iopf_timer_setup(group); >> + >> INIT_LIST_HEAD(&group->faults); >> list_add(&group->last_fault.list, &group->faults); >> INIT_WORK(&group->work, iopf_handler); @@ -414,7 +472,8 @@ >> struct iopf_queue *iopf_queue_alloc(const char >> *name) >> * that's dealt with, the high-level function can handle groups >> out of >> * order. >> */ >> - queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, >> name); >> + queue->wq = alloc_workqueue("iopf_queue/%s", WQ_HIGHPRI | >> + WQ_MEM_RECLAIM | WQ_UNBOUND, 0, name); >> if (!queue->wq) { >> kfree(queue); >> return NULL; >> -- >> 2.24.0 > > . > .

.

.

Shameerali Kolothum Thodi

16 Jan 16 Jan

4:12 p.m.

New subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

...

-----Original Message----- From: liulongfang Sent: 14 January 2023 07:46 To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/9 14:56, Shameerali Kolothum Thodi Wrote:

...
...
-----Original Message----- From: liulongfang Sent: 09 January 2023 02:39 To: Shameerali Kolothum Thodi

<shameerali.kolothum.thodi@huawei.com>;

...
...
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/8 17:12, Shameerali Kolothum Thodi Wrote:

...
...
-----Original Message----- From: liulongfang Sent: 07 January 2023 09:24 To: Shameerali Kolothum Thodi

<shameerali.kolothum.thodi@huawei.com>;

...
...
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/6 16:57, Shameerali Kolothum Thodi wrote:

...
> -----Original Message----- > From: liulongfang > Sent: 06 January 2023 00:55 > To: Shameerali Kolothum Thodi

<shameerali.kolothum.thodi@huawei.com>;

...
> Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) > <prime.zeng@hisilicon.com> > Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) > <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; > shenyang (M) <shenyang39@huawei.com>; qianweili > <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> > Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to > iommu > > On 2023/1/6 1:11, Shameerali Kolothum Thodi wrote: >> >> >>> -----Original Message----- >>> From: liulongfang >>> Sent: 05 January 2023 07:48 >>> To: Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@huawei.com>; >>> Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) >>> <prime.zeng@hisilicon.com> >>> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao >>> (A) <fanghao11@huawei.com>; Wangzhou (B) > <wangzhou1@hisilicon.com>; >>> shenyang (M) <shenyang39@huawei.com>; liulongfang >>> <liulongfang@huawei.com>; qianweili <qianweili@huawei.com>; >>> liudongdong (C) <liudongdong3@huawei.com> >>> Subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu >>>

[...]

...

...
...
...
...
...
>>> send-email-jacob.jun.pan@linux.intel.com/ >>> >>> The current timeout timer will create a timer for each IO page >>> fault request. After the page fault request is processed normally >>> or an error is returned after timeout, the timer will be >>> automatically destroyed and will not be reused. >> >> I think it is better to have a per device timer rather than a >> timer for each request. What was the problem with Intel's approach? >> > > Intel's patch did not handle the timer scheme completely.

You meant not calling the iopf_complete_group()/iommu_page_response() from their timer fn? But they have listed out the options there to handle the timeout. So probably we could just send FAILURE response from that timer fn for now and leave the other options for discussion.

I didn't analyze their plan carefully, because their plan will query the list every time the timer is activated,

Because it is per device timer, not per event timer.

and their processing part is still TODO,

...
which has not been implemented.

Also that ToDo list is how they see this interface will used further which is valid. Remember this is common IOMMU code and should cater to ARM/Intel/AMD etc and not just devices that support STALL mode but also normal PCI PRQ as well .

Their plan may have value, but it is a surefire one. As for the reason for the failure, it is because their timers are created based on the device.

Every time the device receives a new IOPF event, it will refresh this timer, but its timing time starting point cannot be updated (the previous IOPF event still needs to continue timing),

As I said in previous comment, could you please move that refresh timer to page response path? I think that will solve the problem.

You said "page response path" specifically refers to which part of the code ? can you give me a specific file path or function name?

In addition, why is it better to put the refresh timer in the "page response path" than the current response workqueue of IOPF? Can you explain why?

I think I have explained my reason earlier!. Anyway again from my previous comment, " ... Going through their patch, one problem could be that they reset the timer in the iommu_report_device_fault(). So if there are multiple page faults from the same device, the timer expiry will not happen in case of a fault handler failure. If you move that timer reset to iommu_page_response() fn, I think it will work." The idea is if there are no response for a specified time, the per device timer will get fired and send the FAIURE code. Do you see any problem in moving the timer update to iommu_page_response()?

...

...
which leads to the fact that as long as one IOPF

...
event times out, the all IOPF events received by the device after this will time out, that making the device unusable at all.

Ok. I am not sure what device does here when you send out a FAILURE response code to any one IOPF event. Does it continue to work after that or does require a reset?

Take the current accelerator device as an example. When the refresh timer returns a FAULURE, the accelerator device will end the current task and return a task error message to the software. Then wait for the delivery and processing of the next task.

Ok. So our ACC driver will handle it. Most likely other device drivers/vfio/iommufd may require a callback to handle anything specific with respect to this error. Thanks, Shameer

...

Thanks, Longfang.

...
Thanks, Shameer

...
Therefore, to solve this problem, it can only be solved by creating a timer independently for each IOPF event.

Thansk, Longfang.

...
...
...
> In addition, it will appear kernel calltrace under high system load.

Have you debugged this further? Going through their patch, one problem could be that they reset the timer in the iommu_report_device_fault(). So if there are multiple page faults from the save device, the timer expiry will not happen in case of a fault handler failure. If you move that timer reset to iommu_page_response() fn, I think it will

work.

...
...
...
...
Since they have not completed their plans themselves, and there are too many TODO parts in their plans, which do not match our current needs, we will not debug their code. Their patch has been tested on our version, and it will cause kernel calltrace, so we reverted it directly, and redesigned another solution by ourselves.

I think per device timer is more optimum than setting up a timer for each event unless you can prove/explain otherwise. Since a solution is already proposed in the ML, it is always good to anlyse it and come up with an explanation why another different solution is required.

Thanks, Shameer

...
Thanks, Longfang.

...
Thanks, Shameer

> >> Thanks, >> Shameer >> > > Thanks, > Longfang. > >>> Signed-off-by: Longfang Liu <liulongfang@huawei.com> >>> --- >>> drivers/iommu/io-pgfault.c | 63 >>> ++++++++++++++++++++++++++++++++++++-- >>> 1 file changed, 61 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/iommu/io-pgfault.c >>> b/drivers/iommu/io-pgfault.c index cd7ebbfe4a67..0fb3ec1390a8 >>> 100644 >>> --- a/drivers/iommu/io-pgfault.c >>> +++ b/drivers/iommu/io-pgfault.c >>> @@ -6,9 +6,11 @@ >>> */ >>> >>> #include <linux/iommu.h> >>> +#include <linux/jiffies.h> >>> #include <linux/list.h> >>> #include <linux/sched/mm.h> >>> #include <linux/slab.h> >>> +#include <linux/timer.h> >>> #include <linux/workqueue.h> >>> >>> #include "iommu-sva.h" >>> @@ -56,11 +58,20 @@ struct iopf_fault { >>> #define IOPF_RESPONSE_MAX_TIMEOUT 10000 >>> static unsigned long iopf_timeout_ms = > IOPF_RESPONSE_DEF_TIMEOUT; >>> >>> +enum iopf_timer_state { >>> + IOPF_TIME_INIT = 0, >>> + IOPF_IN_PROCESS, >>> + IOPF_COMPLETE, >>> + IOPF_TIME_OUT, >>> +}; >>> + >>> struct iopf_group { >>> struct iopf_fault last_fault; >>> struct list_head faults; >>> struct work_struct work; >>> struct device *dev; >>> + struct timer_list iopf_timer; >>> + atomic_t iopf_state; >>> }; >>> >>> static int __init iommu_set_iopf_timeout(char *str) @@ -101,6 >>> +112,32 @@ static int iopf_complete_group(struct device > *dev, >>> struct iopf_fault *iopf, >>> return iommu_page_response(dev, &resp); } >>> >>> +static void iopf_timer_func(struct timer_list *t) { >>> + struct iopf_group *group = from_timer(group, t, iopf_timer); >>> + >>> + /* Workqueue process time out */ >>> + if (atomic_read(&group->iopf_state) < IOPF_COMPLETE) { >>> + atomic_set(&group->iopf_state, IOPF_TIME_OUT); >>> + iopf_complete_group(group->dev, &group->last_fault, >>> + IOMMU_PAGE_RESP_FAILURE); >>> + dev_err(group->dev, "[timer] iopf workqueue process time >>> out.\n"); >>> + } >>> +} >>> + >>> +static void iopf_timer_setup(struct iopf_group *group) { >>> + atomic_set(&group->iopf_state, IOPF_TIME_INIT); >>> + >>> + /* IOPF timeout is 0 means no timeout tracking */ >>> + if (iopf_timeout_ms) { >>> + timer_setup(&group->iopf_timer, iopf_timer_func, 0); >>> + group->iopf_timer.expires = jiffies + >>> + msecs_to_jiffies(iopf_timeout_ms); >>> + mod_timer(&group->iopf_timer, group->iopf_timer.expires); >>> + } >>> +} >>> + >>> static void iopf_handler(struct work_struct *work) { >>> struct iopf_group *group; >>> @@ -109,6 +146,17 @@ static void iopf_handler(struct work_struct > *work) >>> enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS; >>> >>> group = container_of(work, struct iopf_group, work); >>> + if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { >>> + list_for_each_entry_safe(iopf, next, &group->faults, list) { >>> + if (!(iopf->fault.prm.flags & >>> + IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) >>> + kfree(iopf); >>> + } >>> + dev_err(group->dev, "iopf workqueue wait time out.\n"); >>> + goto iopf_err; >>> + } >>> + atomic_set(&group->iopf_state, IOPF_IN_PROCESS); >>> + >>> domain = iommu_get_domain_for_dev_pasid(group->dev, >>> group->last_fault.fault.prm.pasid, 0); >>> if (!domain || !domain->iopf_handler) @@ -128,7 +176,16 @@ >>> static void iopf_handler(struct work_struct > *work) >>> kfree(iopf); >>> } >>> >>> + if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { >>> + dev_err(group->dev, "iopf handle fault time out.\n"); >>> + goto iopf_err; >>> + } >>> + >>> + atomic_set(&group->iopf_state, IOPF_COMPLETE); >>> iopf_complete_group(group->dev, &group->last_fault, status); >>> + >>> +iopf_err: >>> + del_timer_sync(&group->iopf_timer); >>> kfree(group); >>> } >>> >>> @@ -183,7 +240,6 @@ int iommu_queue_iopf(struct iommu_fault *fault, >>> void *cookie) >>> struct dev_iommu *param = dev->iommu; >>> >>> lockdep_assert_held(¶m->lock); >>> - >>> if (fault->type != IOMMU_FAULT_PAGE_REQ) >>> /* Not a recoverable page fault */ >>> return -EOPNOTSUPP; >>> @@ -222,6 +278,8 @@ int iommu_queue_iopf(struct iommu_fault *fault, >>> void *cookie) >>> >>> group->dev = dev; >>> group->last_fault.fault = *fault; >>> + iopf_timer_setup(group); >>> + >>> INIT_LIST_HEAD(&group->faults); >>> list_add(&group->last_fault.list, &group->faults); >>> INIT_WORK(&group->work, iopf_handler); @@ -414,7 +472,8 @@ >>> struct iopf_queue *iopf_queue_alloc(const char >>> *name) >>> * that's dealt with, the high-level function can handle groups >>> out > of >>> * order. >>> */ >>> - queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, >>> name); >>> + queue->wq = alloc_workqueue("iopf_queue/%s", WQ_HIGHPRI | >>> + WQ_MEM_RECLAIM | WQ_UNBOUND, 0, name); >>> if (!queue->wq) { >>> kfree(queue); >>> return NULL; >>> -- >>> 2.24.0 >> >> . >> .

.

.

liulongfang

20 Jan 20 Jan

9:52 a.m.

New subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/16 16:12, Shameerali Kolothum Thodi Wrote:

...

...
-----Original Message----- From: liulongfang Sent: 14 January 2023 07:46 To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/9 14:56, Shameerali Kolothum Thodi Wrote:

...
...
-----Original Message----- From: liulongfang Sent: 09 January 2023 02:39 To: Shameerali Kolothum Thodi

<shameerali.kolothum.thodi@huawei.com>;

...
...
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/8 17:12, Shameerali Kolothum Thodi Wrote:

...
...
-----Original Message----- From: liulongfang Sent: 07 January 2023 09:24 To: Shameerali Kolothum Thodi

<shameerali.kolothum.thodi@huawei.com>;

...
...
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; shenyang (M) <shenyang39@huawei.com>; qianweili <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

On 2023/1/6 16:57, Shameerali Kolothum Thodi wrote: > > >> -----Original Message----- >> From: liulongfang >> Sent: 06 January 2023 00:55 >> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; >> Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) >> <prime.zeng@hisilicon.com> >> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A) >> <fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; >> shenyang (M) <shenyang39@huawei.com>; qianweili >> <qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com> >> Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to >> iommu >> >> On 2023/1/6 1:11, Shameerali Kolothum Thodi wrote: >>> >>> >>>> -----Original Message----- >>>> From: liulongfang >>>> Sent: 05 January 2023 07:48 >>>> To: Shameerali Kolothum Thodi >> <shameerali.kolothum.thodi@huawei.com>; >>>> Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B) >>>> <prime.zeng@hisilicon.com> >>>> Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao >>>> (A) <fanghao11@huawei.com>; Wangzhou (B) >> <wangzhou1@hisilicon.com>; >>>> shenyang (M) <shenyang39@huawei.com>; liulongfang >>>> <liulongfang@huawei.com>; qianweili <qianweili@huawei.com>; >>>> liudongdong (C) <liudongdong3@huawei.com> >>>> Subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu >>>>

[...]

...
...
...
...
...
>>>> send-email-jacob.jun.pan@linux.intel.com/ >>>> >>>> The current timeout timer will create a timer for each IO page >>>> fault request. After the page fault request is processed normally >>>> or an error is returned after timeout, the timer will be >>>> automatically destroyed and will not be reused. >>> >>> I think it is better to have a per device timer rather than a >>> timer for each request. What was the problem with Intel's approach? >>> >> >> Intel's patch did not handle the timer scheme completely. > > You meant not calling the > iopf_complete_group()/iommu_page_response() > from their timer fn? But they have listed out the options there to > handle the timeout. So probably we could just send FAILURE response > from that timer fn for now and leave the other options for discussion. >

I didn't analyze their plan carefully, because their plan will query the list every time the timer is activated,

Because it is per device timer, not per event timer.

and their processing part is still TODO,

...
which has not been implemented.

Also that ToDo list is how they see this interface will used further which is valid. Remember this is common IOMMU code and should cater to ARM/Intel/AMD etc and not just devices that support STALL mode but also normal PCI PRQ as well .

Their plan may have value, but it is a surefire one. As for the reason for the failure, it is because their timers are created based on the device.

Every time the device receives a new IOPF event, it will refresh this timer, but its timing time starting point cannot be updated (the previous IOPF event still needs to continue timing),

As I said in previous comment, could you please move that refresh timer to page response path? I think that will solve the problem.

You said "page response path" specifically refers to which part of the code ? can you give me a specific file path or function name?

In addition, why is it better to put the refresh timer in the "page response path" than the current response workqueue of IOPF? Can you explain why?

I think I have explained my reason earlier!. Anyway again from my previous comment,

" ... Going through their patch, one problem could be that they reset the timer in the iommu_report_device_fault(). So if there are multiple page faults from the same device, the timer expiry will not happen in case of a fault handler failure. If you move that timer reset to iommu_page_response() fn, I think it will work."

Intel's solution is based on the timer of the device. Even if it is refreshed at any time, it still cannot time each IOPF event. However, there will be many events added to the workqueue by iommu in a high-concurrency scenario, and these events need to be timed at the same time. No matter how the device-based timer is modified, this function cannot be completed.

...

The idea is if there are no response for a specified time, the per device timer will get fired and send the FAIURE code. Do you see any problem in moving the timer update to iommu_page_response()?

In this way, if the timer times out, only a FAIURE code will be returned. Other events that do not return results will still cause the device to RAS. Unless you take all iopf events on the current device out of the workqueue, each one sends a FAIURE code. But this will cause most of the iopf events that can be processed normally to be destroyed by the timer.

...

...
...
which leads to the fact that as long as one IOPF

...
event times out, the all IOPF events received by the device after this will time out, that making the device unusable at all.

Ok. I am not sure what device does here when you send out a FAILURE response code to any one IOPF event. Does it continue to work after that or does require a reset?

Take the current accelerator device as an example. When the refresh timer returns a FAULURE, the accelerator device will end the current task and return a task error message to the software. Then wait for the delivery and processing of the next task.

Ok. So our ACC driver will handle it. Most likely other device drivers/vfio/iommufd may require a callback to handle anything specific with respect to this error.

Thanks, Shameer

...
Thanks, Longfang.

...
Thanks, Shameer

...
Therefore, to solve this problem, it can only be solved by creating a timer independently for each IOPF event.

Thansk, Longfang.

...
...
>> In addition, it will appear kernel calltrace under high system load. > > Have you debugged this further? Going through their patch, one > problem could be that they reset the timer in the iommu_report_device_fault(). > So if there are multiple page faults from the save device, the timer > expiry will not happen in case of a fault handler failure. If you > move that timer reset to iommu_page_response() fn, I think it will

work.

...
...
...
> Since they have not completed their plans themselves, and there are too many TODO parts in their plans, which do not match our current needs, we will not debug their code. Their patch has been tested on our version, and it will cause kernel calltrace, so we reverted it directly, and redesigned another solution by ourselves.

I think per device timer is more optimum than setting up a timer for each event unless you can prove/explain otherwise. Since a solution is already proposed in the ML, it is always good to anlyse it and come up with an explanation why another different solution is required.

Thanks, Shameer

...
Thanks, Longfang.

> Thanks, > Shameer > >> >>> Thanks, >>> Shameer >>> >> >> Thanks, >> Longfang. >> >>>> Signed-off-by: Longfang Liu <liulongfang@huawei.com> >>>> --- >>>> drivers/iommu/io-pgfault.c | 63 >>>> ++++++++++++++++++++++++++++++++++++-- >>>> 1 file changed, 61 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/iommu/io-pgfault.c >>>> b/drivers/iommu/io-pgfault.c index cd7ebbfe4a67..0fb3ec1390a8 >>>> 100644 >>>> --- a/drivers/iommu/io-pgfault.c >>>> +++ b/drivers/iommu/io-pgfault.c >>>> @@ -6,9 +6,11 @@ >>>> */ >>>> >>>> #include <linux/iommu.h> >>>> +#include <linux/jiffies.h> >>>> #include <linux/list.h> >>>> #include <linux/sched/mm.h> >>>> #include <linux/slab.h> >>>> +#include <linux/timer.h> >>>> #include <linux/workqueue.h> >>>> >>>> #include "iommu-sva.h" >>>> @@ -56,11 +58,20 @@ struct iopf_fault { >>>> #define IOPF_RESPONSE_MAX_TIMEOUT 10000 >>>> static unsigned long iopf_timeout_ms = >> IOPF_RESPONSE_DEF_TIMEOUT; >>>> >>>> +enum iopf_timer_state { >>>> + IOPF_TIME_INIT = 0, >>>> + IOPF_IN_PROCESS, >>>> + IOPF_COMPLETE, >>>> + IOPF_TIME_OUT, >>>> +}; >>>> + >>>> struct iopf_group { >>>> struct iopf_fault last_fault; >>>> struct list_head faults; >>>> struct work_struct work; >>>> struct device *dev; >>>> + struct timer_list iopf_timer; >>>> + atomic_t iopf_state; >>>> }; >>>> >>>> static int __init iommu_set_iopf_timeout(char *str) @@ -101,6 >>>> +112,32 @@ static int iopf_complete_group(struct device >> *dev, >>>> struct iopf_fault *iopf, >>>> return iommu_page_response(dev, &resp); } >>>> >>>> +static void iopf_timer_func(struct timer_list *t) { >>>> + struct iopf_group *group = from_timer(group, t, iopf_timer); >>>> + >>>> + /* Workqueue process time out */ >>>> + if (atomic_read(&group->iopf_state) < IOPF_COMPLETE) { >>>> + atomic_set(&group->iopf_state, IOPF_TIME_OUT); >>>> + iopf_complete_group(group->dev, &group->last_fault, >>>> + IOMMU_PAGE_RESP_FAILURE); >>>> + dev_err(group->dev, "[timer] iopf workqueue process time >>>> out.\n"); >>>> + } >>>> +} >>>> + >>>> +static void iopf_timer_setup(struct iopf_group *group) { >>>> + atomic_set(&group->iopf_state, IOPF_TIME_INIT); >>>> + >>>> + /* IOPF timeout is 0 means no timeout tracking */ >>>> + if (iopf_timeout_ms) { >>>> + timer_setup(&group->iopf_timer, iopf_timer_func, 0); >>>> + group->iopf_timer.expires = jiffies + >>>> + msecs_to_jiffies(iopf_timeout_ms); >>>> + mod_timer(&group->iopf_timer, group->iopf_timer.expires); >>>> + } >>>> +} >>>> + >>>> static void iopf_handler(struct work_struct *work) { >>>> struct iopf_group *group; >>>> @@ -109,6 +146,17 @@ static void iopf_handler(struct work_struct >> *work) >>>> enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS; >>>> >>>> group = container_of(work, struct iopf_group, work); >>>> + if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { >>>> + list_for_each_entry_safe(iopf, next, &group->faults, list) { >>>> + if (!(iopf->fault.prm.flags & >>>> + IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) >>>> + kfree(iopf); >>>> + } >>>> + dev_err(group->dev, "iopf workqueue wait time out.\n"); >>>> + goto iopf_err; >>>> + } >>>> + atomic_set(&group->iopf_state, IOPF_IN_PROCESS); >>>> + >>>> domain = iommu_get_domain_for_dev_pasid(group->dev, >>>> group->last_fault.fault.prm.pasid, 0); >>>> if (!domain || !domain->iopf_handler) @@ -128,7 +176,16 @@ >>>> static void iopf_handler(struct work_struct >> *work) >>>> kfree(iopf); >>>> } >>>> >>>> + if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) { >>>> + dev_err(group->dev, "iopf handle fault time out.\n"); >>>> + goto iopf_err; >>>> + } >>>> + >>>> + atomic_set(&group->iopf_state, IOPF_COMPLETE); >>>> iopf_complete_group(group->dev, &group->last_fault, status); >>>> + >>>> +iopf_err: >>>> + del_timer_sync(&group->iopf_timer); >>>> kfree(group); >>>> } >>>> >>>> @@ -183,7 +240,6 @@ int iommu_queue_iopf(struct iommu_fault *fault, >>>> void *cookie) >>>> struct dev_iommu *param = dev->iommu; >>>> >>>> lockdep_assert_held(¶m->lock); >>>> - >>>> if (fault->type != IOMMU_FAULT_PAGE_REQ) >>>> /* Not a recoverable page fault */ >>>> return -EOPNOTSUPP; >>>> @@ -222,6 +278,8 @@ int iommu_queue_iopf(struct iommu_fault *fault, >>>> void *cookie) >>>> >>>> group->dev = dev; >>>> group->last_fault.fault = *fault; >>>> + iopf_timer_setup(group); >>>> + >>>> INIT_LIST_HEAD(&group->faults); >>>> list_add(&group->last_fault.list, &group->faults); >>>> INIT_WORK(&group->work, iopf_handler); @@ -414,7 +472,8 @@ >>>> struct iopf_queue *iopf_queue_alloc(const char >>>> *name) >>>> * that's dealt with, the high-level function can handle groups >>>> out >> of >>>> * order. >>>> */ >>>> - queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND, 0, >>>> name); >>>> + queue->wq = alloc_workqueue("iopf_queue/%s", WQ_HIGHPRI | >>>> + WQ_MEM_RECLAIM | WQ_UNBOUND, 0, name); >>>> if (!queue->wq) { >>>> kfree(queue); >>>> return NULL; >>>> -- >>>> 2.24.0 >>> >>> . >>> > . > .

.

.

1006

Age (days ago)

1021

Last active (days ago)

List overview

15 comments

3 participants

participants (3)

liulongfang
Longfang Liu
Shameerali Kolothum Thodi

[PATCH v3 0/3] iommu: add iopf timeout procssing function

tags

participants (3)