[Linuxarm] Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu

9 Jan 2023

On 2023/1/8 17:12, Shameerali Kolothum Thodi Wrote:
...
...
-----Original Message-----
From: liulongfang
Sent: 07 January 2023 09:24
To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B)
<prime.zeng@hisilicon.com>
Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao (A)
<fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>;
shenyang (M) <shenyang39@huawei.com>; qianweili
<qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com>
Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu
On 2023/1/6 16:57, Shameerali Kolothum Thodi wrote:
...
...
-----Original Message-----
From: liulongfang
Sent: 06 January 2023 00:55
To: Shameerali Kolothum Thodi
<shameerali.kolothum.thodi@huawei.com>;
...
...
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B)
<prime.zeng@hisilicon.com>
Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>; fanghao
(A)
<fanghao11@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>;
shenyang (M) <shenyang39@huawei.com>; qianweili
<qianweili@huawei.com>; liudongdong (C) <liudongdong3@huawei.com>
Subject: Re: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu
On 2023/1/6 1:11, Shameerali Kolothum Thodi wrote:
...
...
-----Original Message-----
From: liulongfang
Sent: 05 January 2023 07:48
To: Shameerali Kolothum Thodi
<shameerali.kolothum.thodi@huawei.com>;
...
...
Jonathan Cameron <jonathan.cameron@huawei.com>; Zengtao (B)
<prime.zeng@hisilicon.com>
Cc: linuxarm@openeuler.org; yekai (A) <yekai13@huawei.com>;
fanghao
(A) <fanghao11@huawei.com>; Wangzhou (B)
<wangzhou1@hisilicon.com>;
shenyang (M) <shenyang39@huawei.com>; liulongfang
<liulongfang@huawei.com>; qianweili <qianweili@huawei.com>;
liudongdong (C) <liudongdong3@huawei.com>
Subject: [PATCH v3 3/3] iommu/iopf: add timeout function to iommu
In the current iopf(IO page fault) processing flow of iommu, each
iopf event will create a queue_work event and put it into the iopf
workqueue. When the system schedules the work event, the page table
creation will be completed.
Its basic processing flow is as follows:
+------------------+
         | recv iopf event  |
         +--------+---------+
                  |
         +--------v---------+
         | init iopf event  |
         +--------+---------+
                  |
         +--------v---------+
         | create queue_work|
         +--------+---------+
                  |
       +----------v-----------+
 +---->|  wait for queue_work |
 |     |       to run         |
 |     +----------+-----------+
 |                |
 |No         +----v----+
 +-----------+  run ?  |
             +----+----+
                  |Yes
        +---------v---------+
 +----->|  hand queue_work  |
 |      | create page table |
 |      +---------+---------+
 |                |
 |No   +----------v------------+
 +-----+ processing completed ?|
       +----------+------------+
                  |Yes
          +-------v---------+
          | return response |
          +-----------------+
In this processing flow, there are two nodes that will be unable to
respond for a long time in some scenarios of high system load:
1. On the node where the work queue is waiting to be scheduled.
2. After the work queue is scheduled, in the process of creating the
page table.
A hardware error and reset occurs when the two nodes cannot
complete
processing within the limited wait time of the hardware device.
So I added a timeout function to this iopf processing flow.
First, it creates a timer when creating the queue_work event and sets
its timeout. At the same time, set an initial state for the iopf
event.
Second, the timer is used to time the process from the waiting of the
iopf event to the completion of the iopf processing.
Third, if the processing is completed within the timeout period, the
timer will be destroyed directly.
If the processing of the iopf event has not been completed after the
timer timeout, the timer callback interface will be activated, it
will directly return the result of the current iopf processing
failure to the device, and set the status of the iopf event to the
timeout state.
Finally, when the iopf event is scheduled by the system, it will be
judged according to the state of the iopf event. If it is in the
timeout state, the iopf event will be skipped directly without
processing.
Different from the previous Intel's timeout timer method,
URL: https://lore.kernel.org/lkml/1565900005-62508-3-git-
send-email-jacob.jun.pan@linux.intel.com/
The current timeout timer will create a timer for each IO page fault
request. After the page fault request is processed normally or an
error is returned after timeout, the timer will be automatically
destroyed and will not be reused.
I think it is better to have a per device timer rather than a timer
for each request. What was the problem with Intel's approach?
Intel's patch did not handle the timer scheme completely.
You meant not calling the iopf_complete_group()/iommu_page_response()
from their timer fn? But they have listed out the options there to handle
the timeout. So probably we could just send FAILURE response from
that timer fn for now and leave the other options for discussion.
I didn't analyze their plan carefully, because their plan will query the list
every time the timer is activated,
Because it is per device timer, not per event timer.
and their processing part is still TODO,
...
which has not been implemented.
Also that ToDo list is how they see this interface will used further which is valid.
Remember this is common IOMMU code and should cater to ARM/Intel/AMD
etc and not just devices that support STALL mode but also normal PCI PRQ as well .
Their plan may have value, but it is a surefire one. As for the reason for the failure,
it is because their timers are created based on the device.

Every time the device receives a new IOPF event, it will refresh this timer, but its
timing time starting point cannot be updated (the previous IOPF event still needs to
continue timing), which leads to the fact that as long as one IOPF event times out,
the all IOPF events received by the device after this will time out, that making the
device unusable at all.

Therefore, to solve this problem, it can only be solved by creating a timer independently
for each IOPF event.

Thansk,
Longfang.
...
...
...
...
In addition, it will appear kernel calltrace under high system load.
Have you debugged this further? Going through their patch, one problem
could be that they reset the timer in the iommu_report_device_fault().
So if there are multiple page faults from the save device, the timer expiry
will not happen in case of a fault handler failure. If you move that timer
reset to iommu_page_response() fn, I think it will work.
Since they have not completed their plans themselves, and there are too
many TODO parts
in their plans, which do not match our current needs, we will not debug their
code.
Their patch has been tested on our version, and it will cause kernel calltrace,
so we reverted it directly, and redesigned another solution by ourselves.
I think per device timer is more optimum than setting up a timer for each event
unless you can prove/explain otherwise. Since a solution is already proposed in the ML,
it is always good to anlyse it and come up with an explanation why another different
solution is required.
Thanks,
Shameer
...
Thanks,
Longfang.
...
Thanks,
Shameer
...
...
Thanks,
Shameer
Thanks,
Longfang.
...
...
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
---
 drivers/iommu/io-pgfault.c | 63
++++++++++++++++++++++++++++++++++++--
 1 file changed, 61 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c
index cd7ebbfe4a67..0fb3ec1390a8 100644
--- a/drivers/iommu/io-pgfault.c
+++ b/drivers/iommu/io-pgfault.c
@@ -6,9 +6,11 @@
  */
#include <linux/iommu.h>
+#include <linux/jiffies.h>
 #include <linux/list.h>
 #include <linux/sched/mm.h>
 #include <linux/slab.h>
+#include <linux/timer.h>
 #include <linux/workqueue.h>
#include "iommu-sva.h"
@@ -56,11 +58,20 @@ struct iopf_fault {
 #define IOPF_RESPONSE_MAX_TIMEOUT		10000
 static unsigned long iopf_timeout_ms =
IOPF_RESPONSE_DEF_TIMEOUT;
+enum iopf_timer_state {
+	IOPF_TIME_INIT = 0,
+	IOPF_IN_PROCESS,
+	IOPF_COMPLETE,
+	IOPF_TIME_OUT,
+};
+
 struct iopf_group {
  struct iopf_fault		last_fault;
  struct list_head		faults;
  struct work_struct		work;
  struct device			*dev;
+	struct timer_list		iopf_timer;
+	atomic_t			iopf_state;
 };
static int __init iommu_set_iopf_timeout(char *str)
@@ -101,6 +112,32 @@ static int iopf_complete_group(struct device
*dev,
struct iopf_fault *iopf,
  return iommu_page_response(dev, &resp);
 }
+static void iopf_timer_func(struct timer_list *t)
+{
+	struct iopf_group *group = from_timer(group, t, iopf_timer);
+
+	/* Workqueue process time out */
+	if (atomic_read(&group->iopf_state) < IOPF_COMPLETE) {
+		atomic_set(&group->iopf_state, IOPF_TIME_OUT);
+		iopf_complete_group(group->dev, &group->last_fault,
+					 IOMMU_PAGE_RESP_FAILURE);
+		dev_err(group->dev, "[timer] iopf workqueue process time
out.\n");
+	}
+}
+
+static void iopf_timer_setup(struct iopf_group *group)
+{
+	atomic_set(&group->iopf_state, IOPF_TIME_INIT);
+
+	/* IOPF timeout is 0 means no timeout tracking */
+	if (iopf_timeout_ms) {
+		timer_setup(&group->iopf_timer, iopf_timer_func, 0);
+		group->iopf_timer.expires = jiffies +
+					msecs_to_jiffies(iopf_timeout_ms);
+		mod_timer(&group->iopf_timer, group->iopf_timer.expires);
+	}
+}
+
 static void iopf_handler(struct work_struct *work)
 {
  struct iopf_group *group;
@@ -109,6 +146,17 @@ static void iopf_handler(struct work_struct
*work)
  enum iommu_page_response_code status =
IOMMU_PAGE_RESP_SUCCESS;
group = container_of(work, struct iopf_group, work);
+	if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) {
+		list_for_each_entry_safe(iopf, next, &group->faults, list) {
+			if (!(iopf->fault.prm.flags &
+			      IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE))
+				kfree(iopf);
+		}
+		dev_err(group->dev, "iopf workqueue wait time out.\n");
+		goto iopf_err;
+	}
+	atomic_set(&group->iopf_state, IOPF_IN_PROCESS);
+
  domain = iommu_get_domain_for_dev_pasid(group->dev,
      		group->last_fault.fault.prm.pasid, 0);
  if (!domain || !domain->iopf_handler)
@@ -128,7 +176,16 @@ static void iopf_handler(struct work_struct
*work)
      	kfree(iopf);
  }
+	if (atomic_read(&group->iopf_state) == IOPF_TIME_OUT) {
+		dev_err(group->dev, "iopf handle fault time out.\n");
+		goto iopf_err;
+	}
+
+	atomic_set(&group->iopf_state, IOPF_COMPLETE);
  iopf_complete_group(group->dev, &group->last_fault, status);
+
+iopf_err:
+	del_timer_sync(&group->iopf_timer);
  kfree(group);
 }
@@ -183,7 +240,6 @@ int iommu_queue_iopf(struct iommu_fault
*fault,
...
...
...
void *cookie)
  struct dev_iommu *param = dev->iommu;
lockdep_assert_held(¶m->lock);
-
  if (fault->type != IOMMU_FAULT_PAGE_REQ)
      /* Not a recoverable page fault */
      return -EOPNOTSUPP;
@@ -222,6 +278,8 @@ int iommu_queue_iopf(struct iommu_fault
*fault,
void *cookie)
group->dev = dev;
  group->last_fault.fault = *fault;
+	iopf_timer_setup(group);
+
  INIT_LIST_HEAD(&group->faults);
  list_add(&group->last_fault.list, &group->faults);
  INIT_WORK(&group->work, iopf_handler);
@@ -414,7 +472,8 @@ struct iopf_queue *iopf_queue_alloc(const
char
*name)
   * that's dealt with, the high-level function can handle groups out
of
   * order.
   */
-	queue->wq = alloc_workqueue("iopf_queue/%s", WQ_UNBOUND,
0,
name);
+	queue->wq = alloc_workqueue("iopf_queue/%s", WQ_HIGHPRI |
+				 WQ_MEM_RECLAIM | WQ_UNBOUND, 0, name);
  if (!queue->wq) {
      kfree(queue);
      return NULL;
--
2.24.0
.
.
.