[Acc] Re: [PATCH v2 6/6] uadk/docs - support a simple interface for initialization

27 Aug 2022


      On 2022/8/11 14:45, Yang Shen write:
...
On 2022/8/11 11:02, fanghao (A) wrote:
...
在 2022/8/10 16:05, Yang Shen 写道:
...
On 2022/8/10 15:29, fanghao (A) wrote:
...
在 2022/7/22 11:46, Yang Shen 写道:
...
On 2022/7/21 16:38, fanghao (A) wrote:
...
在 2022/7/11 17:12, Yang Shen 写道:
> Due to the complexity of wd_alg_init, add wd_alg_init2 interface for
> users. And add the design documents.
>
> Signed-off-by: Yang Shen <shenyang39@huawei.com>
> ---
>   docs/wd_alg_init2.md | 176
> +++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 176 insertions(+)
>   create mode 100644 docs/wd_alg_init2.md
>
> diff --git a/docs/wd_alg_init2.md b/docs/wd_alg_init2.md
> new file mode 100644
> index 0000000..3fb570c
> --- /dev/null
> +++ b/docs/wd_alg_init2.md
> @@ -0,0 +1,176 @@
> +# wd_alg_init2
> +
> +## Preface
> +
> +The current uadk initialization process is:
> +1.Call wd_request_ctx() to request ctxs from devices.
> +2.Call wd_sched_rr_alloc() to create a sched(or some other scheduler
> alloc function if exits).
> +3.Initialize the sched.
> +4.Call wd_alg_init() with ctx_config and sched.
> +
> +```flow
> +st=>start: Start
> +o1=>operation: request ctxs
> +o2=>operation: create uadk_sched and instance ctxs to sched region
> +o3=>operation: call wd_alg_init
> +e=>end
> +st->o1->o2->o3->e
> +```
> +
> +Logic is reasonable. But in practice, the step of `wd_request_ctx()`
> +and `wd_sched_rr_alloc()` are very tedious. This makes it difficult
> +for users to use the interface. One of the main reasons for this is
> +that uadk has made a lot of configurations in the scheduler in order
> +to provide users with better performance. Based on this
> consideration,
> +the current uadk requires the user to arrange the division of
> hardware
> +resources according to the device topology during initialization.
> +Therefore, as a high-level interface, this scheme can provide
> customized
> +scheme configuration for users with deep needs.
> +
> +## wd_alg_init2
> +
> +### Design
> +
> +Is there any way to simplify these steps? Not currently. Because the
> +architecture model designed by uadk is to manage hardware resources
> +through a scheduler, users can no longer perceive after specifying
> +hardware resources, and all subsequent tasks are handled by the
> scheduler.
> +The original intention of this design is to make the scenarios
> supported
> +by uadk more flexible. Because the resource requirements of
> different
> +business scenarios are different from the task model of the business
> +itself, the best performance experience can be obtained through the
> +scheduler to match.
> +
> +But we can try to provide a layer of encapsulation. The original
> design
> +intention of this layer of encapsulation is that users only need to
> +specify available resources and requirements, and the
> configuration of
> +resources is completed internally by the interface. Because the
> previous
> +interface complexity mainly lies in the parameter configuration
> of CTX
> +and scheduler, it is easy for users to make configuration errors and
> +generate bugs because of their misunderstanding of parameters.
> +
> +All algorithms have the same input parameters and initialization
> logic.
> +
> +```c
> +struct wd_ctx_config {
> +    __u32 ctx_num;
> +    struct wd_ctx *ctxs;
> +    void *priv;
> +};
> +
> +struct wd_sched {
> +    const char *name;
> +    int sched_policy;
> +    handle_t (*sched_init)(handle_t h_sched_ctx, void *sched_param);
> +    __u32 (*pick_next_ctx)(handle_t h_sched_ctx, void *sched_key,
> +                   const int sched_mode);
> +    int (*poll_policy)(handle_t h_sched_ctx, __u32 expect, __u32
> *count);
> +    handle_t h_sched_ctx;
> +};
> +
> +int wd_alg_init(struct wd_ctx_config *config, struct wd_sched
> *sched);
> +```
> +
> +`wd_ctx_config` is the requested ctxs descriptor, and the attributes
> +of ctxs are contained in their own structure. The attributes will be
> +used in scheduler for picking ctx according to request type. The
> main
> +difficulty in this step is that users need to apply for CTXs from
> the
> +appropriate device nodes according to their own business
> distribution.
> +If the user does not consider the appropriate device distribution,
> +it may lead to cross chip or cross numa node which will affect
> +performance.
> +
> +`wd_sched` is the scheduler descriptor of the request. It will
> create
> +the scheduling domain based parameters passed by the users. User
> needs
> +to allocate the ctxs applied to the scheduling domain that meets the
> +attribute, so that uadk can select the appropriate ctxs according to
> +the issued business. The main difficulty in this step is that the
> user
> +needs to initialize the correct scheduling domain according to the
> ctxs
> +attributes previously applied. However, there are many attributes of
> +ctxs here, which should be divided by multiple dimensions. If the
> +parameters are not understood enough, it is easy to make queue
> +allocation errors, resulting in the scheduling of the wrong ctxs
> when
> +the task is finally issued, and cause unexpected errors.
> +
> +Therefore, the next thing to be done is to use limited and
> easy-to-use
> +input parameters to describe users' requirements on the two input
> +parameters, ensuring that the functions of the new interface init2
> +are the same as those of init. For ease of description, v1 is used
> +to refer to the existing interface, and v2 is used to refer to the
> +layer of encapsulation.
> +
> +Let's clarify the following logic first: all uacce devices under a
> +numa node can be regarded as the same. So although we request for
> +ctxs from the device, we manage ctxs according to numa nodes.
> +That means if users want to get the same performance for all cpu,
> +the uadk configure should be same for all numa node.
> +
> +At present, at least 4 parameters are required to meet the user
> +configuration requirements with the V1 interface function remains
> +unchanged.
> +
> +@device_list: The available uacce device list. Users can get it by
> +`wd_get_accel_list()`.
> +
> +@numa_bitmask: The bitmask provided by libnuma. Users can use this
> +parameter to control requesting ctxs devices in the bind NUMA
> scenario.
> +This parameter is mainly convenient for users to use in the binding
> +cpu scenario. It can avoid resource waste or initialization failure
> +caused by insufficient resources. Libnuma provides a complete
> operation
> +interface which can be found in numa.h.
> +
> +@ctx_nums: The requested ctx number for each numa node. Due to users
> +may have different requirements for different types of ctx numbers,
> +needs a two-dimensional array as input.
> +
> +@sched_type: Scheduling type the user wants to use.
> +
> +To sum up, the wd_alg_init2 is as follows
> +
> +```c
> +struct wd_ctx_nums {
> +    __u32 sync_ctx_num;
> +    __u32 async_ctx_num;
> +};
> +
> +struct wd_ctx_params {
> +    __u32 ctx_set_num;
> +    struct wd_ctx_nums *ctx_set_size;
> +};
> +
> +init wd_alg_init2 (struct uacce_dev_list *list, struct bitmask *bmp,
> +                   struct wd_ctx_params *cparams, __u32 sched_type);
> +```
> +
> +Somebody may say that the wd_alg_init2 is still complex for three
> +input parameters are structure. So the interface support default
> value
> +for some parameters. The @bmp can be set as NULL, and then it
> will be
> +initialized according to device list. The @cparams can be set as
> NULL,
> +and it has a default value in wd_alg.c. The @list and sched_type are
> +necessary.
> +
> +What's more, uadk provides a new set of interface to get device list
> +bit mask.
> +
> +```c
> +struct bitmask *wd_create_device_nodemask(strcut uacce_dev_list
> *list);
> +
> +void wd_free_device_nodemask(struct bitmask *bmp);
> +```
> +
> +## Demo
> +
> +The simplest user initialization process is:
> +
> +```c
> +{
> +    ……
> +    struct uacce_dev_list *list;
> +    int ret;
> +
> +    list = wd_get_accel_list(alg);
> +    ret = wd_<alg>_init2_(list, NULL, NULL, sched_type);
可以合并再简化：
 wd_<alg>_init2_(alg，mode_type, node_mask);
这里的逻辑还是在于我们wd_alg层用一个全局变量 wd_alg_setting 来维护整个
算法层的资源。
所以其实用alg和用device_list是两种不同的思路。以COMPRESS为例，zlib和
gzip分属不同设备
的情况下，device_list能够支持这种场景，而alg则没有办法满足，需要用户
uninit后再init。
  看起来，如果是两个不同的设备，device_list也支持不了，只固定绑了一个设
备的driver。
  所以根本上静态绑定驱动的方式要修改下一步支持动态注册。
两个device_list完全可以支持，这里我们可以从device的attrs去获取支持的列
表，然后做不同算法的
ctx_alloc这个问题。当然只是目前代码没支持这个功能而已。我这里说的
device_list跟drivers是两个
事情，我们同一个drivers完全有可能支持多个devices。
如果说的是两个相同的设备，那肯定是同一个driver。不过都是相同的设备话，应
该不会一个支持zlib，一个支持gzip。
这些都是底层实现和细节。关键还是看接口抽象。
找到之前的一篇推演可以看下 https://zhuanlan.zhihu.com/p/157973336
这里只是举个例子。就是说如果直接传算法，就暗含了一个约束，一个device需要支持所有comp算法。
实际上我们可以制作一个算法和驱动的支持列表，用户态驱动中针对每一个子算法创建一个简单的driver，
它的实现接口可以复用(可以直接使用当前的驱动)，每一个算法都对应一个驱动，并且驱动自带名称，这个名称与
内核态创建的设备名称必须一致(后缀不一样就行)。
每一个业务请求，先按算法名称查找用户态驱动，找到驱动后，根据驱动名称查找设备，然后在该设备上申请队列
这样就不会有约束了
...
...
...
而且我理解这个还是跟静态动态没有什么关系，关键还是需要把对外的接口给抽
象出来。我们这里即看到
devices又看到drivers，这两个相互耦合，但是实际使用过程中又可能存在各种
情况，我们当前的模型完
全是从hisilicon的硬件模型出发。
一种可行的方案是我们参考crypto子系统，只看到drivers。所有devices的东西
都放到drivers中自己去处
理。我们可以新增一个uadk_device。然后uadk_device完全由drivers去初始
化，而不是当前放在uadk中完
成一堆操作。甚至其实这个uadk_device也不是很有必要，直接保留ctx即可。
...
...
...
其他的一点想法补充：
后面如果拓展指令其他实现，uacce_dev应该要被收到内部去。用户只看到
alg去申
请。
用户只看到 alg 这个设计方案很合理，方便了用户的理解与使用。但目前来看
这需要对 UADK 做
较大的改动。从这个问题出发，在这里简单的对比下 warpdriver、
uadk(wrapdriver 2.0)以及 crypto。
Warpdriver 和 crypto 都是采用这种方案，而 uadk 却很难用这种方案去做。
从实现的角度分析，是因为 warpdriver 和 crypto 的逻辑都是 ctx(tfm) ->
req 的一对多映射
关系。所以它们的 req 的处理单元（软算/硬件/指令）都已经是确定的。而
uadk 不同，uadk
是一个多对多的映射，其req面对的是一个 ctxs pool，如何挑选合适的 ctx 是
由 req 的
attributes 和 scheduler 共同决定。那么倒推三个框架的初始化接
口，warpdriver 和 crypto
初始化时可以由用户指定的 alg 来完成。因为用户想换算法的话申请一个新的
ctx(tfm) 即可。
但是 uadk 不行。Uadk 初始化时创建的是一个pool。在这里如果初始化的时候
没有这个算法，
那后续用户就无法找到这个算法的 ctx。
然后个人从当初设计的角度分析（不排除理解有误），这里 ctxs pool 最主要
的设计目的还是
为了给 uadk scheduler 使用。我还记得当初在做 uadk 设计前不久就遇到一个
问题，在计算产品线
大数据场景，很多进程可能申请了 ctx 之后进入休眠状态，既不发送任务，也
不释放队列，而
服务器上的进程数可能很高，导致加速器设备根本无法正常跑满带宽，甚至到后
期无法使用。
我不确定这个问题是否影响了 uadk 的设计思路。但是目前来看 uadk 的方案对
此场景的适配性
还是高于 warpdriver 的。但是这种设计方案带来的牺牲就是它比 warpdriver
和 crypto 多了
一个进程唯一的初始化环节，这个初始化环节无法通过 alg 来完成，而需要
device_list。
是为了降低对ctx的消耗，但是目前调度均衡申请，反而会浪费更多的ctx。所以这
个需要考虑优化。
如果是均衡申请，又绑核的情况，就会导致性能不是很好。所以不绑核的话，随着
cpu迁移的话动态调度也要支持。
如果不考虑那么复杂，就是需要绑核使用，那uadk内部默认从最近numa申请ctx做
任务，感觉就会简单一点。
我没太理解这里说的均衡申请是指什么场景。
每个numa都申请了同步异步，压缩解压缩。
这个地方的配置是用户决定的啊，用户可以配也可以不配。我的接口只是给用户提供一种途径。
用户的使用个数由它自己来决定，所以这里不存在什么浪费。
...
...
...
...
...
现在uadk用户态三层结构，libwd是基础最底层接口，他提供的接口是给udriver
和 alg层调用。alg层的接口对用户。
所以wd_request_ctx()等被udriver调用，wd_get_accel_list（）等被alg调
用。
“wd_request_ctx()等被udriver调用” —— 个人理解这个应该是反过来。udriver
注册回调到
libwd， wd_request_ctx 调用回调。
  这样理解，估计是这个接口可以sys/uacce下自动搜索各个设备所以认为它是
high level的接口。但是这个应该是很底层的接口，类比内核态alloc_qp。
  因为libwd是对uacce暴露给user space设备的一层很薄封装。
...
...
还有个调度器，这个耦合的有点别扭。目前还要看这么简化和收到内部去。
这里的耦合是没有办法的，init2 的接口设计出发点还是基于 init，当前 init
的逻辑就是对
ctxs pool 建立 scheduler region。所以接口的参数不可避免。
其实uadk里做调度，初衷是不想把ctx和dev丢给app感知，不然用户自己去做调
度，uadk反而省了很多事。
调度放在算法层是否可以挪到driver里？ 不过会很复杂。
init2简化只要传sched_type，目前的简化已经可以。
后续演进支持看，只会对资源初始化进行升级，正常可能会同时有
init1，init2，init3...
/lgtm
...
...
> +    wd_free_list_accel(list);
> +    ……
> +}
> +```
.
.
.
.
.
.

[Acc] Re: [PATCH v2 6/6] uadk/docs - support a simple interface for initialization

liulongfang