November 2022 - Kernel - mailweb.openeuler.org

[PATCH openEuler-1.0-LTS 1/3] blktrace: introduce 'blk_trace_{start,stop}' helper
by Yongqiang Liu 07 Nov '22

07 Nov '22

From: Ye Bin <yebin10(a)huawei.com> mainline inclusion from mainline-v6.1-rc2 commit 60a9bb9048f9e95029df10a9bc346f6b066c593c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5Z7DK CVE: NA -------------------------------- Introduce 'blk_trace_{start,stop}' helper. No functional changed. Signed-off-by: Ye Bin <yebin10(a)huawei.com> Reviewed-by: Christoph Hellwig <hch(a)lst.de> Link: https://lore.kernel.org/r/20221019033602.752383-2-yebin@huaweicloud.com Signed-off-by: Jens Axboe <axboe(a)kernel.dk> conflicts: kernel/trace/blktrace.c Signed-off-by: Ye Bin <yebin(a)huaweicloud.com> Reviewed-by: Jason Yan <yanaijie(a)huawei.com> Signed-off-by: Yongqiang Liu <liuyongqiang13(a)huawei.com> --- kernel/trace/blktrace.c | 74 ++++++++++++++++++++--------------------- 1 file changed, 36 insertions(+), 38 deletions(-) diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index b1aa8c74442c..9cc04b09c42f 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -334,6 +334,37 @@ static void put_probe_ref(void) mutex_unlock(&blk_probe_mutex); } +static int blk_trace_start(struct blk_trace *bt) +{ + if (bt->trace_state != Blktrace_setup && + bt->trace_state != Blktrace_stopped) + return -EINVAL; + + blktrace_seq++; + smp_mb(); + bt->trace_state = Blktrace_running; + spin_lock_irq(&running_trace_lock); + list_add(&bt->running_list, &running_trace_list); + spin_unlock_irq(&running_trace_lock); + trace_note_time(bt); + + return 0; +} + +static int blk_trace_stop(struct blk_trace *bt) +{ + if (bt->trace_state != Blktrace_running) + return -EINVAL; + + bt->trace_state = Blktrace_stopped; + spin_lock_irq(&running_trace_lock); + list_del_init(&bt->running_list); + spin_unlock_irq(&running_trace_lock); + relay_flush(bt->rchan); + + return 0; +} + static void blk_trace_cleanup(struct blk_trace *bt) { synchronize_rcu(); @@ -652,7 +683,6 @@ static int compat_blk_trace_setup(struct request_queue *q, char *name, static int __blk_trace_startstop(struct request_queue *q, int start) { - int ret; struct blk_trace *bt; bt = rcu_dereference_protected(q->blk_trace, @@ -660,36 +690,10 @@ static int __blk_trace_startstop(struct request_queue *q, int start) if (bt == NULL) return -EINVAL; - /* - * For starting a trace, we can transition from a setup or stopped - * trace. For stopping a trace, the state must be running - */ - ret = -EINVAL; - if (start) { - if (bt->trace_state == Blktrace_setup || - bt->trace_state == Blktrace_stopped) { - blktrace_seq++; - smp_mb(); - bt->trace_state = Blktrace_running; - spin_lock_irq(&running_trace_lock); - list_add(&bt->running_list, &running_trace_list); - spin_unlock_irq(&running_trace_lock); - - trace_note_time(bt); - ret = 0; - } - } else { - if (bt->trace_state == Blktrace_running) { - bt->trace_state = Blktrace_stopped; - spin_lock_irq(&running_trace_lock); - list_del_init(&bt->running_list); - spin_unlock_irq(&running_trace_lock); - relay_flush(bt->rchan); - ret = 0; - } - } - - return ret; + if (start) + return blk_trace_start(bt); + else + return blk_trace_stop(bt); } int blk_trace_startstop(struct request_queue *q, int start) @@ -1657,13 +1661,7 @@ static int blk_trace_remove_queue(struct request_queue *q) if (bt == NULL) return -EINVAL; - if (bt->trace_state == Blktrace_running) { - bt->trace_state = Blktrace_stopped; - spin_lock_irq(&running_trace_lock); - list_del_init(&bt->running_list); - spin_unlock_irq(&running_trace_lock); - relay_flush(bt->rchan); - } + blk_trace_stop(bt); put_probe_ref(); synchronize_rcu(); -- 2.25.1

1 2

[PATCH openEuler-1.0-LTS 1/2] io_uring/af_unix: defer registered files gc to io_uring release
by Yongqiang Liu 07 Nov '22

07 Nov '22

From: Pavel Begunkov <asml.silence(a)gmail.com> mainline inclusion from mainline-v6.1-rc1 commit 0091bfc81741b8d3aeb3b7ab8636f911b2de6e80 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5WFKI CVE: CVE-2022-2602 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?h… -------------------------------- Instead of putting io_uring's registered files in unix_gc() we want it to be done by io_uring itself. The trick here is to consider io_uring registered files for cycle detection but not actually putting them down. Because io_uring can't register other ring instances, this will remove all refs to the ring file triggering the ->release path and clean up with io_ring_ctx_free(). Cc: stable(a)vger.kernel.org Fixes: 6b06314c47e1 ("io_uring: add file set registration") Reported-and-tested-by: David Bouman <dbouman03(a)gmail.com> Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo(a)canonical.com> [axboe: add kerneldoc comment to skb, fold in skb leak fix] Signed-off-by: Jens Axboe <axboe(a)kernel.dk> Conflicts: fs/io_uring.c include/linux/skbuff.h Signed-off-by: Zhihao Cheng <chengzhihao1(a)huawei.com> Reviewed-by: Yue Haibing <yuehaibing(a)huawei.com> Reviewed-by: Xiu Jianfeng <xiujianfeng(a)huawei.com> Signed-off-by: Yongqiang Liu <liuyongqiang13(a)huawei.com> --- fs/io_uring.c | 1 + include/linux/skbuff.h | 3 +++ net/unix/garbage.c | 20 ++++++++++++++++++++ 3 files changed, 24 insertions(+) diff --git a/fs/io_uring.c b/fs/io_uring.c index d4e430b51098..7d7af6a0ef96 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6835,6 +6835,7 @@ static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset) } skb->sk = sk; + skb->scm_io_uring = 1; nr_files = 0; fpl->user = get_uid(ctx->user); diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index dbdb03ac557f..4524bef053b8 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -654,6 +654,7 @@ typedef unsigned char *sk_buff_data_t; * @transport_header: Transport layer header * @network_header: Network layer header * @mac_header: Link layer header + * @scm_io_uring: SKB holds io_uring registered files * @tail: Tail pointer * @end: End pointer * @head: Head of buffer @@ -800,6 +801,8 @@ struct sk_buff { __u8 decrypted:1; #endif + __u8 scm_io_uring:1; + #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ #endif diff --git a/net/unix/garbage.c b/net/unix/garbage.c index 4d283e26d816..5c9ff8df9136 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -209,6 +209,7 @@ void wait_for_unix_gc(void) /* The external entry point: unix_gc() */ void unix_gc(void) { + struct sk_buff *next_skb, *skb; struct unix_sock *u; struct unix_sock *next; struct sk_buff_head hitlist; @@ -302,11 +303,30 @@ void unix_gc(void) spin_unlock(&unix_gc_lock); + /* We need io_uring to clean its registered files, ignore all io_uring + * originated skbs. It's fine as io_uring doesn't keep references to + * other io_uring instances and so killing all other files in the cycle + * will put all io_uring references forcing it to go through normal + * release.path eventually putting registered files. + */ + skb_queue_walk_safe(&hitlist, skb, next_skb) { + if (skb->scm_io_uring) { + __skb_unlink(skb, &hitlist); + skb_queue_tail(&skb->sk->sk_receive_queue, skb); + } + } + /* Here we are. Hitlist is filled. Die. */ __skb_queue_purge(&hitlist); spin_lock(&unix_gc_lock); + /* There could be io_uring registered files, just push them back to + * the inflight list + */ + list_for_each_entry_safe(u, next, &gc_candidates, link) + list_move_tail(&u->link, &gc_inflight_list); + /* All candidates should have been detached by now. */ BUG_ON(!list_empty(&gc_candidates)); -- 2.25.1

1 1

[PATCH openEuler-1.0-LTS 1/3] nbd: remove the call to set_blocksize
by Yongqiang Liu 07 Nov '22

07 Nov '22

From: Christoph Hellwig <hch(a)lst.de> mainline inclusion from mainline-v5.11-rc1 commit ee4bf648635055d2b76afadaf34236c8b2d852a7 category: bugfix bugzilla: 187706,https://gitee.com/openeuler/kernel/issues/I5XEBX CVE: NA ---------------------------------------- Block driver have no business setting the file system concept of a block size. Signed-off-by: Christoph Hellwig <hch(a)lst.de> Reviewed-by: Josef Bacik <josef(a)toxicpanda.com> Signed-off-by: Jens Axboe <axboe(a)kernel.dk> conflicts: drivers/block/nbd.c Signed-off-by: Zhong Jinghua <zhongjinghua(a)huawei.com> Reviewed-by: Jason Yan <yanaijie(a)huawei.com> Reviewed-by: Yu Kuai <yukuai3(a)huawei.com> Signed-off-by: Yongqiang Liu <liuyongqiang13(a)huawei.com> --- drivers/block/nbd.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 6834ba0e7e2c..4d162a92dffc 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -298,7 +298,7 @@ static void nbd_size_clear(struct nbd_device *nbd) } } -static void nbd_size_update(struct nbd_device *nbd, bool start) +static void nbd_size_update(struct nbd_device *nbd) { struct nbd_config *config = nbd->config; struct block_device *bdev = bdget_disk(nbd->disk, 0); @@ -312,11 +312,9 @@ static void nbd_size_update(struct nbd_device *nbd, bool start) blk_queue_physical_block_size(nbd->disk->queue, config->blksize); set_capacity(nbd->disk, config->bytesize >> 9); if (bdev) { - if (bdev->bd_disk) { + if (bdev->bd_disk) bd_set_size(bdev, config->bytesize); - if (start) - set_blocksize(bdev, config->blksize); - } else + else bdev->bd_invalidated = 1; bdput(bdev); } @@ -330,7 +328,7 @@ static void nbd_size_set(struct nbd_device *nbd, loff_t blocksize, config->blksize = blocksize; config->bytesize = blocksize * nr_blocks; if (nbd->pid) - nbd_size_update(nbd, false); + nbd_size_update(nbd); } static void nbd_complete_rq(struct request *req) @@ -1333,7 +1331,7 @@ static int nbd_start_device(struct nbd_device *nbd) args->index = i; queue_work(nbd->recv_workq, &args->work); } - nbd_size_update(nbd, true); + nbd_size_update(nbd); return error; } -- 2.25.1

1 2

[PATCH openEuler-1.0-LTS 1/2] fs: fix UAF/GPF bug in nilfs_mdt_destroy
by Yongqiang Liu 04 Nov '22

04 Nov '22

From: Dongliang Mu <mudongliangabcd(a)gmail.com> mainline inclusion from mainline-v6.1-rc1 commit 2e488f13755ffbb60f307e991b27024716a33b29 category: bugfix bugzilla: 187543, https://gitee.com/src-openeuler/kernel/issues/I5NZ98 CVE: CVE-2022-2978 ------------------------------- In alloc_inode, inode_init_always() could return -ENOMEM if security_inode_alloc() fails, which causes inode->i_private uninitialized. Then nilfs_is_metadata_file_inode() returns true and nilfs_free_inode() wrongly calls nilfs_mdt_destroy(), which frees the uninitialized inode->i_private and leads to crashes(e.g., UAF/GPF). Fix this by moving security_inode_alloc just prior to this_cpu_inc(nr_inodes) Link: https://lkml.kernel.org/r/CAFcO6XOcf1Jj2SeGt=jJV59wmhESeSKpfR0omdFRq+J9nD1v… Reported-by: butt3rflyh4ck <butterflyhuangxx(a)gmail.com> Reported-by: Hao Sun <sunhao.th(a)gmail.com> Reported-by: Jiacheng Xu <stitch(a)zju.edu.cn> Reviewed-by: Christian Brauner (Microsoft) <brauner(a)kernel.org> Signed-off-by: Dongliang Mu <mudongliangabcd(a)gmail.com> Cc: Al Viro <viro(a)zeniv.linux.org.uk> Cc: stable(a)vger.kernel.org Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk> Signed-off-by: Li Lingfeng <lilingfeng3(a)huawei.com> Reviewed-by: Zhang Yi <yi.zhang(a)huawei.com> Reviewed-by: Xiu Jianfeng <xiujianfeng(a)huawei.com> Signed-off-by: Yongqiang Liu <liuyongqiang13(a)huawei.com> --- fs/inode.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index c9eb5041ffae..5df2e8ee23ed 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -166,8 +166,6 @@ int inode_init_always(struct super_block *sb, struct inode *inode) inode->i_wb_frn_history = 0; #endif - if (security_inode_alloc(inode)) - goto out; spin_lock_init(&inode->i_lock); lockdep_set_class(&inode->i_lock, &sb->s_type->i_lock_key); @@ -195,11 +193,12 @@ int inode_init_always(struct super_block *sb, struct inode *inode) inode->i_fsnotify_mask = 0; #endif inode->i_flctx = NULL; + + if (unlikely(security_inode_alloc(inode))) + return -ENOMEM; this_cpu_inc(nr_inodes); return 0; -out: - return -ENOMEM; } EXPORT_SYMBOL(inode_init_always); -- 2.25.1

1 1

[PATCH] RDMA/hns: support RoCE bonding
by Chengchang Tang 04 Nov '22

04 Nov '22

From: Junxian Huang <huangjunxian6(a)hisilicon.com> driver inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5Z6L8 ---------------------------------------------------------- Support hns roce bonding Signed-off-by: Junxian Huang <huangjunxian6(a)hisilicon.com> Signed-off-by: ChunZhi Hu <huchunzhi(a)huawei.com> Reviewed-by: Yangyang Li <liyangyang20(a)huawei.com> --- drivers/infiniband/hw/hns/Makefile | 3 +- drivers/infiniband/hw/hns/hns_roce_bond.c | 670 ++++++++++++++++++++ drivers/infiniband/hw/hns/hns_roce_bond.h | 64 ++ drivers/infiniband/hw/hns/hns_roce_device.h | 10 + drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 63 +- drivers/infiniband/hw/hns/hns_roce_hw_v2.h | 17 + drivers/infiniband/hw/hns/hns_roce_main.c | 26 +- 7 files changed, 846 insertions(+), 7 deletions(-) create mode 100644 drivers/infiniband/hw/hns/hns_roce_bond.c create mode 100644 drivers/infiniband/hw/hns/hns_roce_bond.h diff --git a/drivers/infiniband/hw/hns/Makefile b/drivers/infiniband/hw/hns/Makefile index a7d259238305..8ffbf009b948 100644 --- a/drivers/infiniband/hw/hns/Makefile +++ b/drivers/infiniband/hw/hns/Makefile @@ -7,7 +7,8 @@ ccflags-y := -I $(srctree)/drivers/net/ethernet/hisilicon/hns3 hns-roce-objs := hns_roce_main.o hns_roce_cmd.o hns_roce_pd.o \ hns_roce_ah.o hns_roce_hem.o hns_roce_mr.o hns_roce_qp.o \ - hns_roce_cq.o hns_roce_alloc.o hns_roce_db.o hns_roce_srq.o hns_roce_restrack.o + hns_roce_cq.o hns_roce_alloc.o hns_roce_db.o hns_roce_srq.o hns_roce_restrack.o \ + hns_roce_bond.o ifdef CONFIG_INFINIBAND_HNS_HIP08 hns-roce-hw-v2-objs := hns_roce_hw_v2.o $(hns-roce-objs) diff --git a/drivers/infiniband/hw/hns/hns_roce_bond.c b/drivers/infiniband/hw/hns/hns_roce_bond.c new file mode 100644 index 000000000000..14255685a59f --- /dev/null +++ b/drivers/infiniband/hw/hns/hns_roce_bond.c @@ -0,0 +1,670 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2016-2022 Hisilicon Limited. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include <linux/pci.h> +#include "hnae3.h" +#include "hns_roce_device.h" +#include "hns_roce_hw_v2.h" +#include "hns_roce_bond.h" + +static DEFINE_MUTEX(roce_bond_mutex); + +static struct hns_roce_dev *hns_roce_get_hrdev_by_netdev(struct net_device *net_dev) +{ + struct hns_roce_dev *hr_dev; + struct ib_device *ibdev; + + ibdev = ib_device_get_by_netdev(net_dev, RDMA_DRIVER_HNS); + if (!ibdev) + return NULL; + + hr_dev = container_of(ibdev, struct hns_roce_dev, ib_dev); + ib_device_put(ibdev); + + return hr_dev; +} + +bool hns_roce_bond_is_active(struct hns_roce_dev *hr_dev) +{ + struct net_device *upper_dev; + struct net_device *net_dev; + + if (!netif_is_lag_port(hr_dev->iboe.netdevs[0])) + return false; + + rcu_read_lock(); + upper_dev = netdev_master_upper_dev_get_rcu(hr_dev->iboe.netdevs[0]); + for_each_netdev_in_bond_rcu(upper_dev, net_dev) { + hr_dev = hns_roce_get_hrdev_by_netdev(net_dev); + if (hr_dev && hr_dev->bond_grp && + hr_dev->bond_grp->bond_state == HNS_ROCE_BOND_IS_BONDED) { + rcu_read_unlock(); + return true; + } + } + rcu_read_unlock(); + + return false; +} + +struct net_device *hns_roce_get_bond_netdev(struct hns_roce_dev *hr_dev) +{ + struct hns_roce_bond_group *bond_grp = hr_dev->bond_grp; + struct net_device *net_dev = NULL; + int i; + + if (!(hr_dev->caps.flags & HNS_ROCE_CAP_FLAG_BOND)) + return NULL; + + if (!netif_is_lag_port(hr_dev->iboe.netdevs[0])) + return NULL; + + if (!bond_grp) + return NULL; + + mutex_lock(&bond_grp->bond_mutex); + + if (bond_grp->bond_state != HNS_ROCE_BOND_IS_BONDED) + goto out; + + if (bond_grp->tx_type == NETDEV_LAG_TX_TYPE_ACTIVEBACKUP) { + for (i = 0; i < ROCE_BOND_FUNC_MAX; i++) { + net_dev = bond_grp->bond_func_info[i].net_dev; + if (net_dev && + bond_grp->bond_func_info[i].state.tx_enabled) + break; + } + } else { + for (i = 0; i < ROCE_BOND_FUNC_MAX; i++) { + net_dev = bond_grp->bond_func_info[i].net_dev; + if (net_dev && get_port_state(net_dev) == IB_PORT_ACTIVE) + break; + } + } + +out: + mutex_unlock(&bond_grp->bond_mutex); + + return net_dev; +} + +static void hns_roce_queue_bond_work(struct hns_roce_dev *hr_dev, + unsigned long delay) +{ + schedule_delayed_work(&hr_dev->bond_work, delay); +} + +static void hns_roce_bond_get_active_slave(struct hns_roce_bond_group *bond_grp) +{ + struct net_device *net_dev; + u32 active_slave_map = 0; + u8 active_slave_num = 0; + bool active; + u8 i; + + for (i = 0; i < ROCE_BOND_FUNC_MAX; i++) { + net_dev = bond_grp->bond_func_info[i].net_dev; + if (net_dev) { + active = (bond_grp->tx_type == NETDEV_LAG_TX_TYPE_ACTIVEBACKUP) ? + bond_grp->bond_func_info[i].state.tx_enabled : + bond_grp->bond_func_info[i].state.link_up; + if (active) { + active_slave_num++; + active_slave_map |= (1 << i); + } + } + } + + bond_grp->active_slave_num = active_slave_num; + bond_grp->active_slave_map = active_slave_map; +} + +static struct hns_roce_dev + *hns_roce_bond_init_client(struct hns_roce_bond_group *bond_grp, + int func_idx) +{ + struct hnae3_handle *handle; + int ret; + + handle = bond_grp->bond_func_info[func_idx].handle; + ret = hns_roce_hw_v2_init_instance(handle); + if (ret) + return NULL; + + return handle->priv; +} + +static void hns_roce_bond_uninit_client(struct hns_roce_bond_group *bond_grp, + int func_idx) +{ + struct hnae3_handle *handle; + + handle = bond_grp->bond_func_info[func_idx].handle; + hns_roce_hw_v2_uninit_instance(handle, 0); +} + +static void hns_roce_set_bond(struct hns_roce_bond_group *bond_grp) +{ + u8 main_func_idx = PCI_FUNC(bond_grp->main_hr_dev->pci_dev->devfn); + struct net_device *main_net_dev = bond_grp->main_net_dev; + struct hns_roce_dev *hr_dev; + struct net_device *net_dev; + int ret; + int i; + + hns_roce_bond_get_active_slave(bond_grp); + /* bond_grp will be kfree during uninit_instance of main_hr_dev. + * Thus the main_hr_dev is switched before the uninit_instance + * of the previous main_hr_dev. + */ + for (i = 0; i < ROCE_BOND_FUNC_MAX; i++) { + net_dev = bond_grp->bond_func_info[i].net_dev; + if (net_dev && net_dev != main_net_dev) + hns_roce_bond_uninit_client(bond_grp, i); + } + + bond_grp->bond_state = HNS_ROCE_BOND_IS_BONDED; + + for (i = 0; i < ROCE_BOND_FUNC_MAX; i++) { + net_dev = bond_grp->bond_func_info[i].net_dev; + if (net_dev && net_dev != main_net_dev) { + hr_dev = hns_roce_bond_init_client(bond_grp, i); + if (hr_dev) { + bond_grp->bond_id = + hr_dev->ib_dev.name[ROCE_BOND_NAME_ID_IDX] + - '0'; + bond_grp->main_hr_dev->bond_grp = NULL; + bond_grp->main_hr_dev = hr_dev; + bond_grp->main_net_dev = net_dev; + hr_dev->bond_grp = bond_grp; + break; + } + } + } + + if (!hr_dev) + return; + + hns_roce_bond_uninit_client(bond_grp, main_func_idx); + ret = hns_roce_cmd_bond(hr_dev, HNS_ROCE_SET_BOND); + if (ret) { + ibdev_err(&hr_dev->ib_dev, "failed to set RoCE bond!\n"); + return; + } + + ibdev_info(&hr_dev->ib_dev, "RoCE set bond finished!\n"); +} + +static void hns_roce_clear_bond(struct hns_roce_bond_group *bond_grp) +{ + u8 main_func_idx = PCI_FUNC(bond_grp->main_hr_dev->pci_dev->devfn); + struct net_device *main_net_dev = bond_grp->main_net_dev; + struct hnae3_handle *handle; + struct hns_roce_dev *hr_dev; + struct net_device *net_dev; + int ret; + int i; + + bond_grp->bond_state = HNS_ROCE_BOND_NOT_BONDED; + + for (i = 0; i < ROCE_BOND_FUNC_MAX; i++) { + net_dev = bond_grp->bond_func_info[i].net_dev; + if (net_dev && net_dev != main_net_dev) + hns_roce_bond_init_client(bond_grp, i); + } + + ret = hns_roce_cmd_bond(bond_grp->main_hr_dev, HNS_ROCE_CLEAR_BOND); + if (ret) + return; + handle = bond_grp->bond_func_info[main_func_idx].handle; + + /* bond_grp will be freed in uninit_instance(main_net_dev) */ + hns_roce_bond_uninit_client(bond_grp, main_func_idx); + + ret = hns_roce_hw_v2_init_instance(handle); + if (ret) { + ibdev_err(&hr_dev->ib_dev, "failed to clear RoCE bond!\n"); + return; + } + + hr_dev = handle->priv; + + ibdev_info(&hr_dev->ib_dev, "RoCE clear bond finished!\n"); +} + +static void hns_roce_slave_changestate(struct hns_roce_bond_group *bond_grp) +{ + int ret; + + hns_roce_bond_get_active_slave(bond_grp); + bond_grp->bond_state = HNS_ROCE_BOND_IS_BONDED; + + ret = hns_roce_cmd_bond(bond_grp->main_hr_dev, HNS_ROCE_CHANGE_BOND); + if (ret) { + ibdev_err(&bond_grp->main_hr_dev->ib_dev, + "failed to change RoCE bond slave state!\n"); + return; + } + + ibdev_info(&bond_grp->main_hr_dev->ib_dev, + "RoCE slave changestate finished!\n"); +} + +static void hns_roce_slave_inc(struct hns_roce_bond_group *bond_grp) +{ + u32 inc_slave_map = bond_grp->slave_map_diff; + u8 inc_func_idx = 0; + int ret; + + hns_roce_bond_get_active_slave(bond_grp); + + while (inc_slave_map > 0) { + if (inc_slave_map & 1) + hns_roce_bond_uninit_client(bond_grp, inc_func_idx); + inc_slave_map >>= 1; + inc_func_idx++; + } + + bond_grp->bond_state = HNS_ROCE_BOND_IS_BONDED; + + ret = hns_roce_cmd_bond(bond_grp->main_hr_dev, HNS_ROCE_CHANGE_BOND); + if (ret) { + ibdev_err(&bond_grp->main_hr_dev->ib_dev, + "failed to increase RoCE bond slave!\n"); + return; + } + + ibdev_info(&bond_grp->main_hr_dev->ib_dev, + "RoCE slave increase finished!\n"); +} + +static void hns_roce_slave_dec(struct hns_roce_bond_group *bond_grp) +{ + u32 dec_slave_map = bond_grp->slave_map_diff; + struct hns_roce_dev *hr_dev; + struct net_device *net_dev; + u8 main_func_idx = 0; + u8 dec_func_idx = 0; + int ret; + int i; + + hns_roce_bond_get_active_slave(bond_grp); + + bond_grp->bond_state = HNS_ROCE_BOND_IS_BONDED; + + main_func_idx = PCI_FUNC(bond_grp->main_hr_dev->pci_dev->devfn); + if (dec_slave_map & (1 << main_func_idx)) { + hns_roce_cmd_bond(hr_dev, HNS_ROCE_CLEAR_BOND); + for (i = 0; i < ROCE_BOND_FUNC_MAX; i++) { + net_dev = bond_grp->bond_func_info[i].net_dev; + if (!(dec_slave_map & (1 << i)) && net_dev) { + hr_dev = hns_roce_bond_init_client(bond_grp, i); + if (hr_dev) { + bond_grp->main_hr_dev = hr_dev; + bond_grp->main_net_dev = net_dev; + hr_dev->bond_grp = bond_grp; + break; + } + } + } + hns_roce_bond_uninit_client(bond_grp, main_func_idx); + } + + while (dec_slave_map > 0) { + if (dec_slave_map & 1) { + hns_roce_bond_init_client(bond_grp, dec_func_idx); + bond_grp->bond_func_info[dec_func_idx].net_dev = NULL; + } + dec_slave_map >>= 1; + dec_func_idx++; + } + + if (bond_grp->slave_map_diff & (1 << main_func_idx)) + ret = hns_roce_cmd_bond(hr_dev, HNS_ROCE_SET_BOND); + else + ret = hns_roce_cmd_bond(bond_grp->main_hr_dev, + HNS_ROCE_CHANGE_BOND); + if (ret) { + ibdev_err(&bond_grp->main_hr_dev->ib_dev, + "failed to decrease RoCE bond slave!\n"); + return; + } + + ibdev_info(&bond_grp->main_hr_dev->ib_dev, + "RoCE slave decrease finished!\n"); +} + +static void hns_roce_do_bond(struct hns_roce_bond_group *bond_grp) +{ + enum hns_roce_bond_state bond_state; + bool bond_ready; + + bond_ready = bond_grp->bond_ready; + bond_state = bond_grp->bond_state; + ibdev_info(&bond_grp->main_hr_dev->ib_dev, + "do_bond: bond_ready - %d, bond_state - %d.\n", + bond_ready, bond_grp->bond_state); + + if (bond_ready && bond_state == HNS_ROCE_BOND_NOT_BONDED) + hns_roce_set_bond(bond_grp); + else if (bond_ready && bond_state == HNS_ROCE_BOND_SLAVE_CHANGESTATE) + hns_roce_slave_changestate(bond_grp); + else if (bond_ready && bond_state == HNS_ROCE_BOND_SLAVE_INC) + hns_roce_slave_inc(bond_grp); + else if (bond_ready && bond_state == HNS_ROCE_BOND_SLAVE_DEC) + hns_roce_slave_dec(bond_grp); + else if (!bond_ready && bond_state != HNS_ROCE_BOND_NOT_BONDED) + hns_roce_clear_bond(bond_grp); +} + +void hns_roce_do_bond_work(struct work_struct *work) +{ + struct delayed_work *delayed_work; + struct hns_roce_dev *hr_dev; + int status; + + delayed_work = to_delayed_work(work); + hr_dev = container_of(delayed_work, struct hns_roce_dev, bond_work); + status = mutex_trylock(&roce_bond_mutex); + if (!status) { + /* delay 1 sec */ + hns_roce_queue_bond_work(hr_dev, HZ); + return; + } + + hns_roce_do_bond(hr_dev->bond_grp); + mutex_unlock(&roce_bond_mutex); +} + +int hns_roce_bond_init(struct hns_roce_dev *hr_dev) +{ + int ret; + + INIT_DELAYED_WORK(&hr_dev->bond_work, hns_roce_do_bond_work); + + hr_dev->bond_nb.notifier_call = hns_roce_bond_event; + ret = register_netdevice_notifier(&hr_dev->bond_nb); + if (ret) { + ibdev_err(&hr_dev->ib_dev, + "failed to register notifier for RoCE bond!\n"); + hr_dev->bond_nb.notifier_call = NULL; + } + + return ret; +} + +void hns_roce_cleanup_bond(struct hns_roce_dev *hr_dev) +{ + unregister_netdevice_notifier(&hr_dev->bond_nb); + cancel_delayed_work(&hr_dev->bond_work); + + if (hr_dev->bond_grp && hr_dev == hr_dev->bond_grp->main_hr_dev) + kfree(hr_dev->bond_grp); + + hr_dev->bond_grp = NULL; +} + +static bool hns_roce_bond_lowerstate_event(struct hns_roce_dev *hr_dev, + struct netdev_notifier_changelowerstate_info *info) +{ + struct hns_roce_bond_group *bond_grp = hr_dev->bond_grp; + struct netdev_lag_lower_state_info *bond_lower_info; + struct net_device *net_dev; + int i; + + net_dev = netdev_notifier_info_to_dev((struct netdev_notifier_info *)info); + if (!netif_is_lag_port(net_dev)) + return false; + + bond_lower_info = info->lower_state_info; + if (!bond_lower_info) + return false; + + if (!bond_grp) { + hr_dev->slave_state = *bond_lower_info; + return false; + } + + mutex_lock(&bond_grp->bond_mutex); + + for (i = 0; i < ROCE_BOND_FUNC_MAX; i++) { + if (net_dev == bond_grp->bond_func_info[i].net_dev) { + bond_grp->bond_func_info[i].state = *bond_lower_info; + break; + } + } + + if (bond_grp->bond_ready && + bond_grp->bond_state == HNS_ROCE_BOND_IS_BONDED) + bond_grp->bond_state = HNS_ROCE_BOND_SLAVE_CHANGESTATE; + + mutex_unlock(&bond_grp->bond_mutex); + + return true; +} + +static inline bool hns_roce_bond_mode_is_supported(enum netdev_lag_tx_type tx_type) +{ + if (tx_type != NETDEV_LAG_TX_TYPE_ACTIVEBACKUP && + tx_type != NETDEV_LAG_TX_TYPE_HASH) + return false; + + return true; +} + +static void hns_roce_bond_info_record(struct hns_roce_bond_group *bond_grp, + struct net_device *upper_dev) +{ + struct hns_roce_v2_priv *priv; + struct hns_roce_dev *hr_dev; + struct net_device *net_dev; + u8 func_idx; + + bond_grp->slave_num = 0; + bond_grp->slave_map = 0; + + rcu_read_lock(); + for_each_netdev_in_bond_rcu(upper_dev, net_dev) { + hr_dev = hns_roce_get_hrdev_by_netdev(net_dev); + if (hr_dev) { + func_idx = PCI_FUNC(hr_dev->pci_dev->devfn); + bond_grp->slave_map |= (1 << func_idx); + bond_grp->slave_num++; + if (!bond_grp->bond_func_info[func_idx].net_dev) { + priv = hr_dev->priv; + + bond_grp->bond_func_info[func_idx].net_dev = + net_dev; + + bond_grp->bond_func_info[func_idx].handle = + priv->handle; + + bond_grp->bond_func_info[func_idx].state = + hr_dev->slave_state; + } + } + } + rcu_read_unlock(); +} + +static bool hns_roce_bond_upper_event(struct hns_roce_dev *hr_dev, + struct netdev_notifier_changeupper_info *info) +{ + struct hns_roce_bond_group *bond_grp = hr_dev->bond_grp; + struct net_device *upper_dev = info->upper_dev; + struct netdev_lag_upper_info *bond_upper_info; + u32 pre_slave_map = bond_grp->slave_map; + u8 pre_slave_num = bond_grp->slave_num; + bool changed = false; + + if (!upper_dev || !netif_is_lag_master(upper_dev)) + return false; + + if (info->linking) + bond_upper_info = info->upper_info; + + mutex_lock(&bond_grp->bond_mutex); + + if (bond_upper_info) + bond_grp->tx_type = bond_upper_info->tx_type; + + hns_roce_bond_info_record(bond_grp, upper_dev); + + bond_grp->bond = netdev_priv(upper_dev); + if (!hns_roce_bond_mode_is_supported(bond_grp->tx_type) || + bond_grp->slave_num <= 1) { + changed = bond_grp->bond_ready; + bond_grp->bond_ready = false; + goto out; + } + + if (bond_grp->bond_state == HNS_ROCE_BOND_NOT_BONDED) { + bond_grp->bond_ready = true; + changed = true; + } else if (bond_grp->bond_state == HNS_ROCE_BOND_IS_BONDED && + bond_grp->slave_num != pre_slave_num) { + bond_grp->bond_state = bond_grp->slave_num > pre_slave_num ? + HNS_ROCE_BOND_SLAVE_INC : + HNS_ROCE_BOND_SLAVE_DEC; + bond_grp->slave_map_diff = pre_slave_map ^ bond_grp->slave_map; + bond_grp->bond_ready = true; + changed = true; + } + +out: + mutex_unlock(&bond_grp->bond_mutex); + + return changed; +} + +static struct hns_roce_bond_group *hns_roce_alloc_bond_grp(struct hns_roce_dev *main_hr_dev, + struct net_device *upper_dev) +{ + struct hns_roce_bond_group *bond_grp; + + bond_grp = kzalloc(sizeof(*bond_grp), GFP_KERNEL); + if (!bond_grp) + return NULL; + + mutex_init(&bond_grp->bond_mutex); + bond_grp->upper_dev = upper_dev; + bond_grp->main_hr_dev = main_hr_dev; + bond_grp->main_net_dev = main_hr_dev->iboe.netdevs[0]; + bond_grp->bond_ready = false; + bond_grp->bond_state = HNS_ROCE_BOND_NOT_BONDED; + + hns_roce_bond_info_record(bond_grp, upper_dev); + + return bond_grp; +} + +static bool hns_roce_is_slave(struct net_device *bond, + struct net_device *net_dev) +{ + struct net_device *upper_dev; + + rcu_read_lock(); + upper_dev = netdev_master_upper_dev_get_rcu(net_dev); + rcu_read_unlock(); + + return bond == upper_dev; +} + +static bool hns_roce_is_bond_grp_exist(struct net_device *upper_dev) +{ + struct hns_roce_dev *hr_dev; + struct net_device *net_dev; + + rcu_read_lock(); + for_each_netdev_in_bond_rcu(upper_dev, net_dev) { + hr_dev = hns_roce_get_hrdev_by_netdev(net_dev); + if (hr_dev && hr_dev->bond_grp) { + rcu_read_unlock(); + return true; + } + } + rcu_read_unlock(); + + return false; +} + +int hns_roce_bond_event(struct notifier_block *self, + unsigned long event, void *ptr) +{ + struct net_device *net_dev = netdev_notifier_info_to_dev(ptr); + struct hns_roce_dev *hr_dev = + container_of(self, struct hns_roce_dev, bond_nb); + struct net_device *upper_dev; + bool changed; + + if (event != NETDEV_CHANGEUPPER && event != NETDEV_CHANGELOWERSTATE) + return NOTIFY_DONE; + + rcu_read_lock(); + upper_dev = netdev_master_upper_dev_get_rcu(net_dev); + rcu_read_unlock(); + if (event == NETDEV_CHANGELOWERSTATE && !upper_dev && + hr_dev != hns_roce_get_hrdev_by_netdev(net_dev)) + return NOTIFY_DONE; + + if (upper_dev) { + if (!hns_roce_is_slave(upper_dev, hr_dev->iboe.netdevs[0])) + return NOTIFY_DONE; + + mutex_lock(&roce_bond_mutex); + if (!hr_dev->bond_grp) { + if (hns_roce_is_bond_grp_exist(upper_dev)) { + mutex_unlock(&roce_bond_mutex); + return NOTIFY_DONE; + } + hr_dev->bond_grp = hns_roce_alloc_bond_grp(hr_dev, + upper_dev); + if (!hr_dev->bond_grp) { + ibdev_err(&hr_dev->ib_dev, + "failed to alloc RoCE bond_grp!\n"); + mutex_unlock(&roce_bond_mutex); + return NOTIFY_DONE; + } + } + mutex_unlock(&roce_bond_mutex); + } + + changed = (event == NETDEV_CHANGEUPPER) ? + hns_roce_bond_upper_event(hr_dev, ptr) : + hns_roce_bond_lowerstate_event(hr_dev, ptr); + + if (changed) + hns_roce_queue_bond_work(hr_dev, HZ); + + return NOTIFY_DONE; +} diff --git a/drivers/infiniband/hw/hns/hns_roce_bond.h b/drivers/infiniband/hw/hns/hns_roce_bond.h new file mode 100644 index 000000000000..3b00f6061a9d --- /dev/null +++ b/drivers/infiniband/hw/hns/hns_roce_bond.h @@ -0,0 +1,64 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ +#ifndef _HNS_ROCE_BOND_H +#define _HNS_ROCE_BOND_H + +#include <linux/netdevice.h> +#include <net/bonding.h> + +#define ROCE_BOND_FUNC_MAX 4 +#define ROCE_BOND_NAME_ID_IDX 9 + +enum { + BOND_MODE_1, + BOND_MODE_2_4, +}; + +enum hns_roce_bond_state { + HNS_ROCE_BOND_NOT_BONDED, + HNS_ROCE_BOND_IS_BONDED, + HNS_ROCE_BOND_SLAVE_INC, + HNS_ROCE_BOND_SLAVE_DEC, + HNS_ROCE_BOND_SLAVE_CHANGESTATE, +}; + +enum hns_roce_bond_cmd_type { + HNS_ROCE_SET_BOND, + HNS_ROCE_CHANGE_BOND, + HNS_ROCE_CLEAR_BOND, +}; + +struct hns_roce_func_info { + struct net_device *net_dev; + struct hnae3_handle *handle; + struct netdev_lag_lower_state_info state; +}; + +struct hns_roce_bond_group { + struct net_device *upper_dev; + struct net_device *main_net_dev; + struct hns_roce_dev *main_hr_dev; + u8 slave_num; + u8 active_slave_num; + u32 slave_map; + u32 active_slave_map; + u32 slave_map_diff; + u8 bond_id; + struct bonding *bond; + bool bond_ready; + enum hns_roce_bond_state bond_state; + enum netdev_lag_tx_type tx_type; + /* + * A mutex which protect bond_grp info + */ + struct mutex bond_mutex; + struct hns_roce_func_info bond_func_info[ROCE_BOND_FUNC_MAX]; +}; + +int hns_roce_bond_init(struct hns_roce_dev *hr_dev); +int hns_roce_bond_event(struct notifier_block *self, + unsigned long event, void *ptr); +void hns_roce_cleanup_bond(struct hns_roce_dev *hr_dev); +bool hns_roce_bond_is_active(struct hns_roce_dev *hr_dev); +struct net_device *hns_roce_get_bond_netdev(struct hns_roce_dev *hr_dev); + +#endif diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h index 51e59084f875..eb4582ce9c5c 100644 --- a/drivers/infiniband/hw/hns/hns_roce_device.h +++ b/drivers/infiniband/hw/hns/hns_roce_device.h @@ -35,6 +35,7 @@ #include <rdma/ib_verbs.h> #include <rdma/hns-abi.h> +#include "hns_roce_bond.h" #define PCI_REVISION_ID_HIP08 0x21 #define PCI_REVISION_ID_HIP09 0x30 @@ -147,6 +148,7 @@ enum { HNS_ROCE_CAP_FLAG_STASH = BIT(17), HNS_ROCE_CAP_FLAG_CQE_INLINE = BIT(19), HNS_ROCE_CAP_FLAG_RQ_INLINE = BIT(20), + HNS_ROCE_CAP_FLAG_BOND = BIT(21), }; #define HNS_ROCE_DB_TYPE_COUNT 2 @@ -898,6 +900,9 @@ struct hns_roce_hw { u8 *tc_mode, u8 *priority); const struct ib_device_ops *hns_roce_dev_ops; const struct ib_device_ops *hns_roce_dev_srq_ops; + int (*bond_init)(struct hns_roce_dev *hr_dev); + bool (*bond_is_active)(struct hns_roce_dev *hr_dev); + struct net_device *(*get_bond_netdev)(struct hns_roce_dev *hr_dev); }; struct hns_roce_dev { @@ -961,6 +966,11 @@ struct hns_roce_dev { u32 is_vf; u32 cong_algo_tmpl_id; u64 dwqe_page; + + struct notifier_block bond_nb; + struct delayed_work bond_work; + struct hns_roce_bond_group *bond_grp; + struct netdev_lag_lower_state_info slave_state; }; static inline struct hns_roce_dev *to_hr_dev(struct ib_device *ib_dev) diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c index bd45d07619e9..25800d5965bb 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c @@ -1350,6 +1350,61 @@ static int hns_roce_cmq_send(struct hns_roce_dev *hr_dev, return ret; } +static inline enum hns_roce_opcode_type + get_bond_opcode(enum hns_roce_bond_cmd_type bond_type) +{ + if (bond_type == HNS_ROCE_SET_BOND) + return HNS_ROCE_OPC_SET_BOND_INFO; + else if (bond_type == HNS_ROCE_CHANGE_BOND) + return HNS_ROCE_OPC_CHANGE_ACTIVE_PORT; + else + return HNS_ROCE_OPC_CLEAR_BOND_INFO; +} + +int hns_roce_cmd_bond(struct hns_roce_dev *hr_dev, + enum hns_roce_bond_cmd_type bond_type) +{ + enum hns_roce_opcode_type opcode = get_bond_opcode(bond_type); + struct hns_roce_bond_info *slave_info; + struct hns_roce_cmq_desc desc = { 0 }; + int ret; + + slave_info = (struct hns_roce_bond_info *)desc.data; + hns_roce_cmq_setup_basic_desc(&desc, opcode, false); + + slave_info->bond_id = cpu_to_le32(hr_dev->bond_grp->bond_id); + if (bond_type == HNS_ROCE_CLEAR_BOND) + goto out; + + if (hr_dev->bond_grp->tx_type == NETDEV_LAG_TX_TYPE_ACTIVEBACKUP) { + slave_info->bond_mode = cpu_to_le32(BOND_MODE_1); + if (hr_dev->bond_grp->active_slave_num != 1) + ibdev_err(&hr_dev->ib_dev, + "active slave cnt(%d) in Mode 1 is invalid.\n", + hr_dev->bond_grp->active_slave_num); + } else { + slave_info->bond_mode = cpu_to_le32(BOND_MODE_2_4); + slave_info->hash_policy = + cpu_to_le32(hr_dev->bond_grp->bond->params.xmit_policy); + } + + slave_info->active_slave_cnt = + cpu_to_le32(hr_dev->bond_grp->active_slave_num); + slave_info->active_slave_mask = + cpu_to_le32(hr_dev->bond_grp->active_slave_map); + slave_info->slave_mask = + cpu_to_le32(hr_dev->bond_grp->slave_map); + +out: + ret = hns_roce_cmq_send(hr_dev, &desc, 1); + if (ret) + ibdev_err(&hr_dev->ib_dev, + "cmq bond type(%d) failed, ret = %d.\n", + bond_type, ret); + + return ret; +} + static int config_hem_ba_to_hw(struct hns_roce_dev *hr_dev, dma_addr_t base_addr, u8 cmd, unsigned long tag) { @@ -6781,6 +6836,9 @@ static const struct hns_roce_hw hns_roce_hw_v2 = { .get_dscp = hns_roce_hw_v2_get_dscp, .hns_roce_dev_ops = &hns_roce_v2_dev_ops, .hns_roce_dev_srq_ops = &hns_roce_v2_dev_srq_ops, + .bond_init = hns_roce_bond_init, + .bond_is_active = hns_roce_bond_is_active, + .get_bond_netdev = hns_roce_get_bond_netdev, }; static const struct pci_device_id hns_roce_hw_v2_pci_tbl[] = { @@ -6903,7 +6961,7 @@ static void __hns_roce_hw_v2_uninit_instance(struct hnae3_handle *handle, ib_dealloc_device(&hr_dev->ib_dev); } -static int hns_roce_hw_v2_init_instance(struct hnae3_handle *handle) +int hns_roce_hw_v2_init_instance(struct hnae3_handle *handle) { const struct hnae3_ae_ops *ops = handle->ae_algo->ops; const struct pci_device_id *id; @@ -6946,8 +7004,7 @@ static int hns_roce_hw_v2_init_instance(struct hnae3_handle *handle) return -EBUSY; } -static void hns_roce_hw_v2_uninit_instance(struct hnae3_handle *handle, - bool reset) +void hns_roce_hw_v2_uninit_instance(struct hnae3_handle *handle, bool reset) { if (handle->rinfo.instance_state != HNS_ROCE_STATE_INITED) return; diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h index 39641b449a42..7da410ecb966 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h @@ -252,6 +252,9 @@ enum hns_roce_opcode_type { HNS_ROCE_OPC_EXT_CFG = 0x8512, HNS_ROCE_QUERY_RAM_ECC = 0x8513, HNS_SWITCH_PARAMETER_CFG = 0x1033, + HNS_ROCE_OPC_SET_BOND_INFO = 0x8601, + HNS_ROCE_OPC_CLEAR_BOND_INFO = 0x8602, + HNS_ROCE_OPC_CHANGE_ACTIVE_PORT = 0x8603, }; enum { @@ -1464,11 +1467,25 @@ struct hns_roce_sccc_clr_done { __le32 rsv[5]; }; +struct hns_roce_bond_info { + __le32 bond_id; + __le32 bond_mode; + __le32 active_slave_cnt; + __le32 active_slave_mask; + __le32 slave_mask; + __le32 hash_policy; +}; + +int hns_roce_hw_v2_init_instance(struct hnae3_handle *handle); +void hns_roce_hw_v2_uninit_instance(struct hnae3_handle *handle, bool reset); + int hns_roce_v2_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata); int hns_roce_v2_destroy_qp_common(struct hns_roce_dev *hr_dev, struct hns_roce_qp *hr_qp, struct ib_udata *udata); +int hns_roce_cmd_bond(struct hns_roce_dev *hr_dev, + enum hns_roce_bond_cmd_type bond_type); static inline void hns_roce_write64(struct hns_roce_dev *hr_dev, __le32 val[2], void __iomem *dest) diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c index cdb6def7923e..00138fa10f0b 100644 --- a/drivers/infiniband/hw/hns/hns_roce_main.c +++ b/drivers/infiniband/hw/hns/hns_roce_main.c @@ -37,9 +37,12 @@ #include <rdma/ib_smi.h> #include <rdma/ib_user_verbs.h> #include <rdma/ib_cache.h> + +#include "hnae3.h" #include "hns_roce_common.h" #include "hns_roce_device.h" #include "hns_roce_hem.h" +#include "hns_roce_hw_v2.h" static int hns_roce_set_mac(struct hns_roce_dev *hr_dev, u32 port, const u8 *addr) @@ -259,7 +262,9 @@ static int hns_roce_query_port(struct ib_device *ib_dev, u8 port_num, spin_lock_irqsave(&hr_dev->iboe.lock, flags); - net_dev = hr_dev->iboe.netdevs[port]; + net_dev = hr_dev->hw->get_bond_netdev(hr_dev); + if (!net_dev) + net_dev = hr_dev->iboe.netdevs[port]; if (!net_dev) { spin_unlock_irqrestore(&hr_dev->iboe.lock, flags); dev_err(dev, "Find netdev %u failed!\n", port); @@ -534,6 +539,9 @@ static void hns_roce_unregister_device(struct hns_roce_dev *hr_dev) { struct hns_roce_ib_iboe *iboe = &hr_dev->iboe; + if (hr_dev->caps.flags & HNS_ROCE_CAP_FLAG_BOND) + hns_roce_cleanup_bond(hr_dev); + hr_dev->active = false; unregister_netdevice_notifier(&iboe->nb); ib_unregister_device(&hr_dev->ib_dev); @@ -706,7 +714,12 @@ static int hns_roce_register_device(struct hns_roce_dev *hr_dev) return ret; } dma_set_max_seg_size(dev, UINT_MAX); - ret = ib_register_device(ib_dev, "hns_%d", dev); + + if ((hr_dev->caps.flags & HNS_ROCE_CAP_FLAG_BOND) && + (hr_dev->hw->bond_is_active(hr_dev))) + ret = ib_register_device(ib_dev, "hns_bond_%d", dev); + else + ret = ib_register_device(ib_dev, "hns_%d", dev); if (ret) { dev_err(dev, "ib_register_device failed!\n"); return ret; @@ -725,8 +738,15 @@ static int hns_roce_register_device(struct hns_roce_dev *hr_dev) goto error_failed_setup_mtu_mac; } + if (hr_dev->caps.flags & HNS_ROCE_CAP_FLAG_BOND) { + ret = hr_dev->hw->bond_init(hr_dev); + if (ret) + dev_err(dev, "roce bond init failed, ret = %d\n", ret); + } + hr_dev->active = true; - return 0; + + return ret; error_failed_setup_mtu_mac: ib_unregister_device(ib_dev); -- 2.30.0

1 0

[PATCH openEuler-1.0-LTS] dm: Fix UAF in run_timer_softirq()
by Yongqiang Liu 03 Nov '22

03 Nov '22

From: Luo Meng <luomeng12(a)huawei.com> hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5WBID CVE: NA -------------------------------- When dm_resume() and dm_destroy() are concurrent, it will lead to UAF. One of the concurrency UAF can be shown as below: use free do_resume | __find_device_hash_cell | dm_get | atomic_inc(&md->holders) | | dm_destroy | __dm_destroy | if (!dm_suspended_md(md)) | atomic_read(&md->holders) | msleep(1) dm_resume | __dm_resume | dm_table_resume_targets | pool_resume | do_waker #add delay work | | dm_table_destroy | pool_dtr | __pool_dec | __pool_destroy | destroy_workqueue | kfree(pool) # free pool time out __do_softirq run_timer_softirq # pool has already been freed This can be easily reproduced using: 1. create thin-pool 2. dmsetup suspend pool 3. dmsetup resume pool 4. dmsetup remove_all # Concurrent with 3 The root cause of UAF bugs is that dm_resume() adds timer after dm_destroy() skips cancel timer beause of suspend status. After timeout, it will call run_timer_softirq(), however pool has already been freed. The concurrency UAF bug will happen. Therefore, canceling timer is moved after md->holders is zero. Signed-off-by: Luo Meng <luomeng12(a)huawei.com> Reviewed-by: Zhang Xiaoxu <zhangxiaoxu5(a)huawei.com> Reviewed-by: Zhang Yi <yi.zhang(a)huawei.com> Signed-off-by: Yongqiang Liu <liuyongqiang13(a)huawei.com> --- drivers/md/dm.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 4c46f030eed2..288dab0ab226 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2411,6 +2411,19 @@ static void __dm_destroy(struct mapped_device *md, bool wait) if (dm_request_based(md) && md->kworker_task) kthread_flush_worker(&md->kworker); + /* + * Rare, but there may be I/O requests still going to complete, + * for example. Wait for all references to disappear. + * No one should increment the reference count of the mapped_device, + * after the mapped_device state becomes DMF_FREEING. + */ + if (wait) + while (atomic_read(&md->holders)) + msleep(1); + else if (atomic_read(&md->holders)) + DMWARN("%s: Forcibly removing mapped_device still in use! (%d users)", + dm_device_name(md), atomic_read(&md->holders)); + /* * Take suspend_lock so that presuspend and postsuspend methods * do not race with internal suspend. @@ -2427,19 +2440,6 @@ static void __dm_destroy(struct mapped_device *md, bool wait) dm_put_live_table(md, srcu_idx); mutex_unlock(&md->suspend_lock); - /* - * Rare, but there may be I/O requests still going to complete, - * for example. Wait for all references to disappear. - * No one should increment the reference count of the mapped_device, - * after the mapped_device state becomes DMF_FREEING. - */ - if (wait) - while (atomic_read(&md->holders)) - msleep(1); - else if (atomic_read(&md->holders)) - DMWARN("%s: Forcibly removing mapped_device still in use! (%d users)", - dm_device_name(md), atomic_read(&md->holders)); - dm_sysfs_exit(md); dm_table_destroy(__unbind(md)); free_dev(md); -- 2.25.1

1 0

[PATCH openEuler-5.10 01/52] block: fix null-deref in percpu_ref_put
by Zheng Zengkai 03 Nov '22

03 Nov '22

From: Zhang Wensheng <zhangwensheng5(a)huawei.com> hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5N162 CVE: NA -------------------------------- In the use of q_usage_counter of request_queue, blk_cleanup_queue using "wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter))" to wait q_usage_counter becoming zero. however, if the q_usage_counter becoming zero quickly, and percpu_ref_exit will execute and ref->data will be freed, maybe another process will cause a null-defef problem like below: CPU0 CPU1 blk_cleanup_queue blk_freeze_queue blk_mq_freeze_queue_wait scsi_end_request percpu_ref_get ... percpu_ref_put atomic_long_sub_and_test percpu_ref_exit ref->data -> NULL ref->data->release(ref) -> null-deref Fix it by setting flag(QUEUE_FLAG_USAGE_COUNT_SYNC) to add synchronization mechanism, when ref->data->release is called, the flag will be setted, and the "wait_event" in blk_mq_freeze_queue_wait must wait flag becoming true as well, which will limit percpu_ref_exit to execute ahead of time. Signed-off-by: Zhang Wensheng <zhangwensheng5(a)huawei.com> Reviewed-by: Yu Kuai <yukuai3(a)huawei.com> Signed-off-by: Zheng Zengkai <zhengzengkai(a)huawei.com> --- block/blk-core.c | 4 +++- block/blk-mq.c | 7 +++++++ include/linux/blk-mq.h | 1 + include/linux/blkdev.h | 2 ++ 4 files changed, 13 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index 0b496dabc5ac..448e4d70af7f 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -385,7 +385,8 @@ void blk_cleanup_queue(struct request_queue *q) * prevent that blk_mq_run_hw_queues() accesses the hardware queues * after draining finished. */ - blk_freeze_queue(q); + blk_freeze_queue_start(q); + blk_mq_freeze_queue_wait_sync(q); rq_qos_exit(q); @@ -502,6 +503,7 @@ static void blk_queue_usage_counter_release(struct percpu_ref *ref) struct request_queue *q = container_of(ref, struct request_queue, q_usage_counter); + blk_queue_flag_set(QUEUE_FLAG_USAGE_COUNT_SYNC, q); wake_up_all(&q->mq_freeze_wq); } diff --git a/block/blk-mq.c b/block/blk-mq.c index e1fcdbefcac0..ab1b0bfc64f9 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -193,6 +193,7 @@ void blk_freeze_queue_start(struct request_queue *q) { mutex_lock(&q->mq_freeze_lock); if (++q->mq_freeze_depth == 1) { + blk_queue_flag_clear(QUEUE_FLAG_USAGE_COUNT_SYNC, q); percpu_ref_kill(&q->q_usage_counter); mutex_unlock(&q->mq_freeze_lock); if (queue_is_mq(q)) @@ -203,6 +204,12 @@ void blk_freeze_queue_start(struct request_queue *q) } EXPORT_SYMBOL_GPL(blk_freeze_queue_start); +void blk_mq_freeze_queue_wait_sync(struct request_queue *q) +{ + wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter) && + test_bit(QUEUE_FLAG_USAGE_COUNT_SYNC, &q->queue_flags)); +} + void blk_mq_freeze_queue_wait(struct request_queue *q) { wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter)); diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index ac83257972a0..e4e46229d0eb 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -574,6 +574,7 @@ void blk_mq_freeze_queue(struct request_queue *q); void blk_mq_unfreeze_queue(struct request_queue *q); void blk_freeze_queue_start(struct request_queue *q); void blk_mq_freeze_queue_wait(struct request_queue *q); +void blk_mq_freeze_queue_wait_sync(struct request_queue *q); int blk_mq_freeze_queue_wait_timeout(struct request_queue *q, unsigned long timeout); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 49540ce9e325..4c046530edb9 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -643,6 +643,8 @@ struct request_queue { #define QUEUE_FLAG_NOWAIT 29 /* device supports NOWAIT */ /*at least one blk-mq hctx can't get driver tag */ #define QUEUE_FLAG_HCTX_WAIT 30 +/* sync for q_usage_counter */ +#define QUEUE_FLAG_USAGE_COUNT_SYNC 31 #define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ (1 << QUEUE_FLAG_SAME_COMP) | \ -- 2.20.1

1 51

[PATCH openEuler-1.0-LTS] Bluetooth: sco: Fix lock_sock() blockage by memcpy_from_msg()
by Yongqiang Liu 03 Nov '22

03 Nov '22

From: Takashi Iwai <tiwai(a)suse.de> stable inclusion from stable-v4.19.218 commit c1c913f797f3d2441310182ad75b7bd855a327ff category: bugfix bugzilla: 187908, https://gitee.com/src-openeuler/kernel/issues/I44HKK CVE: CVE-2021-3640 -------------------------------- The sco_send_frame() also takes lock_sock() during memcpy_from_msg() call that may be endlessly blocked by a task with userfaultd technique, and this will result in a hung task watchdog trigger. Just like the similar fix for hci_sock_sendmsg() in commit 92c685dc5de0 ("Bluetooth: reorganize functions..."), this patch moves the memcpy_from_msg() out of lock_sock() for addressing the hang. This should be the last piece for fixing CVE-2021-3640 after a few already queued fixes. Signed-off-by: Takashi Iwai <tiwai(a)suse.de> Signed-off-by: Marcel Holtmann <marcel(a)holtmann.org> Signed-off-by: Baisong Zhong <zhongbaisong(a)huawei.com> Reviewed-by: Yue Haibing <yuehaibing(a)huawei.com> Reviewed-by: Xiu Jianfeng <xiujianfeng(a)huawei.com> Signed-off-by: Yongqiang Liu <liuyongqiang13(a)huawei.com> --- net/bluetooth/sco.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/net/bluetooth/sco.c b/net/bluetooth/sco.c index 1d740bbcdb01..608de7e66132 100644 --- a/net/bluetooth/sco.c +++ b/net/bluetooth/sco.c @@ -282,7 +282,8 @@ static int sco_connect(struct hci_dev *hdev, struct sock *sk) return err; } -static int sco_send_frame(struct sock *sk, struct msghdr *msg, int len) +static int sco_send_frame(struct sock *sk, void *buf, int len, + unsigned int msg_flags) { struct sco_conn *conn = sco_pi(sk)->conn; struct sk_buff *skb; @@ -294,15 +295,11 @@ static int sco_send_frame(struct sock *sk, struct msghdr *msg, int len) BT_DBG("sk %p len %d", sk, len); - skb = bt_skb_send_alloc(sk, len, msg->msg_flags & MSG_DONTWAIT, &err); + skb = bt_skb_send_alloc(sk, len, msg_flags & MSG_DONTWAIT, &err); if (!skb) return err; - if (memcpy_from_msg(skb_put(skb, len), msg, len)) { - kfree_skb(skb); - return -EFAULT; - } - + memcpy(skb_put(skb, len), buf, len); hci_send_sco(conn->hcon, skb); return len; @@ -718,6 +715,7 @@ static int sco_sock_sendmsg(struct socket *sock, struct msghdr *msg, size_t len) { struct sock *sk = sock->sk; + void *buf; int err; BT_DBG("sock %p, sk %p", sock, sk); @@ -729,14 +727,24 @@ static int sco_sock_sendmsg(struct socket *sock, struct msghdr *msg, if (msg->msg_flags & MSG_OOB) return -EOPNOTSUPP; + buf = kmalloc(len, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + if (memcpy_from_msg(buf, msg, len)) { + kfree(buf); + return -EFAULT; + } + lock_sock(sk); if (sk->sk_state == BT_CONNECTED) - err = sco_send_frame(sk, msg, len); + err = sco_send_frame(sk, buf, len, msg->msg_flags); else err = -ENOTCONN; release_sock(sk); + kfree(buf); return err; } -- 2.25.1

1 0

[PATCH openEuler-1.0-LTS 1/2] livepatch/core: Fix livepatch/state leak on error path
by Yongqiang Liu 03 Nov '22

03 Nov '22

From: Zheng Yejian <zhengyejian1(a)huawei.com> hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5WF0G CVE: NA -------------------------------- File '/proc/livepatch/state' should be removed before removing '/proc/livepatch', otherwise following log will appear: remove_proc_entry: removing non-empty directory '/proc/livepatch', leaking at least 'state' Fixes: 19190325e604 ("livepatch/core: Allow implementation without ftrace") Signed-off-by: Zheng Yejian <zhengyejian1(a)huawei.com> Reviewed-by: Kuohai Xu <xukuohai(a)huawei.com> Signed-off-by: Yongqiang Liu <liuyongqiang13(a)huawei.com> --- kernel/livepatch/core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c index 89b31e827425..af077abe099b 100644 --- a/kernel/livepatch/core.c +++ b/kernel/livepatch/core.c @@ -1489,10 +1489,12 @@ static int __init klp_init(void) klp_root_kobj = kobject_create_and_add("livepatch", kernel_kobj); if (!klp_root_kobj) - goto error_remove; + goto error_remove_state; return 0; +error_remove_state: + remove_proc_entry("livepatch/state", NULL); error_remove: remove_proc_entry("livepatch", NULL); error_out: -- 2.25.1

1 1

申请议题：增加cgroupv1 writeback功能特性
by lujialin (A) 03 Nov '22

03 Nov '22

cgroup writeback功能主线只在cgroupv2中才会使能，在cgroupv1中由于cgroupv1多层级的关系导致无法使能。从而对于memcg中dirty的及时回写以及buffer I/O的限速都是无法做到的。 cgroupv1 writeback特性在cgroupv1中使能了cgroup writeback解决了v1中的memcg中dirty的及时回写以及buffer I/O的限速的问题。

1 0