From: Ye Bin yebin10@huawei.com
mainline inclusion from mainline-v5.16 commit a846a8e6c9a5949582c5a6a8bbc83a7d27fd891e category: bugfix bugzilla: 185668 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-----------------------------------------------
We got UAF report on v5.10 as follows: [ 1446.674930] ================================================================== [ 1446.675970] BUG: KASAN: use-after-free in blk_mq_get_driver_tag+0x9a4/0xa90 [ 1446.676902] Read of size 8 at addr ffff8880185afd10 by task kworker/1:2/12348 [ 1446.677851] [ 1446.678073] CPU: 1 PID: 12348 Comm: kworker/1:2 Not tainted 5.10.0-10177-gc9c81b1e346a #2 [ 1446.679168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 1446.680692] Workqueue: kthrotld blk_throtl_dispatch_work_fn [ 1446.681448] Call Trace: [ 1446.681800] dump_stack+0x9b/0xce [ 1446.682916] print_address_description.constprop.6+0x3e/0x60 [ 1446.685999] kasan_report.cold.9+0x22/0x3a [ 1446.687186] blk_mq_get_driver_tag+0x9a4/0xa90 [ 1446.687785] blk_mq_dispatch_rq_list+0x21a/0x1d40 [ 1446.692576] __blk_mq_do_dispatch_sched+0x394/0x830 [ 1446.695758] __blk_mq_sched_dispatch_requests+0x398/0x4f0 [ 1446.698279] blk_mq_sched_dispatch_requests+0xdf/0x140 [ 1446.698967] __blk_mq_run_hw_queue+0xc0/0x270 [ 1446.699561] __blk_mq_delay_run_hw_queue+0x4cc/0x550 [ 1446.701407] blk_mq_run_hw_queue+0x13b/0x2b0 [ 1446.702593] blk_mq_sched_insert_requests+0x1de/0x390 [ 1446.703309] blk_mq_flush_plug_list+0x4b4/0x760 [ 1446.705408] blk_flush_plug_list+0x2c5/0x480 [ 1446.708471] blk_finish_plug+0x55/0xa0 [ 1446.708980] blk_throtl_dispatch_work_fn+0x23b/0x2e0 [ 1446.711236] process_one_work+0x6d4/0xfe0 [ 1446.711778] worker_thread+0x91/0xc80 [ 1446.713400] kthread+0x32d/0x3f0 [ 1446.714362] ret_from_fork+0x1f/0x30 [ 1446.714846] [ 1446.715062] Allocated by task 1: [ 1446.715509] kasan_save_stack+0x19/0x40 [ 1446.716026] __kasan_kmalloc.constprop.1+0xc1/0xd0 [ 1446.716673] blk_mq_init_tags+0x6d/0x330 [ 1446.717207] blk_mq_alloc_rq_map+0x50/0x1c0 [ 1446.717769] __blk_mq_alloc_map_and_request+0xe5/0x320 [ 1446.718459] blk_mq_alloc_tag_set+0x679/0xdc0 [ 1446.719050] scsi_add_host_with_dma.cold.3+0xa0/0x5db [ 1446.719736] virtscsi_probe+0x7bf/0xbd0 [ 1446.720265] virtio_dev_probe+0x402/0x6c0 [ 1446.720808] really_probe+0x276/0xde0 [ 1446.721320] driver_probe_device+0x267/0x3d0 [ 1446.721892] device_driver_attach+0xfe/0x140 [ 1446.722491] __driver_attach+0x13a/0x2c0 [ 1446.723037] bus_for_each_dev+0x146/0x1c0 [ 1446.723603] bus_add_driver+0x3fc/0x680 [ 1446.724145] driver_register+0x1c0/0x400 [ 1446.724693] init+0xa2/0xe8 [ 1446.725091] do_one_initcall+0x9e/0x310 [ 1446.725626] kernel_init_freeable+0xc56/0xcb9 [ 1446.726231] kernel_init+0x11/0x198 [ 1446.726714] ret_from_fork+0x1f/0x30 [ 1446.727212] [ 1446.727433] Freed by task 26992: [ 1446.727882] kasan_save_stack+0x19/0x40 [ 1446.728420] kasan_set_track+0x1c/0x30 [ 1446.728943] kasan_set_free_info+0x1b/0x30 [ 1446.729517] __kasan_slab_free+0x111/0x160 [ 1446.730084] kfree+0xb8/0x520 [ 1446.730507] blk_mq_free_map_and_requests+0x10b/0x1b0 [ 1446.731206] blk_mq_realloc_hw_ctxs+0x8cb/0x15b0 [ 1446.731844] blk_mq_init_allocated_queue+0x374/0x1380 [ 1446.732540] blk_mq_init_queue_data+0x7f/0xd0 [ 1446.733155] scsi_mq_alloc_queue+0x45/0x170 [ 1446.733730] scsi_alloc_sdev+0x73c/0xb20 [ 1446.734281] scsi_probe_and_add_lun+0x9a6/0x2d90 [ 1446.734916] __scsi_scan_target+0x208/0xc50 [ 1446.735500] scsi_scan_channel.part.3+0x113/0x170 [ 1446.736149] scsi_scan_host_selected+0x25a/0x360 [ 1446.736783] store_scan+0x290/0x2d0 [ 1446.737275] dev_attr_store+0x55/0x80 [ 1446.737782] sysfs_kf_write+0x132/0x190 [ 1446.738313] kernfs_fop_write_iter+0x319/0x4b0 [ 1446.738921] new_sync_write+0x40e/0x5c0 [ 1446.739429] vfs_write+0x519/0x720 [ 1446.739877] ksys_write+0xf8/0x1f0 [ 1446.740332] do_syscall_64+0x2d/0x40 [ 1446.740802] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1446.741462] [ 1446.741670] The buggy address belongs to the object at ffff8880185afd00 [ 1446.741670] which belongs to the cache kmalloc-256 of size 256 [ 1446.743276] The buggy address is located 16 bytes inside of [ 1446.743276] 256-byte region [ffff8880185afd00, ffff8880185afe00) [ 1446.744765] The buggy address belongs to the page: [ 1446.745416] page:ffffea0000616b00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x185ac [ 1446.746694] head:ffffea0000616b00 order:2 compound_mapcount:0 compound_pincount:0 [ 1446.747719] flags: 0x1fffff80010200(slab|head) [ 1446.748337] raw: 001fffff80010200 ffffea00006a3208 ffffea000061bf08 ffff88801004f240 [ 1446.749404] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 [ 1446.750455] page dumped because: kasan: bad access detected [ 1446.751227] [ 1446.751445] Memory state around the buggy address: [ 1446.752102] ffff8880185afc00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 1446.753090] ffff8880185afc80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 1446.754079] >ffff8880185afd00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 1446.755065] ^ [ 1446.755589] ffff8880185afd80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 1446.756574] ffff8880185afe00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 1446.757566] ==================================================================
Flag 'BLK_MQ_F_TAG_QUEUE_SHARED' will be set if the second device on the same host initializes it's queue successfully. However, if the second device failed to allocate memory in blk_mq_alloc_and_init_hctx() from blk_mq_realloc_hw_ctxs() from blk_mq_init_allocated_queue(), __blk_mq_free_map_and_rqs() will be called on error path, and if 'BLK_MQ_TAG_HCTX_SHARED' is not set, 'tag_set->tags' will be freed while it's still used by the first device.
To fix this issue we move release newly allocated hardware context from blk_mq_realloc_hw_ctxs to __blk_mq_update_nr_hw_queues. As there is needn't to release hardware context in blk_mq_init_allocated_queue.
Fixes: 868f2f0b7206 ("blk-mq: dynamic h/w context count") Signed-off-by: Ye Bin yebin10@huawei.com Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Ming Lei ming.lei@redhat.com Link: https://lore.kernel.org/r/20211108074019.1058843-1-yebin10@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk
conflicts: block/blk-mq.c
Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- block/blk-mq.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index bc7a04cc2acf..fac25524e99e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -3237,8 +3237,6 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, struct blk_mq_hw_ctx *hctx = hctxs[j];
if (hctx) { - if (hctx->tags) - blk_mq_free_map_and_requests(set, j); blk_mq_exit_hctx(q, set, hctx, j); hctxs[j] = NULL; } @@ -3724,8 +3722,13 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, list_for_each_entry(q, &set->tag_list, tag_set_list) { blk_mq_realloc_hw_ctxs(set, q); if (q->nr_hw_queues != set->nr_hw_queues) { + int i = prev_nr_hw_queues; + pr_warn("Increasing nr_hw_queues to %d fails, fallback to %d\n", nr_hw_queues, prev_nr_hw_queues); + for (; i < set->nr_hw_queues; i++) + blk_mq_free_map_and_requests(set, i); + set->nr_hw_queues = prev_nr_hw_queues; blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]); goto fallback;
From: Ard Biesheuvel ardb@kernel.org
mainline inclusion from mainline-v5.15-rc2 commit c0e50736e826b51ddc437e6cf0dc68f07e4ad16b bugzilla: 185697 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-----------------------------
A write to CSSELR needs to complete before its results can be observed via CCSIDR. So add a ISB to ensure that this is the case.
Acked-by: Nicolas Pitre nico@fluxnic.net Signed-off-by: Ard Biesheuvel ardb@kernel.org Signed-off-by: Russell King rmk+kernel@armlinux.org.uk Signed-off-by: He Ying heying24@huawei.com Reviewed-by: Liao Chang liaochang1@huawei.com Reviewed-by: Li Wei liwei391@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- arch/arm/mm/cache-v7.S | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S index dc8f152f3556..307f381eee71 100644 --- a/arch/arm/mm/cache-v7.S +++ b/arch/arm/mm/cache-v7.S @@ -38,9 +38,10 @@ icache_size: * procedures. */ ENTRY(v7_invalidate_l1) - mov r0, #0 - mcr p15, 2, r0, c0, c0, 0 - mrc p15, 1, r0, c0, c0, 0 + mov r0, #0 + mcr p15, 2, r0, c0, c0, 0 @ select L1 data cache in CSSELR + isb + mrc p15, 1, r0, c0, c0, 0 @ read cache geometry from CCSIDR
movw r1, #0x7fff and r2, r1, r0, lsr #13
From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-v5.16 commit a277654bafb51fb8b4cf23550f15926bb02536f4 category: bugfix bugzilla: 182378 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
Add two APIs for stopping and starting admin queue.
Signed-off-by: Ming Lei ming.lei@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20211014081710.1871747-2-ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/nvme/host/core.c | 12 ++++++++++++ drivers/nvme/host/nvme.h | 2 ++ 2 files changed, 14 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 99b5152482fe..509a44ff0ce5 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -4634,6 +4634,18 @@ void nvme_start_queues(struct nvme_ctrl *ctrl) } EXPORT_SYMBOL_GPL(nvme_start_queues);
+void nvme_stop_admin_queue(struct nvme_ctrl *ctrl) +{ + blk_mq_quiesce_queue(ctrl->admin_q); +} +EXPORT_SYMBOL_GPL(nvme_stop_admin_queue); + +void nvme_start_admin_queue(struct nvme_ctrl *ctrl) +{ + blk_mq_unquiesce_queue(ctrl->admin_q); +} +EXPORT_SYMBOL_GPL(nvme_start_admin_queue); + void nvme_sync_io_queues(struct nvme_ctrl *ctrl) { struct nvme_ns *ns; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 5dd1dd8021ba..0cad04335e34 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -647,6 +647,8 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
void nvme_stop_queues(struct nvme_ctrl *ctrl); void nvme_start_queues(struct nvme_ctrl *ctrl); +void nvme_stop_admin_queue(struct nvme_ctrl *ctrl); +void nvme_start_admin_queue(struct nvme_ctrl *ctrl); void nvme_kill_queues(struct nvme_ctrl *ctrl); void nvme_sync_queues(struct nvme_ctrl *ctrl); void nvme_sync_io_queues(struct nvme_ctrl *ctrl);
From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-v5.16 commit 6ca1d9027e0d9ce5604a3e28de89456a76138034 category: bugfix bugzilla: 182378 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
Apply the added two APIs to quiesce/unquiesce admin queue.
Signed-off-by: Ming Lei ming.lei@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20211014081710.1871747-3-ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflict: commit f21c4769d0de ("nvme: rename nvme_init_identify()") is not backported - drivers/nvme/host/rdma.c - drivers/nvme/host/tcp.c - drivers/nvme/target/loop.c - drivers/nvme/host/fc.c Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/nvme/host/core.c | 2 +- drivers/nvme/host/fc.c | 8 ++++---- drivers/nvme/host/pci.c | 8 ++++---- drivers/nvme/host/rdma.c | 14 +++++++------- drivers/nvme/host/tcp.c | 16 ++++++++-------- drivers/nvme/target/loop.c | 4 ++-- 6 files changed, 26 insertions(+), 26 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 509a44ff0ce5..dd425ae68c07 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -4555,7 +4555,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
/* Forcibly unquiesce queues to avoid blocking dispatch */ if (ctrl->admin_q && !blk_queue_dying(ctrl->admin_q)) - blk_mq_unquiesce_queue(ctrl->admin_q); + nvme_start_admin_queue(ctrl);
list_for_each_entry(ns, &ctrl->namespaces, list) nvme_set_queue_dying(ns); diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 906cab35afe7..b534a85e2bf1 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -2381,7 +2381,7 @@ nvme_fc_ctrl_free(struct kref *ref) list_del(&ctrl->ctrl_list); spin_unlock_irqrestore(&ctrl->rport->lock, flags);
- blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + nvme_start_admin_queue(&ctrl->ctrl); blk_cleanup_queue(ctrl->ctrl.admin_q); blk_cleanup_queue(ctrl->ctrl.fabrics_q); blk_mq_free_tag_set(&ctrl->admin_tag_set); @@ -2509,7 +2509,7 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues) /* * clean up the admin queue. Same thing as above. */ - blk_mq_quiesce_queue(ctrl->ctrl.admin_q); + nvme_stop_admin_queue(&ctrl->ctrl); blk_sync_queue(ctrl->ctrl.admin_q); blk_mq_tagset_busy_iter(&ctrl->admin_tag_set, nvme_fc_terminate_exchange, &ctrl->ctrl); @@ -3098,7 +3098,7 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl) ctrl->ctrl.max_hw_sectors = ctrl->ctrl.max_segments << (ilog2(SZ_4K) - 9);
- blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + nvme_start_admin_queue(&ctrl->ctrl);
ret = nvme_init_identify(&ctrl->ctrl); if (ret || test_bit(ASSOC_FAILED, &ctrl->flags)) @@ -3245,7 +3245,7 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl) nvme_fc_free_queue(&ctrl->queues[0]);
/* re-enable the admin_q so anything new can fast fail */ - blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + nvme_start_admin_queue(&ctrl->ctrl);
/* resume the io queues so that things will fast fail */ nvme_start_queues(&ctrl->ctrl); diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 1b85349f57af..224907d8d5dc 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1408,7 +1408,7 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
nvmeq->dev->online_queues--; if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q) - blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q); + nvme_stop_admin_queue(&nvmeq->dev->ctrl); if (!test_and_clear_bit(NVMEQ_POLLED, &nvmeq->flags)) pci_free_irq(to_pci_dev(nvmeq->dev->dev), nvmeq->cq_vector, nvmeq); return 0; @@ -1640,7 +1640,7 @@ static void nvme_dev_remove_admin(struct nvme_dev *dev) * user requests may be waiting on a stopped queue. Start the * queue to flush these to completion. */ - blk_mq_unquiesce_queue(dev->ctrl.admin_q); + nvme_start_admin_queue(&dev->ctrl); blk_cleanup_queue(dev->ctrl.admin_q); blk_mq_free_tag_set(&dev->admin_tagset); } @@ -1674,7 +1674,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev) return -ENODEV; } } else - blk_mq_unquiesce_queue(dev->ctrl.admin_q); + nvme_start_admin_queue(&dev->ctrl);
return 0; } @@ -2516,7 +2516,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) if (shutdown) { nvme_start_queues(&dev->ctrl); if (dev->ctrl.admin_q && !blk_queue_dying(dev->ctrl.admin_q)) - blk_mq_unquiesce_queue(dev->ctrl.admin_q); + nvme_start_admin_queue(&dev->ctrl); } mutex_unlock(&dev->shutdown_lock); } diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 51f4647ea214..5e2c86adeb27 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -918,7 +918,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, else ctrl->ctrl.max_integrity_segments = 0;
- blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + nvme_start_admin_queue(&ctrl->ctrl);
error = nvme_init_identify(&ctrl->ctrl); if (error) @@ -927,7 +927,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, return 0;
out_quiesce_queue: - blk_mq_quiesce_queue(ctrl->ctrl.admin_q); + nvme_stop_admin_queue(&ctrl->ctrl); blk_sync_queue(ctrl->ctrl.admin_q); out_stop_queue: nvme_rdma_stop_queue(&ctrl->queues[0]); @@ -1025,7 +1025,7 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new) static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl, bool remove) { - blk_mq_quiesce_queue(ctrl->ctrl.admin_q); + nvme_stop_admin_queue(&ctrl->ctrl); blk_sync_queue(ctrl->ctrl.admin_q); nvme_rdma_stop_queue(&ctrl->queues[0]); if (ctrl->ctrl.admin_tagset) { @@ -1034,7 +1034,7 @@ static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl, blk_mq_tagset_wait_completed_request(ctrl->ctrl.admin_tagset); } if (remove) - blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + nvme_start_admin_queue(&ctrl->ctrl); nvme_rdma_destroy_admin_queue(ctrl, remove); }
@@ -1161,7 +1161,7 @@ static int nvme_rdma_setup_ctrl(struct nvme_rdma_ctrl *ctrl, bool new) nvme_rdma_destroy_io_queues(ctrl, new); } destroy_admin: - blk_mq_quiesce_queue(ctrl->ctrl.admin_q); + nvme_stop_admin_queue(&ctrl->ctrl); blk_sync_queue(ctrl->ctrl.admin_q); nvme_rdma_stop_queue(&ctrl->queues[0]); nvme_cancel_admin_tagset(&ctrl->ctrl); @@ -1201,7 +1201,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work) nvme_rdma_teardown_io_queues(ctrl, false); nvme_start_queues(&ctrl->ctrl); nvme_rdma_teardown_admin_queue(ctrl, false); - blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + nvme_start_admin_queue(&ctrl->ctrl);
if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) { /* state change failure is ok if we started ctrl delete */ @@ -2237,7 +2237,7 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown) cancel_delayed_work_sync(&ctrl->reconnect_work);
nvme_rdma_teardown_io_queues(ctrl, shutdown); - blk_mq_quiesce_queue(ctrl->ctrl.admin_q); + nvme_stop_admin_queue(&ctrl->ctrl); if (shutdown) nvme_shutdown_ctrl(&ctrl->ctrl); else diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index e99d43989418..c014c5adbac5 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -1889,7 +1889,7 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new) if (error) goto out_stop_queue;
- blk_mq_unquiesce_queue(ctrl->admin_q); + nvme_start_admin_queue(ctrl);
error = nvme_init_identify(ctrl); if (error) @@ -1898,7 +1898,7 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new) return 0;
out_quiesce_queue: - blk_mq_quiesce_queue(ctrl->admin_q); + nvme_stop_admin_queue(ctrl); blk_sync_queue(ctrl->admin_q); out_stop_queue: nvme_tcp_stop_queue(ctrl, 0); @@ -1920,7 +1920,7 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new) static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl, bool remove) { - blk_mq_quiesce_queue(ctrl->admin_q); + nvme_stop_admin_queue(ctrl); blk_sync_queue(ctrl->admin_q); nvme_tcp_stop_queue(ctrl, 0); if (ctrl->admin_tagset) { @@ -1929,7 +1929,7 @@ static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl, blk_mq_tagset_wait_completed_request(ctrl->admin_tagset); } if (remove) - blk_mq_unquiesce_queue(ctrl->admin_q); + nvme_start_admin_queue(ctrl); nvme_tcp_destroy_admin_queue(ctrl, remove); }
@@ -1938,7 +1938,7 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl, { if (ctrl->queue_count <= 1) return; - blk_mq_quiesce_queue(ctrl->admin_q); + nvme_stop_admin_queue(ctrl); nvme_start_freeze(ctrl); nvme_stop_queues(ctrl); nvme_sync_io_queues(ctrl); @@ -2030,7 +2030,7 @@ static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new) nvme_tcp_destroy_io_queues(ctrl, new); } destroy_admin: - blk_mq_quiesce_queue(ctrl->admin_q); + nvme_stop_admin_queue(ctrl); blk_sync_queue(ctrl->admin_q); nvme_tcp_stop_queue(ctrl, 0); nvme_cancel_admin_tagset(ctrl); @@ -2073,7 +2073,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work) /* unquiesce to fail fast pending requests */ nvme_start_queues(ctrl); nvme_tcp_teardown_admin_queue(ctrl, false); - blk_mq_unquiesce_queue(ctrl->admin_q); + nvme_start_admin_queue(ctrl);
if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) { /* state change failure is ok if we started ctrl delete */ @@ -2091,7 +2091,7 @@ static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown) cancel_delayed_work_sync(&to_tcp_ctrl(ctrl)->connect_work);
nvme_tcp_teardown_io_queues(ctrl, shutdown); - blk_mq_quiesce_queue(ctrl->admin_q); + nvme_stop_admin_queue(ctrl); if (shutdown) nvme_shutdown_ctrl(ctrl); else diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index e17710e405a3..5b8a38c9475e 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -396,7 +396,7 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl) ctrl->ctrl.max_hw_sectors = (NVME_LOOP_MAX_SEGMENTS - 1) << (PAGE_SHIFT - 9);
- blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + nvme_start_admin_queue(&ctrl->ctrl);
error = nvme_init_identify(&ctrl->ctrl); if (error) @@ -426,7 +426,7 @@ static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl) nvme_loop_destroy_io_queues(ctrl); }
- blk_mq_quiesce_queue(ctrl->ctrl.admin_q); + nvme_stop_admin_queue(&ctrl->ctrl); if (ctrl->ctrl.state == NVME_CTRL_LIVE) nvme_shutdown_ctrl(&ctrl->ctrl);
From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-v5.16 commit ebc9b95260151d966728cf0063b3b4e465f934d9 category: bugfix bugzilla: 182378 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
Add two helpers so that we can prepare for pairing quiescing and unquiescing which will be done in next patch.
Signed-off-by: Ming Lei ming.lei@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20211014081710.1871747-4-ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk
conflict: commit d17e66aadbe5 ("nvme: use set_capacity_and_notify in nvme_set_queue_dying") is not backported. - drivers/nvme/host/core.c Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/nvme/host/core.c | 54 ++++++++++++++++++++++++---------------- 1 file changed, 32 insertions(+), 22 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index dd425ae68c07..cdea98169131 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -102,26 +102,6 @@ static void nvme_update_bdev_size(struct gendisk *disk) } }
-/* - * Prepare a queue for teardown. - * - * This must forcibly unquiesce queues to avoid blocking dispatch, and only set - * the capacity to 0 after that to avoid blocking dispatchers that may be - * holding bd_butex. This will end buffered writers dirtying pages that can't - * be synced. - */ -static void nvme_set_queue_dying(struct nvme_ns *ns) -{ - if (test_and_set_bit(NVME_NS_DEAD, &ns->flags)) - return; - - blk_set_queue_dying(ns->queue); - blk_mq_unquiesce_queue(ns->queue); - - set_capacity(ns->disk, 0); - nvme_update_bdev_size(ns->disk); -} - static void nvme_queue_scan(struct nvme_ctrl *ctrl) { /* @@ -4540,6 +4520,36 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev, } EXPORT_SYMBOL_GPL(nvme_init_ctrl);
+static void nvme_start_ns_queue(struct nvme_ns *ns) +{ + blk_mq_unquiesce_queue(ns->queue); +} + +static void nvme_stop_ns_queue(struct nvme_ns *ns) +{ + blk_mq_quiesce_queue(ns->queue); +} + +/* + * Prepare a queue for teardown. + * + * This must forcibly unquiesce queues to avoid blocking dispatch, and only set + * the capacity to 0 after that to avoid blocking dispatchers that may be + * holding bd_butex. This will end buffered writers dirtying pages that can't + * be synced. + */ +static void nvme_set_queue_dying(struct nvme_ns *ns) +{ + if (test_and_set_bit(NVME_NS_DEAD, &ns->flags)) + return; + + blk_set_queue_dying(ns->queue); + nvme_start_ns_queue(ns); + + set_capacity(ns->disk, 0); + nvme_update_bdev_size(ns->disk); +} + /** * nvme_kill_queues(): Ends all namespace queues * @ctrl: the dead controller that needs to end @@ -4618,7 +4628,7 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl)
down_read(&ctrl->namespaces_rwsem); list_for_each_entry(ns, &ctrl->namespaces, list) - blk_mq_quiesce_queue(ns->queue); + nvme_stop_ns_queue(ns); up_read(&ctrl->namespaces_rwsem); } EXPORT_SYMBOL_GPL(nvme_stop_queues); @@ -4629,7 +4639,7 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
down_read(&ctrl->namespaces_rwsem); list_for_each_entry(ns, &ctrl->namespaces, list) - blk_mq_unquiesce_queue(ns->queue); + nvme_start_ns_queue(ns); up_read(&ctrl->namespaces_rwsem); } EXPORT_SYMBOL_GPL(nvme_start_queues);
From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-v5.16 commit 9e6a6b1212100148c109675e003369e3e219dbd9 category: bugfix bugzilla: 182378 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
The current blk_mq_quiesce_queue() and blk_mq_unquiesce_queue() always stops and starts the queue unconditionally. And there can be concurrent quiesce/unquiesce coming from different unrelated code paths, so unquiesce may come unexpectedly and start queue too early.
Prepare for supporting concurrent quiesce/unquiesce from multiple contexts, so that we can address the above issue.
NVMe has very complicated quiesce/unquiesce use pattern, add one atomic bit for makeiing sure that blk-mq quiece/unquiesce is always called in pair.
Signed-off-by: Ming Lei ming.lei@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20211014081710.1871747-5-ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflict: commit 8c4dfea97f15 ("nvme-fabrics: reject I/O to offline device") is not backported, struct nvme_ctrl doesn't have member flags yet. The commit is a feature and contains a lot of code changes, thus introduce the new member in this patch. - drivers/nvme/host/nvme.h Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/nvme/host/core.c | 12 ++++++++---- drivers/nvme/host/nvme.h | 3 +++ 2 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index cdea98169131..9ccf44592fe4 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -4522,12 +4522,14 @@ EXPORT_SYMBOL_GPL(nvme_init_ctrl);
static void nvme_start_ns_queue(struct nvme_ns *ns) { - blk_mq_unquiesce_queue(ns->queue); + if (test_and_clear_bit(NVME_NS_STOPPED, &ns->flags)) + blk_mq_unquiesce_queue(ns->queue); }
static void nvme_stop_ns_queue(struct nvme_ns *ns) { - blk_mq_quiesce_queue(ns->queue); + if (!test_and_set_bit(NVME_NS_STOPPED, &ns->flags)) + blk_mq_quiesce_queue(ns->queue); }
/* @@ -4646,13 +4648,15 @@ EXPORT_SYMBOL_GPL(nvme_start_queues);
void nvme_stop_admin_queue(struct nvme_ctrl *ctrl) { - blk_mq_quiesce_queue(ctrl->admin_q); + if (!test_and_set_bit(NVME_CTRL_ADMIN_Q_STOPPED, &ctrl->flags)) + blk_mq_quiesce_queue(ctrl->admin_q); } EXPORT_SYMBOL_GPL(nvme_stop_admin_queue);
void nvme_start_admin_queue(struct nvme_ctrl *ctrl) { - blk_mq_unquiesce_queue(ctrl->admin_q); + if (test_and_clear_bit(NVME_CTRL_ADMIN_Q_STOPPED, &ctrl->flags)) + blk_mq_unquiesce_queue(ctrl->admin_q); } EXPORT_SYMBOL_GPL(nvme_start_admin_queue);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 0cad04335e34..ad46208eac76 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -338,6 +338,8 @@ struct nvme_ctrl { u16 icdoff; u16 maxcmd; int nr_reconnects; + unsigned long flags; +#define NVME_CTRL_ADMIN_Q_STOPPED 0 struct nvmf_ctrl_options *opts;
struct page *discard_page; @@ -449,6 +451,7 @@ struct nvme_ns { #define NVME_NS_REMOVING 0 #define NVME_NS_DEAD 1 #define NVME_NS_ANA_PENDING 2 +#define NVME_NS_STOPPED 3
struct nvme_fault_inject fault_inject;
From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-v5.16 commit 1d35d519d8bf224ccdb43f9a235b8bda2d6d453c category: bugfix bugzilla: 182378 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
The nvme-loop's admin queue may be freed and reallocated, and we have to reset the flag of NVME_CTRL_ADMIN_Q_STOPPED so that the flag can match with the quiesce state of the admin queue.
nvme-loop is the only driver to reallocate request queue, and not see such usage in other nvme drivers.
Signed-off-by: Ming Lei ming.lei@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20211014081710.1871747-6-ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/nvme/target/loop.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index 5b8a38c9475e..a05dc29a43f9 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -382,6 +382,8 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl) error = PTR_ERR(ctrl->ctrl.admin_q); goto out_cleanup_fabrics_q; } + /* reset stopped state for the fresh admin queue */ + clear_bit(NVME_CTRL_ADMIN_Q_STOPPED, &ctrl->ctrl.flags);
error = nvmf_connect_admin_queue(&ctrl->ctrl); if (error)
From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-v5.16 commit e70feb8b3e6886c525c88943b5f1508d02f5a683 category: bugfix bugzilla: 182378 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
blk_mq_quiesce_queue() has been used a bit wide now, so far we don't support concurrent/nested quiesce. One biggest issue is that unquiesce can happen unexpectedly in case that quiesce/unquiesce are run concurrently from more than one context.
This patch introduces q->mq_quiesce_depth to deal concurrent quiesce, and we only unquiesce queue when it is the last/outer-most one of all contexts.
Several kernel panic issue has been reported[1][2][3] when running stress quiesce test. And this patch has been verified in these reports.
[1] https://lore.kernel.org/linux-block/9b21c797-e505-3821-4f5b-df7bf9380328@hua... [2] https://lore.kernel.org/linux-block/9b21c797-e505-3821-4f5b-df7bf9380328@hua... [3] https://listman.redhat.com/archives/dm-devel/2021-September/msg00189.html
Signed-off-by: Ming Lei ming.lei@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20211014081710.1871747-7-ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- block/blk-mq.c | 22 +++++++++++++++++++--- include/linux/blkdev.h | 2 ++ 2 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index fac25524e99e..01ec97aa9ec8 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -206,7 +206,12 @@ EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue); */ void blk_mq_quiesce_queue_nowait(struct request_queue *q) { - blk_queue_flag_set(QUEUE_FLAG_QUIESCED, q); + unsigned long flags; + + spin_lock_irqsave(&q->queue_lock, flags); + if (!q->quiesce_depth++) + blk_queue_flag_set(QUEUE_FLAG_QUIESCED, q); + spin_unlock_irqrestore(&q->queue_lock, flags); } EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
@@ -247,10 +252,21 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue); */ void blk_mq_unquiesce_queue(struct request_queue *q) { - blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q); + unsigned long flags; + bool run_queue = false; + + spin_lock_irqsave(&q->queue_lock, flags); + if (WARN_ON_ONCE(q->quiesce_depth <= 0)) { + ; + } else if (!--q->quiesce_depth) { + blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q); + run_queue = true; + } + spin_unlock_irqrestore(&q->queue_lock, flags);
/* dispatch requests which are inserted during quiescing */ - blk_mq_run_hw_queues(q, true); + if (run_queue) + blk_mq_run_hw_queues(q, true); } EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 7517381b5e15..d6ae309a7dc3 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -571,6 +571,8 @@ struct request_queue { */ struct mutex mq_freeze_lock;
+ int quiesce_depth; + struct blk_mq_tag_set *tag_set; struct list_head tag_set_list; struct bio_set bio_split;
From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-v5.16 commit a1c2f7e7f25c9d35d3bf046f99682c5373b20fa2 category: bugfix bugzilla: 182378 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
For fixing queue quiesce race between driver and block layer(elevator switch, update nr_requests, ...), we need to support concurrent quiesce and unquiesce, which requires the two call to be balanced.
__bind() is only called from dm_swap_table() in which dm device has been suspended already, so not necessary to stop queue again. With this way, request queue quiesce and unquiesce can be balanced.
Reported-by: Yi Zhang yi.zhang@redhat.com Fixes: e70feb8b3e68 ("blk-mq: support concurrent queue quiesce/unquiesce") Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/md/dm.c | 10 ---------- 1 file changed, 10 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 19a70f434029..3403722b1688 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2038,16 +2038,6 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
dm_table_event_callback(t, event_callback, md);
- /* - * The queue hasn't been stopped yet, if the old table type wasn't - * for request-based during suspension. So stop it to prevent - * I/O mapping before resume. - * This must be done before setting the queue restrictions, - * because request-based dm may be run just after the setting. - */ - if (request_based) - dm_stop_queue(q); - if (request_based) { /* * Leverage the fact that request-based DM targets are
From: Ming Lei ming.lei@redhat.com
hulk inclusion category: bugfix bugzilla: 182378 https://gitee.com/openeuler/kernel/issues/I4DDEL
---------------------------
For fixing queue quiesce race between driver and block layer(elevator switch, update nr_requests, ...), we need to support concurrent quiesce and unquiesce, which requires the two to be balanced.
blk_mq_quiesce_queue() calls blk_mq_quiesce_queue_nowait() for updating quiesce depth and marking the flag, then scsi_internal_device_block() calls blk_mq_quiesce_queue_nowait() two times actually.
Fix the double quiesce and keep quiesce and unquiesce balanced.
Reported-by: Yi Zhang yi.zhang@redhat.com Fixes: e70feb8b3e68 ("blk-mq: support concurrent queue quiesce/unquiesce") Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/scsi/scsi_lib.c | 29 ++++++++++++++--------------- 1 file changed, 14 insertions(+), 15 deletions(-)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index b26c922bd65d..2561d48dfade 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -2618,6 +2618,14 @@ scsi_target_resume(struct scsi_target *starget) } EXPORT_SYMBOL(scsi_target_resume);
+static int __scsi_internal_device_block_nowait(struct scsi_device *sdev) +{ + if (scsi_device_set_state(sdev, SDEV_BLOCK)) + return scsi_device_set_state(sdev, SDEV_CREATED_BLOCK); + + return 0; +} + /** * scsi_internal_device_block_nowait - try to transition to the SDEV_BLOCK state * @sdev: device to block @@ -2634,24 +2642,16 @@ EXPORT_SYMBOL(scsi_target_resume); */ int scsi_internal_device_block_nowait(struct scsi_device *sdev) { - struct request_queue *q = sdev->request_queue; - int err = 0; - - err = scsi_device_set_state(sdev, SDEV_BLOCK); - if (err) { - err = scsi_device_set_state(sdev, SDEV_CREATED_BLOCK); - - if (err) - return err; - } + int ret = __scsi_internal_device_block_nowait(sdev);
/* * The device has transitioned to SDEV_BLOCK. Stop the * block layer from calling the midlayer with this device's * request queue. */ - blk_mq_quiesce_queue_nowait(q); - return 0; + if (!ret) + blk_mq_quiesce_queue_nowait(sdev->request_queue); + return ret; } EXPORT_SYMBOL_GPL(scsi_internal_device_block_nowait);
@@ -2672,13 +2672,12 @@ EXPORT_SYMBOL_GPL(scsi_internal_device_block_nowait); */ static int scsi_internal_device_block(struct scsi_device *sdev) { - struct request_queue *q = sdev->request_queue; int err;
mutex_lock(&sdev->state_mutex); - err = scsi_internal_device_block_nowait(sdev); + err = __scsi_internal_device_block_nowait(sdev); if (err == 0) - blk_mq_quiesce_queue(q); + blk_mq_quiesce_queue(sdev->request_queue); mutex_unlock(&sdev->state_mutex);
return err;
From: Ming Lei ming.lei@redhat.com
hulk inclusion category: bugfix bugzilla: 182378 https://gitee.com/openeuler/kernel/issues/I4DDEL
---------------------------
For fixing queue quiesce race between driver and block layer(elevator switch, update nr_requests, ...), we need to support concurrent quiesce and unquiesce, which requires the two call balanced.
It isn't easy to audit that in all scsi drivers, especially the two may be called from different contexts, so do it in scsi core with one per-device bit flag & global spinlock, basically zero cost since request queue quiesce is seldom triggered.
Reported-by: Yi Zhang yi.zhang@redhat.com Fixes: e70feb8b3e68 ("blk-mq: support concurrent queue quiesce/unquiesce") Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/scsi/scsi_lib.c | 45 ++++++++++++++++++++++++++++++-------- include/scsi/scsi_device.h | 1 + 2 files changed, 37 insertions(+), 9 deletions(-)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 2561d48dfade..9582aabf2ff1 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -2626,6 +2626,40 @@ static int __scsi_internal_device_block_nowait(struct scsi_device *sdev) return 0; }
+static DEFINE_SPINLOCK(sdev_queue_stop_lock); + +void scsi_start_queue(struct scsi_device *sdev) +{ + bool need_start; + unsigned long flags; + + spin_lock_irqsave(&sdev_queue_stop_lock, flags); + need_start = sdev->queue_stopped; + sdev->queue_stopped = 0; + spin_unlock_irqrestore(&sdev_queue_stop_lock, flags); + + if (need_start) + blk_mq_unquiesce_queue(sdev->request_queue); +} + +static void scsi_stop_queue(struct scsi_device *sdev, bool nowait) +{ + bool need_stop; + unsigned long flags; + + spin_lock_irqsave(&sdev_queue_stop_lock, flags); + need_stop = !sdev->queue_stopped; + sdev->queue_stopped = 1; + spin_unlock_irqrestore(&sdev_queue_stop_lock, flags); + + if (need_stop) { + if (nowait) + blk_mq_quiesce_queue_nowait(sdev->request_queue); + else + blk_mq_quiesce_queue(sdev->request_queue); + } +} + /** * scsi_internal_device_block_nowait - try to transition to the SDEV_BLOCK state * @sdev: device to block @@ -2650,7 +2684,7 @@ int scsi_internal_device_block_nowait(struct scsi_device *sdev) * request queue. */ if (!ret) - blk_mq_quiesce_queue_nowait(sdev->request_queue); + scsi_stop_queue(sdev, true); return ret; } EXPORT_SYMBOL_GPL(scsi_internal_device_block_nowait); @@ -2677,19 +2711,12 @@ static int scsi_internal_device_block(struct scsi_device *sdev) mutex_lock(&sdev->state_mutex); err = __scsi_internal_device_block_nowait(sdev); if (err == 0) - blk_mq_quiesce_queue(sdev->request_queue); + scsi_stop_queue(sdev, false); mutex_unlock(&sdev->state_mutex);
return err; }
-void scsi_start_queue(struct scsi_device *sdev) -{ - struct request_queue *q = sdev->request_queue; - - blk_mq_unquiesce_queue(q); -} - /** * scsi_internal_device_unblock_nowait - resume a device after a block request * @sdev: device to resume diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index 1a5c9a3df6d6..2b8c3df03f13 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -204,6 +204,7 @@ struct scsi_device { unsigned unmap_limit_for_ws:1; /* Use the UNMAP limit for WRITE SAME */ unsigned rpm_autosuspend:1; /* Enable runtime autosuspend at device * creation time */ + unsigned queue_stopped:1; /* request queue is quiesced */
bool offline_already; /* Device offline message logged */
From: Daniel Borkmann daniel@iogearbox.net
mainline inclusion from mainline-v5.15-rc2 commit 8520e224f547cd070c7c8f97b1fc6d58cff7ccaa bugzilla: 182050 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-----------------------------
Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used. Back in the days, commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") embedded per-socket cgroup information into sock->sk_cgrp_data and in order to save 8 bytes in struct sock made both mutually exclusive, that is, when cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2 falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).
The assumption made was "there is no reason to mix the two and this is in line with how legacy and v2 compatibility is handled" as stated in bd1060a1d671. However, with Kubernetes more widely supporting cgroups v2 as well nowadays, this assumption no longer holds, and the possibility of the v1/v2 mixed mode with the v2 root fallback being hit becomes a real security issue.
Many of the cgroup v2 BPF programs are also used for policy enforcement, just to pick _one_ example, that is, to programmatically deny socket related system calls like connect(2) or bind(2). A v2 root fallback would implicitly cause a policy bypass for the affected Pods.
In production environments, we have recently seen this case due to various circumstances: i) a different 3rd party agent and/or ii) a container runtime such as [0] in the user's environment configuring legacy cgroup v1 net_cls tags, which triggered implicitly mentioned root fallback. Another case is Kubernetes projects like kind [1] which create Kubernetes nodes in a container and also add cgroup namespaces to the mix, meaning programs which are attached to the cgroup v2 root of the cgroup namespace get attached to a non-root cgroup v2 path from init namespace point of view. And the latter's root is out of reach for agents on a kind Kubernetes node to configure. Meaning, any entity on the node setting cgroup v1 net_cls tag will trigger the bypass despite cgroup v2 BPF programs attached to the namespace root.
Generally, this mutual exclusiveness does not hold anymore in today's user environments and makes cgroup v2 usage from BPF side fragile and unreliable. This fix adds proper struct cgroup pointer for the cgroup v2 case to struct sock_cgroup_data in order to address these issues; this implicitly also fixes the tradeoffs being made back then with regards to races and refcount leaks as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF programs always operate as expected.
[0] https://github.com/nestybox/sysbox/ [1] https://kind.sigs.k8s.io/
Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Signed-off-by: Daniel Borkmann daniel@iogearbox.net Signed-off-by: Alexei Starovoitov ast@kernel.org Acked-by: Stanislav Fomichev sdf@google.com Acked-by: Tejun Heo tj@kernel.org Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net Conflicts: include/linux/cgroup-defs.h net/core/netclassid_cgroup.c net/core/netprio_cgroup.c Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- include/linux/cgroup-defs.h | 107 +++++++++-------------------------- include/linux/cgroup.h | 22 +------ kernel/cgroup/cgroup.c | 50 ++++------------ net/core/netclassid_cgroup.c | 7 +-- net/core/netprio_cgroup.c | 11 +--- 5 files changed, 42 insertions(+), 155 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index cea7518fe9dd..b87e13609f0f 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -772,107 +772,54 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) {} * sock_cgroup_data is embedded at sock->sk_cgrp_data and contains * per-socket cgroup information except for memcg association. * - * On legacy hierarchies, net_prio and net_cls controllers directly set - * attributes on each sock which can then be tested by the network layer. - * On the default hierarchy, each sock is associated with the cgroup it was - * created in and the networking layer can match the cgroup directly. - * - * To avoid carrying all three cgroup related fields separately in sock, - * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer. - * On boot, sock_cgroup_data records the cgroup that the sock was created - * in so that cgroup2 matches can be made; however, once either net_prio or - * net_cls starts being used, the area is overriden to carry prioidx and/or - * classid. The two modes are distinguished by whether the lowest bit is - * set. Clear bit indicates cgroup pointer while set bit prioidx and - * classid. - * - * While userland may start using net_prio or net_cls at any time, once - * either is used, cgroup2 matching no longer works. There is no reason to - * mix the two and this is in line with how legacy and v2 compatibility is - * handled. On mode switch, cgroup references which are already being - * pointed to by socks may be leaked. While this can be remedied by adding - * synchronization around sock_cgroup_data, given that the number of leaked - * cgroups is bound and highly unlikely to be high, this seems to be the - * better trade-off. + * On legacy hierarchies, net_prio and net_cls controllers directly + * set attributes on each sock which can then be tested by the network + * layer. On the default hierarchy, each sock is associated with the + * cgroup it was created in and the networking layer can match the + * cgroup directly. */ struct sock_cgroup_data { - union { -#ifdef __LITTLE_ENDIAN - struct { - u8 is_data : 1; - u8 no_refcnt : 1; - u8 unused : 6; - u8 padding; - u16 prioidx; - u32 classid; - } __packed; -#else - struct { - u32 classid; - u16 prioidx; - u8 padding; - u8 unused : 6; - u8 no_refcnt : 1; - u8 is_data : 1; - } __packed; + struct cgroup *cgroup; /* v2 */ +#ifdef CONFIG_CGROUP_NET_CLASSID + u32 classid; /* v1 */ +#endif +#ifdef CONFIG_CGROUP_NET_PRIO + u16 prioidx; /* v1 */ #endif - u64 val; - }; };
-/* - * There's a theoretical window where the following accessors race with - * updaters and return part of the previous pointer as the prioidx or - * classid. Such races are short-lived and the result isn't critical. - */ static inline u16 sock_cgroup_prioidx(const struct sock_cgroup_data *skcd) { - /* fallback to 1 which is always the ID of the root cgroup */ - return (skcd->is_data & 1) ? skcd->prioidx : 1; +#ifdef CONFIG_CGROUP_NET_PRIO + return READ_ONCE(skcd->prioidx); +#else + return 1; +#endif }
static inline u32 sock_cgroup_classid(const struct sock_cgroup_data *skcd) { - /* fallback to 0 which is the unconfigured default classid */ - return (skcd->is_data & 1) ? skcd->classid : 0; +#ifdef CONFIG_CGROUP_NET_CLASSID + return READ_ONCE(skcd->classid); +#else + return 0; +#endif }
-/* - * If invoked concurrently, the updaters may clobber each other. The - * caller is responsible for synchronization. - */ static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd, u16 prioidx) { - struct sock_cgroup_data skcd_buf = {{ .val = READ_ONCE(skcd->val) }}; - - if (sock_cgroup_prioidx(&skcd_buf) == prioidx) - return; - - if (!(skcd_buf.is_data & 1)) { - skcd_buf.val = 0; - skcd_buf.is_data = 1; - } - - skcd_buf.prioidx = prioidx; - WRITE_ONCE(skcd->val, skcd_buf.val); /* see sock_cgroup_ptr() */ +#ifdef CONFIG_CGROUP_NET_PRIO + WRITE_ONCE(skcd->prioidx, prioidx); +#endif }
static inline void sock_cgroup_set_classid(struct sock_cgroup_data *skcd, u32 classid) { - struct sock_cgroup_data skcd_buf = {{ .val = READ_ONCE(skcd->val) }}; - - if (sock_cgroup_classid(&skcd_buf) == classid) - return; - - if (!(skcd_buf.is_data & 1)) { - skcd_buf.val = 0; - skcd_buf.is_data = 1; - } - - skcd_buf.classid = classid; - WRITE_ONCE(skcd->val, skcd_buf.val); /* see sock_cgroup_ptr() */ +#ifdef CONFIG_CGROUP_NET_CLASSID + WRITE_ONCE(skcd->classid, classid); +#endif }
#else /* CONFIG_SOCK_CGROUP_DATA */ diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 4c4f1ddc1f5f..9226ec130ba3 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -816,33 +816,13 @@ static inline void cgroup_account_cputime_field(struct task_struct *task, */ #ifdef CONFIG_SOCK_CGROUP_DATA
-#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID) -extern spinlock_t cgroup_sk_update_lock; -#endif - -void cgroup_sk_alloc_disable(void); void cgroup_sk_alloc(struct sock_cgroup_data *skcd); void cgroup_sk_clone(struct sock_cgroup_data *skcd); void cgroup_sk_free(struct sock_cgroup_data *skcd);
static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd) { -#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID) - unsigned long v; - - /* - * @skcd->val is 64bit but the following is safe on 32bit too as we - * just need the lower ulong to be written and read atomically. - */ - v = READ_ONCE(skcd->val); - - if (v & 3) - return &cgrp_dfl_root.cgrp; - - return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp; -#else - return (struct cgroup *)(unsigned long)skcd->val; -#endif + return skcd->cgroup; }
#else /* CONFIG_CGROUP_DATA */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 5e4a50091c18..15f4474e0bad 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -6489,74 +6489,44 @@ int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v) */ #ifdef CONFIG_SOCK_CGROUP_DATA
-#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID) - -DEFINE_SPINLOCK(cgroup_sk_update_lock); -static bool cgroup_sk_alloc_disabled __read_mostly; - -void cgroup_sk_alloc_disable(void) -{ - if (cgroup_sk_alloc_disabled) - return; - pr_info("cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation\n"); - cgroup_sk_alloc_disabled = true; -} - -#else - -#define cgroup_sk_alloc_disabled false - -#endif - void cgroup_sk_alloc(struct sock_cgroup_data *skcd) { - if (cgroup_sk_alloc_disabled) { - skcd->no_refcnt = 1; - return; - } - /* Don't associate the sock with unrelated interrupted task's cgroup. */ if (in_interrupt()) return;
rcu_read_lock(); - while (true) { struct css_set *cset;
cset = task_css_set(current); if (likely(cgroup_tryget(cset->dfl_cgrp))) { - skcd->val = (unsigned long)cset->dfl_cgrp; + skcd->cgroup = cset->dfl_cgrp; cgroup_bpf_get(cset->dfl_cgrp); break; } cpu_relax(); } - rcu_read_unlock(); }
void cgroup_sk_clone(struct sock_cgroup_data *skcd) { - if (skcd->val) { - if (skcd->no_refcnt) - return; - /* - * We might be cloning a socket which is left in an empty - * cgroup and the cgroup might have already been rmdir'd. - * Don't use cgroup_get_live(). - */ - cgroup_get(sock_cgroup_ptr(skcd)); - cgroup_bpf_get(sock_cgroup_ptr(skcd)); - } + struct cgroup *cgrp = sock_cgroup_ptr(skcd); + + /* + * We might be cloning a socket which is left in an empty + * cgroup and the cgroup might have already been rmdir'd. + * Don't use cgroup_get_live(). + */ + cgroup_get(cgrp); + cgroup_bpf_get(cgrp); }
void cgroup_sk_free(struct sock_cgroup_data *skcd) { struct cgroup *cgrp = sock_cgroup_ptr(skcd);
- if (skcd->no_refcnt) - return; cgroup_bpf_put(cgrp); cgroup_put(cgrp); } diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c index 41b24cd31562..b6de5ee22391 100644 --- a/net/core/netclassid_cgroup.c +++ b/net/core/netclassid_cgroup.c @@ -72,11 +72,8 @@ static int update_classid_sock(const void *v, struct file *file, unsigned n) struct update_classid_context *ctx = (void *)v; struct socket *sock = sock_from_file(file, &err);
- if (sock) { - spin_lock(&cgroup_sk_update_lock); + if (sock) sock_cgroup_set_classid(&sock->sk->sk_cgrp_data, ctx->classid); - spin_unlock(&cgroup_sk_update_lock); - } if (--ctx->batch == 0) { ctx->batch = UPDATE_CLASSID_BATCH; return n + 1; @@ -122,8 +119,6 @@ static int write_classid(struct cgroup_subsys_state *css, struct cftype *cft, struct css_task_iter it; struct task_struct *p;
- cgroup_sk_alloc_disable(); - cs->classid = (u32)value;
css_task_iter_start(css, 0, &it); diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c index 9bd4cab7d510..34c3e49814e2 100644 --- a/net/core/netprio_cgroup.c +++ b/net/core/netprio_cgroup.c @@ -207,8 +207,6 @@ static ssize_t write_priomap(struct kernfs_open_file *of, if (!dev) return -ENODEV;
- cgroup_sk_alloc_disable(); - rtnl_lock();
ret = netprio_set_prio(of_css(of), dev, prio); @@ -222,12 +220,11 @@ static int update_netprio(const void *v, struct file *file, unsigned n) { int err; struct socket *sock = sock_from_file(file, &err); - if (sock) { - spin_lock(&cgroup_sk_update_lock); + + if (sock) sock_cgroup_set_prioidx(&sock->sk->sk_cgrp_data, (unsigned long)v); - spin_unlock(&cgroup_sk_update_lock); - } + return 0; }
@@ -236,8 +233,6 @@ static void net_prio_attach(struct cgroup_taskset *tset) struct task_struct *p; struct cgroup_subsys_state *css;
- cgroup_sk_alloc_disable(); - cgroup_taskset_for_each(p, css, tset) { void *v = (void *)(unsigned long)css->id;
From: Daniel Borkmann daniel@iogearbox.net
mainline inclusion from mainline-v5.15-rc4 commit 78cc316e9583067884eb8bd154301dc1e9ee945c bugzilla: 185622 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------
If cgroup_sk_alloc() is called from interrupt context, then just assign the root cgroup to skcd->cgroup. Prior to commit 8520e224f547 ("bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode") we would just return, and later on in sock_cgroup_ptr(), we were NULL-testing the cgroup in fast-path, and iff indeed NULL returning the root cgroup (v ?: &cgrp_dfl_root.cgrp). Rather than re-adding the NULL-test to the fast-path we can just assign it once from cgroup_sk_alloc() given v1/v2 handling has been simplified. The migration from NULL test with returning &cgrp_dfl_root.cgrp to assigning &cgrp_dfl_root.cgrp directly does /not/ change behavior for callers of sock_cgroup_ptr().
syzkaller was able to trigger a splat in the legacy netrom code base, where the RX handler in nr_rx_frame() calls nr_make_new() which calls sk_alloc() and therefore cgroup_sk_alloc() with in_interrupt() condition. Thus the NULL skcd->cgroup, where it trips over on cgroup_sk_free() side given it expects a non-NULL object. There are a few other candidates aside from netrom which have similar pattern where in their accept-like implementation, they just call to sk_alloc() and thus cgroup_sk_alloc() instead of sk_clone_lock() with the corresponding cgroup_sk_clone() which then inherits the cgroup from the parent socket. None of them are related to core protocols where BPF cgroup programs are running from. However, in future, they should follow to implement a similar inheritance mechanism.
Additionally, with a !CONFIG_CGROUP_NET_PRIO and !CONFIG_CGROUP_NET_CLASSID configuration, the same issue was exposed also prior to 8520e224f547 due to commit e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated cgroup") which added the early in_interrupt() return back then.
Fixes: 8520e224f547 ("bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode") Fixes: e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated cgroup") Reported-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com Reported-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Signed-off-by: Alexei Starovoitov ast@kernel.org Tested-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com Tested-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com Acked-by: Tejun Heo tj@kernel.org Link: https://lore.kernel.org/bpf/20210927123921.21535-1-daniel@iogearbox.net Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- kernel/cgroup/cgroup.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 15f4474e0bad..fafd2332457b 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -6491,22 +6491,29 @@ int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v)
void cgroup_sk_alloc(struct sock_cgroup_data *skcd) { - /* Don't associate the sock with unrelated interrupted task's cgroup. */ - if (in_interrupt()) - return; + struct cgroup *cgroup;
rcu_read_lock(); + /* Don't associate the sock with unrelated interrupted task's cgroup. */ + if (in_interrupt()) { + cgroup = &cgrp_dfl_root.cgrp; + cgroup_get(cgroup); + goto out; + } + while (true) { struct css_set *cset;
cset = task_css_set(current); if (likely(cgroup_tryget(cset->dfl_cgrp))) { - skcd->cgroup = cset->dfl_cgrp; - cgroup_bpf_get(cset->dfl_cgrp); + cgroup = cset->dfl_cgrp; break; } cpu_relax(); } +out: + skcd->cgroup = cgroup; + cgroup_bpf_get(cgroup); rcu_read_unlock(); }
From: Guo Xuenan guoxuenan@huawei.com
hulk inclusion category: bugfix bugzilla: 182980 https://gitee.com/openeuler/kernel/issues/I4DDEL
-------------------------------------------------
When using ioctl interface to resize ubi volume, ubi_resize_volume will resize eba table first, but not change vol->reserved_pebs in the same atomic context which may cause concurrency access eba table.
For example, When user do shrink ubi volume A calling ubi_resize_volume, while the other thread is writing (volume B) and triggering wear-leveling, which may calling ubi_write_fastmap, under these circumstances, KASAN may report: slab-out-of-bounds in ubi_eba_get_ldesc+0xfb/0x130.
The main work of this patch include: 1. fix races in ubi_resize_volume and ubi_update_fastmap, to avoid eba_tbl read out of bounds. first, we make eba_tbl and reserved_pebs updating under the protect of vol->volumes_lock. second, rollback volume in case of resize failure. Also mention that for volume shrinking failure, since part of volume has been shrunk and unmapped, there is no need to recover {rsvd/avail}_pebs. 2. fix some memleak in error path of ubi_resize_volume when destroy new_eba_tbl.
BUG: KASAN: slab-out-of-bounds in ubi_eba_get_ldesc+0xfb/0x130 [ubi] Read of size 4 at addr ffff888013ff7170 by task kworker/u16:3/160 CPU: 6 PID: 160 Comm: kworker/u16:3 Not tainted Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014 Workqueue: writeback wb_workfn (flush-ubifs_0_0) Call Trace: dump_stack+0x9c/0xcf print_address_description.constprop.0+0x1c/0x220 kasan_report.cold+0x1f/0x37 ubi_eba_get_ldesc+0xfb/0x130 [ubi] ubi_update_fastmap.cold+0x6be/0xc6b [ubi] ubi_wl_get_peb+0x2a2/0x580 [ubi] try_write_vid_and_data+0x9a/0x4d0 [ubi] ubi_eba_write_leb+0x780/0x1890 [ubi] ubi_leb_map+0x197/0x2c0 [ubi] ubifs_leb_map+0x139/0x240 [ubifs] ubifs_add_bud_to_log+0xb02/0xea0 [ubifs] make_reservation+0x860/0xb40 [ubifs] ubifs_jnl_write_data+0x461/0x9b0 [ubifs] do_writepage+0x375/0x550 [ubifs] ubifs_writepage+0x3a4/0x670 [ubifs] __writepage+0x61/0x160 write_cache_pages+0x433/0xb70 do_writepages+0x1ad/0x260 __writeback_single_inode+0xb3/0x810 writeback_sb_inodes+0x4d9/0xcc0 __writeback_inodes_wb+0xc1/0x260 wb_writeback+0x585/0x760 wb_workfn+0x751/0xdb0 process_one_work+0x6e5/0xf10 worker_thread+0x5e1/0x10c0 kthread+0x335/0x400 ret_from_fork+0x1f/0x30
Allocated by task 690: kasan_save_stack+0x1b/0x40 __kasan_kmalloc.constprop.0+0xc2/0xd0 ubi_eba_create_table+0x88/0x1a0 [ubi] ubi_resize_volume.cold+0x166/0xc75 [ubi] ubi_cdev_ioctl+0x57f/0x1aa0 [ubi] __x64_sys_ioctl+0x141/0x1a0 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The following steps can used to reproduce: Process 1: write and trigger ubi wear-leveling ubimkvol /dev/ubi0 -s 5000MiB -N v1 ubimkvol /dev/ubi0 -s 2000MiB -N v2 ubimkvol /dev/ubi0 -s 10MiB -N v3 mount -t ubifs /dev/ubi0_0 /mnt/ubifs while true; do filename=/mnt/ubifs/$((RANDOM)) dd if=/dev/random of=${filename} bs=1M count=$((RANDOM % 1000)) rm -rf ${filename} sync /mnt/ubifs/ done
Process 2: do random resize struct ubi_rsvol_req req; req.vol_id = 1; req.bytes = (rand() % 50) * 512KB; ioctl(fd, UBI_IOCRSVOL, &req);
Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Zhang Xiaoxu zhangxiaoxu5@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/mtd/ubi/vmt.c | 27 +++++++++++++++++++-------- 1 file changed, 19 insertions(+), 8 deletions(-)
diff --git a/drivers/mtd/ubi/vmt.c b/drivers/mtd/ubi/vmt.c index 1bc7b3a05604..04d0e217ea1c 100644 --- a/drivers/mtd/ubi/vmt.c +++ b/drivers/mtd/ubi/vmt.c @@ -409,6 +409,7 @@ int ubi_resize_volume(struct ubi_volume_desc *desc, int reserved_pebs) struct ubi_device *ubi = vol->ubi; struct ubi_vtbl_record vtbl_rec; struct ubi_eba_table *new_eba_tbl = NULL; + struct ubi_eba_table *old_eba_tbl = NULL; int vol_id = vol->vol_id;
if (ubi->ro_mode) @@ -454,10 +455,13 @@ int ubi_resize_volume(struct ubi_volume_desc *desc, int reserved_pebs) err = -ENOSPC; goto out_free; } + ubi->avail_pebs -= pebs; ubi->rsvd_pebs += pebs; ubi_eba_copy_table(vol, new_eba_tbl, vol->reserved_pebs); - ubi_eba_replace_table(vol, new_eba_tbl); + old_eba_tbl = vol->eba_tbl; + vol->eba_tbl = new_eba_tbl; + vol->reserved_pebs = reserved_pebs; spin_unlock(&ubi->volumes_lock); }
@@ -465,14 +469,16 @@ int ubi_resize_volume(struct ubi_volume_desc *desc, int reserved_pebs) for (i = 0; i < -pebs; i++) { err = ubi_eba_unmap_leb(ubi, vol, reserved_pebs + i); if (err) - goto out_acc; + goto out_free; } spin_lock(&ubi->volumes_lock); ubi->rsvd_pebs += pebs; ubi->avail_pebs -= pebs; ubi_update_reserved(ubi); ubi_eba_copy_table(vol, new_eba_tbl, reserved_pebs); - ubi_eba_replace_table(vol, new_eba_tbl); + old_eba_tbl = vol->eba_tbl; + vol->eba_tbl = new_eba_tbl; + vol->reserved_pebs = reserved_pebs; spin_unlock(&ubi->volumes_lock); }
@@ -494,27 +500,32 @@ int ubi_resize_volume(struct ubi_volume_desc *desc, int reserved_pebs) if (err) goto out_acc;
- vol->reserved_pebs = reserved_pebs; if (vol->vol_type == UBI_DYNAMIC_VOLUME) { vol->used_ebs = reserved_pebs; vol->last_eb_bytes = vol->usable_leb_size; vol->used_bytes = (long long)vol->used_ebs * vol->usable_leb_size; } - + /* destroy old table */ + ubi_eba_destroy_table(old_eba_tbl); ubi_volume_notify(ubi, vol, UBI_VOLUME_RESIZED); self_check_volumes(ubi); return err;
out_acc: + spin_lock(&ubi->volumes_lock); + vol->reserved_pebs = reserved_pebs - pebs; if (pebs > 0) { - spin_lock(&ubi->volumes_lock); ubi->rsvd_pebs -= pebs; ubi->avail_pebs += pebs; - spin_unlock(&ubi->volumes_lock); + ubi_eba_copy_table(vol, old_eba_tbl, vol->reserved_pebs); + } else { + ubi_eba_copy_table(vol, old_eba_tbl, reserved_pebs); } + vol->eba_tbl = old_eba_tbl; + spin_unlock(&ubi->volumes_lock); out_free: - kfree(new_eba_tbl); + ubi_eba_destroy_table(new_eba_tbl); return err; }
From: Desmond Cheong Zhi Xi desmondcheongzx@gmail.com
mainline inclusion from mainline-v5.14-rc1 commit 27c24fda62b601d6f9ca5e992502578c4310876f category: bugfix bugzilla: 185743 https://gitee.com/openeuler/kernel/issues/I4DDEL CVE: CVE-2021-3640
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Since sco_sock_timeout is now scheduled using delayed work, it is no longer run in SOFTIRQ context. Hence bh_lock_sock is no longer necessary in SCO to synchronise between user contexts and SOFTIRQ processing.
As such, calls to bh_lock_sock should be replaced with lock_sock to synchronize with other concurrent processes that use lock_sock.
Signed-off-by: Desmond Cheong Zhi Xi desmondcheongzx@gmail.com Signed-off-by: Luiz Augusto von Dentz luiz.von.dentz@intel.com Signed-off-by: Lijun Fang fanglijun3@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- net/bluetooth/sco.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/net/bluetooth/sco.c b/net/bluetooth/sco.c index 7c24a9acbc45..cb2b11683082 100644 --- a/net/bluetooth/sco.c +++ b/net/bluetooth/sco.c @@ -93,10 +93,10 @@ static void sco_sock_timeout(struct work_struct *work)
BT_DBG("sock %p state %d", sk, sk->sk_state);
- bh_lock_sock(sk); + lock_sock(sk); sk->sk_err = ETIMEDOUT; sk->sk_state_change(sk); - bh_unlock_sock(sk); + release_sock(sk);
sock_put(sk); } @@ -192,10 +192,10 @@ static void sco_conn_del(struct hci_conn *hcon, int err)
if (sk) { sock_hold(sk); - bh_lock_sock(sk); + lock_sock(sk); sco_sock_clear_timer(sk); sco_chan_del(sk, err); - bh_unlock_sock(sk); + release_sock(sk); sock_put(sk);
/* Ensure no more work items will run before freeing conn. */ @@ -1101,10 +1101,10 @@ static void sco_conn_ready(struct sco_conn *conn)
if (sk) { sco_sock_clear_timer(sk); - bh_lock_sock(sk); + lock_sock(sk); sk->sk_state = BT_CONNECTED; sk->sk_state_change(sk); - bh_unlock_sock(sk); + release_sock(sk); } else { sco_conn_lock(conn);
@@ -1119,12 +1119,12 @@ static void sco_conn_ready(struct sco_conn *conn) return; }
- bh_lock_sock(parent); + lock_sock(parent);
sk = sco_sock_alloc(sock_net(parent), NULL, BTPROTO_SCO, GFP_ATOMIC, 0); if (!sk) { - bh_unlock_sock(parent); + release_sock(parent); sco_conn_unlock(conn); return; } @@ -1145,7 +1145,7 @@ static void sco_conn_ready(struct sco_conn *conn) /* Wake up parent */ parent->sk_data_ready(parent);
- bh_unlock_sock(parent); + release_sock(parent);
sco_conn_unlock(conn); }
From: Takashi Iwai tiwai@suse.de
mainline inclusion from mainline-v5.14-rc7 commit 99c23da0eed4fd20cae8243f2b51e10e66aa0951 category: bugfix bugzilla: 185743 https://gitee.com/openeuler/kernel/issues/I4DDEL CVE: CVE-2021-3640
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
The sco_send_frame() also takes lock_sock() during memcpy_from_msg() call that may be endlessly blocked by a task with userfaultd technique, and this will result in a hung task watchdog trigger.
Just like the similar fix for hci_sock_sendmsg() in commit 92c685dc5de0 ("Bluetooth: reorganize functions..."), this patch moves the memcpy_from_msg() out of lock_sock() for addressing the hang.
This should be the last piece for fixing CVE-2021-3640 after a few already queued fixes.
Signed-off-by: Takashi Iwai tiwai@suse.de Signed-off-by: Marcel Holtmann marcel@holtmann.org Signed-off-by: Lijun Fang fanglijun3@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- net/bluetooth/sco.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/net/bluetooth/sco.c b/net/bluetooth/sco.c index cb2b11683082..918df8d0e8b6 100644 --- a/net/bluetooth/sco.c +++ b/net/bluetooth/sco.c @@ -281,7 +281,8 @@ static int sco_connect(struct hci_dev *hdev, struct sock *sk) return err; }
-static int sco_send_frame(struct sock *sk, struct msghdr *msg, int len) +static int sco_send_frame(struct sock *sk, void *buf, int len, + unsigned int msg_flags) { struct sco_conn *conn = sco_pi(sk)->conn; struct sk_buff *skb; @@ -293,15 +294,11 @@ static int sco_send_frame(struct sock *sk, struct msghdr *msg, int len)
BT_DBG("sk %p len %d", sk, len);
- skb = bt_skb_send_alloc(sk, len, msg->msg_flags & MSG_DONTWAIT, &err); + skb = bt_skb_send_alloc(sk, len, msg_flags & MSG_DONTWAIT, &err); if (!skb) return err;
- if (memcpy_from_msg(skb_put(skb, len), msg, len)) { - kfree_skb(skb); - return -EFAULT; - } - + memcpy(skb_put(skb, len), buf, len); hci_send_sco(conn->hcon, skb);
return len; @@ -726,6 +723,7 @@ static int sco_sock_sendmsg(struct socket *sock, struct msghdr *msg, size_t len) { struct sock *sk = sock->sk; + void *buf; int err;
BT_DBG("sock %p, sk %p", sock, sk); @@ -737,14 +735,24 @@ static int sco_sock_sendmsg(struct socket *sock, struct msghdr *msg, if (msg->msg_flags & MSG_OOB) return -EOPNOTSUPP;
+ buf = kmalloc(len, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + if (memcpy_from_msg(buf, msg, len)) { + kfree(buf); + return -EFAULT; + } + lock_sock(sk);
if (sk->sk_state == BT_CONNECTED) - err = sco_send_frame(sk, msg, len); + err = sco_send_frame(sk, buf, len, msg->msg_flags); else err = -ENOTCONN;
release_sock(sk); + kfree(buf); return err; }
From: Navid Emamdoost navid.emamdoost@gmail.com
maillist inclusion category: bugfix bugzilla: 185744 https://gitee.com/openeuler/kernel/issues/I4DDEL CVE: CVE-2019-16089
Reference: https://lore.kernel.org/lkml/20190911164013.27364-1-navid.emamdoost@gmail.co...
---------------------------
nla_nest_start may fail and return NULL. The check is inserted, and errno is selected based on other call sites within the same source code. Update: removed extra new line. v3 Update: added release reply, thanks to Michal Kubecek for pointing out.
Signed-off-by: Navid Emamdoost navid.emamdoost@gmail.com Reviewed-by: Michal Kubecek mkubecek@suse.cz Signed-off-by: Lijun Fang fanglijun3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/block/nbd.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index cc9770936c67..07b06fc6f70e 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -2386,6 +2386,12 @@ static int nbd_genl_status(struct sk_buff *skb, struct genl_info *info) }
dev_list = nla_nest_start_noflag(reply, NBD_ATTR_DEVICE_LIST); + if (!dev_list) { + nlmsg_free(reply); + ret = -EMSGSIZE; + goto out; + } + if (index == -1) { ret = idr_for_each(&nbd_index_idr, &status_cb, reply); if (ret) {
From: Zheng Liang zhengliang6@huawei.com
hulk inclusion category: bugfix bugzilla: 185682 https://gitee.com/openeuler/kernel/issues/I4DDEL
-------------------------------------------------
the commit c1f6925e1091("mm: put readahead pages in cache earlier") causes the read performance of squashfs to deteriorate.Through testing, we find that the performance will be back by closing the readahead of squashfs. So we want to learn the way of ubifs, provides backing_dev_info and disable read-ahead.
Signed-off-by: Zheng Liang zhengliang6@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- fs/squashfs/super.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+)
diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c index 88cc94be1076..f7128ae7b949 100644 --- a/fs/squashfs/super.c +++ b/fs/squashfs/super.c @@ -26,6 +26,7 @@ #include <linux/module.h> #include <linux/magic.h> #include <linux/xattr.h> +#include <linux/backing-dev.h>
#include "squashfs_fs.h" #include "squashfs_fs_sb.h" @@ -64,6 +65,24 @@ static const struct squashfs_decompressor *supported_squashfs_filesystem( return decompressor; }
+static int squashfs_bdi_init(struct super_block *sb) +{ + int err; + unsigned int major = MAJOR(sb->s_dev); + unsigned int minor = MINOR(sb->s_dev); + + bdi_put(sb->s_bdi); + sb->s_bdi = &noop_backing_dev_info; + + err = super_setup_bdi_name(sb, "squashfs_%u_%u", major, minor); + if (err) + return err; + + sb->s_bdi->ra_pages = 0; + sb->s_bdi->io_pages = 0; + + return 0; +}
static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc) { @@ -78,6 +97,19 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
TRACE("Entered squashfs_fill_superblock\n");
+ /* + * squashfs provides 'backing_dev_info' in order to disable read-ahead. For + * squashfs, I/O is not deferred, it is done immediately in readpage, + * which means the user would have to wait not just for their own I/O + * but the read-ahead I/O as well i.e. completely pointless.squashfs_bdi_init + * will set sb->s_bdi->ra_pages and sb->s_bdi->io_pages to 0. + */ + err = squashfs_bdi_init(sb); + if (err) { + errorf(fc, "squashfs init bdi failed"); + return err; + } + sb->s_fs_info = kzalloc(sizeof(*msblk), GFP_KERNEL); if (sb->s_fs_info == NULL) { ERROR("Failed to allocate squashfs_sb_info\n");
From: Cui GaoSheng cuigaosheng1@huawei.com
hulk inclusion category: bugfix bugzilla: 185737 https://gitee.com/openeuler/kernel/issues/I4DDEL
-----------------------------------------------------------------
ARM supports position independent code sequences that produce symbol references with a greater reach than the ordinary adr/ldr instructions, pseudo-instruction ldr_l is used to solve the symbol references problem, so we should use ldr_l to replace ldr instrutction when kaslr is enabled.
Signed-off-by: Cui GaoSheng cuigaosheng1@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Reviewed-by: Hanjun Guo guohanjun@huawei.com
Signed-off-by: Chen Jun chenjun102@huawei.com --- arch/arm/mm/proc-v7.S | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/arch/arm/mm/proc-v7.S b/arch/arm/mm/proc-v7.S index a59ddfd7a179..2fcffcc60cc6 100644 --- a/arch/arm/mm/proc-v7.S +++ b/arch/arm/mm/proc-v7.S @@ -112,10 +112,8 @@ ENTRY(cpu_v7_hvc_switch_mm) ENDPROC(cpu_v7_hvc_switch_mm) #endif
-.globl nospectre_v2 ENTRY(cpu_v7_iciallu_switch_mm) - adr r3, 3f - ldr r3, [r3] + ldr_l r3, nospectre_v2 cmp r3, #1 beq 1f mov r3, #0 @@ -124,8 +122,7 @@ ENTRY(cpu_v7_iciallu_switch_mm) b cpu_v7_switch_mm ENDPROC(cpu_v7_iciallu_switch_mm) ENTRY(cpu_v7_bpiall_switch_mm) - adr r3, 3f - ldr r3, [r3] + ldr_l r3, nospectre_v2 cmp r3, #1 beq 1f mov r3, #0 @@ -134,9 +131,6 @@ ENTRY(cpu_v7_bpiall_switch_mm) b cpu_v7_switch_mm ENDPROC(cpu_v7_bpiall_switch_mm)
- .align -3: .long nospectre_v2 - string cpu_v7_name, "ARMv7 Processor" .align
From: Heiner Kallweit hkallweit1@gmail.com
mainline inclusion from mainline commit: a4db9055fdb9cf607775c66d39796caf6439ec92 category: bugfix bugzilla: 185671 https://gitee.com/openeuler/kernel/issues/I4DDEL
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
As reported by Zhang there's a small issue if in forced mode the duplex mode changes with the link staying up [0]. In this case the MAC isn't notified about the change.
The proposed patch relies on the phylib state machine and ignores the fact that there are drivers that uses phylib but not the phylib state machine. So let's don't change the behavior for such drivers and fix it w/o re-adding state PHY_FORCING for the case that phylib state machine is used.
[0] https://lore.kernel.org/netdev/a5c26ffd-4ee4-a5e6-4103-873208ce0dc5@huawei.c...
Fixes: 2bd229df5e2e ("net: phy: remove state PHY_FORCING") Reported-by: Zhang Changzhong zhangchangzhong@huawei.com Tested-by: Zhang Changzhong zhangchangzhong@huawei.com Signed-off-by: Heiner Kallweit hkallweit1@gmail.com Link: https://lore.kernel.org/r/7b8b9456-a93f-abbc-1dc5-a2c2542f932c@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Zhang Changzhong zhangchangzhong@huawei.com Reviewed-by: Wei Yongjun weiyongjun1@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Chen Jun chenjun102@huawei.com --- drivers/net/phy/phy.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index 5ee7cde0c2e9..db7866b6f752 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -831,7 +831,12 @@ int phy_ethtool_ksettings_set(struct phy_device *phydev, phydev->mdix_ctrl = cmd->base.eth_tp_mdix_ctrl;
/* Restart the PHY */ - _phy_start_aneg(phydev); + if (phy_is_started(phydev)) { + phydev->state = PHY_UP; + phy_trigger_machine(phydev); + } else { + _phy_start_aneg(phydev); + }
mutex_unlock(&phydev->lock); return 0;