[PATCH OLK-5.10 v4 00/11] io_uring patches backport

Changes in v2: Add 01e68ce08a30d ("io_uring/io-wq: stop setting PF_NO_SETAFFINITY on io-wq workers") and a5fc1441af77 ("io_uring/sqpoll: Do not set PF_NO_SETAFFINITY on sqpoll threads") to allow users to set the cpumask for sqpoll thread and worker thread. Changes in v3: Remove PF_NO_SETAFFINITY of worker; Add 7215469659cb ("io_uring: check for iowq alloc_workqueue failure"). Changes in v4: Modify the commit message; Adjusting the patch sequence. Al Viro (1): io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed Jeff Moyer (1): io-wq: fully initialize wqe before calling cpuhp_state_add_instance_nocalls() Jens Axboe (4): io_uring/io-wq: stop setting PF_NO_SETAFFINITY on io-wq workers io_uring/fdinfo: remove need for sqpoll lock for thread/pid retrieval io_uring: use private workqueue for exit work io_uring/sqpoll: close race on waiting for sqring entries Max Kellermann (1): io_uring/io-wq: do not use bogus hash value Michal Koutný (1): io_uring/sqpoll: Do not set PF_NO_SETAFFINITY on sqpoll threads Pavel Begunkov (3): io_uring: check for iowq alloc_workqueue failure io_uring: protect register tracing io_uring/sqpoll: fix sqpoll error handling races io_uring/io-wq.c | 29 ++++++++++++++++----------- io_uring/io_uring.c | 48 ++++++++++++++++++++++++++++++++------------- 2 files changed, 52 insertions(+), 25 deletions(-) -- 2.31.1

反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/16442 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/IMU... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/16442 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/IMU...

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.3-rc2 commit 01e68ce08a30db3d842ce7a55f7f6e0474a55f9a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC6ES1 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Every now and then reports come in that are puzzled on why changing affinity on the io-wq workers fails with EINVAL. This happens because they set PF_NO_SETAFFINITY as part of their creation, as io-wq organizes workers into groups based on what CPU they are running on. However, this is purely an optimization and not a functional requirement. We can allow setting affinity, and just lazily update our worker to wqe mappings. If a given io-wq thread times out, it normally exits if there's no more work to do. The exception is if it's the last worker available. For the timeout case, check the affinity of the worker against group mask and exit even if it's the last worker. New workers should be created with the right mask and in the right location. Reported-by:Daniel Dao <dqminh@cloudflare.com> Link: https://lore.kernel.org/io-uring/CA+wXwBQwgxB3_UphSny-yAP5b26meeOu1W4TwYVcD_... Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: io_uring/io-wq.c [Commit 8a565304927f ("io_uring/io-wq: Use set_bit() and test_bit() at worker->flags") modified the way worker->flag is set; commit 42abc95f05bf ("io-wq: decouple work_list protection from the big wqe->lock") move io_acct_run_queue out of the protect of wqe->lock in io_wqe_worker; commit e13fb1fe1483 ("io-wq: reduce acct->lock crossing functions lock/unlock") remove the use of acct->lock in io_wqe_worker.] Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> --- io_uring/io-wq.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c index 066f9ab708c6..e25ab32414f4 100644 --- a/io_uring/io-wq.c +++ b/io_uring/io-wq.c @@ -622,7 +622,7 @@ static int io_wqe_worker(void *data) struct io_wqe_acct *acct = io_wqe_get_acct(worker); struct io_wqe *wqe = worker->wqe; struct io_wq *wq = wqe->wq; - bool last_timeout = false; + bool exit_mask = false, last_timeout = false; char buf[TASK_COMM_LEN]; set_mask_bits(&worker->flags, 0, @@ -641,8 +641,11 @@ static int io_wqe_worker(void *data) io_worker_handle_work(worker); goto loop; } - /* timed out, exit unless we're the last worker */ - if (last_timeout && acct->nr_workers > 1) { + /* + * Last sleep timed out. Exit if we're not the last worker, + * or if someone modified our affinity. + */ + if (last_timeout && (exit_mask || acct->nr_workers > 1)) { acct->nr_workers--; raw_spin_unlock(&wqe->lock); __set_current_state(TASK_RUNNING); @@ -661,7 +664,11 @@ static int io_wqe_worker(void *data) continue; break; } - last_timeout = !ret; + if (!ret) { + last_timeout = true; + exit_mask = !cpumask_test_cpu(raw_smp_processor_id(), + wqe->cpu_mask); + } } if (test_bit(IO_WQ_BIT_EXIT, &wq->state)) { @@ -718,7 +725,6 @@ static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker, tsk->pf_io_worker = worker; worker->task = tsk; set_cpus_allowed_ptr(tsk, wqe->cpu_mask); - tsk->flags |= PF_NO_SETAFFINITY; raw_spin_lock(&wqe->lock); hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list); -- 2.31.1

From: Michal Koutný <mkoutny@suse.com> mainline inclusion from mainline-v6.3-rc3 commit a5fc1441af7719e93dc7a638a960befb694ade89 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC6ES1 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Users may specify a CPU where the sqpoll thread would run. This may conflict with cpuset operations because of strict PF_NO_SETAFFINITY requirement. That flag is unnecessary for polling "kernel" threads, see the reasoning in commit 01e68ce08a30 ("io_uring/io-wq: stop setting PF_NO_SETAFFINITY on io-wq workers"). Drop the flag on poll threads too. Fixes: 01e68ce08a30 ("io_uring/io-wq: stop setting PF_NO_SETAFFINITY on io-wq workers") Link: https://lore.kernel.org/all/20230314162559.pnyxdllzgw7jozgx@blackpad/ Signed-off-by: Michal Koutný <mkoutny@suse.com> Link: https://lore.kernel.org/r/20230314183332.25834-1-mkoutny@suse.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: io_uring/sqpoll.c [Commit 17437f311490 ("io_uring: move SQPOLL related handling into its own file") move io_sq_thread from io_uring.c to sqpoll.c] Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> --- io_uring/io_uring.c | 1 - 1 file changed, 1 deletion(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 883e2b74f82e..ea821d5c54a7 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -7542,7 +7542,6 @@ static int io_sq_thread(void *data) set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu)); else set_cpus_allowed_ptr(current, cpu_online_mask); - current->flags |= PF_NO_SETAFFINITY; mutex_lock(&sqd->lock); while (1) { -- 2.31.1

From: Jeff Moyer <jmoyer@redhat.com> mainline inclusion from mainline-v6.6-rc5 commit 0f8baa3c9802fbfe313c901e1598397b61b91ada category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC6ES1 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- I received a bug report with the following signature: [ 1759.937637] BUG: unable to handle page fault for address: ffffffffffffffe8 [ 1759.944564] #PF: supervisor read access in kernel mode [ 1759.949732] #PF: error_code(0x0000) - not-present page [ 1759.954901] PGD 7ab615067 P4D 7ab615067 PUD 7ab617067 PMD 0 [ 1759.960596] Oops: 0000 1 PREEMPT SMP PTI [ 1759.964804] CPU: 15 PID: 109 Comm: cpuhp/15 Kdump: loaded Tainted: G X ------- — 5.14.0-362.3.1.el9_3.x86_64 #1 [ 1759.976609] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 06/20/2018 [ 1759.985181] RIP: 0010:io_wq_for_each_worker.isra.0+0x24/0xa0 [ 1759.990877] Code: 90 90 90 90 90 90 0f 1f 44 00 00 41 56 41 55 41 54 55 48 8d 6f 78 53 48 8b 47 78 48 39 c5 74 4f 49 89 f5 49 89 d4 48 8d 58 e8 <8b> 13 85 d2 74 32 8d 4a 01 89 d0 f0 0f b1 0b 75 5c 09 ca 78 3d 48 [ 1760.009758] RSP: 0000:ffffb6f403603e20 EFLAGS: 00010286 [ 1760.015013] RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: 0000000000000000 [ 1760.022188] RDX: ffffb6f403603e50 RSI: ffffffffb11e95b0 RDI: ffff9f73b09e9400 [ 1760.029362] RBP: ffff9f73b09e9478 R08: 000000000000000f R09: 0000000000000000 [ 1760.036536] R10: ffffffffffffff00 R11: ffffb6f403603d80 R12: ffffb6f403603e50 [ 1760.043712] R13: ffffffffb11e95b0 R14: ffffffffb28531e8 R15: ffff9f7a6fbdf548 [ 1760.050887] FS: 0000000000000000(0000) GS:ffff9f7a6fbc0000(0000) knlGS:0000000000000000 [ 1760.059025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1760.064801] CR2: ffffffffffffffe8 CR3: 00000007ab610002 CR4: 00000000007706e0 [ 1760.071976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1760.079150] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1760.086325] PKRU: 55555554 [ 1760.089044] Call Trace: [ 1760.091501] <TASK> [ 1760.093612] ? show_trace_log_lvl+0x1c4/0x2df [ 1760.097995] ? show_trace_log_lvl+0x1c4/0x2df [ 1760.102377] ? __io_wq_cpu_online+0x54/0xb0 [ 1760.106584] ? __die_body.cold+0x8/0xd [ 1760.110356] ? page_fault_oops+0x134/0x170 [ 1760.114479] ? kernelmode_fixup_or_oops+0x84/0x110 [ 1760.119298] ? exc_page_fault+0xa8/0x150 [ 1760.123247] ? asm_exc_page_fault+0x22/0x30 [ 1760.127458] ? __pfx_io_wq_worker_affinity+0x10/0x10 [ 1760.132453] ? __pfx_io_wq_worker_affinity+0x10/0x10 [ 1760.137446] ? io_wq_for_each_worker.isra.0+0x24/0xa0 [ 1760.142527] __io_wq_cpu_online+0x54/0xb0 [ 1760.146558] cpuhp_invoke_callback+0x109/0x460 [ 1760.151029] ? __pfx_io_wq_cpu_offline+0x10/0x10 [ 1760.155673] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 1760.160320] cpuhp_thread_fun+0x8d/0x140 [ 1760.164266] smpboot_thread_fn+0xd3/0x1a0 [ 1760.168297] kthread+0xdd/0x100 [ 1760.171457] ? __pfx_kthread+0x10/0x10 [ 1760.175225] ret_from_fork+0x29/0x50 [ 1760.178826] </TASK> [ 1760.181022] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc vfat fat dm_multipath intel_rapl_msr intel_rapl_common isst_if_common ipmi_ssif nfit libnvdimm mgag200 i2c_algo_bit ioatdma drm_shmem_helper drm_kms_helper acpi_ipmi syscopyarea x86_pkg_temp_thermal sysfillrect ipmi_si intel_powerclamp sysimgblt ipmi_devintf coretemp acpi_power_meter ipmi_msghandler rapl pcspkr dca intel_pch_thermal intel_cstate ses lpc_ich intel_uncore enclosure hpilo mei_me mei acpi_tad fuse drm xfs sd_mod sg bnx2x nvme nvme_core crct10dif_pclmul crc32_pclmul nvme_common ghash_clmulni_intel smartpqi tg3 t10_pi mdio uas libcrc32c crc32c_intel scsi_transport_sas usb_storage hpwdt wmi dm_mirror dm_region_hash dm_log dm_mod [ 1760.248623] CR2: ffffffffffffffe8 A cpu hotplug callback was issued before wq->all_list was initialized. This results in a null pointer dereference. The fix is to fully setup the io_wq before calling cpuhp_state_add_instance_nocalls(). Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Link: https://lore.kernel.org/r/x49y1ghnecs.fsf@segfault.boston.devel.redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: io_uring/io-wq.c [Commit da64d6db3bd3 ("io_uring: One wqe per wq") changes the allocation mode of io_wq and the release mode of cpu_mask.] Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> --- io_uring/io-wq.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c index e25ab32414f4..14d70520ed9e 100644 --- a/io_uring/io-wq.c +++ b/io_uring/io-wq.c @@ -1150,9 +1150,6 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) wq = kzalloc(struct_size(wq, wqes, nr_node_ids), GFP_KERNEL); if (!wq) return ERR_PTR(-ENOMEM); - ret = cpuhp_state_add_instance_nocalls(io_wq_online, &wq->cpuhp_node); - if (ret) - goto err_wq; refcount_inc(&data->hash->refs); wq->hash = data->hash; @@ -1195,17 +1192,19 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) wq->task = get_task_struct(data->task); atomic_set(&wq->worker_refs, 1); init_completion(&wq->worker_done); + ret = cpuhp_state_add_instance_nocalls(io_wq_online, &wq->cpuhp_node); + if (ret) + goto err; + return wq; err: io_wq_put_hash(data->hash); - cpuhp_state_remove_instance_nocalls(io_wq_online, &wq->cpuhp_node); for_each_node(node) { if (!wq->wqes[node]) continue; free_cpumask_var(wq->wqes[node]->cpu_mask); kfree(wq->wqes[node]); } -err_wq: kfree(wq); return ERR_PTR(ret); } -- 2.31.1

From: Al Viro <viro@zeniv.linux.org.uk> mainline inclusion from mainline-v6.7-rc1 commit 1939316bf988f3e49a07d9c4dd6f660bf4daa53d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC6ES1 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- ->ki_pos value is unreliable in such cases. For an obvious example, consider O_DSYNC write - we feed the data to page cache and start IO, then we make sure it's completed. Update of ->ki_pos is dealt with by the first part; failure in the second ends up with negative value returned _and_ ->ki_pos left advanced as if sync had been successful. In the same situation write(2) does not advance the file position at all. Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Conflicts: io_uring/rw.c [Commit f3b44f92e59a ("io_uring: move read/write related opcodes to its own file") move kiocb_done from io_uring.c to rw.c] Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> --- io_uring/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index ea821d5c54a7..bdbd8fe36773 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3073,7 +3073,7 @@ static void kiocb_done(struct kiocb *kiocb, ssize_t ret, { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); - if (req->flags & REQ_F_CUR_POS) + if (ret >= 0 && req->flags & REQ_F_CUR_POS) req->file->f_pos = kiocb->ki_pos; if (ret >= 0 && (kiocb->ki_complete == io_complete_rw)) { if (!__io_complete_rw_common(req, ret)) { -- 2.31.1

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.7-rc2 commit a0d45c3f596be53c1bd8822a1984532d14fdcea9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC6ES1 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- A previous commit added a trylock for getting the SQPOLL thread info via fdinfo, but this introduced a regression where we often fail to get it if the thread is busy. For that case, we end up not printing the current CPU and PID info. Rather than rely on this lock, just print the pid we already stored in the io_sq_data struct, and ensure we update the current CPU every time we've slept or potentially rescheduled. The latter won't potentially be 100% accurate, but that wasn't the case before either as the task can get migrated at any time unless it has been pinned at creation time. We retain keeping the io_sq_data dereference inside the ctx->uring_lock, as it has always been, as destruction of the thread and data happen below that. We could make this RCU safe, but there's little point in doing that. With this, we always print the last valid information we had, rather than have spurious outputs with missing information. Fixes: 7644b1a1c9a7 ("io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid") Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: io_uring/fdinfo.c io_uring/sqpoll.c [Commit a4ad4f748ea9 ("io_uring: move fdinfo helpers to its own file") move io_uring_show_fdinfo from io_uring.c to fdinfo.c; commit 17437f311490 ("io_uring: move SQPOLL related handling into its own file") move io_sqd_handle_event/io_sq_thread from io_uring.c to sqpoll.c] Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> --- io_uring/io_uring.c | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index bdbd8fe36773..fa0370fdb1d6 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -7523,6 +7523,7 @@ static bool io_sqd_handle_event(struct io_sq_data *sqd) did_sig = get_signal(&ksig); cond_resched(); mutex_lock(&sqd->lock); + sqd->sq_cpu = raw_smp_processor_id(); } return did_sig || test_bit(IO_SQ_THREAD_SHOULD_STOP, &sqd->state); } @@ -7538,10 +7539,15 @@ static int io_sq_thread(void *data) snprintf(buf, sizeof(buf), "iou-sqp-%d", sqd->task_pid); set_task_comm(current, buf); - if (sqd->sq_cpu != -1) + /* reset to our pid after we've set task_comm, for fdinfo */ + sqd->task_pid = current->pid; + + if (sqd->sq_cpu != -1) { set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu)); - else + } else { set_cpus_allowed_ptr(current, cpu_online_mask); + sqd->sq_cpu = raw_smp_processor_id(); + } mutex_lock(&sqd->lock); while (1) { @@ -7565,6 +7571,7 @@ static int io_sq_thread(void *data) if (sqt_spin || !time_after(jiffies, timeout)) { cond_resched(); + sqd->sq_cpu = raw_smp_processor_id(); if (sqt_spin) timeout = jiffies + sqd->sq_thread_idle; continue; @@ -7592,6 +7599,7 @@ static int io_sq_thread(void *data) mutex_unlock(&sqd->lock); schedule(); mutex_lock(&sqd->lock); + sqd->sq_cpu = raw_smp_processor_id(); } list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) io_ring_clear_wakeup_flag(ctx); @@ -10054,13 +10062,8 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m) if (has_lock && (ctx->flags & IORING_SETUP_SQPOLL)) { struct io_sq_data *sq = ctx->sq_data; - if (mutex_trylock(&sq->lock)) { - if (sq->thread) { - sq_pid = task_pid_nr(sq->thread); - sq_cpu = task_cpu(sq->thread); - } - mutex_unlock(&sq->lock); - } + sq_pid = sq->task_pid; + sq_cpu = sq->sq_cpu; } seq_printf(m, "SqThread:\t%d\n", sq_pid); -- 2.31.1

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.9-rc3 commit 73eaa2b583493b680c6f426531d6736c39643bfb category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC6ES1 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Rather than use the system unbound event workqueue, use an io_uring specific one. This avoids dependencies with the tty, which also uses the system_unbound_wq, and issues flushes of said workqueue from inside its poll handling. Cc: stable@vger.kernel.org Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com> Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com> Tested-by: Iskren Chernev <me@iskren.info> Link: https://github.com/axboe/liburing/issues/1113 Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: io_uring/io_uring.c [Commit b3a4dbc89d40 ("io_uring/kbuf: Use slab for struct io_buffer objects") add initialization of io_buf_cachep.] Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> --- io_uring/io_uring.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index fa0370fdb1d6..6677213e07b4 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -1109,6 +1109,7 @@ static int io_close_fixed(struct io_kiocb *req, unsigned int issue_flags); static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer); static struct kmem_cache *req_cachep; +static struct workqueue_struct *iou_wq __ro_after_init; static const struct file_operations io_uring_fops; @@ -9494,7 +9495,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) * noise and overhead, there's no discernable change in runtime * over using system_wq. */ - queue_work(system_unbound_wq, &ctx->exit_work); + queue_work(iou_wq, &ctx->exit_work); } static int io_uring_release(struct inode *inode, struct file *file) @@ -10999,6 +11000,8 @@ static int __init io_uring_init(void) req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT); + + iou_wq = alloc_workqueue("iou_exit", WQ_UNBOUND, 64); return 0; }; __initcall(io_uring_init); -- 2.31.1

From: Pavel Begunkov <asml.silence@gmail.com> mainline inclusion from mainline-v6.15-rc1 commit 7215469659cb9751a9bf80e43b24a48749004d26 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC6ES1 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- alloc_workqueue() can fail even during init in io_uring_init(), check the result and panic if anything went wrong. Fixes: 73eaa2b583493 ("io_uring: use private workqueue for exit work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3a046063902f888f66151f89fa42f84063b9727b.173834308... Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: io_uring/io_uring.c [Commit 76d3ccecfa18 ("io_uring: add a sysctl to disable io_uring system-wide") add register_sysctl_init in io_uring_init.] Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> --- io_uring/io_uring.c | 1 + 1 file changed, 1 insertion(+) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 6677213e07b4..78d89002242a 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -11002,6 +11002,7 @@ static int __init io_uring_init(void) SLAB_ACCOUNT); iou_wq = alloc_workqueue("iou_exit", WQ_UNBOUND, 64); + BUG_ON(!iou_wq); return 0; }; __initcall(io_uring_init); -- 2.31.1

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.12-rc4 commit 28aabffae6be54284869a91cd8bccd3720041129 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC6ES1 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- When an application uses SQPOLL, it must wait for the SQPOLL thread to consume SQE entries, if it fails to get an sqe when calling io_uring_get_sqe(). It can do so by calling io_uring_enter(2) with the flag value of IORING_ENTER_SQ_WAIT. In liburing, this is generally done with io_uring_sqring_wait(). There's a natural expectation that once this call returns, a new SQE entry can be retrieved, filled out, and submitted. However, the kernel uses the cached sq head to determine if the SQRING is full or not. If the SQPOLL thread is currently in the process of submitting SQE entries, it may have updated the cached sq head, but not yet committed it to the SQ ring. Hence the kernel may find that there are SQE entries ready to be consumed, and return successfully to the application. If the SQPOLL thread hasn't yet committed the SQ ring entries by the time the application returns to userspace and attempts to get a new SQE, it will fail getting a new SQE. Fix this by having io_sqring_full() always use the user visible SQ ring head entry, rather than the internally cached one. Cc: stable@vger.kernel.org # 5.10+ Link: https://github.com/axboe/liburing/discussions/1267 Reported-by: Benedek Thaler <thaler@thaler.hu> Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: io_uring/io_uring.c [Commit 17437f311490 ("io_uring: move SQPOLL related handling into its own file") move io_sqring_full from io_uring.c to io_uring.h] Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> --- io_uring/io_uring.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 78d89002242a..c4264ab42a2d 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -1587,7 +1587,14 @@ static inline bool io_sqring_full(struct io_ring_ctx *ctx) { struct io_rings *r = ctx->rings; - return READ_ONCE(r->sq.tail) - ctx->cached_sq_head == ctx->sq_entries; + /* + * SQPOLL must use the actual sqring head, as using the cached_sq_head + * is race prone if the SQPOLL thread has grabbed entries but not yet + * committed them to the ring. For !SQPOLL, this doesn't matter, but + * since this helper is just used for SQPOLL sqring waits (or POLLOUT), + * just read the actual sqring head unconditionally. + */ + return READ_ONCE(r->sq.tail) - READ_ONCE(r->sq.head) == ctx->sq_entries; } static inline unsigned int __io_cqring_events(struct io_ring_ctx *ctx) -- 2.31.1

From: Pavel Begunkov <asml.silence@gmail.com> mainline inclusion from mainline-v6.13-rc1 commit e358e09a894dbcd51fdbbcf62bec1df249915834 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC6ES1 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Syz reports: BUG: KCSAN: data-race in __se_sys_io_uring_register / io_sqe_files_register read-write to 0xffff8881021940b8 of 4 bytes by task 5923 on cpu 1: io_sqe_files_register+0x2c4/0x3b0 io_uring/rsrc.c:713 __io_uring_register io_uring/register.c:403 [inline] __do_sys_io_uring_register io_uring/register.c:611 [inline] __se_sys_io_uring_register+0x8d0/0x1280 io_uring/register.c:591 __x64_sys_io_uring_register+0x55/0x70 io_uring/register.c:591 x64_sys_call+0x202/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:428 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff8881021940b8 of 4 bytes by task 5924 on cpu 0: __do_sys_io_uring_register io_uring/register.c:613 [inline] __se_sys_io_uring_register+0xe4a/0x1280 io_uring/register.c:591 __x64_sys_io_uring_register+0x55/0x70 io_uring/register.c:591 x64_sys_call+0x202/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:428 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Which should be due to reading the table size after unlock. We don't care much as it's just to print it in trace, but we might as well do it under the lock. Reported-by: syzbot+5a486fef3de40e0d8c76@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8233af2886a37b57f79e444e3db88fcfda1817ac.173194220... Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: io_uring/register.c [Commit c43203154d8a ("io_uring/register: move io_uring_register(2) related code to register.c") move io_uring_register from io_uring.c to register.c] Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> --- io_uring/io_uring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index c4264ab42a2d..631f02ddacf5 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -10942,9 +10942,10 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode, mutex_lock(&ctx->uring_lock); ret = __io_uring_register(ctx, opcode, arg, nr_args); - mutex_unlock(&ctx->uring_lock); + trace_io_uring_register(ctx, opcode, ctx->nr_user_files, ctx->nr_user_bufs, ctx->cq_ev_fd != NULL, ret); + mutex_unlock(&ctx->uring_lock); out_fput: fdput(f); return ret; -- 2.31.1

From: Pavel Begunkov <asml.silence@gmail.com> mainline inclusion from mainline-v6.13-rc5 commit e33ac68e5e21ec1292490dfe061e75c0dbdd3bd4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC6ES1 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- BUG: KASAN: slab-use-after-free in __lock_acquire+0x370b/0x4a10 kernel/locking/lockdep.c:5089 Call Trace: <TASK> ... _raw_spin_lock_irqsave+0x3d/0x60 kernel/locking/spinlock.c:162 class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline] try_to_wake_up+0xb5/0x23c0 kernel/sched/core.c:4205 io_sq_thread_park+0xac/0xe0 io_uring/sqpoll.c:55 io_sq_thread_finish+0x6b/0x310 io_uring/sqpoll.c:96 io_sq_offload_create+0x162/0x11d0 io_uring/sqpoll.c:497 io_uring_create io_uring/io_uring.c:3724 [inline] io_uring_setup+0x1728/0x3230 io_uring/io_uring.c:3806 ... Kun Hu reports that the SQPOLL creating error path has UAF, which happens if io_uring_alloc_task_context() fails and then io_sq_thread() manages to run and complete before the rest of error handling code, which means io_sq_thread_finish() is looking at already killed task. Note that this is mostly theoretical, requiring fault injection on the allocation side to trigger in practice. Cc: stable@vger.kernel.org Reported-by: Kun Hu <huk23@m.fudan.edu.cn> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0f2f1aa5729332612bd01fe0f2f385fd1f06ce7c.173523171... Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: io_uring/sqpoll.c [Commit 17437f311490 ("io_uring: move SQPOLL related handling into its own file") move io_sq_offload_create from io_uring.c to sqpoll.c] Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> --- io_uring/io_uring.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 631f02ddacf5..fee6459547d9 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -8583,6 +8583,7 @@ void __io_uring_free(struct task_struct *tsk) static int io_sq_offload_create(struct io_ring_ctx *ctx, struct io_uring_params *p) { + struct task_struct *task_to_put = NULL; int ret; /* Retain compatibility with failing for an invalid attach attempt */ @@ -8648,6 +8649,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx, } sqd->thread = tsk; + task_to_put = get_task_struct(tsk); ret = io_uring_alloc_task_context(tsk, ctx); wake_up_new_task(tsk); if (ret) @@ -8658,11 +8660,15 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx, goto err; } + if (task_to_put) + put_task_struct(task_to_put); return 0; err_sqpoll: complete(&ctx->sq_data->exited); err: io_sq_thread_finish(ctx); + if (task_to_put) + put_task_struct(task_to_put); return ret; } -- 2.31.1

From: Max Kellermann <max.kellermann@ionos.com> mainline inclusion from mainline-v6.15-rc1 commit 486ba4d84d62e92716cd395c4b1612b8ce70a257 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC6ES1 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Previously, the `hash` variable was initialized with `-1` and only updated by io_get_next_work() if the current work was hashed. Commit 60cf46ae6054 ("io-wq: hash dependent work") changed this to always call io_get_work_hash() even if the work was not hashed. This caused the `hash != -1U` check to always be true, adding some overhead for the `hash->wait` code. This patch fixes the regression by checking the `IO_WQ_WORK_HASHED` flag. Perf diff for a flood of `IORING_OP_NOP` with `IOSQE_ASYNC`: 38.55% -1.57% [kernel.kallsyms] [k] queued_spin_lock_slowpath 6.86% -0.72% [kernel.kallsyms] [k] io_worker_handle_work 0.10% +0.67% [kernel.kallsyms] [k] put_prev_entity 1.96% +0.59% [kernel.kallsyms] [k] io_nop_prep 3.31% -0.51% [kernel.kallsyms] [k] try_to_wake_up 7.18% -0.47% [kernel.kallsyms] [k] io_wq_free_work Fixes: 60cf46ae6054 ("io-wq: hash dependent work") Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/r/20250128133927.3989681-6-max.kellermann@ionos.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: io_uring/io-wq.c [Commit 6ee78354eaa6 ("io_uring/io-wq: cache work->flags in variable") replace io_get_work_hash with __io_get_work_hash.] Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> --- io_uring/io-wq.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c index 14d70520ed9e..b14af1b6f9dc 100644 --- a/io_uring/io-wq.c +++ b/io_uring/io-wq.c @@ -577,7 +577,9 @@ static void io_worker_handle_work(struct io_worker *worker) /* handle a whole dependent link */ do { struct io_wq_work *next_hashed, *linked; - unsigned int hash = io_get_work_hash(work); + unsigned int hash = io_wq_is_hashed(work) + ? io_get_work_hash(work) + : -1U; next_hashed = wq_next_work(work); -- 2.31.1
participants (2)
-
Li Lingfeng
-
patchwork bot