[PATCH OLK-6.6] io_uring: validate user-controlled cq.head in io_cqe_cache_refill()

18 May 2026

mainline inclusion
from mainline-v7.1-rc4
commit f44d38a31f1802b7222adaea9ee69f9d280f698a
category: bugfix
bugzilla: https://atomgit.com/openeuler/kernel/issues/9154

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...

--------------------------------

A fuzzing run reproduced an unkillable io_uring task stuck at ~100% CPU:

[root@fedora io_uring_stress]# ps -ef | grep io_uring
root  1240  1  99 13:36 ?  00:01:35 [io_uring_stress] <defunct>

The task loops inside io_cqring_wait() and never returns to userspace,
and SIGKILL has no effect.

This is caused by the CQ ring exposing rings->cq.head to userspace as
writable, while the authoritative tail lives in kernel-private
ctx->cached_cq_tail. io_cqe_cache_refill() computes free space as an
unsigned subtraction:

    free = ctx->cq_entries - min(tail - head, ctx->cq_entries);

If userspace keeps head within [0, tail], the subtraction is well
defined and min() just acts as a defensive clamp. But if userspace
advances head past tail, (tail - head) wraps to a huge value, free
becomes 0, and io_cqe_cache_refill() fails. The CQE is pushed onto the
overflow list and IO_CHECK_CQ_OVERFLOW_BIT is set.

The wait loop in io_cqring_wait() relies on an invariant: refill() only
fails when the CQ is *physically* full, in which case rings->cq.tail has
been advanced to iowq->cq_tail and io_should_wake() returns true. The
tampered head breaks this: refill() fails while the ring is not full, no
OCQE is copied in, rings->cq.tail never catches up, io_should_wake()
stays false, and io_cqring_wait_schedule() keeps returning early because
IO_CHECK_CQ_OVERFLOW_BIT is still set. The result is a tight retry loop
that never returns to userspace.

Introduce io_cqring_queued() as the single point that converts the
(tail, head) pair into a trustworthy queued count. Since the real
head/tail distance is bounded by cq_entries (far below 2^31), a signed
comparison reliably detects userspace moving head past tail; in that
case treat the queue as empty so callers see the full cache as free and
forward progress is preserved.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Link: https://patch.msgid.link/20260514021847.4062782-1-wozizhi@huaweicloud.com
[axboe: fixup commit message, kill 'queued' var, and keep it all in
io_uring.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Conflicts:
	io_uring/io_uring.c
[Not has io_fill_nop_cqe(), not affect this patch.]

Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
---
 io_uring/io_uring.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 20a657c8d625..4b7927f92389 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -835,32 +835,44 @@ void io_req_cqe_overflow(struct io_kiocb *req)
 				req->cqe.res, req->cqe.flags,
 				req->big_cqe.extra1, req->big_cqe.extra2);
 	memset(&req->big_cqe, 0, sizeof(req->big_cqe));
 }
 
+/*
+ Compute queued CQEs for free-space calculation, clamped to cq_entries.
+ */
+static unsigned int io_cqring_queued(struct io_ring_ctx *ctx)
+{
+	struct io_rings *rings = io_get_rings(ctx);
+	int diff;
+
+	diff = (int)(ctx->cached_cq_tail - READ_ONCE(rings->cq.head));
+	if (diff >= 0)
+		return min((unsigned int)diff, ctx->cq_entries);
+	return 0;
+}
+
 /*
  * writes to the cq entry need to come after reading head; the
  * control dependency is enough as we're using WRITE_ONCE to
  * fill the cq entry
  */
 bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow)
 {
 	struct io_rings *rings = ctx->rings;
 	unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1);
-	unsigned int free, queued, len;
+	unsigned int free, len;
 
 	/*
 	 * Posting into the CQ when there are pending overflowed CQEs may break
 	 * ordering guarantees, which will affect links, F_MORE users and more.
 	 * Force overflow the completion.
 	 */
 	if (!overflow && (ctx->check_cq & BIT(IO_CHECK_CQ_OVERFLOW_BIT)))
 		return false;
 
-	/* userspace may cheat modifying the tail, be safe and do min */
-	queued = min(__io_cqring_events(ctx), ctx->cq_entries);
-	free = ctx->cq_entries - queued;
+	free = ctx->cq_entries - io_cqring_queued(ctx);
 	/* we need a contiguous range, limit based on the current array offset */
 	len = min(free, ctx->cq_entries - off);
 	if (!len)
 		return false;
 
-- 
2.52.0