io_uring support
Aleix Roca Nonell (1): io_uring: fix manual setup of iov_iter for fixed buffers
Arnd Bergmann (2): io_uring: fix big-endian compat signal mask handling io_uring: use __kernel_timespec in timeout ABI
Bart Van Assche (1): percpu-refcount: Introduce percpu_ref_resurrect()
Bijan Mottahedeh (13): io_uring: clear req->result always before issuing a read/write request io_uring: process requests completed with -EAGAIN on poll list io_uring: use proper references for fallback_req locking io_uring: don't use kiocb.private to store buf_index io_uring: add io_statx structure statx: allow system call to be invoked from io_uring io_uring: call statx directly statx: hide interfaces no longer used by io_uring io_uring: validate the full range of provided buffers for access io_uring: add wrappers for memory accounting io_uring: rename ctx->account_mem field io_uring: report pinned memory usage io_uring: separate reporting of ring pages from registered pages
Bob Liu (2): io_uring: clean up io_uring_cancel_files() io_uring: introduce req_need_defer()
Brian Gianforcaro (1): io_uring: fix stale comment and a few typos
Christoph Hellwig (2): fs: add an iopoll method to struct file_operations io_uring: add fsync support
Chucheng Luo (1): io_uring: fix missing 'return' in comment
Colin Ian King (3): io_uring: fix shadowed variable ret return code being not checked io_uring: remove redundant variable pointer nxt and io_wq_assign_next call io_uring: Fix sizeof() mismatch
Damien Le Moal (5): aio: Comment use of IOCB_FLAG_IOPRIO aio flag block: Introduce get_current_ioprio() aio: Fix fallback I/O priority value block: prevent merging of requests with different priorities block: Initialize BIO I/O priority early
Dan Carpenter (3): io-wq: remove extra space characters io_uring: remove unnecessary NULL checks io_uring: fix a use after free in io_async_task_func()
Daniel Xu (1): io_uring: increase IORING_MAX_ENTRIES to 32K
Daniele Albano (1): io_uring: always allow drain/link/hardlink/async sqe flags
Deepa Dinamani (5): signal: Add set_user_sigmask() signal: Add restore_user_sigmask() ppoll: use __kernel_timespec pselect6: use __kernel_timespec io_pgetevents: use __kernel_timespec
Denis Efremov (1): io_uring: use kvfree() in io_sqe_buffer_register()
Dmitrii Dolgov (1): io_uring: add set of tracing events
Dmitry Vyukov (1): io_uring: fix sq array offset calculation
Eric Biggers (1): io_uring: fix memory leak of UNIX domain socket inode
Eric W. Biederman (2): signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig signal: Allow cifs and drbd to receive their terminating signals
Eugene Syromiatnikov (1): io_uring: fix compat for IORING_REGISTER_FILES_UPDATE
Guoyu Huang (1): io_uring: Fix NULL pointer dereference in loop_rw_iter()
Hillf Danton (6): io-wq: remove unused busy list from io_sqe io-wq: add cond_resched() to worker thread io-uring: drop completion when removing file io-uring: drop 'free_pfile' in struct io_file_put io_uring: add missing finish_wait() in io_sq_thread() io-wq: fix use-after-free in io_wq_worker_running
Hristo Venev (1): io_uring: allocate the two rings together
Hrvoje Zeba (1): io_uring: remove superfluous check for sqe->off in io_accept()
Jackie Liu (17): io_uring: adjust smp_rmb inside io_cqring_events io_uring: use wait_event_interruptible for cq_wait conditional wait io_uring: fix io_sq_thread_stop running in front of io_sq_thread io_uring: fix KASAN use after free in io_sq_wq_submit_work io_uring: fix an issue when IOSQE_IO_LINK is inserted into defer list io_uring: fix wrong sequence setting logic io_uring: add support for link with drain io_uring: use kmemdup instead of kmalloc and memcpy io_uring: fix use-after-free of shadow_req io_uring: fix potential crash issue due to io_get_req failure io_uring: replace s->needs_lock with s->in_async io_uring: set -EINTR directly when a signal wakes up in io_cqring_wait io_uring: remove passed in 'ctx' function parameter ctx if possible io_uring: keep io_put_req only responsible for release and put req io_uring: separate the io_free_req and io_free_req_find_next interface io_uring: remove parameter ctx of io_submit_state_start io_uring: remove io_wq_current_is_worker
Jann Horn (2): io_uring: use kzalloc instead of kcalloc for single-element allocations io-wq: fix handling of NUMA node IDs
Jens Axboe (352): Add io_uring IO interface io_uring: support for IO polling fs: add fget_many() and fput_many() io_uring: use fget/fput_many() for file references io_uring: batch io_kiocb allocation io_uring: add support for pre-mapped user IO buffers net: split out functions related to registering inflight socket files io_uring: add file set registration io_uring: add submission polling io_uring: add io_kiocb ref count io_uring: add support for IORING_OP_POLL io_uring: allow workqueue item to handle multiple buffered requests io_uring: add a few test tools tools/io_uring: remove IOCQE_FLAG_CACHEHIT io_uring: use regular request ref counts io_uring: make io_read/write return an integer io_uring: add prepped flag io_uring: fix fget/fput handling io_uring: fix poll races io_uring: retry bulk slab allocs as single allocs io_uring: fix double free in case of fileset regitration failure io_uring: restrict IORING_SETUP_SQPOLL to root io_uring: park SQPOLL thread if it's percpu io_uring: only test SQPOLL cpu after we've verified it io_uring: drop io_file_put() 'file' argument io_uring: fix possible deadlock between io_uring_{enter,register} io_uring: fix CQ overflow condition io_uring: fail io_uring_register(2) on a dying io_uring instance io_uring: remove 'state' argument from io_{read,write} path io_uring: have submission side sqe errors post a cqe io_uring: drop req submit reference always in async punt fs: add sync_file_range() helper io_uring: add support for marking commands as draining io_uring: add support for IORING_OP_SYNC_FILE_RANGE io_uring: add support for eventfd notifications io_uring: fix failure to verify SQ_AFF cpu io_uring: remove 'ev_flags' argument tools/io_uring: fix Makefile for pthread library link tools/io_uring: sync with liburing io_uring: ensure req->file is cleared on allocation uio: make import_iovec()/compat_import_iovec() return bytes on success io_uring: punt short reads to async context io_uring: add support for sqe links io_uring: add support for sendmsg() io_uring: add support for recvmsg() io_uring: don't use iov_iter_advance() for fixed buffers io_uring: ensure ->list is initialized for poll commands io_uring: fix potential hang with polled IO io_uring: don't enter poll loop if we have CQEs pending io_uring: add need_resched() check in inner poll loop io_uring: expose single mmap capability io_uring: optimize submit_and_wait API io_uring: add io_queue_async_work() helper io_uring: limit parallelism of buffered writes io_uring: extend async work merging io_uring: make sqpoll wakeup possible with getevents io_uring: ensure poll commands clear ->sqe io_uring: use cond_resched() in sqthread io_uring: IORING_OP_TIMEOUT support io_uring: correctly handle non ->{read,write}_iter() file_operations io_uring: make CQ ring wakeups be more efficient io_uring: only flush workqueues on fileset removal io_uring: fix sequence logic for timeout requests io_uring: fix up O_NONBLOCK handling for sockets io_uring: revert "io_uring: optimize submit_and_wait API" io_uring: used cached copies of sq->dropped and cq->overflow io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD io_uring: don't touch ctx in setup after ring fd install io_uring: run dependent links inline if possible io_uring: allow sparse fixed file sets io_uring: add support for IORING_REGISTER_FILES_UPDATE io_uring: allow application controlled CQ ring size io_uring: add support for absolute timeouts io_uring: add support for canceling timeout requests io-wq: small threadpool implementation for io_uring io_uring: replace workqueue usage with io-wq io_uring: io_uring: add support for async work inheriting files net: add __sys_accept4_file() helper io_uring: add support for IORING_OP_ACCEPT io_uring: protect fixed file indexing with array_index_nospec() io_uring: support for larger fixed file sets io_uring: fix race with canceling timeouts io_uring: io_wq_create() returns an error pointer, not NULL io_uring: ensure we clear io_kiocb->result before each issue io_uring: support for generic async request cancel io_uring: add completion trace event io-wq: use proper nesting IRQ disabling spinlocks for cancel io_uring: enable optimized link handling for IORING_OP_POLL_ADD io_uring: fixup a few spots where link failure isn't flagged io_uring: kill dead REQ_F_LINK_DONE flag io_uring: abstract out io_async_cancel_one() helper io_uring: add support for linked SQE timeouts io_uring: make io_cqring_events() take 'ctx' as argument io_uring: pass in io_kiocb to fill/add CQ handlers io_uring: add support for backlogged CQ ring io-wq: io_wqe_run_queue() doesn't need to use list_empty_careful() io-wq: add support for bounded vs unbunded work io_uring: properly mark async work as bounded vs unbounded io_uring: reduce/pack size of io_ring_ctx io_uring: fix error clear of ->file_table in io_sqe_files_register() io_uring: convert accept4() -ERESTARTSYS into -EINTR io_uring: provide fallback request for OOM situations io_uring: make ASYNC_CANCEL work with poll and timeout io_uring: flag SQPOLL busy condition to userspace io_uring: don't do flush cancel under inflight_lock io_uring: fix -ENOENT issue with linked timer with short timeout io_uring: make timeout sequence == 0 mean no sequence io_uring: use correct "is IO worker" helper io_uring: fix potential deadlock in io_poll_wake() io_uring: check for validity of ->rings in teardown io_wq: add get/put_work handlers to io_wq_create() io-wq: ensure we have a stable view of ->cur_work for cancellations io_uring: ensure registered buffer import returns the IO length io-wq: ensure free/busy list browsing see all items io-wq: remove now redundant struct io_wq_nulls_list io_uring: make POLL_ADD/POLL_REMOVE scale better io_uring: io_async_cancel() should pass in 'nxt' request pointer io_uring: cleanup return values from the queueing functions io_uring: make io_double_put_req() use normal completion path io_uring: make req->timeout be dynamically allocated io_uring: fix sequencing issues with linked timeouts io_uring: remove dead REQ_F_SEQ_PREV flag io_uring: correct poll cancel and linked timeout expiration completion io_uring: request cancellations should break links io-wq: wait for io_wq_create() to setup necessary workers io_uring: io_fail_links() should only consider first linked timeout io_uring: io_allocate_scq_urings() should return a sane state io_uring: allow finding next link independent of req reference count io_uring: close lookup gap for dependent next work io_uring: improve trace_io_uring_defer() trace point io_uring: only return -EBUSY for submit on non-flushed backlog net: add __sys_connect_file() helper io_uring: add support for IORING_OP_CONNECT io-wq: have io_wq_create() take a 'data' argument io_uring: async workers should inherit the user creds io-wq: shrink io_wq_work a bit io_uring: make poll->wait dynamically allocated io_uring: fix missing kmap() declaration on powerpc io_uring: use current task creds instead of allocating a new one io_uring: transform send/recvmsg() -ERESTARTSYS to -EINTR io_uring: add general async offload context io_uring: ensure async punted read/write requests copy iovec net: separate out the msghdr copy from ___sys_{send,recv}msg() net: disallow ancillary data for __sys_{send,recv}msg_file() io_uring: ensure async punted sendmsg/recvmsg requests copy data io_uring: ensure async punted connect requests copy data io_uring: mark us with IORING_FEAT_SUBMIT_STABLE io_uring: handle connect -EINPROGRESS like -EAGAIN io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT io_uring: ensure deferred timeouts copy necessary data io-wq: clear node->next on list deletion io_uring: use hash table for poll command lookups io_uring: allow unbreakable links io-wq: remove worker->wait waitqueue io-wq: briefly spin for new work after finishing work io_uring: sqthread should grab ctx->uring_lock for submissions io_uring: deferred send/recvmsg should assign iov io_uring: don't dynamically allocate poll data io_uring: run next sqe inline if possible io_uring: only hash regular files for async work execution io_uring: add sockets to list of files that support non-blocking issue io_uring: ensure we return -EINVAL on unknown opcode io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG io-wq: re-add io_wq_current_is_worker() io_uring: fix pre-prepped issue with force_nonblock == true io_uring: remove 'sqe' parameter to the OP helpers that take it io_uring: any deferred command must have stable sqe data io_uring: make IORING_POLL_ADD and IORING_POLL_REMOVE deferrable io_uring: make IORING_OP_CANCEL_ASYNC deferrable io_uring: make IORING_OP_TIMEOUT_REMOVE deferrable io_uring: read opcode and user_data from SQE exactly once io_uring: warn about unhandled opcode io_uring: io_wq_submit_work() should not touch req->rw io_uring: use u64_to_user_ptr() consistently io_uring: add and use struct io_rw for read/writes io_uring: move all prep state for IORING_OP_CONNECT to prep handler io_uring: move all prep state for IORING_OP_{SEND,RECV}_MGS to prep handler io_uring: read 'count' for IORING_OP_TIMEOUT in prep handler io_uring: standardize the prep methods io_uring: pass in 'sqe' to the prep handlers io_uring: remove punt of short reads to async context io_uring: don't setup async context for read/write fixed io-wq: cancel work if we fail getting a mm reference io_uring: be consistent in assigning next work from handler io_uring: ensure workqueue offload grabs ring mutex for poll list io_uring: only allow submit from owning task Revert "io_uring: only allow submit from owning task" io_uring: don't cancel all work on process exit io_uring: add support for fallocate() fs: make build_open_flags() available internally io_uring: add support for IORING_OP_OPENAT io-wq: add support for uncancellable work io_uring: add support for IORING_OP_CLOSE io_uring: avoid ring quiesce for fixed file set unregister and update fs: make two stat prep helpers available io_uring: add support for IORING_OP_STATX io-wq: support concurrent non-blocking work io_uring: add IOSQE_ASYNC io_uring: remove two unnecessary function declarations io_uring: add lookup table for various opcode needs io_uring: split overflow state into SQ and CQ side io_uring: improve poll completion performance io_uring: add non-vectored read/write commands io_uring: allow use of offset == -1 to mean file position io_uring: add IORING_OP_FADVISE mm: make do_madvise() available internally io_uring: add IORING_OP_MADVISE io_uring: wrap multi-req freeing in struct req_batch io_uring: extend batch freeing to cover more cases io_uring: add support for IORING_SETUP_CLAMP io_uring: add support for send(2) and recv(2) io_uring: file set registration should use interruptible waits io_uring: change io_ring_ctx bool fields into bit fields io_uring: enable option to only trigger eventfd for async completions io_uring: remove 'fname' from io_open structure io_uring: add opcode to issue trace event io_uring: account fixed file references correctly in batch io_uring: add support for probing opcodes io_uring: file switch work needs to get flushed on exit io_uring: don't attempt to copy iovec for READ/WRITE io-wq: make the io_wq ref counted io_uring/io-wq: don't use static creds/mm assignments io_uring: allow registering credentials io_uring: support using a registered personality for commands io_uring: fix linked command file table usage eventpoll: abstract out epoll_ctl() handler eventpoll: support non-blocking do_epoll_ctl() calls io_uring: add support for epoll_ctl(2) io_uring: add ->show_fdinfo() for the io_uring file descriptor io_uring: prevent potential eventfd recursion on poll io_uring: use the proper helpers for io_send/recv io_uring: don't map read/write iovec potentially twice io_uring: fix sporadic double CQE entry for close io_uring: punt even fadvise() WILLNEED to async context io_uring: spin for sq thread to idle on shutdown io_uring: cleanup fixed file data table references io_uring: statx/openat/openat2 don't support fixed files io_uring: retry raw bdev writes if we hit -EOPNOTSUPP io-wq: add support for inheriting ->fs io_uring: grab ->fs as part of async preparation io_uring: allow AT_FDCWD for non-file openat/openat2/statx io-wq: make io_wqe_cancel_work() take a match handler io-wq: add io_wq_cancel_pid() to cancel based on a specific pid io_uring: cancel pending async work if task exits io_uring: retain sockaddr_storage across send/recvmsg async punt io-wq: don't call kXalloc_node() with non-online node io_uring: prune request from overflow list on flush io_uring: handle multiple personalities in link chains io_uring: fix personality idr leak io-wq: remove spin-for-work optimization io-wq: ensure work->task_pid is cleared on init io_uring: pick up link work on submit reference drop io_uring: import_single_range() returns 0/-ERROR io_uring: drop file set ref put/get on switch io_uring: fix 32-bit compatability with sendmsg/recvmsg io_uring: free fixed_file_data after RCU grace period io_uring: ensure RCU callback ordering with rcu_barrier() io_uring: make sure openat/openat2 honor rlimit nofile io_uring: make sure accept honor rlimit nofile io_uring: consider any io_read/write -EAGAIN as final io_uring: io_accept() should hold on to submit reference on retry io_uring: store io_kiocb in wait->private io_uring: add per-task callback handler io_uring: mark requests that we can do poll async in io_op_defs io_uring: use poll driven retry for files that support it io_uring: buffer registration infrastructure io_uring: add IORING_OP_PROVIDE_BUFFERS io_uring: support buffer selection for OP_READ and OP_RECV io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_READV net: abstract out normal and compat msghdr import io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG io_uring: provide means of removing buffers io_uring: add end-of-bits marker and build time verify it io_uring: dual license io_uring.h uapi header io_uring: fix truncated async read/readv and write/writev retry io_uring: honor original task RLIMIT_FSIZE io_uring: retry poll if we got woken with non-matching mask io_uring: grab task reference for poll requests io_uring: use io-wq manager as backup task if task is exiting io_uring: remove bogus RLIMIT_NOFILE check in file registration io_uring: ensure openat sets O_LARGEFILE if needed io_uring: punt final io_ring_ctx wait-and-free to workqueue io_uring: correct O_NONBLOCK check for splice punt io_uring: check for need to re-wait in polled async handling io_uring: io_async_task_func() should check and honor cancelation io_uring: only post events in io_poll_remove_all() if we completed some io_uring: statx must grab the file table for valid fd io_uring: enable poll retry for any file with ->read_iter / ->write_iter io_uring: only force async punt if poll based retry can't handle it io_uring: don't use 'fd' for openat/openat2/statx io_uring: polled fixed file must go through free iteration io_uring: initialize ctx->sqo_wait earlier io_uring: remove dead check in io_splice() io_uring: cancel work if task_work_add() fails io_uring: don't add non-IO requests to iopoll pending list io_uring: remove 'fd is io_uring' from close path io_uring: name sq thread and ref completions io_uring: batch reap of dead file registrations io_uring: allow POLL_ADD with double poll_wait() users io_uring: file registration list and lock optimization io_uring: cleanup io_poll_remove_one() logic io_uring: async task poll trigger cleanup io_uring: disallow close of ring itself io_uring: re-set iov base/len for buffer select retry io_uring: allow O_NONBLOCK async retry io_uring: acquire 'mm' for task_work for SQPOLL io_uring: reap poll completions while waiting for refs to drop on exit io_uring: use signal based task_work running io_uring: fix regression with always ignoring signals in io_cqring_wait() io_uring: account user memory freed when exit has been queued io_uring: ensure double poll additions work with both request types io_uring: use TWA_SIGNAL for task_work uncondtionally io_uring: hold 'ctx' reference around task_work queue + execute io_uring: clear req->result on IOPOLL re-issue io_uring: fix IOPOLL -EAGAIN retries io_uring: always delete double poll wait entry on match io_uring: fix potential ABBA deadlock in ->show_fdinfo() io_uring: use type appropriate io_kiocb handler for double poll io_uring: round-up cq size before comparing with rounded sq size io_uring: remove dead 'ctx' argument and move forward declaration io_uring: don't touch 'ctx' after installing file descriptor io_uring: account locked memory before potential error case io_uring: fix imbalanced sqo_mm accounting io_uring: stash ctx task reference for SQPOLL io_uring: ensure consistent view of original task ->mm from SQPOLL io_uring: allow non-fixed files with SQPOLL io_uring: fail poll arm on queue proc failure io_uring: sanitize double poll handling io_uring: ensure open/openat2 name is cleaned on cancelation io_uring: fix error path cleanup in io_sqe_files_register() io_uring: make ctx cancel on exit targeted to actual ctx io_uring: fix SQPOLL IORING_OP_CLOSE cancelation state io_uring: ignore double poll add on the same waitqueue head io_uring: clean up io_kill_linked_timeout() locking io_uring: add missing REQ_F_COMP_LOCKED for nested requests io_uring: provide generic io_req_complete() helper io_uring: add 'io_comp_state' to struct io_submit_state io_uring: pass down completion state on the issue side io_uring: pass in completion state to appropriate issue side handlers io_uring: enable READ/WRITE to use deferred completions io_uring: use task_work for links if possible io_uring: abstract out task work running io_uring: use new io_req_task_work_add() helper throughout io_uring: only call kfree() for a non-zero pointer io_uring: get rid of __req_need_defer() io_uring: enable lookup of links holding inflight files io_uring: fix recursive completion locking on oveflow flush io_uring: always plug for any number of IOs io_uring: find and cancel head link async work on files exit io_uring: don't use poll handler if file can't be nonblocking read/written io_uring: don't recurse on tsk->sighand->siglock with signalfd io_uring: defer file table grabbing request cleanup for locked requests
Jiufei Xue (5): io_uring: check file O_NONBLOCK state for accept io_uring: change the poll type to be 32-bits io_uring: use EPOLLEXCLUSIVE flag to aoid thundering herd type behavior io_uring: fix removing the wrong file in __io_sqe_files_update() io_uring: set table->files[i] to NULL when io_sqe_file_register failed
Joseph Qi (1): io_uring: fix shift-out-of-bounds when round up cq size
LimingWu (1): io_uring: fix a typo in a comment
Lukas Bulwahn (1): io_uring: make spdxcheck.py happy
Marcelo Diop-Gonzalez (1): io_uring: flush timeouts that should already have expired
Mark Rutland (3): io_uring: fix SQPOLL cpu validation io_uring: free allocated io_memory once io_uring: avoid page allocation warnings
Nathan Chancellor (1): io_uring: Ensure mask is initialized in io_arm_poll_handler
Oleg Nesterov (6): signal: remove the wrong signal_pending() check in restore_user_sigmask() signal: simplify set_user_sigmask/restore_user_sigmask select: change do_poll() to return -ERESTARTNOHAND rather than -EINTR select: shift restore_saved_sigmask_unless() into poll_select_copy_remaining() task_work_run: don't take ->pi_lock unconditionally task_work: teach task_work_add() to do signal_wake_up()
Pavel Begunkov (249): io_uring: Fix __io_uring_register() false success io_uring: fix reversed nonblock flag for link submission io_uring: remove wait loop spurious wakeups io_uring: Fix corrupted user_data io_uring: Fix broken links with offloading io_uring: Fix race for sqes with userspace io_uring: Fix leaked shadow_req io_uring: remove index from sqe_submit io_uring: Fix mm_fault with READ/WRITE_FIXED io_uring: Merge io_submit_sqes and io_ring_submit io_uring: io_queue_link*() right after submit io_uring: allocate io_kiocb upfront io_uring: Use submit info inlined into req io_uring: use inlined struct sqe_submit io_uring: Fix getting file for timeout io_uring: Fix getting file for non-fd opcodes io_uring: break links for failed defer io_uring: remove redundant check io_uring: Fix leaking linked timeouts io_uring: Always REQ_F_FREE_SQE for allocated sqe io_uring: drain next sqe instead of shadowing io_uring: rename __io_submit_sqe() io_uring: add likely/unlikely in io_get_sqring() io_uring: remove io_free_req_find_next() io_uring: pass only !null to io_req_find_next() io_uring: simplify io_req_link_next() io_uring: only !null ptr to io_issue_sqe() io_uring: fix dead-hung for non-iter fixed rw io_uring: store timeout's sqe->off in proper place io_uring: inline struct sqe_submit io_uring: cleanup io_import_fixed() io_uring: fix error handling in io_queue_link_head io_uring: hook all linked requests via link_list io_uring: make HARDLINK imply LINK io_uring: don't wait when under-submitting io_uring: rename prev to head io_uring: move *queue_link_head() from common path pcpu_ref: add percpu_ref_tryget_many() io_uring: batch getting pcpu references io_uring: clamp to_submit in io_submit_sqes() io_uring: optimise head checks in io_get_sqring() io_uring: optimise commit_sqring() for common case io_uring: remove extra io_wq_current_is_worker() io_uring: optimise use of ctx->drain_next io_uring: remove extra check in __io_commit_cqring io_uring: hide uring_fd in ctx io_uring: remove REQ_F_IO_DRAINED io_uring: optimise sqe-to-req flags translation io_uring: use labeled array init in io_op_defs io_uring: prep req when do IOSQE_ASYNC io_uring: honor IOSQE_ASYNC for linked reqs io_uring: add comment for drain_next io_uring: fix refcounting with batched allocations at OOM io-wq: allow grabbing existing io-wq io_uring: add io-wq workqueue sharing io_uring: remove extra ->file check io_uring: iterate req cache backwards io_uring: put the flag changing code in the same spot io_uring: get rid of delayed mm check io_uring: fix deferred req iovec leak io_uring: remove unused struct io_async_open io_uring: fix iovec leaks io_uring: add cleanup for openat()/statx() io_uring: fix async close() with f_op->flush() io_uring: fix double prep iovec leak io_uring: fix openat/statx's filename leak io_uring: add missing io_req_cancelled() io_uring: fix use-after-free by io_cleanup_req() io-wq: fix IO_WQ_WORK_NO_CANCEL cancellation io-wq: remove io_wq_flush and IO_WQ_WORK_INTERNAL io_uring: fix lockup with timeouts io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN} io_uring: don't call work.func from sync ctx io_uring: don't do full *prep_worker() from io-wq io_uring: remove req->in_async splice: make do_splice public io_uring: add interface for getting files io_uring: add splice(2) support io_uring: clean io_poll_complete io_uring: extract kmsg copy helper io-wq: remove unused IO_WQ_WORK_HAS_MM io_uring: remove IO_WQ_WORK_CB io-wq: use BIT for ulong hash io_uring: remove extra nxt check after punt io_uring: remove io_prep_next_work() io_uring: clean up io_close io_uring: make submission ref putting consistent io_uring: remove @nxt from handlers io_uring: get next work with submission ref drop io-wq: shuffle io_worker_handle_work() code io-wq: optimise locking in io_worker_handle_work() io-wq: optimise out *next_work() double lock io_uring/io-wq: forward submission ref to async io-wq: remove duplicated cancel code io-wq: don't resched if there is no work io-wq: split hashing and enqueueing io-wq: hash dependent work io-wq: close cancel gap for hashed linked work io_uring: Fix ->data corruption on re-enqueue io-wq: handle hashed writes in chains io_uring: fix ctx refcounting in io_submit_sqes() io_uring: simplify io_get_sqring io_uring: alloc req only after getting sqe io_uring: remove req init from io_get_req() io_uring: don't read user-shared sqe flags twice io_uring: fix fs cleanup on cqe overflow io_uring: remove obsolete @mm_fault io_uring: track mm through current->mm io_uring: early submission req fail code io_uring: keep all sqe->flags in req->flags io_uring: move all request init code in one place io_uring: fix cached_sq_head in io_timeout() io_uring: kill already cached timeout.seq_offset io_uring: don't count rqs failed after current one io_uring: fix extra put in sync_file_range() io_uring: check non-sync defer_list carefully io_uring: punt splice async because of inode mutex splice: move f_mode checks to do_{splice,tee}() io_uring: fix zero len do_splice() io_uring: don't prepare DRAIN reqs twice io_uring: fix FORCE_ASYNC req preparation io_uring: remove req->needs_fixed_files io_uring: rename io_file_put() io_uring: don't repeat valid flag list splice: export do_tee() io_uring: add tee(2) support io_uring: fix flush req->refs underflow io_uring: simplify io_timeout locking io_uring: don't re-read sqe->off in timeout_prep() io_uring: separate DRAIN flushing into a cold path io_uring: get rid of manual punting in io_close io_uring: move timeouts flushing to a helper io_uring: off timeouts based only on completions io_uring: fix overflowed reqs cancellation io_uring: fix {SQ,IO}POLL with unsupported opcodes io_uring: move send/recv IOPOLL check into prep io_uring: don't derive close state from ->func io_uring: remove custom ->func handlers io_uring: don't arm a timeout through work.func io_wq: add per-wq work handler instead of per work io_uring: fix lazy work init io-wq: reorder cancellation pending -> running io-wq: add an option to cancel all matched reqs io_uring: cancel all task's requests on exit io_uring: batch cancel in io_uring_cancel_files() io_uring: lazy get task io_uring: cancel by ->task not pid io-wq: compact io-wq flags numbers io-wq: return next work from ->do_work() directly io_uring: fix hanging iopoll in case of -EAGAIN io_uring: fix current->mm NULL dereference on exit io_uring: fix missing msg_name assignment io_uring: fix not initialised work->flags io_uring: fix recvmsg memory leak with buffer selection io_uring: missed req_init_async() for IOSQE_ASYNC io_uring: fix ->work corruption with poll_add io_uring: fix lockup in io_fail_links() io_uring: rename sr->msg into umsg io_uring: use more specific type in rcv/snd msg cp io_uring: extract io_sendmsg_copy_hdr() io_uring: simplify io_req_map_rw() io_uring: add a helper for async rw iovec prep io_uring: fix potential use after free on fallback request free io_uring: fix stopping iopoll'ing too early io_uring: briefly loose locks while reaping events io_uring: partially inline io_iopoll_getevents() io_uring: fix racy overflow count reporting io-wq: fix hang after cancelling pending hashed work io_uring: clean file_data access in files_register io_uring: refactor *files_register()'s error paths io_uring: keep a pointer ref_node in file_data io_uring: fix double poll mask init io_uring: fix recvmsg setup with compat buf-select io_uring: fix NULL-mm for linked reqs io_uring: fix missing ->mm on exit io_uring: return locked and pinned page accounting io_uring: don't burn CPU for iopoll on exit io_uring: don't miscount pinned memory io_uring: fix provide_buffers sign extension io_uring: fix stalled deferred requests io_uring: kill REQ_F_LINK_NEXT io_uring: deduplicate freeing linked timeouts io_uring: fix refs underflow in io_iopoll_queue() io_uring: remove inflight batching in free_many() io_uring: dismantle req early and remove need_iter io_uring: batch-free linked requests as well io_uring: cosmetic changes for batch free io_uring: clean up req->result setting by rw io_uring: do task_work_run() during iopoll io_uring: fix NULL mm in io_poll_task_func() io_uring: simplify io_async_task_func() io_uring: fix req->work corruption io_uring: fix punting req w/o grabbed env io_uring: fix feeding io-wq with uninit reqs io_uring: don't mark link's head for_async io_uring: fix missing io_grab_files() io_uring: replace find_next() out param with ret io_uring: kill REQ_F_TIMEOUT io_uring: kill REQ_F_TIMEOUT_NOSEQ io_uring: optimise io_req_find_next() fast check io_uring: remove setting REQ_F_MUST_PUNT in rw io_uring: remove REQ_F_MUST_PUNT io_uring: set @poll->file after @poll init io_uring: don't pass def into io_req_work_grab_env io_uring: do init work in grab_env() io_uring: factor out grab_env() from defer_prep() io_uring: do grab_env() just before punting io_uring: fix mis-refcounting linked timeouts io_uring: keep queue_sqe()'s fail path separately io_uring: fix lost cqe->flags io_uring: don't delay iopoll'ed req completion io_uring: remove nr_events arg from iopoll_check() io_uring: share completion list w/ per-op space io_uring: rename ctx->poll into ctx->iopoll io_uring: use inflight_entry list for iopoll'ing io_uring: use completion list for CQ overflow io_uring: add req->timeout.list io_uring: remove init for unused list io_uring: use non-intrusive list for defer io_uring: remove sequence from io_kiocb io_uring: place cflags into completion data io_uring: fix cancel of deferred reqs with ->files io_uring: fix linked deferred ->files cancellation io_uring: fix racy IOPOLL completions io_uring: inline io_req_work_grab_env() io_uring: alloc ->io in io_req_defer_prep() io_uring/io-wq: move RLIMIT_FSIZE to io-wq io_uring: mark ->work uninitialised after cleanup io_uring: follow **iovec idiom in io_import_iovec io_uring: de-unionise io_kiocb io_uring: consolidate *_check_overflow accounting io_uring: get rid of atomic FAA for cq_timeouts io-wq: update hash bits io_uring: indent left {send,recv}[msg]() io_uring: remove extra checks in send/recv io_uring: don't forget cflags in io_recv() io_uring: free selected-bufs if error'ed io_uring: move BUFFER_SELECT check into *recv[msg] io_uring: simplify file ref tracking in submission state io_uring: extract io_put_kbuf() helper io_uring: don't open-code recv kbuf managment io_uring: don't do opcode prep twice io_uring: deduplicate io_grab_files() calls io_uring: fix missing io_queue_linked_timeout() tasks: add put_task_struct_many() io_uring: batch put_task_struct() io_uring: fix racy req->flags modification io_uring: get an active ref_node from files_data io_uring: order refnode recycling
Randy Dunlap (2): io_uring: fix 1-bit bitfields to be unsigned io_uring: fix function args for !CONFIG_NET
Roman Gushchin (1): percpu_ref: introduce PERCPU_REF_ALLOW_REINIT flag
Roman Penyaev (3): io_uring: offload write to async worker in case of -EAGAIN io_uring: fix infinite wait in khread_park() on io_finish_async() io_uring: add mapping support for NOMMU archs
Shenghui Wang (1): io_uring: use cpu_online() to check p->sq_thread_cpu instead of cpu_possible()
Stefan Bühler (13): io_uring: fix race condition reading SQ entries io_uring: fix race condition when sq threads goes sleeping io_uring: fix poll full SQ detection io_uring: fix handling SQEs requesting NOWAIT io_uring: fix notes on barriers io_uring: remove unnecessary barrier before wq_has_sleeper io_uring: remove unnecessary barrier before reading cq head io_uring: remove unnecessary barrier after updating SQ head io_uring: remove unnecessary barrier before reading SQ tail io_uring: remove unnecessary barrier after incrementing dropped counter io_uring: remove unnecessary barrier after unsetting IORING_SQ_NEED_WAKEUP req->error only used for iopoll io_uring: fix race condition reading SQE data
Stefan Metzmacher (1): io_uring: add BUILD_BUG_ON() to assert the layout of struct io_uring_sqe
Stefano Garzarella (4): io_uring: flush overflowed CQ events in the io_uring_poll() io_uring: prevent sq_thread from spinning when it should stop io_uring: add 'cq_flags' field for the CQ ring io_uring: add IORING_CQ_EVENTFD_DISABLED to the CQ ring flags
Steve French (1): cifs: fix rmmod regression in cifs.ko caused by force_sig changes
Thomas Gleixner (2): sched: Remove stale PF_MUTEX_TESTER bit sched/core, workqueues: Distangle worker accounting from rq lock
Tobias Klauser (1): io_uring: define and set show_fdinfo only if procfs is enabled
Xiaoguang Wang (24): io_uring: fix __io_iopoll_check deadlock in io_sq_thread io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled io_uring: cleanup io_alloc_async_ctx() io_uring: refactor file register/unregister/update handling io_uring: initialize fixed_file_data lock io_uring: do not always copy iovec in io_req_map_rw() io_uring: restore req->work when canceling poll request io_uring: only restore req->work for req that needs do completion io_uring: use cond_resched() in io_ring_ctx_wait_and_kill() io_uring: fix mismatched finish_wait() calls in io_uring_cancel_files() io_uring: handle -EFAULT properly in io_uring_setup() io_uring: reset -EBUSY error when io sq thread is waken up io_uring: remove obsolete 'state' parameter io_uring: don't submit sqes when ctx->refs is dying io_uring: avoid whole io_wq_work copy for requests completed inline io_uring: avoid unnecessary io_wq_work copy for fast poll feature io_uring: fix io_kiocb.flags modification race in IOPOLL mode io_uring: don't fail links for EAGAIN error in IOPOLL mode io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed io_uring: fix possible race condition against REQ_F_NEED_CLEANUP io_uring: export cq overflow status to userspace io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works io_uring: always let io_iopoll_complete() complete polled io
Xiaoming Ni (1): io_uring: remove duplicate semicolon at the end of line
Xuan Zhuo (1): io_uring: fix io_sq_thread no schedule when busy
Yang Yingliang (2): io_uring: fix memleak in __io_sqe_files_update() io_uring: fix memleak in io_sqe_files_register()
YueHaibing (3): io-wq: use kfree_rcu() to simplify the code io_uring: Remove unnecessary null check io_uring: Fix unused function warnings
Zhengyuan Liu (4): io_uring: fix the sequence comparison in io_sequence_defer io_uring: fix counter inc/dec mismatch in async_list io_uring: add a memory barrier before atomic_read io_uring: track io length in async_list based on bytes
yangerkun (9): fs: fix kabi change since add iopoll io_uring: compare cached_cq_tail with cq.head in_io_uring_poll io_uring: consider the overflow of sequence for timeout req io_uring: fix logic error in io_timeout fs: introduce __close_fd_get_file to support IORING_OP_CLOSE for io_uring fs: make filename_lookup available externally x86: fix kabi with io_uring interface arm64: fix kabi with io_uring interface io_uring: add IORING_OP_OPENAT2 for compatablity
zhangyi (F) (2): io_uring : correct timeout req sequence when waiting timeout io_uring: correct timeout req sequence when inserting a new entry
Documentation/filesystems/vfs.txt | 3 + arch/arm64/include/asm/syscall_wrapper.h | 5 + arch/arm64/kernel/syscall.c | 9 +- arch/x86/entry/common.c | 7 + arch/x86/include/asm/syscall_wrapper.h | 3 + block/blk-core.c | 12 +- block/blk-merge.c | 7 +- drivers/block/drbd/drbd_main.c | 2 + fs/Kconfig | 3 + fs/Makefile | 2 + fs/aio.c | 157 +- fs/cifs/connect.c | 3 +- fs/eventpoll.c | 143 +- fs/file.c | 53 +- fs/file_table.c | 9 +- fs/internal.h | 9 + fs/io-wq.c | 1158 +++ fs/io-wq.h | 152 + fs/io_uring.c | 8822 ++++++++++++++++++++++ fs/namei.c | 4 +- fs/open.c | 2 +- fs/select.c | 376 +- fs/splice.c | 62 +- fs/stat.c | 65 +- fs/sync.c | 141 +- include/linux/compat.h | 19 + include/linux/eventpoll.h | 9 + include/linux/fdtable.h | 1 + include/linux/file.h | 3 + include/linux/fs.h | 22 +- include/linux/ioprio.h | 13 + include/linux/mm.h | 1 + include/linux/percpu-refcount.h | 36 +- include/linux/sched.h | 2 +- include/linux/sched/jobctl.h | 4 +- include/linux/sched/signal.h | 12 +- include/linux/sched/task.h | 6 + include/linux/sched/user.h | 2 +- include/linux/signal.h | 15 +- include/linux/socket.h | 25 + include/linux/splice.h | 6 + include/linux/syscalls.h | 28 +- include/linux/task_work.h | 5 +- include/linux/uio.h | 4 +- include/net/af_unix.h | 1 + include/net/compat.h | 3 + include/trace/events/io_uring.h | 495 ++ include/uapi/linux/aio_abi.h | 2 + include/uapi/linux/io_uring.h | 294 + init/Kconfig | 10 + kernel/sched/core.c | 96 +- kernel/signal.c | 63 +- kernel/sys_ni.c | 3 + kernel/task_work.c | 34 +- kernel/workqueue.c | 54 +- kernel/workqueue_internal.h | 5 +- lib/iov_iter.c | 15 +- lib/percpu-refcount.c | 28 +- mm/madvise.c | 7 +- net/Makefile | 2 +- net/compat.c | 31 +- net/socket.c | 298 +- net/unix/Kconfig | 5 + net/unix/Makefile | 2 + net/unix/af_unix.c | 63 +- net/unix/garbage.c | 68 +- net/unix/scm.c | 151 + net/unix/scm.h | 10 + tools/io_uring/Makefile | 18 + tools/io_uring/README | 29 + tools/io_uring/barrier.h | 16 + tools/io_uring/io_uring-bench.c | 592 ++ tools/io_uring/io_uring-cp.c | 260 + tools/io_uring/liburing.h | 187 + tools/io_uring/queue.c | 156 + tools/io_uring/setup.c | 107 + tools/io_uring/syscall.c | 52 + 77 files changed, 13750 insertions(+), 829 deletions(-) create mode 100644 fs/io-wq.c create mode 100644 fs/io-wq.h create mode 100644 fs/io_uring.c create mode 100644 include/trace/events/io_uring.h create mode 100644 include/uapi/linux/io_uring.h create mode 100644 net/unix/scm.c create mode 100644 net/unix/scm.h create mode 100644 tools/io_uring/Makefile create mode 100644 tools/io_uring/README create mode 100644 tools/io_uring/barrier.h create mode 100644 tools/io_uring/io_uring-bench.c create mode 100644 tools/io_uring/io_uring-cp.c create mode 100644 tools/io_uring/liburing.h create mode 100644 tools/io_uring/queue.c create mode 100644 tools/io_uring/setup.c create mode 100644 tools/io_uring/syscall.c
From: "Eric W. Biederman" ebiederm@xmission.com
stable inclusion from linux-4.19.99 commit 6db0e28b893aa28af3f7c0197749a5d9cbfded5c bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
[ Upstream commit 33da8e7c814f77310250bb54a9db36a44c5de784 ]
My recent to change to only use force_sig for a synchronous events wound up breaking signal reception cifs and drbd. I had overlooked the fact that by default kthreads start out with all signals set to SIG_IGN. So a change I thought was safe turned out to have made it impossible for those kernel thread to catch their signals.
Reverting the work on force_sig is a bad idea because what the code was doing was very much a misuse of force_sig. As the way force_sig ultimately allowed the signal to happen was to change the signal handler to SIG_DFL. Which after the first signal will allow userspace to send signals to these kernel threads. At least for wake_ack_receiver in drbd that does not appear actively wrong.
So correct this problem by adding allow_kernel_signal that will allow signals whose siginfo reports they were sent by the kernel through, but will not allow userspace generated signals, and update cifs and drbd to call allow_kernel_signal in an appropriate place so that their thread can receive this signal.
Fixing things this way ensures that userspace won't be able to send signals and cause problems, that it is clear which signals the threads are expecting to receive, and it guarantees that nothing else in the system will be affected.
This change was partly inspired by similar cifs and drbd patches that added allow_signal.
Reported-by: ronnie sahlberg ronniesahlberg@gmail.com Reported-by: Christoph Böhmwalder christoph.boehmwalder@linbit.com Tested-by: Christoph Böhmwalder christoph.boehmwalder@linbit.com Cc: Steve French smfrench@gmail.com Cc: Philipp Reisner philipp.reisner@linbit.com Cc: David Laight David.Laight@ACULAB.COM Fixes: 247bc9470b1e ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes") Fixes: 72abe3bcf091 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig") Fixes: fee109901f39 ("signal/drbd: Use send_sig not force_sig") Fixes: 3cf5d076fb4d ("signal: Remove task parameter from force_sig") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com [io_uring need allow_kernel_signal] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/block/drbd/drbd_main.c | 2 ++ fs/cifs/connect.c | 2 +- include/linux/signal.h | 15 ++++++++++++++- kernel/signal.c | 5 +++++ 4 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index a49a8d91a599..5e3885f5729b 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -334,6 +334,8 @@ static int drbd_thread_setup(void *arg) thi->name[0], resource->name);
+ allow_kernel_signal(DRBD_SIGKILL); + allow_kernel_signal(SIGXCPU); restart: retval = thi->function(thi);
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c index ef7e71b904df..907be252c5d4 100644 --- a/fs/cifs/connect.c +++ b/fs/cifs/connect.c @@ -974,7 +974,7 @@ cifs_demultiplex_thread(void *p) mempool_resize(cifs_req_poolp, length + cifs_min_rcv);
set_freezable(); - allow_signal(SIGKILL); + allow_kernel_signal(SIGKILL); while (server->tcpStatus != CifsExiting) { if (try_to_freeze()) continue; diff --git a/include/linux/signal.h b/include/linux/signal.h index e4d01469ed60..0be5ce2375cb 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -272,6 +272,9 @@ extern void signal_setup_done(int failed, struct ksignal *ksig, int stepping); extern void exit_signals(struct task_struct *tsk); extern void kernel_sigaction(int, __sighandler_t);
+#define SIG_KTHREAD ((__force __sighandler_t)2) +#define SIG_KTHREAD_KERNEL ((__force __sighandler_t)3) + static inline void allow_signal(int sig) { /* @@ -279,7 +282,17 @@ static inline void allow_signal(int sig) * know it'll be handled, so that they don't get converted to * SIGKILL or just silently dropped. */ - kernel_sigaction(sig, (__force __sighandler_t)2); + kernel_sigaction(sig, SIG_KTHREAD); +} + +static inline void allow_kernel_signal(int sig) +{ + /* + * Kernel threads handle their own signals. Let the signal code + * know signals sent by the kernel will be handled, so that they + * don't get silently dropped. + */ + kernel_sigaction(sig, SIG_KTHREAD_KERNEL); }
static inline void disallow_signal(int sig) diff --git a/kernel/signal.c b/kernel/signal.c index deba77ef0573..5ded8c6ac789 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -88,6 +88,11 @@ static bool sig_task_ignored(struct task_struct *t, int sig, bool force) handler == SIG_DFL && !(force && sig_kernel_only(sig))) return true;
+ /* Only allow kernel generated signals to this kthread */ + if (unlikely((t->flags & PF_KTHREAD) && + (handler == SIG_KTHREAD_KERNEL) && !force)) + return true; + return sig_handler_ignored(handler, sig); }
From: Christoph Hellwig hch@lst.de
mainline inclusion from mainline-5.1-rc1 commit fb7e160019f4abb4082740bfeb27a38f6389c745 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This new methods is used to explicitly poll for I/O completion for an iocb. It must be called for any iocb submitted asynchronously (that is with a non-null ki_complete) which has the IOCB_HIPRI flag set.
The method is assisted by a new ki_cookie field in struct iocb to store the polling cookie.
Reviewed-by: Hannes Reinecke hare@suse.com Reviewed-by: Johannes Thumshirn jthumshirn@suse.de Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: [add ki_cookie in struct kiocb will change KABI and can not fix it. Stop to support block poll.]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- Documentation/filesystems/vfs.txt | 3 +++ include/linux/fs.h | 1 + 2 files changed, 4 insertions(+)
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index a6c6a8af48a2..0fe9c0dd3269 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -857,6 +857,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); @@ -901,6 +902,8 @@ otherwise noted.
write_iter: possibly asynchronous write with iov_iter as source
+ iopoll: called when aio wants to poll for completions on HIPRI iocbs + iterate: called when the VFS needs to read the directory contents
iterate_shared: called when the VFS needs to read the directory contents diff --git a/include/linux/fs.h b/include/linux/fs.h index 118021c316da..63748acb1444 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1776,6 +1776,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *);
From: Deepa Dinamani deepa.kernel@gmail.com
mainline inclusion from mainline-5.0-rc1 commit 8bd27a3004e80d3d0962534c97e5a841262d5093 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
struct timespec is not y2038 safe. struct __kernel_timespec is the new y2038 safe structure for all syscalls that are using struct timespec. Update ppoll interfaces to use struct __kernel_timespec.
sigset_t also has different representations on 32 bit and 64 bit architectures. Hence, we need to support the following different syscalls:
New y2038 safe syscalls: (Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)
Native 64 bit(unchanged) and native 32 bit : sys_ppoll Compat : compat_sys_ppoll_time64
Older y2038 unsafe syscalls: (Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)
Native 32 bit : ppoll_time32 Compat : compat_sys_ppoll
Signed-off-by: Deepa Dinamani deepa.kernel@gmail.com Signed-off-by: Arnd Bergmann arnd@arndb.de
Conflicts: fs/select.c include/linux/compat.h [ Patch 9afc5eee65c("y2038: globally rename compat_time to old_time32") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/select.c | 166 ++++++++++++++++++++++++++------------- include/linux/compat.h | 5 ++ include/linux/syscalls.h | 5 +- 3 files changed, 120 insertions(+), 56 deletions(-)
diff --git a/fs/select.c b/fs/select.c index 5989a43813b7..3dd2155f1d8b 100644 --- a/fs/select.c +++ b/fs/select.c @@ -287,12 +287,18 @@ int poll_select_set_timeout(struct timespec64 *to, time64_t sec, long nsec) return 0; }
+enum poll_time_type { + PT_TIMEVAL = 0, + PT_OLD_TIMEVAL = 1, + PT_TIMESPEC = 2, + PT_OLD_TIMESPEC = 3, +}; + static int poll_select_copy_remaining(struct timespec64 *end_time, void __user *p, - int timeval, int ret) + enum poll_time_type pt_type, int ret) { struct timespec64 rts; - struct timeval rtv;
if (!p) return ret; @@ -310,18 +316,40 @@ static int poll_select_copy_remaining(struct timespec64 *end_time, rts.tv_sec = rts.tv_nsec = 0;
- if (timeval) { - if (sizeof(rtv) > sizeof(rtv.tv_sec) + sizeof(rtv.tv_usec)) - memset(&rtv, 0, sizeof(rtv)); - rtv.tv_sec = rts.tv_sec; - rtv.tv_usec = rts.tv_nsec / NSEC_PER_USEC; + switch (pt_type) { + case PT_TIMEVAL: + { + struct timeval rtv;
- if (!copy_to_user(p, &rtv, sizeof(rtv))) + if (sizeof(rtv) > sizeof(rtv.tv_sec) + sizeof(rtv.tv_usec)) + memset(&rtv, 0, sizeof(rtv)); + rtv.tv_sec = rts.tv_sec; + rtv.tv_usec = rts.tv_nsec / NSEC_PER_USEC; + if (!copy_to_user(p, &rtv, sizeof(rtv))) + return ret; + } + break; + case PT_OLD_TIMEVAL: + { + struct compat_timeval rtv; + + rtv.tv_sec = rts.tv_sec; + rtv.tv_usec = rts.tv_nsec / NSEC_PER_USEC; + if (!copy_to_user(p, &rtv, sizeof(rtv))) + return ret; + } + break; + case PT_TIMESPEC: + if (!put_timespec64(&rts, p)) return ret; - - } else if (!put_timespec64(&rts, p)) - return ret; - + break; + case PT_OLD_TIMESPEC: + if (!compat_put_timespec64(&rts, p)) + return ret; + break; + default: + BUG(); + } /* * If an application puts its timeval in read-only memory, we * don't want the Linux-specific update to the timeval to @@ -686,7 +714,7 @@ static int kern_select(int n, fd_set __user *inp, fd_set __user *outp, }
ret = core_sys_select(n, inp, outp, exp, to); - ret = poll_select_copy_remaining(&end_time, tvp, 1, ret); + ret = poll_select_copy_remaining(&end_time, tvp, PT_TIMEVAL, ret);
return ret; } @@ -719,7 +747,7 @@ static long do_pselect(int n, fd_set __user *inp, fd_set __user *outp, return ret;
ret = core_sys_select(n, inp, outp, exp, to); - ret = poll_select_copy_remaining(&end_time, tsp, 0, ret); + ret = poll_select_copy_remaining(&end_time, tsp, PT_TIMESPEC, ret);
restore_user_sigmask(sigmask, &sigsaved);
@@ -1021,7 +1049,7 @@ SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfds, }
SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, - struct timespec __user *, tsp, const sigset_t __user *, sigmask, + struct __kernel_timespec __user *, tsp, const sigset_t __user *, sigmask, size_t, sigsetsize) { sigset_t ksigmask, sigsaved; @@ -1049,60 +1077,50 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, if (ret == -EINTR) ret = -ERESTARTNOHAND;
- ret = poll_select_copy_remaining(&end_time, tsp, 0, ret); + ret = poll_select_copy_remaining(&end_time, tsp, PT_TIMESPEC, ret);
return ret; }
-#ifdef CONFIG_COMPAT -#define __COMPAT_NFDBITS (8 * sizeof(compat_ulong_t)) +#if defined(CONFIG_COMPAT_32BIT_TIME) && !defined(CONFIG_64BIT)
-static -int compat_poll_select_copy_remaining(struct timespec64 *end_time, void __user *p, - int timeval, int ret) +SYSCALL_DEFINE5(ppoll_time32, struct pollfd __user *, ufds, unsigned int, nfds, + struct compat_timespec __user *, tsp, const sigset_t __user *, sigmask, + size_t, sigsetsize) { - struct timespec64 ts; + sigset_t ksigmask, sigsaved; + struct timespec64 ts, end_time, *to = NULL; + int ret;
- if (!p) - return ret; + if (tsp) { + if (compat_get_timespec64(&ts, tsp)) + return -EFAULT;
- if (current->personality & STICKY_TIMEOUTS) - goto sticky; + to = &end_time; + if (poll_select_set_timeout(to, ts.tv_sec, ts.tv_nsec)) + return -EINVAL; + }
- /* No update for zero timeout */ - if (!end_time->tv_sec && !end_time->tv_nsec) + ret = set_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + if (ret) return ret;
- ktime_get_ts64(&ts); - ts = timespec64_sub(*end_time, ts); - if (ts.tv_sec < 0) - ts.tv_sec = ts.tv_nsec = 0; + ret = do_sys_poll(ufds, nfds, to);
- if (timeval) { - struct compat_timeval rtv; + restore_user_sigmask(sigmask, &sigsaved);
- rtv.tv_sec = ts.tv_sec; - rtv.tv_usec = ts.tv_nsec / NSEC_PER_USEC; + /* We can restart this syscall, usually */ + if (ret == -EINTR) + ret = -ERESTARTNOHAND;
- if (!copy_to_user(p, &rtv, sizeof(rtv))) - return ret; - } else { - if (!compat_put_timespec64(&ts, p)) - return ret; - } - /* - * If an application puts its timeval in read-only memory, we - * don't want the Linux-specific update to the timeval to - * cause a fault after the select has completed - * successfully. However, because we're not updating the - * timeval, we can't restart the system call. - */ + ret = poll_select_copy_remaining(&end_time, tsp, PT_OLD_TIMESPEC, ret);
-sticky: - if (ret == -ERESTARTNOHAND) - ret = -EINTR; return ret; } +#endif + +#ifdef CONFIG_COMPAT +#define __COMPAT_NFDBITS (8 * sizeof(compat_ulong_t))
/* * Ooo, nasty. We need here to frob 32-bit unsigned longs to @@ -1234,7 +1252,7 @@ static int do_compat_select(int n, compat_ulong_t __user *inp, }
ret = compat_core_sys_select(n, inp, outp, exp, to); - ret = compat_poll_select_copy_remaining(&end_time, tvp, 1, ret); + ret = poll_select_copy_remaining(&end_time, tvp, PT_OLD_TIMEVAL, ret);
return ret; } @@ -1287,7 +1305,7 @@ static long do_compat_pselect(int n, compat_ulong_t __user *inp, return ret;
ret = compat_core_sys_select(n, inp, outp, exp, to); - ret = compat_poll_select_copy_remaining(&end_time, tsp, 0, ret); + ret = poll_select_copy_remaining(&end_time, tsp, PT_OLD_TIMESPEC, ret);
restore_user_sigmask(sigmask, &sigsaved);
@@ -1313,6 +1331,7 @@ COMPAT_SYSCALL_DEFINE6(pselect6, int, n, compat_ulong_t __user *, inp, sigsetsize); }
+#if defined(CONFIG_COMPAT_32BIT_TIME) COMPAT_SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, struct compat_timespec __user *, tsp, const compat_sigset_t __user *, sigmask, compat_size_t, sigsetsize) @@ -1342,8 +1361,45 @@ COMPAT_SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, if (ret == -EINTR) ret = -ERESTARTNOHAND;
- ret = compat_poll_select_copy_remaining(&end_time, tsp, 0, ret); + ret = poll_select_copy_remaining(&end_time, tsp, PT_OLD_TIMESPEC, ret);
return ret; } #endif + +/* New compat syscall for 64 bit time_t*/ +COMPAT_SYSCALL_DEFINE5(ppoll_time64, struct pollfd __user *, ufds, + unsigned int, nfds, struct __kernel_timespec __user *, tsp, + const compat_sigset_t __user *, sigmask, compat_size_t, sigsetsize) +{ + sigset_t ksigmask, sigsaved; + struct timespec64 ts, end_time, *to = NULL; + int ret; + + if (tsp) { + if (get_timespec64(&ts, tsp)) + return -EFAULT; + + to = &end_time; + if (poll_select_set_timeout(to, ts.tv_sec, ts.tv_nsec)) + return -EINVAL; + } + + ret = set_compat_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + if (ret) + return ret; + + ret = do_sys_poll(ufds, nfds, to); + + restore_user_sigmask(sigmask, &sigsaved); + + /* We can restart this syscall, usually */ + if (ret == -EINTR) + ret = -ERESTARTNOHAND; + + ret = poll_select_copy_remaining(&end_time, tsp, PT_TIMESPEC, ret); + + return ret; +} + +#endif diff --git a/include/linux/compat.h b/include/linux/compat.h index c0476f7c4444..714856d98351 100644 --- a/include/linux/compat.h +++ b/include/linux/compat.h @@ -654,6 +654,11 @@ asmlinkage long compat_sys_ppoll(struct pollfd __user *ufds, struct compat_timespec __user *tsp, const compat_sigset_t __user *sigmask, compat_size_t sigsetsize); +asmlinkage long compat_sys_ppoll_time64(struct pollfd __user *ufds, + unsigned int nfds, + struct __kernel_timespec __user *tsp, + const compat_sigset_t __user *sigmask, + compat_size_t sigsetsize);
/* fs/signalfd.c */ asmlinkage long compat_sys_signalfd4(int ufd, diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 2ff814c92f7f..0b7fb85b3a06 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -469,7 +469,10 @@ asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *, fd_set __user *, struct timespec __user *, void __user *); asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int, - struct timespec __user *, const sigset_t __user *, + struct __kernel_timespec __user *, const sigset_t __user *, + size_t); +asmlinkage long sys_ppoll_time32(struct pollfd __user *, unsigned int, + struct compat_timespec __user *, const sigset_t __user *, size_t);
/* fs/signalfd.c */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1-rc6 commit b19062a567266ee1f10f6709325f766bbcc07d1c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we have multiple threads, one doing io_uring_enter() while the other is doing io_uring_register(), we can run into a deadlock between the two. io_uring_register() must wait for existing users of the io_uring instance to exit. But it does so while holding the io_uring mutex. Callers of io_uring_enter() may need this mutex to make progress (and eventually exit). If we wait for users to exit in io_uring_register(), we can't do so with the io_uring mutex held without potentially risking a deadlock.
Drop the io_uring mutex while waiting for existing callers to exit. This is safe and guaranteed to make forward progress, since we already killed the percpu ref before doing so. Hence later callers of io_uring_enter() will be rejected.
Reported-by: syzbot+16dc03452dee970a0c3e@syzkaller.appspotmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 77aea48e1e61..742f541ccf09 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2926,11 +2926,23 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries,
static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, void __user *arg, unsigned nr_args) + __releases(ctx->uring_lock) + __acquires(ctx->uring_lock) { int ret;
percpu_ref_kill(&ctx->refs); + + /* + * Drop uring mutex before waiting for references to exit. If another + * thread is currently inside io_uring_enter() it might need to grab + * the uring_lock to make progress. If we hold it here across the drain + * wait, then we can deadlock. It's safe to drop the mutex here, since + * no new references will come in after we've killed the percpu ref. + */ + mutex_unlock(&ctx->uring_lock); wait_for_completion(&ctx->ctx_done); + mutex_lock(&ctx->uring_lock);
switch (opcode) { case IORING_REGISTER_BUFFERS:
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1-rc6 commit 74f464e97044da33b25aaed00213914b0edf1f2e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This is a leftover from when the rings initially were not free flowing, and hence a test for tail + 1 == head would indicate full. Since we now let them wrap instead of mask them with the size, we need to check if they drift more than the ring size from each other.
This fixes a case where we'd overwrite CQ ring entries, if the user failed to reap completions. Both cases would ultimately result in lost completions as the application violated the depth it asked for. The only difference is that before this fix we'd return invalid entries for the overflowed completions, instead of properly flagging it in the cq_ring->overflow variable.
Reported-by: Stefan Bühler source@stbuehler.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 742f541ccf09..0f0c052aea49 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -338,7 +338,7 @@ static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx) tail = ctx->cached_cq_tail; /* See comment at the top of the file */ smp_rmb(); - if (tail + 1 == READ_ONCE(ring->r.head)) + if (tail - READ_ONCE(ring->r.head) == ring->ring_entries) return NULL;
ctx->cached_cq_tail++;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1-rc7 commit 35fa71a030caa50458a043560d4814ea9bcd639f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we have multiple threads doing io_uring_register(2) on an io_uring fd, then we can potentially try and kill the percpu reference while someone else has already killed it.
Prevent this race by failing io_uring_register(2) if the ref is marked dying. This is safe since we're inside the io_uring mutex.
Fixes: b19062a56726 ("io_uring: fix possible deadlock between io_uring_{enter,register}") Reported-by: syzbot syzbot+10d25e23199614b7721f@syzkaller.appspotmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 0f0c052aea49..6b13efe414bc 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2931,6 +2931,14 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, { int ret;
+ /* + * We're inside the ring mutex, if the ref is already dying, then + * someone else killed the ctx or is already going through + * io_uring_register(). + */ + if (percpu_ref_is_dying(&ctx->refs)) + return -ENXIO; + percpu_ref_kill(&ctx->refs);
/*
From: Stefan Bühler source@stbuehler.de
mainline inclusion from mainline-5.1-rc7 commit e523a29c4f2703bdb98f68ce1bb256e259fd8d5f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
A read memory barrier is required between reading SQ tail and reading the actual data belonging to the SQ entry.
Userspace needs a matching write barrier between writing SQ entries and updating SQ tail (using smp_store_release to update tail will do).
Signed-off-by: Stefan Bühler source@stbuehler.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6b13efe414bc..9507ef2b399e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1736,7 +1736,8 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) head = ctx->cached_sq_head; /* See comment at the top of this file */ smp_rmb(); - if (head == READ_ONCE(ring->r.tail)) + /* make sure SQ entry isn't read before tail */ + if (head == smp_load_acquire(&ring->r.tail)) return false;
head = READ_ONCE(ring->array[head & ctx->sq_mask]);
From: Stefan Bühler source@stbuehler.de
mainline inclusion from mainline-5.1-rc7 commit 0d7bae69c574c5f25802f8a71252e7d66933a3ab category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Reading the SQ tail needs to come after setting IORING_SQ_NEED_WAKEUP in flags; there is no cheap barrier for ordering a store before a load, a full memory barrier is required.
Userspace needs a full memory barrier between updating SQ tail and checking for the IORING_SQ_NEED_WAKEUP too.
Signed-off-by: Stefan Bühler source@stbuehler.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9507ef2b399e..68f0ac9470c3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1862,7 +1862,8 @@ static int io_sq_thread(void *data)
/* Tell userspace we may need a wakeup call */ ctx->sq_ring->flags |= IORING_SQ_NEED_WAKEUP; - smp_wmb(); + /* make sure to read SQ tail after writing flags */ + smp_mb();
if (!io_get_sqring(ctx, &sqes[0])) { if (kthread_should_stop()) {
From: Stefan Bühler source@stbuehler.de
mainline inclusion from mainline-5.1-rc7 commit fb775faa9e46ff481e4ced11116c9bd45359cb43 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_uring_poll shouldn't signal EPOLLOUT | EPOLLWRNORM if the queue is full; the old check would always signal EPOLLOUT | EPOLLWRNORM (unless there were U32_MAX - 1 entries in the SQ queue).
Signed-off-by: Stefan Bühler source@stbuehler.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 68f0ac9470c3..dcbb2beb2050 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2573,7 +2573,8 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait) poll_wait(file, &ctx->cq_wait, wait); /* See comment at the top of this file */ smp_rmb(); - if (READ_ONCE(ctx->sq_ring->r.tail) + 1 != ctx->cached_sq_head) + if (READ_ONCE(ctx->sq_ring->r.tail) - ctx->cached_sq_head != + ctx->sq_ring->ring_entries) mask |= EPOLLOUT | EPOLLWRNORM; if (READ_ONCE(ctx->cq_ring->r.head) != ctx->cached_cq_tail) mask |= EPOLLIN | EPOLLRDNORM;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1-rc7 commit 8358e3a8264a228cf2dfb6f3a05c0328f4118f12 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Since commit 09bb839434b we don't use the state argument for any sort of on-stack caching in the io read and write path. Remove the stale and unused argument from them, and bubble it up to __io_submit_sqe() and down to io_prep_rw().
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index dcbb2beb2050..d1efb389661c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -740,7 +740,7 @@ static bool io_file_supports_async(struct file *file) }
static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s, - bool force_nonblock, struct io_submit_state *state) + bool force_nonblock) { const struct io_uring_sqe *sqe = s->sqe; struct io_ring_ctx *ctx = req->ctx; @@ -935,7 +935,7 @@ static void io_async_list_note(int rw, struct io_kiocb *req, size_t len) }
static int io_read(struct io_kiocb *req, const struct sqe_submit *s, - bool force_nonblock, struct io_submit_state *state) + bool force_nonblock) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw; @@ -944,7 +944,7 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s, size_t iov_count; int ret;
- ret = io_prep_rw(req, s, force_nonblock, state); + ret = io_prep_rw(req, s, force_nonblock); if (ret) return ret; file = kiocb->ki_filp; @@ -982,7 +982,7 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s, }
static int io_write(struct io_kiocb *req, const struct sqe_submit *s, - bool force_nonblock, struct io_submit_state *state) + bool force_nonblock) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw; @@ -991,7 +991,7 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s, size_t iov_count; int ret;
- ret = io_prep_rw(req, s, force_nonblock, state); + ret = io_prep_rw(req, s, force_nonblock); if (ret) return ret;
@@ -1333,8 +1333,7 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe) }
static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, - const struct sqe_submit *s, bool force_nonblock, - struct io_submit_state *state) + const struct sqe_submit *s, bool force_nonblock) { int ret, opcode;
@@ -1350,18 +1349,18 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, case IORING_OP_READV: if (unlikely(s->sqe->buf_index)) return -EINVAL; - ret = io_read(req, s, force_nonblock, state); + ret = io_read(req, s, force_nonblock); break; case IORING_OP_WRITEV: if (unlikely(s->sqe->buf_index)) return -EINVAL; - ret = io_write(req, s, force_nonblock, state); + ret = io_write(req, s, force_nonblock); break; case IORING_OP_READ_FIXED: - ret = io_read(req, s, force_nonblock, state); + ret = io_read(req, s, force_nonblock); break; case IORING_OP_WRITE_FIXED: - ret = io_write(req, s, force_nonblock, state); + ret = io_write(req, s, force_nonblock); break; case IORING_OP_FSYNC: ret = io_fsync(req, s->sqe, force_nonblock); @@ -1454,7 +1453,7 @@ static void io_sq_wq_submit_work(struct work_struct *work) s->has_user = cur_mm != NULL; s->needs_lock = true; do { - ret = __io_submit_sqe(ctx, req, s, false, NULL); + ret = __io_submit_sqe(ctx, req, s, false); /* * We can get EAGAIN for polled IO even though * we're forcing a sync submission from here, @@ -1620,7 +1619,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, if (unlikely(ret)) goto out;
- ret = __io_submit_sqe(ctx, req, s, true, state); + ret = __io_submit_sqe(ctx, req, s, true); if (ret == -EAGAIN) { struct io_uring_sqe *sqe_copy;
From: Stefan Bühler source@stbuehler.de
mainline inclusion from mainline-5.1 commit 8449eedaa1da6a51d67190c905b1b54243e095f6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Not all request types set REQ_F_FORCE_NONBLOCK when they needed async punting; reverse logic instead and set REQ_F_NOWAIT if request mustn't be punted.
Signed-off-by: Stefan Bühler source@stbuehler.de
Merged with my previous patch for this.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d1efb389661c..ddba6d1ea340 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -221,7 +221,7 @@ struct io_kiocb { struct list_head list; unsigned int flags; refcount_t refs; -#define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ +#define REQ_F_NOWAIT 1 /* must not punt to workers */ #define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ #define REQ_F_FIXED_FILE 4 /* ctx owns file */ #define REQ_F_SEQ_PREV 8 /* sequential with previous */ @@ -774,10 +774,14 @@ static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s, ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags)); if (unlikely(ret)) return ret; - if (force_nonblock) { + + /* don't allow async punt if RWF_NOWAIT was requested */ + if (kiocb->ki_flags & IOCB_NOWAIT) + req->flags |= REQ_F_NOWAIT; + + if (force_nonblock) kiocb->ki_flags |= IOCB_NOWAIT; - req->flags |= REQ_F_FORCE_NONBLOCK; - } + if (ctx->flags & IORING_SETUP_IOPOLL) { if (!(kiocb->ki_flags & IOCB_DIRECT) || !kiocb->ki_filp->f_op->iopoll) @@ -1433,8 +1437,7 @@ static void io_sq_wq_submit_work(struct work_struct *work) struct sqe_submit *s = &req->submit; const struct io_uring_sqe *sqe = s->sqe;
- /* Ensure we clear previously set forced non-block flag */ - req->flags &= ~REQ_F_FORCE_NONBLOCK; + /* Ensure we clear previously set non-block flag */ req->rw.ki_flags &= ~IOCB_NOWAIT;
ret = 0; @@ -1620,7 +1623,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, goto out;
ret = __io_submit_sqe(ctx, req, s, true); - if (ret == -EAGAIN) { + if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) { struct io_uring_sqe *sqe_copy;
sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL);
From: Stefan Bühler source@stbuehler.de
mainline inclusion from mainline-5.1 commit 1e84b97b7377bd0198f87b49ad3e396e84bf0458 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The application reading the CQ ring needs a barrier to pair with the smp_store_release in io_commit_cqring, not the barrier after it.
Also a write barrier *after* writing something (but not *before* writing anything interesting) doesn't order anything, so an smp_wmb() after writing SQ tail is not needed.
Additionally consider reading SQ head and writing CQ tail in the notes.
Also add some clarifications how the various other fields in the ring buffers are used.
Signed-off-by: Stefan Bühler source@stbuehler.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 110 insertions(+), 9 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ddba6d1ea340..fac3785de555 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4,15 +4,28 @@ * supporting fast/efficient IO. * * A note on the read/write ordering memory barriers that are matched between - * the application and kernel side. When the application reads the CQ ring - * tail, it must use an appropriate smp_rmb() to order with the smp_wmb() - * the kernel uses after writing the tail. Failure to do so could cause a - * delay in when the application notices that completion events available. - * This isn't a fatal condition. Likewise, the application must use an - * appropriate smp_wmb() both before writing the SQ tail, and after writing - * the SQ tail. The first one orders the sqe writes with the tail write, and - * the latter is paired with the smp_rmb() the kernel will issue before - * reading the SQ tail on submission. + * the application and kernel side. + * + * After the application reads the CQ ring tail, it must use an + * appropriate smp_rmb() to pair with the smp_wmb() the kernel uses + * before writing the tail (using smp_load_acquire to read the tail will + * do). It also needs a smp_mb() before updating CQ head (ordering the + * entry load(s) with the head store), pairing with an implicit barrier + * through a control-dependency in io_get_cqring (smp_store_release to + * store head will do). Failure to do so could lead to reading invalid + * CQ entries. + * + * Likewise, the application must use an appropriate smp_wmb() before + * writing the SQ tail (ordering SQ entry stores with the tail store), + * which pairs with smp_load_acquire in io_get_sqring (smp_store_release + * to store the tail will do). And it needs a barrier ordering the SQ + * head load before writing new SQ entries (smp_load_acquire to read + * head will do). + * + * When using the SQ poll thread (IORING_SETUP_SQPOLL), the application + * needs to check the SQ flags for IORING_SQ_NEED_WAKEUP *after* + * updating the SQ tail; a full memory barrier smp_mb() is needed + * between. * * Also see the examples in the liburing library: * @@ -70,20 +83,108 @@ struct io_uring { u32 tail ____cacheline_aligned_in_smp; };
+/* + * This data is shared with the application through the mmap at offset + * IORING_OFF_SQ_RING. + * + * The offsets to the member fields are published through struct + * io_sqring_offsets when calling io_uring_setup. + */ struct io_sq_ring { + /* + * Head and tail offsets into the ring; the offsets need to be + * masked to get valid indices. + * + * The kernel controls head and the application controls tail. + */ struct io_uring r; + /* + * Bitmask to apply to head and tail offsets (constant, equals + * ring_entries - 1) + */ u32 ring_mask; + /* Ring size (constant, power of 2) */ u32 ring_entries; + /* + * Number of invalid entries dropped by the kernel due to + * invalid index stored in array + * + * Written by the kernel, shouldn't be modified by the + * application (i.e. get number of "new events" by comparing to + * cached value). + * + * After a new SQ head value was read by the application this + * counter includes all submissions that were dropped reaching + * the new SQ head (and possibly more). + */ u32 dropped; + /* + * Runtime flags + * + * Written by the kernel, shouldn't be modified by the + * application. + * + * The application needs a full memory barrier before checking + * for IORING_SQ_NEED_WAKEUP after updating the sq tail. + */ u32 flags; + /* + * Ring buffer of indices into array of io_uring_sqe, which is + * mmapped by the application using the IORING_OFF_SQES offset. + * + * This indirection could e.g. be used to assign fixed + * io_uring_sqe entries to operations and only submit them to + * the queue when needed. + * + * The kernel modifies neither the indices array nor the entries + * array. + */ u32 array[]; };
+/* + * This data is shared with the application through the mmap at offset + * IORING_OFF_CQ_RING. + * + * The offsets to the member fields are published through struct + * io_cqring_offsets when calling io_uring_setup. + */ struct io_cq_ring { + /* + * Head and tail offsets into the ring; the offsets need to be + * masked to get valid indices. + * + * The application controls head and the kernel tail. + */ struct io_uring r; + /* + * Bitmask to apply to head and tail offsets (constant, equals + * ring_entries - 1) + */ u32 ring_mask; + /* Ring size (constant, power of 2) */ u32 ring_entries; + /* + * Number of completion events lost because the queue was full; + * this should be avoided by the application by making sure + * there are not more requests pending thatn there is space in + * the completion queue. + * + * Written by the kernel, shouldn't be modified by the + * application (i.e. get number of "new events" by comparing to + * cached value). + * + * As completion events come in out of order this counter is not + * ordered with any other data. + */ u32 overflow; + /* + * Ring buffer of completion events. + * + * The kernel writes completion events fresh every time they are + * produced, so the application is allowed to modify pending + * entries. + */ struct io_uring_cqe cqes[]; };
From: Mark Rutland mark.rutland@arm.com
mainline inclusion from mainline-5.1 commit 52e04ef4c9d459cba3afd86ec335a411b40b7fd2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If io_allocate_scq_urings() fails to allocate an sq_* region, it will call io_mem_free() for any previously allocated regions, but leave dangling pointers to these regions in the ctx. Any regions which have not yet been allocated are left NULL. Note that when returning -EOVERFLOW, the previously allocated sq_ring is not freed, which appears to be an unintentional leak.
When io_allocate_scq_urings() fails, io_uring_create() will call io_ring_ctx_wait_and_kill(), which calls io_mem_free() on all the sq_* regions, assuming the pointers are valid and not NULL.
This can result in pages being freed multiple times, which has been observed to corrupt the page state, leading to subsequent fun. This can also result in virt_to_page() on NULL, resulting in the use of bogus page addresses, and yet more subsequent fun. The latter can be detected with CONFIG_DEBUG_VIRTUAL on arm64.
Adding a cleanup path to io_allocate_scq_urings() complicates the logic, so let's leave it to io_ring_ctx_free() to consistently free these pointers, and simplify the io_allocate_scq_urings() error paths.
Full splats from before this patch below. Note that the pointer logged by the DEBUG_VIRTUAL "non-linear address" warning has been hashed, and is actually NULL.
[ 26.098129] page:ffff80000e949a00 count:0 mapcount:-128 mapping:0000000000000000 index:0x0 [ 26.102976] flags: 0x63fffc000000() [ 26.104373] raw: 000063fffc000000 ffff80000e86c188 ffff80000ea3df08 0000000000000000 [ 26.108917] raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000 [ 26.137235] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) [ 26.143960] ------------[ cut here ]------------ [ 26.146020] kernel BUG at include/linux/mm.h:547! [ 26.147586] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 26.149163] Modules linked in: [ 26.150287] Process syz-executor.21 (pid: 20204, stack limit = 0x000000000e9cefeb) [ 26.153307] CPU: 2 PID: 20204 Comm: syz-executor.21 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #18 [ 26.156566] Hardware name: linux,dummy-virt (DT) [ 26.158089] pstate: 40400005 (nZcv daif +PAN -UAO) [ 26.159869] pc : io_mem_free+0x9c/0xa8 [ 26.161436] lr : io_mem_free+0x9c/0xa8 [ 26.162720] sp : ffff000013003d60 [ 26.164048] x29: ffff000013003d60 x28: ffff800025048040 [ 26.165804] x27: 0000000000000000 x26: ffff800025048040 [ 26.167352] x25: 00000000000000c0 x24: ffff0000112c2820 [ 26.169682] x23: 0000000000000000 x22: 0000000020000080 [ 26.171899] x21: ffff80002143b418 x20: ffff80002143b400 [ 26.174236] x19: ffff80002143b280 x18: 0000000000000000 [ 26.176607] x17: 0000000000000000 x16: 0000000000000000 [ 26.178997] x15: 0000000000000000 x14: 0000000000000000 [ 26.181508] x13: 00009178a5e077b2 x12: 0000000000000001 [ 26.183863] x11: 0000000000000000 x10: 0000000000000980 [ 26.186437] x9 : ffff000013003a80 x8 : ffff800025048a20 [ 26.189006] x7 : ffff8000250481c0 x6 : ffff80002ffe9118 [ 26.191359] x5 : ffff80002ffe9118 x4 : 0000000000000000 [ 26.193863] x3 : ffff80002ffefe98 x2 : 44c06ddd107d1f00 [ 26.196642] x1 : 0000000000000000 x0 : 000000000000003e [ 26.198892] Call trace: [ 26.199893] io_mem_free+0x9c/0xa8 [ 26.201155] io_ring_ctx_wait_and_kill+0xec/0x180 [ 26.202688] io_uring_setup+0x6c4/0x6f0 [ 26.204091] __arm64_sys_io_uring_setup+0x18/0x20 [ 26.205576] el0_svc_common.constprop.0+0x7c/0xe8 [ 26.207186] el0_svc_handler+0x28/0x78 [ 26.208389] el0_svc+0x8/0xc [ 26.209408] Code: aa0203e0 d0006861 9133a021 97fcdc3c (d4210000) [ 26.211995] ---[ end trace bdb81cd43a21e50d ]---
[ 81.770626] ------------[ cut here ]------------ [ 81.825015] virt_to_phys used for non-linear address: 000000000d42f2c7 ( (null)) [ 81.827860] WARNING: CPU: 1 PID: 30171 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x48/0x68 [ 81.831202] Modules linked in: [ 81.832212] CPU: 1 PID: 30171 Comm: syz-executor.20 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #19 [ 81.835616] Hardware name: linux,dummy-virt (DT) [ 81.836863] pstate: 60400005 (nZCv daif +PAN -UAO) [ 81.838727] pc : __virt_to_phys+0x48/0x68 [ 81.840572] lr : __virt_to_phys+0x48/0x68 [ 81.842264] sp : ffff80002cf67c70 [ 81.843858] x29: ffff80002cf67c70 x28: ffff800014358e18 [ 81.846463] x27: 0000000000000000 x26: 0000000020000080 [ 81.849148] x25: 0000000000000000 x24: ffff80001bb01f40 [ 81.851986] x23: ffff200011db06c8 x22: ffff2000127e3c60 [ 81.854351] x21: ffff800014358cc0 x20: ffff800014358d98 [ 81.856711] x19: 0000000000000000 x18: 0000000000000000 [ 81.859132] x17: 0000000000000000 x16: 0000000000000000 [ 81.861586] x15: 0000000000000000 x14: 0000000000000000 [ 81.863905] x13: 0000000000000000 x12: ffff1000037603e9 [ 81.866226] x11: 1ffff000037603e8 x10: 0000000000000980 [ 81.868776] x9 : ffff80002cf67840 x8 : ffff80001bb02920 [ 81.873272] x7 : ffff1000037603e9 x6 : ffff80001bb01f47 [ 81.875266] x5 : ffff1000037603e9 x4 : dfff200000000000 [ 81.876875] x3 : ffff200010087528 x2 : ffff1000059ecf58 [ 81.878751] x1 : 44c06ddd107d1f00 x0 : 0000000000000000 [ 81.880453] Call trace: [ 81.881164] __virt_to_phys+0x48/0x68 [ 81.882919] io_mem_free+0x18/0x110 [ 81.886585] io_ring_ctx_wait_and_kill+0x13c/0x1f0 [ 81.891212] io_uring_setup+0xa60/0xad0 [ 81.892881] __arm64_sys_io_uring_setup+0x2c/0x38 [ 81.894398] el0_svc_common.constprop.0+0xac/0x150 [ 81.896306] el0_svc_handler+0x34/0x88 [ 81.897744] el0_svc+0x8/0xc [ 81.898715] ---[ end trace b4a703802243cbba ]---
Fixes: 2b188cc1bb857a9d ("Add io_uring IO interface") Signed-off-by: Mark Rutland mark.rutland@arm.com Cc: Jens Axboe axboe@kernel.dk Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: linux-block@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 919789957544..6dd523adacab 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2393,8 +2393,12 @@ static int io_account_mem(struct user_struct *user, unsigned long nr_pages)
static void io_mem_free(void *ptr) { - struct page *page = virt_to_head_page(ptr); + struct page *page; + + if (!ptr) + return;
+ page = virt_to_head_page(ptr); if (put_page_testzero(page)) free_compound_page(page); } @@ -2813,17 +2817,12 @@ static int io_allocate_scq_urings(struct io_ring_ctx *ctx, return -EOVERFLOW;
ctx->sq_sqes = io_mem_alloc(size); - if (!ctx->sq_sqes) { - io_mem_free(ctx->sq_ring); + if (!ctx->sq_sqes) return -ENOMEM; - }
cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries)); - if (!cq_ring) { - io_mem_free(ctx->sq_ring); - io_mem_free(ctx->sq_sqes); + if (!cq_ring) return -ENOMEM; - }
ctx->cq_ring = cq_ring; cq_ring->ring_mask = p->cq_entries - 1;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1 commit 817869d2519f0cb7be5b3482129dadc806dfb747 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we don't end up actually calling submit in io_sq_wq_submit_work(), we still need to drop the submit reference to the request. If we don't, then we can leak the request. This can happen if we race with ring shutdown while flushing the workqueue for requests that require use of the mm_struct.
Fixes: e65ef56db494 ("io_uring: use regular request ref counts") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6dd523adacab..a6cd6b3ac4f6 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1565,10 +1565,11 @@ static void io_sq_wq_submit_work(struct work_struct *work) break; cond_resched(); } while (1); - - /* drop submission reference */ - io_put_req(req); } + + /* drop submission reference */ + io_put_req(req); + if (ret) { io_cqring_add_event(ctx, sqe->user_data, ret, 0); io_put_req(req);
From: Mark Rutland mark.rutland@arm.com
mainline inclusion from mainline-5.1 commit d4ef647510b1200fe1c996ff1cbf5ac47eb930cc category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In io_sqe_buffer_register() we allocate a number of arrays based on the iov_len from the user-provided iov. While we limit iov_len to SZ_1G, we can still attempt to allocate arrays exceeding MAX_ORDER.
On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320 bytes, which is greater than 4MiB. This results in SLUB warning that we're trying to allocate greater than MAX_ORDER, and failing the allocation.
Avoid this by using kvmalloc() for allocations dependent on the user-provided iov_len. At the same time, fix a leak of imu->bvec when registration fails.
Full splat from before this patch:
WARNING: CPU: 1 PID: 2314 at mm/page_alloc.c:4595 __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 2314 Comm: syz-executor326 Not tainted 5.1.0-rc7-dirty #4 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x110/0x190 lib/dump_stack.c:113 panic+0x384/0x68c kernel/panic.c:214 __warn+0x2bc/0x2c0 kernel/panic.c:571 report_bug+0x228/0x2d8 lib/bug.c:186 bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956 call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline] brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316 do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831 el1_dbg+0x18/0x8c __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595 alloc_pages_current+0x164/0x278 mm/mempolicy.c:2132 alloc_pages include/linux/gfp.h:509 [inline] kmalloc_order+0x20/0x50 mm/slab_common.c:1231 kmalloc_order_trace+0x30/0x2b0 mm/slab_common.c:1243 kmalloc_large include/linux/slab.h:480 [inline] __kmalloc+0x3dc/0x4f0 mm/slub.c:3791 kmalloc_array include/linux/slab.h:670 [inline] io_sqe_buffer_register fs/io_uring.c:2472 [inline] __io_uring_register fs/io_uring.c:2962 [inline] __do_sys_io_uring_register fs/io_uring.c:3008 [inline] __se_sys_io_uring_register fs/io_uring.c:2990 [inline] __arm64_sys_io_uring_register+0x9e0/0x1bc8 fs/io_uring.c:2990 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall arch/arm64/kernel/syscall.c:47 [inline] el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83 el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129 el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948 SMP: stopping secondary CPUs Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled CPU features: 0x002,23000438 Memory Limit: none Rebooting in 1 seconds..
Fixes: edafccee56ff3167 ("io_uring: add support for pre-mapped user IO buffers") Signed-off-by: Mark Rutland mark.rutland@arm.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Jens Axboe axboe@kernel.dk Cc: linux-fsdevel@vger.kernel.org Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a6cd6b3ac4f6..ae1d4793013b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2440,7 +2440,7 @@ static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
if (ctx->account_mem) io_unaccount_mem(ctx->user, imu->nr_bvecs); - kfree(imu->bvec); + kvfree(imu->bvec); imu->nr_bvecs = 0; }
@@ -2532,9 +2532,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, if (!pages || nr_pages > got_pages) { kfree(vmas); kfree(pages); - pages = kmalloc_array(nr_pages, sizeof(struct page *), + pages = kvmalloc_array(nr_pages, sizeof(struct page *), GFP_KERNEL); - vmas = kmalloc_array(nr_pages, + vmas = kvmalloc_array(nr_pages, sizeof(struct vm_area_struct *), GFP_KERNEL); if (!pages || !vmas) { @@ -2546,7 +2546,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, got_pages = nr_pages; }
- imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec), + imu->bvec = kvmalloc_array(nr_pages, sizeof(struct bio_vec), GFP_KERNEL); ret = -ENOMEM; if (!imu->bvec) { @@ -2585,6 +2585,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, } if (ctx->account_mem) io_unaccount_mem(ctx->user, nr_pages); + kvfree(imu->bvec); goto err; }
@@ -2607,12 +2608,12 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
ctx->nr_user_bufs++; } - kfree(pages); - kfree(vmas); + kvfree(pages); + kvfree(vmas); return 0; err: - kfree(pages); - kfree(vmas); + kvfree(pages); + kvfree(vmas); io_sqe_buffer_unregister(ctx); return ret; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.2-rc1 commit 22f96b3808c12a218e9a3bce6e1bfbd74efbe374 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This just pulls out the ksys_sync_file_range() code to work on a struct file instead of an fd, so we can use it elsewhere.
Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/sync.c include/linux/fs.h [ Patch c553ea4fdf("fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback") applied earlier. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/sync.c | 141 ++++++++++++++++++++++++--------------------- include/linux/fs.h | 3 + 2 files changed, 77 insertions(+), 67 deletions(-)
diff --git a/fs/sync.c b/fs/sync.c index 9e8cd90e890f..4d1ff010bc5a 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -234,61 +234,10 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd) return do_fsync(fd, 1); }
-/* - * ksys_sync_file_range() permits finely controlled syncing over a segment of - * a file in the range offset .. (offset+nbytes-1) inclusive. If nbytes is - * zero then ksys_sync_file_range() will operate from offset out to EOF. - * - * The flag bits are: - * - * SYNC_FILE_RANGE_WAIT_BEFORE: wait upon writeout of all pages in the range - * before performing the write. - * - * SYNC_FILE_RANGE_WRITE: initiate writeout of all those dirty pages in the - * range which are not presently under writeback. Note that this may block for - * significant periods due to exhaustion of disk request structures. - * - * SYNC_FILE_RANGE_WAIT_AFTER: wait upon writeout of all pages in the range - * after performing the write. - * - * Useful combinations of the flag bits are: - * - * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE: ensures that all pages - * in the range which were dirty on entry to ksys_sync_file_range() are placed - * under writeout. This is a start-write-for-data-integrity operation. - * - * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which - * are not presently under writeout. This is an asynchronous flush-to-disk - * operation. Not suitable for data integrity operations. - * - * SYNC_FILE_RANGE_WAIT_BEFORE (or SYNC_FILE_RANGE_WAIT_AFTER): wait for - * completion of writeout of all pages in the range. This will be used after an - * earlier SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE operation to wait - * for that operation to complete and to return the result. - * - * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER - * (a.k.a. SYNC_FILE_RANGE_WRITE_AND_WAIT): - * a traditional sync() operation. This is a write-for-data-integrity operation - * which will ensure that all pages in the range which were dirty on entry to - * ksys_sync_file_range() are written to disk. It should be noted that disk - * caches are not flushed by this call, so there are no guarantees here that the - * data will be available on disk after a crash. - * - * - * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any - * I/O errors or ENOSPC conditions and will return those to the caller, after - * clearing the EIO and ENOSPC flags in the address_space. - * - * It should be noted that none of these operations write out the file's - * metadata. So unless the application is strictly performing overwrites of - * already-instantiated disk blocks, there are no guarantees here that the data - * will be available after a crash. - */ -int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes, - unsigned int flags) +int sync_file_range(struct file *file, loff_t offset, loff_t nbytes, + unsigned int flags) { int ret; - struct fd f; struct address_space *mapping; loff_t endbyte; /* inclusive */ umode_t i_mode; @@ -328,23 +277,18 @@ int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes, else endbyte--; /* inclusive */
- ret = -EBADF; - f = fdget(fd); - if (!f.file) - goto out; - - i_mode = file_inode(f.file)->i_mode; + i_mode = file_inode(file)->i_mode; ret = -ESPIPE; if (!S_ISREG(i_mode) && !S_ISBLK(i_mode) && !S_ISDIR(i_mode) && !S_ISLNK(i_mode)) - goto out_put; + goto out;
- mapping = f.file->f_mapping; + mapping = file->f_mapping; ret = 0; if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) { - ret = file_fdatawait_range(f.file, offset, endbyte); + ret = file_fdatawait_range(file, offset, endbyte); if (ret < 0) - goto out_put; + goto out; }
if (flags & SYNC_FILE_RANGE_WRITE) { @@ -357,18 +301,81 @@ int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes, ret = __filemap_fdatawrite_range(mapping, offset, endbyte, sync_mode); if (ret < 0) - goto out_put; + goto out; }
if (flags & SYNC_FILE_RANGE_WAIT_AFTER) - ret = file_fdatawait_range(f.file, offset, endbyte); + ret = file_fdatawait_range(file, offset, endbyte);
-out_put: - fdput(f); out: return ret; }
+/* + * ksys_sync_file_range() permits finely controlled syncing over a segment of + * a file in the range offset .. (offset+nbytes-1) inclusive. If nbytes is + * zero then ksys_sync_file_range() will operate from offset out to EOF. + * + * The flag bits are: + * + * SYNC_FILE_RANGE_WAIT_BEFORE: wait upon writeout of all pages in the range + * before performing the write. + * + * SYNC_FILE_RANGE_WRITE: initiate writeout of all those dirty pages in the + * range which are not presently under writeback. Note that this may block for + * significant periods due to exhaustion of disk request structures. + * + * SYNC_FILE_RANGE_WAIT_AFTER: wait upon writeout of all pages in the range + * after performing the write. + * + * Useful combinations of the flag bits are: + * + * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE: ensures that all pages + * in the range which were dirty on entry to ksys_sync_file_range() are placed + * under writeout. This is a start-write-for-data-integrity operation. + * + * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which + * are not presently under writeout. This is an asynchronous flush-to-disk + * operation. Not suitable for data integrity operations. + * + * SYNC_FILE_RANGE_WAIT_BEFORE (or SYNC_FILE_RANGE_WAIT_AFTER): wait for + * completion of writeout of all pages in the range. This will be used after an + * earlier SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE operation to wait + * for that operation to complete and to return the result. + * + * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER + * (a.k.a. SYNC_FILE_RANGE_WRITE_AND_WAIT): + * a traditional sync() operation. This is a write-for-data-integrity operation + * which will ensure that all pages in the range which were dirty on entry to + * ksys_sync_file_range() are written to disk. It should be noted that disk + * caches are not flushed by this call, so there are no guarantees here that the + * data will be available on disk after a crash. + * + * + * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any + * I/O errors or ENOSPC conditions and will return those to the caller, after + * clearing the EIO and ENOSPC flags in the address_space. + * + * It should be noted that none of these operations write out the file's + * metadata. So unless the application is strictly performing overwrites of + * already-instantiated disk blocks, there are no guarantees here that the data + * will be available after a crash. + */ +int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes, + unsigned int flags) +{ + int ret; + struct fd f; + + ret = -EBADF; + f = fdget(fd); + if (f.file) + ret = sync_file_range(f.file, offset, nbytes, flags); + + fdput(f); + return ret; +} + SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes, unsigned int, flags) { diff --git a/include/linux/fs.h b/include/linux/fs.h index db7dd25ce645..36d828c741d5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2804,6 +2804,9 @@ extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync); extern int vfs_fsync(struct file *file, int datasync);
+extern int sync_file_range(struct file *file, loff_t offset, loff_t nbytes, + unsigned int flags); + /* * Sync the bytes written if this was a synchronous write. Expect ki_pos * to already be updated for the write, and will return either the amount
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.2-rc1 commit de0617e467171ba44c73efd1ba63f101b164a035 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There are no ordering constraints between the submission and completion side of io_uring. But sometimes that would be useful to have. One common example is doing an fsync, for instance, and have it ordered with previous writes. Without support for that, the application must do this tracking itself.
This adds a general SQE flag, IOSQE_IO_DRAIN. If a command is marked with this flag, then it will not be issued before previous commands have completed, and subsequent commands submitted after the drain will not be issued before the drain is started.. If there are no pending commands, setting this flag will not change the behavior of the issue of the command.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 91 +++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 1 + 2 files changed, 89 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ae1d4793013b..e10adb340c26 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -222,6 +222,8 @@ struct io_ring_ctx { unsigned sq_mask; unsigned sq_thread_idle; struct io_uring_sqe *sq_sqes; + + struct list_head defer_list; } ____cacheline_aligned_in_smp;
/* IO offload */ @@ -327,8 +329,11 @@ struct io_kiocb { #define REQ_F_FIXED_FILE 4 /* ctx owns file */ #define REQ_F_SEQ_PREV 8 /* sequential with previous */ #define REQ_F_PREPPED 16 /* prep already done */ +#define REQ_F_IO_DRAIN 32 /* drain existing IO first */ +#define REQ_F_IO_DRAINED 64 /* drain done */ u64 user_data; - u64 error; + u32 error; + u32 sequence;
struct work_struct work; }; @@ -356,6 +361,8 @@ struct io_submit_state { unsigned int ios_left; };
+static void io_sq_wq_submit_work(struct work_struct *work); + static struct kmem_cache *req_cachep;
static const struct file_operations io_uring_fops; @@ -407,10 +414,36 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) spin_lock_init(&ctx->completion_lock); INIT_LIST_HEAD(&ctx->poll_list); INIT_LIST_HEAD(&ctx->cancel_list); + INIT_LIST_HEAD(&ctx->defer_list); return ctx; }
-static void io_commit_cqring(struct io_ring_ctx *ctx) +static inline bool io_sequence_defer(struct io_ring_ctx *ctx, + struct io_kiocb *req) +{ + if ((req->flags & (REQ_F_IO_DRAIN|REQ_F_IO_DRAINED)) != REQ_F_IO_DRAIN) + return false; + + return req->sequence > ctx->cached_cq_tail + ctx->sq_ring->dropped; +} + +static struct io_kiocb *io_get_deferred_req(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + if (list_empty(&ctx->defer_list)) + return NULL; + + req = list_first_entry(&ctx->defer_list, struct io_kiocb, list); + if (!io_sequence_defer(ctx, req)) { + list_del_init(&req->list); + return req; + } + + return NULL; +} + +static void __io_commit_cqring(struct io_ring_ctx *ctx) { struct io_cq_ring *ring = ctx->cq_ring;
@@ -425,6 +458,18 @@ static void io_commit_cqring(struct io_ring_ctx *ctx) } }
+static void io_commit_cqring(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + __io_commit_cqring(ctx); + + while ((req = io_get_deferred_req(ctx)) != NULL) { + req->flags |= REQ_F_IO_DRAINED; + queue_work(ctx->sqo_wq, &req->work); + } +} + static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx) { struct io_cq_ring *ring = ctx->cq_ring; @@ -1434,6 +1479,34 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe) return ipt.error; }
+static int io_req_defer(struct io_ring_ctx *ctx, struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + struct io_uring_sqe *sqe_copy; + + if (!io_sequence_defer(ctx, req) && list_empty(&ctx->defer_list)) + return 0; + + sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL); + if (!sqe_copy) + return -EAGAIN; + + spin_lock_irq(&ctx->completion_lock); + if (!io_sequence_defer(ctx, req) && list_empty(&ctx->defer_list)) { + spin_unlock_irq(&ctx->completion_lock); + kfree(sqe_copy); + return 0; + } + + memcpy(sqe_copy, sqe, sizeof(*sqe_copy)); + req->submit.sqe = sqe_copy; + + INIT_WORK(&req->work, io_sq_wq_submit_work); + list_add_tail(&req->list, &ctx->defer_list); + spin_unlock_irq(&ctx->completion_lock); + return -EIOCBQUEUED; +} + static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, const struct sqe_submit *s, bool force_nonblock) { @@ -1681,6 +1754,11 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s, flags = READ_ONCE(s->sqe->flags); fd = READ_ONCE(s->sqe->fd);
+ if (flags & IOSQE_IO_DRAIN) { + req->flags |= REQ_F_IO_DRAIN; + req->sequence = ctx->cached_sq_head - 1; + } + if (!io_op_needs_file(s->sqe)) { req->file = NULL; return 0; @@ -1710,7 +1788,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, int ret;
/* enforce forwards compatibility on users */ - if (unlikely(s->sqe->flags & ~IOSQE_FIXED_FILE)) + if (unlikely(s->sqe->flags & ~(IOSQE_FIXED_FILE | IOSQE_IO_DRAIN))) return -EINVAL;
req = io_get_req(ctx, state); @@ -1721,6 +1799,13 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, if (unlikely(ret)) goto out;
+ ret = io_req_defer(ctx, req, s->sqe); + if (ret) { + if (ret == -EIOCBQUEUED) + ret = 0; + return ret; + } + ret = __io_submit_sqe(ctx, req, s, true); if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) { struct io_uring_sqe *sqe_copy; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index e23408692118..a7a6384d0c70 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -38,6 +38,7 @@ struct io_uring_sqe { * sqe->flags */ #define IOSQE_FIXED_FILE (1U << 0) /* use fixed fileset */ +#define IOSQE_IO_DRAIN (1U << 1) /* issue after inflight IO */
/* * io_uring_setup() flags
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.2-rc1 commit 5d17b4a4b7fa172b205be8a05051ae705d1dc3bb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This behaves just like sync_file_range(2) does.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 51 +++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 2 ++ 2 files changed, 53 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e10adb340c26..b61e9838d34a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1264,6 +1264,54 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; }
+static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_ring_ctx *ctx = req->ctx; + int ret = 0; + + if (!req->file) + return -EBADF; + /* Prep already done (EAGAIN retry) */ + if (req->flags & REQ_F_PREPPED) + return 0; + + if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index)) + return -EINVAL; + + req->flags |= REQ_F_PREPPED; + return ret; +} + +static int io_sync_file_range(struct io_kiocb *req, + const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + loff_t sqe_off; + loff_t sqe_len; + unsigned flags; + int ret; + + ret = io_prep_sfr(req, sqe); + if (ret) + return ret; + + /* sync_file_range always requires a blocking context */ + if (force_nonblock) + return -EAGAIN; + + sqe_off = READ_ONCE(sqe->off); + sqe_len = READ_ONCE(sqe->len); + flags = READ_ONCE(sqe->sync_range_flags); + + ret = sync_file_range(req->rw.ki_filp, sqe_off, sqe_len, flags); + + io_cqring_add_event(req->ctx, sqe->user_data, ret, 0); + io_put_req(req); + return 0; +} + static void io_poll_remove_one(struct io_kiocb *req) { struct io_poll_iocb *poll = &req->poll; @@ -1546,6 +1594,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, case IORING_OP_POLL_REMOVE: ret = io_poll_remove(req, s->sqe); break; + case IORING_OP_SYNC_FILE_RANGE: + ret = io_sync_file_range(req, s->sqe, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index a7a6384d0c70..e707a17c6908 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -26,6 +26,7 @@ struct io_uring_sqe { __kernel_rwf_t rw_flags; __u32 fsync_flags; __u16 poll_events; + __u32 sync_range_flags; }; __u64 user_data; /* data to be passed back at completion time */ union { @@ -55,6 +56,7 @@ struct io_uring_sqe { #define IORING_OP_WRITE_FIXED 5 #define IORING_OP_POLL_ADD 6 #define IORING_OP_POLL_REMOVE 7 +#define IORING_OP_SYNC_FILE_RANGE 8
/* * sqe->fsync_flags
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.2-rc1 commit 9b402849e80c85eee10bbd341aab3f1a0f942d4f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Allow registration of an eventfd, which will trigger an event every time a completion event happens for this io_uring instance.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 48 +++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 2 ++ 2 files changed, 50 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b61e9838d34a..723575f8a8c4 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -241,6 +241,7 @@ struct io_ring_ctx { unsigned cq_mask; struct wait_queue_head cq_wait; struct fasync_struct *cq_fasync; + struct eventfd_ctx *cq_ev_fd; } ____cacheline_aligned_in_smp;
/* @@ -516,6 +517,8 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx) wake_up(&ctx->wait); if (waitqueue_active(&ctx->sqo_wait)) wake_up(&ctx->sqo_wait); + if (ctx->cq_ev_fd) + eventfd_signal(ctx->cq_ev_fd, 1); }
static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 user_data, @@ -2754,6 +2757,38 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, return ret; }
+static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg) +{ + __s32 __user *fds = arg; + int fd; + + if (ctx->cq_ev_fd) + return -EBUSY; + + if (copy_from_user(&fd, fds, sizeof(*fds))) + return -EFAULT; + + ctx->cq_ev_fd = eventfd_ctx_fdget(fd); + if (IS_ERR(ctx->cq_ev_fd)) { + int ret = PTR_ERR(ctx->cq_ev_fd); + ctx->cq_ev_fd = NULL; + return ret; + } + + return 0; +} + +static int io_eventfd_unregister(struct io_ring_ctx *ctx) +{ + if (ctx->cq_ev_fd) { + eventfd_ctx_put(ctx->cq_ev_fd); + ctx->cq_ev_fd = NULL; + return 0; + } + + return -ENXIO; +} + static void io_ring_ctx_free(struct io_ring_ctx *ctx) { io_finish_async(ctx); @@ -2763,6 +2798,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) io_iopoll_reap_events(ctx); io_sqe_buffer_unregister(ctx); io_sqe_files_unregister(ctx); + io_eventfd_unregister(ctx);
#if defined(CONFIG_UNIX) if (ctx->ring_sock) @@ -3176,6 +3212,18 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_sqe_files_unregister(ctx); break; + case IORING_REGISTER_EVENTFD: + ret = -EINVAL; + if (nr_args != 1) + break; + ret = io_eventfd_register(ctx, arg); + break; + case IORING_UNREGISTER_EVENTFD: + ret = -EINVAL; + if (arg || nr_args) + break; + ret = io_eventfd_unregister(ctx); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index e707a17c6908..a0c460025036 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -136,5 +136,7 @@ struct io_uring_params { #define IORING_UNREGISTER_BUFFERS 1 #define IORING_REGISTER_FILES 2 #define IORING_UNREGISTER_FILES 3 +#define IORING_REGISTER_EVENTFD 4 +#define IORING_UNREGISTER_EVENTFD 5
#endif
From: Stefan Bühler source@stbuehler.de
mainline inclusion from mainline-5.2-rc1 commit 5dcf877fb13f3c6a8ba0777ef766c4af32df725d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
No need to set it in io_poll_add; io_poll_complete doesn't use it to set the result in the CQE.
Signed-off-by: Stefan Bühler source@stbuehler.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 723575f8a8c4..39e89e13addd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -333,7 +333,7 @@ struct io_kiocb { #define REQ_F_IO_DRAIN 32 /* drain existing IO first */ #define REQ_F_IO_DRAINED 64 /* drain done */ u64 user_data; - u32 error; + u32 error; /* iopoll result from callback */ u32 sequence;
struct work_struct work; @@ -1517,7 +1517,6 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe) spin_unlock(&poll->head->lock); } if (mask) { /* no async, we'd stolen it */ - req->error = mangle_poll(mask); ipt.error = 0; io_poll_complete(ctx, req, mask); }
From: Colin Ian King colin.king@canonical.com
mainline inclusion from mainline-5.2-rc1 commit efeb862bd5bc001636e690debf6f9fbba98e5bfd category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently variable ret is declared in a while-loop code block that shadows another variable ret. When an error occurs in the while-loop the error return in ret is not being set in the outer code block and so the error check on ret is always going to be checking on the wrong ret variable resulting in check that is always going to be true and a premature return occurs.
Fix this by removing the declaration of the inner while-loop variable ret so that shadowing does not occur.
Addresses-Coverity: ("'Constant' variable guards dead code") Fixes: 6b06314c47e1 ("io_uring: add file set registration") Signed-off-by: Colin Ian King colin.king@canonical.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 39e89e13addd..27d0e4ed6f21 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2360,7 +2360,6 @@ static int io_sqe_files_scm(struct io_ring_ctx *ctx) left = ctx->nr_user_files; while (left) { unsigned this_files = min_t(unsigned, left, SCM_MAX_FD); - int ret;
ret = __io_sqe_files_scm(ctx, this_files, total); if (ret)
From: Shenghui Wang shhuiw@foxmail.com
mainline inclusion from mainline-5.2-rc1 commit 7889f44dd9cee15aff1c3f7daf81ca4dfed48fc7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This issue is found by running liburing/test/io_uring_setup test.
When test run, the testcase "attempt to bind to invalid cpu" would not pass with messages like: io_uring_setup(1, 0xbfc2f7c8), \ flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, \ resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, \ sq_thread_cpu: 2 expected -1, got 3 FAIL
On my system, there is: CPU(s) possible : 0-3 CPU(s) online : 0-1 CPU(s) offline : 2-3 CPU(s) present : 0-1
The sq_thread_cpu 2 is offline on my system, so the bind should fail. But cpu_possible() will pass the check. We shouldn't be able to bind to an offline cpu. Use cpu_online() to do the check.
After the change, the testcase run as expected: EINVAL will be returned for cpu offlined.
Reviewed-by: Jeff Moyer jmoyer@redhat.com Signed-off-by: Shenghui Wang shhuiw@foxmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 27d0e4ed6f21..1ade02bd1192 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2468,7 +2468,7 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx, nr_cpu_ids);
ret = -EINVAL; - if (!cpu_possible(cpu)) + if (!cpu_online(cpu)) goto err;
ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread,
From: Stefan Bühler source@stbuehler.de
mainline inclusion from mainline-5.2-rc1 commit e2033e33cb3821c26d4f9e70677910827d3b7885 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
When punting to workers the SQE gets copied after the initial try. There is a race condition between reading SQE data for the initial try and copying it for punting it to the workers.
For example io_rw_done calls kiocb->ki_complete even if it was prepared for IORING_OP_FSYNC (and would be NULL).
The easiest solution for now is to alway prepare again in the worker.
req->file is safe to prepare though as long as it is checked before use.
Signed-off-by: Stefan Bühler source@stbuehler.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 17 ++--------------- 1 file changed, 2 insertions(+), 15 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1ade02bd1192..52226adb7739 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -329,9 +329,8 @@ struct io_kiocb { #define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ #define REQ_F_FIXED_FILE 4 /* ctx owns file */ #define REQ_F_SEQ_PREV 8 /* sequential with previous */ -#define REQ_F_PREPPED 16 /* prep already done */ -#define REQ_F_IO_DRAIN 32 /* drain existing IO first */ -#define REQ_F_IO_DRAINED 64 /* drain done */ +#define REQ_F_IO_DRAIN 16 /* drain existing IO first */ +#define REQ_F_IO_DRAINED 32 /* drain done */ u64 user_data; u32 error; /* iopoll result from callback */ u32 sequence; @@ -896,9 +895,6 @@ static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s,
if (!req->file) return -EBADF; - /* For -EAGAIN retry, everything is already prepped */ - if (req->flags & REQ_F_PREPPED) - return 0;
if (force_nonblock && !io_file_supports_async(req->file)) force_nonblock = false; @@ -941,7 +937,6 @@ static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s, return -EINVAL; kiocb->ki_complete = io_complete_rw; } - req->flags |= REQ_F_PREPPED; return 0; }
@@ -1224,16 +1219,12 @@ static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
if (!req->file) return -EBADF; - /* Prep already done (EAGAIN retry) */ - if (req->flags & REQ_F_PREPPED) - return 0;
if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index)) return -EINVAL;
- req->flags |= REQ_F_PREPPED; return 0; }
@@ -1274,16 +1265,12 @@ static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe)
if (!req->file) return -EBADF; - /* Prep already done (EAGAIN retry) */ - if (req->flags & REQ_F_PREPPED) - return 0;
if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index)) return -EINVAL;
- req->flags |= REQ_F_PREPPED; return ret; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.2-rc1 commit 44a9bd18a0f06bba19d155aeaa11e2edce898293 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The test case we have is rightfully failing with the current kernel:
io_uring_setup(1, 0x7ffe2cafebe0), flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, sq_thread_cpu: 4 expected -1, got 3
This is in a vm, and CPU3 is the last valid one, hence asking for 4 should fail the setup with -EINVAL, not succeed. The problem is that we're using array_index_nospec() with nr_cpu_ids as the index, hence we wrap and end up using CPU0 instead of CPU4. This makes the setup succeed where it should be failing.
We don't need to use array_index_nospec() as we're not indexing any array with this. Instead just compare with nr_cpu_ids directly. This is fine as we're checking with cpu_online() afterwards.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 52226adb7739..5f5c37d2764b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2451,10 +2451,11 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx, ctx->sq_thread_idle = HZ;
if (p->flags & IORING_SETUP_SQ_AFF) { - int cpu = array_index_nospec(p->sq_thread_cpu, - nr_cpu_ids); + int cpu = p->sq_thread_cpu;
ret = -EINVAL; + if (cpu >= nr_cpu_ids) + goto err; if (!cpu_online(cpu)) goto err;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.2-rc1 commit c71ffb673cd9bb2ddc575ede9055f265b2535690 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We always pass in 0 for the cqe flags argument, since the support for "this read hit page cache" hint was dropped.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5f5c37d2764b..6c5dc75f62c2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -489,7 +489,7 @@ static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx) }
static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, - long res, unsigned ev_flags) + long res) { struct io_uring_cqe *cqe;
@@ -502,7 +502,7 @@ static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, if (cqe) { WRITE_ONCE(cqe->user_data, ki_user_data); WRITE_ONCE(cqe->res, res); - WRITE_ONCE(cqe->flags, ev_flags); + WRITE_ONCE(cqe->flags, 0); } else { unsigned overflow = READ_ONCE(ctx->cq_ring->overflow);
@@ -521,12 +521,12 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx) }
static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 user_data, - long res, unsigned ev_flags) + long res) { unsigned long flags;
spin_lock_irqsave(&ctx->completion_lock, flags); - io_cqring_fill_event(ctx, user_data, res, ev_flags); + io_cqring_fill_event(ctx, user_data, res); io_commit_cqring(ctx); spin_unlock_irqrestore(&ctx->completion_lock, flags);
@@ -628,7 +628,7 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, req = list_first_entry(done, struct io_kiocb, list); list_del(&req->list);
- io_cqring_fill_event(ctx, req->user_data, req->error, 0); + io_cqring_fill_event(ctx, req->user_data, req->error); (*nr_events)++;
if (refcount_dec_and_test(&req->refs)) { @@ -776,7 +776,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
kiocb_end_write(kiocb);
- io_cqring_add_event(req->ctx, req->user_data, res, 0); + io_cqring_add_event(req->ctx, req->user_data, res); io_put_req(req); }
@@ -1208,7 +1208,7 @@ static int io_nop(struct io_kiocb *req, u64 user_data) if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL;
- io_cqring_add_event(ctx, user_data, err, 0); + io_cqring_add_event(ctx, user_data, err); io_put_req(req); return 0; } @@ -1253,7 +1253,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, end > 0 ? end : LLONG_MAX, fsync_flags & IORING_FSYNC_DATASYNC);
- io_cqring_add_event(req->ctx, sqe->user_data, ret, 0); + io_cqring_add_event(req->ctx, sqe->user_data, ret); io_put_req(req); return 0; } @@ -1297,7 +1297,7 @@ static int io_sync_file_range(struct io_kiocb *req,
ret = sync_file_range(req->rw.ki_filp, sqe_off, sqe_len, flags);
- io_cqring_add_event(req->ctx, sqe->user_data, ret, 0); + io_cqring_add_event(req->ctx, sqe->user_data, ret); io_put_req(req); return 0; } @@ -1355,7 +1355,7 @@ static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe) } spin_unlock_irq(&ctx->completion_lock);
- io_cqring_add_event(req->ctx, sqe->user_data, ret, 0); + io_cqring_add_event(req->ctx, sqe->user_data, ret); io_put_req(req); return 0; } @@ -1364,7 +1364,7 @@ static void io_poll_complete(struct io_ring_ctx *ctx, struct io_kiocb *req, __poll_t mask) { req->poll.done = true; - io_cqring_fill_event(ctx, req->user_data, mangle_poll(mask), 0); + io_cqring_fill_event(ctx, req->user_data, mangle_poll(mask)); io_commit_cqring(ctx); }
@@ -1684,7 +1684,7 @@ static void io_sq_wq_submit_work(struct work_struct *work) io_put_req(req);
if (ret) { - io_cqring_add_event(ctx, sqe->user_data, ret, 0); + io_cqring_add_event(ctx, sqe->user_data, ret); io_put_req(req); }
@@ -1989,7 +1989,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, continue; }
- io_cqring_add_event(ctx, sqes[i].sqe->user_data, ret, 0); + io_cqring_add_event(ctx, sqes[i].sqe->user_data, ret); }
if (statep) @@ -2154,7 +2154,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
ret = io_submit_sqe(ctx, &s, statep); if (ret) - io_cqring_add_event(ctx, s.sqe->user_data, ret, 0); + io_cqring_add_event(ctx, s.sqe->user_data, ret); } io_commit_sqring(ctx);
From: Roman Penyaev rpenyaev@suse.de
mainline inclusion from mainline-5.2-rc1 commit 2bbcd6d3b36a75a19be4917807f54ae32dd26aba category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This fixes couple of races which lead to infinite wait of park completion with the following backtraces:
[20801.303319] Call Trace: [20801.303321] ? __schedule+0x284/0x650 [20801.303323] schedule+0x33/0xc0 [20801.303324] schedule_timeout+0x1bc/0x210 [20801.303326] ? schedule+0x3d/0xc0 [20801.303327] ? schedule_timeout+0x1bc/0x210 [20801.303329] ? preempt_count_add+0x79/0xb0 [20801.303330] wait_for_completion+0xa5/0x120 [20801.303331] ? wake_up_q+0x70/0x70 [20801.303333] kthread_park+0x48/0x80 [20801.303335] io_finish_async+0x2c/0x70 [20801.303336] io_ring_ctx_wait_and_kill+0x95/0x180 [20801.303338] io_uring_release+0x1c/0x20 [20801.303339] __fput+0xad/0x210 [20801.303341] task_work_run+0x8f/0xb0 [20801.303342] exit_to_usermode_loop+0xa0/0xb0 [20801.303343] do_syscall_64+0xe0/0x100 [20801.303349] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[20801.303380] Call Trace: [20801.303383] ? __schedule+0x284/0x650 [20801.303384] schedule+0x33/0xc0 [20801.303386] io_sq_thread+0x38a/0x410 [20801.303388] ? __switch_to_asm+0x40/0x70 [20801.303390] ? wait_woken+0x80/0x80 [20801.303392] ? _raw_spin_lock_irqsave+0x17/0x40 [20801.303394] ? io_submit_sqes+0x120/0x120 [20801.303395] kthread+0x112/0x130 [20801.303396] ? kthread_create_on_node+0x60/0x60 [20801.303398] ret_from_fork+0x35/0x40
o kthread_park() waits for park completion, so io_sq_thread() loop should check kthread_should_park() along with khread_should_stop(), otherwise if kthread_park() is called before prepare_to_wait() the following schedule() never returns:
CPU#0 CPU#1
io_sq_thread_stop(): io_sq_thread():
while(!kthread_should_stop() && !ctx->sqo_stop) {
ctx->sqo_stop = 1; kthread_park()
prepare_to_wait(); if (kthread_should_stop() { } schedule(); <<< nobody checks park flag, <<< so schedule and never return
o if the flag ctx->sqo_stop is observed by the io_sq_thread() loop it is quite possible, that kthread_should_park() check and the following kthread_parkme() is never called, because kthread_park() has not been yet called, but few moments later is is called and waits there for park completion, which never happens, because kthread has already exited:
CPU#0 CPU#1
io_sq_thread_stop(): io_sq_thread():
ctx->sqo_stop = 1; while(!kthread_should_stop() && !ctx->sqo_stop) { <<< observe sqo_stop and exit the loop }
if (kthread_should_park()) kthread_parkme(); <<< never called, since was <<< never parked
kthread_park() <<< waits forever for park completion
In the current patch we quit the loop by only kthread_should_park() check (kthread_park() is synchronous, so kthread_should_stop() is never observed), and we abandon ->sqo_stop flag, since it is racy. At the end of the io_sq_thread() we unconditionally call parmke(), since we've exited the loop by the park flag.
Signed-off-by: Roman Penyaev rpenyaev@suse.de Cc: Jens Axboe axboe@kernel.dk Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6c5dc75f62c2..6df8a9aa975d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -231,7 +231,6 @@ struct io_ring_ctx { struct task_struct *sqo_thread; /* if using sq thread polling */ struct mm_struct *sqo_mm; wait_queue_head_t sqo_wait; - unsigned sqo_stop;
struct { /* CQ ring */ @@ -2012,7 +2011,7 @@ static int io_sq_thread(void *data) set_fs(USER_DS);
timeout = inflight = 0; - while (!kthread_should_stop() && !ctx->sqo_stop) { + while (!kthread_should_park()) { bool all_fixed, mm_fault = false; int i;
@@ -2074,7 +2073,7 @@ static int io_sq_thread(void *data) smp_mb();
if (!io_get_sqring(ctx, &sqes[0])) { - if (kthread_should_stop()) { + if (kthread_should_park()) { finish_wait(&ctx->sqo_wait, &wait); break; } @@ -2124,8 +2123,7 @@ static int io_sq_thread(void *data) mmput(cur_mm); }
- if (kthread_should_park()) - kthread_parkme(); + kthread_parkme();
return 0; } @@ -2257,8 +2255,11 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx) static void io_sq_thread_stop(struct io_ring_ctx *ctx) { if (ctx->sqo_thread) { - ctx->sqo_stop = 1; - mb(); + /* + * The park is a bit of a work-around, without it we get + * warning spews on shutdown with SQPOLL set and affinity + * set to a single CPU. + */ kthread_park(ctx->sqo_thread); kthread_stop(ctx->sqo_thread); ctx->sqo_thread = NULL;
From: Jackie Liu liuyun01@kylinos.cn
mainline inclusion from mainline-5.2-rc1 commit dc6ce4bc2b355a47f225a0205046b3ebf29a7f72 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Whenever smp_rmb is required to use io_cqring_events, keep smp_rmb inside the function io_cqring_events.
Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6df8a9aa975d..a852f67019fe 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2164,6 +2164,8 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
static unsigned io_cqring_events(struct io_cq_ring *ring) { + /* See comment at the top of this file */ + smp_rmb(); return READ_ONCE(ring->r.tail) - READ_ONCE(ring->r.head); }
@@ -2179,8 +2181,6 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, DEFINE_WAIT(wait); int ret;
- /* See comment at the top of this file */ - smp_rmb(); if (io_cqring_events(ring) >= min_events) return 0;
@@ -2202,8 +2202,6 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE);
ret = 0; - /* See comment at the top of this file */ - smp_rmb(); if (io_cqring_events(ring) >= min_events) break;
From: Jackie Liu liuyun01@kylinos.cn
mainline inclusion from mainline-5.2-rc1 commit fdb288a679cdf6a71f3c1ae6f348ba4dae742681 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The previous patch has ensured that io_cqring_events contain smp_rmb memory barriers, Now we can use wait_event_interruptible to keep the code simple.
Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 17 ++--------------- 1 file changed, 2 insertions(+), 15 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a852f67019fe..2d8abce92aae 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2178,7 +2178,6 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, { struct io_cq_ring *ring = ctx->cq_ring; sigset_t ksigmask, sigsaved; - DEFINE_WAIT(wait); int ret;
if (io_cqring_events(ring) >= min_events) @@ -2198,21 +2197,9 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, return ret; }
- do { - prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE); - - ret = 0; - if (io_cqring_events(ring) >= min_events) - break; - - schedule(); - + ret = wait_event_interruptible(ctx->wait, io_cqring_events(ring) >= min_events); + if (ret == -ERESTARTSYS) ret = -EINTR; - if (signal_pending(current)) - break; - } while (1); - - finish_wait(&ctx->wait, &wait);
if (sig) restore_user_sigmask(sig, &sigsaved);
From: Oleg Nesterov oleg@redhat.com
mainline inclusion from mainline-5.2-rc7 commit 97abc889ee296faf95ca0e978340fb7b942a3e32 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This is the minimal fix for stable, I'll send cleanups later.
Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced the visible change which breaks user-space: a signal temporary unblocked by set_user_sigmask() can be delivered even if the caller returns success or timeout.
Change restore_user_sigmask() to accept the additional "interrupted" argument which should be used instead of signal_pending() check, and update the callers.
Eric said:
: For clarity. I don't think this is required by posix, or fundamentally to : remove the races in select. It is what linux has always done and we have : applications who care so I agree this fix is needed. : : Further in any case where the semantic change that this patch rolls back : (aka where allowing a signal to be delivered and the select like call to : complete) would be advantage we can do as well if not better by using : signalfd. : : Michael is there any chance we can get this guarantee of the linux : implementation of pselect and friends clearly documented. The guarantee : that if the system call completes successfully we are guaranteed that no : signal that is unblocked by using sigmask will be delivered?
Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()") Signed-off-by: Oleg Nesterov oleg@redhat.com Reported-by: Eric Wong e@80x24.org Tested-by: Eric Wong e@80x24.org Acked-by: "Eric W. Biederman" ebiederm@xmission.com Acked-by: Arnd Bergmann arnd@arndb.de Acked-by: Deepa Dinamani deepa.kernel@gmail.com Cc: Michael Kerrisk mtk.manpages@gmail.com Cc: Jens Axboe axboe@kernel.dk Cc: Davidlohr Bueso dave@stgolabs.net Cc: Jason Baron jbaron@akamai.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Al Viro viro@ZenIV.linux.org.uk Cc: David Laight David.Laight@ACULAB.COM Cc: stable@vger.kernel.org [5.0+] Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: fs/aio.c [ Patch 9afc5eee65c("y2038: globally rename compat_time to old_time32") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/aio.c | 28 ++++++++++++++++++++-------- fs/eventpoll.c | 4 ++-- fs/io_uring.c | 7 ++++--- fs/select.c | 18 ++++++------------ include/linux/signal.h | 2 +- kernel/signal.c | 5 +++-- 6 files changed, 36 insertions(+), 28 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c index 1458922faa56..c0d061c355e1 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -2134,6 +2134,7 @@ SYSCALL_DEFINE6(io_pgetevents, struct __aio_sigset ksig = { NULL, }; sigset_t ksigmask, sigsaved; struct timespec64 ts; + bool interrupted; int ret;
if (timeout && unlikely(get_timespec64(&ts, timeout))) @@ -2147,8 +2148,10 @@ SYSCALL_DEFINE6(io_pgetevents, return ret;
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL); - restore_user_sigmask(ksig.sigmask, &sigsaved); - if (signal_pending(current) && !ret) + + interrupted = signal_pending(current); + restore_user_sigmask(ksig.sigmask, &sigsaved, interrupted); + if (interrupted && !ret) ret = -ERESTARTNOHAND;
return ret; @@ -2167,6 +2170,7 @@ SYSCALL_DEFINE6(io_pgetevents_time32, struct __aio_sigset ksig = { NULL, }; sigset_t ksigmask, sigsaved; struct timespec64 ts; + bool interrupted; int ret;
if (timeout && unlikely(compat_get_timespec64(&ts, timeout))) @@ -2181,8 +2185,10 @@ SYSCALL_DEFINE6(io_pgetevents_time32, return ret;
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL); - restore_user_sigmask(ksig.sigmask, &sigsaved); - if (signal_pending(current) && !ret) + + interrupted = signal_pending(current); + restore_user_sigmask(ksig.sigmask, &sigsaved, interrupted); + if (interrupted && !ret) ret = -ERESTARTNOHAND;
return ret; @@ -2232,6 +2238,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents, struct __compat_aio_sigset ksig = { NULL, }; sigset_t ksigmask, sigsaved; struct timespec64 t; + bool interrupted; int ret;
if (timeout && compat_get_timespec64(&t, timeout)) @@ -2245,8 +2252,10 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents, return ret;
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL); - restore_user_sigmask(ksig.sigmask, &sigsaved); - if (signal_pending(current) && !ret) + + interrupted = signal_pending(current); + restore_user_sigmask(ksig.sigmask, &sigsaved, interrupted); + if (interrupted && !ret) ret = -ERESTARTNOHAND;
return ret; @@ -2265,6 +2274,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents_time64, struct __compat_aio_sigset ksig = { NULL, }; sigset_t ksigmask, sigsaved; struct timespec64 t; + bool interrupted; int ret;
if (timeout && get_timespec64(&t, timeout)) @@ -2278,8 +2288,10 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents_time64, return ret;
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL); - restore_user_sigmask(ksig.sigmask, &sigsaved); - if (signal_pending(current) && !ret) + + interrupted = signal_pending(current); + restore_user_sigmask(ksig.sigmask, &sigsaved, interrupted); + if (interrupted && !ret) ret = -ERESTARTNOHAND;
return ret; diff --git a/fs/eventpoll.c b/fs/eventpoll.c index fb096e3c9fdc..f7f185779d25 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2223,7 +2223,7 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
error = do_epoll_wait(epfd, events, maxevents, timeout);
- restore_user_sigmask(sigmask, &sigsaved); + restore_user_sigmask(sigmask, &sigsaved, error == -EINTR);
return error; } @@ -2248,7 +2248,7 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
err = do_epoll_wait(epfd, events, maxevents, timeout);
- restore_user_sigmask(sigmask, &sigsaved); + restore_user_sigmask(sigmask, &sigsaved, err == -EINTR);
return err; } diff --git a/fs/io_uring.c b/fs/io_uring.c index 2d8abce92aae..ad2e9a3b5ad7 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2198,11 +2198,12 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, }
ret = wait_event_interruptible(ctx->wait, io_cqring_events(ring) >= min_events); - if (ret == -ERESTARTSYS) - ret = -EINTR;
if (sig) - restore_user_sigmask(sig, &sigsaved); + restore_user_sigmask(sig, &sigsaved, ret == -ERESTARTSYS); + + if (ret == -ERESTARTSYS) + ret = -EINTR;
return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0; } diff --git a/fs/select.c b/fs/select.c index ea99f0620a9c..6cfe6965284c 100644 --- a/fs/select.c +++ b/fs/select.c @@ -758,10 +758,9 @@ static long do_pselect(int n, fd_set __user *inp, fd_set __user *outp, return ret;
ret = core_sys_select(n, inp, outp, exp, to); + restore_user_sigmask(sigmask, &sigsaved, ret == -ERESTARTNOHAND); ret = poll_select_copy_remaining(&end_time, tsp, type, ret);
- restore_user_sigmask(sigmask, &sigsaved); - return ret; }
@@ -1104,8 +1103,7 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds,
ret = do_sys_poll(ufds, nfds, to);
- restore_user_sigmask(sigmask, &sigsaved); - + restore_user_sigmask(sigmask, &sigsaved, ret == -EINTR); /* We can restart this syscall, usually */ if (ret == -EINTR) ret = -ERESTARTNOHAND; @@ -1140,8 +1138,7 @@ SYSCALL_DEFINE5(ppoll_time32, struct pollfd __user *, ufds, unsigned int, nfds,
ret = do_sys_poll(ufds, nfds, to);
- restore_user_sigmask(sigmask, &sigsaved); - + restore_user_sigmask(sigmask, &sigsaved, ret == -EINTR); /* We can restart this syscall, usually */ if (ret == -EINTR) ret = -ERESTARTNOHAND; @@ -1348,10 +1345,9 @@ static long do_compat_pselect(int n, compat_ulong_t __user *inp, return ret;
ret = compat_core_sys_select(n, inp, outp, exp, to); + restore_user_sigmask(sigmask, &sigsaved, ret == -ERESTARTNOHAND); ret = poll_select_copy_remaining(&end_time, tsp, type, ret);
- restore_user_sigmask(sigmask, &sigsaved); - return ret; }
@@ -1423,8 +1419,7 @@ COMPAT_SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds,
ret = do_sys_poll(ufds, nfds, to);
- restore_user_sigmask(sigmask, &sigsaved); - + restore_user_sigmask(sigmask, &sigsaved, ret == -EINTR); /* We can restart this syscall, usually */ if (ret == -EINTR) ret = -ERESTARTNOHAND; @@ -1459,8 +1454,7 @@ COMPAT_SYSCALL_DEFINE5(ppoll_time64, struct pollfd __user *, ufds,
ret = do_sys_poll(ufds, nfds, to);
- restore_user_sigmask(sigmask, &sigsaved); - + restore_user_sigmask(sigmask, &sigsaved, ret == -EINTR); /* We can restart this syscall, usually */ if (ret == -EINTR) ret = -ERESTARTNOHAND; diff --git a/include/linux/signal.h b/include/linux/signal.h index 5172526c90ce..b41c6a4a5362 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -266,7 +266,7 @@ extern int sigprocmask(int, sigset_t *, sigset_t *); extern int set_user_sigmask(const sigset_t __user *usigmask, sigset_t *set, sigset_t *oldset, size_t sigsetsize); extern void restore_user_sigmask(const void __user *usigmask, - sigset_t *sigsaved); + sigset_t *sigsaved, bool interrupted); extern void set_current_blocked(sigset_t *); extern void __set_current_blocked(const sigset_t *); extern int show_unhandled_signals; diff --git a/kernel/signal.c b/kernel/signal.c index 24b48a689972..59da2ae4aea0 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2870,7 +2870,8 @@ EXPORT_SYMBOL(set_compat_user_sigmask); * This is useful for syscalls such as ppoll, pselect, io_pgetevents and * epoll_pwait where a new sigmask is passed in from userland for the syscalls. */ -void restore_user_sigmask(const void __user *usigmask, sigset_t *sigsaved) +void restore_user_sigmask(const void __user *usigmask, sigset_t *sigsaved, + bool interrupted) {
if (!usigmask) @@ -2880,7 +2881,7 @@ void restore_user_sigmask(const void __user *usigmask, sigset_t *sigsaved) * Restoring sigmask here can lead to delivering signals that the above * syscalls are intended to block because of the sigmask passed in. */ - if (signal_pending(current)) { + if (interrupted) { current->saved_sigmask = *sigsaved; set_restore_sigmask(); return;
From: Oleg Nesterov oleg@redhat.com
mainline inclusion from mainline-5.3-rc1 commit b772434be0891ed1081a08ae7cfd4666728f8e82 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
task->saved_sigmask and ->restore_sigmask are only used in the ret-from- syscall paths. This means that set_user_sigmask() can save ->blocked in ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked was modified.
This way the callers do not need 2 sigset_t's passed to set/restore and restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns into the trivial helper which just calls restore_saved_sigmask().
Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com Signed-off-by: Oleg Nesterov oleg@redhat.com Cc: Deepa Dinamani deepa.kernel@gmail.com Cc: Arnd Bergmann arnd@arndb.de Cc: Jens Axboe axboe@kernel.dk Cc: Davidlohr Bueso dave@stgolabs.net Cc: Eric Wong e@80x24.org Cc: Jason Baron jbaron@akamai.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Al Viro viro@ZenIV.linux.org.uk Cc: Eric W. Biederman ebiederm@xmission.com Cc: David Laight David.Laight@aculab.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: fs/select.c [ Patch 9afc5eee65c("y2038: globally rename compat_time to old_time32") is not applied. ] [ Patch 8dabe7245b("y2038: syscalls: rename y2038 compat syscalls") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/aio.c | 20 +++++------ fs/eventpoll.c | 12 +++---- fs/io_uring.c | 11 ++---- fs/select.c | 34 +++++++----------- include/linux/compat.h | 3 +- include/linux/sched/signal.h | 12 +++++-- include/linux/signal.h | 4 --- kernel/signal.c | 69 ++++++++++-------------------------- 8 files changed, 57 insertions(+), 108 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c index c0d061c355e1..4561f9ba56c4 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -2132,7 +2132,6 @@ SYSCALL_DEFINE6(io_pgetevents, const struct __aio_sigset __user *, usig) { struct __aio_sigset ksig = { NULL, }; - sigset_t ksigmask, sigsaved; struct timespec64 ts; bool interrupted; int ret; @@ -2143,14 +2142,14 @@ SYSCALL_DEFINE6(io_pgetevents, if (usig && copy_from_user(&ksig, usig, sizeof(ksig))) return -EFAULT;
- ret = set_user_sigmask(ksig.sigmask, &ksigmask, &sigsaved, ksig.sigsetsize); + ret = set_user_sigmask(ksig.sigmask, ksig.sigsetsize); if (ret) return ret;
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL);
interrupted = signal_pending(current); - restore_user_sigmask(ksig.sigmask, &sigsaved, interrupted); + restore_saved_sigmask_unless(interrupted); if (interrupted && !ret) ret = -ERESTARTNOHAND;
@@ -2168,7 +2167,6 @@ SYSCALL_DEFINE6(io_pgetevents_time32, const struct __aio_sigset __user *, usig) { struct __aio_sigset ksig = { NULL, }; - sigset_t ksigmask, sigsaved; struct timespec64 ts; bool interrupted; int ret; @@ -2180,14 +2178,14 @@ SYSCALL_DEFINE6(io_pgetevents_time32, return -EFAULT;
- ret = set_user_sigmask(ksig.sigmask, &ksigmask, &sigsaved, ksig.sigsetsize); + ret = set_user_sigmask(ksig.sigmask, ksig.sigsetsize); if (ret) return ret;
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL);
interrupted = signal_pending(current); - restore_user_sigmask(ksig.sigmask, &sigsaved, interrupted); + restore_saved_sigmask_unless(interrupted); if (interrupted && !ret) ret = -ERESTARTNOHAND;
@@ -2236,7 +2234,6 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents, const struct __compat_aio_sigset __user *, usig) { struct __compat_aio_sigset ksig = { NULL, }; - sigset_t ksigmask, sigsaved; struct timespec64 t; bool interrupted; int ret; @@ -2247,14 +2244,14 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents, if (usig && copy_from_user(&ksig, usig, sizeof(ksig))) return -EFAULT;
- ret = set_compat_user_sigmask(ksig.sigmask, &ksigmask, &sigsaved, ksig.sigsetsize); + ret = set_compat_user_sigmask(ksig.sigmask, ksig.sigsetsize); if (ret) return ret;
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL);
interrupted = signal_pending(current); - restore_user_sigmask(ksig.sigmask, &sigsaved, interrupted); + restore_saved_sigmask_unless(interrupted); if (interrupted && !ret) ret = -ERESTARTNOHAND;
@@ -2272,7 +2269,6 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents_time64, const struct __compat_aio_sigset __user *, usig) { struct __compat_aio_sigset ksig = { NULL, }; - sigset_t ksigmask, sigsaved; struct timespec64 t; bool interrupted; int ret; @@ -2283,14 +2279,14 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents_time64, if (usig && copy_from_user(&ksig, usig, sizeof(ksig))) return -EFAULT;
- ret = set_compat_user_sigmask(ksig.sigmask, &ksigmask, &sigsaved, ksig.sigsetsize); + ret = set_compat_user_sigmask(ksig.sigmask, ksig.sigsetsize); if (ret) return ret;
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL);
interrupted = signal_pending(current); - restore_user_sigmask(ksig.sigmask, &sigsaved, interrupted); + restore_saved_sigmask_unless(interrupted); if (interrupted && !ret) ret = -ERESTARTNOHAND;
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index f7f185779d25..6d4d73faabfd 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2211,19 +2211,17 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events, size_t, sigsetsize) { int error; - sigset_t ksigmask, sigsaved;
/* * If the caller wants a certain signal mask to be set during the wait, * we apply it here. */ - error = set_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + error = set_user_sigmask(sigmask, sigsetsize); if (error) return error;
error = do_epoll_wait(epfd, events, maxevents, timeout); - - restore_user_sigmask(sigmask, &sigsaved, error == -EINTR); + restore_saved_sigmask_unless(error == -EINTR);
return error; } @@ -2236,19 +2234,17 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd, compat_size_t, sigsetsize) { long err; - sigset_t ksigmask, sigsaved;
/* * If the caller wants a certain signal mask to be set during the wait, * we apply it here. */ - err = set_compat_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + err = set_compat_user_sigmask(sigmask, sigsetsize); if (err) return err;
err = do_epoll_wait(epfd, events, maxevents, timeout); - - restore_user_sigmask(sigmask, &sigsaved, err == -EINTR); + restore_saved_sigmask_unless(err == -EINTR);
return err; } diff --git a/fs/io_uring.c b/fs/io_uring.c index ad2e9a3b5ad7..62a73623601d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2177,7 +2177,6 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, const sigset_t __user *sig, size_t sigsz) { struct io_cq_ring *ring = ctx->cq_ring; - sigset_t ksigmask, sigsaved; int ret;
if (io_cqring_events(ring) >= min_events) @@ -2187,21 +2186,17 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, #ifdef CONFIG_COMPAT if (in_compat_syscall()) ret = set_compat_user_sigmask((const compat_sigset_t __user *)sig, - &ksigmask, &sigsaved, sigsz); + sigsz); else #endif - ret = set_user_sigmask(sig, &ksigmask, - &sigsaved, sigsz); + ret = set_user_sigmask(sig, sigsz);
if (ret) return ret; }
ret = wait_event_interruptible(ctx->wait, io_cqring_events(ring) >= min_events); - - if (sig) - restore_user_sigmask(sig, &sigsaved, ret == -ERESTARTSYS); - + restore_saved_sigmask_unless(ret == -ERESTARTSYS); if (ret == -ERESTARTSYS) ret = -EINTR;
diff --git a/fs/select.c b/fs/select.c index 6cfe6965284c..baed50c60083 100644 --- a/fs/select.c +++ b/fs/select.c @@ -730,7 +730,6 @@ static long do_pselect(int n, fd_set __user *inp, fd_set __user *outp, const sigset_t __user *sigmask, size_t sigsetsize, enum poll_time_type type) { - sigset_t ksigmask, sigsaved; struct timespec64 ts, end_time, *to = NULL; int ret;
@@ -753,12 +752,12 @@ static long do_pselect(int n, fd_set __user *inp, fd_set __user *outp, return -EINVAL; }
- ret = set_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + ret = set_user_sigmask(sigmask, sigsetsize); if (ret) return ret;
ret = core_sys_select(n, inp, outp, exp, to); - restore_user_sigmask(sigmask, &sigsaved, ret == -ERESTARTNOHAND); + restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); ret = poll_select_copy_remaining(&end_time, tsp, type, ret);
return ret; @@ -1084,7 +1083,6 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, struct __kernel_timespec __user *, tsp, const sigset_t __user *, sigmask, size_t, sigsetsize) { - sigset_t ksigmask, sigsaved; struct timespec64 ts, end_time, *to = NULL; int ret;
@@ -1097,17 +1095,16 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, return -EINVAL; }
- ret = set_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + ret = set_user_sigmask(sigmask, sigsetsize); if (ret) return ret;
ret = do_sys_poll(ufds, nfds, to);
- restore_user_sigmask(sigmask, &sigsaved, ret == -EINTR); + restore_saved_sigmask_unless(ret == -EINTR); /* We can restart this syscall, usually */ if (ret == -EINTR) ret = -ERESTARTNOHAND; - ret = poll_select_copy_remaining(&end_time, tsp, PT_TIMESPEC, ret);
return ret; @@ -1119,7 +1116,6 @@ SYSCALL_DEFINE5(ppoll_time32, struct pollfd __user *, ufds, unsigned int, nfds, struct compat_timespec __user *, tsp, const sigset_t __user *, sigmask, size_t, sigsetsize) { - sigset_t ksigmask, sigsaved; struct timespec64 ts, end_time, *to = NULL; int ret;
@@ -1132,17 +1128,16 @@ SYSCALL_DEFINE5(ppoll_time32, struct pollfd __user *, ufds, unsigned int, nfds, return -EINVAL; }
- ret = set_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + ret = set_user_sigmask(sigmask, sigsetsize); if (ret) return ret;
ret = do_sys_poll(ufds, nfds, to);
- restore_user_sigmask(sigmask, &sigsaved, ret == -EINTR); + restore_saved_sigmask_unless(ret == -EINTR); /* We can restart this syscall, usually */ if (ret == -EINTR) ret = -ERESTARTNOHAND; - ret = poll_select_copy_remaining(&end_time, tsp, PT_OLD_TIMESPEC, ret);
return ret; @@ -1317,7 +1312,6 @@ static long do_compat_pselect(int n, compat_ulong_t __user *inp, void __user *tsp, compat_sigset_t __user *sigmask, compat_size_t sigsetsize, enum poll_time_type type) { - sigset_t ksigmask, sigsaved; struct timespec64 ts, end_time, *to = NULL; int ret;
@@ -1340,12 +1334,12 @@ static long do_compat_pselect(int n, compat_ulong_t __user *inp, return -EINVAL; }
- ret = set_compat_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + ret = set_compat_user_sigmask(sigmask, sigsetsize); if (ret) return ret;
ret = compat_core_sys_select(n, inp, outp, exp, to); - restore_user_sigmask(sigmask, &sigsaved, ret == -ERESTARTNOHAND); + restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); ret = poll_select_copy_remaining(&end_time, tsp, type, ret);
return ret; @@ -1400,7 +1394,6 @@ COMPAT_SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, struct compat_timespec __user *, tsp, const compat_sigset_t __user *, sigmask, compat_size_t, sigsetsize) { - sigset_t ksigmask, sigsaved; struct timespec64 ts, end_time, *to = NULL; int ret;
@@ -1413,17 +1406,16 @@ COMPAT_SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, return -EINVAL; }
- ret = set_compat_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + ret = set_compat_user_sigmask(sigmask, sigsetsize); if (ret) return ret;
ret = do_sys_poll(ufds, nfds, to);
- restore_user_sigmask(sigmask, &sigsaved, ret == -EINTR); + restore_saved_sigmask_unless(ret == -EINTR); /* We can restart this syscall, usually */ if (ret == -EINTR) ret = -ERESTARTNOHAND; - ret = poll_select_copy_remaining(&end_time, tsp, PT_OLD_TIMESPEC, ret);
return ret; @@ -1435,7 +1427,6 @@ COMPAT_SYSCALL_DEFINE5(ppoll_time64, struct pollfd __user *, ufds, unsigned int, nfds, struct __kernel_timespec __user *, tsp, const compat_sigset_t __user *, sigmask, compat_size_t, sigsetsize) { - sigset_t ksigmask, sigsaved; struct timespec64 ts, end_time, *to = NULL; int ret;
@@ -1448,17 +1439,16 @@ COMPAT_SYSCALL_DEFINE5(ppoll_time64, struct pollfd __user *, ufds, return -EINVAL; }
- ret = set_compat_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + ret = set_compat_user_sigmask(sigmask, sigsetsize); if (ret) return ret;
ret = do_sys_poll(ufds, nfds, to);
- restore_user_sigmask(sigmask, &sigsaved, ret == -EINTR); + restore_saved_sigmask_unless(ret == -EINTR); /* We can restart this syscall, usually */ if (ret == -EINTR) ret = -ERESTARTNOHAND; - ret = poll_select_copy_remaining(&end_time, tsp, PT_TIMESPEC, ret);
return ret; diff --git a/include/linux/compat.h b/include/linux/compat.h index 996eba7c11cd..90769f00f152 100644 --- a/include/linux/compat.h +++ b/include/linux/compat.h @@ -176,8 +176,7 @@ typedef struct { compat_sigset_word sig[_COMPAT_NSIG_WORDS]; } compat_sigset_t;
-int set_compat_user_sigmask(const compat_sigset_t __user *usigmask, - sigset_t *set, sigset_t *oldset, +int set_compat_user_sigmask(const compat_sigset_t __user *umask, size_t sigsetsize);
struct compat_sigaction { diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index db7ee0d0e874..8bde85ac0e77 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -421,7 +421,6 @@ void task_join_group_stop(struct task_struct *task); static inline void set_restore_sigmask(void) { set_thread_flag(TIF_RESTORE_SIGMASK); - WARN_ON(!test_thread_flag(TIF_SIGPENDING)); }
static inline void clear_tsk_restore_sigmask(struct task_struct *tsk) @@ -452,7 +451,6 @@ static inline bool test_and_clear_restore_sigmask(void) static inline void set_restore_sigmask(void) { current->restore_sigmask = true; - WARN_ON(!test_thread_flag(TIF_SIGPENDING)); } static inline void clear_tsk_restore_sigmask(struct task_struct *tsk) { @@ -485,6 +483,16 @@ static inline void restore_saved_sigmask(void) __set_current_blocked(¤t->saved_sigmask); }
+extern int set_user_sigmask(const sigset_t __user *umask, size_t sigsetsize); + +static inline void restore_saved_sigmask_unless(bool interrupted) +{ + if (interrupted) + WARN_ON(!test_thread_flag(TIF_SIGPENDING)); + else + restore_saved_sigmask(); +} + static inline sigset_t *sigmask_to_save(void) { sigset_t *res = ¤t->blocked; diff --git a/include/linux/signal.h b/include/linux/signal.h index b41c6a4a5362..0be5ce2375cb 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -263,10 +263,6 @@ extern int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p, enum pid_type type); extern int __group_send_sig_info(int, struct siginfo *, struct task_struct *); extern int sigprocmask(int, sigset_t *, sigset_t *); -extern int set_user_sigmask(const sigset_t __user *usigmask, sigset_t *set, - sigset_t *oldset, size_t sigsetsize); -extern void restore_user_sigmask(const void __user *usigmask, - sigset_t *sigsaved, bool interrupted); extern void set_current_blocked(sigset_t *); extern void __set_current_blocked(const sigset_t *); extern int show_unhandled_signals; diff --git a/kernel/signal.c b/kernel/signal.c index 59da2ae4aea0..03c0fbd586b4 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2821,80 +2821,49 @@ int sigprocmask(int how, sigset_t *set, sigset_t *oldset) * * This is useful for syscalls such as ppoll, pselect, io_pgetevents and * epoll_pwait where a new sigmask is passed from userland for the syscalls. + * + * Note that it does set_restore_sigmask() in advance, so it must be always + * paired with restore_saved_sigmask_unless() before return from syscall. */ -int set_user_sigmask(const sigset_t __user *usigmask, sigset_t *set, - sigset_t *oldset, size_t sigsetsize) +int set_user_sigmask(const sigset_t __user *umask, size_t sigsetsize) { - if (!usigmask) - return 0; + sigset_t kmask;
+ if (!umask) + return 0; if (sigsetsize != sizeof(sigset_t)) return -EINVAL; - if (copy_from_user(set, usigmask, sizeof(sigset_t))) + if (copy_from_user(&kmask, umask, sizeof(sigset_t))) return -EFAULT;
- *oldset = current->blocked; - set_current_blocked(set); + set_restore_sigmask(); + current->saved_sigmask = current->blocked; + set_current_blocked(&kmask);
return 0; } -EXPORT_SYMBOL(set_user_sigmask);
#ifdef CONFIG_COMPAT -int set_compat_user_sigmask(const compat_sigset_t __user *usigmask, - sigset_t *set, sigset_t *oldset, +int set_compat_user_sigmask(const compat_sigset_t __user *umask, size_t sigsetsize) { - if (!usigmask) - return 0; + sigset_t kmask;
+ if (!umask) + return 0; if (sigsetsize != sizeof(compat_sigset_t)) return -EINVAL; - if (get_compat_sigset(set, usigmask)) + if (get_compat_sigset(&kmask, umask)) return -EFAULT;
- *oldset = current->blocked; - set_current_blocked(set); + set_restore_sigmask(); + current->saved_sigmask = current->blocked; + set_current_blocked(&kmask);
return 0; } -EXPORT_SYMBOL(set_compat_user_sigmask); #endif
-/* - * restore_user_sigmask: - * usigmask: sigmask passed in from userland. - * sigsaved: saved sigmask when the syscall started and changed the sigmask to - * usigmask. - * - * This is useful for syscalls such as ppoll, pselect, io_pgetevents and - * epoll_pwait where a new sigmask is passed in from userland for the syscalls. - */ -void restore_user_sigmask(const void __user *usigmask, sigset_t *sigsaved, - bool interrupted) -{ - - if (!usigmask) - return; - /* - * When signals are pending, do not restore them here. - * Restoring sigmask here can lead to delivering signals that the above - * syscalls are intended to block because of the sigmask passed in. - */ - if (interrupted) { - current->saved_sigmask = *sigsaved; - set_restore_sigmask(); - return; - } - - /* - * This is needed because the fast syscall return path does not restore - * saved_sigmask when signals are not pending. - */ - set_current_blocked(sigsaved); -} -EXPORT_SYMBOL(restore_user_sigmask); - /** * sys_rt_sigprocmask - change the list of currently blocked signals * @how: whether to add, remove, or set signals
From: Oleg Nesterov oleg@redhat.com
mainline inclusion from mainline-5.3-rc1 commit 8cf8b5539a414da3257db6d121bcee2d883135cb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
do_poll() returns -EINTR if interrupted and after that all its callers have to translate it into -ERESTARTNOHAND. Change do_poll() to return -ERESTARTNOHAND and update (simplify) the callers.
Note that this also unifies all users of restore_saved_sigmask_unless(), see the next patch.
Linus:
: The *right* return value will actually be then chosen by : poll_select_copy_remaining(), which will turn ERESTARTNOHAND to EINTR : when it can't update the timeout. : : Except for the cases that use restart_block and do that instead and : don't have the whole timeout restart issue as a result.
Link: http://lkml.kernel.org/r/20190606140852.GB13440@redhat.com Signed-off-by: Oleg Nesterov oleg@redhat.com Acked-by: Linus Torvalds torvalds@linux-foundation.org Cc: Al Viro viro@ZenIV.linux.org.uk Cc: Arnd Bergmann arnd@arndb.de Cc: David Laight David.Laight@aculab.com Cc: Davidlohr Bueso dave@stgolabs.net Cc: Deepa Dinamani deepa.kernel@gmail.com Cc: Eric W. Biederman ebiederm@xmission.com Cc: Eric Wong e@80x24.org Cc: Jason Baron jbaron@akamai.com Cc: Jens Axboe axboe@kernel.dk Cc: Thomas Gleixner tglx@linutronix.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com
Conflicts: fs/select.c
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/select.c | 30 +++++++----------------------- 1 file changed, 7 insertions(+), 23 deletions(-)
diff --git a/fs/select.c b/fs/select.c index baed50c60083..bf2395de6437 100644 --- a/fs/select.c +++ b/fs/select.c @@ -925,7 +925,7 @@ static int do_poll(struct poll_list *list, struct poll_wqueues *wait, if (!count) { count = wait->error; if (signal_pending(current)) - count = -EINTR; + count = -ERESTARTNOHAND; } if (count || timed_out) break; @@ -1040,7 +1040,7 @@ static long do_restart_poll(struct restart_block *restart_block)
ret = do_sys_poll(ufds, nfds, to);
- if (ret == -EINTR) + if (ret == -ERESTARTNOHAND) ret = set_restart_fn(restart_block, do_restart_poll);
return ret; @@ -1060,7 +1060,7 @@ SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfds,
ret = do_sys_poll(ufds, nfds, to);
- if (ret == -EINTR) { + if (ret == -ERESTARTNOHAND) { struct restart_block *restart_block;
restart_block = ¤t->restart_block; @@ -1100,11 +1100,7 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, return ret;
ret = do_sys_poll(ufds, nfds, to); - - restore_saved_sigmask_unless(ret == -EINTR); - /* We can restart this syscall, usually */ - if (ret == -EINTR) - ret = -ERESTARTNOHAND; + restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); ret = poll_select_copy_remaining(&end_time, tsp, PT_TIMESPEC, ret);
return ret; @@ -1133,11 +1129,7 @@ SYSCALL_DEFINE5(ppoll_time32, struct pollfd __user *, ufds, unsigned int, nfds, return ret;
ret = do_sys_poll(ufds, nfds, to); - - restore_saved_sigmask_unless(ret == -EINTR); - /* We can restart this syscall, usually */ - if (ret == -EINTR) - ret = -ERESTARTNOHAND; + restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); ret = poll_select_copy_remaining(&end_time, tsp, PT_OLD_TIMESPEC, ret);
return ret; @@ -1411,11 +1403,7 @@ COMPAT_SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, return ret;
ret = do_sys_poll(ufds, nfds, to); - - restore_saved_sigmask_unless(ret == -EINTR); - /* We can restart this syscall, usually */ - if (ret == -EINTR) - ret = -ERESTARTNOHAND; + restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); ret = poll_select_copy_remaining(&end_time, tsp, PT_OLD_TIMESPEC, ret);
return ret; @@ -1444,11 +1432,7 @@ COMPAT_SYSCALL_DEFINE5(ppoll_time64, struct pollfd __user *, ufds, return ret;
ret = do_sys_poll(ufds, nfds, to); - - restore_saved_sigmask_unless(ret == -EINTR); - /* We can restart this syscall, usually */ - if (ret == -EINTR) - ret = -ERESTARTNOHAND; + restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); ret = poll_select_copy_remaining(&end_time, tsp, PT_TIMESPEC, ret);
return ret;
From: Oleg Nesterov oleg@redhat.com
mainline inclusion from mainline-5.3-rc1 commit ac301020627e258a304f40cab5b35b6814a6f033 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Now that restore_saved_sigmask_unless() is always called with the same argument right before poll_select_copy_remaining() we can move it into poll_select_copy_remaining() and make it the only caller of restore() in fs/select.c.
The patch also renames poll_select_copy_remaining(), poll_select_finish() looks better after this change.
kern_select() doesn't use set_user_sigmask(), so in this case poll_select_finish() does restore_saved_sigmask_unless() "for no reason". But this won't hurt, and WARN_ON(!TIF_SIGPENDING) is still valid.
Link: http://lkml.kernel.org/r/20190606140915.GC13440@redhat.com Signed-off-by: Oleg Nesterov oleg@redhat.com Cc: Al Viro viro@ZenIV.linux.org.uk Cc: Arnd Bergmann arnd@arndb.de Cc: David Laight David.Laight@aculab.com Cc: Davidlohr Bueso dave@stgolabs.net Cc: Deepa Dinamani deepa.kernel@gmail.com Cc: Eric W. Biederman ebiederm@xmission.com Cc: Eric Wong e@80x24.org Cc: Jason Baron jbaron@akamai.com Cc: Jens Axboe axboe@kernel.dk Cc: Thomas Gleixner tglx@linutronix.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/select.c | 46 +++++++++++++--------------------------------- 1 file changed, 13 insertions(+), 33 deletions(-)
diff --git a/fs/select.c b/fs/select.c index bf2395de6437..b684f0dd6db8 100644 --- a/fs/select.c +++ b/fs/select.c @@ -294,12 +294,14 @@ enum poll_time_type { PT_OLD_TIMESPEC = 3, };
-static int poll_select_copy_remaining(struct timespec64 *end_time, - void __user *p, - enum poll_time_type pt_type, int ret) +static int poll_select_finish(struct timespec64 *end_time, + void __user *p, + enum poll_time_type pt_type, int ret) { struct timespec64 rts;
+ restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); + if (!p) return ret;
@@ -714,9 +716,7 @@ static int kern_select(int n, fd_set __user *inp, fd_set __user *outp, }
ret = core_sys_select(n, inp, outp, exp, to); - ret = poll_select_copy_remaining(&end_time, tvp, PT_TIMEVAL, ret); - - return ret; + return poll_select_finish(&end_time, tvp, PT_TIMEVAL, ret); }
SYSCALL_DEFINE5(select, int, n, fd_set __user *, inp, fd_set __user *, outp, @@ -757,10 +757,7 @@ static long do_pselect(int n, fd_set __user *inp, fd_set __user *outp, return ret;
ret = core_sys_select(n, inp, outp, exp, to); - restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); - ret = poll_select_copy_remaining(&end_time, tsp, type, ret); - - return ret; + return poll_select_finish(&end_time, tsp, type, ret); }
/* @@ -1100,10 +1097,7 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, return ret;
ret = do_sys_poll(ufds, nfds, to); - restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); - ret = poll_select_copy_remaining(&end_time, tsp, PT_TIMESPEC, ret); - - return ret; + return poll_select_finish(&end_time, tsp, PT_TIMESPEC, ret); }
#if defined(CONFIG_COMPAT_32BIT_TIME) && !defined(CONFIG_64BIT) @@ -1129,10 +1123,7 @@ SYSCALL_DEFINE5(ppoll_time32, struct pollfd __user *, ufds, unsigned int, nfds, return ret;
ret = do_sys_poll(ufds, nfds, to); - restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); - ret = poll_select_copy_remaining(&end_time, tsp, PT_OLD_TIMESPEC, ret); - - return ret; + return poll_select_finish(&end_time, tsp, PT_OLD_TIMESPEC, ret); } #endif
@@ -1269,9 +1260,7 @@ static int do_compat_select(int n, compat_ulong_t __user *inp, }
ret = compat_core_sys_select(n, inp, outp, exp, to); - ret = poll_select_copy_remaining(&end_time, tvp, PT_OLD_TIMEVAL, ret); - - return ret; + return poll_select_finish(&end_time, tvp, PT_OLD_TIMEVAL, ret); }
COMPAT_SYSCALL_DEFINE5(select, int, n, compat_ulong_t __user *, inp, @@ -1331,10 +1320,7 @@ static long do_compat_pselect(int n, compat_ulong_t __user *inp, return ret;
ret = compat_core_sys_select(n, inp, outp, exp, to); - restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); - ret = poll_select_copy_remaining(&end_time, tsp, type, ret); - - return ret; + return poll_select_finish(&end_time, tsp, type, ret); }
COMPAT_SYSCALL_DEFINE6(pselect6_time64, int, n, compat_ulong_t __user *, inp, @@ -1403,10 +1389,7 @@ COMPAT_SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, return ret;
ret = do_sys_poll(ufds, nfds, to); - restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); - ret = poll_select_copy_remaining(&end_time, tsp, PT_OLD_TIMESPEC, ret); - - return ret; + return poll_select_finish(&end_time, tsp, PT_OLD_TIMESPEC, ret); } #endif
@@ -1432,10 +1415,7 @@ COMPAT_SYSCALL_DEFINE5(ppoll_time64, struct pollfd __user *, ufds, return ret;
ret = do_sys_poll(ufds, nfds, to); - restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); - ret = poll_select_copy_remaining(&end_time, tsp, PT_TIMESPEC, ret); - - return ret; + return poll_select_finish(&end_time, tsp, PT_TIMESPEC, ret); }
#endif
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.3-rc6 commit 500f9fbadef86466a435726192f4ca4df7d94236 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If a request issue ends up being punted to async context to avoid blocking, we can get into a situation where the original application enters the poll loop for that very request before it has been issued. This should not be an issue, except that the polling will hold the io_uring uring_ctx mutex for the duration of the poll. When the async worker has actually issued the request, it needs to acquire this mutex to add the request to the poll issued list. Since the application polling is already holding this mutex, the workqueue sleeps on the mutex forever, and the application thus never gets a chance to poll for the very request it was interested in.
Fix this by ensuring that the polling drops the uring_ctx occasionally if it's not making any progress.
Reported-by: Jeffrey M. Birnbaum jmbnyc@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 36 +++++++++++++++++++++++++----------- 1 file changed, 25 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 571a8c1b6ec5..023dacbefb38 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -804,11 +804,34 @@ static void io_iopoll_reap_events(struct io_ring_ctx *ctx) static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, long min) { - int ret = 0; + int iters, ret = 0; + + /* + * We disallow the app entering submit/complete with polling, but we + * still need to lock the ring to prevent racing with polled issue + * that got punted to a workqueue. + */ + mutex_lock(&ctx->uring_lock);
+ iters = 0; do { int tmin = 0;
+ /* + * If a submit got punted to a workqueue, we can have the + * application entering polling for a command before it gets + * issued. That app will hold the uring_lock for the duration + * of the poll right here, so we need to take a breather every + * now and then to ensure that the issue has a chance to add + * the poll to the issued list. Otherwise we can spin here + * forever, while the workqueue is stuck trying to acquire the + * very same mutex. + */ + if (!(++iters & 7)) { + mutex_unlock(&ctx->uring_lock); + mutex_lock(&ctx->uring_lock); + } + if (*nr_events < min) tmin = min - *nr_events;
@@ -818,6 +841,7 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, ret = 0; } while (min && !*nr_events && !need_resched());
+ mutex_unlock(&ctx->uring_lock); return ret; }
@@ -2279,15 +2303,7 @@ static int io_sq_thread(void *data) unsigned nr_events = 0;
if (ctx->flags & IORING_SETUP_IOPOLL) { - /* - * We disallow the app entering submit/complete - * with polling, but we still need to lock the - * ring to prevent racing with polled issue - * that got punted to a workqueue. - */ - mutex_lock(&ctx->uring_lock); io_iopoll_check(ctx, &nr_events, 0); - mutex_unlock(&ctx->uring_lock); } else { /* * Normal IO, just pretend everything completed. @@ -3188,9 +3204,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, min_complete = min(min_complete, ctx->cq_entries);
if (ctx->flags & IORING_SETUP_IOPOLL) { - mutex_lock(&ctx->uring_lock); ret = io_iopoll_check(ctx, &nr_events, min_complete); - mutex_unlock(&ctx->uring_lock); } else { ret = io_cqring_wait(ctx, min_complete, sig, sigsz); }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.3-rc6 commit a3a0e43fd77013819e4b6f55e37e0efe8e35d805 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We need to check if we have CQEs pending before starting a poll loop, as those could be the events we will be spinning for (and hence we'll find none). This can happen if a CQE triggers an error, or if it is found by eg an IRQ before we get a chance to find it through polling.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 023dacbefb38..c059b065aca2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -678,6 +678,13 @@ static void io_put_req(struct io_kiocb *req) io_free_req(req); }
+static unsigned io_cqring_events(struct io_cq_ring *ring) +{ + /* See comment at the top of this file */ + smp_rmb(); + return READ_ONCE(ring->r.tail) - READ_ONCE(ring->r.head); +} + /* * Find and free completed poll iocbs */ @@ -817,6 +824,14 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, do { int tmin = 0;
+ /* + * Don't enter poll loop if we already have events pending. + * If we do, we can potentially be spinning for commands that + * already triggered a CQE (eg in error). + */ + if (io_cqring_events(ctx->cq_ring)) + break; + /* * If a submit got punted to a workqueue, we can have the * application entering polling for a command before it gets @@ -2448,13 +2463,6 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) return submit; }
-static unsigned io_cqring_events(struct io_cq_ring *ring) -{ - /* See comment at the top of this file */ - smp_rmb(); - return READ_ONCE(ring->r.tail) - READ_ONCE(ring->r.head); -} - /* * Wait until events become available, if we don't already have some. The * application must reap them itself, as they reside on the shared cq ring.
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.3-rc6 commit 08f5439f1df25a6cf6cf4c72cf6c13025599ce67 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The outer poll loop checks for whether we need to reschedule, and returns to userspace if we do. However, it's possible to get stuck in the inner loop as well, if the CPU we are running on needs to reschedule to finish the IO work.
Add the need_resched() check in the inner loop as well. This fixes a potential hang if the kernel is configured with CONFIG_PREEMPT_VOLUNTARY=y.
Reported-by: Sagi Grimberg sagi@grimberg.me Reviewed-by: Sagi Grimberg sagi@grimberg.me Tested-by: Sagi Grimberg sagi@grimberg.me Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c059b065aca2..a4127b301106 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -777,7 +777,7 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, static int io_iopoll_getevents(struct io_ring_ctx *ctx, unsigned int *nr_events, long min) { - while (!list_empty(&ctx->poll_list)) { + while (!list_empty(&ctx->poll_list) && !need_resched()) { int ret;
ret = io_do_iopoll(ctx, nr_events, min); @@ -804,6 +804,12 @@ static void io_iopoll_reap_events(struct io_ring_ctx *ctx) unsigned int nr_events = 0;
io_iopoll_getevents(ctx, &nr_events, 1); + + /* + * Ensure we allow local-to-the-cpu processing to take place, + * in this case we need to ensure that we reap all events. + */ + cond_resched(); } mutex_unlock(&ctx->uring_lock); }
From: Hristo Venev hristo@venev.name
mainline inclusion from mainline-5.4-rc1 commit 75b28affdd6aed1c68073ef53907c7bd822aff84 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Both the sq and the cq rings have sizes just over a power of two, and the sq ring is significantly smaller. By bundling them in a single alllocation, we get the sq ring for free.
This also means that IORING_OFF_SQ_RING and IORING_OFF_CQ_RING now mean the same thing. If we indicate this to userspace, we can save a mmap call.
Signed-off-by: Hristo Venev hristo@venev.name Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 255 +++++++++++++++++++++++++------------------------- 1 file changed, 128 insertions(+), 127 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a4127b301106..a901f4c91c28 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -84,27 +84,29 @@ struct io_uring { };
/* - * This data is shared with the application through the mmap at offset - * IORING_OFF_SQ_RING. + * This data is shared with the application through the mmap at offsets + * IORING_OFF_SQ_RING and IORING_OFF_CQ_RING. * * The offsets to the member fields are published through struct * io_sqring_offsets when calling io_uring_setup. */ -struct io_sq_ring { +struct io_rings { /* * Head and tail offsets into the ring; the offsets need to be * masked to get valid indices. * - * The kernel controls head and the application controls tail. + * The kernel controls head of the sq ring and the tail of the cq ring, + * and the application controls tail of the sq ring and the head of the + * cq ring. */ - struct io_uring r; + struct io_uring sq, cq; /* - * Bitmask to apply to head and tail offsets (constant, equals + * Bitmasks to apply to head and tail offsets (constant, equals * ring_entries - 1) */ - u32 ring_mask; - /* Ring size (constant, power of 2) */ - u32 ring_entries; + u32 sq_ring_mask, cq_ring_mask; + /* Ring sizes (constant, power of 2) */ + u32 sq_ring_entries, cq_ring_entries; /* * Number of invalid entries dropped by the kernel due to * invalid index stored in array @@ -117,7 +119,7 @@ struct io_sq_ring { * counter includes all submissions that were dropped reaching * the new SQ head (and possibly more). */ - u32 dropped; + u32 sq_dropped; /* * Runtime flags * @@ -127,43 +129,7 @@ struct io_sq_ring { * The application needs a full memory barrier before checking * for IORING_SQ_NEED_WAKEUP after updating the sq tail. */ - u32 flags; - /* - * Ring buffer of indices into array of io_uring_sqe, which is - * mmapped by the application using the IORING_OFF_SQES offset. - * - * This indirection could e.g. be used to assign fixed - * io_uring_sqe entries to operations and only submit them to - * the queue when needed. - * - * The kernel modifies neither the indices array nor the entries - * array. - */ - u32 array[]; -}; - -/* - * This data is shared with the application through the mmap at offset - * IORING_OFF_CQ_RING. - * - * The offsets to the member fields are published through struct - * io_cqring_offsets when calling io_uring_setup. - */ -struct io_cq_ring { - /* - * Head and tail offsets into the ring; the offsets need to be - * masked to get valid indices. - * - * The application controls head and the kernel tail. - */ - struct io_uring r; - /* - * Bitmask to apply to head and tail offsets (constant, equals - * ring_entries - 1) - */ - u32 ring_mask; - /* Ring size (constant, power of 2) */ - u32 ring_entries; + u32 sq_flags; /* * Number of completion events lost because the queue was full; * this should be avoided by the application by making sure @@ -177,7 +143,7 @@ struct io_cq_ring { * As completion events come in out of order this counter is not * ordered with any other data. */ - u32 overflow; + u32 cq_overflow; /* * Ring buffer of completion events. * @@ -185,7 +151,7 @@ struct io_cq_ring { * produced, so the application is allowed to modify pending * entries. */ - struct io_uring_cqe cqes[]; + struct io_uring_cqe cqes[] ____cacheline_aligned_in_smp; };
struct io_mapped_ubuf { @@ -215,8 +181,18 @@ struct io_ring_ctx { bool compat; bool account_mem;
- /* SQ ring */ - struct io_sq_ring *sq_ring; + /* + * Ring buffer of indices into array of io_uring_sqe, which is + * mmapped by the application using the IORING_OFF_SQES offset. + * + * This indirection could e.g. be used to assign fixed + * io_uring_sqe entries to operations and only submit them to + * the queue when needed. + * + * The kernel modifies neither the indices array nor the entries + * array. + */ + u32 *sq_array; unsigned cached_sq_head; unsigned sq_entries; unsigned sq_mask; @@ -234,8 +210,6 @@ struct io_ring_ctx { struct completion sqo_thread_started;
struct { - /* CQ ring */ - struct io_cq_ring *cq_ring; unsigned cached_cq_tail; unsigned cq_entries; unsigned cq_mask; @@ -244,6 +218,8 @@ struct io_ring_ctx { struct eventfd_ctx *cq_ev_fd; } ____cacheline_aligned_in_smp;
+ struct io_rings *rings; + /* * If used, fixed file set. Writers must ensure that ->refs is dead, * readers must ensure that ->refs is alive as long as the file* is @@ -429,7 +405,7 @@ static inline bool io_sequence_defer(struct io_ring_ctx *ctx, if ((req->flags & (REQ_F_IO_DRAIN|REQ_F_IO_DRAINED)) != REQ_F_IO_DRAIN) return false;
- return req->sequence != ctx->cached_cq_tail + ctx->sq_ring->dropped; + return req->sequence != ctx->cached_cq_tail + ctx->rings->sq_dropped; }
static struct io_kiocb *io_get_deferred_req(struct io_ring_ctx *ctx) @@ -450,11 +426,11 @@ static struct io_kiocb *io_get_deferred_req(struct io_ring_ctx *ctx)
static void __io_commit_cqring(struct io_ring_ctx *ctx) { - struct io_cq_ring *ring = ctx->cq_ring; + struct io_rings *rings = ctx->rings;
- if (ctx->cached_cq_tail != READ_ONCE(ring->r.tail)) { + if (ctx->cached_cq_tail != READ_ONCE(rings->cq.tail)) { /* order cqe stores with ring update */ - smp_store_release(&ring->r.tail, ctx->cached_cq_tail); + smp_store_release(&rings->cq.tail, ctx->cached_cq_tail);
if (wq_has_sleeper(&ctx->cq_wait)) { wake_up_interruptible(&ctx->cq_wait); @@ -477,7 +453,7 @@ static void io_commit_cqring(struct io_ring_ctx *ctx)
static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx) { - struct io_cq_ring *ring = ctx->cq_ring; + struct io_rings *rings = ctx->rings; unsigned tail;
tail = ctx->cached_cq_tail; @@ -486,11 +462,11 @@ static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx) * control dependency is enough as we're using WRITE_ONCE to * fill the cq entry */ - if (tail - READ_ONCE(ring->r.head) == ring->ring_entries) + if (tail - READ_ONCE(rings->cq.head) == rings->cq_ring_entries) return NULL;
ctx->cached_cq_tail++; - return &ring->cqes[tail & ctx->cq_mask]; + return &rings->cqes[tail & ctx->cq_mask]; }
static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, @@ -509,9 +485,9 @@ static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, WRITE_ONCE(cqe->res, res); WRITE_ONCE(cqe->flags, 0); } else { - unsigned overflow = READ_ONCE(ctx->cq_ring->overflow); + unsigned overflow = READ_ONCE(ctx->rings->cq_overflow);
- WRITE_ONCE(ctx->cq_ring->overflow, overflow + 1); + WRITE_ONCE(ctx->rings->cq_overflow, overflow + 1); } }
@@ -678,11 +654,11 @@ static void io_put_req(struct io_kiocb *req) io_free_req(req); }
-static unsigned io_cqring_events(struct io_cq_ring *ring) +static unsigned io_cqring_events(struct io_rings *rings) { /* See comment at the top of this file */ smp_rmb(); - return READ_ONCE(ring->r.tail) - READ_ONCE(ring->r.head); + return READ_ONCE(rings->cq.tail) - READ_ONCE(rings->cq.head); }
/* @@ -835,7 +811,7 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, * If we do, we can potentially be spinning for commands that * already triggered a CQE (eg in error). */ - if (io_cqring_events(ctx->cq_ring)) + if (io_cqring_events(ctx->rings)) break;
/* @@ -2204,15 +2180,15 @@ static void io_submit_state_start(struct io_submit_state *state,
static void io_commit_sqring(struct io_ring_ctx *ctx) { - struct io_sq_ring *ring = ctx->sq_ring; + struct io_rings *rings = ctx->rings;
- if (ctx->cached_sq_head != READ_ONCE(ring->r.head)) { + if (ctx->cached_sq_head != READ_ONCE(rings->sq.head)) { /* * Ensure any loads from the SQEs are done at this point, * since once we write the new head, the application could * write new data to them. */ - smp_store_release(&ring->r.head, ctx->cached_sq_head); + smp_store_release(&rings->sq.head, ctx->cached_sq_head); } }
@@ -2226,7 +2202,8 @@ static void io_commit_sqring(struct io_ring_ctx *ctx) */ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) { - struct io_sq_ring *ring = ctx->sq_ring; + struct io_rings *rings = ctx->rings; + u32 *sq_array = ctx->sq_array; unsigned head;
/* @@ -2239,10 +2216,10 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) */ head = ctx->cached_sq_head; /* make sure SQ entry isn't read before tail */ - if (head == smp_load_acquire(&ring->r.tail)) + if (head == smp_load_acquire(&rings->sq.tail)) return false;
- head = READ_ONCE(ring->array[head & ctx->sq_mask]); + head = READ_ONCE(sq_array[head & ctx->sq_mask]); if (head < ctx->sq_entries) { s->index = head; s->sqe = &ctx->sq_sqes[head]; @@ -2252,7 +2229,7 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
/* drop invalid entries */ ctx->cached_sq_head++; - ring->dropped++; + rings->sq_dropped++; return false; }
@@ -2365,7 +2342,7 @@ static int io_sq_thread(void *data) TASK_INTERRUPTIBLE);
/* Tell userspace we may need a wakeup call */ - ctx->sq_ring->flags |= IORING_SQ_NEED_WAKEUP; + ctx->rings->sq_flags |= IORING_SQ_NEED_WAKEUP; /* make sure to read SQ tail after writing flags */ smp_mb();
@@ -2379,12 +2356,12 @@ static int io_sq_thread(void *data) schedule(); finish_wait(&ctx->sqo_wait, &wait);
- ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP; + ctx->rings->sq_flags &= ~IORING_SQ_NEED_WAKEUP; continue; } finish_wait(&ctx->sqo_wait, &wait);
- ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP; + ctx->rings->sq_flags &= ~IORING_SQ_NEED_WAKEUP; }
i = 0; @@ -2476,10 +2453,10 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, const sigset_t __user *sig, size_t sigsz) { - struct io_cq_ring *ring = ctx->cq_ring; + struct io_rings *rings = ctx->rings; int ret;
- if (io_cqring_events(ring) >= min_events) + if (io_cqring_events(rings) >= min_events) return 0;
if (sig) { @@ -2495,12 +2472,12 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, return ret; }
- ret = wait_event_interruptible(ctx->wait, io_cqring_events(ring) >= min_events); + ret = wait_event_interruptible(ctx->wait, io_cqring_events(rings) >= min_events); restore_saved_sigmask_unless(ret == -ERESTARTSYS); if (ret == -ERESTARTSYS) ret = -EINTR;
- return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0; + return READ_ONCE(rings->cq.head) == READ_ONCE(rings->cq.tail) ? ret : 0; }
static void __io_sqe_files_unregister(struct io_ring_ctx *ctx) @@ -2820,17 +2797,45 @@ static void *io_mem_alloc(size_t size) return (void *) __get_free_pages(gfp_flags, get_order(size)); }
+static unsigned long rings_size(unsigned sq_entries, unsigned cq_entries, + size_t *sq_offset) +{ + struct io_rings *rings; + size_t off, sq_array_size; + + off = struct_size(rings, cqes, cq_entries); + if (off == SIZE_MAX) + return SIZE_MAX; + +#ifdef CONFIG_SMP + off = ALIGN(off, SMP_CACHE_BYTES); + if (off == 0) + return SIZE_MAX; +#endif + + sq_array_size = array_size(sizeof(u32), sq_entries); + if (sq_array_size == SIZE_MAX) + return SIZE_MAX; + + if (check_add_overflow(off, sq_array_size, &off)) + return SIZE_MAX; + + if (sq_offset) + *sq_offset = off; + + return off; +} + static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries) { - struct io_sq_ring *sq_ring; - struct io_cq_ring *cq_ring; - size_t bytes; + size_t pages;
- bytes = struct_size(sq_ring, array, sq_entries); - bytes += array_size(sizeof(struct io_uring_sqe), sq_entries); - bytes += struct_size(cq_ring, cqes, cq_entries); + pages = (size_t)1 << get_order( + rings_size(sq_entries, cq_entries, NULL)); + pages += (size_t)1 << get_order( + array_size(sizeof(struct io_uring_sqe), sq_entries));
- return (bytes + PAGE_SIZE - 1) / PAGE_SIZE; + return pages; }
static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx) @@ -3076,9 +3081,8 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) } #endif
- io_mem_free(ctx->sq_ring); + io_mem_free(ctx->rings); io_mem_free(ctx->sq_sqes); - io_mem_free(ctx->cq_ring);
percpu_ref_exit(&ctx->refs); if (ctx->account_mem) @@ -3099,10 +3103,10 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait) * io_commit_cqring */ smp_rmb(); - if (READ_ONCE(ctx->sq_ring->r.tail) - ctx->cached_sq_head != - ctx->sq_ring->ring_entries) + if (READ_ONCE(ctx->rings->sq.tail) - ctx->cached_sq_head != + ctx->rings->sq_ring_entries) mask |= EPOLLOUT | EPOLLWRNORM; - if (READ_ONCE(ctx->cq_ring->r.head) != ctx->cached_cq_tail) + if (READ_ONCE(ctx->rings->sq.head) != ctx->cached_cq_tail) mask |= EPOLLIN | EPOLLRDNORM;
return mask; @@ -3147,14 +3151,12 @@ static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
switch (offset) { case IORING_OFF_SQ_RING: - ptr = ctx->sq_ring; + case IORING_OFF_CQ_RING: + ptr = ctx->rings; break; case IORING_OFF_SQES: ptr = ctx->sq_sqes; break; - case IORING_OFF_CQ_RING: - ptr = ctx->cq_ring; - break; default: return -EINVAL; } @@ -3241,19 +3243,27 @@ static const struct file_operations io_uring_fops = { static int io_allocate_scq_urings(struct io_ring_ctx *ctx, struct io_uring_params *p) { - struct io_sq_ring *sq_ring; - struct io_cq_ring *cq_ring; - size_t size; + struct io_rings *rings; + size_t size, sq_array_offset;
- sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries)); - if (!sq_ring) + size = rings_size(p->sq_entries, p->cq_entries, &sq_array_offset); + if (size == SIZE_MAX) + return -EOVERFLOW; + + rings = io_mem_alloc(size); + if (!rings) return -ENOMEM;
- ctx->sq_ring = sq_ring; - sq_ring->ring_mask = p->sq_entries - 1; - sq_ring->ring_entries = p->sq_entries; - ctx->sq_mask = sq_ring->ring_mask; - ctx->sq_entries = sq_ring->ring_entries; + ctx->rings = rings; + ctx->sq_array = (u32 *)((char *)rings + sq_array_offset); + rings->sq_ring_mask = p->sq_entries - 1; + rings->cq_ring_mask = p->cq_entries - 1; + rings->sq_ring_entries = p->sq_entries; + rings->cq_ring_entries = p->cq_entries; + ctx->sq_mask = rings->sq_ring_mask; + ctx->cq_mask = rings->cq_ring_mask; + ctx->sq_entries = rings->sq_ring_entries; + ctx->cq_entries = rings->cq_ring_entries;
size = array_size(sizeof(struct io_uring_sqe), p->sq_entries); if (size == SIZE_MAX) @@ -3263,15 +3273,6 @@ static int io_allocate_scq_urings(struct io_ring_ctx *ctx, if (!ctx->sq_sqes) return -ENOMEM;
- cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries)); - if (!cq_ring) - return -ENOMEM; - - ctx->cq_ring = cq_ring; - cq_ring->ring_mask = p->cq_entries - 1; - cq_ring->ring_entries = p->cq_entries; - ctx->cq_mask = cq_ring->ring_mask; - ctx->cq_entries = cq_ring->ring_entries; return 0; }
@@ -3375,21 +3376,21 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) goto err;
memset(&p->sq_off, 0, sizeof(p->sq_off)); - p->sq_off.head = offsetof(struct io_sq_ring, r.head); - p->sq_off.tail = offsetof(struct io_sq_ring, r.tail); - p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask); - p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries); - p->sq_off.flags = offsetof(struct io_sq_ring, flags); - p->sq_off.dropped = offsetof(struct io_sq_ring, dropped); - p->sq_off.array = offsetof(struct io_sq_ring, array); + p->sq_off.head = offsetof(struct io_rings, sq.head); + p->sq_off.tail = offsetof(struct io_rings, sq.tail); + p->sq_off.ring_mask = offsetof(struct io_rings, sq_ring_mask); + p->sq_off.ring_entries = offsetof(struct io_rings, sq_ring_entries); + p->sq_off.flags = offsetof(struct io_rings, sq_flags); + p->sq_off.dropped = offsetof(struct io_rings, sq_dropped); + p->sq_off.array = (char *)ctx->sq_array - (char *)ctx->rings;
memset(&p->cq_off, 0, sizeof(p->cq_off)); - p->cq_off.head = offsetof(struct io_cq_ring, r.head); - p->cq_off.tail = offsetof(struct io_cq_ring, r.tail); - p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask); - p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries); - p->cq_off.overflow = offsetof(struct io_cq_ring, overflow); - p->cq_off.cqes = offsetof(struct io_cq_ring, cqes); + p->cq_off.head = offsetof(struct io_rings, cq.head); + p->cq_off.tail = offsetof(struct io_rings, cq.tail); + p->cq_off.ring_mask = offsetof(struct io_rings, cq_ring_mask); + p->cq_off.ring_entries = offsetof(struct io_rings, cq_ring_entries); + p->cq_off.overflow = offsetof(struct io_rings, cq_overflow); + p->cq_off.cqes = offsetof(struct io_rings, cqes); return ret; err: io_ring_ctx_wait_and_kill(ctx);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.4-rc1 commit ac90f249e15cd2a850daa9e36e15f81ce1ff6550 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
After commit 75b28affdd6a we can get by with just a single mmap to map both the sq and cq ring. However, userspace doesn't know that.
Add a features variable to io_uring_params, and notify userspace that the kernel has this ability. This can then be used in liburing (or in applications directly) to avoid the second mmap.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 ++ include/uapi/linux/io_uring.h | 8 +++++++- 2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a901f4c91c28..b721fb3dce92 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3391,6 +3391,8 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) p->cq_off.ring_entries = offsetof(struct io_rings, cq_ring_entries); p->cq_off.overflow = offsetof(struct io_rings, cq_overflow); p->cq_off.cqes = offsetof(struct io_rings, cqes); + + p->features = IORING_FEAT_SINGLE_MMAP; return ret; err: io_ring_ctx_wait_and_kill(ctx); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 1e1652f25cc1..96ee9d94b73e 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -128,11 +128,17 @@ struct io_uring_params { __u32 flags; __u32 sq_thread_cpu; __u32 sq_thread_idle; - __u32 resv[5]; + __u32 features; + __u32 resv[4]; struct io_sqring_offsets sq_off; struct io_cqring_offsets cq_off; };
+/* + * io_uring_params->features flags + */ +#define IORING_FEAT_SINGLE_MMAP (1U << 0) + /* * io_uring_register(2) opcodes and arguments */
From: Jackie Liu liuyun01@kylinos.cn
mainline inclusion from mainline-5.4-rc1 commit 8776f3fa15a5cd213c4dfab7ddaf557983374ea6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Sqo_thread will get sqring in batches, which will cause ctx->cached_sq_head to be added in batches. if one of these sqes is set with the DRAIN flag, then he will never get a chance to process, and finally sqo_thread will not exit.
Fixes: de0617e4671 ("io_uring: add support for marking commands as draining") Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b721fb3dce92..a48331512bed 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -264,6 +264,7 @@ struct io_ring_ctx { struct sqe_submit { const struct io_uring_sqe *sqe; unsigned short index; + u32 sequence; bool has_user; bool needs_lock; bool needs_fixed_file; @@ -2015,7 +2016,7 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s,
if (flags & IOSQE_IO_DRAIN) { req->flags |= REQ_F_IO_DRAIN; - req->sequence = ctx->cached_sq_head - 1; + req->sequence = s->sequence; }
if (!io_op_needs_file(s->sqe)) @@ -2223,6 +2224,7 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) if (head < ctx->sq_entries) { s->index = head; s->sqe = &ctx->sq_sqes[head]; + s->sequence = ctx->cached_sq_head; ctx->cached_sq_head++; return true; }
From: Jackie Liu liuyun01@kylinos.cn
mainline inclusion from mainline-5.4-rc1 commit 4fe2c963154c31227bec2f2d690e01f9cab383ea category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
To support the link with drain, we need to do two parts.
There is an sqes:
0 1 2 3 4 5 6 +-----+-----+-----+-----+-----+-----+-----+ | N | L | L | L+D | N | N | N | +-----+-----+-----+-----+-----+-----+-----+
First, we need to ensure that the io before the link is completed, there is a easy way is set drain flag to the link list's head, so all subsequent io will be inserted into the defer_list.
+-----+ (0) | N | +-----+ | (2) (3) (4) +-----+ +-----+ +-----+ +-----+ (1) | L+D | --> | L | --> | L+D | --> | N | +-----+ +-----+ +-----+ +-----+ | +-----+ (5) | N | +-----+ | +-----+ (6) | N | +-----+
Second, ensure that the following IO will not be completed first, an easy way is to create a mirror of drain io and insert it into defer_list, in this way, as long as drain io is not processed, the following io in the defer_list will not be actively process.
+-----+ (0) | N | +-----+ | (2) (3) (4) +-----+ +-----+ +-----+ +-----+ (1) | L+D | --> | L | --> | L+D | --> | N | +-----+ +-----+ +-----+ +-----+ | +-----+ ('3) | D | <== This is a shadow of (3) +-----+ | +-----+ (5) | N | +-----+ | +-----+ (6) | N | +-----+
Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 114 ++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 97 insertions(+), 17 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a48331512bed..43adfc58a2cd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -312,6 +312,7 @@ struct io_kiocb { #define REQ_F_LINK 64 /* linked sqes */ #define REQ_F_LINK_DONE 128 /* linked sqes done */ #define REQ_F_FAIL_LINK 256 /* fail rest of links */ +#define REQ_F_SHADOW_DRAIN 512 /* link-drain shadow req */ u64 user_data; u32 result; u32 sequence; @@ -343,6 +344,7 @@ struct io_submit_state { };
static void io_sq_wq_submit_work(struct work_struct *work); +static void __io_free_req(struct io_kiocb *req);
static struct kmem_cache *req_cachep;
@@ -447,6 +449,11 @@ static void io_commit_cqring(struct io_ring_ctx *ctx) __io_commit_cqring(ctx);
while ((req = io_get_deferred_req(ctx)) != NULL) { + if (req->flags & REQ_F_SHADOW_DRAIN) { + /* Just for drain, free it. */ + __io_free_req(req); + continue; + } req->flags |= REQ_F_IO_DRAINED; queue_work(ctx->sqo_wq, &req->work); } @@ -2014,10 +2021,14 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s, flags = READ_ONCE(s->sqe->flags); fd = READ_ONCE(s->sqe->fd);
- if (flags & IOSQE_IO_DRAIN) { + if (flags & IOSQE_IO_DRAIN) req->flags |= REQ_F_IO_DRAIN; - req->sequence = s->sequence; - } + /* + * All io need record the previous position, if LINK vs DARIN, + * it can be used to mark the position of the first IO in the + * link list. + */ + req->sequence = s->sequence;
if (!io_op_needs_file(s->sqe)) return 0; @@ -2039,20 +2050,11 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s, return 0; }
-static int io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, +static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, struct sqe_submit *s) { int ret;
- ret = io_req_defer(ctx, req, s->sqe); - if (ret) { - if (ret != -EIOCBQUEUED) { - io_free_req(req); - io_cqring_add_event(ctx, s->sqe->user_data, ret); - } - return 0; - } - ret = __io_submit_sqe(ctx, req, s, true); if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) { struct io_uring_sqe *sqe_copy; @@ -2095,6 +2097,64 @@ static int io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, return ret; }
+static int io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, + struct sqe_submit *s) +{ + int ret; + + ret = io_req_defer(ctx, req, s->sqe); + if (ret) { + if (ret != -EIOCBQUEUED) { + io_free_req(req); + io_cqring_add_event(ctx, s->sqe->user_data, ret); + } + return 0; + } + + return __io_queue_sqe(ctx, req, s); +} + +static int io_queue_link_head(struct io_ring_ctx *ctx, struct io_kiocb *req, + struct sqe_submit *s, struct io_kiocb *shadow) +{ + int ret; + int need_submit = false; + + if (!shadow) + return io_queue_sqe(ctx, req, s); + + /* + * Mark the first IO in link list as DRAIN, let all the following + * IOs enter the defer list. all IO needs to be completed before link + * list. + */ + req->flags |= REQ_F_IO_DRAIN; + ret = io_req_defer(ctx, req, s->sqe); + if (ret) { + if (ret != -EIOCBQUEUED) { + io_free_req(req); + io_cqring_add_event(ctx, s->sqe->user_data, ret); + return 0; + } + } else { + /* + * If ret == 0 means that all IOs in front of link io are + * running done. let's queue link head. + */ + need_submit = true; + } + + /* Insert shadow req to defer_list, blocking next IOs */ + spin_lock_irq(&ctx->completion_lock); + list_add_tail(&shadow->list, &ctx->defer_list); + spin_unlock_irq(&ctx->completion_lock); + + if (need_submit) + return __io_queue_sqe(ctx, req, s); + + return 0; +} + #define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK)
static void io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, @@ -2240,6 +2300,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, { struct io_submit_state state, *statep = NULL; struct io_kiocb *link = NULL; + struct io_kiocb *shadow_req = NULL; bool prev_was_link = false; int i, submitted = 0;
@@ -2254,11 +2315,20 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, * that's the end of the chain. Submit the previous link. */ if (!prev_was_link && link) { - io_queue_sqe(ctx, link, &link->submit); + io_queue_link_head(ctx, link, &link->submit, shadow_req); link = NULL; } prev_was_link = (sqes[i].sqe->flags & IOSQE_IO_LINK) != 0;
+ if (link && (sqes[i].sqe->flags & IOSQE_IO_DRAIN)) { + if (!shadow_req) { + shadow_req = io_get_req(ctx, NULL); + shadow_req->flags |= (REQ_F_IO_DRAIN | REQ_F_SHADOW_DRAIN); + refcount_dec(&shadow_req->refs); + } + shadow_req->sequence = sqes[i].sequence; + } + if (unlikely(mm_fault)) { io_cqring_add_event(ctx, sqes[i].sqe->user_data, -EFAULT); @@ -2272,7 +2342,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, }
if (link) - io_queue_sqe(ctx, link, &link->submit); + io_queue_link_head(ctx, link, &link->submit, shadow_req); if (statep) io_submit_state_end(&state);
@@ -2408,6 +2478,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) { struct io_submit_state state, *statep = NULL; struct io_kiocb *link = NULL; + struct io_kiocb *shadow_req = NULL; bool prev_was_link = false; int i, submit = 0;
@@ -2427,11 +2498,20 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) * that's the end of the chain. Submit the previous link. */ if (!prev_was_link && link) { - io_queue_sqe(ctx, link, &link->submit); + io_queue_link_head(ctx, link, &link->submit, shadow_req); link = NULL; } prev_was_link = (s.sqe->flags & IOSQE_IO_LINK) != 0;
+ if (link && (s.sqe->flags & IOSQE_IO_DRAIN)) { + if (!shadow_req) { + shadow_req = io_get_req(ctx, NULL); + shadow_req->flags |= (REQ_F_IO_DRAIN | REQ_F_SHADOW_DRAIN); + refcount_dec(&shadow_req->refs); + } + shadow_req->sequence = s.sequence; + } + s.has_user = true; s.needs_lock = false; s.needs_fixed_file = false; @@ -2441,7 +2521,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) io_commit_sqring(ctx);
if (link) - io_queue_sqe(ctx, link, &link->submit); + io_queue_link_head(ctx, link, &link->submit, shadow_req); if (statep) io_submit_state_end(statep);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.4-rc1 commit c576666863b788c2d7e8ab4ef4edd0e9059cb47b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For some applications that end up using a submit-and-wait type of approach for certain batches of IO, we can make that a bit more efficient by allowing the application to block for the last IO submission. This prevents an async when we don't need it, as the application will be blocking for the completion event(s) anyway.
Typical use cases are using the liburing io_uring_submit_and_wait() API, or just using io_uring_enter() doing both submissions and completions. As a specific example, RocksDB doing MultiGet() is sped up quite a bit with this change.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 63 +++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 46 insertions(+), 17 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 43adfc58a2cd..4b417c7b18c2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2051,11 +2051,11 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s, }
static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, - struct sqe_submit *s) + struct sqe_submit *s, bool force_nonblock) { int ret;
- ret = __io_submit_sqe(ctx, req, s, true); + ret = __io_submit_sqe(ctx, req, s, force_nonblock); if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) { struct io_uring_sqe *sqe_copy;
@@ -2098,7 +2098,7 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, }
static int io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, - struct sqe_submit *s) + struct sqe_submit *s, bool force_nonblock) { int ret;
@@ -2111,17 +2111,18 @@ static int io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, return 0; }
- return __io_queue_sqe(ctx, req, s); + return __io_queue_sqe(ctx, req, s, force_nonblock); }
static int io_queue_link_head(struct io_ring_ctx *ctx, struct io_kiocb *req, - struct sqe_submit *s, struct io_kiocb *shadow) + struct sqe_submit *s, struct io_kiocb *shadow, + bool force_nonblock) { int ret; int need_submit = false;
if (!shadow) - return io_queue_sqe(ctx, req, s); + return io_queue_sqe(ctx, req, s, force_nonblock);
/* * Mark the first IO in link list as DRAIN, let all the following @@ -2150,7 +2151,7 @@ static int io_queue_link_head(struct io_ring_ctx *ctx, struct io_kiocb *req, spin_unlock_irq(&ctx->completion_lock);
if (need_submit) - return __io_queue_sqe(ctx, req, s); + return __io_queue_sqe(ctx, req, s, force_nonblock);
return 0; } @@ -2158,7 +2159,8 @@ static int io_queue_link_head(struct io_ring_ctx *ctx, struct io_kiocb *req, #define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK)
static void io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, - struct io_submit_state *state, struct io_kiocb **link) + struct io_submit_state *state, struct io_kiocb **link, + bool force_nonblock) { struct io_uring_sqe *sqe_copy; struct io_kiocb *req; @@ -2211,7 +2213,7 @@ static void io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, INIT_LIST_HEAD(&req->link_list); *link = req; } else { - io_queue_sqe(ctx, req, s); + io_queue_sqe(ctx, req, s, force_nonblock); } }
@@ -2315,7 +2317,8 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, * that's the end of the chain. Submit the previous link. */ if (!prev_was_link && link) { - io_queue_link_head(ctx, link, &link->submit, shadow_req); + io_queue_link_head(ctx, link, &link->submit, shadow_req, + true); link = NULL; } prev_was_link = (sqes[i].sqe->flags & IOSQE_IO_LINK) != 0; @@ -2336,13 +2339,13 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, sqes[i].has_user = has_user; sqes[i].needs_lock = true; sqes[i].needs_fixed_file = true; - io_submit_sqe(ctx, &sqes[i], statep, &link); + io_submit_sqe(ctx, &sqes[i], statep, &link, true); submitted++; } }
if (link) - io_queue_link_head(ctx, link, &link->submit, shadow_req); + io_queue_link_head(ctx, link, &link->submit, shadow_req, true); if (statep) io_submit_state_end(&state);
@@ -2474,7 +2477,8 @@ static int io_sq_thread(void *data) return 0; }
-static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, + bool block_for_last) { struct io_submit_state state, *statep = NULL; struct io_kiocb *link = NULL; @@ -2488,6 +2492,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) }
for (i = 0; i < to_submit; i++) { + bool force_nonblock = true; struct sqe_submit s;
if (!io_get_sqring(ctx, &s)) @@ -2498,7 +2503,8 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) * that's the end of the chain. Submit the previous link. */ if (!prev_was_link && link) { - io_queue_link_head(ctx, link, &link->submit, shadow_req); + io_queue_link_head(ctx, link, &link->submit, shadow_req, + force_nonblock); link = NULL; } prev_was_link = (s.sqe->flags & IOSQE_IO_LINK) != 0; @@ -2516,12 +2522,24 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) s.needs_lock = false; s.needs_fixed_file = false; submit++; - io_submit_sqe(ctx, &s, statep, &link); + + /* + * The caller will block for events after submit, submit the + * last IO non-blocking. This is either the only IO it's + * submitting, or it already submitted the previous ones. This + * improves performance by avoiding an async punt that we don't + * need to do. + */ + if (block_for_last && submit == to_submit) + force_nonblock = false; + + io_submit_sqe(ctx, &s, statep, &link, force_nonblock); } io_commit_sqring(ctx);
if (link) - io_queue_link_head(ctx, link, &link->submit, shadow_req); + io_queue_link_head(ctx, link, &link->submit, shadow_req, + block_for_last); if (statep) io_submit_state_end(statep);
@@ -3290,10 +3308,21 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
ret = 0; if (to_submit) { + bool block_for_last = false; + to_submit = min(to_submit, ctx->sq_entries);
+ /* + * Allow last submission to block in a series, IFF the caller + * asked to wait for events and we don't currently have + * enough. This potentially avoids an async punt. + */ + if (to_submit == min_complete && + io_cqring_events(ctx->rings) < min_complete) + block_for_last = true; + mutex_lock(&ctx->uring_lock); - submitted = io_ring_submit(ctx, to_submit); + submitted = io_ring_submit(ctx, to_submit, block_for_last); mutex_unlock(&ctx->uring_lock); } if (flags & IORING_ENTER_GETEVENTS) {
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.4-rc1 commit 18d9be1a970c3704366df902b00871bea88d9f14 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add a helper for queueing a request for async execution, in preparation for optimizing it.
No functional change in this patch.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4b417c7b18c2..b23a4f3b6c61 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -442,6 +442,12 @@ static void __io_commit_cqring(struct io_ring_ctx *ctx) } }
+static inline void io_queue_async_work(struct io_ring_ctx *ctx, + struct io_kiocb *req) +{ + queue_work(ctx->sqo_wq, &req->work); +} + static void io_commit_cqring(struct io_ring_ctx *ctx) { struct io_kiocb *req; @@ -455,7 +461,7 @@ static void io_commit_cqring(struct io_ring_ctx *ctx) continue; } req->flags |= REQ_F_IO_DRAINED; - queue_work(ctx->sqo_wq, &req->work); + io_queue_async_work(ctx, req); } }
@@ -618,7 +624,7 @@ static void io_req_link_next(struct io_kiocb *req)
nxt->flags |= REQ_F_LINK_DONE; INIT_WORK(&nxt->work, io_sq_wq_submit_work); - queue_work(req->ctx->sqo_wq, &nxt->work); + io_queue_async_work(req->ctx, nxt); } }
@@ -1518,7 +1524,7 @@ static void io_poll_remove_one(struct io_kiocb *req) WRITE_ONCE(poll->canceled, true); if (!list_empty(&poll->wait.entry)) { list_del_init(&poll->wait.entry); - queue_work(req->ctx->sqo_wq, &req->work); + io_queue_async_work(req->ctx, req); } spin_unlock(&poll->head->lock);
@@ -1632,7 +1638,7 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, io_cqring_ev_posted(ctx); io_put_req(req); } else { - queue_work(ctx->sqo_wq, &req->work); + io_queue_async_work(ctx, req); }
return 1; @@ -2072,7 +2078,7 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, if (list) atomic_inc(&list->cnt); INIT_WORK(&req->work, io_sq_wq_submit_work); - queue_work(ctx->sqo_wq, &req->work); + io_queue_async_work(ctx, req); }
/*
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.4-rc1 commit 54a91f3bb9b96ed86bc12b2f7e06b3fce8e86503 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
All the popular filesystems need to grab the inode lock for buffered writes. With io_uring punting buffered writes to async context, we observe a lot of contention with all workers hamming this mutex.
For buffered writes, we generally don't need a lot of parallelism on the submission side, as the flushing will take care of that for us. Hence we don't need a deep queue on the write side, as long as we can safely punt from the original submission context.
Add a workqueue with a limit of 2 that we can use for buffered writes. This greatly improves the performance and efficiency of higher queue depth buffered async writes with io_uring.
Reported-by: Andres Freund andres@anarazel.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 47 +++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 39 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b23a4f3b6c61..2bd3c4cc1394 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -203,7 +203,7 @@ struct io_ring_ctx { } ____cacheline_aligned_in_smp;
/* IO offload */ - struct workqueue_struct *sqo_wq; + struct workqueue_struct *sqo_wq[2]; struct task_struct *sqo_thread; /* if using sq thread polling */ struct mm_struct *sqo_mm; wait_queue_head_t sqo_wait; @@ -445,7 +445,19 @@ static void __io_commit_cqring(struct io_ring_ctx *ctx) static inline void io_queue_async_work(struct io_ring_ctx *ctx, struct io_kiocb *req) { - queue_work(ctx->sqo_wq, &req->work); + int rw; + + switch (req->submit.sqe->opcode) { + case IORING_OP_WRITEV: + case IORING_OP_WRITE_FIXED: + rw = !(req->rw.ki_flags & IOCB_DIRECT); + break; + default: + rw = 0; + break; + } + + queue_work(ctx->sqo_wq[rw], &req->work); }
static void io_commit_cqring(struct io_ring_ctx *ctx) @@ -2633,11 +2645,15 @@ static void io_sq_thread_stop(struct io_ring_ctx *ctx)
static void io_finish_async(struct io_ring_ctx *ctx) { + int i; + io_sq_thread_stop(ctx);
- if (ctx->sqo_wq) { - destroy_workqueue(ctx->sqo_wq); - ctx->sqo_wq = NULL; + for (i = 0; i < ARRAY_SIZE(ctx->sqo_wq); i++) { + if (ctx->sqo_wq[i]) { + destroy_workqueue(ctx->sqo_wq[i]); + ctx->sqo_wq[i] = NULL; + } } }
@@ -2845,16 +2861,31 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx, }
/* Do QD, or 2 * CPUS, whatever is smallest */ - ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE, + ctx->sqo_wq[0] = alloc_workqueue("io_ring-wq", + WQ_UNBOUND | WQ_FREEZABLE, min(ctx->sq_entries - 1, 2 * num_online_cpus())); - if (!ctx->sqo_wq) { + if (!ctx->sqo_wq[0]) { + ret = -ENOMEM; + goto err; + } + + /* + * This is for buffered writes, where we want to limit the parallelism + * due to file locking in file systems. As "normal" buffered writes + * should parellelize on writeout quite nicely, limit us to having 2 + * pending. This avoids massive contention on the inode when doing + * buffered async writes. + */ + ctx->sqo_wq[1] = alloc_workqueue("io_ring-write-wq", + WQ_UNBOUND | WQ_FREEZABLE, 2); + if (!ctx->sqo_wq[1]) { ret = -ENOMEM; goto err; }
return 0; err: - io_sq_thread_stop(ctx); + io_finish_async(ctx); mmdrop(ctx->sqo_mm); ctx->sqo_mm = NULL; return ret;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.4-rc1 commit 6d5d5ac522b20b65167dafe0656b7cad05ec48b3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently merge async work items if we see a strict sequential hit. This helps avoid unnecessary workqueue switches when we don't need them. We can extend this merging to cover cases where it's not a strict sequential hit, but the IO still fits within the same page. If an application is doing multiple requests within the same page, we don't want separate workers waiting on the same page to complete IO. It's much faster to let the first worker bring in the page, then operate on that page from the same worker to complete the next request(s).
Reviewed-by: Jeff Moyer jmoyer@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 36 ++++++++++++++++++++++++++++-------- 1 file changed, 28 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2bd3c4cc1394..4af831003956 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -167,7 +167,7 @@ struct async_list { struct list_head list;
struct file *file; - off_t io_end; + off_t io_start; size_t io_len; };
@@ -1188,6 +1188,28 @@ static ssize_t io_import_iovec(struct io_ring_ctx *ctx, int rw, return import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec, iter); }
+static inline bool io_should_merge(struct async_list *al, struct kiocb *kiocb) +{ + if (al->file == kiocb->ki_filp) { + off_t start, end; + + /* + * Allow merging if we're anywhere in the range of the same + * page. Generally this happens for sub-page reads or writes, + * and it's beneficial to allow the first worker to bring the + * page in and the piggy backed work can then work on the + * cached page. + */ + start = al->io_start & PAGE_MASK; + end = (al->io_start + al->io_len + PAGE_SIZE - 1) & PAGE_MASK; + if (kiocb->ki_pos >= start && kiocb->ki_pos <= end) + return true; + } + + al->file = NULL; + return false; +} + /* * Make a note of the last file/offset/direction we punted to async * context. We'll use this information to see if we can piggy back a @@ -1199,9 +1221,8 @@ static void io_async_list_note(int rw, struct io_kiocb *req, size_t len) struct async_list *async_list = &req->ctx->pending_async[rw]; struct kiocb *kiocb = &req->rw; struct file *filp = kiocb->ki_filp; - off_t io_end = kiocb->ki_pos + len;
- if (filp == async_list->file && kiocb->ki_pos == async_list->io_end) { + if (io_should_merge(async_list, kiocb)) { unsigned long max_bytes;
/* Use 8x RA size as a decent limiter for both reads/writes */ @@ -1214,17 +1235,16 @@ static void io_async_list_note(int rw, struct io_kiocb *req, size_t len) req->flags |= REQ_F_SEQ_PREV; async_list->io_len += len; } else { - io_end = 0; - async_list->io_len = 0; + async_list->file = NULL; } }
/* New file? Reset state. */ if (async_list->file != filp) { - async_list->io_len = 0; + async_list->io_start = kiocb->ki_pos; + async_list->io_len = len; async_list->file = filp; } - async_list->io_end = io_end; }
static int io_read(struct io_kiocb *req, const struct sqe_submit *s, @@ -1993,7 +2013,7 @@ static void io_sq_wq_submit_work(struct work_struct *work) */ static bool io_add_to_prev_work(struct async_list *list, struct io_kiocb *req) { - bool ret = false; + bool ret;
if (!list) return false;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.4-rc1 commit b2a9eadab85730935f5a6fe19f3f61faaaced601 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The way the logic is setup in io_uring_enter() means that you can't wake up the SQ poller thread while at the same time waiting (or polling) for completions afterwards. There's no reason for that to be the case.
Reported-by: Lewis Baker lbaker@fb.com Reviewed-by: Jeff Moyer jmoyer@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4af831003956..f9d570bda423 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3356,15 +3356,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, * Just return the requested submit count, and wake the thread if * we were asked to. */ + ret = 0; if (ctx->flags & IORING_SETUP_SQPOLL) { if (flags & IORING_ENTER_SQ_WAKEUP) wake_up(&ctx->sqo_wait); submitted = to_submit; - goto out_ctx; - } - - ret = 0; - if (to_submit) { + } else if (to_submit) { bool block_for_last = false;
to_submit = min(to_submit, ctx->sq_entries); @@ -3394,7 +3391,6 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, } }
-out_ctx: io_ring_drop_ctx_refs(ctx, 1); out_fput: fdput(f);
From: Daniel Xu dxu@dxuuu.xyz
mainline inclusion from mainline-5.4-rc1 commit 5277deaab9f98229bdfb8d1e30019b6c25052708 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Some workloads can require far more than 4K oustanding entries. For example memcached can have ~300K sockets over ~40 cores. Bumping the max to 32K seems to work pretty well.
Reported-by: Dan Melnic dmm@fb.com Signed-off-by: Daniel Xu dxu@dxuuu.xyz Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f9d570bda423..5d111591b620 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -75,7 +75,7 @@
#include "internal.h"
-#define IORING_MAX_ENTRIES 4096 +#define IORING_MAX_ENTRIES 32768 #define IORING_MAX_FIXED_FILES 1024
struct io_uring {
From: Jackie Liu liuyun01@kylinos.cn
mainline inclusion from mainline-5.4-rc1 commit 954dab193d19cbbff8f83b58c9360bf00ddb273c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Just clean up the code, no function changes.
Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5d111591b620..41022fd5dce7 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2097,13 +2097,11 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) { struct io_uring_sqe *sqe_copy;
- sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL); + sqe_copy = kmemdup(s->sqe, sizeof(*sqe_copy), GFP_KERNEL); if (sqe_copy) { struct async_list *list;
- memcpy(sqe_copy, s->sqe, sizeof(*sqe_copy)); s->sqe = sqe_copy; - memcpy(&req->submit, s, sizeof(*s)); list = io_async_list_from_sqe(ctx, s->sqe); if (!io_add_to_prev_work(list, req)) {
From: Jackie Liu liuyun01@kylinos.cn
mainline inclusion from mainline-5.4-rc1 commit 5f5ad9ced33621d353be6429c3900f8a526fcae8 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There is a potential dangling pointer problem. we never clean shadow_req, if there are multiple link lists in this series of sqes, then the shadow_req will not reallocate, and continue to use the last one. but in the previous, his memory has been released, thus forming a dangling pointer. let's clean up him and make sure that every new link list can reapply for a new shadow_req.
Fixes: 4fe2c963154c ("io_uring: add support for link with drain") Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 41022fd5dce7..55db30a9ebed 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2356,6 +2356,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, io_queue_link_head(ctx, link, &link->submit, shadow_req, true); link = NULL; + shadow_req = NULL; } prev_was_link = (sqes[i].sqe->flags & IOSQE_IO_LINK) != 0;
@@ -2542,6 +2543,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, io_queue_link_head(ctx, link, &link->submit, shadow_req, force_nonblock); link = NULL; + shadow_req = NULL; } prev_was_link = (s.sqe->flags & IOSQE_IO_LINK) != 0;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.4-rc1 commit 6cc47d1d2a9b631f62405f56df651975c7587a97 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we end up getting woken in poll (due to a signal), then we may need to punt the poll request to an async worker. When we do that, we look up the list to queue at, deferefencing req->submit.sqe, however that is only set for requests we initially decided to queue async.
This fixes a crash with poll command usage and wakeups that need to punt to async context.
Fixes: 54a91f3bb9b9 ("io_uring: limit parallelism of buffered writes") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 55db30a9ebed..6b2295fcb355 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -445,16 +445,15 @@ static void __io_commit_cqring(struct io_ring_ctx *ctx) static inline void io_queue_async_work(struct io_ring_ctx *ctx, struct io_kiocb *req) { - int rw; + int rw = 0;
- switch (req->submit.sqe->opcode) { - case IORING_OP_WRITEV: - case IORING_OP_WRITE_FIXED: - rw = !(req->rw.ki_flags & IOCB_DIRECT); - break; - default: - rw = 0; - break; + if (req->submit.sqe) { + switch (req->submit.sqe->opcode) { + case IORING_OP_WRITEV: + case IORING_OP_WRITE_FIXED: + rw = !(req->rw.ki_flags & IOCB_DIRECT); + break; + } }
queue_work(ctx->sqo_wq[rw], &req->work); @@ -1713,6 +1712,7 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (!poll->file) return -EBADF;
+ req->submit.sqe = NULL; INIT_WORK(&req->work, io_poll_complete_work); events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP;
From: Jackie Liu liuyun01@kylinos.cn
mainline inclusion from mainline-5.4-rc1 commit a1041c27b64ce744632147e19701c95fed14fab1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Sometimes io_get_req will return a NUL, then we need to do the correct error handling, otherwise it will cause the kernel null pointer exception.
Fixes: 4fe2c963154c ("io_uring: add support for link with drain") Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6b2295fcb355..3eeee5a03fc8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2363,12 +2363,15 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, if (link && (sqes[i].sqe->flags & IOSQE_IO_DRAIN)) { if (!shadow_req) { shadow_req = io_get_req(ctx, NULL); + if (unlikely(!shadow_req)) + goto out; shadow_req->flags |= (REQ_F_IO_DRAIN | REQ_F_SHADOW_DRAIN); refcount_dec(&shadow_req->refs); } shadow_req->sequence = sqes[i].sequence; }
+out: if (unlikely(mm_fault)) { io_cqring_add_event(ctx, sqes[i].sqe->user_data, -EFAULT); @@ -2550,12 +2553,15 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, if (link && (s.sqe->flags & IOSQE_IO_DRAIN)) { if (!shadow_req) { shadow_req = io_get_req(ctx, NULL); + if (unlikely(!shadow_req)) + goto out; shadow_req->flags |= (REQ_F_IO_DRAIN | REQ_F_SHADOW_DRAIN); refcount_dec(&shadow_req->refs); } shadow_req->sequence = s.sequence; }
+out: s.has_user = true; s.needs_lock = false; s.needs_fixed_file = false;
From: Thomas Gleixner tglx@linutronix.de
mainline inclusion from mainline-5.1-rc1 commit 15917dc02841862840efcbfe1da0830f88078b5c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The RTMUTEX tester was removed long ago but the PF bit stayed around. Remove it and free up the space.
Signed-off-by: Thomas Gleixner tglx@linutronix.de
Conflicts: include/linux/sched.h [ Patch 73ab1cb2de9e3("umh: add exit routine for UMH process") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Cheng Jian cj.chengjian@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/sched.h | 1 - 1 file changed, 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 9dc064305c13..67d4cfefc99d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1421,7 +1421,6 @@ extern struct pid *cad_pid; #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */ #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ -#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */ #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */ #define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */
From: Thomas Gleixner tglx@linutronix.de
mainline inclusion from mainline-5.2-rc1 commit 6d25be5782e482eb93e3de0c94d0a517879377d0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The worker accounting for CPU bound workers is plugged into the core scheduler code and the wakeup code. This is not a hard requirement and can be avoided by keeping track of the state in the workqueue code itself.
Keep track of the sleeping state in the worker itself and call the notifier before entering the core scheduler. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from scheduling and being woken immediately after switching away. When nr_running is updated when the task is retunrning from schedule() then it is later compared when it is done from ttwu().
[ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ]
Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Sebastian Andrzej Siewior bigeasy@linutronix.de Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Acked-by: Tejun Heo tj@kernel.org Cc: Daniel Bristot de Oliveira bristot@redhat.com Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814... Signed-off-by: Ingo Molnar mingo@kernel.org
Conflicts: kernel/workqueue_internal.h [ Patch 1b69ac6b40ebd("psi: fix aggregation idle shut-off") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/sched/core.c | 88 +++++++++---------------------------- kernel/workqueue.c | 54 ++++++++++------------- kernel/workqueue_internal.h | 5 ++- 3 files changed, 48 insertions(+), 99 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 43d58409607b..0fac7e9aa9fe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1663,10 +1663,6 @@ static inline void ttwu_activate(struct rq *rq, struct task_struct *p, int en_fl { activate_task(rq, p, en_flags); p->on_rq = TASK_ON_RQ_QUEUED; - - /* If a worker is waking up, notify the workqueue: */ - if (p->flags & PF_WQ_WORKER) - wq_worker_waking_up(p, cpu_of(rq)); }
/* @@ -2083,56 +2079,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) return success; }
-/** - * try_to_wake_up_local - try to wake up a local task with rq lock held - * @p: the thread to be awakened - * @rf: request-queue flags for pinning - * - * Put @p on the run-queue if it's not already there. The caller must - * ensure that this_rq() is locked, @p is bound to this_rq() and not - * the current task. - */ -static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf) -{ - struct rq *rq = task_rq(p); - - if (WARN_ON_ONCE(rq != this_rq()) || - WARN_ON_ONCE(p == current)) - return; - - lockdep_assert_held(&rq->lock); - - if (!raw_spin_trylock(&p->pi_lock)) { - /* - * This is OK, because current is on_cpu, which avoids it being - * picked for load-balance and preemption/IRQs are still - * disabled avoiding further scheduler activity on it and we've - * not yet picked a replacement task. - */ - rq_unlock(rq, rf); - raw_spin_lock(&p->pi_lock); - rq_relock(rq, rf); - } - - if (!(p->state & TASK_NORMAL)) - goto out; - - trace_sched_waking(p); - - if (!task_on_rq_queued(p)) { - if (p->in_iowait) { - delayacct_blkio_end(p); - atomic_dec(&rq->nr_iowait); - } - ttwu_activate(rq, p, ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK); - } - - ttwu_do_wakeup(rq, p, 0, rf); - ttwu_stat(p, smp_processor_id(), 0); -out: - raw_spin_unlock(&p->pi_lock); -} - /** * wake_up_process - Wake up a specific process * @p: The process to be woken up. @@ -3538,19 +3484,6 @@ static void __sched notrace __schedule(bool preempt) atomic_inc(&rq->nr_iowait); delayacct_blkio_start(); } - - /* - * If a worker went to sleep, notify and ask workqueue - * whether it wants to wake up a task to maintain - * concurrency. - */ - if (prev->flags & PF_WQ_WORKER) { - struct task_struct *to_wakeup; - - to_wakeup = wq_worker_sleeping(prev); - if (to_wakeup) - try_to_wake_up_local(to_wakeup, &rf); - } } switch_count = &prev->nvcsw; } @@ -3610,6 +3543,20 @@ static inline void sched_submit_work(struct task_struct *tsk) { if (!tsk->state || tsk_is_pi_blocked(tsk)) return; + + /* + * If a worker went to sleep, notify and ask workqueue whether + * it wants to wake up a task to maintain concurrency. + * As this function is called inside the schedule() context, + * we disable preemption to avoid it calling schedule() again + * in the possible wakeup of a kworker. + */ + if (tsk->flags & PF_WQ_WORKER) { + preempt_disable(); + wq_worker_sleeping(tsk); + preempt_enable_no_resched(); + } + /* * If we are going to sleep and we have plugged IO queued, * make sure to submit it to avoid deadlocks. @@ -3618,6 +3565,12 @@ static inline void sched_submit_work(struct task_struct *tsk) blk_schedule_flush_plug(tsk); }
+static void sched_update_worker(struct task_struct *tsk) +{ + if (tsk->flags & PF_WQ_WORKER) + wq_worker_running(tsk); +} + asmlinkage __visible void __sched schedule(void) { struct task_struct *tsk = current; @@ -3628,6 +3581,7 @@ asmlinkage __visible void __sched schedule(void) __schedule(false); sched_preempt_enable_no_resched(); } while (need_resched()); + sched_update_worker(tsk); } EXPORT_SYMBOL(schedule);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 1ffc523edb65..a07aa758571e 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -840,43 +840,32 @@ static void wake_up_worker(struct worker_pool *pool) }
/** - * wq_worker_waking_up - a worker is waking up + * wq_worker_running - a worker is running again * @task: task waking up - * @cpu: CPU @task is waking up to * - * This function is called during try_to_wake_up() when a worker is - * being awoken. - * - * CONTEXT: - * spin_lock_irq(rq->lock) + * This function is called when a worker returns from schedule() */ -void wq_worker_waking_up(struct task_struct *task, int cpu) +void wq_worker_running(struct task_struct *task) { struct worker *worker = kthread_data(task);
- if (!(worker->flags & WORKER_NOT_RUNNING)) { - WARN_ON_ONCE(worker->pool->cpu != cpu); + if (!worker->sleeping) + return; + if (!(worker->flags & WORKER_NOT_RUNNING)) atomic_inc(&worker->pool->nr_running); - } + worker->sleeping = 0; }
/** * wq_worker_sleeping - a worker is going to sleep * @task: task going to sleep * - * This function is called during schedule() when a busy worker is - * going to sleep. Worker on the same cpu can be woken up by - * returning pointer to its task. - * - * CONTEXT: - * spin_lock_irq(rq->lock) - * - * Return: - * Worker task on @cpu to wake up, %NULL if none. + * This function is called from schedule() when a busy worker is + * going to sleep. */ -struct task_struct *wq_worker_sleeping(struct task_struct *task) +void wq_worker_sleeping(struct task_struct *task) { - struct worker *worker = kthread_data(task), *to_wakeup = NULL; + struct worker *next, *worker = kthread_data(task); struct worker_pool *pool;
/* @@ -885,13 +874,15 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task) * checking NOT_RUNNING. */ if (worker->flags & WORKER_NOT_RUNNING) - return NULL; + return;
pool = worker->pool;
- /* this can only happen on the local cpu */ - if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id())) - return NULL; + if (WARN_ON_ONCE(worker->sleeping)) + return; + + worker->sleeping = 1; + spin_lock_irq(&pool->lock);
/* * The counterpart of the following dec_and_test, implied mb, @@ -905,9 +896,12 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task) * lock is safe. */ if (atomic_dec_and_test(&pool->nr_running) && - !list_empty(&pool->worklist)) - to_wakeup = first_idle_worker(pool); - return to_wakeup ? to_wakeup->task : NULL; + !list_empty(&pool->worklist)) { + next = first_idle_worker(pool); + if (next) + wake_up_process(next->task); + } + spin_unlock_irq(&pool->lock); }
/** @@ -4891,7 +4885,7 @@ static void rebind_workers(struct worker_pool *pool) * * WRITE_ONCE() is necessary because @worker->flags may be * tested without holding any lock in - * wq_worker_waking_up(). Without it, NOT_RUNNING test may + * wq_worker_running(). Without it, NOT_RUNNING test may * fail incorrectly leading to premature concurrency * management operations. */ diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h index 66fbb5a9e633..30cfed226b39 100644 --- a/kernel/workqueue_internal.h +++ b/kernel/workqueue_internal.h @@ -44,6 +44,7 @@ struct worker { unsigned long last_active; /* L: last active timestamp */ unsigned int flags; /* X: flags */ int id; /* I: worker id */ + int sleeping; /* None */
/* * Opaque string set with work_set_desc(). Printed out with task @@ -69,7 +70,7 @@ static inline struct worker *current_wq_worker(void) * Scheduler hooks for concurrency managed workqueue. Only to be used from * sched/core.c and workqueue.c. */ -void wq_worker_waking_up(struct task_struct *task, int cpu); -struct task_struct *wq_worker_sleeping(struct task_struct *task); +void wq_worker_running(struct task_struct *task); +void wq_worker_sleeping(struct task_struct *task);
#endif /* _KERNEL_WORKQUEUE_INTERNAL_H */
From: YueHaibing yuehaibing@huawei.com
mainline inclusion from mainline-5.5-rc1 commit 364b05fd06e87e53dc03396f73afeac48d8e0998 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The callback function of call_rcu() just calls kfree(), so we can use kfree_rcu() instead of call_rcu() + callback function.
Signed-off-by: YueHaibing yuehaibing@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 4fab4917938e..7f94fab46c22 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -103,13 +103,6 @@ struct io_wq { struct completion done; };
-static void io_wq_free_worker(struct rcu_head *head) -{ - struct io_worker *worker = container_of(head, struct io_worker, rcu); - - kfree(worker); -} - static bool io_worker_get(struct io_worker *worker) { return refcount_inc_not_zero(&worker->ref); @@ -195,7 +188,7 @@ static void io_worker_exit(struct io_worker *worker) if (all_done && refcount_dec_and_test(&wqe->wq->refs)) complete(&wqe->wq->done);
- call_rcu(&worker->rcu, io_wq_free_worker); + kfree_rcu(worker, rcu); }
static void io_worker_start(struct io_wqe *wqe, struct io_worker *worker)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 51c3ff62cac635ae9d75f875ce5b7bdafc97abd5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently don't have a completion event trace, add one of those. And to better be able to match up submissions and completions, add user_data to the submission trace as well.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 ++-- include/trace/events/io_uring.h | 54 +++++++++++++++++++++++++++------ 2 files changed, 49 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f8beb5f5be98..132845095db8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -591,6 +591,8 @@ static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, { struct io_uring_cqe *cqe;
+ trace_io_uring_complete(ctx, ki_user_data, res); + /* * If we can't get a cq entry, userspace overflowed the * submission (by quite a lot). Increment the overflow count in @@ -2733,7 +2735,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, s.has_user = *mm != NULL; s.in_async = true; s.needs_fixed_file = true; - trace_io_uring_submit_sqe(ctx, true, true); + trace_io_uring_submit_sqe(ctx, s.sqe->user_data, true, true); io_submit_sqe(ctx, &s, statep, &link); submitted++; } @@ -2913,7 +2915,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, s.needs_fixed_file = false; s.ring_fd = ring_fd; submit++; - trace_io_uring_submit_sqe(ctx, true, false); + trace_io_uring_submit_sqe(ctx, s.sqe->user_data, true, false); io_submit_sqe(ctx, &s, statep, &link); }
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h index b85255121b98..9e80fa2415d2 100644 --- a/include/trace/events/io_uring.h +++ b/include/trace/events/io_uring.h @@ -313,10 +313,43 @@ TRACE_EVENT(io_uring_fail_link, TP_printk("request %p, link %p", __entry->req, __entry->link) );
+/** + * io_uring_complete - called when completing an SQE + * + * @ctx: pointer to a ring context structure + * @user_data: user data associated with the request + * @res: result of the request + * + */ +TRACE_EVENT(io_uring_complete, + + TP_PROTO(void *ctx, u64 user_data, long res), + + TP_ARGS(ctx, user_data, res), + + TP_STRUCT__entry ( + __field( void *, ctx ) + __field( u64, user_data ) + __field( long, res ) + ), + + TP_fast_assign( + __entry->ctx = ctx; + __entry->user_data = user_data; + __entry->res = res; + ), + + TP_printk("ring %p, user_data 0x%llx, result %ld", + __entry->ctx, (unsigned long long)__entry->user_data, + __entry->res) +); + + /** * io_uring_submit_sqe - called before submitting one SQE * - * @ctx: pointer to a ring context structure + * @ctx: pointer to a ring context structure + * @user_data: user data associated with the request * @force_nonblock: whether a context blocking or not * @sq_thread: true if sq_thread has submitted this SQE * @@ -325,24 +358,27 @@ TRACE_EVENT(io_uring_fail_link, */ TRACE_EVENT(io_uring_submit_sqe,
- TP_PROTO(void *ctx, bool force_nonblock, bool sq_thread), + TP_PROTO(void *ctx, u64 user_data, bool force_nonblock, bool sq_thread),
- TP_ARGS(ctx, force_nonblock, sq_thread), + TP_ARGS(ctx, user_data, force_nonblock, sq_thread),
TP_STRUCT__entry ( - __field( void *, ctx ) + __field( void *, ctx ) + __field( u64, user_data ) __field( bool, force_nonblock ) - __field( bool, sq_thread ) + __field( bool, sq_thread ) ),
TP_fast_assign( - __entry->ctx = ctx; + __entry->ctx = ctx; + __entry->user_data = user_data; __entry->force_nonblock = force_nonblock; - __entry->sq_thread = sq_thread; + __entry->sq_thread = sq_thread; ),
- TP_printk("ring %p, non block %d, sq_thread %d", - __entry->ctx, __entry->force_nonblock, __entry->sq_thread) + TP_printk("ring %p, user data 0x%llx, non block %d, sq_thread %d", + __entry->ctx, (unsigned long long) __entry->user_data, + __entry->force_nonblock, __entry->sq_thread) );
#endif /* _TRACE_IO_URING_H */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 3aa5fa030558e2b0da284fd069aeb7178543c987 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We had no more use for this flag after the conversion to io-wq, kill it off.
Fixes: 561fb04a6a22 ("io_uring: replace workqueue usage with io-wq") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 212c0f5c4065..ab576c29b1a7 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -329,7 +329,6 @@ struct io_kiocb { #define REQ_F_IO_DRAIN 16 /* drain existing IO first */ #define REQ_F_IO_DRAINED 32 /* drain done */ #define REQ_F_LINK 64 /* linked sqes */ -#define REQ_F_LINK_DONE 128 /* linked sqes done */ #define REQ_F_FAIL_LINK 256 /* fail rest of links */ #define REQ_F_SHADOW_DRAIN 512 /* link-drain shadow req */ #define REQ_F_TIMEOUT 1024 /* timeout request */ @@ -730,7 +729,6 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) nxt->flags |= REQ_F_LINK; }
- nxt->flags |= REQ_F_LINK_DONE; /* * If we're in async work, we can continue processing the chain * in this context instead of having to queue up new async work.
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-v5.5-rc1 commit e5eb6366ac2d1df8ad5b010718ac1997ceae45be category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
After a call to io_submit_sqe(), it's already known whether it needs to queue a link or not. Do it there, as it's simplier and doesn't keep an extra variable across the loop.
Reviewed-by:Bob Liu bob.liu@oracle.com Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 22 ++++++++++------------ 1 file changed, 10 insertions(+), 12 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1b7bdbc12c8f..eb56e79f1f29 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2700,7 +2700,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, struct io_submit_state state, *statep = NULL; struct io_kiocb *link = NULL; struct io_kiocb *shadow_req = NULL; - bool prev_was_link = false; int i, submitted = 0; bool mm_fault = false;
@@ -2723,17 +2722,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, } }
- /* - * If previous wasn't linked and we have a linked command, - * that's the end of the chain. Submit the previous link. - */ - if (!prev_was_link && link) { - io_queue_link_head(ctx, link, &link->submit, shadow_req); - link = NULL; - shadow_req = NULL; - } - prev_was_link = (s.sqe->flags & IOSQE_IO_LINK) != 0; - if (link && (s.sqe->flags & IOSQE_IO_DRAIN)) { if (!shadow_req) { shadow_req = io_get_req(ctx, NULL); @@ -2754,6 +2742,16 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, trace_io_uring_submit_sqe(ctx, s.sqe->user_data, true, async); io_submit_sqe(ctx, &s, statep, &link); submitted++; + + /* + * If previous wasn't linked and we have a linked command, + * that's the end of the chain. Submit the previous link. + */ + if (!(s.sqe->flags & IOSQE_IO_LINK) && link) { + io_queue_link_head(ctx, link, &link->submit, shadow_req); + link = NULL; + shadow_req = NULL; + } }
if (link)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 5f8fd2d3e0a7aa7fc9d97226be24286edd289835 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Now that io-wq supports separating the two request lifetime types, mark the following IO as having unbounded runtimes:
- Any read/write to a non-regular file - Any specific networked IO - Any poll command
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3b0e132a21f8..6f5edbb83f86 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -505,6 +505,20 @@ static inline bool io_prep_async_work(struct io_kiocb *req) case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: do_hashed = true; + /* fall-through */ + case IORING_OP_READV: + case IORING_OP_READ_FIXED: + case IORING_OP_SENDMSG: + case IORING_OP_RECVMSG: + case IORING_OP_ACCEPT: + case IORING_OP_POLL_ADD: + /* + * We know REQ_F_ISREG is not set on some of these + * opcodes, but this enables us to keep the check in + * just one place. + */ + if (!(req->flags & REQ_F_ISREG)) + req->work.flags |= IO_WQ_WORK_UNBOUND; break; } if (io_sqe_needs_user(req->submit.sqe)) @@ -3745,7 +3759,7 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx,
/* Do QD, or 4 * CPUS, whatever is smallest */ concurrency = min(ctx->sq_entries, 4 * num_online_cpus()); - ctx->io_wq = io_wq_create(concurrency, ctx->sqo_mm, NULL); + ctx->io_wq = io_wq_create(concurrency, ctx->sqo_mm, ctx->user); if (IS_ERR(ctx->io_wq)) { ret = PTR_ERR(ctx->io_wq); ctx->io_wq = NULL;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 206aefde4f886fdeb3b6339aacab3a85fb74cb7e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
With the recent flurry of additions and changes to io_uring, the layout of io_ring_ctx has become a bit stale. We're right now at 704 bytes in size on my x86-64 build, or 11 cachelines. This patch does two things:
- We have to completion structs embedded, that we only use for quiesce of the ctx (or shutdown) and for sqthread init cases. That 2x32 bytes right there, let's dynamically allocate them.
- Reorder the struct a bit with an eye on cachelines, use cases, and holes.
With this patch, we're down to 512 bytes, or 8 cachelines.
Reviewed-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [ Patch 214828962de("io_uring: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 69 ++++++++++++++++++++++++++++----------------------- 1 file changed, 38 insertions(+), 31 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6f5edbb83f86..914a999a458b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -204,6 +204,7 @@ struct io_ring_ctx { unsigned sq_mask; unsigned sq_thread_idle; unsigned cached_sq_dropped; + atomic_t cached_cq_overflow; struct io_uring_sqe *sq_sqes;
struct list_head defer_list; @@ -213,25 +214,13 @@ struct io_ring_ctx { wait_queue_head_t inflight_wait; } ____cacheline_aligned_in_smp;
+ struct io_rings *rings; + /* IO offload */ struct io_wq *io_wq; struct task_struct *sqo_thread; /* if using sq thread polling */ struct mm_struct *sqo_mm; wait_queue_head_t sqo_wait; - struct completion sqo_thread_started; - - struct { - unsigned cached_cq_tail; - atomic_t cached_cq_overflow; - unsigned cq_entries; - unsigned cq_mask; - struct wait_queue_head cq_wait; - struct fasync_struct *cq_fasync; - struct eventfd_ctx *cq_ev_fd; - atomic_t cq_timeouts; - } ____cacheline_aligned_in_smp; - - struct io_rings *rings;
/* * If used, fixed file set. Writers must ensure that ->refs is dead, @@ -247,7 +236,22 @@ struct io_ring_ctx {
struct user_struct *user;
- struct completion ctx_done; + /* 0 is for ctx quiesce/reinit/free, 1 is for sqo_thread started */ + struct completion *completions; + +#if defined(CONFIG_UNIX) + struct socket *ring_sock; +#endif + + struct { + unsigned cached_cq_tail; + unsigned cq_entries; + unsigned cq_mask; + atomic_t cq_timeouts; + struct wait_queue_head cq_wait; + struct fasync_struct *cq_fasync; + struct eventfd_ctx *cq_ev_fd; + } ____cacheline_aligned_in_smp;
struct { struct mutex uring_lock; @@ -269,10 +273,6 @@ struct io_ring_ctx { spinlock_t inflight_lock; struct list_head inflight_list; } ____cacheline_aligned_in_smp; - -#if defined(CONFIG_UNIX) - struct socket *ring_sock; -#endif };
struct sqe_submit { @@ -397,7 +397,7 @@ static void io_ring_ctx_ref_free(struct percpu_ref *ref) { struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs);
- complete(&ctx->ctx_done); + complete(&ctx->completions[0]); }
static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) @@ -408,16 +408,18 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) if (!ctx) return NULL;
- if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) { - kfree(ctx); - return NULL; - } + ctx->completions = kmalloc(2 * sizeof(struct completion), GFP_KERNEL); + if (!ctx->completions) + goto err; + + if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) + goto err;
ctx->flags = p->flags; init_waitqueue_head(&ctx->cq_wait); INIT_LIST_HEAD(&ctx->cq_overflow_list); - init_completion(&ctx->ctx_done); - init_completion(&ctx->sqo_thread_started); + init_completion(&ctx->completions[0]); + init_completion(&ctx->completions[1]); mutex_init(&ctx->uring_lock); init_waitqueue_head(&ctx->wait); spin_lock_init(&ctx->completion_lock); @@ -429,6 +431,10 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) spin_lock_init(&ctx->inflight_lock); INIT_LIST_HEAD(&ctx->inflight_list); return ctx; +err: + kfree(ctx->completions); + kfree(ctx); + return NULL; }
static inline bool __io_sequence_defer(struct io_ring_ctx *ctx, @@ -3046,7 +3052,7 @@ static int io_sq_thread(void *data) unsigned inflight; unsigned long timeout;
- complete(&ctx->sqo_thread_started); + complete(&ctx->completions[1]);
old_fs = get_fs(); set_fs(USER_DS); @@ -3286,7 +3292,7 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx) static void io_sq_thread_stop(struct io_ring_ctx *ctx) { if (ctx->sqo_thread) { - wait_for_completion(&ctx->sqo_thread_started); + wait_for_completion(&ctx->completions[1]); /* * The park is a bit of a work-around, without it we get * warning spews on shutdown with SQPOLL set and affinity @@ -4109,6 +4115,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) io_unaccount_mem(ctx->user, ring_pages(ctx->sq_entries, ctx->cq_entries)); free_uid(ctx->user); + kfree(ctx->completions); kfree(ctx); }
@@ -4153,7 +4160,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
io_iopoll_reap_events(ctx); io_cqring_overflow_flush(ctx, true); - wait_for_completion(&ctx->ctx_done); + wait_for_completion(&ctx->completions[0]); io_ring_ctx_free(ctx); }
@@ -4556,7 +4563,7 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, * no new references will come in after we've killed the percpu ref. */ mutex_unlock(&ctx->uring_lock); - wait_for_completion(&ctx->ctx_done); + wait_for_completion(&ctx->completions[0]); mutex_lock(&ctx->uring_lock);
switch (opcode) { @@ -4599,7 +4606,7 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, }
/* bring the ctx back to life */ - reinit_completion(&ctx->ctx_done); + reinit_completion(&ctx->completions[0]); percpu_ref_reinit(&ctx->refs); return ret; }
From: Jackie Liu liuyun01@kylinos.cn
mainline inclusion from mainline-5.5-rc1 commit a197f664a0db8a6219d9ce949f5f29b89f60fb2b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Many times, the core of the function is req, and req has already set req->ctx at initialization time, so there is no need to pass in the ctx from the caller.
Cleanup, no functional change.
Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 108 ++++++++++++++++++++++++++------------------------ 1 file changed, 56 insertions(+), 52 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 914a999a458b..51c9b3d2d2ff 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -437,20 +437,20 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) return NULL; }
-static inline bool __io_sequence_defer(struct io_ring_ctx *ctx, - struct io_kiocb *req) +static inline bool __io_sequence_defer(struct io_kiocb *req) { + struct io_ring_ctx *ctx = req->ctx; + return req->sequence != ctx->cached_cq_tail + ctx->cached_sq_dropped + atomic_read(&ctx->cached_cq_overflow); }
-static inline bool io_sequence_defer(struct io_ring_ctx *ctx, - struct io_kiocb *req) +static inline bool io_sequence_defer(struct io_kiocb *req) { if ((req->flags & (REQ_F_IO_DRAIN|REQ_F_IO_DRAINED)) != REQ_F_IO_DRAIN) return false;
- return __io_sequence_defer(ctx, req); + return __io_sequence_defer(req); }
static struct io_kiocb *io_get_deferred_req(struct io_ring_ctx *ctx) @@ -458,7 +458,7 @@ static struct io_kiocb *io_get_deferred_req(struct io_ring_ctx *ctx) struct io_kiocb *req;
req = list_first_entry_or_null(&ctx->defer_list, struct io_kiocb, list); - if (req && !io_sequence_defer(ctx, req)) { + if (req && !io_sequence_defer(req)) { list_del_init(&req->list); return req; } @@ -471,7 +471,7 @@ static struct io_kiocb *io_get_timeout_req(struct io_ring_ctx *ctx) struct io_kiocb *req;
req = list_first_entry_or_null(&ctx->timeout_list, struct io_kiocb, list); - if (req && !__io_sequence_defer(ctx, req)) { + if (req && !__io_sequence_defer(req)) { list_del_init(&req->list); return req; } @@ -534,10 +534,10 @@ static inline bool io_prep_async_work(struct io_kiocb *req) return do_hashed; }
-static inline void io_queue_async_work(struct io_ring_ctx *ctx, - struct io_kiocb *req) +static inline void io_queue_async_work(struct io_kiocb *req) { bool do_hashed = io_prep_async_work(req); + struct io_ring_ctx *ctx = req->ctx;
trace_io_uring_queue_async_work(ctx, do_hashed, req, &req->work, req->flags); @@ -588,7 +588,7 @@ static void io_commit_cqring(struct io_ring_ctx *ctx) continue; } req->flags |= REQ_F_IO_DRAINED; - io_queue_async_work(ctx, req); + io_queue_async_work(req); } }
@@ -791,9 +791,9 @@ static void __io_free_req(struct io_kiocb *req) kmem_cache_free(req_cachep, req); }
-static bool io_link_cancel_timeout(struct io_ring_ctx *ctx, - struct io_kiocb *req) +static bool io_link_cancel_timeout(struct io_kiocb *req) { + struct io_ring_ctx *ctx = req->ctx; int ret;
ret = hrtimer_try_to_cancel(&req->timeout.timer); @@ -833,7 +833,7 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) * in this context instead of having to queue up new async work. */ if (req->flags & REQ_F_LINK_TIMEOUT) { - wake_ev = io_link_cancel_timeout(ctx, nxt); + wake_ev = io_link_cancel_timeout(nxt);
/* we dropped this link, get next */ nxt = list_first_entry_or_null(&req->link_list, @@ -842,7 +842,7 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) *nxtptr = nxt; break; } else { - io_queue_async_work(req->ctx, nxt); + io_queue_async_work(nxt); break; } } @@ -870,7 +870,7 @@ static void io_fail_links(struct io_kiocb *req)
if ((req->flags & REQ_F_LINK_TIMEOUT) && link->submit.sqe->opcode == IORING_OP_LINK_TIMEOUT) { - io_link_cancel_timeout(ctx, link); + io_link_cancel_timeout(link); } else { io_cqring_fill_event(link, -ECANCELED); io_double_put_req(link); @@ -939,7 +939,7 @@ static void io_put_req(struct io_kiocb *req, struct io_kiocb **nxtptr) if (nxtptr) *nxtptr = nxt; else - io_queue_async_work(nxt->ctx, nxt); + io_queue_async_work(nxt); } }
@@ -1899,7 +1899,7 @@ static void io_poll_remove_one(struct io_kiocb *req) WRITE_ONCE(poll->canceled, true); if (!list_empty(&poll->wait.entry)) { list_del_init(&poll->wait.entry); - io_queue_async_work(req->ctx, req); + io_queue_async_work(req); } spin_unlock(&poll->head->lock);
@@ -1951,9 +1951,10 @@ static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static void io_poll_complete(struct io_ring_ctx *ctx, struct io_kiocb *req, - __poll_t mask) +static void io_poll_complete(struct io_kiocb *req, __poll_t mask) { + struct io_ring_ctx *ctx = req->ctx; + req->poll.done = true; io_cqring_fill_event(req, mangle_poll(mask)); io_commit_cqring(ctx); @@ -1989,7 +1990,7 @@ static void io_poll_complete_work(struct io_wq_work **workptr) return; } list_del_init(&req->list); - io_poll_complete(ctx, req, mask); + io_poll_complete(req, mask); spin_unlock_irq(&ctx->completion_lock);
io_cqring_ev_posted(ctx); @@ -2017,13 +2018,13 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
if (mask && spin_trylock_irqsave(&ctx->completion_lock, flags)) { list_del(&req->list); - io_poll_complete(ctx, req, mask); + io_poll_complete(req, mask); spin_unlock_irqrestore(&ctx->completion_lock, flags);
io_cqring_ev_posted(ctx); io_put_req(req, NULL); } else { - io_queue_async_work(ctx, req); + io_queue_async_work(req); }
return 1; @@ -2108,7 +2109,7 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, } if (mask) { /* no async, we'd stolen it */ ipt.error = 0; - io_poll_complete(ctx, req, mask); + io_poll_complete(req, mask); } spin_unlock_irq(&ctx->completion_lock);
@@ -2355,12 +2356,13 @@ static int io_async_cancel(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; }
-static int io_req_defer(struct io_ring_ctx *ctx, struct io_kiocb *req) +static int io_req_defer(struct io_kiocb *req) { const struct io_uring_sqe *sqe = req->submit.sqe; struct io_uring_sqe *sqe_copy; + struct io_ring_ctx *ctx = req->ctx;
- if (!io_sequence_defer(ctx, req) && list_empty(&ctx->defer_list)) + if (!io_sequence_defer(req) && list_empty(&ctx->defer_list)) return 0;
sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL); @@ -2368,7 +2370,7 @@ static int io_req_defer(struct io_ring_ctx *ctx, struct io_kiocb *req) return -EAGAIN;
spin_lock_irq(&ctx->completion_lock); - if (!io_sequence_defer(ctx, req) && list_empty(&ctx->defer_list)) { + if (!io_sequence_defer(req) && list_empty(&ctx->defer_list)) { spin_unlock_irq(&ctx->completion_lock); kfree(sqe_copy); return 0; @@ -2383,11 +2385,12 @@ static int io_req_defer(struct io_ring_ctx *ctx, struct io_kiocb *req) return -EIOCBQUEUED; }
-static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, - struct io_kiocb **nxt, bool force_nonblock) +static int __io_submit_sqe(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) { int ret, opcode; struct sqe_submit *s = &req->submit; + struct io_ring_ctx *ctx = req->ctx;
opcode = READ_ONCE(s->sqe->opcode); switch (opcode) { @@ -2467,7 +2470,6 @@ static void io_wq_submit_work(struct io_wq_work **workptr) { struct io_wq_work *work = *workptr; struct io_kiocb *req = container_of(work, struct io_kiocb, work); - struct io_ring_ctx *ctx = req->ctx; struct sqe_submit *s = &req->submit; const struct io_uring_sqe *sqe = s->sqe; struct io_kiocb *nxt = NULL; @@ -2483,7 +2485,7 @@ static void io_wq_submit_work(struct io_wq_work **workptr) s->has_user = (work->flags & IO_WQ_WORK_HAS_MM) != 0; s->in_async = true; do { - ret = __io_submit_sqe(ctx, req, &nxt, false); + ret = __io_submit_sqe(req, &nxt, false); /* * We can get EAGAIN for polled IO even though we're * forcing a sync submission from here, since we can't @@ -2537,10 +2539,10 @@ static inline struct file *io_file_from_index(struct io_ring_ctx *ctx, return table->files[index & IORING_FILE_TABLE_MASK]; }
-static int io_req_set_file(struct io_ring_ctx *ctx, - struct io_submit_state *state, struct io_kiocb *req) +static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req) { struct sqe_submit *s = &req->submit; + struct io_ring_ctx *ctx = req->ctx; unsigned flags; int fd;
@@ -2580,9 +2582,10 @@ static int io_req_set_file(struct io_ring_ctx *ctx, return 0; }
-static int io_grab_files(struct io_ring_ctx *ctx, struct io_kiocb *req) +static int io_grab_files(struct io_kiocb *req) { int ret = -EBADF; + struct io_ring_ctx *ctx = req->ctx;
rcu_read_lock(); spin_lock_irq(&ctx->inflight_lock); @@ -2698,7 +2701,7 @@ static inline struct io_kiocb *io_get_linked_timeout(struct io_kiocb *req) return NULL; }
-static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req) +static int __io_queue_sqe(struct io_kiocb *req) { struct io_kiocb *nxt; int ret; @@ -2710,7 +2713,7 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req) goto err; }
- ret = __io_submit_sqe(ctx, req, NULL, true); + ret = __io_submit_sqe(req, NULL, true);
/* * We async punt it if the file wasn't marked NOWAIT, or if the file @@ -2725,7 +2728,7 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req) if (sqe_copy) { s->sqe = sqe_copy; if (req->work.flags & IO_WQ_WORK_NEEDS_FILES) { - ret = io_grab_files(ctx, req); + ret = io_grab_files(req); if (ret) { kfree(sqe_copy); goto err; @@ -2736,7 +2739,7 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req) * Queued up for async execution, worker will release * submit reference when the iocb is actually submitted. */ - io_queue_async_work(ctx, req); + io_queue_async_work(req); return 0; } } @@ -2756,11 +2759,11 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req) return ret; }
-static int io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req) +static int io_queue_sqe(struct io_kiocb *req) { int ret;
- ret = io_req_defer(ctx, req); + ret = io_req_defer(req); if (ret) { if (ret != -EIOCBQUEUED) { io_cqring_add_event(req, ret); @@ -2769,17 +2772,17 @@ static int io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req) return 0; }
- return __io_queue_sqe(ctx, req); + return __io_queue_sqe(req); }
-static int io_queue_link_head(struct io_ring_ctx *ctx, struct io_kiocb *req, - struct io_kiocb *shadow) +static int io_queue_link_head(struct io_kiocb *req, struct io_kiocb *shadow) { int ret; int need_submit = false; + struct io_ring_ctx *ctx = req->ctx;
if (!shadow) - return io_queue_sqe(ctx, req); + return io_queue_sqe(req);
/* * Mark the first IO in link list as DRAIN, let all the following @@ -2787,7 +2790,7 @@ static int io_queue_link_head(struct io_ring_ctx *ctx, struct io_kiocb *req, * list. */ req->flags |= REQ_F_IO_DRAIN; - ret = io_req_defer(ctx, req); + ret = io_req_defer(req); if (ret) { if (ret != -EIOCBQUEUED) { io_cqring_add_event(req, ret); @@ -2810,18 +2813,19 @@ static int io_queue_link_head(struct io_ring_ctx *ctx, struct io_kiocb *req, spin_unlock_irq(&ctx->completion_lock);
if (need_submit) - return __io_queue_sqe(ctx, req); + return __io_queue_sqe(req);
return 0; }
#define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK)
-static void io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, - struct io_submit_state *state, struct io_kiocb **link) +static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, + struct io_kiocb **link) { struct io_uring_sqe *sqe_copy; struct sqe_submit *s = &req->submit; + struct io_ring_ctx *ctx = req->ctx; int ret;
req->user_data = s->sqe->user_data; @@ -2832,7 +2836,7 @@ static void io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, goto err_req; }
- ret = io_req_set_file(ctx, state, req); + ret = io_req_set_file(state, req); if (unlikely(ret)) { err_req: io_cqring_add_event(req, ret); @@ -2869,7 +2873,7 @@ static void io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, ret = -EINVAL; goto err_req; } else { - io_queue_sqe(ctx, req); + io_queue_sqe(req); } }
@@ -3018,7 +3022,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, req->submit.needs_fixed_file = async; trace_io_uring_submit_sqe(ctx, req->submit.sqe->user_data, true, async); - io_submit_sqe(ctx, req, statep, &link); + io_submit_sqe(req, statep, &link); submitted++;
/* @@ -3026,14 +3030,14 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, * that's the end of the chain. Submit the previous link. */ if (!(sqe_flags & IOSQE_IO_LINK) && link) { - io_queue_link_head(ctx, link, shadow_req); + io_queue_link_head(link, shadow_req); link = NULL; shadow_req = NULL; } }
if (link) - io_queue_link_head(ctx, link, shadow_req); + io_queue_link_head(link, shadow_req); if (statep) io_submit_state_end(&state);
From: Jackie Liu liuyun01@kylinos.cn
mainline inclusion from mainline-5.5-rc1 commit ec9c02ad4c3808d6d9ed28ad1d0485d6e2a33ac5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We already have io_put_req_find_next to find the next req of the link. we should not use the io_put_req function to find them. They should be functions of the same level.
Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 73 ++++++++++++++++++++++++++------------------------- 1 file changed, 37 insertions(+), 36 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 51c9b3d2d2ff..b55ad3414218 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -373,7 +373,7 @@ struct io_submit_state { static void io_wq_submit_work(struct io_wq_work **workptr); static void io_cqring_fill_event(struct io_kiocb *req, long res); static void __io_free_req(struct io_kiocb *req); -static void io_put_req(struct io_kiocb *req, struct io_kiocb **nxtptr); +static void io_put_req(struct io_kiocb *req); static void io_double_put_req(struct io_kiocb *req);
static struct kmem_cache *req_cachep; @@ -558,7 +558,7 @@ static void io_kill_timeout(struct io_kiocb *req) atomic_inc(&req->ctx->cq_timeouts); list_del_init(&req->list); io_cqring_fill_event(req, 0); - io_put_req(req, NULL); + io_put_req(req); } }
@@ -667,7 +667,7 @@ static void io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) while (!list_empty(&list)) { req = list_first_entry(&list, struct io_kiocb, list); list_del(&req->list); - io_put_req(req, NULL); + io_put_req(req); } }
@@ -801,7 +801,7 @@ static bool io_link_cancel_timeout(struct io_kiocb *req) io_cqring_fill_event(req, -ECANCELED); io_commit_cqring(ctx); req->flags &= ~REQ_F_LINK; - io_put_req(req, NULL); + io_put_req(req); return true; }
@@ -920,21 +920,13 @@ static void io_free_req(struct io_kiocb *req, struct io_kiocb **nxt) * Drop reference to request, return next in chain (if there is one) if this * was the last reference to this request. */ -static struct io_kiocb *io_put_req_find_next(struct io_kiocb *req) +static void io_put_req_find_next(struct io_kiocb *req, struct io_kiocb **nxtptr) { struct io_kiocb *nxt = NULL;
if (refcount_dec_and_test(&req->refs)) io_free_req(req, &nxt);
- return nxt; -} - -static void io_put_req(struct io_kiocb *req, struct io_kiocb **nxtptr) -{ - struct io_kiocb *nxt; - - nxt = io_put_req_find_next(req); if (nxt) { if (nxtptr) *nxtptr = nxt; @@ -943,6 +935,12 @@ static void io_put_req(struct io_kiocb *req, struct io_kiocb **nxtptr) } }
+static void io_put_req(struct io_kiocb *req) +{ + if (refcount_dec_and_test(&req->refs)) + io_free_req(req, NULL); +} + static void io_double_put_req(struct io_kiocb *req) { /* drop both submit and complete references */ @@ -1196,15 +1194,18 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2) struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
io_complete_rw_common(kiocb, res); - io_put_req(req, NULL); + io_put_req(req); }
static struct io_kiocb *__io_complete_rw(struct kiocb *kiocb, long res) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + struct io_kiocb *nxt = NULL;
io_complete_rw_common(kiocb, res); - return io_put_req_find_next(req); + io_put_req_find_next(req, &nxt); + + return nxt; }
static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2) @@ -1698,7 +1699,7 @@ static int io_nop(struct io_kiocb *req) return -EINVAL;
io_cqring_add_event(req, 0); - io_put_req(req, NULL); + io_put_req(req); return 0; }
@@ -1745,7 +1746,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret < 0 && (req->flags & REQ_F_LINK)) req->flags |= REQ_F_FAIL_LINK; io_cqring_add_event(req, ret); - io_put_req(req, nxt); + io_put_req_find_next(req, nxt); return 0; }
@@ -1792,7 +1793,7 @@ static int io_sync_file_range(struct io_kiocb *req, if (ret < 0 && (req->flags & REQ_F_LINK)) req->flags |= REQ_F_FAIL_LINK; io_cqring_add_event(req, ret); - io_put_req(req, nxt); + io_put_req_find_next(req, nxt); return 0; }
@@ -1830,7 +1831,7 @@ static int io_send_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, io_cqring_add_event(req, ret); if (ret < 0 && (req->flags & REQ_F_LINK)) req->flags |= REQ_F_FAIL_LINK; - io_put_req(req, nxt); + io_put_req_find_next(req, nxt); return 0; } #endif @@ -1884,7 +1885,7 @@ static int io_accept(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret < 0 && (req->flags & REQ_F_LINK)) req->flags |= REQ_F_FAIL_LINK; io_cqring_add_event(req, ret); - io_put_req(req, nxt); + io_put_req_find_next(req, nxt); return 0; #else return -EOPNOTSUPP; @@ -1947,7 +1948,7 @@ static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe) io_cqring_add_event(req, ret); if (ret < 0 && (req->flags & REQ_F_LINK)) req->flags |= REQ_F_FAIL_LINK; - io_put_req(req, NULL); + io_put_req(req); return 0; }
@@ -1995,7 +1996,7 @@ static void io_poll_complete_work(struct io_wq_work **workptr)
io_cqring_ev_posted(ctx);
- io_put_req(req, &nxt); + io_put_req_find_next(req, &nxt); if (nxt) *workptr = &nxt->work; } @@ -2022,7 +2023,7 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, spin_unlock_irqrestore(&ctx->completion_lock, flags);
io_cqring_ev_posted(ctx); - io_put_req(req, NULL); + io_put_req(req); } else { io_queue_async_work(req); } @@ -2115,7 +2116,7 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe,
if (mask) { io_cqring_ev_posted(ctx); - io_put_req(req, nxt); + io_put_req_find_next(req, nxt); } return ipt.error; } @@ -2157,7 +2158,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) io_cqring_ev_posted(ctx); if (req->flags & REQ_F_LINK) req->flags |= REQ_F_FAIL_LINK; - io_put_req(req, NULL); + io_put_req(req); return HRTIMER_NORESTART; }
@@ -2200,7 +2201,7 @@ static int io_timeout_remove(struct io_kiocb *req, io_cqring_ev_posted(ctx); if (req->flags & REQ_F_LINK) req->flags |= REQ_F_FAIL_LINK; - io_put_req(req, NULL); + io_put_req(req); return 0; }
@@ -2216,8 +2217,8 @@ static int io_timeout_remove(struct io_kiocb *req, spin_unlock_irq(&ctx->completion_lock); io_cqring_ev_posted(ctx);
- io_put_req(treq, NULL); - io_put_req(req, NULL); + io_put_req(treq); + io_put_req(req); return 0; }
@@ -2352,7 +2353,7 @@ static int io_async_cancel(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret < 0 && (req->flags & REQ_F_LINK)) req->flags |= REQ_F_FAIL_LINK; io_cqring_add_event(req, ret); - io_put_req(req, nxt); + io_put_req_find_next(req, nxt); return 0; }
@@ -2498,13 +2499,13 @@ static void io_wq_submit_work(struct io_wq_work **workptr) }
/* drop submission reference */ - io_put_req(req, NULL); + io_put_req(req);
if (ret) { if (req->flags & REQ_F_LINK) req->flags |= REQ_F_FAIL_LINK; io_cqring_add_event(req, ret); - io_put_req(req, NULL); + io_put_req(req); }
/* async context always use a copy of the sqe */ @@ -2635,7 +2636,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) }
io_cqring_add_event(req, ret); - io_put_req(req, NULL); + io_put_req(req); return HRTIMER_NORESTART; }
@@ -2667,7 +2668,7 @@ static int io_queue_linked_timeout(struct io_kiocb *req, struct io_kiocb *nxt) ret = 0; err: /* drop submission reference */ - io_put_req(nxt, NULL); + io_put_req(nxt);
if (ret) { struct io_ring_ctx *ctx = req->ctx; @@ -2680,7 +2681,7 @@ static int io_queue_linked_timeout(struct io_kiocb *req, struct io_kiocb *nxt) io_cqring_fill_event(nxt, ret); trace_io_uring_fail_link(req, nxt); io_commit_cqring(ctx); - io_put_req(nxt, NULL); + io_put_req(nxt); ret = -ECANCELED; }
@@ -2746,14 +2747,14 @@ static int __io_queue_sqe(struct io_kiocb *req)
/* drop submission reference */ err: - io_put_req(req, NULL); + io_put_req(req);
/* and drop final reference, if we failed */ if (ret) { io_cqring_add_event(req, ret); if (req->flags & REQ_F_LINK) req->flags |= REQ_F_FAIL_LINK; - io_put_req(req, NULL); + io_put_req(req); }
return ret;
From: Jackie Liu liuyun01@kylinos.cn
mainline inclusion from mainline-5.5-rc1 commit c69f8dbe2426cbf6150407b7e86ce85bb463c1dc category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Similar to the distinction between io_put_req and io_put_req_find_next, io_free_req has been modified similarly, with no functional changes.
Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b55ad3414218..2b0995ab8998 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -882,7 +882,7 @@ static void io_fail_links(struct io_kiocb *req) io_cqring_ev_posted(ctx); }
-static void io_free_req(struct io_kiocb *req, struct io_kiocb **nxt) +static void io_free_req_find_next(struct io_kiocb *req, struct io_kiocb **nxt) { if (likely(!(req->flags & REQ_F_LINK))) { __io_free_req(req); @@ -916,6 +916,11 @@ static void io_free_req(struct io_kiocb *req, struct io_kiocb **nxt) __io_free_req(req); }
+static void io_free_req(struct io_kiocb *req) +{ + io_free_req_find_next(req, NULL); +} + /* * Drop reference to request, return next in chain (if there is one) if this * was the last reference to this request. @@ -925,7 +930,7 @@ static void io_put_req_find_next(struct io_kiocb *req, struct io_kiocb **nxtptr) struct io_kiocb *nxt = NULL;
if (refcount_dec_and_test(&req->refs)) - io_free_req(req, &nxt); + io_free_req_find_next(req, &nxt);
if (nxt) { if (nxtptr) @@ -938,7 +943,7 @@ static void io_put_req_find_next(struct io_kiocb *req, struct io_kiocb **nxtptr) static void io_put_req(struct io_kiocb *req) { if (refcount_dec_and_test(&req->refs)) - io_free_req(req, NULL); + io_free_req(req); }
static void io_double_put_req(struct io_kiocb *req) @@ -1005,7 +1010,7 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, if (to_free == ARRAY_SIZE(reqs)) io_free_req_many(ctx, reqs, &to_free); } else { - io_free_req(req, NULL); + io_free_req(req); } } }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 46568e9be70ff8211d986685f08d919376c32998 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
syzbot reports that when using failslab and friends, we can get a double free in io_sqe_files_unregister():
BUG: KASAN: double-free or invalid-free in io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185
CPU: 1 PID: 8819 Comm: syz-executor452 Not tainted 5.4.0-rc6-next-20191108 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x197/0x210 lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374 kasan_report_invalid_free+0x65/0xa0 mm/kasan/report.c:468 __kasan_slab_free+0x13a/0x150 mm/kasan/common.c:450 kasan_slab_free+0xe/0x10 mm/kasan/common.c:480 __cache_free mm/slab.c:3426 [inline] kfree+0x10a/0x2c0 mm/slab.c:3757 io_sqe_files_unregister+0x20b/0x300 fs/io_uring.c:3185 io_ring_ctx_free fs/io_uring.c:3998 [inline] io_ring_ctx_wait_and_kill+0x348/0x700 fs/io_uring.c:4060 io_uring_release+0x42/0x50 fs/io_uring.c:4068 __fput+0x2ff/0x890 fs/file_table.c:280 ____fput+0x16/0x20 fs/file_table.c:313 task_work_run+0x145/0x1c0 kernel/task_work.c:113 exit_task_work include/linux/task_work.h:22 [inline] do_exit+0x904/0x2e60 kernel/exit.c:817 do_group_exit+0x135/0x360 kernel/exit.c:921 __do_sys_exit_group kernel/exit.c:932 [inline] __se_sys_exit_group kernel/exit.c:930 [inline] __x64_sys_exit_group+0x44/0x50 kernel/exit.c:930 do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x43f2c8 Code: 31 b8 c5 f7 ff ff 48 8b 5c 24 28 48 8b 6c 24 30 4c 8b 64 24 38 4c 8b 6c 24 40 4c 8b 74 24 48 4c 8b 7c 24 50 48 83 c4 58 c3 66 <0f> 1f 84 00 00 00 00 00 48 8d 35 59 ca 00 00 0f b6 d2 48 89 fb 48 RSP: 002b:00007ffd5b976008 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043f2c8 RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000 RBP: 00000000004bf0a8 R08: 00000000000000e7 R09: ffffffffffffffd0 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001 R13: 00000000006d1180 R14: 0000000000000000 R15: 0000000000000000
This happens if we fail allocating the file tables. For that case we do free the file table correctly, but we forget to set it to NULL. This means that ring teardown will see it as being non-NULL, and attempt to free it again.
Fix this by clearing the file_table pointer if we free the table.
Reported-by: syzbot+3254bc44113ae1e331ee@syzkaller.appspotmail.com Fixes: 65e19f54d29c ("io_uring: support for larger fixed file sets") Reviewed-by: Bob Liu bob.liu@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2b0995ab8998..7c3d8208eef8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3488,6 +3488,7 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
if (io_sqe_alloc_file_tables(ctx, nr_tables, nr_args)) { kfree(ctx->file_table); + ctx->file_table = NULL; return -ENOMEM; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 8e3cca12706231daf8daf90dbde59f1665135e48 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we cancel a pending accept operating with a signal, we get -ERESTARTSYS returned. Turn that into -EINTR for userspace, we should not be return -ERESTARTSYS.
Fixes: 17f2fe35d080 ("io_uring: add support for IORING_OP_ACCEPT") Reported-by: Hrvoje Zeba zeba.hrvoje@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7c3d8208eef8..1a82dd6ddfd5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1887,6 +1887,8 @@ static int io_accept(struct io_kiocb *req, const struct io_uring_sqe *sqe, req->work.flags |= IO_WQ_WORK_NEEDS_FILES; return -EAGAIN; } + if (ret == -ERESTARTSYS) + ret = -EINTR; if (ret < 0 && (req->flags & REQ_F_LINK)) req->flags |= REQ_F_FAIL_LINK; io_cqring_add_event(req, ret);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 0ddf92e848ab7abf216f218ee363eb9b9650e98f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
One thing that really sucks for userspace APIs is if the kernel passes back -ENOMEM/-EAGAIN for resource shortages. The application really has no idea of what to do in those cases. Should it try and reap completions? Probably a good idea. Will it solve the issue? Who knows.
This patch adds a simple fallback mechanism if we fail to allocate memory for a request. If we fail allocating memory from the slab for a request, we punt to a pre-allocated request. There's just one of these per io_ring_ctx, but the important part is if we ever return -EBUSY to the application, the applications knows that it can wait for events and make forward progress when events have completed. This is the important part.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 46 ++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 40 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1a82dd6ddfd5..0cbe02ace776 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -239,6 +239,9 @@ struct io_ring_ctx { /* 0 is for ctx quiesce/reinit/free, 1 is for sqo_thread started */ struct completion *completions;
+ /* if all else fails... */ + struct io_kiocb *fallback_req; + #if defined(CONFIG_UNIX) struct socket *ring_sock; #endif @@ -408,6 +411,10 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) if (!ctx) return NULL;
+ ctx->fallback_req = kmem_cache_alloc(req_cachep, GFP_KERNEL); + if (!ctx->fallback_req) + goto err; + ctx->completions = kmalloc(2 * sizeof(struct completion), GFP_KERNEL); if (!ctx->completions) goto err; @@ -432,6 +439,8 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) INIT_LIST_HEAD(&ctx->inflight_list); return ctx; err: + if (ctx->fallback_req) + kmem_cache_free(req_cachep, ctx->fallback_req); kfree(ctx->completions); kfree(ctx); return NULL; @@ -711,6 +720,23 @@ static void io_cqring_add_event(struct io_kiocb *req, long res) io_cqring_ev_posted(ctx); }
+static inline bool io_is_fallback_req(struct io_kiocb *req) +{ + return req == (struct io_kiocb *) + ((unsigned long) req->ctx->fallback_req & ~1UL); +} + +static struct io_kiocb *io_get_fallback_req(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + req = ctx->fallback_req; + if (!test_and_set_bit_lock(0, (unsigned long *) ctx->fallback_req)) + return req; + + return NULL; +} + static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, struct io_submit_state *state) { @@ -723,7 +749,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, if (!state) { req = kmem_cache_alloc(req_cachep, gfp); if (unlikely(!req)) - goto out; + goto fallback; } else if (!state->free_reqs) { size_t sz; int ret; @@ -738,7 +764,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, if (unlikely(ret <= 0)) { state->reqs[0] = kmem_cache_alloc(req_cachep, gfp); if (!state->reqs[0]) - goto out; + goto fallback; ret = 1; } state->free_reqs = ret - 1; @@ -750,6 +776,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, state->cur_req++; }
+got_it: req->file = NULL; req->ctx = ctx; req->flags = 0; @@ -758,7 +785,10 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, req->result = 0; INIT_IO_WORK(&req->work, io_wq_submit_work); return req; -out: +fallback: + req = io_get_fallback_req(ctx); + if (req) + goto got_it; percpu_ref_put(&ctx->refs); return NULL; } @@ -788,7 +818,10 @@ static void __io_free_req(struct io_kiocb *req) spin_unlock_irqrestore(&ctx->inflight_lock, flags); } percpu_ref_put(&ctx->refs); - kmem_cache_free(req_cachep, req); + if (likely(!io_is_fallback_req(req))) + kmem_cache_free(req_cachep, req); + else + clear_bit_unlock(0, (unsigned long *) ctx->fallback_req); }
static bool io_link_cancel_timeout(struct io_kiocb *req) @@ -1004,8 +1037,8 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, * completions for those, only batch free for fixed * file and non-linked commands. */ - if ((req->flags & (REQ_F_FIXED_FILE|REQ_F_LINK)) == - REQ_F_FIXED_FILE) { + if (((req->flags & (REQ_F_FIXED_FILE|REQ_F_LINK)) == + REQ_F_FIXED_FILE) && !io_is_fallback_req(req)) { reqs[to_free++] = req; if (to_free == ARRAY_SIZE(reqs)) io_free_req_many(ctx, reqs, &to_free); @@ -4129,6 +4162,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) ring_pages(ctx->sq_entries, ctx->cq_entries)); free_uid(ctx->user); kfree(ctx->completions); + kmem_cache_free(req_cachep, ctx->fallback_req); kfree(ctx); }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 47f467686ec02fc07fd5c6bb34b6f6736e2884b0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
It's a little confusing that we have multiple types of command cancellation opcodes now that we have a generic one. Make the generic one work with POLL_ADD and TIMEOUT commands as well, that makes for an easier to use API for the application. The fact that they currently don't is a bit confusing.
Add a helper that takes care of it, so we can user it from both IORING_OP_ASYNC_CANCEL and from the linked timeout cancellation.
Reported-by: Hrvoje Zeba zeba.hrvoje@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 138 +++++++++++++++++++++++++++++--------------------- 1 file changed, 80 insertions(+), 58 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 0cbe02ace776..645939e864db 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1959,6 +1959,20 @@ static void io_poll_remove_all(struct io_ring_ctx *ctx) spin_unlock_irq(&ctx->completion_lock); }
+static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr) +{ + struct io_kiocb *req; + + list_for_each_entry(req, &ctx->cancel_list, list) { + if (req->user_data != sqe_addr) + continue; + io_poll_remove_one(req); + return 0; + } + + return -ENOENT; +} + /* * Find a running poll command that matches one specified in sqe->addr, * and remove it if found. @@ -1966,8 +1980,7 @@ static void io_poll_remove_all(struct io_ring_ctx *ctx) static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_ring_ctx *ctx = req->ctx; - struct io_kiocb *poll_req, *next; - int ret = -ENOENT; + int ret;
if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; @@ -1976,13 +1989,7 @@ static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EINVAL;
spin_lock_irq(&ctx->completion_lock); - list_for_each_entry_safe(poll_req, next, &ctx->cancel_list, list) { - if (READ_ONCE(sqe->addr) == poll_req->user_data) { - io_poll_remove_one(poll_req); - ret = 0; - break; - } - } + ret = io_poll_cancel(ctx, READ_ONCE(sqe->addr)); spin_unlock_irq(&ctx->completion_lock);
io_cqring_add_event(req, ret); @@ -2202,6 +2209,31 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) return HRTIMER_NORESTART; }
+static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) +{ + struct io_kiocb *req; + int ret = -ENOENT; + + list_for_each_entry(req, &ctx->timeout_list, list) { + if (user_data == req->user_data) { + list_del_init(&req->list); + ret = 0; + break; + } + } + + if (ret == -ENOENT) + return ret; + + ret = hrtimer_try_to_cancel(&req->timeout.timer); + if (ret == -1) + return -EALREADY; + + io_cqring_fill_event(req, -ECANCELED); + io_put_req(req); + return 0; +} + /* * Remove or update an existing timeout command */ @@ -2209,10 +2241,8 @@ static int io_timeout_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_ring_ctx *ctx = req->ctx; - struct io_kiocb *treq; - int ret = -ENOENT; - __u64 user_data; unsigned flags; + int ret;
if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; @@ -2222,42 +2252,15 @@ static int io_timeout_remove(struct io_kiocb *req, if (flags) return -EINVAL;
- user_data = READ_ONCE(sqe->addr); spin_lock_irq(&ctx->completion_lock); - list_for_each_entry(treq, &ctx->timeout_list, list) { - if (user_data == treq->user_data) { - list_del_init(&treq->list); - ret = 0; - break; - } - } + ret = io_timeout_cancel(ctx, READ_ONCE(sqe->addr));
- /* didn't find timeout */ - if (ret) { -fill_ev: - io_cqring_fill_event(req, ret); - io_commit_cqring(ctx); - spin_unlock_irq(&ctx->completion_lock); - io_cqring_ev_posted(ctx); - if (req->flags & REQ_F_LINK) - req->flags |= REQ_F_FAIL_LINK; - io_put_req(req); - return 0; - } - - ret = hrtimer_try_to_cancel(&treq->timeout.timer); - if (ret == -1) { - ret = -EBUSY; - goto fill_ev; - } - - io_cqring_fill_event(req, 0); - io_cqring_fill_event(treq, -ECANCELED); + io_cqring_fill_event(req, ret); io_commit_cqring(ctx); spin_unlock_irq(&ctx->completion_lock); io_cqring_ev_posted(ctx); - - io_put_req(treq); + if (ret < 0 && req->flags & REQ_F_LINK) + req->flags |= REQ_F_FAIL_LINK; io_put_req(req); return 0; } @@ -2374,12 +2377,39 @@ static int io_async_cancel_one(struct io_ring_ctx *ctx, void *sqe_addr) return ret; }
+static void io_async_find_and_cancel(struct io_ring_ctx *ctx, + struct io_kiocb *req, __u64 sqe_addr, + struct io_kiocb **nxt) +{ + unsigned long flags; + int ret; + + ret = io_async_cancel_one(ctx, (void *) (unsigned long) sqe_addr); + if (ret != -ENOENT) { + spin_lock_irqsave(&ctx->completion_lock, flags); + goto done; + } + + spin_lock_irqsave(&ctx->completion_lock, flags); + ret = io_timeout_cancel(ctx, sqe_addr); + if (ret != -ENOENT) + goto done; + ret = io_poll_cancel(ctx, sqe_addr); +done: + io_cqring_fill_event(req, ret); + io_commit_cqring(ctx); + spin_unlock_irqrestore(&ctx->completion_lock, flags); + io_cqring_ev_posted(ctx); + + if (ret < 0 && (req->flags & REQ_F_LINK)) + req->flags |= REQ_F_FAIL_LINK; + io_put_req_find_next(req, nxt); +} + static int io_async_cancel(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct io_kiocb **nxt) { struct io_ring_ctx *ctx = req->ctx; - void *sqe_addr; - int ret;
if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; @@ -2387,13 +2417,7 @@ static int io_async_cancel(struct io_kiocb *req, const struct io_uring_sqe *sqe, sqe->cancel_flags) return -EINVAL;
- sqe_addr = (void *) (unsigned long) READ_ONCE(sqe->addr); - ret = io_async_cancel_one(ctx, sqe_addr); - - if (ret < 0 && (req->flags & REQ_F_LINK)) - req->flags |= REQ_F_FAIL_LINK; - io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_async_find_and_cancel(ctx, req, READ_ONCE(sqe->addr), NULL); return 0; }
@@ -2655,7 +2679,6 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) struct io_ring_ctx *ctx = req->ctx; struct io_kiocb *prev = NULL; unsigned long flags; - int ret = -ETIME;
spin_lock_irqsave(&ctx->completion_lock, flags);
@@ -2671,12 +2694,11 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) spin_unlock_irqrestore(&ctx->completion_lock, flags);
if (prev) { - void *user_data = (void *) (unsigned long) prev->user_data; - ret = io_async_cancel_one(ctx, user_data); + io_async_find_and_cancel(ctx, req, prev->user_data, NULL); + } else { + io_cqring_add_event(req, -ETIME); + io_put_req(req); } - - io_cqring_add_event(req, ret); - io_put_req(req); return HRTIMER_NORESTART; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit c1edbf5f081be9fbbea68c1d564b773e59c1acf3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Now that we have backpressure, for SQPOLL, we have one more condition that warrants flagging that the application needs to enter the kernel: we failed to submit IO due to backpressure. Make sure we catch that and flag it appropriately.
If we run into backpressure issues with the SQPOLL thread, flag it as such to the application by setting IORING_SQ_NEED_WAKEUP. This will cause the application to enter the kernel, and that will flush the backlog and clear the condition.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 645939e864db..85c21fee7ac0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3118,16 +3118,16 @@ static int io_sq_thread(void *data) DEFINE_WAIT(wait); unsigned inflight; unsigned long timeout; + int ret;
complete(&ctx->completions[1]);
old_fs = get_fs(); set_fs(USER_DS);
- timeout = inflight = 0; + ret = timeout = inflight = 0; while (!kthread_should_park()) { unsigned int to_submit; - int ret;
if (inflight) { unsigned nr_events = 0; @@ -3161,13 +3161,21 @@ static int io_sq_thread(void *data) }
to_submit = io_sqring_entries(ctx); - if (!to_submit) { + + /* + * If submit got -EBUSY, flag us as needing the application + * to enter the kernel to reap and flush events. + */ + if (!to_submit || ret == -EBUSY) { /* * We're polling. If we're within the defined idle * period, then let us spin without work before going - * to sleep. + * to sleep. The exception is if we got EBUSY doing + * more IO, we should wait for the application to + * reap events and wake us up. */ - if (inflight || !time_after(jiffies, timeout)) { + if (inflight || + (!time_after(jiffies, timeout) && ret != -EBUSY)) { cond_resched(); continue; } @@ -3193,7 +3201,7 @@ static int io_sq_thread(void *data) smp_mb();
to_submit = io_sqring_entries(ctx); - if (!to_submit) { + if (!to_submit || ret == -EBUSY) { if (kthread_should_park()) { finish_wait(&ctx->sqo_wait, &wait); break; @@ -4352,6 +4360,8 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, */ ret = 0; if (ctx->flags & IORING_SETUP_SQPOLL) { + if (!list_empty_careful(&ctx->cq_overflow_list)) + io_cqring_overflow_flush(ctx, false); if (flags & IORING_ENTER_SQ_WAKEUP) wake_up(&ctx->sqo_wait); submitted = to_submit;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 768134d4f48109b90f4248feecbeeb7d684e410c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We can't safely cancel under the inflight lock. If the work hasn't been started yet, then io_wq_cancel_work() simply marks the work as cancelled and invokes the work handler. But if the work completion needs to grab the inflight lock because it's grabbing user files, then we'll deadlock trying to finish the work as we already hold that lock.
Instead grab a reference to the request, if it isn't already zero. If it's zero, then we know it's going through completion anyway, and we can safely ignore it. If it's not zero, then we can drop the lock and attempt to cancel from there.
This also fixes a missing finish_wait() at the end of io_uring_cancel_files().
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 35 ++++++++++++++++++----------------- 1 file changed, 18 insertions(+), 17 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 85c21fee7ac0..d751e1eb245e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4258,33 +4258,34 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx,
while (!list_empty_careful(&ctx->inflight_list)) { enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; + struct io_kiocb *cancel_req = NULL;
spin_lock_irq(&ctx->inflight_lock); list_for_each_entry(req, &ctx->inflight_list, inflight_entry) { - if (req->work.files == files) { - ret = io_wq_cancel_work(ctx->io_wq, &req->work); - break; - } + if (req->work.files != files) + continue; + /* req is being completed, ignore */ + if (!refcount_inc_not_zero(&req->refs)) + continue; + cancel_req = req; + break; } - if (ret == IO_WQ_CANCEL_RUNNING) + if (cancel_req) prepare_to_wait(&ctx->inflight_wait, &wait, - TASK_UNINTERRUPTIBLE); - + TASK_UNINTERRUPTIBLE); spin_unlock_irq(&ctx->inflight_lock);
- /* - * We need to keep going until we get NOTFOUND. We only cancel - * one work at the time. - * - * If we get CANCEL_RUNNING, then wait for a work to complete - * before continuing. - */ - if (ret == IO_WQ_CANCEL_OK) - continue; - else if (ret != IO_WQ_CANCEL_RUNNING) + if (cancel_req) { + ret = io_wq_cancel_work(ctx->io_wq, &cancel_req->work); + io_put_req(cancel_req); + } + + /* We need to keep going until we don't find a matching req */ + if (!cancel_req) break; schedule(); } + finish_wait(&ctx->inflight_wait, &wait); }
static int io_uring_flush(struct file *file, void *data)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 76a46e066e2d93bd333599d1c84c605c2c4cc909 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If you prep a read (for example) that needs to get punted to async context with a timer, if the timeout is sufficiently short, the timer request will get completed with -ENOENT as it could not find the read.
The issue is that we prep and start the timer before we start the read. Hence the timer can trigger before the read is even started, and the end result is then that the timer completes with -ENOENT, while the read starts instead of being cancelled by the timer.
Fix this by splitting the linked timer into two parts:
1) Prep and validate the linked timer 2) Start timer
The read is then started between steps 1 and 2, so we know that the timer will always have a consistent view of the read request state.
Reported-by: Hrvoje Zeba zeba.hrvoje@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 123 +++++++++++++++++++++++++++++--------------------- 1 file changed, 72 insertions(+), 51 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d751e1eb245e..5e487fa88a82 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -854,7 +854,7 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) */ nxt = list_first_entry_or_null(&req->link_list, struct io_kiocb, list); while (nxt) { - list_del(&nxt->list); + list_del_init(&nxt->list); if (!list_empty(&req->link_list)) { INIT_LIST_HEAD(&nxt->link_list); list_splice(&req->link_list, &nxt->link_list); @@ -2688,13 +2688,17 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) */ if (!list_empty(&req->list)) { prev = list_entry(req->list.prev, struct io_kiocb, link_list); - list_del_init(&req->list); + if (refcount_inc_not_zero(&prev->refs)) + list_del_init(&req->list); + else + prev = NULL; }
spin_unlock_irqrestore(&ctx->completion_lock, flags);
if (prev) { io_async_find_and_cancel(ctx, req, prev->user_data, NULL); + io_put_req(prev); } else { io_cqring_add_event(req, -ETIME); io_put_req(req); @@ -2702,78 +2706,84 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) return HRTIMER_NORESTART; }
-static int io_queue_linked_timeout(struct io_kiocb *req, struct io_kiocb *nxt) +static void io_queue_linked_timeout(struct io_kiocb *req, struct timespec64 *ts, + enum hrtimer_mode *mode) { - const struct io_uring_sqe *sqe = nxt->submit.sqe; - enum hrtimer_mode mode; - struct timespec64 ts; - int ret = -EINVAL; + struct io_ring_ctx *ctx = req->ctx;
- if (sqe->ioprio || sqe->buf_index || sqe->len != 1 || sqe->off) - goto err; - if (sqe->timeout_flags & ~IORING_TIMEOUT_ABS) - goto err; - if (get_timespec64(&ts, u64_to_user_ptr(sqe->addr))) { - ret = -EFAULT; - goto err; + /* + * If the list is now empty, then our linked request finished before + * we got a chance to setup the timer + */ + spin_lock_irq(&ctx->completion_lock); + if (!list_empty(&req->list)) { + req->timeout.timer.function = io_link_timeout_fn; + hrtimer_start(&req->timeout.timer, timespec64_to_ktime(*ts), + *mode); } + spin_unlock_irq(&ctx->completion_lock);
- req->flags |= REQ_F_LINK_TIMEOUT; - - if (sqe->timeout_flags & IORING_TIMEOUT_ABS) - mode = HRTIMER_MODE_ABS; - else - mode = HRTIMER_MODE_REL; - hrtimer_init(&nxt->timeout.timer, CLOCK_MONOTONIC, mode); - nxt->timeout.timer.function = io_link_timeout_fn; - hrtimer_start(&nxt->timeout.timer, timespec64_to_ktime(ts), mode); - ret = 0; -err: /* drop submission reference */ - io_put_req(nxt); - - if (ret) { - struct io_ring_ctx *ctx = req->ctx; + io_put_req(req); +}
- /* - * Break the link and fail linked timeout, parent will get - * failed by the regular submission path. - */ - list_del(&nxt->list); - io_cqring_fill_event(nxt, ret); - trace_io_uring_fail_link(req, nxt); - io_commit_cqring(ctx); - io_put_req(nxt); - ret = -ECANCELED; - } +static int io_validate_link_timeout(const struct io_uring_sqe *sqe, + struct timespec64 *ts) +{ + if (sqe->ioprio || sqe->buf_index || sqe->len != 1 || sqe->off) + return -EINVAL; + if (sqe->timeout_flags & ~IORING_TIMEOUT_ABS) + return -EINVAL; + if (get_timespec64(ts, u64_to_user_ptr(sqe->addr))) + return -EFAULT;
- return ret; + return 0; }
-static inline struct io_kiocb *io_get_linked_timeout(struct io_kiocb *req) +static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req, + struct timespec64 *ts, + enum hrtimer_mode *mode) { struct io_kiocb *nxt; + int ret;
if (!(req->flags & REQ_F_LINK)) return NULL;
nxt = list_first_entry_or_null(&req->link_list, struct io_kiocb, list); - if (nxt && nxt->submit.sqe->opcode == IORING_OP_LINK_TIMEOUT) - return nxt; + if (!nxt || nxt->submit.sqe->opcode != IORING_OP_LINK_TIMEOUT) + return NULL;
- return NULL; + ret = io_validate_link_timeout(nxt->submit.sqe, ts); + if (ret) { + list_del_init(&nxt->list); + io_cqring_add_event(nxt, ret); + io_double_put_req(nxt); + return ERR_PTR(-ECANCELED); + } + + if (nxt->submit.sqe->timeout_flags & IORING_TIMEOUT_ABS) + *mode = HRTIMER_MODE_ABS; + else + *mode = HRTIMER_MODE_REL; + + req->flags |= REQ_F_LINK_TIMEOUT; + hrtimer_init(&nxt->timeout.timer, CLOCK_MONOTONIC, *mode); + return nxt; }
static int __io_queue_sqe(struct io_kiocb *req) { + enum hrtimer_mode mode; struct io_kiocb *nxt; + struct timespec64 ts; int ret;
- nxt = io_get_linked_timeout(req); - if (unlikely(nxt)) { - ret = io_queue_linked_timeout(req, nxt); - if (ret) - goto err; + nxt = io_prep_linked_timeout(req, &ts, &mode); + if (IS_ERR(nxt)) { + ret = PTR_ERR(nxt); + nxt = NULL; + goto err; }
ret = __io_submit_sqe(req, NULL, true); @@ -2803,14 +2813,25 @@ static int __io_queue_sqe(struct io_kiocb *req) * submit reference when the iocb is actually submitted. */ io_queue_async_work(req); + + if (nxt) + io_queue_linked_timeout(nxt, &ts, &mode); + return 0; } }
- /* drop submission reference */ err: + /* drop submission reference */ io_put_req(req);
+ if (nxt) { + if (!ret) + io_queue_linked_timeout(nxt, &ts, &mode); + else + io_put_req(nxt); + } + /* and drop final reference, if we failed */ if (ret) { io_cqring_add_event(req, ret);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 15dff286d0e0087d4dcd7049911f179e4e4cfd94 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Normally the rings are always valid, the exception is if we failed to allocate the rings at setup time. syzbot reports this:
RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229 RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 8903 Comm: syz-executor410 Not tainted 5.4.0-rc7-next-20191113 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline] RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline] RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592 Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1 ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61 RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100 RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0 R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000 FS: 0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: io_cqring_overflow_flush+0x6b9/0xa90 fs/io_uring.c:673 io_ring_ctx_wait_and_kill+0x24f/0x7c0 fs/io_uring.c:4260 io_uring_create fs/io_uring.c:4600 [inline] io_uring_setup+0x1256/0x1cc0 fs/io_uring.c:4626 __do_sys_io_uring_setup fs/io_uring.c:4639 [inline] __se_sys_io_uring_setup fs/io_uring.c:4636 [inline] __x64_sys_io_uring_setup+0x54/0x80 fs/io_uring.c:4636 do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x441229 Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229 RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 Modules linked in: ---[ end trace b0f5b127a57f623f ]--- RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline] RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline] RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592 Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1 ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61 RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100 RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0 R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000 FS: 0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
which is exactly the case of failing to allocate the SQ/CQ rings, and then entering shutdown. Check if the rings are valid before trying to access them at shutdown time.
Reported-by: syzbot+21147d79607d724bd6f3@syzkaller.appspotmail.com Fixes: 1d7bb1d50fb4 ("io_uring: add support for backlogged CQ ring") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8f8bb5c7e791..cc69f38c77e5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4280,7 +4280,9 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) io_wq_cancel_all(ctx->io_wq);
io_iopoll_reap_events(ctx); - io_cqring_overflow_flush(ctx, true); + /* if we failed setting up the ctx, we might not have any rings */ + if (ctx->rings) + io_cqring_overflow_flush(ctx, true); wait_for_completion(&ctx->completions[0]); io_ring_ctx_free(ctx); }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 7d7230652e7c788ef908536fd79f4cca077f269f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For cancellation, we need to ensure that the work item stays valid for as long as ->cur_work is valid. Right now we can't safely dereference the work item even under the wqe->lock, because while the ->cur_work pointer will remain valid, the work could be completing and be freed in parallel.
Only invoke ->get/put_work() on items we know that the caller queued themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed when we're queueing a flush item, for instance.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 25 +++++++++++++++++++++++-- fs/io-wq.h | 7 ++++++- fs/io_uring.c | 17 ++++++++++++++++- 3 files changed, 45 insertions(+), 4 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index e70da9583377..97b18dfad163 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -107,6 +107,9 @@ struct io_wq { unsigned long state; unsigned nr_wqes;
+ get_work_fn *get_work; + put_work_fn *put_work; + struct task_struct *manager; struct user_struct *user; struct mm_struct *mm; @@ -393,7 +396,7 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe, unsigned *hash) static void io_worker_handle_work(struct io_worker *worker) __releases(wqe->lock) { - struct io_wq_work *work, *old_work; + struct io_wq_work *work, *old_work = NULL, *put_work = NULL; struct io_wqe *wqe = worker->wqe; struct io_wq *wq = wqe->wq;
@@ -425,6 +428,8 @@ static void io_worker_handle_work(struct io_worker *worker) wqe->flags |= IO_WQE_FLAG_STALLED;
spin_unlock_irq(&wqe->lock); + if (put_work && wq->put_work) + wq->put_work(old_work); if (!work) break; next: @@ -445,6 +450,11 @@ static void io_worker_handle_work(struct io_worker *worker) if (worker->mm) work->flags |= IO_WQ_WORK_HAS_MM;
+ if (wq->get_work && !(work->flags & IO_WQ_WORK_INTERNAL)) { + put_work = work; + wq->get_work(work); + } + old_work = work; work->func(&work);
@@ -456,6 +466,12 @@ static void io_worker_handle_work(struct io_worker *worker) } if (work && work != old_work) { spin_unlock_irq(&wqe->lock); + + if (put_work && wq->put_work) { + wq->put_work(put_work); + put_work = NULL; + } + /* dependent work not hashed */ hash = -1U; goto next; @@ -951,13 +967,15 @@ void io_wq_flush(struct io_wq *wq)
init_completion(&data.done); INIT_IO_WORK(&data.work, io_wq_flush_func); + data.work.flags |= IO_WQ_WORK_INTERNAL; io_wqe_enqueue(wqe, &data.work); wait_for_completion(&data.done); } }
struct io_wq *io_wq_create(unsigned bounded, struct mm_struct *mm, - struct user_struct *user) + struct user_struct *user, get_work_fn *get_work, + put_work_fn *put_work) { int ret = -ENOMEM, i, node; struct io_wq *wq; @@ -973,6 +991,9 @@ struct io_wq *io_wq_create(unsigned bounded, struct mm_struct *mm, return ERR_PTR(-ENOMEM); }
+ wq->get_work = get_work; + wq->put_work = put_work; + /* caller must already hold a reference to this */ wq->user = user;
diff --git a/fs/io-wq.h b/fs/io-wq.h index cc50754d028c..4b29f922f80c 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -10,6 +10,7 @@ enum { IO_WQ_WORK_NEEDS_USER = 8, IO_WQ_WORK_NEEDS_FILES = 16, IO_WQ_WORK_UNBOUND = 32, + IO_WQ_WORK_INTERNAL = 64,
IO_WQ_HASH_SHIFT = 24, /* upper 8 bits are used for hash key */ }; @@ -34,8 +35,12 @@ struct io_wq_work { (work)->files = NULL; \ } while (0) \
+typedef void (get_work_fn)(struct io_wq_work *); +typedef void (put_work_fn)(struct io_wq_work *); + struct io_wq *io_wq_create(unsigned bounded, struct mm_struct *mm, - struct user_struct *user); + struct user_struct *user, + get_work_fn *get_work, put_work_fn *put_work); void io_wq_destroy(struct io_wq *wq);
void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work); diff --git a/fs/io_uring.c b/fs/io_uring.c index 77e8d403b3e7..9ceb7af472bf 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3837,6 +3837,20 @@ static int io_sqe_files_update(struct io_ring_ctx *ctx, void __user *arg, return done ? done : err; }
+static void io_put_work(struct io_wq_work *work) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work); + + io_put_req(req); +} + +static void io_get_work(struct io_wq_work *work) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work); + + refcount_inc(&req->refs); +} + static int io_sq_offload_start(struct io_ring_ctx *ctx, struct io_uring_params *p) { @@ -3886,7 +3900,8 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx,
/* Do QD, or 4 * CPUS, whatever is smallest */ concurrency = min(ctx->sq_entries, 4 * num_online_cpus()); - ctx->io_wq = io_wq_create(concurrency, ctx->sqo_mm, ctx->user); + ctx->io_wq = io_wq_create(concurrency, ctx->sqo_mm, ctx->user, + io_get_work, io_put_work); if (IS_ERR(ctx->io_wq)) { ret = PTR_ERR(ctx->io_wq); ctx->io_wq = NULL;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit ad8a48acc23cb13cbf4332ebabb867b1baa81842 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There are a few reasons for this:
- As a prep to improving the linked timeout logic - io_timeout is the biggest member in the io_kiocb opcode union
This also enables a few cleanups, like unifying the timer setup between IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT, and not needing multiple arguments to the link/prep helpers.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 129 +++++++++++++++++++++++++++----------------------- 1 file changed, 70 insertions(+), 59 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 700ae01d986f..2eb1b9cec145 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -301,9 +301,16 @@ struct io_poll_iocb { struct wait_queue_entry wait; };
+struct io_timeout_data { + struct io_kiocb *req; + struct hrtimer timer; + struct timespec64 ts; + enum hrtimer_mode mode; +}; + struct io_timeout { struct file *file; - struct hrtimer timer; + struct io_timeout_data *data; };
/* @@ -572,7 +579,7 @@ static void io_kill_timeout(struct io_kiocb *req) { int ret;
- ret = hrtimer_try_to_cancel(&req->timeout.timer); + ret = hrtimer_try_to_cancel(&req->timeout.data->timer); if (ret != -1) { atomic_inc(&req->ctx->cq_timeouts); list_del_init(&req->list); @@ -827,6 +834,8 @@ static void __io_free_req(struct io_kiocb *req) wake_up(&ctx->inflight_wait); spin_unlock_irqrestore(&ctx->inflight_lock, flags); } + if (req->flags & REQ_F_TIMEOUT) + kfree(req->timeout.data); percpu_ref_put(&ctx->refs); if (likely(!io_is_fallback_req(req))) kmem_cache_free(req_cachep, req); @@ -839,7 +848,7 @@ static bool io_link_cancel_timeout(struct io_kiocb *req) struct io_ring_ctx *ctx = req->ctx; int ret;
- ret = hrtimer_try_to_cancel(&req->timeout.timer); + ret = hrtimer_try_to_cancel(&req->timeout.data->timer); if (ret != -1) { io_cqring_fill_event(req, -ECANCELED); io_commit_cqring(ctx); @@ -2235,12 +2244,12 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe,
static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) { - struct io_ring_ctx *ctx; - struct io_kiocb *req; + struct io_timeout_data *data = container_of(timer, + struct io_timeout_data, timer); + struct io_kiocb *req = data->req; + struct io_ring_ctx *ctx = req->ctx; unsigned long flags;
- req = container_of(timer, struct io_kiocb, timeout.timer); - ctx = req->ctx; atomic_inc(&ctx->cq_timeouts);
spin_lock_irqsave(&ctx->completion_lock, flags); @@ -2290,7 +2299,7 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) if (ret == -ENOENT) return ret;
- ret = hrtimer_try_to_cancel(&req->timeout.timer); + ret = hrtimer_try_to_cancel(&req->timeout.data->timer); if (ret == -1) return -EALREADY;
@@ -2330,34 +2339,54 @@ static int io_timeout_remove(struct io_kiocb *req, return 0; }
-static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) +static int io_timeout_setup(struct io_kiocb *req) { - unsigned count; - struct io_ring_ctx *ctx = req->ctx; - struct list_head *entry; - enum hrtimer_mode mode; - struct timespec64 ts; - unsigned span = 0; + const struct io_uring_sqe *sqe = req->submit.sqe; + struct io_timeout_data *data; unsigned flags;
- if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; - if (sqe->flags || sqe->ioprio || sqe->buf_index || sqe->len != 1) + if (sqe->ioprio || sqe->buf_index || sqe->len != 1) return -EINVAL; flags = READ_ONCE(sqe->timeout_flags); if (flags & ~IORING_TIMEOUT_ABS) return -EINVAL;
- if (get_timespec64(&ts, u64_to_user_ptr(sqe->addr))) + data = kzalloc(sizeof(struct io_timeout_data), GFP_KERNEL); + if (!data) + return -ENOMEM; + data->req = req; + req->timeout.data = data; + req->flags |= REQ_F_TIMEOUT; + + if (get_timespec64(&data->ts, u64_to_user_ptr(sqe->addr))) return -EFAULT;
if (flags & IORING_TIMEOUT_ABS) - mode = HRTIMER_MODE_ABS; + data->mode = HRTIMER_MODE_ABS; else - mode = HRTIMER_MODE_REL; + data->mode = HRTIMER_MODE_REL;
- hrtimer_init(&req->timeout.timer, CLOCK_MONOTONIC, mode); - req->flags |= REQ_F_TIMEOUT; + hrtimer_init(&data->timer, CLOCK_MONOTONIC, data->mode); + return 0; +} + +static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + unsigned count; + struct io_ring_ctx *ctx = req->ctx; + struct io_timeout_data *data; + struct list_head *entry; + unsigned span = 0; + int ret; + + ret = io_timeout_setup(req); + /* common setup allows flags (like links) set, we don't */ + if (!ret && sqe->flags) + ret = -EINVAL; + if (ret) + return ret;
/* * sqe->off holds how many events that need to occur for this @@ -2417,8 +2446,9 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) req->sequence -= span; add: list_add(&req->list, entry); - req->timeout.timer.function = io_timeout_fn; - hrtimer_start(&req->timeout.timer, timespec64_to_ktime(ts), mode); + data = req->timeout.data; + data->timer.function = io_timeout_fn; + hrtimer_start(&data->timer, timespec64_to_ktime(data->ts), data->mode); spin_unlock_irq(&ctx->completion_lock); return 0; } @@ -2753,8 +2783,9 @@ static int io_grab_files(struct io_kiocb *req)
static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) { - struct io_kiocb *req = container_of(timer, struct io_kiocb, - timeout.timer); + struct io_timeout_data *data = container_of(timer, + struct io_timeout_data, timer); + struct io_kiocb *req = data->req; struct io_ring_ctx *ctx = req->ctx; struct io_kiocb *prev = NULL; unsigned long flags; @@ -2785,9 +2816,9 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) return HRTIMER_NORESTART; }
-static void io_queue_linked_timeout(struct io_kiocb *req, struct timespec64 *ts, - enum hrtimer_mode *mode) +static void io_queue_linked_timeout(struct io_kiocb *req) { + struct io_timeout_data *data = req->timeout.data; struct io_ring_ctx *ctx = req->ctx;
/* @@ -2796,9 +2827,9 @@ static void io_queue_linked_timeout(struct io_kiocb *req, struct timespec64 *ts, */ spin_lock_irq(&ctx->completion_lock); if (!list_empty(&req->list)) { - req->timeout.timer.function = io_link_timeout_fn; - hrtimer_start(&req->timeout.timer, timespec64_to_ktime(*ts), - *mode); + data->timer.function = io_link_timeout_fn; + hrtimer_start(&data->timer, timespec64_to_ktime(data->ts), + data->mode); } spin_unlock_irq(&ctx->completion_lock);
@@ -2806,22 +2837,7 @@ static void io_queue_linked_timeout(struct io_kiocb *req, struct timespec64 *ts, io_put_req(req); }
-static int io_validate_link_timeout(const struct io_uring_sqe *sqe, - struct timespec64 *ts) -{ - if (sqe->ioprio || sqe->buf_index || sqe->len != 1 || sqe->off) - return -EINVAL; - if (sqe->timeout_flags & ~IORING_TIMEOUT_ABS) - return -EINVAL; - if (get_timespec64(ts, u64_to_user_ptr(sqe->addr))) - return -EFAULT; - - return 0; -} - -static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req, - struct timespec64 *ts, - enum hrtimer_mode *mode) +static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req) { struct io_kiocb *nxt; int ret; @@ -2833,7 +2849,10 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req, if (!nxt || nxt->submit.sqe->opcode != IORING_OP_LINK_TIMEOUT) return NULL;
- ret = io_validate_link_timeout(nxt->submit.sqe, ts); + ret = io_timeout_setup(nxt); + /* common setup allows offset being set, we don't */ + if (!ret && nxt->submit.sqe->off) + ret = -EINVAL; if (ret) { list_del_init(&nxt->list); io_cqring_add_event(nxt, ret); @@ -2841,24 +2860,16 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req, return ERR_PTR(-ECANCELED); }
- if (nxt->submit.sqe->timeout_flags & IORING_TIMEOUT_ABS) - *mode = HRTIMER_MODE_ABS; - else - *mode = HRTIMER_MODE_REL; - req->flags |= REQ_F_LINK_TIMEOUT; - hrtimer_init(&nxt->timeout.timer, CLOCK_MONOTONIC, *mode); return nxt; }
static void __io_queue_sqe(struct io_kiocb *req) { - enum hrtimer_mode mode; struct io_kiocb *nxt; - struct timespec64 ts; int ret;
- nxt = io_prep_linked_timeout(req, &ts, &mode); + nxt = io_prep_linked_timeout(req); if (IS_ERR(nxt)) { ret = PTR_ERR(nxt); nxt = NULL; @@ -2894,7 +2905,7 @@ static void __io_queue_sqe(struct io_kiocb *req) io_queue_async_work(req);
if (nxt) - io_queue_linked_timeout(nxt, &ts, &mode); + io_queue_linked_timeout(nxt);
return; } @@ -2906,7 +2917,7 @@ static void __io_queue_sqe(struct io_kiocb *req)
if (nxt) { if (!ret) - io_queue_linked_timeout(nxt, &ts, &mode); + io_queue_linked_timeout(nxt); else io_put_req(nxt); }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 94ae5e77a9150a8c6c57432e2db290c6868ddfad category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We have an issue with timeout links that are deeper in the submit chain, because we only handle it upfront, not from later submissions. Move the prep + issue of the timeout link to the async work prep handler, and do it normally for non-async queue. If we validate and prepare the timeout links upfront when we first see them, there's nothing stopping us from supporting any sort of nesting.
Fixes: 2665abfd757f ("io_uring: add support for linked SQE timeouts") Reported-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 102 ++++++++++++++++++++++++++++++-------------------- 1 file changed, 61 insertions(+), 41 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2eb1b9cec145..9cdced780c9f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -353,6 +353,7 @@ struct io_kiocb { #define REQ_F_TIMEOUT_NOSEQ 8192 /* no timeout sequence */ #define REQ_F_INFLIGHT 16384 /* on inflight list */ #define REQ_F_COMP_LOCKED 32768 /* completion under lock */ +#define REQ_F_FREE_SQE 65536 /* free sqe if not async queued */ u64 user_data; u32 result; u32 sequence; @@ -391,6 +392,8 @@ static void __io_free_req(struct io_kiocb *req); static void io_put_req(struct io_kiocb *req); static void io_double_put_req(struct io_kiocb *req); static void __io_double_put_req(struct io_kiocb *req); +static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req); +static void io_queue_linked_timeout(struct io_kiocb *req);
static struct kmem_cache *req_cachep;
@@ -528,7 +531,8 @@ static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe) opcode == IORING_OP_WRITE_FIXED); }
-static inline bool io_prep_async_work(struct io_kiocb *req) +static inline bool io_prep_async_work(struct io_kiocb *req, + struct io_kiocb **link) { bool do_hashed = false;
@@ -557,13 +561,17 @@ static inline bool io_prep_async_work(struct io_kiocb *req) req->work.flags |= IO_WQ_WORK_NEEDS_USER; }
+ *link = io_prep_linked_timeout(req); return do_hashed; }
static inline void io_queue_async_work(struct io_kiocb *req) { - bool do_hashed = io_prep_async_work(req); struct io_ring_ctx *ctx = req->ctx; + struct io_kiocb *link; + bool do_hashed; + + do_hashed = io_prep_async_work(req, &link);
trace_io_uring_queue_async_work(ctx, do_hashed, req, &req->work, req->flags); @@ -573,6 +581,9 @@ static inline void io_queue_async_work(struct io_kiocb *req) io_wq_enqueue_hashed(ctx->io_wq, &req->work, file_inode(req->file)); } + + if (link) + io_queue_linked_timeout(link); }
static void io_kill_timeout(struct io_kiocb *req) @@ -874,6 +885,15 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) nxt = list_first_entry_or_null(&req->link_list, struct io_kiocb, list); while (nxt) { list_del_init(&nxt->list); + + if ((req->flags & REQ_F_LINK_TIMEOUT) && + (nxt->flags & REQ_F_TIMEOUT)) { + wake_ev |= io_link_cancel_timeout(nxt); + nxt = list_first_entry_or_null(&req->link_list, + struct io_kiocb, list); + req->flags &= ~REQ_F_LINK_TIMEOUT; + continue; + } if (!list_empty(&req->link_list)) { INIT_LIST_HEAD(&nxt->link_list); list_splice(&req->link_list, &nxt->link_list); @@ -884,19 +904,13 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) * If we're in async work, we can continue processing the chain * in this context instead of having to queue up new async work. */ - if (req->flags & REQ_F_LINK_TIMEOUT) { - wake_ev = io_link_cancel_timeout(nxt); - - /* we dropped this link, get next */ - nxt = list_first_entry_or_null(&req->link_list, - struct io_kiocb, list); - } else if (nxtptr && io_wq_current_is_worker()) { - *nxtptr = nxt; - break; - } else { - io_queue_async_work(nxt); - break; + if (nxt) { + if (nxtptr && io_wq_current_is_worker()) + *nxtptr = nxt; + else + io_queue_async_work(nxt); } + break; }
if (wake_ev) @@ -915,11 +929,16 @@ static void io_fail_links(struct io_kiocb *req) spin_lock_irqsave(&ctx->completion_lock, flags);
while (!list_empty(&req->link_list)) { + const struct io_uring_sqe *sqe_to_free = NULL; + link = list_first_entry(&req->link_list, struct io_kiocb, list); list_del_init(&link->list);
trace_io_uring_fail_link(req, link);
+ if (link->flags & REQ_F_FREE_SQE) + sqe_to_free = link->submit.sqe; + if ((req->flags & REQ_F_LINK_TIMEOUT) && link->submit.sqe->opcode == IORING_OP_LINK_TIMEOUT) { io_link_cancel_timeout(link); @@ -927,6 +946,7 @@ static void io_fail_links(struct io_kiocb *req) io_cqring_fill_event(link, -ECANCELED); __io_double_put_req(link); } + kfree(sqe_to_free); }
io_commit_cqring(ctx); @@ -2682,8 +2702,12 @@ static void io_wq_submit_work(struct io_wq_work **workptr)
/* if a dependent link is ready, pass it back */ if (!ret && nxt) { - io_prep_async_work(nxt); + struct io_kiocb *link; + + io_prep_async_work(nxt, &link); *workptr = &nxt->work; + if (link) + io_queue_linked_timeout(link); } }
@@ -2818,7 +2842,6 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer)
static void io_queue_linked_timeout(struct io_kiocb *req) { - struct io_timeout_data *data = req->timeout.data; struct io_ring_ctx *ctx = req->ctx;
/* @@ -2827,6 +2850,8 @@ static void io_queue_linked_timeout(struct io_kiocb *req) */ spin_lock_irq(&ctx->completion_lock); if (!list_empty(&req->list)) { + struct io_timeout_data *data = req->timeout.data; + data->timer.function = io_link_timeout_fn; hrtimer_start(&data->timer, timespec64_to_ktime(data->ts), data->mode); @@ -2840,7 +2865,6 @@ static void io_queue_linked_timeout(struct io_kiocb *req) static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req) { struct io_kiocb *nxt; - int ret;
if (!(req->flags & REQ_F_LINK)) return NULL; @@ -2849,33 +2873,15 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req) if (!nxt || nxt->submit.sqe->opcode != IORING_OP_LINK_TIMEOUT) return NULL;
- ret = io_timeout_setup(nxt); - /* common setup allows offset being set, we don't */ - if (!ret && nxt->submit.sqe->off) - ret = -EINVAL; - if (ret) { - list_del_init(&nxt->list); - io_cqring_add_event(nxt, ret); - io_double_put_req(nxt); - return ERR_PTR(-ECANCELED); - } - req->flags |= REQ_F_LINK_TIMEOUT; return nxt; }
static void __io_queue_sqe(struct io_kiocb *req) { - struct io_kiocb *nxt; + struct io_kiocb *nxt = io_prep_linked_timeout(req); int ret;
- nxt = io_prep_linked_timeout(req); - if (IS_ERR(nxt)) { - ret = PTR_ERR(nxt); - nxt = NULL; - goto err; - } - ret = __io_submit_sqe(req, NULL, true);
/* @@ -2903,10 +2909,6 @@ static void __io_queue_sqe(struct io_kiocb *req) * submit reference when the iocb is actually submitted. */ io_queue_async_work(req); - - if (nxt) - io_queue_linked_timeout(nxt); - return; } } @@ -2951,6 +2953,10 @@ static void io_queue_link_head(struct io_kiocb *req, struct io_kiocb *shadow) int need_submit = false; struct io_ring_ctx *ctx = req->ctx;
+ if (unlikely(req->flags & REQ_F_FAIL_LINK)) { + ret = -ECANCELED; + goto err; + } if (!shadow) { io_queue_sqe(req); return; @@ -2965,9 +2971,11 @@ static void io_queue_link_head(struct io_kiocb *req, struct io_kiocb *shadow) ret = io_req_defer(req); if (ret) { if (ret != -EIOCBQUEUED) { +err: io_cqring_add_event(req, ret); io_double_put_req(req); - __io_free_req(shadow); + if (shadow) + __io_free_req(shadow); return; } } else { @@ -3024,6 +3032,17 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, if (*link) { struct io_kiocb *prev = *link;
+ if (READ_ONCE(s->sqe->opcode) == IORING_OP_LINK_TIMEOUT) { + ret = io_timeout_setup(req); + /* common setup allows offset being set, we don't */ + if (!ret && s->sqe->off) + ret = -EINVAL; + if (ret) { + prev->flags |= REQ_F_FAIL_LINK; + goto err_req; + } + } + sqe_copy = kmemdup(s->sqe, sizeof(*sqe_copy), GFP_KERNEL); if (!sqe_copy) { ret = -EAGAIN; @@ -3031,6 +3050,7 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, }
s->sqe = sqe_copy; + req->flags |= REQ_F_FREE_SQE; trace_io_uring_link(ctx, req, prev); list_add_tail(&req->list, &prev->link_list); } else if (s->sqe->flags & IOSQE_IO_LINK) {
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit e0e328c4b330712e45ba799dc589bda751323110 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
With the conversion to io-wq, we no longer use that flag. Kill it.
Fixes: 561fb04a6a22 ("io_uring: replace workqueue usage with io-wq") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9cdced780c9f..5e34c660faef 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -340,7 +340,6 @@ struct io_kiocb { #define REQ_F_NOWAIT 1 /* must not punt to workers */ #define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ #define REQ_F_FIXED_FILE 4 /* ctx owns file */ -#define REQ_F_SEQ_PREV 8 /* sequential with previous */ #define REQ_F_IO_DRAIN 16 /* drain existing IO first */ #define REQ_F_IO_DRAINED 32 /* drain done */ #define REQ_F_LINK 64 /* linked sqes */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit b0dd8a412699afe3420a08f841333f3474ad45c5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently a poll request fills a completion entry of 0, even if it got cancelled. This is odd, and it makes it harder to support with chains. Ensure that it returns -ECANCELED in the completions events if it got cancelled, and furthermore ensure that the linked timeout that triggered it completes with -ETIME if we did indeed trigger the completions through a timeout.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 33 ++++++++++++++++++++++----------- 1 file changed, 22 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5e34c660faef..f892ef9b848f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2065,12 +2065,15 @@ static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static void io_poll_complete(struct io_kiocb *req, __poll_t mask) +static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error) { struct io_ring_ctx *ctx = req->ctx;
req->poll.done = true; - io_cqring_fill_event(req, mangle_poll(mask)); + if (error) + io_cqring_fill_event(req, error); + else + io_cqring_fill_event(req, mangle_poll(mask)); io_commit_cqring(ctx); }
@@ -2083,11 +2086,16 @@ static void io_poll_complete_work(struct io_wq_work **workptr) struct io_ring_ctx *ctx = req->ctx; struct io_kiocb *nxt = NULL; __poll_t mask = 0; + int ret = 0;
- if (work->flags & IO_WQ_WORK_CANCEL) + if (work->flags & IO_WQ_WORK_CANCEL) { WRITE_ONCE(poll->canceled, true); + ret = -ECANCELED; + } else if (READ_ONCE(poll->canceled)) { + ret = -ECANCELED; + }
- if (!READ_ONCE(poll->canceled)) + if (ret != -ECANCELED) mask = vfs_poll(poll->file, &pt) & poll->events;
/* @@ -2098,13 +2106,13 @@ static void io_poll_complete_work(struct io_wq_work **workptr) * avoid further branches in the fast path. */ spin_lock_irq(&ctx->completion_lock); - if (!mask && !READ_ONCE(poll->canceled)) { + if (!mask && ret != -ECANCELED) { add_wait_queue(poll->head, &poll->wait); spin_unlock_irq(&ctx->completion_lock); return; } io_poll_remove_req(req); - io_poll_complete(req, mask); + io_poll_complete(req, mask, ret); spin_unlock_irq(&ctx->completion_lock);
io_cqring_ev_posted(ctx); @@ -2138,7 +2146,7 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, */ if (mask && spin_trylock_irqsave(&ctx->completion_lock, flags)) { io_poll_remove_req(req); - io_poll_complete(req, mask); + io_poll_complete(req, mask, 0); req->flags |= REQ_F_COMP_LOCKED; io_put_req(req); spin_unlock_irqrestore(&ctx->completion_lock, flags); @@ -2250,7 +2258,7 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, } if (mask) { /* no async, we'd stolen it */ ipt.error = 0; - io_poll_complete(req, mask); + io_poll_complete(req, mask, 0); } spin_unlock_irq(&ctx->completion_lock);
@@ -2502,7 +2510,7 @@ static int io_async_cancel_one(struct io_ring_ctx *ctx, void *sqe_addr)
static void io_async_find_and_cancel(struct io_ring_ctx *ctx, struct io_kiocb *req, __u64 sqe_addr, - struct io_kiocb **nxt) + struct io_kiocb **nxt, int success_ret) { unsigned long flags; int ret; @@ -2519,6 +2527,8 @@ static void io_async_find_and_cancel(struct io_ring_ctx *ctx, goto done; ret = io_poll_cancel(ctx, sqe_addr); done: + if (!ret) + ret = success_ret; io_cqring_fill_event(req, ret); io_commit_cqring(ctx); spin_unlock_irqrestore(&ctx->completion_lock, flags); @@ -2540,7 +2550,7 @@ static int io_async_cancel(struct io_kiocb *req, const struct io_uring_sqe *sqe, sqe->cancel_flags) return -EINVAL;
- io_async_find_and_cancel(ctx, req, READ_ONCE(sqe->addr), nxt); + io_async_find_and_cancel(ctx, req, READ_ONCE(sqe->addr), nxt, 0); return 0; }
@@ -2830,7 +2840,8 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) spin_unlock_irqrestore(&ctx->completion_lock, flags);
if (prev) { - io_async_find_and_cancel(ctx, req, prev->user_data, NULL); + io_async_find_and_cancel(ctx, req, prev->user_data, NULL, + -ETIME); io_put_req(prev); } else { io_cqring_add_event(req, -ETIME);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit fba38c272a0385148935d6443cb9dc68cf1f37a7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently don't explicitly break links if a request is cancelled, but we should. Add explicitly link breakage for all types of request cancellations that we support.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f892ef9b848f..b18844ca8484 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2117,6 +2117,8 @@ static void io_poll_complete_work(struct io_wq_work **workptr)
io_cqring_ev_posted(ctx);
+ if (ret < 0 && req->flags & REQ_F_LINK) + req->flags |= REQ_F_FAIL_LINK; io_put_req_find_next(req, &nxt); if (nxt) *workptr = &nxt->work; @@ -2330,6 +2332,8 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) if (ret == -1) return -EALREADY;
+ if (req->flags & REQ_F_LINK) + req->flags |= REQ_F_FAIL_LINK; io_cqring_fill_event(req, -ECANCELED); io_put_req(req); return 0; @@ -2840,6 +2844,8 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) spin_unlock_irqrestore(&ctx->completion_lock, flags);
if (prev) { + if (prev->flags & REQ_F_LINK) + prev->flags |= REQ_F_FAIL_LINK; io_async_find_and_cancel(ctx, req, prev->user_data, NULL, -ETIME); io_put_req(prev);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit b60fda6000a99a7ccac36005ab78b14b47c06de3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently have a race where if setup is really slow, we can be calling io_wq_destroy() before we're done setting up. This will cause the caller to get stuck waiting for the manager to set things up, but the manager already exited.
Fix this by doing a sync setup of the manager. This also fixes the case where if we failed creating workers, we'd also get stuck.
In practice this race window was really small, as we already wait for the manager to start. Hence someone would have to call io_wq_destroy() after the task has started, but before it started the first loop. The reported test case forked tons of these, which is why it became an issue.
Reported-by: syzbot+0f1cc17f85154f400465@syzkaller.appspotmail.com Fixes: 771b53d033e8 ("io-wq: small threadpool implementation for io_uring") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 50 +++++++++++++++++++++++++++++++++++--------------- 1 file changed, 35 insertions(+), 15 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index b7eae2e866a3..f9b5a1f94aa3 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -34,6 +34,7 @@ enum { enum { IO_WQ_BIT_EXIT = 0, /* wq exiting */ IO_WQ_BIT_CANCEL = 1, /* cancel work on list */ + IO_WQ_BIT_ERROR = 2, /* error on setup */ };
enum { @@ -563,14 +564,14 @@ void io_wq_worker_sleeping(struct task_struct *tsk) spin_unlock_irq(&wqe->lock); }
-static void create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index) +static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index) { struct io_wqe_acct *acct =&wqe->acct[index]; struct io_worker *worker;
worker = kcalloc_node(1, sizeof(*worker), GFP_KERNEL, wqe->node); if (!worker) - return; + return false;
refcount_set(&worker->ref, 1); worker->nulls_node.pprev = NULL; @@ -582,7 +583,7 @@ static void create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index) "io_wqe_worker-%d/%d", index, wqe->node); if (IS_ERR(worker->task)) { kfree(worker); - return; + return false; }
spin_lock_irq(&wqe->lock); @@ -600,6 +601,7 @@ static void create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index) atomic_inc(&wq->user->processes);
wake_up_process(worker->task); + return true; }
static inline bool io_wqe_need_worker(struct io_wqe *wqe, int index) @@ -607,9 +609,6 @@ static inline bool io_wqe_need_worker(struct io_wqe *wqe, int index) { struct io_wqe_acct *acct = &wqe->acct[index];
- /* always ensure we have one bounded worker */ - if (index == IO_WQ_ACCT_BOUND && !acct->nr_workers) - return true; /* if we have available workers or no work, no need */ if (!hlist_nulls_empty(&wqe->free_list) || !io_wqe_run_queue(wqe)) return false; @@ -622,10 +621,19 @@ static inline bool io_wqe_need_worker(struct io_wqe *wqe, int index) static int io_wq_manager(void *data) { struct io_wq *wq = data; + int i;
- while (!kthread_should_stop()) { - int i; + /* create fixed workers */ + refcount_set(&wq->refs, wq->nr_wqes); + for (i = 0; i < wq->nr_wqes; i++) { + if (create_io_worker(wq, wq->wqes[i], IO_WQ_ACCT_BOUND)) + continue; + goto err; + }
+ complete(&wq->done); + + while (!kthread_should_stop()) { for (i = 0; i < wq->nr_wqes; i++) { struct io_wqe *wqe = wq->wqes[i]; bool fork_worker[2] = { false, false }; @@ -646,6 +654,12 @@ static int io_wq_manager(void *data) }
return 0; +err: + set_bit(IO_WQ_BIT_ERROR, &wq->state); + set_bit(IO_WQ_BIT_EXIT, &wq->state); + if (refcount_sub_and_test(wq->nr_wqes - i, &wq->refs)) + complete(&wq->done); + return 0; }
static bool io_wq_can_queue(struct io_wqe *wqe, struct io_wqe_acct *acct, @@ -983,7 +997,6 @@ struct io_wq *io_wq_create(unsigned bounded, struct mm_struct *mm, wq->user = user;
i = 0; - refcount_set(&wq->refs, wq->nr_wqes); for_each_online_node(node) { struct io_wqe *wqe;
@@ -1021,14 +1034,22 @@ struct io_wq *io_wq_create(unsigned bounded, struct mm_struct *mm, wq->manager = kthread_create(io_wq_manager, wq, "io_wq_manager"); if (!IS_ERR(wq->manager)) { wake_up_process(wq->manager); + wait_for_completion(&wq->done); + if (test_bit(IO_WQ_BIT_ERROR, &wq->state)) { + ret = -ENOMEM; + goto err; + } + reinit_completion(&wq->done); return wq; }
ret = PTR_ERR(wq->manager); - wq->manager = NULL; -err: complete(&wq->done); - io_wq_destroy(wq); +err: + for (i = 0; i < wq->nr_wqes; i++) + kfree(wq->wqes[i]); + kfree(wq->wqes); + kfree(wq); return ERR_PTR(ret); }
@@ -1042,10 +1063,9 @@ void io_wq_destroy(struct io_wq *wq) { int i;
- if (wq->manager) { - set_bit(IO_WQ_BIT_EXIT, &wq->state); + set_bit(IO_WQ_BIT_EXIT, &wq->state); + if (wq->manager) kthread_stop(wq->manager); - }
rcu_read_lock(); for (i = 0; i < wq->nr_wqes; i++) {
From: Dan Carpenter dan.carpenter@oracle.com
mainline inclusion from mainline-5.5-rc1 commit b2e9c7d64b7ecacc1d0f15a6af88a73cab7d8db9 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
These lines are indented an extra space character.
Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index f9b5a1f94aa3..fc83200e04ca 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -329,9 +329,9 @@ static void __io_worker_busy(struct io_wqe *wqe, struct io_worker *worker, * If worker is moving from bound to unbound (or vice versa), then * ensure we update the running accounting. */ - worker_bound = (worker->flags & IO_WORKER_F_BOUND) != 0; - work_bound = (work->flags & IO_WQ_WORK_UNBOUND) == 0; - if (worker_bound != work_bound) { + worker_bound = (worker->flags & IO_WORKER_F_BOUND) != 0; + work_bound = (work->flags & IO_WQ_WORK_UNBOUND) == 0; + if (worker_bound != work_bound) { io_wqe_dec_running(wqe, worker); if (work_bound) { worker->flags |= IO_WORKER_F_BOUND;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit d3b35796b1e3f118017491d621f624e0de7ff9fb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If io_req_defer() failed, it needs to cancel a dependant link.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b18844ca8484..3e223d0cd26b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2957,6 +2957,8 @@ static void io_queue_sqe(struct io_kiocb *req) if (ret) { if (ret != -EIOCBQUEUED) { io_cqring_add_event(req, ret); + if (req->flags & REQ_F_LINK) + req->flags |= REQ_F_FAIL_LINK; io_double_put_req(req); } } else @@ -2989,6 +2991,8 @@ static void io_queue_link_head(struct io_kiocb *req, struct io_kiocb *shadow) if (ret != -EIOCBQUEUED) { err: io_cqring_add_event(req, ret); + if (req->flags & REQ_F_LINK) + req->flags |= REQ_F_FAIL_LINK; io_double_put_req(req); if (shadow) __io_free_req(shadow);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit f70193d6d8cad4cc614223fef349e6ea9d48c61f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Pass any IORING_OP_LINK_TIMEOUT request further, where it will eventually fail in io_issue_sqe().
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ---- 1 file changed, 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3e223d0cd26b..b96dc17afd75 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3078,10 +3078,6 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state,
INIT_LIST_HEAD(&req->link_list); *link = req; - } else if (READ_ONCE(s->sqe->opcode) == IORING_OP_LINK_TIMEOUT) { - /* Only valid as a linked SQE */ - ret = -EINVAL; - goto err_req; } else { io_queue_sqe(req); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit 09fbb0a83ec6ab5a4037766261c031151985fff6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
let have a dependant link: REQ -> LINK_TIMEOUT -> LINK_TIMEOUT
1. submission stage: submission references for REQ and LINK_TIMEOUT are dropped. So, references respectively (1,1,2)
2. io_put(REQ) + FAIL_LINKS stage: calls io_fail_links(), which for all linked timeouts will call cancel_timeout() and drop 1 reference. So, references after: (0,0,1). That's a leak.
Make it treat only the first linked timeout as such, and pass others through __io_double_put_req().
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b96dc17afd75..94ee48d6cdf7 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -941,6 +941,7 @@ static void io_fail_links(struct io_kiocb *req) if ((req->flags & REQ_F_LINK_TIMEOUT) && link->submit.sqe->opcode == IORING_OP_LINK_TIMEOUT) { io_link_cancel_timeout(link); + req->flags &= ~REQ_F_LINK_TIMEOUT; } else { io_cqring_fill_event(link, -ECANCELED); __io_double_put_req(link);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 5d960724b0cb0d12469d1c62912e4a8c09c9fd92 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently clear the linked timeout field if we cancel such a timeout, but we should only attempt to cancel if it's the first one we see. Others should simply be freed like other requests, as they haven't been started yet.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 94ee48d6cdf7..b05bdd5d523e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -941,12 +941,12 @@ static void io_fail_links(struct io_kiocb *req) if ((req->flags & REQ_F_LINK_TIMEOUT) && link->submit.sqe->opcode == IORING_OP_LINK_TIMEOUT) { io_link_cancel_timeout(link); - req->flags &= ~REQ_F_LINK_TIMEOUT; } else { io_cqring_fill_event(link, -ECANCELED); __io_double_put_req(link); } kfree(sqe_to_free); + req->flags &= ~REQ_F_LINK_TIMEOUT; }
io_commit_cqring(ctx); @@ -2836,9 +2836,10 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) */ if (!list_empty(&req->list)) { prev = list_entry(req->list.prev, struct io_kiocb, link_list); - if (refcount_inc_not_zero(&prev->refs)) + if (refcount_inc_not_zero(&prev->refs)) { list_del_init(&req->list); - else + prev->flags &= ~REQ_F_LINK_TIMEOUT; + } else prev = NULL; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit bbad27b2f622fa26d107f8a72c0cd5cc102dc56e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Always mark requests with allocated sqe and deallocate it in __io_free_req(). It's easier to follow and doesn't add edge cases.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 49 ++++++++++++++++++++++--------------------------- 1 file changed, 22 insertions(+), 27 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b05bdd5d523e..ceb7ae870cf1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -833,6 +833,8 @@ static void __io_free_req(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx;
+ if (req->flags & REQ_F_FREE_SQE) + kfree(req->submit.sqe); if (req->file && !(req->flags & REQ_F_FIXED_FILE)) fput(req->file); if (req->flags & REQ_F_INFLIGHT) { @@ -928,16 +930,11 @@ static void io_fail_links(struct io_kiocb *req) spin_lock_irqsave(&ctx->completion_lock, flags);
while (!list_empty(&req->link_list)) { - const struct io_uring_sqe *sqe_to_free = NULL; - link = list_first_entry(&req->link_list, struct io_kiocb, list); list_del_init(&link->list);
trace_io_uring_fail_link(req, link);
- if (link->flags & REQ_F_FREE_SQE) - sqe_to_free = link->submit.sqe; - if ((req->flags & REQ_F_LINK_TIMEOUT) && link->submit.sqe->opcode == IORING_OP_LINK_TIMEOUT) { io_link_cancel_timeout(link); @@ -945,7 +942,6 @@ static void io_fail_links(struct io_kiocb *req) io_cqring_fill_event(link, -ECANCELED); __io_double_put_req(link); } - kfree(sqe_to_free); req->flags &= ~REQ_F_LINK_TIMEOUT; }
@@ -1088,7 +1084,8 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, * completions for those, only batch free for fixed * file and non-linked commands. */ - if (((req->flags & (REQ_F_FIXED_FILE|REQ_F_LINK)) == + if (((req->flags & + (REQ_F_FIXED_FILE|REQ_F_LINK|REQ_F_FREE_SQE)) == REQ_F_FIXED_FILE) && !io_is_fallback_req(req)) { reqs[to_free++] = req; if (to_free == ARRAY_SIZE(reqs)) @@ -2581,6 +2578,7 @@ static int io_req_defer(struct io_kiocb *req) }
memcpy(sqe_copy, sqe, sizeof(*sqe_copy)); + req->flags |= REQ_F_FREE_SQE; req->submit.sqe = sqe_copy;
trace_io_uring_defer(ctx, req, false); @@ -2675,7 +2673,6 @@ static void io_wq_submit_work(struct io_wq_work **workptr) struct io_wq_work *work = *workptr; struct io_kiocb *req = container_of(work, struct io_kiocb, work); struct sqe_submit *s = &req->submit; - const struct io_uring_sqe *sqe = s->sqe; struct io_kiocb *nxt = NULL; int ret = 0;
@@ -2711,9 +2708,6 @@ static void io_wq_submit_work(struct io_wq_work **workptr) io_put_req(req); }
- /* async context always use a copy of the sqe */ - kfree(sqe); - /* if a dependent link is ready, pass it back */ if (!ret && nxt) { struct io_kiocb *link; @@ -2912,23 +2906,24 @@ static void __io_queue_sqe(struct io_kiocb *req) struct io_uring_sqe *sqe_copy;
sqe_copy = kmemdup(s->sqe, sizeof(*sqe_copy), GFP_KERNEL); - if (sqe_copy) { - s->sqe = sqe_copy; - if (req->work.flags & IO_WQ_WORK_NEEDS_FILES) { - ret = io_grab_files(req); - if (ret) { - kfree(sqe_copy); - goto err; - } - } + if (!sqe_copy) + goto err;
- /* - * Queued up for async execution, worker will release - * submit reference when the iocb is actually submitted. - */ - io_queue_async_work(req); - return; + s->sqe = sqe_copy; + req->flags |= REQ_F_FREE_SQE; + + if (req->work.flags & IO_WQ_WORK_NEEDS_FILES) { + ret = io_grab_files(req); + if (ret) + goto err; } + + /* + * Queued up for async execution, worker will release + * submit reference when the iocb is actually submitted. + */ + io_queue_async_work(req); + return; }
err: @@ -3023,7 +3018,6 @@ static void io_queue_link_head(struct io_kiocb *req, struct io_kiocb *shadow) static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, struct io_kiocb **link) { - struct io_uring_sqe *sqe_copy; struct sqe_submit *s = &req->submit; struct io_ring_ctx *ctx = req->ctx; int ret; @@ -3053,6 +3047,7 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, */ if (*link) { struct io_kiocb *prev = *link; + struct io_uring_sqe *sqe_copy;
if (READ_ONCE(s->sqe->opcode) == IORING_OP_LINK_TIMEOUT) { ret = io_timeout_setup(req);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit eb065d301e8c83643367bdb0898becc364046bda category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently rely on the ring destroy on cleaning things up in case of failure, but io_allocate_scq_urings() can leave things half initialized if only parts of it fails.
Be nice and return with either everything setup in success, or return an error with things nicely cleaned up.
Reported-by: syzbot+0d818c0d39399188f393@syzkaller.appspotmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ceb7ae870cf1..79ee6964ff6c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4583,12 +4583,18 @@ static int io_allocate_scq_urings(struct io_ring_ctx *ctx, ctx->cq_entries = rings->cq_ring_entries;
size = array_size(sizeof(struct io_uring_sqe), p->sq_entries); - if (size == SIZE_MAX) + if (size == SIZE_MAX) { + io_mem_free(ctx->rings); + ctx->rings = NULL; return -EOVERFLOW; + }
ctx->sq_sqes = io_mem_alloc(size); - if (!ctx->sq_sqes) + if (!ctx->sq_sqes) { + io_mem_free(ctx->rings); + ctx->rings = NULL; return -ENOMEM; + }
return 0; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 4d7dd462971405c65bfb3821dbb6b9ce13b5e8d6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently try and start the next link when we put the request, and only if we were going to free it. This means that the optimization to continue executing requests from the same context often fails, as we're not putting the final reference.
Add REQ_F_LINK_NEXT to keep track of this, and allow io_uring to find the next request more efficiently.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 79ee6964ff6c..3531ffbeacfc 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -340,6 +340,7 @@ struct io_kiocb { #define REQ_F_NOWAIT 1 /* must not punt to workers */ #define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ #define REQ_F_FIXED_FILE 4 /* ctx owns file */ +#define REQ_F_LINK_NEXT 8 /* already grabbed next link */ #define REQ_F_IO_DRAIN 16 /* drain existing IO first */ #define REQ_F_IO_DRAINED 32 /* drain done */ #define REQ_F_LINK 64 /* linked sqes */ @@ -878,6 +879,10 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) struct io_kiocb *nxt; bool wake_ev = false;
+ /* Already got next link */ + if (req->flags & REQ_F_LINK_NEXT) + return; + /* * The list should never be empty when we are called here. But could * potentially happen if the chain is messed up, check to be on the @@ -914,6 +919,7 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) break; }
+ req->flags |= REQ_F_LINK_NEXT; if (wake_ev) io_cqring_ev_posted(ctx); } @@ -950,12 +956,10 @@ static void io_fail_links(struct io_kiocb *req) io_cqring_ev_posted(ctx); }
-static void io_free_req_find_next(struct io_kiocb *req, struct io_kiocb **nxt) +static void io_req_find_next(struct io_kiocb *req, struct io_kiocb **nxt) { - if (likely(!(req->flags & REQ_F_LINK))) { - __io_free_req(req); + if (likely(!(req->flags & REQ_F_LINK))) return; - }
/* * If LINK is set, we have dependent requests in this chain. If we @@ -981,7 +985,11 @@ static void io_free_req_find_next(struct io_kiocb *req, struct io_kiocb **nxt) } else { io_req_link_next(req, nxt); } +}
+static void io_free_req_find_next(struct io_kiocb *req, struct io_kiocb **nxt) +{ + io_req_find_next(req, nxt); __io_free_req(req); }
@@ -998,8 +1006,10 @@ static void io_put_req_find_next(struct io_kiocb *req, struct io_kiocb **nxtptr) { struct io_kiocb *nxt = NULL;
+ io_req_find_next(req, &nxt); + if (refcount_dec_and_test(&req->refs)) - io_free_req_find_next(req, &nxt); + __io_free_req(req);
if (nxt) { if (nxtptr)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit b76da70fc3759df13e0991706451f1a2e06ba19e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
When we find new work to process within the work handler, we queue the linked timeout before we have issued the new work. This can be problematic for very short timeouts, as we have a window where the new work isn't visible.
Allow the work handler to store a callback function for this in the work item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that is set, then io-wq will call the callback when it has setup the new work item.
Reported-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 3 +++ fs/io-wq.h | 6 +++++- fs/io_uring.c | 16 ++++++++++++++-- 3 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index fc83200e04ca..36553ae81eda 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -428,6 +428,9 @@ static void io_worker_handle_work(struct io_worker *worker) worker->cur_work = work; spin_unlock_irq(&worker->lock);
+ if (work->flags & IO_WQ_WORK_CB) + work->func(&work); + if ((work->flags & IO_WQ_WORK_NEEDS_FILES) && current->files != work->files) { task_lock(current); diff --git a/fs/io-wq.h b/fs/io-wq.h index 4b29f922f80c..b68b11bf3633 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -11,6 +11,7 @@ enum { IO_WQ_WORK_NEEDS_FILES = 16, IO_WQ_WORK_UNBOUND = 32, IO_WQ_WORK_INTERNAL = 64, + IO_WQ_WORK_CB = 128,
IO_WQ_HASH_SHIFT = 24, /* upper 8 bits are used for hash key */ }; @@ -22,7 +23,10 @@ enum io_wq_cancel { };
struct io_wq_work { - struct list_head list; + union { + struct list_head list; + void *data; + }; void (*func)(struct io_wq_work **); unsigned flags; struct files_struct *files; diff --git a/fs/io_uring.c b/fs/io_uring.c index 3531ffbeacfc..146b0febb54b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2678,6 +2678,15 @@ static int __io_submit_sqe(struct io_kiocb *req, struct io_kiocb **nxt, return 0; }
+static void io_link_work_cb(struct io_wq_work **workptr) +{ + struct io_wq_work *work = *workptr; + struct io_kiocb *link = work->data; + + io_queue_linked_timeout(link); + work->func = io_wq_submit_work; +} + static void io_wq_submit_work(struct io_wq_work **workptr) { struct io_wq_work *work = *workptr; @@ -2724,8 +2733,11 @@ static void io_wq_submit_work(struct io_wq_work **workptr)
io_prep_async_work(nxt, &link); *workptr = &nxt->work; - if (link) - io_queue_linked_timeout(link); + if (link) { + nxt->work.flags |= IO_WQ_WORK_CB; + nxt->work.func = io_link_work_cb; + nxt->work.data = link; + } } }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit 1b4a51b6d03d21f55effbcf609ba5526d87d9e9d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There's an issue with the shadow drain logic in that we drop the completion lock after deciding to defer a request, then re-grab it later and assume that the state is still the same. In the mean time, someone else completing a request could have found and issued it. This can cause a stall in the queue, by having a shadow request inserted that nobody is going to drain.
Additionally, if we fail allocating the shadow request, we simply ignore the drain.
Instead of using a shadow request, defer the next request/link instead. This also has the following advantages:
- removes semi-duplicated code - doesn't allocate memory for shadows - works better if only the head marked for drain - doesn't need complex synchronisation
On the flip side, it removes the shadow->seq == last_drain_in_in_link->seq optimization. That shouldn't be a common case, and can always be added back, if needed.
Fixes: 4fe2c963154c ("io_uring: add support for link with drain") Cc: Jackie Liu liuyun01@kylinos.cn Reported-by: Jens Axboe axboe@kernel.dk Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 86 +++++++++++---------------------------------------- 1 file changed, 18 insertions(+), 68 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 146b0febb54b..b066a2300f68 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -186,6 +186,7 @@ struct io_ring_ctx { bool compat; bool account_mem; bool cq_overflow_flushed; + bool drain_next;
/* * Ring buffer of indices into array of io_uring_sqe, which is @@ -346,7 +347,7 @@ struct io_kiocb { #define REQ_F_LINK 64 /* linked sqes */ #define REQ_F_LINK_TIMEOUT 128 /* has linked timeout */ #define REQ_F_FAIL_LINK 256 /* fail rest of links */ -#define REQ_F_SHADOW_DRAIN 512 /* link-drain shadow req */ +#define REQ_F_DRAIN_LINK 512 /* link should be fully drained */ #define REQ_F_TIMEOUT 1024 /* timeout request */ #define REQ_F_ISREG 2048 /* regular file */ #define REQ_F_MUST_PUNT 4096 /* must be punted even for NONBLOCK */ @@ -619,11 +620,6 @@ static void io_commit_cqring(struct io_ring_ctx *ctx) __io_commit_cqring(ctx);
while ((req = io_get_deferred_req(ctx)) != NULL) { - if (req->flags & REQ_F_SHADOW_DRAIN) { - /* Just for drain, free it. */ - __io_free_req(req); - continue; - } req->flags |= REQ_F_IO_DRAINED; io_queue_async_work(req); } @@ -2972,6 +2968,12 @@ static void io_queue_sqe(struct io_kiocb *req) { int ret;
+ if (unlikely(req->ctx->drain_next)) { + req->flags |= REQ_F_IO_DRAIN; + req->ctx->drain_next = false; + } + req->ctx->drain_next = (req->flags & REQ_F_DRAIN_LINK); + ret = io_req_defer(req); if (ret) { if (ret != -EIOCBQUEUED) { @@ -2984,57 +2986,16 @@ static void io_queue_sqe(struct io_kiocb *req) __io_queue_sqe(req); }
-static void io_queue_link_head(struct io_kiocb *req, struct io_kiocb *shadow) +static inline void io_queue_link_head(struct io_kiocb *req) { - int ret; - int need_submit = false; - struct io_ring_ctx *ctx = req->ctx; - if (unlikely(req->flags & REQ_F_FAIL_LINK)) { - ret = -ECANCELED; - goto err; - } - if (!shadow) { + io_cqring_add_event(req, -ECANCELED); + io_double_put_req(req); + } else io_queue_sqe(req); - return; - } - - /* - * Mark the first IO in link list as DRAIN, let all the following - * IOs enter the defer list. all IO needs to be completed before link - * list. - */ - req->flags |= REQ_F_IO_DRAIN; - ret = io_req_defer(req); - if (ret) { - if (ret != -EIOCBQUEUED) { -err: - io_cqring_add_event(req, ret); - if (req->flags & REQ_F_LINK) - req->flags |= REQ_F_FAIL_LINK; - io_double_put_req(req); - if (shadow) - __io_free_req(shadow); - return; - } - } else { - /* - * If ret == 0 means that all IOs in front of link io are - * running done. let's queue link head. - */ - need_submit = true; - } - - /* Insert shadow req to defer_list, blocking next IOs */ - spin_lock_irq(&ctx->completion_lock); - trace_io_uring_defer(ctx, shadow, true); - list_add_tail(&shadow->list, &ctx->defer_list); - spin_unlock_irq(&ctx->completion_lock); - - if (need_submit) - __io_queue_sqe(req); }
+ #define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK)
static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, @@ -3071,6 +3032,9 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, struct io_kiocb *prev = *link; struct io_uring_sqe *sqe_copy;
+ if (s->sqe->flags & IOSQE_IO_DRAIN) + (*link)->flags |= REQ_F_DRAIN_LINK | REQ_F_IO_DRAIN; + if (READ_ONCE(s->sqe->opcode) == IORING_OP_LINK_TIMEOUT) { ret = io_timeout_setup(req); /* common setup allows offset being set, we don't */ @@ -3189,7 +3153,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, { struct io_submit_state state, *statep = NULL; struct io_kiocb *link = NULL; - struct io_kiocb *shadow_req = NULL; int i, submitted = 0; bool mm_fault = false;
@@ -3228,18 +3191,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
sqe_flags = req->submit.sqe->flags;
- if (link && (sqe_flags & IOSQE_IO_DRAIN)) { - if (!shadow_req) { - shadow_req = io_get_req(ctx, NULL); - if (unlikely(!shadow_req)) - goto out; - shadow_req->flags |= (REQ_F_IO_DRAIN | REQ_F_SHADOW_DRAIN); - refcount_dec(&shadow_req->refs); - } - shadow_req->sequence = req->submit.sequence; - } - -out: req->submit.ring_file = ring_file; req->submit.ring_fd = ring_fd; req->submit.has_user = *mm != NULL; @@ -3255,14 +3206,13 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, * that's the end of the chain. Submit the previous link. */ if (!(sqe_flags & IOSQE_IO_LINK) && link) { - io_queue_link_head(link, shadow_req); + io_queue_link_head(link); link = NULL; - shadow_req = NULL; } }
if (link) - io_queue_link_head(link, shadow_req); + io_queue_link_head(link); if (statep) io_submit_state_end(&state);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 915967f69c591b34c5a18d6618af021a81ffd700 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We don't have shadow requests anymore, so get rid of the shadow argument. Add the user_data argument, as that's often useful to easily match up requests, instead of having to look at request pointers.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- include/trace/events/io_uring.h | 16 ++++++++-------- 2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b066a2300f68..a094787d9bab 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2587,7 +2587,7 @@ static int io_req_defer(struct io_kiocb *req) req->flags |= REQ_F_FREE_SQE; req->submit.sqe = sqe_copy;
- trace_io_uring_defer(ctx, req, false); + trace_io_uring_defer(ctx, req, req->user_data); list_add_tail(&req->list, &ctx->defer_list); spin_unlock_irq(&ctx->completion_lock); return -EIOCBQUEUED; diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h index 9e80fa2415d2..5ff28df57763 100644 --- a/include/trace/events/io_uring.h +++ b/include/trace/events/io_uring.h @@ -163,35 +163,35 @@ TRACE_EVENT(io_uring_queue_async_work, );
/** - * io_uring_defer_list - called before the io_uring work added into defer_list + * io_uring_defer - called when an io_uring request is deferred * * @ctx: pointer to a ring context structure * @req: pointer to a deferred request - * @shadow: whether request is shadow or not + * @user_data: user data associated with the request * * Allows to track deferred requests, to get an insight about what requests are * not started immediately. */ TRACE_EVENT(io_uring_defer,
- TP_PROTO(void *ctx, void *req, bool shadow), + TP_PROTO(void *ctx, void *req, unsigned long long user_data),
- TP_ARGS(ctx, req, shadow), + TP_ARGS(ctx, req, user_data),
TP_STRUCT__entry ( __field( void *, ctx ) __field( void *, req ) - __field( bool, shadow ) + __field( unsigned long long, data ) ),
TP_fast_assign( __entry->ctx = ctx; __entry->req = req; - __entry->shadow = shadow; + __entry->data = user_data; ),
- TP_printk("ring %p, request %p%s", __entry->ctx, __entry->req, - __entry->shadow ? ", shadow": "") + TP_printk("ring %p, request %p user_data %llu", __entry->ctx, + __entry->req, __entry->data) );
/**
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit d732447fed7d6b4c22907f630cd25d574bae5276 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
__io_submit_sqe() is issuing requests, so call it as such. Moreover, it ends by calling io_iopoll_req_issued().
Rename it and make terminology clearer.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a094787d9bab..bfacf7a8954b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2593,8 +2593,8 @@ static int io_req_defer(struct io_kiocb *req) return -EIOCBQUEUED; }
-static int __io_submit_sqe(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_issue_sqe(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) { int ret, opcode; struct sqe_submit *s = &req->submit; @@ -2701,7 +2701,7 @@ static void io_wq_submit_work(struct io_wq_work **workptr) s->has_user = (work->flags & IO_WQ_WORK_HAS_MM) != 0; s->in_async = true; do { - ret = __io_submit_sqe(req, &nxt, false); + ret = io_issue_sqe(req, &nxt, false); /* * We can get EAGAIN for polled IO even though we're * forcing a sync submission from here, since we can't @@ -2912,7 +2912,7 @@ static void __io_queue_sqe(struct io_kiocb *req) struct io_kiocb *nxt = io_prep_linked_timeout(req); int ret;
- ret = __io_submit_sqe(req, NULL, true); + ret = io_issue_sqe(req, NULL, true);
/* * We async punt it if the file wasn't marked NOWAIT, or if the file
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit 9835d6fafba58e6d9386a6d5af800789bdb52e5b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The number of SQEs to submit is specified by a user, so io_get_sqring() in most of the cases succeeds. Hint compilers about that.
Checking ASM genereted by gcc 9.2.0 for x64, there is one branch misprediction.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index bfacf7a8954b..d7ea7e0ee473 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3128,11 +3128,11 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) */ head = ctx->cached_sq_head; /* make sure SQ entry isn't read before tail */ - if (head == smp_load_acquire(&rings->sq.tail)) + if (unlikely(head == smp_load_acquire(&rings->sq.tail))) return false;
head = READ_ONCE(sq_array[head & ctx->sq_mask]); - if (head < ctx->sq_entries) { + if (likely(head < ctx->sq_entries)) { s->ring_file = NULL; s->sqe = &ctx->sq_sqes[head]; s->sequence = ctx->cached_sq_head;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 0b8c0ec7eedcd8f9f1a1f238d87f9b512b09e71a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
syzbot reports:
kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline] RIP: 0010:__validate_creds include/linux/cred.h:187 [inline] RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550 Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c 24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318 RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010 RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849 R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000 R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274 kthread+0x361/0x430 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Modules linked in: ---[ end trace f2e1a4307fbe2245 ]--- RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline] RIP: 0010:__validate_creds include/linux/cred.h:187 [inline] RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550 Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c 24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318 RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010 RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849 R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000 R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
which is caused by slab fault injection triggering a failure in prepare_creds(). We don't actually need to create a copy of the creds as we're not modifying it, we just need a reference on the current task creds. This avoids the failure case as well, and propagates the const throughout the stack.
Fixes: 181e448d8709 ("io_uring: async workers should inherit the user creds") Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 2 +- fs/io-wq.h | 2 +- fs/io_uring.c | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index cadbc77542f7..25654b5bf853 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -112,7 +112,7 @@ struct io_wq {
struct task_struct *manager; struct user_struct *user; - struct cred *creds; + const struct cred *creds; struct mm_struct *mm; refcount_t refs; struct completion done; diff --git a/fs/io-wq.h b/fs/io-wq.h index 600e0158cba7..dd0af0d7376c 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -87,7 +87,7 @@ typedef void (put_work_fn)(struct io_wq_work *); struct io_wq_data { struct mm_struct *mm; struct user_struct *user; - struct cred *creds; + const struct cred *creds;
get_work_fn *get_work; put_work_fn *put_work; diff --git a/fs/io_uring.c b/fs/io_uring.c index b22d30fecb60..da8e3bbddc1b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -238,7 +238,7 @@ struct io_ring_ctx {
struct user_struct *user;
- struct cred *creds; + const struct cred *creds;
/* 0 is for ctx quiesce/reinit/free, 1 is for sqo_thread started */ struct completion *completions; @@ -4759,7 +4759,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) ctx->compat = in_compat_syscall(); ctx->account_mem = account_mem; ctx->user = user; - ctx->creds = prepare_creds(); + ctx->creds = get_current_cred();
ret = io_allocate_scq_urings(ctx, p); if (ret)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 441cdbd5449b4923cd413d3ba748124f91388be9 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We should never return -ERESTARTSYS to userspace, transform it into -EINTR.
Cc: stable@vger.kernel.org # v5.3+ Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index da8e3bbddc1b..0780574e1843 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1916,6 +1916,8 @@ static int io_send_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, ret = fn(sock, msg, flags); if (force_nonblock && ret == -EAGAIN) return ret; + if (ret == -ERESTARTSYS) + ret = -EINTR; }
io_cqring_add_event(req, ret);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 1a6b74fc87024db59d41cd7346bd437f20fb3e2d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Right now we just copy the sqe for async offload, but we want to store more context across an async punt. In preparation for doing so, put the sqe copy inside a structure that we can expand. With this pointer added, we can get rid of REQ_F_FREE_SQE, as that is now indicated by whether req->io is NULL or not.
No functional changes in this patch.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 56 +++++++++++++++++++++++++++++---------------------- 1 file changed, 32 insertions(+), 24 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 0780574e1843..12db5162dae8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -308,6 +308,10 @@ struct io_timeout { struct io_timeout_data *data; };
+struct io_async_ctx { + struct io_uring_sqe sqe; +}; + /* * NOTE! Each of the iocb union members has the file pointer * as the first entry in their struct definition. So you can @@ -323,6 +327,7 @@ struct io_kiocb { };
const struct io_uring_sqe *sqe; + struct io_async_ctx *io; struct file *ring_file; int ring_fd; bool has_user; @@ -353,7 +358,6 @@ struct io_kiocb { #define REQ_F_TIMEOUT_NOSEQ 8192 /* no timeout sequence */ #define REQ_F_INFLIGHT 16384 /* on inflight list */ #define REQ_F_COMP_LOCKED 32768 /* completion under lock */ -#define REQ_F_FREE_SQE 65536 /* free sqe if not async queued */ u64 user_data; u32 result; u32 sequence; @@ -805,6 +809,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, }
got_it: + req->io = NULL; req->ring_file = NULL; req->file = NULL; req->ctx = ctx; @@ -835,8 +840,8 @@ static void __io_free_req(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx;
- if (req->flags & REQ_F_FREE_SQE) - kfree(req->sqe); + if (req->io) + kfree(req->io); if (req->file && !(req->flags & REQ_F_FIXED_FILE)) fput(req->file); if (req->flags & REQ_F_INFLIGHT) { @@ -1078,9 +1083,9 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, * completions for those, only batch free for fixed * file and non-linked commands. */ - if (((req->flags & - (REQ_F_FIXED_FILE|REQ_F_LINK|REQ_F_FREE_SQE)) == - REQ_F_FIXED_FILE) && !io_is_fallback_req(req)) { + if (((req->flags & (REQ_F_FIXED_FILE|REQ_F_LINK)) == + REQ_F_FIXED_FILE) && !io_is_fallback_req(req) && + !req->io) { reqs[to_free++] = req; if (to_free == ARRAY_SIZE(reqs)) io_free_req_many(ctx, reqs, &to_free); @@ -2258,7 +2263,7 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (!poll->wait) return -ENOMEM;
- req->sqe = NULL; + req->io = NULL; INIT_IO_WORK(&req->work, io_poll_complete_work); events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; @@ -2601,27 +2606,27 @@ static int io_async_cancel(struct io_kiocb *req, const struct io_uring_sqe *sqe,
static int io_req_defer(struct io_kiocb *req) { - struct io_uring_sqe *sqe_copy; struct io_ring_ctx *ctx = req->ctx; + struct io_async_ctx *io;
/* Still need defer if there is pending req in defer list. */ if (!req_need_defer(req) && list_empty(&ctx->defer_list)) return 0;
- sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL); - if (!sqe_copy) + io = kmalloc(sizeof(*io), GFP_KERNEL); + if (!io) return -EAGAIN;
spin_lock_irq(&ctx->completion_lock); if (!req_need_defer(req) && list_empty(&ctx->defer_list)) { spin_unlock_irq(&ctx->completion_lock); - kfree(sqe_copy); + kfree(io); return 0; }
- memcpy(sqe_copy, req->sqe, sizeof(*sqe_copy)); - req->flags |= REQ_F_FREE_SQE; - req->sqe = sqe_copy; + memcpy(&io->sqe, req->sqe, sizeof(io->sqe)); + req->sqe = &io->sqe; + req->io = io;
trace_io_uring_defer(ctx, req, req->user_data); list_add_tail(&req->list, &ctx->defer_list); @@ -2954,14 +2959,16 @@ static void __io_queue_sqe(struct io_kiocb *req) */ if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) || (req->flags & REQ_F_MUST_PUNT))) { - struct io_uring_sqe *sqe_copy; + struct io_async_ctx *io;
- sqe_copy = kmemdup(req->sqe, sizeof(*sqe_copy), GFP_KERNEL); - if (!sqe_copy) + io = kmalloc(sizeof(*io), GFP_KERNEL); + if (!io) goto err;
- req->sqe = sqe_copy; - req->flags |= REQ_F_FREE_SQE; + memcpy(&io->sqe, req->sqe, sizeof(io->sqe)); + + req->sqe = &io->sqe; + req->io = io;
if (req->work.flags & IO_WQ_WORK_NEEDS_FILES) { ret = io_grab_files(req); @@ -3062,7 +3069,7 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, */ if (*link) { struct io_kiocb *prev = *link; - struct io_uring_sqe *sqe_copy; + struct io_async_ctx *io;
if (req->sqe->flags & IOSQE_IO_DRAIN) (*link)->flags |= REQ_F_DRAIN_LINK | REQ_F_IO_DRAIN; @@ -3078,14 +3085,15 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, } }
- sqe_copy = kmemdup(req->sqe, sizeof(*sqe_copy), GFP_KERNEL); - if (!sqe_copy) { + io = kmalloc(sizeof(*io), GFP_KERNEL); + if (!io) { ret = -EAGAIN; goto err_req; }
- req->sqe = sqe_copy; - req->flags |= REQ_F_FREE_SQE; + memcpy(&io->sqe, req->sqe, sizeof(io->sqe)); + req->sqe = &io->sqe; + req->io = io; trace_io_uring_link(ctx, req, prev); list_add_tail(&req->list, &prev->link_list); } else if (req->sqe->flags & IOSQE_IO_LINK) {
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit f67676d160c6ee2ed82917fadfed6d29cab8237c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently we don't copy the iovecs when we punt to async context. This can be problematic for applications that store the iovec on the stack, as they often assume that it's safe to let the iovec go out of scope as soon as IO submission has been called. This isn't always safe, as we will re-copy the iovec once we're in async context.
Make this 100% safe by copying the iovec just once. With this change, applications may safely store the iovec on the stack for all cases.
Reported-by: 李通洲 carter.li@eoitek.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 243 +++++++++++++++++++++++++++++++++++++------------- 1 file changed, 181 insertions(+), 62 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 12db5162dae8..2060fb7b4450 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -308,8 +308,18 @@ struct io_timeout { struct io_timeout_data *data; };
+struct io_async_rw { + struct iovec fast_iov[UIO_FASTIOV]; + struct iovec *iov; + ssize_t nr_segs; + ssize_t size; +}; + struct io_async_ctx { struct io_uring_sqe sqe; + union { + struct io_async_rw rw; + }; };
/* @@ -1414,15 +1424,6 @@ static int io_prep_rw(struct io_kiocb *req, bool force_nonblock) if (S_ISREG(file_inode(req->file)->i_mode)) req->flags |= REQ_F_ISREG;
- /* - * If the file doesn't support async, mark it as REQ_F_MUST_PUNT so - * we know to async punt it even if it was opened O_NONBLOCK - */ - if (force_nonblock && !io_file_supports_async(req->file)) { - req->flags |= REQ_F_MUST_PUNT; - return -EAGAIN; - } - kiocb->ki_pos = READ_ONCE(sqe->off); kiocb->ki_flags = iocb_flags(kiocb->ki_filp); kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp)); @@ -1591,6 +1592,16 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, return io_import_fixed(req->ctx, rw, sqe, iter); }
+ if (req->io) { + struct io_async_rw *iorw = &req->io->rw; + + *iovec = iorw->iov; + iov_iter_init(iter, rw, *iovec, iorw->nr_segs, iorw->size); + if (iorw->iov == iorw->fast_iov) + *iovec = NULL; + return iorw->size; + } + if (!req->has_user) return -EFAULT;
@@ -1661,6 +1672,50 @@ static ssize_t loop_rw_iter(int rw, struct file *file, struct kiocb *kiocb, return ret; }
+static void io_req_map_io(struct io_kiocb *req, ssize_t io_size, + struct iovec *iovec, struct iovec *fast_iov, + struct iov_iter *iter) +{ + req->io->rw.nr_segs = iter->nr_segs; + req->io->rw.size = io_size; + req->io->rw.iov = iovec; + if (!req->io->rw.iov) { + req->io->rw.iov = req->io->rw.fast_iov; + memcpy(req->io->rw.iov, fast_iov, + sizeof(struct iovec) * iter->nr_segs); + } +} + +static int io_setup_async_io(struct io_kiocb *req, ssize_t io_size, + struct iovec *iovec, struct iovec *fast_iov, + struct iov_iter *iter) +{ + req->io = kmalloc(sizeof(*req->io), GFP_KERNEL); + if (req->io) { + io_req_map_io(req, io_size, iovec, fast_iov, iter); + memcpy(&req->io->sqe, req->sqe, sizeof(req->io->sqe)); + req->sqe = &req->io->sqe; + return 0; + } + + return -ENOMEM; +} + +static int io_read_prep(struct io_kiocb *req, struct iovec **iovec, + struct iov_iter *iter, bool force_nonblock) +{ + ssize_t ret; + + ret = io_prep_rw(req, force_nonblock); + if (ret) + return ret; + + if (unlikely(!(req->file->f_mode & FMODE_READ))) + return -EBADF; + + return io_import_iovec(READ, req, iovec, iter); +} + static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { @@ -1669,23 +1724,31 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, struct iov_iter iter; struct file *file; size_t iov_count; - ssize_t read_size, ret; + ssize_t io_size, ret;
- ret = io_prep_rw(req, force_nonblock); - if (ret) - return ret; - file = kiocb->ki_filp; - - if (unlikely(!(file->f_mode & FMODE_READ))) - return -EBADF; - - ret = io_import_iovec(READ, req, &iovec, &iter); - if (ret < 0) - return ret; + if (!req->io) { + ret = io_read_prep(req, &iovec, &iter, force_nonblock); + if (ret < 0) + return ret; + } else { + ret = io_import_iovec(READ, req, &iovec, &iter); + if (ret < 0) + return ret; + }
- read_size = ret; + file = req->file; + io_size = ret; if (req->flags & REQ_F_LINK) - req->result = read_size; + req->result = io_size; + + /* + * If the file doesn't support async, mark it as REQ_F_MUST_PUNT so + * we know to async punt it even if it was opened O_NONBLOCK + */ + if (force_nonblock && !io_file_supports_async(file)) { + req->flags |= REQ_F_MUST_PUNT; + goto copy_iov; + }
iov_count = iov_iter_count(&iter); ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_count); @@ -1707,18 +1770,40 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, */ if (force_nonblock && !(req->flags & REQ_F_NOWAIT) && (req->flags & REQ_F_ISREG) && - ret2 > 0 && ret2 < read_size) + ret2 > 0 && ret2 < io_size) ret2 = -EAGAIN; /* Catch -EAGAIN return for forced non-blocking submission */ - if (!force_nonblock || ret2 != -EAGAIN) + if (!force_nonblock || ret2 != -EAGAIN) { kiocb_done(kiocb, ret2, nxt, req->in_async); - else - ret = -EAGAIN; + } else { +copy_iov: + ret = io_setup_async_io(req, io_size, iovec, + inline_vecs, &iter); + if (ret) + goto out_free; + return -EAGAIN; + } } +out_free: kfree(iovec); return ret; }
+static int io_write_prep(struct io_kiocb *req, struct iovec **iovec, + struct iov_iter *iter, bool force_nonblock) +{ + ssize_t ret; + + ret = io_prep_rw(req, force_nonblock); + if (ret) + return ret; + + if (unlikely(!(req->file->f_mode & FMODE_WRITE))) + return -EBADF; + + return io_import_iovec(WRITE, req, iovec, iter); +} + static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { @@ -1727,29 +1812,36 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, struct iov_iter iter; struct file *file; size_t iov_count; - ssize_t ret; + ssize_t ret, io_size;
- ret = io_prep_rw(req, force_nonblock); - if (ret) - return ret; + if (!req->io) { + ret = io_write_prep(req, &iovec, &iter, force_nonblock); + if (ret < 0) + return ret; + } else { + ret = io_import_iovec(WRITE, req, &iovec, &iter); + if (ret < 0) + return ret; + }
file = kiocb->ki_filp; - if (unlikely(!(file->f_mode & FMODE_WRITE))) - return -EBADF; - - ret = io_import_iovec(WRITE, req, &iovec, &iter); - if (ret < 0) - return ret; - + io_size = ret; if (req->flags & REQ_F_LINK) - req->result = ret; + req->result = io_size;
- iov_count = iov_iter_count(&iter); + /* + * If the file doesn't support async, mark it as REQ_F_MUST_PUNT so + * we know to async punt it even if it was opened O_NONBLOCK + */ + if (force_nonblock && !io_file_supports_async(req->file)) { + req->flags |= REQ_F_MUST_PUNT; + goto copy_iov; + }
- ret = -EAGAIN; if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) - goto out_free; + goto copy_iov;
+ iov_count = iov_iter_count(&iter); ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, iov_count); if (!ret) { ssize_t ret2; @@ -1773,10 +1865,16 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, ret2 = call_write_iter(file, kiocb, &iter); else ret2 = loop_rw_iter(WRITE, file, kiocb, &iter); - if (!force_nonblock || ret2 != -EAGAIN) + if (!force_nonblock || ret2 != -EAGAIN) { kiocb_done(kiocb, ret2, nxt, req->in_async); - else - ret = -EAGAIN; + } else { +copy_iov: + ret = io_setup_async_io(req, io_size, iovec, + inline_vecs, &iter); + if (ret) + goto out_free; + return -EAGAIN; + } } out_free: kfree(iovec); @@ -2604,10 +2702,42 @@ static int io_async_cancel(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; }
+static int io_req_defer_prep(struct io_kiocb *req, struct io_async_ctx *io) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct iov_iter iter; + ssize_t ret; + + memcpy(&io->sqe, req->sqe, sizeof(io->sqe)); + req->sqe = &io->sqe; + + switch (io->sqe.opcode) { + case IORING_OP_READV: + case IORING_OP_READ_FIXED: + ret = io_read_prep(req, &iovec, &iter, true); + break; + case IORING_OP_WRITEV: + case IORING_OP_WRITE_FIXED: + ret = io_write_prep(req, &iovec, &iter, true); + break; + default: + req->io = io; + return 0; + } + + if (ret < 0) + return ret; + + req->io = io; + io_req_map_io(req, ret, iovec, inline_vecs, &iter); + return 0; +} + static int io_req_defer(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; struct io_async_ctx *io; + int ret;
/* Still need defer if there is pending req in defer list. */ if (!req_need_defer(req) && list_empty(&ctx->defer_list)) @@ -2624,9 +2754,9 @@ static int io_req_defer(struct io_kiocb *req) return 0; }
- memcpy(&io->sqe, req->sqe, sizeof(io->sqe)); - req->sqe = &io->sqe; - req->io = io; + ret = io_req_defer_prep(req, io); + if (ret < 0) + return ret;
trace_io_uring_defer(ctx, req, req->user_data); list_add_tail(&req->list, &ctx->defer_list); @@ -2959,17 +3089,6 @@ static void __io_queue_sqe(struct io_kiocb *req) */ if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) || (req->flags & REQ_F_MUST_PUNT))) { - struct io_async_ctx *io; - - io = kmalloc(sizeof(*io), GFP_KERNEL); - if (!io) - goto err; - - memcpy(&io->sqe, req->sqe, sizeof(io->sqe)); - - req->sqe = &io->sqe; - req->io = io; - if (req->work.flags & IO_WQ_WORK_NEEDS_FILES) { ret = io_grab_files(req); if (ret) @@ -3091,9 +3210,9 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, goto err_req; }
- memcpy(&io->sqe, req->sqe, sizeof(io->sqe)); - req->sqe = &io->sqe; - req->io = io; + ret = io_req_defer_prep(req, io); + if (ret) + goto err_req; trace_io_uring_link(ctx, req, prev); list_add_tail(&req->list, &prev->link_list); } else if (req->sqe->flags & IOSQE_IO_LINK) {
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 4257c8ca13b084550574b8c9a667d9c90ff746eb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This is in preparation for enabling the io_uring helpers for sendmsg and recvmsg to first copy the header for validation before continuing with the operation.
There should be no functional changes in this patch.
Acked-by: David S. Miller davem@davemloft.net Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/socket.c | 141 ++++++++++++++++++++++++++++++++++----------------- 1 file changed, 95 insertions(+), 46 deletions(-)
diff --git a/net/socket.c b/net/socket.c index 8faf6ea75c61..b4fd9c96e2ed 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2068,15 +2068,10 @@ static int copy_msghdr_from_user(struct msghdr *kmsg, return err < 0 ? err : 0; }
-static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, - struct msghdr *msg_sys, unsigned int flags, - struct used_address *used_address, - unsigned int allowed_msghdr_flags) +static int ____sys_sendmsg(struct socket *sock, struct msghdr *msg_sys, + unsigned int flags, struct used_address *used_address, + unsigned int allowed_msghdr_flags) { - struct compat_msghdr __user *msg_compat = - (struct compat_msghdr __user *)msg; - struct sockaddr_storage address; - struct iovec iovstack[UIO_FASTIOV], *iov = iovstack; unsigned char ctl[sizeof(struct cmsghdr) + 20] __aligned(sizeof(__kernel_size_t)); /* 20 is size of ipv6_pktinfo */ @@ -2084,19 +2079,10 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, int ctl_len; ssize_t err;
- msg_sys->msg_name = &address; - - if (MSG_CMSG_COMPAT & flags) - err = get_compat_msghdr(msg_sys, msg_compat, NULL, &iov); - else - err = copy_msghdr_from_user(msg_sys, msg, NULL, &iov); - if (err < 0) - return err; - err = -ENOBUFS;
if (msg_sys->msg_controllen > INT_MAX) - goto out_freeiov; + goto out; flags |= (msg_sys->msg_flags & allowed_msghdr_flags); ctl_len = msg_sys->msg_controllen; if ((MSG_CMSG_COMPAT & flags) && ctl_len) { @@ -2104,7 +2090,7 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, cmsghdr_from_user_compat_to_kern(msg_sys, sock->sk, ctl, sizeof(ctl)); if (err) - goto out_freeiov; + goto out; ctl_buf = msg_sys->msg_control; ctl_len = msg_sys->msg_controllen; } else if (ctl_len) { @@ -2113,7 +2099,7 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, if (ctl_len > sizeof(ctl)) { ctl_buf = sock_kmalloc(sock->sk, ctl_len, GFP_KERNEL); if (ctl_buf == NULL) - goto out_freeiov; + goto out; } err = -EFAULT; /* @@ -2159,7 +2145,47 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, out_freectl: if (ctl_buf != ctl) sock_kfree_s(sock->sk, ctl_buf, ctl_len); -out_freeiov: +out: + return err; +} + +static int sendmsg_copy_msghdr(struct msghdr *msg, + struct user_msghdr __user *umsg, unsigned flags, + struct iovec **iov) +{ + int err; + + if (flags & MSG_CMSG_COMPAT) { + struct compat_msghdr __user *msg_compat; + + msg_compat = (struct compat_msghdr __user *) umsg; + err = get_compat_msghdr(msg, msg_compat, NULL, iov); + } else { + err = copy_msghdr_from_user(msg, umsg, NULL, iov); + } + if (err < 0) + return err; + + return 0; +} + +static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, + struct msghdr *msg_sys, unsigned int flags, + struct used_address *used_address, + unsigned int allowed_msghdr_flags) +{ + struct sockaddr_storage address; + struct iovec iovstack[UIO_FASTIOV], *iov = iovstack; + ssize_t err; + + msg_sys->msg_name = &address; + + err = sendmsg_copy_msghdr(msg_sys, msg, flags, &iov); + if (err < 0) + return err; + + err = ____sys_sendmsg(sock, msg_sys, flags, used_address, + allowed_msghdr_flags); kfree(iov); return err; } @@ -2278,33 +2304,41 @@ SYSCALL_DEFINE4(sendmmsg, int, fd, struct mmsghdr __user *, mmsg, return __sys_sendmmsg(fd, mmsg, vlen, flags, true); }
-static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg, - struct msghdr *msg_sys, unsigned int flags, int nosec) +static int recvmsg_copy_msghdr(struct msghdr *msg, + struct user_msghdr __user *umsg, unsigned flags, + struct sockaddr __user **uaddr, + struct iovec **iov) { - struct compat_msghdr __user *msg_compat = - (struct compat_msghdr __user *)msg; - struct iovec iovstack[UIO_FASTIOV]; - struct iovec *iov = iovstack; - unsigned long cmsg_ptr; - int len; ssize_t err;
- /* kernel mode address */ - struct sockaddr_storage addr; - - /* user mode address pointers */ - struct sockaddr __user *uaddr; - int __user *uaddr_len = COMPAT_NAMELEN(msg); - - msg_sys->msg_name = &addr; + if (MSG_CMSG_COMPAT & flags) { + struct compat_msghdr __user *msg_compat;
- if (MSG_CMSG_COMPAT & flags) - err = get_compat_msghdr(msg_sys, msg_compat, &uaddr, &iov); - else - err = copy_msghdr_from_user(msg_sys, msg, &uaddr, &iov); + msg_compat = (struct compat_msghdr __user *) umsg; + err = get_compat_msghdr(msg, msg_compat, uaddr, iov); + } else { + err = copy_msghdr_from_user(msg, umsg, uaddr, iov); + } if (err < 0) return err;
+ return 0; +} + +static int ____sys_recvmsg(struct socket *sock, struct msghdr *msg_sys, + struct user_msghdr __user *msg, + struct sockaddr __user *uaddr, + unsigned int flags, int nosec) +{ + struct compat_msghdr __user *msg_compat = + (struct compat_msghdr __user *) msg; + int __user *uaddr_len = COMPAT_NAMELEN(msg); + struct sockaddr_storage addr; + unsigned long cmsg_ptr; + int len; + ssize_t err; + + msg_sys->msg_name = &addr; cmsg_ptr = (unsigned long)msg_sys->msg_control; msg_sys->msg_flags = flags & (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
@@ -2315,7 +2349,7 @@ static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg, flags |= MSG_DONTWAIT; err = (nosec ? sock_recvmsg_nosec : sock_recvmsg)(sock, msg_sys, flags); if (err < 0) - goto out_freeiov; + goto out; len = err;
if (uaddr != NULL) { @@ -2323,12 +2357,12 @@ static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg, msg_sys->msg_namelen, uaddr, uaddr_len); if (err < 0) - goto out_freeiov; + goto out; } err = __put_user((msg_sys->msg_flags & ~MSG_CMSG_COMPAT), COMPAT_FLAGS(msg)); if (err) - goto out_freeiov; + goto out; if (MSG_CMSG_COMPAT & flags) err = __put_user((unsigned long)msg_sys->msg_control - cmsg_ptr, &msg_compat->msg_controllen); @@ -2336,10 +2370,25 @@ static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg, err = __put_user((unsigned long)msg_sys->msg_control - cmsg_ptr, &msg->msg_controllen); if (err) - goto out_freeiov; + goto out; err = len; +out: + return err; +} + +static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg, + struct msghdr *msg_sys, unsigned int flags, int nosec) +{ + struct iovec iovstack[UIO_FASTIOV], *iov = iovstack; + /* user mode address pointers */ + struct sockaddr __user *uaddr; + ssize_t err; + + err = recvmsg_copy_msghdr(msg_sys, msg, flags, &uaddr, &iov); + if (err < 0) + return err;
-out_freeiov: + err = ____sys_recvmsg(sock, msg_sys, msg, uaddr, flags, nosec); kfree(iov); return err; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit d69e07793f891524c6bbf1e75b9ae69db4450953 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Only io_uring uses (and added) these, and we want to disallow the use of sendmsg/recvmsg for anything but regular data transfers. Use the newly added prep helper to split the msghdr copy out from the core function, to check for msg_control and msg_controllen settings. If either is set, we return -EINVAL.
Acked-by: David S. Miller davem@davemloft.net Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/socket.c | 43 +++++++++++++++++++++++++++++++++++++------ 1 file changed, 37 insertions(+), 6 deletions(-)
diff --git a/net/socket.c b/net/socket.c index b4fd9c96e2ed..b3ffa502d62a 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2193,12 +2193,27 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, /* * BSD sendmsg interface */ -long __sys_sendmsg_sock(struct socket *sock, struct user_msghdr __user *msg, +long __sys_sendmsg_sock(struct socket *sock, struct user_msghdr __user *umsg, unsigned int flags) { - struct msghdr msg_sys; + struct iovec iovstack[UIO_FASTIOV], *iov = iovstack; + struct sockaddr_storage address; + struct msghdr msg = { .msg_name = &address }; + ssize_t err; + + err = sendmsg_copy_msghdr(&msg, umsg, flags, &iov); + if (err) + return err; + /* disallow ancillary data requests from this path */ + if (msg.msg_control || msg.msg_controllen) { + err = -EINVAL; + goto out; + }
- return ___sys_sendmsg(sock, msg, &msg_sys, flags, NULL, 0); + err = ____sys_sendmsg(sock, &msg, flags, NULL, 0); +out: + kfree(iov); + return err; }
long __sys_sendmsg(int fd, struct user_msghdr __user *msg, unsigned int flags, @@ -2397,12 +2412,28 @@ static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg, * BSD recvmsg interface */
-long __sys_recvmsg_sock(struct socket *sock, struct user_msghdr __user *msg, +long __sys_recvmsg_sock(struct socket *sock, struct user_msghdr __user *umsg, unsigned int flags) { - struct msghdr msg_sys; + struct iovec iovstack[UIO_FASTIOV], *iov = iovstack; + struct sockaddr_storage address; + struct msghdr msg = { .msg_name = &address }; + struct sockaddr __user *uaddr; + ssize_t err;
- return ___sys_recvmsg(sock, msg, &msg_sys, flags, 0); + err = recvmsg_copy_msghdr(&msg, umsg, flags, &uaddr, &iov); + if (err) + return err; + /* disallow ancillary data requests from this path */ + if (msg.msg_control || msg.msg_controllen) { + err = -EINVAL; + goto out; + } + + err = ____sys_recvmsg(sock, &msg, umsg, uaddr, flags, 0); +out: + kfree(iov); + return err; }
long __sys_recvmsg(int fd, struct user_msghdr __user *msg, unsigned int flags,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 03b1230ca12a12e045d83b0357792075bf94a1e0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Just like commit f67676d160c6 for read/write requests, this one ensures that the msghdr data is fully copied if we need to punt a recvmsg or sendmsg system call to async context.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 145 ++++++++++++++++++++++++++++++++++++----- include/linux/socket.h | 15 +++-- net/socket.c | 60 +++++------------ 3 files changed, 156 insertions(+), 64 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2060fb7b4450..4de95825e878 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -308,6 +308,13 @@ struct io_timeout { struct io_timeout_data *data; };
+struct io_async_msghdr { + struct iovec fast_iov[UIO_FASTIOV]; + struct iovec *iov; + struct sockaddr __user *uaddr; + struct msghdr msg; +}; + struct io_async_rw { struct iovec fast_iov[UIO_FASTIOV]; struct iovec *iov; @@ -319,6 +326,7 @@ struct io_async_ctx { struct io_uring_sqe sqe; union { struct io_async_rw rw; + struct io_async_msghdr msg; }; };
@@ -1990,12 +1998,25 @@ static int io_sync_file_range(struct io_kiocb *req, return 0; }
+static int io_sendmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) +{ #if defined(CONFIG_NET) -static int io_send_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, - struct io_kiocb **nxt, bool force_nonblock, - long (*fn)(struct socket *, struct user_msghdr __user *, - unsigned int)) + const struct io_uring_sqe *sqe = req->sqe; + struct user_msghdr __user *msg; + unsigned flags; + + flags = READ_ONCE(sqe->msg_flags); + msg = (struct user_msghdr __user *)(unsigned long) READ_ONCE(sqe->addr); + return sendmsg_copy_msghdr(&io->msg.msg, msg, flags, &io->msg.iov); +#else + return 0; +#endif +} + +static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, + struct io_kiocb **nxt, bool force_nonblock) { +#if defined(CONFIG_NET) struct socket *sock; int ret;
@@ -2004,7 +2025,9 @@ static int io_send_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
sock = sock_from_file(req->file, &ret); if (sock) { - struct user_msghdr __user *msg; + struct io_async_ctx io, *copy; + struct sockaddr_storage addr; + struct msghdr *kmsg; unsigned flags;
flags = READ_ONCE(sqe->msg_flags); @@ -2013,32 +2036,59 @@ static int io_send_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, else if (force_nonblock) flags |= MSG_DONTWAIT;
- msg = (struct user_msghdr __user *) (unsigned long) - READ_ONCE(sqe->addr); + if (req->io) { + kmsg = &req->io->msg.msg; + kmsg->msg_name = &addr; + } else { + kmsg = &io.msg.msg; + kmsg->msg_name = &addr; + io.msg.iov = io.msg.fast_iov; + ret = io_sendmsg_prep(req, &io); + if (ret) + goto out; + }
- ret = fn(sock, msg, flags); - if (force_nonblock && ret == -EAGAIN) + ret = __sys_sendmsg_sock(sock, kmsg, flags); + if (force_nonblock && ret == -EAGAIN) { + copy = kmalloc(sizeof(*copy), GFP_KERNEL); + if (!copy) { + ret = -ENOMEM; + goto out; + } + memcpy(©->msg, &io.msg, sizeof(copy->msg)); + req->io = copy; + memcpy(&req->io->sqe, req->sqe, sizeof(*req->sqe)); + req->sqe = &req->io->sqe; return ret; + } if (ret == -ERESTARTSYS) ret = -EINTR; }
+out: io_cqring_add_event(req, ret); if (ret < 0 && (req->flags & REQ_F_LINK)) req->flags |= REQ_F_FAIL_LINK; io_put_req_find_next(req, nxt); return 0; -} +#else + return -EOPNOTSUPP; #endif +}
-static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, - struct io_kiocb **nxt, bool force_nonblock) +static int io_recvmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) { #if defined(CONFIG_NET) - return io_send_recvmsg(req, sqe, nxt, force_nonblock, - __sys_sendmsg_sock); + const struct io_uring_sqe *sqe = req->sqe; + struct user_msghdr __user *msg; + unsigned flags; + + flags = READ_ONCE(sqe->msg_flags); + msg = (struct user_msghdr __user *)(unsigned long) READ_ONCE(sqe->addr); + return recvmsg_copy_msghdr(&io->msg.msg, msg, flags, &io->msg.uaddr, + &io->msg.iov); #else - return -EOPNOTSUPP; + return 0; #endif }
@@ -2046,8 +2096,63 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct io_kiocb **nxt, bool force_nonblock) { #if defined(CONFIG_NET) - return io_send_recvmsg(req, sqe, nxt, force_nonblock, - __sys_recvmsg_sock); + struct socket *sock; + int ret; + + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + + sock = sock_from_file(req->file, &ret); + if (sock) { + struct user_msghdr __user *msg; + struct io_async_ctx io, *copy; + struct sockaddr_storage addr; + struct msghdr *kmsg; + unsigned flags; + + flags = READ_ONCE(sqe->msg_flags); + if (flags & MSG_DONTWAIT) + req->flags |= REQ_F_NOWAIT; + else if (force_nonblock) + flags |= MSG_DONTWAIT; + + msg = (struct user_msghdr __user *) (unsigned long) + READ_ONCE(sqe->addr); + if (req->io) { + kmsg = &req->io->msg.msg; + kmsg->msg_name = &addr; + } else { + kmsg = &io.msg.msg; + kmsg->msg_name = &addr; + io.msg.iov = io.msg.fast_iov; + ret = io_recvmsg_prep(req, &io); + if (ret) + goto out; + } + + ret = __sys_recvmsg_sock(sock, kmsg, msg, io.msg.uaddr, flags); + if (force_nonblock && ret == -EAGAIN) { + copy = kmalloc(sizeof(*copy), GFP_KERNEL); + if (!copy) { + ret = -ENOMEM; + goto out; + } + memcpy(copy, &io, sizeof(*copy)); + req->io = copy; + memcpy(&req->io->sqe, req->sqe, sizeof(*req->sqe)); + req->sqe = &req->io->sqe; + return ret; + } + if (ret == -ERESTARTSYS) + ret = -EINTR; + } + +out: + io_cqring_add_event(req, ret); + if (ret < 0 && (req->flags & REQ_F_LINK)) + req->flags |= REQ_F_FAIL_LINK; + io_put_req_find_next(req, nxt); + return 0; #else return -EOPNOTSUPP; #endif @@ -2720,6 +2825,12 @@ static int io_req_defer_prep(struct io_kiocb *req, struct io_async_ctx *io) case IORING_OP_WRITE_FIXED: ret = io_write_prep(req, &iovec, &iter, true); break; + case IORING_OP_SENDMSG: + ret = io_sendmsg_prep(req, io); + break; + case IORING_OP_RECVMSG: + ret = io_recvmsg_prep(req, io); + break; default: req->io = io; return 0; diff --git a/include/linux/socket.h b/include/linux/socket.h index 841f18488954..9ea24dbab8b7 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -364,12 +364,19 @@ extern int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen extern int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen, unsigned int flags, bool forbid_cmsg_compat); -extern long __sys_sendmsg_sock(struct socket *sock, - struct user_msghdr __user *msg, +extern long __sys_sendmsg_sock(struct socket *sock, struct msghdr *msg, unsigned int flags); -extern long __sys_recvmsg_sock(struct socket *sock, - struct user_msghdr __user *msg, +extern long __sys_recvmsg_sock(struct socket *sock, struct msghdr *msg, + struct user_msghdr __user *umsg, + struct sockaddr __user *uaddr, unsigned int flags); +extern int sendmsg_copy_msghdr(struct msghdr *msg, + struct user_msghdr __user *umsg, unsigned flags, + struct iovec **iov); +extern int recvmsg_copy_msghdr(struct msghdr *msg, + struct user_msghdr __user *umsg, unsigned flags, + struct sockaddr __user **uaddr, + struct iovec **iov);
/* helpers which do the actual work for syscalls */ extern int __sys_recvfrom(int fd, void __user *ubuf, size_t size, diff --git a/net/socket.c b/net/socket.c index b3ffa502d62a..cf06a55d2f18 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2149,9 +2149,9 @@ static int ____sys_sendmsg(struct socket *sock, struct msghdr *msg_sys, return err; }
-static int sendmsg_copy_msghdr(struct msghdr *msg, - struct user_msghdr __user *umsg, unsigned flags, - struct iovec **iov) +int sendmsg_copy_msghdr(struct msghdr *msg, + struct user_msghdr __user *umsg, unsigned flags, + struct iovec **iov) { int err;
@@ -2193,27 +2193,14 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, /* * BSD sendmsg interface */ -long __sys_sendmsg_sock(struct socket *sock, struct user_msghdr __user *umsg, +long __sys_sendmsg_sock(struct socket *sock, struct msghdr *msg, unsigned int flags) { - struct iovec iovstack[UIO_FASTIOV], *iov = iovstack; - struct sockaddr_storage address; - struct msghdr msg = { .msg_name = &address }; - ssize_t err; - - err = sendmsg_copy_msghdr(&msg, umsg, flags, &iov); - if (err) - return err; /* disallow ancillary data requests from this path */ - if (msg.msg_control || msg.msg_controllen) { - err = -EINVAL; - goto out; - } + if (msg->msg_control || msg->msg_controllen) + return -EINVAL;
- err = ____sys_sendmsg(sock, &msg, flags, NULL, 0); -out: - kfree(iov); - return err; + return ____sys_sendmsg(sock, msg, flags, NULL, 0); }
long __sys_sendmsg(int fd, struct user_msghdr __user *msg, unsigned int flags, @@ -2319,10 +2306,10 @@ SYSCALL_DEFINE4(sendmmsg, int, fd, struct mmsghdr __user *, mmsg, return __sys_sendmmsg(fd, mmsg, vlen, flags, true); }
-static int recvmsg_copy_msghdr(struct msghdr *msg, - struct user_msghdr __user *umsg, unsigned flags, - struct sockaddr __user **uaddr, - struct iovec **iov) +int recvmsg_copy_msghdr(struct msghdr *msg, + struct user_msghdr __user *umsg, unsigned flags, + struct sockaddr __user **uaddr, + struct iovec **iov) { ssize_t err;
@@ -2412,28 +2399,15 @@ static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg, * BSD recvmsg interface */
-long __sys_recvmsg_sock(struct socket *sock, struct user_msghdr __user *umsg, - unsigned int flags) +long __sys_recvmsg_sock(struct socket *sock, struct msghdr *msg, + struct user_msghdr __user *umsg, + struct sockaddr __user *uaddr, unsigned int flags) { - struct iovec iovstack[UIO_FASTIOV], *iov = iovstack; - struct sockaddr_storage address; - struct msghdr msg = { .msg_name = &address }; - struct sockaddr __user *uaddr; - ssize_t err; - - err = recvmsg_copy_msghdr(&msg, umsg, flags, &uaddr, &iov); - if (err) - return err; /* disallow ancillary data requests from this path */ - if (msg.msg_control || msg.msg_controllen) { - err = -EINVAL; - goto out; - } + if (msg->msg_control || msg->msg_controllen) + return -EINVAL;
- err = ____sys_recvmsg(sock, &msg, umsg, uaddr, flags, 0); -out: - kfree(iov); - return err; + return ____sys_recvmsg(sock, msg, umsg, uaddr, flags, 0); }
long __sys_recvmsg(int fd, struct user_msghdr __user *msg, unsigned int flags,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit f499a021ea8c9f70321fce3d674d8eca5bbeee2c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Just like commit f67676d160c6 for read/write requests, this one ensures that the sockaddr data has been copied for IORING_OP_CONNECT if we need to punt the request to async context.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 51 ++++++++++++++++++++++++++++++++++++++---- include/linux/socket.h | 5 ++--- net/socket.c | 16 ++++++------- 3 files changed, 57 insertions(+), 15 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4de95825e878..128eb4c89f70 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -308,6 +308,10 @@ struct io_timeout { struct io_timeout_data *data; };
+struct io_async_connect { + struct sockaddr_storage address; +}; + struct io_async_msghdr { struct iovec fast_iov[UIO_FASTIOV]; struct iovec *iov; @@ -327,6 +331,7 @@ struct io_async_ctx { union { struct io_async_rw rw; struct io_async_msghdr msg; + struct io_async_connect connect; }; };
@@ -2194,11 +2199,26 @@ static int io_accept(struct io_kiocb *req, const struct io_uring_sqe *sqe, #endif }
+static int io_connect_prep(struct io_kiocb *req, struct io_async_ctx *io) +{ +#if defined(CONFIG_NET) + const struct io_uring_sqe *sqe = req->sqe; + struct sockaddr __user *addr; + int addr_len; + + addr = (struct sockaddr __user *) (unsigned long) READ_ONCE(sqe->addr); + addr_len = READ_ONCE(sqe->addr2); + return move_addr_to_kernel(addr, addr_len, &io->connect.address); +#else + return 0; +#endif +} + static int io_connect(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct io_kiocb **nxt, bool force_nonblock) { #if defined(CONFIG_NET) - struct sockaddr __user *addr; + struct io_async_ctx __io, *io; unsigned file_flags; int addr_len, ret;
@@ -2207,15 +2227,35 @@ static int io_connect(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (sqe->ioprio || sqe->len || sqe->buf_index || sqe->rw_flags) return -EINVAL;
- addr = (struct sockaddr __user *) (unsigned long) READ_ONCE(sqe->addr); addr_len = READ_ONCE(sqe->addr2); file_flags = force_nonblock ? O_NONBLOCK : 0;
- ret = __sys_connect_file(req->file, addr, addr_len, file_flags); - if (ret == -EAGAIN && force_nonblock) + if (req->io) { + io = req->io; + } else { + ret = io_connect_prep(req, &__io); + if (ret) + goto out; + io = &__io; + } + + ret = __sys_connect_file(req->file, &io->connect.address, addr_len, + file_flags); + if (ret == -EAGAIN && force_nonblock) { + io = kmalloc(sizeof(*io), GFP_KERNEL); + if (!io) { + ret = -ENOMEM; + goto out; + } + memcpy(&io->connect, &__io.connect, sizeof(io->connect)); + req->io = io; + memcpy(&io->sqe, req->sqe, sizeof(*req->sqe)); + req->sqe = &io->sqe; return -EAGAIN; + } if (ret == -ERESTARTSYS) ret = -EINTR; +out: if (ret < 0 && (req->flags & REQ_F_LINK)) req->flags |= REQ_F_FAIL_LINK; io_cqring_add_event(req, ret); @@ -2831,6 +2871,9 @@ static int io_req_defer_prep(struct io_kiocb *req, struct io_async_ctx *io) case IORING_OP_RECVMSG: ret = io_recvmsg_prep(req, io); break; + case IORING_OP_CONNECT: + ret = io_connect_prep(req, io); + break; default: req->io = io; return 0; diff --git a/include/linux/socket.h b/include/linux/socket.h index 9ea24dbab8b7..06fa9883d702 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -392,9 +392,8 @@ extern int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr, int __user *upeer_addrlen, int flags); extern int __sys_socket(int family, int type, int protocol); extern int __sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen); -extern int __sys_connect_file(struct file *file, - struct sockaddr __user *uservaddr, int addrlen, - int file_flags); +extern int __sys_connect_file(struct file *file, struct sockaddr_storage *addr, + int addrlen, int file_flags); extern int __sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen); extern int __sys_listen(int fd, int backlog); diff --git a/net/socket.c b/net/socket.c index cf06a55d2f18..fd58966a04d1 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1659,26 +1659,22 @@ SYSCALL_DEFINE3(accept, int, fd, struct sockaddr __user *, upeer_sockaddr, * include the -EINPROGRESS status for such sockets. */
-int __sys_connect_file(struct file *file, struct sockaddr __user *uservaddr, +int __sys_connect_file(struct file *file, struct sockaddr_storage *address, int addrlen, int file_flags) { struct socket *sock; - struct sockaddr_storage address; int err;
sock = sock_from_file(file, &err); if (!sock) goto out; - err = move_addr_to_kernel(uservaddr, addrlen, &address); - if (err < 0) - goto out;
err = - security_socket_connect(sock, (struct sockaddr *)&address, addrlen); + security_socket_connect(sock, (struct sockaddr *)address, addrlen); if (err) goto out;
- err = sock->ops->connect(sock, (struct sockaddr *)&address, addrlen, + err = sock->ops->connect(sock, (struct sockaddr *)address, addrlen, sock->file->f_flags | file_flags); out: return err; @@ -1691,7 +1687,11 @@ int __sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen)
f = fdget(fd); if (f.file) { - ret = __sys_connect_file(f.file, uservaddr, addrlen, 0); + struct sockaddr_storage address; + + ret = move_addr_to_kernel(uservaddr, addrlen, &address); + if (!ret) + ret = __sys_connect_file(f.file, &address, addrlen, 0); if (f.flags) fput(f.file); }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit da8c96906990f1108cb626ee7865e69267a3263b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If this flag is set, applications can be certain that any data for async offload has been consumed when the kernel has consumed the SQE.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 ++- include/uapi/linux/io_uring.h | 1 + 2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 128eb4c89f70..6e488009f961 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5077,7 +5077,8 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) if (ret < 0) goto err;
- p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP; + p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP | + IORING_FEAT_SUBMIT_STABLE; trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags); return ret; err: diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 4637ed1d9949..eabccb46edd1 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -157,6 +157,7 @@ struct io_uring_params { */ #define IORING_FEAT_SINGLE_MMAP (1U << 0) #define IORING_FEAT_NODROP (1U << 1) +#define IORING_FEAT_SUBMIT_STABLE (1U << 2)
/* * io_uring_register(2) opcodes and arguments
From: Jackie Liu liuyun01@kylinos.cn
mainline inclusion from mainline-5.5-rc1 commit 22efde5998657f6d1f31592c659aa3a9c7ad65f1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Parameter ctx we have never used, clean it up.
Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6e488009f961..965399150227 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3395,7 +3395,7 @@ static void io_submit_state_end(struct io_submit_state *state) * Start submission side cache. */ static void io_submit_state_start(struct io_submit_state *state, - struct io_ring_ctx *ctx, unsigned max_ios) + unsigned int max_ios) { blk_start_plug(&state->plug); state->free_reqs = 0; @@ -3479,7 +3479,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, return -EBUSY;
if (nr > IO_PLUG_THRESHOLD) { - io_submit_state_start(&state, ctx, nr); + io_submit_state_start(&state, nr); statep = &state; }
From: Jackie Liu liuyun01@kylinos.cn
mainline inclusion from mainline-5.5-rc1 commit 8cdda87a4414092cd210e766189cf0353a844861 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Since commit b18fdf71e01f ("io_uring: simplify io_req_link_next()"), the io_wq_current_is_worker function is no longer needed, clean it up.
Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.h | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/fs/io-wq.h b/fs/io-wq.h index dd0af0d7376c..892db0bb64b1 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -118,10 +118,6 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk) static inline void io_wq_worker_running(struct task_struct *tsk) { } -#endif +#endif /* CONFIG_IO_WQ */
-static inline bool io_wq_current_is_worker(void) -{ - return in_task() && (current->flags & PF_IO_WORKER); -} -#endif +#endif /* INTERNAL_IO_WQ_H */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 87f80d623c6c93c721b2aaead8a45e848bc8ffbf category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Right now we return it to userspace, which means the application has to poll for the socket to be writeable. Let's just treat it like -EAGAIN and have io_uring handle it internally, this makes it much easier to use.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 965399150227..cf0f09545395 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2241,7 +2241,7 @@ static int io_connect(struct io_kiocb *req, const struct io_uring_sqe *sqe,
ret = __sys_connect_file(req->file, &io->connect.address, addr_len, file_flags); - if (ret == -EAGAIN && force_nonblock) { + if ((ret == -EAGAIN || ret == -EINPROGRESS) && force_nonblock) { io = kmalloc(sizeof(*io), GFP_KERNEL); if (!io) { ret = -ENOMEM;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 901e59bba9ddad4bc6994ecb8598ea60a993da4c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There's really no reason why we forbid things like link/drain etc on regular timeout commands. Enable the usual SQE flags on timeouts.
Reported-by: 李通洲 carter.li@eoitek.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 --- 1 file changed, 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cf0f09545395..c5bcb751b688 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2702,9 +2702,6 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) int ret;
ret = io_timeout_setup(req); - /* common setup allows flags (like links) set, we don't */ - if (!ret && sqe->flags) - ret = -EINVAL; if (ret) return ret;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 2d28390aff879238f00e209e38c2a0b78717360e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we defer a timeout, we should ensure that we copy the timespec when we have consumed the sqe. This is similar to commit f67676d160c6 for read/write requests. We already did this correctly for timeouts deferred as links, but do it generally and use the infrastructure added by commit 1a6b74fc8702 instead of having the timeout deferral use its own.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 83 ++++++++++++++++++++++++++------------------------- 1 file changed, 42 insertions(+), 41 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c5bcb751b688..7d9001280fb5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -303,11 +303,6 @@ struct io_timeout_data { u32 seq_offset; };
-struct io_timeout { - struct file *file; - struct io_timeout_data *data; -}; - struct io_async_connect { struct sockaddr_storage address; }; @@ -332,6 +327,7 @@ struct io_async_ctx { struct io_async_rw rw; struct io_async_msghdr msg; struct io_async_connect connect; + struct io_timeout_data timeout; }; };
@@ -346,7 +342,6 @@ struct io_kiocb { struct file *file; struct kiocb rw; struct io_poll_iocb poll; - struct io_timeout timeout; };
const struct io_uring_sqe *sqe; @@ -618,7 +613,7 @@ static void io_kill_timeout(struct io_kiocb *req) { int ret;
- ret = hrtimer_try_to_cancel(&req->timeout.data->timer); + ret = hrtimer_try_to_cancel(&req->io->timeout.timer); if (ret != -1) { atomic_inc(&req->ctx->cq_timeouts); list_del_init(&req->list); @@ -876,8 +871,6 @@ static void __io_free_req(struct io_kiocb *req) wake_up(&ctx->inflight_wait); spin_unlock_irqrestore(&ctx->inflight_lock, flags); } - if (req->flags & REQ_F_TIMEOUT) - kfree(req->timeout.data); percpu_ref_put(&ctx->refs); if (likely(!io_is_fallback_req(req))) kmem_cache_free(req_cachep, req); @@ -890,7 +883,7 @@ static bool io_link_cancel_timeout(struct io_kiocb *req) struct io_ring_ctx *ctx = req->ctx; int ret;
- ret = hrtimer_try_to_cancel(&req->timeout.data->timer); + ret = hrtimer_try_to_cancel(&req->io->timeout.timer); if (ret != -1) { io_cqring_fill_event(req, -ECANCELED); io_commit_cqring(ctx); @@ -2617,7 +2610,7 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) if (ret == -ENOENT) return ret;
- ret = hrtimer_try_to_cancel(&req->timeout.data->timer); + ret = hrtimer_try_to_cancel(&req->io->timeout.timer); if (ret == -1) return -EALREADY;
@@ -2659,7 +2652,8 @@ static int io_timeout_remove(struct io_kiocb *req, return 0; }
-static int io_timeout_setup(struct io_kiocb *req) +static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io, + bool is_timeout_link) { const struct io_uring_sqe *sqe = req->sqe; struct io_timeout_data *data; @@ -2669,15 +2663,14 @@ static int io_timeout_setup(struct io_kiocb *req) return -EINVAL; if (sqe->ioprio || sqe->buf_index || sqe->len != 1) return -EINVAL; + if (sqe->off && is_timeout_link) + return -EINVAL; flags = READ_ONCE(sqe->timeout_flags); if (flags & ~IORING_TIMEOUT_ABS) return -EINVAL;
- data = kzalloc(sizeof(struct io_timeout_data), GFP_KERNEL); - if (!data) - return -ENOMEM; + data = &io->timeout; data->req = req; - req->timeout.data = data; req->flags |= REQ_F_TIMEOUT;
if (get_timespec64(&data->ts, u64_to_user_ptr(sqe->addr))) @@ -2689,6 +2682,7 @@ static int io_timeout_setup(struct io_kiocb *req) data->mode = HRTIMER_MODE_REL;
hrtimer_init(&data->timer, CLOCK_MONOTONIC, data->mode); + req->io = io; return 0; }
@@ -2697,13 +2691,24 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) unsigned count; struct io_ring_ctx *ctx = req->ctx; struct io_timeout_data *data; + struct io_async_ctx *io; struct list_head *entry; unsigned span = 0; - int ret;
- ret = io_timeout_setup(req); - if (ret) - return ret; + io = req->io; + if (!io) { + int ret; + + io = kmalloc(sizeof(*io), GFP_KERNEL); + if (!io) + return -ENOMEM; + ret = io_timeout_prep(req, io, false); + if (ret) { + kfree(io); + return ret; + } + } + data = &req->io->timeout;
/* * sqe->off holds how many events that need to occur for this @@ -2719,7 +2724,7 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) }
req->sequence = ctx->cached_sq_head + count - 1; - req->timeout.data->seq_offset = count; + data->seq_offset = count;
/* * Insertion sort, ensuring the first entry in the list is always @@ -2730,7 +2735,7 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) struct io_kiocb *nxt = list_entry(entry, struct io_kiocb, list); unsigned nxt_sq_head; long long tmp, tmp_nxt; - u32 nxt_offset = nxt->timeout.data->seq_offset; + u32 nxt_offset = nxt->io->timeout.seq_offset;
if (nxt->flags & REQ_F_TIMEOUT_NOSEQ) continue; @@ -2763,7 +2768,6 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) req->sequence -= span; add: list_add(&req->list, entry); - data = req->timeout.data; data->timer.function = io_timeout_fn; hrtimer_start(&data->timer, timespec64_to_ktime(data->ts), data->mode); spin_unlock_irq(&ctx->completion_lock); @@ -2871,6 +2875,10 @@ static int io_req_defer_prep(struct io_kiocb *req, struct io_async_ctx *io) case IORING_OP_CONNECT: ret = io_connect_prep(req, io); break; + case IORING_OP_TIMEOUT: + return io_timeout_prep(req, io, false); + case IORING_OP_LINK_TIMEOUT: + return io_timeout_prep(req, io, true); default: req->io = io; return 0; @@ -2898,17 +2906,18 @@ static int io_req_defer(struct io_kiocb *req) if (!io) return -EAGAIN;
+ ret = io_req_defer_prep(req, io); + if (ret < 0) { + kfree(io); + return ret; + } + spin_lock_irq(&ctx->completion_lock); if (!req_need_defer(req) && list_empty(&ctx->defer_list)) { spin_unlock_irq(&ctx->completion_lock); - kfree(io); return 0; }
- ret = io_req_defer_prep(req, io); - if (ret < 0) - return ret; - trace_io_uring_defer(ctx, req, req->user_data); list_add_tail(&req->list, &ctx->defer_list); spin_unlock_irq(&ctx->completion_lock); @@ -3197,7 +3206,7 @@ static void io_queue_linked_timeout(struct io_kiocb *req) */ spin_lock_irq(&ctx->completion_lock); if (!list_empty(&req->list)) { - struct io_timeout_data *data = req->timeout.data; + struct io_timeout_data *data = &req->io->timeout;
data->timer.function = io_link_timeout_fn; hrtimer_start(&data->timer, timespec64_to_ktime(data->ts), @@ -3344,17 +3353,6 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, if (req->sqe->flags & IOSQE_IO_DRAIN) (*link)->flags |= REQ_F_DRAIN_LINK | REQ_F_IO_DRAIN;
- if (READ_ONCE(req->sqe->opcode) == IORING_OP_LINK_TIMEOUT) { - ret = io_timeout_setup(req); - /* common setup allows offset being set, we don't */ - if (!ret && req->sqe->off) - ret = -EINVAL; - if (ret) { - prev->flags |= REQ_F_FAIL_LINK; - goto err_req; - } - } - io = kmalloc(sizeof(*io), GFP_KERNEL); if (!io) { ret = -EAGAIN; @@ -3362,8 +3360,11 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, }
ret = io_req_defer_prep(req, io); - if (ret) + if (ret) { + kfree(io); + prev->flags |= REQ_F_FAIL_LINK; goto err_req; + } trace_io_uring_link(ctx, req, prev); list_add_tail(&req->list, &prev->link_list); } else if (req->sqe->flags & IOSQE_IO_LINK) {
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 08bdcc35f00c91b325195723cceaba0937a89ddf category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If someone removes a node from a list, and then later adds it back to a list, we can have invalid data in ->next. This can cause all sorts of issues. One such use case is the IORING_OP_POLL_ADD command, which will do just that if we race and get woken twice without any pending events. This is a pretty rare case, but can happen under extreme loads. Dan reports that he saw the following crash:
BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD d283ce067 P4D d283ce067 PUD e5ca04067 PMD 0 Oops: 0002 [#1] SMP CPU: 17 PID: 10726 Comm: tao:fast-fiber Kdump: loaded Not tainted 5.2.9-02851-gac7bc042d2d1 #116 Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019 RIP: 0010:io_wqe_enqueue+0x3e/0xd0 Code: 34 24 74 55 8b 47 58 48 8d 6f 50 85 c0 74 50 48 89 df e8 35 7c 75 00 48 83 7b 08 00 48 8b 14 24 0f 84 84 00 00 00 48 8b 4b 10 <48> 89 11 48 89 53 10 83 63 20 fe 48 89 c6 48 89 df e8 0c 7a 75 00 RSP: 0000:ffffc90006858a08 EFLAGS: 00010082 RAX: 0000000000000002 RBX: ffff889037492fc0 RCX: 0000000000000000 RDX: ffff888e40cc11a8 RSI: ffff888e40cc11a8 RDI: ffff889037492fc0 RBP: ffff889037493010 R08: 00000000000000c3 R09: ffffc90006858ab8 R10: 0000000000000000 R11: 0000000000000000 R12: ffff888e40cc11a8 R13: 0000000000000000 R14: 00000000000000c3 R15: ffff888e40cc1100 FS: 00007fcddc9db700(0000) GS:ffff88903fa40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000e479f5003 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> io_poll_wake+0x12f/0x2a0 __wake_up_common+0x86/0x120 __wake_up_common_lock+0x7a/0xc0 sock_def_readable+0x3c/0x70 tcp_rcv_established+0x557/0x630 tcp_v6_do_rcv+0x118/0x3c0 tcp_v6_rcv+0x97e/0x9d0 ip6_protocol_deliver_rcu+0xe3/0x440 ip6_input+0x3d/0xc0 ? ip6_protocol_deliver_rcu+0x440/0x440 ipv6_rcv+0x56/0xd0 ? ip6_rcv_finish_core.isra.18+0x80/0x80 __netif_receive_skb_one_core+0x50/0x70 netif_receive_skb_internal+0x2f/0xa0 napi_gro_receive+0x125/0x150 mlx5e_handle_rx_cqe+0x1d9/0x5a0 ? mlx5e_poll_tx_cq+0x305/0x560 mlx5e_poll_rx_cq+0x49f/0x9c5 mlx5e_napi_poll+0xee/0x640 ? smp_reschedule_interrupt+0x16/0xd0 ? reschedule_interrupt+0xf/0x20 net_rx_action+0x286/0x3d0 __do_softirq+0xca/0x297 irq_exit+0x96/0xa0 do_IRQ+0x54/0xe0 common_interrupt+0xf/0xf </IRQ> RIP: 0033:0x7fdc627a2e3a Code: 31 c0 85 d2 0f 88 f6 00 00 00 55 48 89 e5 41 57 41 56 4c 63 f2 41 55 41 54 53 48 83 ec 18 48 85 ff 0f 84 c7 00 00 00 48 8b 07 <41> 89 d4 49 89 f5 48 89 fb 48 85 c0 0f 84 64 01 00 00 48 83 78 10
when running a networked workload with about 5000 sockets being polled for. Fix this by clearing node->next when the node is being removed from the list.
Fixes: 6206f0e180d4 ("io-wq: shrink io_wq_work a bit") Reported-by: Dan Melnic dmm@fb.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io-wq.h b/fs/io-wq.h index 892db0bb64b1..7c333a28e2a7 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -52,6 +52,7 @@ static inline void wq_node_del(struct io_wq_work_list *list, list->last = prev; if (prev) prev->next = node->next; + node->next = NULL; }
#define wq_list_for_each(pos, prv, head) \
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 78076bb64aa8ba5b7207c38b2660a9e10ffa8cc7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We recently changed this from a single list to an rbtree, but for some real life workloads, the rbtree slows down the submission/insertion case enough so that it's the top cycle consumer on the io_uring side. In testing, using a hash table is a more well rounded compromise. It is fast for insertion, and as long as it's sized appropriately, it works well for the cancellation case as well. Running TAO with a lot of network sockets, this removes io_poll_req_insert() from spending 2% of the CPU cycles.
Reported-by: Dan Melnic dmm@fb.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [214828962dea io_uring: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT not applied]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 84 +++++++++++++++++++++++++-------------------------- 1 file changed, 41 insertions(+), 43 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7d9001280fb5..d2f9fc82810b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -275,7 +275,8 @@ struct io_ring_ctx { * manipulate the list, hence no extra locking is needed there. */ struct list_head poll_list; - struct rb_root cancel_tree; + struct hlist_head *cancel_hash; + unsigned cancel_hash_bits;
spinlock_t inflight_lock; struct list_head inflight_list; @@ -355,7 +356,7 @@ struct io_kiocb { struct io_ring_ctx *ctx; union { struct list_head list; - struct rb_node rb_node; + struct hlist_node hash_node; }; struct list_head link_list; unsigned int flags; @@ -444,6 +445,7 @@ static void io_ring_ctx_ref_free(struct percpu_ref *ref) static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) { struct io_ring_ctx *ctx; + int hash_bits;
ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); if (!ctx) @@ -457,6 +459,21 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) if (!ctx->completions) goto err;
+ /* + * Use 5 bits less than the max cq entries, that should give us around + * 32 entries per hash list if totally full and uniformly spread. + */ + hash_bits = ilog2(p->cq_entries); + hash_bits -= 5; + if (hash_bits <= 0) + hash_bits = 1; + ctx->cancel_hash_bits = hash_bits; + ctx->cancel_hash = kmalloc((1U << hash_bits) * sizeof(struct hlist_head), + GFP_KERNEL); + if (!ctx->cancel_hash) + goto err; + __hash_init(ctx->cancel_hash, 1U << hash_bits); + if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) goto err;
@@ -469,7 +486,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) init_waitqueue_head(&ctx->wait); spin_lock_init(&ctx->completion_lock); INIT_LIST_HEAD(&ctx->poll_list); - ctx->cancel_tree = RB_ROOT; INIT_LIST_HEAD(&ctx->defer_list); INIT_LIST_HEAD(&ctx->timeout_list); init_waitqueue_head(&ctx->inflight_wait); @@ -480,6 +496,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) if (ctx->fallback_req) kmem_cache_free(req_cachep, ctx->fallback_req); kfree(ctx->completions); + kfree(ctx->cancel_hash); kfree(ctx); return NULL; } @@ -2259,14 +2276,6 @@ static int io_connect(struct io_kiocb *req, const struct io_uring_sqe *sqe, #endif }
-static inline void io_poll_remove_req(struct io_kiocb *req) -{ - if (!RB_EMPTY_NODE(&req->rb_node)) { - rb_erase(&req->rb_node, &req->ctx->cancel_tree); - RB_CLEAR_NODE(&req->rb_node); - } -} - static void io_poll_remove_one(struct io_kiocb *req) { struct io_poll_iocb *poll = &req->poll; @@ -2278,36 +2287,34 @@ static void io_poll_remove_one(struct io_kiocb *req) io_queue_async_work(req); } spin_unlock(&poll->head->lock); - io_poll_remove_req(req); + hash_del(&req->hash_node); }
static void io_poll_remove_all(struct io_ring_ctx *ctx) { - struct rb_node *node; + struct hlist_node *tmp; struct io_kiocb *req; + int i;
spin_lock_irq(&ctx->completion_lock); - while ((node = rb_first(&ctx->cancel_tree)) != NULL) { - req = rb_entry(node, struct io_kiocb, rb_node); - io_poll_remove_one(req); + for (i = 0; i < (1U << ctx->cancel_hash_bits); i++) { + struct hlist_head *list; + + list = &ctx->cancel_hash[i]; + hlist_for_each_entry_safe(req, tmp, list, hash_node) + io_poll_remove_one(req); } spin_unlock_irq(&ctx->completion_lock); }
static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr) { - struct rb_node *p, *parent = NULL; + struct hlist_head *list; struct io_kiocb *req;
- p = ctx->cancel_tree.rb_node; - while (p) { - parent = p; - req = rb_entry(parent, struct io_kiocb, rb_node); - if (sqe_addr < req->user_data) { - p = p->rb_left; - } else if (sqe_addr > req->user_data) { - p = p->rb_right; - } else { + list = &ctx->cancel_hash[hash_long(sqe_addr, ctx->cancel_hash_bits)]; + hlist_for_each_entry(req, list, hash_node) { + if (sqe_addr == req->user_data) { io_poll_remove_one(req); return 0; } @@ -2389,7 +2396,7 @@ static void io_poll_complete_work(struct io_wq_work **workptr) spin_unlock_irq(&ctx->completion_lock); return; } - io_poll_remove_req(req); + hash_del(&req->hash_node); io_poll_complete(req, mask, ret); spin_unlock_irq(&ctx->completion_lock);
@@ -2424,7 +2431,7 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, * for finalizing the request, mark us as having grabbed that already. */ if (mask && spin_trylock_irqsave(&ctx->completion_lock, flags)) { - io_poll_remove_req(req); + hash_del(&req->hash_node); io_poll_complete(req, mask, 0); req->flags |= REQ_F_COMP_LOCKED; io_put_req(req); @@ -2462,20 +2469,10 @@ static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head, static void io_poll_req_insert(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; - struct rb_node **p = &ctx->cancel_tree.rb_node; - struct rb_node *parent = NULL; - struct io_kiocb *tmp; - - while (*p) { - parent = *p; - tmp = rb_entry(parent, struct io_kiocb, rb_node); - if (req->user_data < tmp->user_data) - p = &(*p)->rb_left; - else - p = &(*p)->rb_right; - } - rb_link_node(&req->rb_node, parent, p); - rb_insert_color(&req->rb_node, &ctx->cancel_tree); + struct hlist_head *list; + + list = &ctx->cancel_hash[hash_long(req->user_data, ctx->cancel_hash_bits)]; + hlist_add_head(&req->hash_node, list); }
static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, @@ -2503,7 +2500,7 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, INIT_IO_WORK(&req->work, io_poll_complete_work); events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; - RB_CLEAR_NODE(&req->rb_node); + INIT_HLIST_NODE(&req->hash_node);
poll->head = NULL; poll->done = false; @@ -4644,6 +4641,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) free_uid(ctx->user); put_cred(ctx->creds); kfree(ctx->completions); + kfree(ctx->cancel_hash); kmem_cache_free(req_cachep, ctx->fallback_req); kfree(ctx); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit 2e6e1fde32d7d41cf076c21060c329d3fdbce25c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In case of an error io_submit_sqe() drops a request and continues without it, even if the request was a part of a link. Not only it doesn't cancel links, but also may execute wrong sequence of actions.
Stop consuming sqes, and let the user handle errors.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d2f9fc82810b..f58ab64d2617 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3314,7 +3314,7 @@ static inline void io_queue_link_head(struct io_kiocb *req)
#define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK)
-static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, +static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, struct io_kiocb **link) { struct io_ring_ctx *ctx = req->ctx; @@ -3333,7 +3333,7 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, err_req: io_cqring_add_event(req, ret); io_double_put_req(req); - return; + return false; }
/* @@ -3372,6 +3372,8 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, } else { io_queue_sqe(req); } + + return true; }
/* @@ -3501,6 +3503,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, } }
+ submitted++; sqe_flags = req->sqe->flags;
req->ring_file = ring_file; @@ -3510,9 +3513,8 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, req->needs_fixed_file = async; trace_io_uring_submit_sqe(ctx, req->sqe->user_data, true, async); - io_submit_sqe(req, statep, &link); - submitted++; - + if (!io_submit_sqe(req, statep, &link)) + break; /* * If previous wasn't linked and we have a linked command, * that's the end of the chain. Submit the previous link.
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc3 commit 0969e783e3a8913f79df27286501a6c21e961524 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we defer these commands as part of a link, we have to make sure that the SQE data has been read upfront. Integrate the poll add/remove into the prep handling to make it safe for SQE reuse.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 68 ++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 54 insertions(+), 14 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1d6c4ee18daf..3ea74527361f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -289,7 +289,10 @@ struct io_ring_ctx { */ struct io_poll_iocb { struct file *file; - struct wait_queue_head *head; + union { + struct wait_queue_head *head; + u64 addr; + }; __poll_t events; bool done; bool canceled; @@ -2489,24 +2492,40 @@ static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr) return -ENOENT; }
+static int io_poll_remove_prep(struct io_kiocb *req) +{ + const struct io_uring_sqe *sqe = req->sqe; + + if (req->flags & REQ_F_PREPPED) + return 0; + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + if (sqe->ioprio || sqe->off || sqe->len || sqe->buf_index || + sqe->poll_events) + return -EINVAL; + + req->poll.addr = READ_ONCE(sqe->addr); + req->flags |= REQ_F_PREPPED; + return 0; +} + /* * Find a running poll command that matches one specified in sqe->addr, * and remove it if found. */ static int io_poll_remove(struct io_kiocb *req) { - const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx; + u64 addr; int ret;
- if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) - return -EINVAL; - if (sqe->ioprio || sqe->off || sqe->len || sqe->buf_index || - sqe->poll_events) - return -EINVAL; + ret = io_poll_remove_prep(req); + if (ret) + return ret;
+ addr = req->poll.addr; spin_lock_irq(&ctx->completion_lock); - ret = io_poll_cancel(ctx, READ_ONCE(sqe->addr)); + ret = io_poll_cancel(ctx, addr); spin_unlock_irq(&ctx->completion_lock);
io_cqring_add_event(req, ret); @@ -2641,16 +2660,14 @@ static void io_poll_req_insert(struct io_kiocb *req) hlist_add_head(&req->hash_node, list); }
-static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) +static int io_poll_add_prep(struct io_kiocb *req) { const struct io_uring_sqe *sqe = req->sqe; struct io_poll_iocb *poll = &req->poll; - struct io_ring_ctx *ctx = req->ctx; - struct io_poll_table ipt; - bool cancel = false; - __poll_t mask; u16 events;
+ if (req->flags & REQ_F_PREPPED) + return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->addr || sqe->ioprio || sqe->off || sqe->len || sqe->buf_index) @@ -2658,9 +2675,26 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) if (!poll->file) return -EBADF;
- INIT_IO_WORK(&req->work, io_poll_complete_work); + req->flags |= REQ_F_PREPPED; events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; + return 0; +} + +static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) +{ + struct io_poll_iocb *poll = &req->poll; + struct io_ring_ctx *ctx = req->ctx; + struct io_poll_table ipt; + bool cancel = false; + __poll_t mask; + int ret; + + ret = io_poll_add_prep(req); + if (ret) + return ret; + + INIT_IO_WORK(&req->work, io_poll_complete_work); INIT_HLIST_NODE(&req->hash_node);
poll->head = NULL; @@ -3028,6 +3062,12 @@ static int io_req_defer_prep(struct io_kiocb *req) io_req_map_rw(req, ret, iovec, inline_vecs, &iter); ret = 0; break; + case IORING_OP_POLL_ADD: + ret = io_poll_add_prep(req); + break; + case IORING_OP_POLL_REMOVE: + ret = io_poll_remove_prep(req); + break; case IORING_OP_FSYNC: ret = io_prep_fsync(req); break;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc4 commit 3fbb51c18f5c15a23db74c4da79d3d035176c480 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add struct io_connect in our io_kiocb per-command union, and ensure that io_connect_prep() has grabbed what it needs from the SQE.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 40 ++++++++++++++++++++++------------------ 1 file changed, 22 insertions(+), 18 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c38e34925bb3..e97d6e98d6bf 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -339,6 +339,12 @@ struct io_rw { u64 len; };
+struct io_connect { + struct file *file; + struct sockaddr __user *addr; + int addr_len; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -382,6 +388,7 @@ struct io_kiocb { struct io_sync sync; struct io_cancel cancel; struct io_timeout timeout; + struct io_connect connect; };
const struct io_uring_sqe *sqe; @@ -2405,14 +2412,18 @@ static int io_connect_prep(struct io_kiocb *req, struct io_async_ctx *io) { #if defined(CONFIG_NET) const struct io_uring_sqe *sqe = req->sqe; - struct sockaddr __user *addr; - int addr_len;
- addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); - addr_len = READ_ONCE(sqe->addr2); - return move_addr_to_kernel(addr, addr_len, &io->connect.address); + if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) + return -EINVAL; + if (sqe->ioprio || sqe->len || sqe->buf_index || sqe->rw_flags) + return -EINVAL; + + req->connect.addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); + req->connect.addr_len = READ_ONCE(sqe->addr2); + return move_addr_to_kernel(req->connect.addr, req->connect.addr_len, + &io->connect.address); #else - return 0; + return -EOPNOTSUPP; #endif }
@@ -2420,18 +2431,9 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { #if defined(CONFIG_NET) - const struct io_uring_sqe *sqe = req->sqe; struct io_async_ctx __io, *io; unsigned file_flags; - int addr_len, ret; - - if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) - return -EINVAL; - if (sqe->ioprio || sqe->len || sqe->buf_index || sqe->rw_flags) - return -EINVAL; - - addr_len = READ_ONCE(sqe->addr2); - file_flags = force_nonblock ? O_NONBLOCK : 0; + int ret;
if (req->io) { io = req->io; @@ -2442,8 +2444,10 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt, io = &__io; }
- ret = __sys_connect_file(req->file, &io->connect.address, addr_len, - file_flags); + file_flags = force_nonblock ? O_NONBLOCK : 0; + + ret = __sys_connect_file(req->file, &io->connect.address, + req->connect.addr_len, file_flags); if ((ret == -EAGAIN || ret == -EINPROGRESS) && force_nonblock) { if (req->io) return -EAGAIN;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc4 commit e47293fdf98998292a89d516c8f7b8b9eb5c5213 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add struct io_sr_msg in our io_kiocb per-command union, and ensure that the send/recvmsg prep handlers have grabbed what they need from the SQE by the time prep is done.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 64 ++++++++++++++++++++++++++------------------------- 1 file changed, 33 insertions(+), 31 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e97d6e98d6bf..05463be5e320 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -345,6 +345,12 @@ struct io_connect { int addr_len; };
+struct io_sr_msg { + struct file *file; + struct user_msghdr __user *msg; + int msg_flags; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -389,6 +395,7 @@ struct io_kiocb { struct io_cancel cancel; struct io_timeout timeout; struct io_connect connect; + struct io_sr_msg sr_msg; };
const struct io_uring_sqe *sqe; @@ -2163,15 +2170,15 @@ static int io_sendmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) { #if defined(CONFIG_NET) const struct io_uring_sqe *sqe = req->sqe; - struct user_msghdr __user *msg; - unsigned flags; + struct io_sr_msg *sr = &req->sr_msg;
- flags = READ_ONCE(sqe->msg_flags); - msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + sr->msg_flags = READ_ONCE(sqe->msg_flags); + sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); io->msg.iov = io->msg.fast_iov; - return sendmsg_copy_msghdr(&io->msg.msg, msg, flags, &io->msg.iov); + return sendmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, + &io->msg.iov); #else - return 0; + return -EOPNOTSUPP; #endif }
@@ -2179,7 +2186,6 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { #if defined(CONFIG_NET) - const struct io_uring_sqe *sqe = req->sqe; struct io_async_msghdr *kmsg = NULL; struct socket *sock; int ret; @@ -2193,12 +2199,6 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, struct sockaddr_storage addr; unsigned flags;
- flags = READ_ONCE(sqe->msg_flags); - if (flags & MSG_DONTWAIT) - req->flags |= REQ_F_NOWAIT; - else if (force_nonblock) - flags |= MSG_DONTWAIT; - if (req->io) { kmsg = &req->io->msg; kmsg->msg.msg_name = &addr; @@ -2214,6 +2214,12 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, goto out; }
+ flags = req->sr_msg.msg_flags; + if (flags & MSG_DONTWAIT) + req->flags |= REQ_F_NOWAIT; + else if (force_nonblock) + flags |= MSG_DONTWAIT; + ret = __sys_sendmsg_sock(sock, &kmsg->msg, flags); if (force_nonblock && ret == -EAGAIN) { if (req->io) @@ -2244,17 +2250,15 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, static int io_recvmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) { #if defined(CONFIG_NET) - const struct io_uring_sqe *sqe = req->sqe; - struct user_msghdr __user *msg; - unsigned flags; + struct io_sr_msg *sr = &req->sr_msg;
- flags = READ_ONCE(sqe->msg_flags); - msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + sr->msg_flags = READ_ONCE(req->sqe->msg_flags); + sr->msg = u64_to_user_ptr(READ_ONCE(req->sqe->addr)); io->msg.iov = io->msg.fast_iov; - return recvmsg_copy_msghdr(&io->msg.msg, msg, flags, &io->msg.uaddr, - &io->msg.iov); + return recvmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, + &io->msg.uaddr, &io->msg.iov); #else - return 0; + return -EOPNOTSUPP; #endif }
@@ -2262,7 +2266,6 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { #if defined(CONFIG_NET) - const struct io_uring_sqe *sqe = req->sqe; struct io_async_msghdr *kmsg = NULL; struct socket *sock; int ret; @@ -2272,18 +2275,10 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt,
sock = sock_from_file(req->file, &ret); if (sock) { - struct user_msghdr __user *msg; struct io_async_ctx io; struct sockaddr_storage addr; unsigned flags;
- flags = READ_ONCE(sqe->msg_flags); - if (flags & MSG_DONTWAIT) - req->flags |= REQ_F_NOWAIT; - else if (force_nonblock) - flags |= MSG_DONTWAIT; - - msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); if (req->io) { kmsg = &req->io->msg; kmsg->msg.msg_name = &addr; @@ -2299,7 +2294,14 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, goto out; }
- ret = __sys_recvmsg_sock(sock, &kmsg->msg, msg, kmsg->uaddr, flags); + flags = req->sr_msg.msg_flags; + if (flags & MSG_DONTWAIT) + req->flags |= REQ_F_NOWAIT; + else if (force_nonblock) + flags |= MSG_DONTWAIT; + + ret = __sys_recvmsg_sock(sock, &kmsg->msg, req->sr_msg.msg, + kmsg->uaddr, flags); if (force_nonblock && ret == -EAGAIN) { if (req->io) return -EAGAIN;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc4 commit 26a61679f10c6f041726411964b172565021c2eb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add the count field to struct io_timeout, and ensure the prep handler has read it. Timeout also needs an async context always, set it up in the prep handler if we don't have one.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 05463be5e320..5badcd315eef 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -330,6 +330,7 @@ struct io_timeout { struct file *file; u64 addr; int flags; + unsigned count; };
struct io_rw { @@ -2901,7 +2902,12 @@ static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io, if (flags & ~IORING_TIMEOUT_ABS) return -EINVAL;
- data = &io->timeout; + req->timeout.count = READ_ONCE(sqe->off); + + if (!io && io_alloc_async_ctx(req)) + return -ENOMEM; + + data = &req->io->timeout; data->req = req; req->flags |= REQ_F_TIMEOUT;
@@ -2919,7 +2925,6 @@ static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io,
static int io_timeout(struct io_kiocb *req) { - const struct io_uring_sqe *sqe = req->sqe; unsigned count; struct io_ring_ctx *ctx = req->ctx; struct io_timeout_data *data; @@ -2941,7 +2946,7 @@ static int io_timeout(struct io_kiocb *req) * timeout event to be satisfied. If it isn't set, then this is * a pure timeout request, sequence isn't used. */ - count = READ_ONCE(sqe->off); + count = req->timeout.count; if (!count) { req->flags |= REQ_F_TIMEOUT_NOSEQ; spin_lock_irq(&ctx->completion_lock);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc4 commit 06b76d44ba25e52711dc7cc4fc75b50907bc6b8e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently have a mix of use cases. Most of the newer ones are pretty uniform, but we have some older ones that use different calling calling conventions. This is confusing.
For the opcodes that currently rely on the req->io->sqe copy saving them from reuse, add a request type struct in the io_kiocb command union to store the data they need.
Prepare for all opcodes having a standard prep method, so we can call it in a uniform fashion and outside of the opcode handler. This is in preparation for passing in the 'sqe' pointer, rather than storing it in the io_kiocb. Once we have uniform prep handlers, we can leave all the prep work to that part, and not even pass in the sqe to the opcode handler. This ensures that we don't reuse sqe data inadvertently.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 128 +++++++++++++++++++++++++------------------------- 1 file changed, 63 insertions(+), 65 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5badcd315eef..05abe7bf6a81 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -371,7 +371,6 @@ struct io_async_rw { };
struct io_async_ctx { - struct io_uring_sqe sqe; union { struct io_async_rw rw; struct io_async_msghdr msg; @@ -433,7 +432,6 @@ struct io_kiocb { #define REQ_F_INFLIGHT 16384 /* on inflight list */ #define REQ_F_COMP_LOCKED 32768 /* completion under lock */ #define REQ_F_HARDLINK 65536 /* doesn't sever on completion < 0 */ -#define REQ_F_PREPPED 131072 /* request already opcode prepared */ u64 user_data; u32 result; u32 sequence; @@ -1500,6 +1498,8 @@ static int io_prep_rw(struct io_kiocb *req, bool force_nonblock) unsigned ioprio; int ret;
+ if (!sqe) + return 0; if (!req->file) return -EBADF;
@@ -1551,6 +1551,7 @@ static int io_prep_rw(struct io_kiocb *req, bool force_nonblock) /* we own ->private, reuse it for the buffer index */ req->rw.kiocb.private = (void *) (unsigned long) READ_ONCE(req->sqe->buf_index); + req->sqe = NULL; return 0; }
@@ -1772,13 +1773,7 @@ static void io_req_map_rw(struct io_kiocb *req, ssize_t io_size, static int io_alloc_async_ctx(struct io_kiocb *req) { req->io = kmalloc(sizeof(*req->io), GFP_KERNEL); - if (req->io) { - memcpy(&req->io->sqe, req->sqe, sizeof(req->io->sqe)); - req->sqe = &req->io->sqe; - return 0; - } - - return 1; + return req->io == NULL; }
static void io_rw_async(struct io_wq_work **workptr) @@ -1809,12 +1804,14 @@ static int io_read_prep(struct io_kiocb *req, struct iovec **iovec, { ssize_t ret;
- ret = io_prep_rw(req, force_nonblock); - if (ret) - return ret; + if (req->sqe) { + ret = io_prep_rw(req, force_nonblock); + if (ret) + return ret;
- if (unlikely(!(req->file->f_mode & FMODE_READ))) - return -EBADF; + if (unlikely(!(req->file->f_mode & FMODE_READ))) + return -EBADF; + }
return io_import_iovec(READ, req, iovec, iter); } @@ -1828,15 +1825,9 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, size_t iov_count; ssize_t io_size, ret;
- if (!req->io) { - ret = io_read_prep(req, &iovec, &iter, force_nonblock); - if (ret < 0) - return ret; - } else { - ret = io_import_iovec(READ, req, &iovec, &iter); - if (ret < 0) - return ret; - } + ret = io_read_prep(req, &iovec, &iter, force_nonblock); + if (ret < 0) + return ret;
/* Ensure we clear previously set non-block flag */ if (!force_nonblock) @@ -1900,12 +1891,14 @@ static int io_write_prep(struct io_kiocb *req, struct iovec **iovec, { ssize_t ret;
- ret = io_prep_rw(req, force_nonblock); - if (ret) - return ret; + if (req->sqe) { + ret = io_prep_rw(req, force_nonblock); + if (ret) + return ret;
- if (unlikely(!(req->file->f_mode & FMODE_WRITE))) - return -EBADF; + if (unlikely(!(req->file->f_mode & FMODE_WRITE))) + return -EBADF; + }
return io_import_iovec(WRITE, req, iovec, iter); } @@ -1919,15 +1912,9 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, size_t iov_count; ssize_t ret, io_size;
- if (!req->io) { - ret = io_write_prep(req, &iovec, &iter, force_nonblock); - if (ret < 0) - return ret; - } else { - ret = io_import_iovec(WRITE, req, &iovec, &iter); - if (ret < 0) - return ret; - } + ret = io_write_prep(req, &iovec, &iter, force_nonblock); + if (ret < 0) + return ret;
/* Ensure we clear previously set non-block flag */ if (!force_nonblock) @@ -2012,7 +1999,7 @@ static int io_prep_fsync(struct io_kiocb *req) const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx;
- if (req->flags & REQ_F_PREPPED) + if (!req->sqe) return 0; if (!req->file) return -EBADF; @@ -2028,7 +2015,7 @@ static int io_prep_fsync(struct io_kiocb *req)
req->sync.off = READ_ONCE(sqe->off); req->sync.len = READ_ONCE(sqe->len); - req->flags |= REQ_F_PREPPED; + req->sqe = NULL; return 0; }
@@ -2094,7 +2081,7 @@ static int io_prep_sfr(struct io_kiocb *req) const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx;
- if (req->flags & REQ_F_PREPPED) + if (!sqe) return 0; if (!req->file) return -EBADF; @@ -2107,7 +2094,7 @@ static int io_prep_sfr(struct io_kiocb *req) req->sync.off = READ_ONCE(sqe->off); req->sync.len = READ_ONCE(sqe->len); req->sync.flags = READ_ONCE(sqe->sync_range_flags); - req->flags |= REQ_F_PREPPED; + req->sqe = NULL; return 0; }
@@ -2172,12 +2159,17 @@ static int io_sendmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) #if defined(CONFIG_NET) const struct io_uring_sqe *sqe = req->sqe; struct io_sr_msg *sr = &req->sr_msg; + int ret;
+ if (!sqe) + return 0; sr->msg_flags = READ_ONCE(sqe->msg_flags); sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); io->msg.iov = io->msg.fast_iov; - return sendmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, + ret = sendmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, &io->msg.iov); + req->sqe = NULL; + return ret; #else return -EOPNOTSUPP; #endif @@ -2252,12 +2244,18 @@ static int io_recvmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) { #if defined(CONFIG_NET) struct io_sr_msg *sr = &req->sr_msg; + int ret; + + if (!req->sqe) + return 0;
sr->msg_flags = READ_ONCE(req->sqe->msg_flags); sr->msg = u64_to_user_ptr(READ_ONCE(req->sqe->addr)); io->msg.iov = io->msg.fast_iov; - return recvmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, + ret = recvmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, &io->msg.uaddr, &io->msg.iov); + req->sqe = NULL; + return ret; #else return -EOPNOTSUPP; #endif @@ -2335,7 +2333,7 @@ static int io_accept_prep(struct io_kiocb *req) const struct io_uring_sqe *sqe = req->sqe; struct io_accept *accept = &req->accept;
- if (req->flags & REQ_F_PREPPED) + if (!req->sqe) return 0;
if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) @@ -2346,7 +2344,7 @@ static int io_accept_prep(struct io_kiocb *req) accept->addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); accept->addr_len = u64_to_user_ptr(READ_ONCE(sqe->addr2)); accept->flags = READ_ONCE(sqe->accept_flags); - req->flags |= REQ_F_PREPPED; + req->sqe = NULL; return 0; #else return -EOPNOTSUPP; @@ -2415,7 +2413,10 @@ static int io_connect_prep(struct io_kiocb *req, struct io_async_ctx *io) { #if defined(CONFIG_NET) const struct io_uring_sqe *sqe = req->sqe; + int ret;
+ if (!sqe) + return 0; if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) return -EINVAL; if (sqe->ioprio || sqe->len || sqe->buf_index || sqe->rw_flags) @@ -2423,8 +2424,10 @@ static int io_connect_prep(struct io_kiocb *req, struct io_async_ctx *io)
req->connect.addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); req->connect.addr_len = READ_ONCE(sqe->addr2); - return move_addr_to_kernel(req->connect.addr, req->connect.addr_len, + ret = move_addr_to_kernel(req->connect.addr, req->connect.addr_len, &io->connect.address); + req->sqe = NULL; + return ret; #else return -EOPNOTSUPP; #endif @@ -2525,7 +2528,7 @@ static int io_poll_remove_prep(struct io_kiocb *req) { const struct io_uring_sqe *sqe = req->sqe;
- if (req->flags & REQ_F_PREPPED) + if (!sqe) return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; @@ -2534,7 +2537,7 @@ static int io_poll_remove_prep(struct io_kiocb *req) return -EINVAL;
req->poll.addr = READ_ONCE(sqe->addr); - req->flags |= REQ_F_PREPPED; + req->sqe = NULL; return 0; }
@@ -2695,7 +2698,7 @@ static int io_poll_add_prep(struct io_kiocb *req) struct io_poll_iocb *poll = &req->poll; u16 events;
- if (req->flags & REQ_F_PREPPED) + if (!sqe) return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; @@ -2704,9 +2707,9 @@ static int io_poll_add_prep(struct io_kiocb *req) if (!poll->file) return -EBADF;
- req->flags |= REQ_F_PREPPED; events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; + req->sqe = NULL; return 0; }
@@ -2844,7 +2847,7 @@ static int io_timeout_remove_prep(struct io_kiocb *req) { const struct io_uring_sqe *sqe = req->sqe;
- if (req->flags & REQ_F_PREPPED) + if (!sqe) return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; @@ -2856,7 +2859,7 @@ static int io_timeout_remove_prep(struct io_kiocb *req) if (req->timeout.flags) return -EINVAL;
- req->flags |= REQ_F_PREPPED; + req->sqe = NULL; return 0; }
@@ -2892,6 +2895,8 @@ static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io, struct io_timeout_data *data; unsigned flags;
+ if (!sqe) + return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->ioprio || sqe->buf_index || sqe->len != 1) @@ -2920,6 +2925,7 @@ static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io, data->mode = HRTIMER_MODE_REL;
hrtimer_init(&data->timer, CLOCK_MONOTONIC, data->mode); + req->sqe = NULL; return 0; }
@@ -2932,13 +2938,9 @@ static int io_timeout(struct io_kiocb *req) unsigned span = 0; int ret;
- if (!req->io) { - if (io_alloc_async_ctx(req)) - return -ENOMEM; - ret = io_timeout_prep(req, req->io, false); - if (ret) - return ret; - } + ret = io_timeout_prep(req, req->io, false); + if (ret) + return ret; data = &req->io->timeout;
/* @@ -3068,7 +3070,7 @@ static int io_async_cancel_prep(struct io_kiocb *req) { const struct io_uring_sqe *sqe = req->sqe;
- if (req->flags & REQ_F_PREPPED) + if (!sqe) return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; @@ -3076,8 +3078,8 @@ static int io_async_cancel_prep(struct io_kiocb *req) sqe->cancel_flags) return -EINVAL;
- req->flags |= REQ_F_PREPPED; req->cancel.addr = READ_ONCE(sqe->addr); + req->sqe = NULL; return 0; }
@@ -3212,13 +3214,9 @@ static int io_issue_sqe(struct io_kiocb *req, struct io_kiocb **nxt, ret = io_nop(req); break; case IORING_OP_READV: - if (unlikely(req->sqe->buf_index)) - return -EINVAL; ret = io_read(req, nxt, force_nonblock); break; case IORING_OP_WRITEV: - if (unlikely(req->sqe->buf_index)) - return -EINVAL; ret = io_write(req, nxt, force_nonblock); break; case IORING_OP_READ_FIXED:
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc4 commit 3529d8c2b353e6e446277ae96a36e7471cb070fc category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This moves the prep handlers outside of the opcode handlers, and allows us to pass in the sqe directly. If the sqe is non-NULL, it means that the request should be prepared for the first time.
With the opcode handlers not having access to the sqe at all, we are guaranteed that the prep handler has setup the request fully by the time we get there. As before, for opcodes that need to copy in more data then the io_kiocb allows for, the io_async_ctx holds that info. If a prep handler is invoked with req->io set, it must use that to retain information for later.
Finally, we can remove io_kiocb->sqe as well.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 493 +++++++++++++++++++++++++------------------------- 1 file changed, 251 insertions(+), 242 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 05abe7bf6a81..8b4faa21e2f1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -398,7 +398,6 @@ struct io_kiocb { struct io_sr_msg sr_msg; };
- const struct io_uring_sqe *sqe; struct io_async_ctx *io; struct file *ring_file; int ring_fd; @@ -628,33 +627,31 @@ static inline bool io_prep_async_work(struct io_kiocb *req, { bool do_hashed = false;
- if (req->sqe) { - switch (req->opcode) { - case IORING_OP_WRITEV: - case IORING_OP_WRITE_FIXED: - /* only regular files should be hashed for writes */ - if (req->flags & REQ_F_ISREG) - do_hashed = true; - /* fall-through */ - case IORING_OP_READV: - case IORING_OP_READ_FIXED: - case IORING_OP_SENDMSG: - case IORING_OP_RECVMSG: - case IORING_OP_ACCEPT: - case IORING_OP_POLL_ADD: - case IORING_OP_CONNECT: - /* - * We know REQ_F_ISREG is not set on some of these - * opcodes, but this enables us to keep the check in - * just one place. - */ - if (!(req->flags & REQ_F_ISREG)) - req->work.flags |= IO_WQ_WORK_UNBOUND; - break; - } - if (io_req_needs_user(req)) - req->work.flags |= IO_WQ_WORK_NEEDS_USER; + switch (req->opcode) { + case IORING_OP_WRITEV: + case IORING_OP_WRITE_FIXED: + /* only regular files should be hashed for writes */ + if (req->flags & REQ_F_ISREG) + do_hashed = true; + /* fall-through */ + case IORING_OP_READV: + case IORING_OP_READ_FIXED: + case IORING_OP_SENDMSG: + case IORING_OP_RECVMSG: + case IORING_OP_ACCEPT: + case IORING_OP_POLL_ADD: + case IORING_OP_CONNECT: + /* + * We know REQ_F_ISREG is not set on some of these + * opcodes, but this enables us to keep the check in + * just one place. + */ + if (!(req->flags & REQ_F_ISREG)) + req->work.flags |= IO_WQ_WORK_UNBOUND; + break; } + if (io_req_needs_user(req)) + req->work.flags |= IO_WQ_WORK_NEEDS_USER;
*link = io_prep_linked_timeout(req); return do_hashed; @@ -1490,16 +1487,14 @@ static bool io_file_supports_async(struct file *file) return false; }
-static int io_prep_rw(struct io_kiocb *req, bool force_nonblock) +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) { - const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx; struct kiocb *kiocb = &req->rw.kiocb; unsigned ioprio; int ret;
- if (!sqe) - return 0; if (!req->file) return -EBADF;
@@ -1546,12 +1541,11 @@ static int io_prep_rw(struct io_kiocb *req, bool force_nonblock) kiocb->ki_complete = io_complete_rw; }
- req->rw.addr = READ_ONCE(req->sqe->addr); - req->rw.len = READ_ONCE(req->sqe->len); + req->rw.addr = READ_ONCE(sqe->addr); + req->rw.len = READ_ONCE(sqe->len); /* we own ->private, reuse it for the buffer index */ req->rw.kiocb.private = (void *) (unsigned long) - READ_ONCE(req->sqe->buf_index); - req->sqe = NULL; + READ_ONCE(sqe->buf_index); return 0; }
@@ -1799,21 +1793,33 @@ static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size, return 0; }
-static int io_read_prep(struct io_kiocb *req, struct iovec **iovec, - struct iov_iter *iter, bool force_nonblock) +static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) { + struct io_async_ctx *io; + struct iov_iter iter; ssize_t ret;
- if (req->sqe) { - ret = io_prep_rw(req, force_nonblock); - if (ret) - return ret; + ret = io_prep_rw(req, sqe, force_nonblock); + if (ret) + return ret;
- if (unlikely(!(req->file->f_mode & FMODE_READ))) - return -EBADF; - } + if (unlikely(!(req->file->f_mode & FMODE_READ))) + return -EBADF;
- return io_import_iovec(READ, req, iovec, iter); + if (!req->io) + return 0; + + io = req->io; + io->rw.iov = io->rw.fast_iov; + req->io = NULL; + ret = io_import_iovec(READ, req, &io->rw.iov, &iter); + req->io = io; + if (ret < 0) + return ret; + + io_req_map_rw(req, ret, io->rw.iov, io->rw.fast_iov, &iter); + return 0; }
static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, @@ -1825,7 +1831,7 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, size_t iov_count; ssize_t io_size, ret;
- ret = io_read_prep(req, &iovec, &iter, force_nonblock); + ret = io_import_iovec(READ, req, &iovec, &iter); if (ret < 0) return ret;
@@ -1886,21 +1892,33 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, return ret; }
-static int io_write_prep(struct io_kiocb *req, struct iovec **iovec, - struct iov_iter *iter, bool force_nonblock) +static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) { + struct io_async_ctx *io; + struct iov_iter iter; ssize_t ret;
- if (req->sqe) { - ret = io_prep_rw(req, force_nonblock); - if (ret) - return ret; + ret = io_prep_rw(req, sqe, force_nonblock); + if (ret) + return ret;
- if (unlikely(!(req->file->f_mode & FMODE_WRITE))) - return -EBADF; - } + if (unlikely(!(req->file->f_mode & FMODE_WRITE))) + return -EBADF;
- return io_import_iovec(WRITE, req, iovec, iter); + if (!req->io) + return 0; + + io = req->io; + io->rw.iov = io->rw.fast_iov; + req->io = NULL; + ret = io_import_iovec(WRITE, req, &io->rw.iov, &iter); + req->io = io; + if (ret < 0) + return ret; + + io_req_map_rw(req, ret, io->rw.iov, io->rw.fast_iov, &iter); + return 0; }
static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, @@ -1912,7 +1930,7 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, size_t iov_count; ssize_t ret, io_size;
- ret = io_write_prep(req, &iovec, &iter, force_nonblock); + ret = io_import_iovec(WRITE, req, &iovec, &iter); if (ret < 0) return ret;
@@ -1994,13 +2012,10 @@ static int io_nop(struct io_kiocb *req) return 0; }
-static int io_prep_fsync(struct io_kiocb *req) +static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe) { - const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx;
- if (!req->sqe) - return 0; if (!req->file) return -EBADF;
@@ -2015,7 +2030,6 @@ static int io_prep_fsync(struct io_kiocb *req)
req->sync.off = READ_ONCE(sqe->off); req->sync.len = READ_ONCE(sqe->len); - req->sqe = NULL; return 0; }
@@ -2056,11 +2070,6 @@ static int io_fsync(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { struct io_wq_work *work, *old_work; - int ret; - - ret = io_prep_fsync(req); - if (ret) - return ret;
/* fsync always requires a blocking context */ if (force_nonblock) { @@ -2076,13 +2085,10 @@ static int io_fsync(struct io_kiocb *req, struct io_kiocb **nxt, return 0; }
-static int io_prep_sfr(struct io_kiocb *req) +static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe) { - const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx;
- if (!sqe) - return 0; if (!req->file) return -EBADF;
@@ -2094,7 +2100,6 @@ static int io_prep_sfr(struct io_kiocb *req) req->sync.off = READ_ONCE(sqe->off); req->sync.len = READ_ONCE(sqe->len); req->sync.flags = READ_ONCE(sqe->sync_range_flags); - req->sqe = NULL; return 0; }
@@ -2121,11 +2126,6 @@ static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { struct io_wq_work *work, *old_work; - int ret; - - ret = io_prep_sfr(req); - if (ret) - return ret;
/* sync_file_range always requires a blocking context */ if (force_nonblock) { @@ -2154,22 +2154,21 @@ static void io_sendrecv_async(struct io_wq_work **workptr) } #endif
-static int io_sendmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) +static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { #if defined(CONFIG_NET) - const struct io_uring_sqe *sqe = req->sqe; struct io_sr_msg *sr = &req->sr_msg; - int ret; + struct io_async_ctx *io = req->io;
- if (!sqe) - return 0; sr->msg_flags = READ_ONCE(sqe->msg_flags); sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + + if (!io) + return 0; + io->msg.iov = io->msg.fast_iov; - ret = sendmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, + return sendmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, &io->msg.iov); - req->sqe = NULL; - return ret; #else return -EOPNOTSUPP; #endif @@ -2200,11 +2199,16 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, kmsg->iov = kmsg->fast_iov; kmsg->msg.msg_iter.iov = kmsg->iov; } else { + struct io_sr_msg *sr = &req->sr_msg; + kmsg = &io.msg; kmsg->msg.msg_name = &addr; - ret = io_sendmsg_prep(req, &io); + + io.msg.iov = io.msg.fast_iov; + ret = sendmsg_copy_msghdr(&io.msg.msg, sr->msg, + sr->msg_flags, &io.msg.iov); if (ret) - goto out; + return ret; }
flags = req->sr_msg.msg_flags; @@ -2227,7 +2231,6 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, ret = -EINTR; }
-out: if (!io_wq_current_is_worker() && kmsg && kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); io_cqring_add_event(req, ret); @@ -2240,22 +2243,22 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, #endif }
-static int io_recvmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) +static int io_recvmsg_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) { #if defined(CONFIG_NET) struct io_sr_msg *sr = &req->sr_msg; - int ret; + struct io_async_ctx *io = req->io; + + sr->msg_flags = READ_ONCE(sqe->msg_flags); + sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr));
- if (!req->sqe) + if (!io) return 0;
- sr->msg_flags = READ_ONCE(req->sqe->msg_flags); - sr->msg = u64_to_user_ptr(READ_ONCE(req->sqe->addr)); io->msg.iov = io->msg.fast_iov; - ret = recvmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, + return recvmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, &io->msg.uaddr, &io->msg.iov); - req->sqe = NULL; - return ret; #else return -EOPNOTSUPP; #endif @@ -2286,11 +2289,17 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, kmsg->iov = kmsg->fast_iov; kmsg->msg.msg_iter.iov = kmsg->iov; } else { + struct io_sr_msg *sr = &req->sr_msg; + kmsg = &io.msg; kmsg->msg.msg_name = &addr; - ret = io_recvmsg_prep(req, &io); + + io.msg.iov = io.msg.fast_iov; + ret = recvmsg_copy_msghdr(&io.msg.msg, sr->msg, + sr->msg_flags, &io.msg.uaddr, + &io.msg.iov); if (ret) - goto out; + return ret; }
flags = req->sr_msg.msg_flags; @@ -2314,7 +2323,6 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, ret = -EINTR; }
-out: if (!io_wq_current_is_worker() && kmsg && kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); io_cqring_add_event(req, ret); @@ -2327,15 +2335,11 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, #endif }
-static int io_accept_prep(struct io_kiocb *req) +static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { #if defined(CONFIG_NET) - const struct io_uring_sqe *sqe = req->sqe; struct io_accept *accept = &req->accept;
- if (!req->sqe) - return 0; - if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) return -EINVAL; if (sqe->ioprio || sqe->len || sqe->buf_index) @@ -2344,7 +2348,6 @@ static int io_accept_prep(struct io_kiocb *req) accept->addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); accept->addr_len = u64_to_user_ptr(READ_ONCE(sqe->addr2)); accept->flags = READ_ONCE(sqe->accept_flags); - req->sqe = NULL; return 0; #else return -EOPNOTSUPP; @@ -2392,10 +2395,6 @@ static int io_accept(struct io_kiocb *req, struct io_kiocb **nxt, #if defined(CONFIG_NET) int ret;
- ret = io_accept_prep(req); - if (ret) - return ret; - ret = __io_accept(req, nxt, force_nonblock); if (ret == -EAGAIN && force_nonblock) { req->work.func = io_accept_finish; @@ -2409,25 +2408,25 @@ static int io_accept(struct io_kiocb *req, struct io_kiocb **nxt, #endif }
-static int io_connect_prep(struct io_kiocb *req, struct io_async_ctx *io) +static int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { #if defined(CONFIG_NET) - const struct io_uring_sqe *sqe = req->sqe; - int ret; + struct io_connect *conn = &req->connect; + struct io_async_ctx *io = req->io;
- if (!sqe) - return 0; if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) return -EINVAL; if (sqe->ioprio || sqe->len || sqe->buf_index || sqe->rw_flags) return -EINVAL;
- req->connect.addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); - req->connect.addr_len = READ_ONCE(sqe->addr2); - ret = move_addr_to_kernel(req->connect.addr, req->connect.addr_len, + conn->addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); + conn->addr_len = READ_ONCE(sqe->addr2); + + if (!io) + return 0; + + return move_addr_to_kernel(conn->addr, conn->addr_len, &io->connect.address); - req->sqe = NULL; - return ret; #else return -EOPNOTSUPP; #endif @@ -2444,7 +2443,9 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt, if (req->io) { io = req->io; } else { - ret = io_connect_prep(req, &__io); + ret = move_addr_to_kernel(req->connect.addr, + req->connect.addr_len, + &__io.connect.address); if (ret) goto out; io = &__io; @@ -2524,12 +2525,9 @@ static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr) return -ENOENT; }
-static int io_poll_remove_prep(struct io_kiocb *req) +static int io_poll_remove_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) { - const struct io_uring_sqe *sqe = req->sqe; - - if (!sqe) - return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->ioprio || sqe->off || sqe->len || sqe->buf_index || @@ -2537,7 +2535,6 @@ static int io_poll_remove_prep(struct io_kiocb *req) return -EINVAL;
req->poll.addr = READ_ONCE(sqe->addr); - req->sqe = NULL; return 0; }
@@ -2551,10 +2548,6 @@ static int io_poll_remove(struct io_kiocb *req) u64 addr; int ret;
- ret = io_poll_remove_prep(req); - if (ret) - return ret; - addr = req->poll.addr; spin_lock_irq(&ctx->completion_lock); ret = io_poll_cancel(ctx, addr); @@ -2692,14 +2685,11 @@ static void io_poll_req_insert(struct io_kiocb *req) hlist_add_head(&req->hash_node, list); }
-static int io_poll_add_prep(struct io_kiocb *req) +static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { - const struct io_uring_sqe *sqe = req->sqe; struct io_poll_iocb *poll = &req->poll; u16 events;
- if (!sqe) - return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->addr || sqe->ioprio || sqe->off || sqe->len || sqe->buf_index) @@ -2709,7 +2699,6 @@ static int io_poll_add_prep(struct io_kiocb *req)
events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; - req->sqe = NULL; return 0; }
@@ -2720,11 +2709,6 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) struct io_poll_table ipt; bool cancel = false; __poll_t mask; - int ret; - - ret = io_poll_add_prep(req); - if (ret) - return ret;
INIT_IO_WORK(&req->work, io_poll_complete_work); INIT_HLIST_NODE(&req->hash_node); @@ -2843,12 +2827,9 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) return 0; }
-static int io_timeout_remove_prep(struct io_kiocb *req) +static int io_timeout_remove_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) { - const struct io_uring_sqe *sqe = req->sqe; - - if (!sqe) - return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->flags || sqe->ioprio || sqe->buf_index || sqe->len) @@ -2859,7 +2840,6 @@ static int io_timeout_remove_prep(struct io_kiocb *req) if (req->timeout.flags) return -EINVAL;
- req->sqe = NULL; return 0; }
@@ -2871,10 +2851,6 @@ static int io_timeout_remove(struct io_kiocb *req) struct io_ring_ctx *ctx = req->ctx; int ret;
- ret = io_timeout_remove_prep(req); - if (ret) - return ret; - spin_lock_irq(&ctx->completion_lock); ret = io_timeout_cancel(ctx, req->timeout.addr);
@@ -2888,15 +2864,12 @@ static int io_timeout_remove(struct io_kiocb *req) return 0; }
-static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io, +static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, bool is_timeout_link) { - const struct io_uring_sqe *sqe = req->sqe; struct io_timeout_data *data; unsigned flags;
- if (!sqe) - return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->ioprio || sqe->buf_index || sqe->len != 1) @@ -2909,7 +2882,7 @@ static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io,
req->timeout.count = READ_ONCE(sqe->off);
- if (!io && io_alloc_async_ctx(req)) + if (!req->io && io_alloc_async_ctx(req)) return -ENOMEM;
data = &req->io->timeout; @@ -2925,7 +2898,6 @@ static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io, data->mode = HRTIMER_MODE_REL;
hrtimer_init(&data->timer, CLOCK_MONOTONIC, data->mode); - req->sqe = NULL; return 0; }
@@ -2936,11 +2908,7 @@ static int io_timeout(struct io_kiocb *req) struct io_timeout_data *data; struct list_head *entry; unsigned span = 0; - int ret;
- ret = io_timeout_prep(req, req->io, false); - if (ret) - return ret; data = &req->io->timeout;
/* @@ -3066,12 +3034,9 @@ static void io_async_find_and_cancel(struct io_ring_ctx *ctx, io_put_req_find_next(req, nxt); }
-static int io_async_cancel_prep(struct io_kiocb *req) +static int io_async_cancel_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) { - const struct io_uring_sqe *sqe = req->sqe; - - if (!sqe) - return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->flags || sqe->ioprio || sqe->off || sqe->len || @@ -3079,28 +3044,20 @@ static int io_async_cancel_prep(struct io_kiocb *req) return -EINVAL;
req->cancel.addr = READ_ONCE(sqe->addr); - req->sqe = NULL; return 0; }
static int io_async_cancel(struct io_kiocb *req, struct io_kiocb **nxt) { struct io_ring_ctx *ctx = req->ctx; - int ret; - - ret = io_async_cancel_prep(req); - if (ret) - return ret;
io_async_find_and_cancel(ctx, req, req->cancel.addr, nxt, 0); return 0; }
-static int io_req_defer_prep(struct io_kiocb *req) +static int io_req_defer_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) { - struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; - struct io_async_ctx *io = req->io; - struct iov_iter iter; ssize_t ret = 0;
switch (req->opcode) { @@ -3108,61 +3065,47 @@ static int io_req_defer_prep(struct io_kiocb *req) break; case IORING_OP_READV: case IORING_OP_READ_FIXED: - /* ensure prep does right import */ - req->io = NULL; - ret = io_read_prep(req, &iovec, &iter, true); - req->io = io; - if (ret < 0) - break; - io_req_map_rw(req, ret, iovec, inline_vecs, &iter); - ret = 0; + ret = io_read_prep(req, sqe, true); break; case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: - /* ensure prep does right import */ - req->io = NULL; - ret = io_write_prep(req, &iovec, &iter, true); - req->io = io; - if (ret < 0) - break; - io_req_map_rw(req, ret, iovec, inline_vecs, &iter); - ret = 0; + ret = io_write_prep(req, sqe, true); break; case IORING_OP_POLL_ADD: - ret = io_poll_add_prep(req); + ret = io_poll_add_prep(req, sqe); break; case IORING_OP_POLL_REMOVE: - ret = io_poll_remove_prep(req); + ret = io_poll_remove_prep(req, sqe); break; case IORING_OP_FSYNC: - ret = io_prep_fsync(req); + ret = io_prep_fsync(req, sqe); break; case IORING_OP_SYNC_FILE_RANGE: - ret = io_prep_sfr(req); + ret = io_prep_sfr(req, sqe); break; case IORING_OP_SENDMSG: - ret = io_sendmsg_prep(req, io); + ret = io_sendmsg_prep(req, sqe); break; case IORING_OP_RECVMSG: - ret = io_recvmsg_prep(req, io); + ret = io_recvmsg_prep(req, sqe); break; case IORING_OP_CONNECT: - ret = io_connect_prep(req, io); + ret = io_connect_prep(req, sqe); break; case IORING_OP_TIMEOUT: - ret = io_timeout_prep(req, io, false); + ret = io_timeout_prep(req, sqe, false); break; case IORING_OP_TIMEOUT_REMOVE: - ret = io_timeout_remove_prep(req); + ret = io_timeout_remove_prep(req, sqe); break; case IORING_OP_ASYNC_CANCEL: - ret = io_async_cancel_prep(req); + ret = io_async_cancel_prep(req, sqe); break; case IORING_OP_LINK_TIMEOUT: - ret = io_timeout_prep(req, io, true); + ret = io_timeout_prep(req, sqe, true); break; case IORING_OP_ACCEPT: - ret = io_accept_prep(req); + ret = io_accept_prep(req, sqe); break; default: printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", @@ -3174,7 +3117,7 @@ static int io_req_defer_prep(struct io_kiocb *req) return ret; }
-static int io_req_defer(struct io_kiocb *req) +static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_ring_ctx *ctx = req->ctx; int ret; @@ -3183,10 +3126,10 @@ static int io_req_defer(struct io_kiocb *req) if (!req_need_defer(req) && list_empty(&ctx->defer_list)) return 0;
- if (io_alloc_async_ctx(req)) + if (!req->io && io_alloc_async_ctx(req)) return -EAGAIN;
- ret = io_req_defer_prep(req); + ret = io_req_defer_prep(req, sqe); if (ret < 0) return ret;
@@ -3202,9 +3145,8 @@ static int io_req_defer(struct io_kiocb *req) return -EIOCBQUEUED; }
-__attribute__((nonnull)) -static int io_issue_sqe(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, + struct io_kiocb **nxt, bool force_nonblock) { struct io_ring_ctx *ctx = req->ctx; int ret; @@ -3214,48 +3156,109 @@ static int io_issue_sqe(struct io_kiocb *req, struct io_kiocb **nxt, ret = io_nop(req); break; case IORING_OP_READV: - ret = io_read(req, nxt, force_nonblock); - break; - case IORING_OP_WRITEV: - ret = io_write(req, nxt, force_nonblock); - break; case IORING_OP_READ_FIXED: + if (sqe) { + ret = io_read_prep(req, sqe, force_nonblock); + if (ret < 0) + break; + } ret = io_read(req, nxt, force_nonblock); break; + case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: + if (sqe) { + ret = io_write_prep(req, sqe, force_nonblock); + if (ret < 0) + break; + } ret = io_write(req, nxt, force_nonblock); break; case IORING_OP_FSYNC: + if (sqe) { + ret = io_prep_fsync(req, sqe); + if (ret < 0) + break; + } ret = io_fsync(req, nxt, force_nonblock); break; case IORING_OP_POLL_ADD: + if (sqe) { + ret = io_poll_add_prep(req, sqe); + if (ret) + break; + } ret = io_poll_add(req, nxt); break; case IORING_OP_POLL_REMOVE: + if (sqe) { + ret = io_poll_remove_prep(req, sqe); + if (ret < 0) + break; + } ret = io_poll_remove(req); break; case IORING_OP_SYNC_FILE_RANGE: + if (sqe) { + ret = io_prep_sfr(req, sqe); + if (ret < 0) + break; + } ret = io_sync_file_range(req, nxt, force_nonblock); break; case IORING_OP_SENDMSG: + if (sqe) { + ret = io_sendmsg_prep(req, sqe); + if (ret < 0) + break; + } ret = io_sendmsg(req, nxt, force_nonblock); break; case IORING_OP_RECVMSG: + if (sqe) { + ret = io_recvmsg_prep(req, sqe); + if (ret) + break; + } ret = io_recvmsg(req, nxt, force_nonblock); break; case IORING_OP_TIMEOUT: + if (sqe) { + ret = io_timeout_prep(req, sqe, false); + if (ret) + break; + } ret = io_timeout(req); break; case IORING_OP_TIMEOUT_REMOVE: + if (sqe) { + ret = io_timeout_remove_prep(req, sqe); + if (ret) + break; + } ret = io_timeout_remove(req); break; case IORING_OP_ACCEPT: + if (sqe) { + ret = io_accept_prep(req, sqe); + if (ret) + break; + } ret = io_accept(req, nxt, force_nonblock); break; case IORING_OP_CONNECT: + if (sqe) { + ret = io_connect_prep(req, sqe); + if (ret) + break; + } ret = io_connect(req, nxt, force_nonblock); break; case IORING_OP_ASYNC_CANCEL: + if (sqe) { + ret = io_async_cancel_prep(req, sqe); + if (ret) + break; + } ret = io_async_cancel(req, nxt); break; default: @@ -3299,7 +3302,7 @@ static void io_wq_submit_work(struct io_wq_work **workptr) req->has_user = (work->flags & IO_WQ_WORK_HAS_MM) != 0; req->in_async = true; do { - ret = io_issue_sqe(req, &nxt, false); + ret = io_issue_sqe(req, NULL, &nxt, false); /* * We can get EAGAIN for polled IO even though we're * forcing a sync submission from here, since we can't @@ -3365,14 +3368,15 @@ static inline struct file *io_file_from_index(struct io_ring_ctx *ctx, return table->files[index & IORING_FILE_TABLE_MASK]; }
-static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req) +static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req, + const struct io_uring_sqe *sqe) { struct io_ring_ctx *ctx = req->ctx; unsigned flags; int fd, ret;
- flags = READ_ONCE(req->sqe->flags); - fd = READ_ONCE(req->sqe->fd); + flags = READ_ONCE(sqe->flags); + fd = READ_ONCE(sqe->fd);
if (flags & IOSQE_IO_DRAIN) req->flags |= REQ_F_IO_DRAIN; @@ -3504,7 +3508,7 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req) return nxt; }
-static void __io_queue_sqe(struct io_kiocb *req) +static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_kiocb *linked_timeout; struct io_kiocb *nxt = NULL; @@ -3513,7 +3517,7 @@ static void __io_queue_sqe(struct io_kiocb *req) again: linked_timeout = io_prep_linked_timeout(req);
- ret = io_issue_sqe(req, &nxt, true); + ret = io_issue_sqe(req, sqe, &nxt, true);
/* * We async punt it if the file wasn't marked NOWAIT, or if the file @@ -3560,7 +3564,7 @@ static void __io_queue_sqe(struct io_kiocb *req) } }
-static void io_queue_sqe(struct io_kiocb *req) +static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) { int ret;
@@ -3570,7 +3574,7 @@ static void io_queue_sqe(struct io_kiocb *req) } req->ctx->drain_next = (req->flags & REQ_F_DRAIN_LINK);
- ret = io_req_defer(req); + ret = io_req_defer(req, sqe); if (ret) { if (ret != -EIOCBQUEUED) { io_cqring_add_event(req, ret); @@ -3578,7 +3582,7 @@ static void io_queue_sqe(struct io_kiocb *req) io_double_put_req(req); } } else - __io_queue_sqe(req); + __io_queue_sqe(req, sqe); }
static inline void io_queue_link_head(struct io_kiocb *req) @@ -3587,25 +3591,25 @@ static inline void io_queue_link_head(struct io_kiocb *req) io_cqring_add_event(req, -ECANCELED); io_double_put_req(req); } else - io_queue_sqe(req); + io_queue_sqe(req, NULL); }
#define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK| \ IOSQE_IO_HARDLINK)
-static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, - struct io_kiocb **link) +static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, + struct io_submit_state *state, struct io_kiocb **link) { struct io_ring_ctx *ctx = req->ctx; int ret;
/* enforce forwards compatibility on users */ - if (unlikely(req->sqe->flags & ~SQE_VALID_FLAGS)) { + if (unlikely(sqe->flags & ~SQE_VALID_FLAGS)) { ret = -EINVAL; goto err_req; }
- ret = io_req_set_file(state, req); + ret = io_req_set_file(state, req, sqe); if (unlikely(ret)) { err_req: io_cqring_add_event(req, ret); @@ -3623,10 +3627,10 @@ static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, if (*link) { struct io_kiocb *prev = *link;
- if (req->sqe->flags & IOSQE_IO_DRAIN) + if (sqe->flags & IOSQE_IO_DRAIN) (*link)->flags |= REQ_F_DRAIN_LINK | REQ_F_IO_DRAIN;
- if (req->sqe->flags & IOSQE_IO_HARDLINK) + if (sqe->flags & IOSQE_IO_HARDLINK) req->flags |= REQ_F_HARDLINK;
if (io_alloc_async_ctx(req)) { @@ -3634,7 +3638,7 @@ static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, goto err_req; }
- ret = io_req_defer_prep(req); + ret = io_req_defer_prep(req, sqe); if (ret) { /* fail even hard links since we don't submit */ prev->flags |= REQ_F_FAIL_LINK; @@ -3642,15 +3646,18 @@ static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, } trace_io_uring_link(ctx, req, prev); list_add_tail(&req->link_list, &prev->link_list); - } else if (req->sqe->flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) { + } else if (sqe->flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) { req->flags |= REQ_F_LINK; - if (req->sqe->flags & IOSQE_IO_HARDLINK) + if (sqe->flags & IOSQE_IO_HARDLINK) req->flags |= REQ_F_HARDLINK;
INIT_LIST_HEAD(&req->link_list); + ret = io_req_defer_prep(req, sqe); + if (ret) + req->flags |= REQ_F_FAIL_LINK; *link = req; } else { - io_queue_sqe(req); + io_queue_sqe(req, sqe); }
return true; @@ -3695,14 +3702,15 @@ static void io_commit_sqring(struct io_ring_ctx *ctx) }
/* - * Fetch an sqe, if one is available. Note that req->sqe will point to memory + * Fetch an sqe, if one is available. Note that sqe_ptr will point to memory * that is mapped by userspace. This means that care needs to be taken to * ensure that reads are stable, as we cannot rely on userspace always * being a good citizen. If members of the sqe are validated and then later * used, it's important that those reads are done through READ_ONCE() to * prevent a re-load down the line. */ -static bool io_get_sqring(struct io_ring_ctx *ctx, struct io_kiocb *req) +static bool io_get_sqring(struct io_ring_ctx *ctx, struct io_kiocb *req, + const struct io_uring_sqe **sqe_ptr) { struct io_rings *rings = ctx->rings; u32 *sq_array = ctx->sq_array; @@ -3729,9 +3737,9 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct io_kiocb *req) * link list. */ req->sequence = ctx->cached_sq_head; - req->sqe = &ctx->sq_sqes[head]; - req->opcode = READ_ONCE(req->sqe->opcode); - req->user_data = READ_ONCE(req->sqe->user_data); + *sqe_ptr = &ctx->sq_sqes[head]; + req->opcode = READ_ONCE((*sqe_ptr)->opcode); + req->user_data = READ_ONCE((*sqe_ptr)->user_data); ctx->cached_sq_head++; return true; } @@ -3763,6 +3771,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, }
for (i = 0; i < nr; i++) { + const struct io_uring_sqe *sqe; struct io_kiocb *req; unsigned int sqe_flags;
@@ -3772,7 +3781,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, submitted = -EAGAIN; break; } - if (!io_get_sqring(ctx, req)) { + if (!io_get_sqring(ctx, req, &sqe)) { __io_free_req(req); break; } @@ -3786,7 +3795,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, }
submitted++; - sqe_flags = req->sqe->flags; + sqe_flags = sqe->flags;
req->ring_file = ring_file; req->ring_fd = ring_fd; @@ -3794,7 +3803,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, req->in_async = async; req->needs_fixed_file = async; trace_io_uring_submit_sqe(ctx, req->user_data, true, async); - if (!io_submit_sqe(req, statep, &link)) + if (!io_submit_sqe(req, sqe, statep, &link)) break; /* * If previous wasn't linked and we have a linked command,
From: Hillf Danton hdanton@sina.com
mainline inclusion from mainline-5.5-rc4 commit 1f424e8bd18754d27b15f49359004b0cea344fb5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Commit e61df66c69b1 ("io-wq: ensure free/busy list browsing see all items") added a list for io workers in addition to the free and busy lists, not only making worker walk cleaner, but leaving the busy list unused. Let's remove it.
Signed-off-by: Hillf Danton hdanton@sina.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 8 -------- 1 file changed, 8 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index e38e3c6e30f7..8adc2821b0cc 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -93,7 +93,6 @@ struct io_wqe { struct io_wqe_acct acct[2];
struct hlist_nulls_head free_list; - struct hlist_nulls_head busy_list; struct list_head all_list;
struct io_wq *wq; @@ -328,7 +327,6 @@ static void __io_worker_busy(struct io_wqe *wqe, struct io_worker *worker, if (worker->flags & IO_WORKER_F_FREE) { worker->flags &= ~IO_WORKER_F_FREE; hlist_nulls_del_init_rcu(&worker->nulls_node); - hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->busy_list); }
/* @@ -366,7 +364,6 @@ static bool __io_worker_idle(struct io_wqe *wqe, struct io_worker *worker) { if (!(worker->flags & IO_WORKER_F_FREE)) { worker->flags |= IO_WORKER_F_FREE; - hlist_nulls_del_init_rcu(&worker->nulls_node); hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list); }
@@ -799,10 +796,6 @@ void io_wq_cancel_all(struct io_wq *wq)
set_bit(IO_WQ_BIT_CANCEL, &wq->state);
- /* - * Browse both lists, as there's a gap between handing work off - * to a worker and the worker putting itself on the busy_list - */ rcu_read_lock(); for_each_node(node) { struct io_wqe *wqe = wq->wqes[node]; @@ -1050,7 +1043,6 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) spin_lock_init(&wqe->lock); INIT_WQ_LIST(&wqe->work_list); INIT_HLIST_NULLS_HEAD(&wqe->free_list, 0); - INIT_HLIST_NULLS_HEAD(&wqe->busy_list, 1); INIT_LIST_HEAD(&wqe->all_list); }
From: Hillf Danton hdanton@sina.com
mainline inclusion from mainline-5.5-rc4 commit fd1c4bc6e9b34a5e4fe7a3130a49380ef9d7037c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Reschedule the current IO worker to cut the risk that it is becoming a cpu hog.
Signed-off-by: Hillf Danton hdanton@sina.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 8adc2821b0cc..1b5889c89e9b 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -430,6 +430,8 @@ static void io_worker_handle_work(struct io_worker *worker) if (signal_pending(current)) flush_signals(current);
+ cond_resched(); + spin_lock_irq(&worker->lock); worker->cur_work = work; spin_unlock_irq(&worker->lock);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc6 commit eacc6dfaea963ef61540abb31ad7829be5eff284 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently punt any short read on a regular file to async context, but this fails if the short read is due to running into EOF. This is especially problematic since we only do the single prep for commands now, as we don't reset kiocb->ki_pos. This can result in a 4k read on a 1k file returning zero, as we detect the short read and then retry from async context. At the time of retry, the position is now 1k, and we end up reading nothing, and hence return 0.
Instead of trying to patch around the fact that short reads can be legitimate and won't succeed in case of retry, remove the logic to punt a short read to async context. Simply return it.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 12 ------------ 1 file changed, 12 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8b4faa21e2f1..4bb2e5bb843e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1862,18 +1862,6 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, else ret2 = loop_rw_iter(READ, req->file, kiocb, &iter);
- /* - * In case of a short read, punt to async. This can happen - * if we have data partially cached. Alternatively we can - * return the short read, in which case the application will - * need to issue another SQE and wait for it. That SQE will - * need async punt anyway, so it's more efficient to do it - * here. - */ - if (force_nonblock && !(req->flags & REQ_F_NOWAIT) && - (req->flags & REQ_F_ISREG) && - ret2 > 0 && ret2 < io_size) - ret2 = -EAGAIN; /* Catch -EAGAIN return for forced non-blocking submission */ if (!force_nonblock || ret2 != -EAGAIN) { kiocb_done(kiocb, ret2, nxt, req->in_async);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc7 commit 74566df3a71c1b92da608868cca787557d8be7b2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We don't need it, and if we have it, then the retry handler will attempt to copy the non-existent iovec with the inline iovec, with a segment count that doesn't make sense.
Fixes: f67676d160c6 ("io_uring: ensure async punted read/write requests copy iovec") Reported-by: Jonathan Lemon jonathan.lemon@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4bb2e5bb843e..83bcc0a0a851 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1785,6 +1785,9 @@ static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size, struct iovec *iovec, struct iovec *fast_iov, struct iov_iter *iter) { + if (req->opcode == IORING_OP_READ_FIXED || + req->opcode == IORING_OP_WRITE_FIXED) + return 0; if (!req->io && io_alloc_async_ctx(req)) return -ENOMEM;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit ce35a47a3a0208a77b4d31b7f2e8ed57d624093d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_uring defaults to always doing inline submissions, if at all possible. But for larger copies, even if the data is fully cached, that can take a long time. Add an IOSQE_ASYNC flag that the application can set on the SQE - if set, it'll ensure that we always go async for those kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we get the concurrency we desire for this case.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 ++++++++++++++-- include/uapi/linux/io_uring.h | 1 + 2 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b8e5b742a00a..29d67e40e81d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -483,6 +483,7 @@ struct io_kiocb { #define REQ_F_INFLIGHT 16384 /* on inflight list */ #define REQ_F_COMP_LOCKED 32768 /* completion under lock */ #define REQ_F_HARDLINK 65536 /* doesn't sever on completion < 0 */ +#define REQ_F_FORCE_ASYNC 131072 /* IOSQE_ASYNC */ u64 user_data; u32 result; u32 sequence; @@ -4014,8 +4015,17 @@ static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) req_set_fail_links(req); io_double_put_req(req); } - } else + } else if ((req->flags & REQ_F_FORCE_ASYNC) && + !io_wq_current_is_worker()) { + /* + * Never try inline submit of IOSQE_ASYNC is set, go straight + * to async execution. + */ + req->work.flags |= IO_WQ_WORK_CONCURRENT; + io_queue_async_work(req); + } else { __io_queue_sqe(req, sqe); + } }
static inline void io_queue_link_head(struct io_kiocb *req) @@ -4028,7 +4038,7 @@ static inline void io_queue_link_head(struct io_kiocb *req) }
#define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK| \ - IOSQE_IO_HARDLINK) + IOSQE_IO_HARDLINK | IOSQE_ASYNC)
static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct io_submit_state *state, struct io_kiocb **link) @@ -4041,6 +4051,8 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, ret = -EINVAL; goto err_req; } + if (sqe->flags & IOSQE_ASYNC) + req->flags |= REQ_F_FORCE_ASYNC;
ret = io_req_set_file(state, req, sqe); if (unlikely(ret)) { diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 3f45f7c543de..d7ec50247a3a 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -51,6 +51,7 @@ struct io_uring_sqe { #define IOSQE_IO_DRAIN (1U << 1) /* issue after inflight IO */ #define IOSQE_IO_LINK (1U << 2) /* links next sqe */ #define IOSQE_IO_HARDLINK (1U << 3) /* like LINK, but stronger */ +#define IOSQE_ASYNC (1U << 4) /* always go async */
/* * io_uring_setup() flags
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit 9d76377f7e13c19441fdd066033345289f89b5fe category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Calling "prev" a head of a link is a bit misleading. Rename it
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 29d67e40e81d..d481b9ae8715 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4070,10 +4070,10 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, * conditions are true (normal request), then just queue it. */ if (*link) { - struct io_kiocb *prev = *link; + struct io_kiocb *head = *link;
if (sqe->flags & IOSQE_IO_DRAIN) - (*link)->flags |= REQ_F_DRAIN_LINK | REQ_F_IO_DRAIN; + head->flags |= REQ_F_DRAIN_LINK | REQ_F_IO_DRAIN;
if (sqe->flags & IOSQE_IO_HARDLINK) req->flags |= REQ_F_HARDLINK; @@ -4086,11 +4086,11 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, ret = io_req_defer_prep(req, sqe); if (ret) { /* fail even hard links since we don't submit */ - prev->flags |= REQ_F_FAIL_LINK; + head->flags |= REQ_F_FAIL_LINK; goto err_req; } - trace_io_uring_link(ctx, req, prev); - list_add_tail(&req->link_list, &prev->link_list); + trace_io_uring_link(ctx, req, head); + list_add_tail(&req->link_list, &head->link_list); } else if (sqe->flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) { req->flags |= REQ_F_LINK; if (sqe->flags & IOSQE_IO_HARDLINK)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-v5.6-rc1 commit 32fe525b6d10fec956cfe68f0db76839cd7f0ea5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Move io_queue_link_head() to links handling code in io_submit_sqe(), so it wouldn't need extra checks and would have better data locality.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 32 +++++++++++++++----------------- 1 file changed, 15 insertions(+), 17 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d481b9ae8715..23e549dcc3a1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4044,14 +4044,17 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct io_submit_state *state, struct io_kiocb **link) { struct io_ring_ctx *ctx = req->ctx; + unsigned int sqe_flags; int ret;
+ sqe_flags = READ_ONCE(sqe->flags); + /* enforce forwards compatibility on users */ - if (unlikely(sqe->flags & ~SQE_VALID_FLAGS)) { + if (unlikely(sqe_flags & ~SQE_VALID_FLAGS)) { ret = -EINVAL; goto err_req; } - if (sqe->flags & IOSQE_ASYNC) + if (sqe_flags & IOSQE_ASYNC) req->flags |= REQ_F_FORCE_ASYNC;
ret = io_req_set_file(state, req, sqe); @@ -4072,10 +4075,10 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (*link) { struct io_kiocb *head = *link;
- if (sqe->flags & IOSQE_IO_DRAIN) + if (sqe_flags & IOSQE_IO_DRAIN) head->flags |= REQ_F_DRAIN_LINK | REQ_F_IO_DRAIN;
- if (sqe->flags & IOSQE_IO_HARDLINK) + if (sqe_flags & IOSQE_IO_HARDLINK) req->flags |= REQ_F_HARDLINK;
if (io_alloc_async_ctx(req)) { @@ -4091,9 +4094,15 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } trace_io_uring_link(ctx, req, head); list_add_tail(&req->link_list, &head->link_list); - } else if (sqe->flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) { + + /* last request of a link, enqueue the link */ + if (!(sqe_flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK))) { + io_queue_link_head(head); + *link = NULL; + } + } else if (sqe_flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) { req->flags |= REQ_F_LINK; - if (sqe->flags & IOSQE_IO_HARDLINK) + if (sqe_flags & IOSQE_IO_HARDLINK) req->flags |= REQ_F_HARDLINK;
INIT_LIST_HEAD(&req->link_list); @@ -4218,7 +4227,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, for (i = 0; i < nr; i++) { const struct io_uring_sqe *sqe; struct io_kiocb *req; - unsigned int sqe_flags;
req = io_get_req(ctx, statep); if (unlikely(!req)) { @@ -4240,8 +4248,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, }
submitted++; - sqe_flags = sqe->flags; - req->ring_file = ring_file; req->ring_fd = ring_fd; req->has_user = *mm != NULL; @@ -4250,14 +4256,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, trace_io_uring_submit_sqe(ctx, req->user_data, true, async); if (!io_submit_sqe(req, sqe, statep, &link)) break; - /* - * If previous wasn't linked and we have a linked command, - * that's the end of the chain. Submit the previous link. - */ - if (!(sqe_flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) && link) { - io_queue_link_head(link); - link = NULL; - } }
if (link)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit add7b6b85a4dfa89283834d181e87ea2144b9028 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
__io_free_req() and io_double_put_req() aren't used before they are defined, so we can kill these two forwards.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 23e549dcc3a1..e50de3e3c341 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -518,9 +518,7 @@ struct io_submit_state {
static void io_wq_submit_work(struct io_wq_work **workptr); static void io_cqring_fill_event(struct io_kiocb *req, long res); -static void __io_free_req(struct io_kiocb *req); static void io_put_req(struct io_kiocb *req); -static void io_double_put_req(struct io_kiocb *req); static void __io_double_put_req(struct io_kiocb *req); static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req); static void io_queue_linked_timeout(struct io_kiocb *req);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit d3656344fea0339fb0365c8df4d2beba4e0089cd category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently have various switch statements that check if an opcode needs a file, mm, etc. These are hard to keep in sync as opcodes are added. Add a struct io_op_def that holds all of this information, so we have just one spot to update when opcodes are added.
This also enables us to NOT allocate req->io if a deferred command doesn't need it, and corrects some mistakes we had in terms of what commands need mm context.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 208 +++++++++++++++++++++++++++++++++++++------------- 1 file changed, 155 insertions(+), 53 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e50de3e3c341..9216f407ab03 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -516,6 +516,135 @@ struct io_submit_state { unsigned int ios_left; };
+struct io_op_def { + /* needs req->io allocated for deferral/async */ + unsigned async_ctx : 1; + /* needs current->mm setup, does mm access */ + unsigned needs_mm : 1; + /* needs req->file assigned */ + unsigned needs_file : 1; + /* needs req->file assigned IFF fd is >= 0 */ + unsigned fd_non_neg : 1; + /* hash wq insertion if file is a regular file */ + unsigned hash_reg_file : 1; + /* unbound wq insertion if file is a non-regular file */ + unsigned unbound_nonreg_file : 1; +}; + +static const struct io_op_def io_op_defs[] = { + { + /* IORING_OP_NOP */ + }, + { + /* IORING_OP_READV */ + .async_ctx = 1, + .needs_mm = 1, + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_WRITEV */ + .async_ctx = 1, + .needs_mm = 1, + .needs_file = 1, + .hash_reg_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_FSYNC */ + .needs_file = 1, + }, + { + /* IORING_OP_READ_FIXED */ + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_WRITE_FIXED */ + .needs_file = 1, + .hash_reg_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_POLL_ADD */ + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_POLL_REMOVE */ + }, + { + /* IORING_OP_SYNC_FILE_RANGE */ + .needs_file = 1, + }, + { + /* IORING_OP_SENDMSG */ + .async_ctx = 1, + .needs_mm = 1, + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_RECVMSG */ + .async_ctx = 1, + .needs_mm = 1, + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_TIMEOUT */ + .async_ctx = 1, + .needs_mm = 1, + }, + { + /* IORING_OP_TIMEOUT_REMOVE */ + }, + { + /* IORING_OP_ACCEPT */ + .needs_mm = 1, + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_ASYNC_CANCEL */ + }, + { + /* IORING_OP_LINK_TIMEOUT */ + .async_ctx = 1, + .needs_mm = 1, + }, + { + /* IORING_OP_CONNECT */ + .async_ctx = 1, + .needs_mm = 1, + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_FALLOCATE */ + .needs_file = 1, + }, + { + /* IORING_OP_OPENAT */ + .needs_file = 1, + .fd_non_neg = 1, + }, + { + /* IORING_OP_CLOSE */ + .needs_file = 1, + }, + { + /* IORING_OP_FILES_UPDATE */ + .needs_mm = 1, + }, + { + /* IORING_OP_STATX */ + .needs_mm = 1, + .needs_file = 1, + .fd_non_neg = 1, + }, +}; + static void io_wq_submit_work(struct io_wq_work **workptr); static void io_cqring_fill_event(struct io_kiocb *req, long res); static void io_put_req(struct io_kiocb *req); @@ -670,41 +799,20 @@ static void __io_commit_cqring(struct io_ring_ctx *ctx) } }
-static inline bool io_req_needs_user(struct io_kiocb *req) -{ - return !(req->opcode == IORING_OP_READ_FIXED || - req->opcode == IORING_OP_WRITE_FIXED); -} - static inline bool io_prep_async_work(struct io_kiocb *req, struct io_kiocb **link) { + const struct io_op_def *def = &io_op_defs[req->opcode]; bool do_hashed = false;
- switch (req->opcode) { - case IORING_OP_WRITEV: - case IORING_OP_WRITE_FIXED: - /* only regular files should be hashed for writes */ - if (req->flags & REQ_F_ISREG) + if (req->flags & REQ_F_ISREG) { + if (def->hash_reg_file) do_hashed = true; - /* fall-through */ - case IORING_OP_READV: - case IORING_OP_READ_FIXED: - case IORING_OP_SENDMSG: - case IORING_OP_RECVMSG: - case IORING_OP_ACCEPT: - case IORING_OP_POLL_ADD: - case IORING_OP_CONNECT: - /* - * We know REQ_F_ISREG is not set on some of these - * opcodes, but this enables us to keep the check in - * just one place. - */ - if (!(req->flags & REQ_F_ISREG)) + } else { + if (def->unbound_nonreg_file) req->work.flags |= IO_WQ_WORK_UNBOUND; - break; } - if (io_req_needs_user(req)) + if (def->needs_mm) req->work.flags |= IO_WQ_WORK_NEEDS_USER;
*link = io_prep_linked_timeout(req); @@ -1825,6 +1933,8 @@ static void io_req_map_rw(struct io_kiocb *req, ssize_t io_size,
static int io_alloc_async_ctx(struct io_kiocb *req) { + if (!io_op_defs[req->opcode].async_ctx) + return 0; req->io = kmalloc(sizeof(*req->io), GFP_KERNEL); return req->io == NULL; } @@ -3762,29 +3872,13 @@ static void io_wq_submit_work(struct io_wq_work **workptr) io_wq_assign_next(workptr, nxt); }
-static bool io_req_op_valid(int op) -{ - return op >= IORING_OP_NOP && op < IORING_OP_LAST; -} - static int io_req_needs_file(struct io_kiocb *req, int fd) { - switch (req->opcode) { - case IORING_OP_NOP: - case IORING_OP_POLL_REMOVE: - case IORING_OP_TIMEOUT: - case IORING_OP_TIMEOUT_REMOVE: - case IORING_OP_ASYNC_CANCEL: - case IORING_OP_LINK_TIMEOUT: + if (!io_op_defs[req->opcode].needs_file) return 0; - case IORING_OP_OPENAT: - case IORING_OP_STATX: - return fd != -1; - default: - if (io_req_op_valid(req->opcode)) - return 1; - return -EINVAL; - } + if (fd == -1 && io_op_defs[req->opcode].fd_non_neg) + return 0; + return 1; }
static inline struct file *io_file_from_index(struct io_ring_ctx *ctx, @@ -3801,7 +3895,7 @@ static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req, { struct io_ring_ctx *ctx = req->ctx; unsigned flags; - int fd, ret; + int fd;
flags = READ_ONCE(sqe->flags); fd = READ_ONCE(sqe->fd); @@ -3809,9 +3903,8 @@ static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req, if (flags & IOSQE_IO_DRAIN) req->flags |= REQ_F_IO_DRAIN;
- ret = io_req_needs_file(req, fd); - if (ret <= 0) - return ret; + if (!io_req_needs_file(req, fd)) + return 0;
if (flags & IOSQE_FIXED_FILE) { if (unlikely(!ctx->file_data || @@ -4237,7 +4330,16 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, break; }
- if (io_req_needs_user(req) && !*mm) { + /* will complete beyond this point, count as submitted */ + submitted++; + + if (unlikely(req->opcode >= IORING_OP_LAST)) { + io_cqring_add_event(req, -EINVAL); + io_double_put_req(req); + break; + } + + if (io_op_defs[req->opcode].needs_mm && !*mm) { mm_fault = mm_fault || !mmget_not_zero(ctx->sqo_mm); if (!mm_fault) { use_mm(ctx->sqo_mm); @@ -4245,7 +4347,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, } }
- submitted++; req->ring_file = ring_file; req->ring_fd = ring_fd; req->has_user = *mm != NULL; @@ -6092,6 +6193,7 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
static int __init io_uring_init(void) { + BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST); req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); return 0; };
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit ad3eb2c89fb24d14ac81f43eff8e85fece2c934d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently check ->cq_overflow_list from both SQ and CQ context, which causes some bouncing of that cache line. Add separate bits of state for this instead, so that the SQ side can check using its own state, and likewise for the CQ side.
This adds ->sq_check_overflow with the SQ state, and ->cq_check_overflow with the CQ state. If we hit an overflow condition, both of these bits are set. Likewise for overflow flush clear, we clear both bits. For the fast path of just checking if there's an overflow condition on either the SQ or CQ side, we can use our own private bit for this.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 40 +++++++++++++++++++++++++++------------- 1 file changed, 27 insertions(+), 13 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9216f407ab03..44a0166f7d85 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -224,13 +224,14 @@ struct io_ring_ctx { unsigned sq_thread_idle; unsigned cached_sq_dropped; atomic_t cached_cq_overflow; - struct io_uring_sqe *sq_sqes; + unsigned long sq_check_overflow;
struct list_head defer_list; struct list_head timeout_list; struct list_head cq_overflow_list;
wait_queue_head_t inflight_wait; + struct io_uring_sqe *sq_sqes; } ____cacheline_aligned_in_smp;
struct io_rings *rings; @@ -272,6 +273,7 @@ struct io_ring_ctx { unsigned cq_entries; unsigned cq_mask; atomic_t cq_timeouts; + unsigned long cq_check_overflow; struct wait_queue_head cq_wait; struct fasync_struct *cq_fasync; struct eventfd_ctx *cq_ev_fd; @@ -949,6 +951,10 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) }
io_commit_cqring(ctx); + if (cqe) { + clear_bit(0, &ctx->sq_check_overflow); + clear_bit(0, &ctx->cq_check_overflow); + } spin_unlock_irqrestore(&ctx->completion_lock, flags); io_cqring_ev_posted(ctx);
@@ -982,6 +988,10 @@ static void io_cqring_fill_event(struct io_kiocb *req, long res) WRITE_ONCE(ctx->rings->cq_overflow, atomic_inc_return(&ctx->cached_cq_overflow)); } else { + if (list_empty(&ctx->cq_overflow_list)) { + set_bit(0, &ctx->sq_check_overflow); + set_bit(0, &ctx->cq_check_overflow); + } refcount_inc(&req->refs); req->result = res; list_add_tail(&req->list, &ctx->cq_overflow_list); @@ -1284,19 +1294,21 @@ static unsigned io_cqring_events(struct io_ring_ctx *ctx, bool noflush) { struct io_rings *rings = ctx->rings;
- /* - * noflush == true is from the waitqueue handler, just ensure we wake - * up the task, and the next invocation will flush the entries. We - * cannot safely to it from here. - */ - if (noflush && !list_empty(&ctx->cq_overflow_list)) - return -1U; + if (test_bit(0, &ctx->cq_check_overflow)) { + /* + * noflush == true is from the waitqueue handler, just ensure + * we wake up the task, and the next invocation will flush the + * entries. We cannot safely to it from here. + */ + if (noflush && !list_empty(&ctx->cq_overflow_list)) + return -1U;
- io_cqring_overflow_flush(ctx, false); + io_cqring_overflow_flush(ctx, false); + }
/* See comment at the top of this file */ smp_rmb(); - return READ_ONCE(rings->cq.tail) - READ_ONCE(rings->cq.head); + return ctx->cached_cq_tail - READ_ONCE(rings->cq.head); }
static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx) @@ -4306,9 +4318,11 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, bool mm_fault = false;
/* if we have a backlog and couldn't flush it all, return BUSY */ - if (!list_empty(&ctx->cq_overflow_list) && - !io_cqring_overflow_flush(ctx, false)) - return -EBUSY; + if (test_bit(0, &ctx->sq_check_overflow)) { + if (!list_empty(&ctx->cq_overflow_list) && + !io_cqring_overflow_flush(ctx, false)) + return -EBUSY; + }
if (nr > IO_PLUG_THRESHOLD) { io_submit_state_start(&state, nr);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit e94f141bd248ebdadcb7351f1e70b31cee5add53 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For busy IORING_OP_POLL_ADD workloads, we can have enough contention on the completion lock that we fail the inline completion path quite often as we fail the trylock on that lock. Add a list for deferred completions that we can use in that case. This helps reduce the number of async offloads we have to do, as if we get multiple completions in a row, we'll piggy back on to the poll_llist instead of having to queue our own offload.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 108 ++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 88 insertions(+), 20 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 44a0166f7d85..c96694d7b0fb 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -286,7 +286,8 @@ struct io_ring_ctx {
struct { spinlock_t completion_lock; - bool poll_multi_file; + struct llist_head poll_llist; + /* * ->poll_list is protected by the ctx->uring_lock for * io_uring instances that don't use IORING_SETUP_SQPOLL. @@ -296,6 +297,7 @@ struct io_ring_ctx { struct list_head poll_list; struct hlist_head *cancel_hash; unsigned cancel_hash_bits; + bool poll_multi_file;
spinlock_t inflight_lock; struct list_head inflight_list; @@ -453,7 +455,14 @@ struct io_kiocb { };
struct io_async_ctx *io; - struct file *ring_file; + union { + /* + * ring_file is only used in the submission path, and + * llist_node is only used for poll deferred completions + */ + struct file *ring_file; + struct llist_node llist_node; + }; int ring_fd; bool has_user; bool in_async; @@ -724,6 +733,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) mutex_init(&ctx->uring_lock); init_waitqueue_head(&ctx->wait); spin_lock_init(&ctx->completion_lock); + init_llist_head(&ctx->poll_llist); INIT_LIST_HEAD(&ctx->poll_list); INIT_LIST_HEAD(&ctx->defer_list); INIT_LIST_HEAD(&ctx->timeout_list); @@ -1319,6 +1329,20 @@ static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx) return smp_load_acquire(&rings->sq.tail) - ctx->cached_sq_head; }
+static inline bool io_req_multi_free(struct io_kiocb *req) +{ + /* + * If we're not using fixed files, we have to pair the completion part + * with the file put. Use regular completions for those, only batch + * free for fixed file and non-linked commands. + */ + if (((req->flags & (REQ_F_FIXED_FILE|REQ_F_LINK)) == REQ_F_FIXED_FILE) + && !io_is_fallback_req(req) && !req->io) + return true; + + return false; +} + /* * Find and free completed poll iocbs */ @@ -1338,14 +1362,7 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, (*nr_events)++;
if (refcount_dec_and_test(&req->refs)) { - /* If we're not using fixed files, we have to pair the - * completion part with the file put. Use regular - * completions for those, only batch free for fixed - * file and non-linked commands. - */ - if (((req->flags & (REQ_F_FIXED_FILE|REQ_F_LINK)) == - REQ_F_FIXED_FILE) && !io_is_fallback_req(req) && - !req->io) { + if (io_req_multi_free(req)) { reqs[to_free++] = req; if (to_free == ARRAY_SIZE(reqs)) io_free_req_many(ctx, reqs, &to_free); @@ -3078,6 +3095,44 @@ static void io_poll_complete_work(struct io_wq_work **workptr) io_wq_assign_next(workptr, nxt); }
+static void __io_poll_flush(struct io_ring_ctx *ctx, struct llist_node *nodes) +{ + void *reqs[IO_IOPOLL_BATCH]; + struct io_kiocb *req, *tmp; + int to_free = 0; + + spin_lock_irq(&ctx->completion_lock); + llist_for_each_entry_safe(req, tmp, nodes, llist_node) { + hash_del(&req->hash_node); + io_poll_complete(req, req->result, 0); + + if (refcount_dec_and_test(&req->refs)) { + if (io_req_multi_free(req)) { + reqs[to_free++] = req; + if (to_free == ARRAY_SIZE(reqs)) + io_free_req_many(ctx, reqs, &to_free); + } else { + req->flags |= REQ_F_COMP_LOCKED; + io_free_req(req); + } + } + } + spin_unlock_irq(&ctx->completion_lock); + + io_cqring_ev_posted(ctx); + io_free_req_many(ctx, reqs, &to_free); +} + +static void io_poll_flush(struct io_wq_work **workptr) +{ + struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); + struct llist_node *nodes; + + nodes = llist_del_all(&req->ctx->poll_llist); + if (nodes) + __io_poll_flush(req->ctx, nodes); +} + static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, void *key) { @@ -3085,7 +3140,6 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, struct io_kiocb *req = container_of(poll, struct io_kiocb, poll); struct io_ring_ctx *ctx = req->ctx; __poll_t mask = key_to_poll(key); - unsigned long flags;
/* for instances that support it check for an event match first: */ if (mask && !(mask & poll->events)) @@ -3099,17 +3153,31 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, * If we have a link timeout we're going to need the completion_lock * for finalizing the request, mark us as having grabbed that already. */ - if (mask && spin_trylock_irqsave(&ctx->completion_lock, flags)) { - hash_del(&req->hash_node); - io_poll_complete(req, mask, 0); - req->flags |= REQ_F_COMP_LOCKED; - io_put_req(req); - spin_unlock_irqrestore(&ctx->completion_lock, flags); + if (mask) { + unsigned long flags;
- io_cqring_ev_posted(ctx); - } else { - io_queue_async_work(req); + if (llist_empty(&ctx->poll_llist) && + spin_trylock_irqsave(&ctx->completion_lock, flags)) { + hash_del(&req->hash_node); + io_poll_complete(req, mask, 0); + req->flags |= REQ_F_COMP_LOCKED; + io_put_req(req); + spin_unlock_irqrestore(&ctx->completion_lock, flags); + + io_cqring_ev_posted(ctx); + req = NULL; + } else { + req->result = mask; + req->llist_node.next = NULL; + /* if the list wasn't empty, we're done */ + if (!llist_add(&req->llist_node, &ctx->poll_llist)) + req = NULL; + else + req->work.func = io_poll_flush; + } } + if (req) + io_queue_async_work(req);
return 1; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 3a6820f2bb8a079975109c25a5d1f29f46bce5d2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For uses cases that don't already naturally have an iovec, it's easier (or more convenient) to just use a buffer address + length. This is particular true if the use case is from languages that want to create a memory safe abstraction on top of io_uring, and where introducing the need for the iovec may impose an ownership issue. For those cases, they currently need an indirection buffer, which means allocating data just for this purpose.
Add basic read/write that don't require the iovec.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 23 +++++++++++++++++++++++ include/uapi/linux/io_uring.h | 2 ++ 2 files changed, 25 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c96694d7b0fb..8cb06ca5f21c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -654,6 +654,18 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .fd_non_neg = 1, }, + { + /* IORING_OP_READ */ + .needs_mm = 1, + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_WRITE */ + .needs_mm = 1, + .needs_file = 1, + .unbound_nonreg_file = 1, + }, };
static void io_wq_submit_work(struct io_wq_work **workptr); @@ -1866,6 +1878,13 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, if (req->rw.kiocb.private) return -EINVAL;
+ if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) { + ssize_t ret; + ret = import_single_range(rw, buf, sqe_len, *iovec, iter); + *iovec = NULL; + return ret; + } + if (req->io) { struct io_async_rw *iorw = &req->io->rw;
@@ -3631,10 +3650,12 @@ static int io_req_defer_prep(struct io_kiocb *req, break; case IORING_OP_READV: case IORING_OP_READ_FIXED: + case IORING_OP_READ: ret = io_read_prep(req, sqe, true); break; case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: + case IORING_OP_WRITE: ret = io_write_prep(req, sqe, true); break; case IORING_OP_POLL_ADD: @@ -3738,6 +3759,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, break; case IORING_OP_READV: case IORING_OP_READ_FIXED: + case IORING_OP_READ: if (sqe) { ret = io_read_prep(req, sqe, force_nonblock); if (ret < 0) @@ -3747,6 +3769,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, break; case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: + case IORING_OP_WRITE: if (sqe) { ret = io_write_prep(req, sqe, force_nonblock); if (ret < 0) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index d7ec50247a3a..7fdf994f3313 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -84,6 +84,8 @@ enum { IORING_OP_CLOSE, IORING_OP_FILES_UPDATE, IORING_OP_STATX, + IORING_OP_READ, + IORING_OP_WRITE,
/* this goes last, obviously */ IORING_OP_LAST,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit ba04291eb66ed895f194ae5abd3748d72bf8aaea category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This behaves like preadv2/pwritev2 with offset == -1, it'll use (and update) the current file position. This obviously comes with the caveat that if the application has multiple read/writes in flight, then the end result will not be as expected. This is similar to threads sharing a file descriptor and doing IO using the current file position.
Since this feature isn't easily detectable by doing a read or write, add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to detect presence of this feature.
Reported-by: 李通洲 carter.li@eoitek.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 11 ++++++++++- include/uapi/linux/io_uring.h | 1 + 2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8cb06ca5f21c..4385714506c2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -495,6 +495,7 @@ struct io_kiocb { #define REQ_F_COMP_LOCKED 32768 /* completion under lock */ #define REQ_F_HARDLINK 65536 /* doesn't sever on completion < 0 */ #define REQ_F_FORCE_ASYNC 131072 /* IOSQE_ASYNC */ +#define REQ_F_CUR_POS 262144 /* read/write uses file position */ u64 user_data; u32 result; u32 sequence; @@ -1710,6 +1711,10 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, req->flags |= REQ_F_ISREG;
kiocb->ki_pos = READ_ONCE(sqe->off); + if (kiocb->ki_pos == -1 && !(req->file->f_mode & FMODE_STREAM)) { + req->flags |= REQ_F_CUR_POS; + kiocb->ki_pos = req->file->f_pos; + } kiocb->ki_flags = iocb_flags(kiocb->ki_filp); kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
@@ -1781,6 +1786,10 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) static void kiocb_done(struct kiocb *kiocb, ssize_t ret, struct io_kiocb **nxt, bool in_async) { + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); + + if (req->flags & REQ_F_CUR_POS) + req->file->f_pos = kiocb->ki_pos; if (in_async && ret >= 0 && kiocb->ki_complete == io_complete_rw) *nxt = __io_complete_rw(kiocb, ret); else @@ -6142,7 +6151,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) goto err;
p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP | - IORING_FEAT_SUBMIT_STABLE; + IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS; trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags); return ret; err: diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 7fdf994f3313..1f96136eb6ee 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -174,6 +174,7 @@ struct io_uring_params { #define IORING_FEAT_SINGLE_MMAP (1U << 0) #define IORING_FEAT_NODROP (1U << 1) #define IORING_FEAT_SUBMIT_STABLE (1U << 2) +#define IORING_FEAT_RW_CUR_POS (1U << 3)
/* * io_uring_register(2) opcodes and arguments
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 4840e418c2fc533d55ff6caa5b9313eed1d26cfd category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This adds support for doing fadvise through io_uring. We assume that WILLNEED doesn't block, but that DONTNEED may block.
Reviewed-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 53 +++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 2 ++ 2 files changed, 55 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4385714506c2..c47ab9ce390e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -72,6 +72,7 @@ #include <linux/highmem.h> #include <linux/namei.h> #include <linux/fsnotify.h> +#include <linux/fadvise.h>
#define CREATE_TRACE_POINTS #include <trace/events/io_uring.h> @@ -400,6 +401,13 @@ struct io_files_update { u32 offset; };
+struct io_fadvise { + struct file *file; + u64 offset; + u32 len; + u32 advice; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -452,6 +460,7 @@ struct io_kiocb { struct io_open open; struct io_close close; struct io_files_update files_update; + struct io_fadvise fadvise; };
struct io_async_ctx *io; @@ -667,6 +676,10 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .unbound_nonreg_file = 1, }, + { + /* IORING_OP_FADVISE */ + .needs_file = 1, + }, };
static void io_wq_submit_work(struct io_wq_work **workptr); @@ -2433,6 +2446,35 @@ static int io_openat(struct io_kiocb *req, struct io_kiocb **nxt, return 0; }
+static int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + if (sqe->ioprio || sqe->buf_index || sqe->addr) + return -EINVAL; + + req->fadvise.offset = READ_ONCE(sqe->off); + req->fadvise.len = READ_ONCE(sqe->len); + req->fadvise.advice = READ_ONCE(sqe->fadvise_advice); + return 0; +} + +static int io_fadvise(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) +{ + struct io_fadvise *fa = &req->fadvise; + int ret; + + /* DONTNEED may block, others _should_ not */ + if (fa->advice == POSIX_FADV_DONTNEED && force_nonblock) + return -EAGAIN; + + ret = vfs_fadvise(req->file, fa->offset, fa->len, fa->advice); + if (ret < 0) + req_set_fail_links(req); + io_cqring_add_event(req, ret); + io_put_req_find_next(req, nxt); + return 0; +} + static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { unsigned lookup_flags; @@ -3718,6 +3760,9 @@ static int io_req_defer_prep(struct io_kiocb *req, case IORING_OP_STATX: ret = io_statx_prep(req, sqe); break; + case IORING_OP_FADVISE: + ret = io_fadvise_prep(req, sqe); + break; default: printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", req->opcode); @@ -3914,6 +3959,14 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } ret = io_statx(req, nxt, force_nonblock); break; + case IORING_OP_FADVISE: + if (sqe) { + ret = io_fadvise_prep(req, sqe); + if (ret) + break; + } + ret = io_fadvise(req, nxt, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 1f96136eb6ee..f86d1c776078 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -36,6 +36,7 @@ struct io_uring_sqe { __u32 cancel_flags; __u32 open_flags; __u32 statx_flags; + __u32 fadvise_advice; }; __u64 user_data; /* data to be passed back at completion time */ union { @@ -86,6 +87,7 @@ enum { IORING_OP_STATX, IORING_OP_READ, IORING_OP_WRITE, + IORING_OP_FADVISE,
/* this goes last, obviously */ IORING_OP_LAST,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit caf582c652feccd42c50923f0467c4f2dcef279e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
It should be pretty rare to not submitting anything when there is something in the ring. No need to keep heuristics for this case.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 185575f027e9..1778679ad595 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4498,14 +4498,12 @@ static void io_commit_sqring(struct io_ring_ctx *ctx) { struct io_rings *rings = ctx->rings;
- if (ctx->cached_sq_head != READ_ONCE(rings->sq.head)) { - /* - * Ensure any loads from the SQEs are done at this point, - * since once we write the new head, the application could - * write new data to them. - */ - smp_store_release(&rings->sq.head, ctx->cached_sq_head); - } + /* + * Ensure any loads from the SQEs are done at this point, + * since once we write the new head, the application could + * write new data to them. + */ + smp_store_release(&rings->sq.head, ctx->cached_sq_head); }
/*
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit 2550878f8421f7912fdd56b38c630b797f95c749 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_wq workers use io_issue_sqe() to forward sqes and never io_queue_sqe(). Remove extra check for io_wq_current_is_worker()
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1778679ad595..4e881562e638 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4368,8 +4368,7 @@ static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) req_set_fail_links(req); io_double_put_req(req); } - } else if ((req->flags & REQ_F_FORCE_ASYNC) && - !io_wq_current_is_worker()) { + } else if (req->flags & REQ_F_FORCE_ASYNC) { /* * Never try inline submit of IOSQE_ASYNC is set, go straight * to async execution.
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit fddafacee287b3140212c92464077e971401f860 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This adds IORING_OP_SEND for send(2) support, and IORING_OP_RECV for recv(2) support.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 140 ++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 2 + 2 files changed, 137 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4e881562e638..eddba18d46f8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -377,8 +377,12 @@ struct io_connect {
struct io_sr_msg { struct file *file; - struct user_msghdr __user *msg; + union { + struct user_msghdr __user *msg; + void __user *buf; + }; int msg_flags; + size_t len; };
struct io_open { @@ -692,6 +696,18 @@ static const struct io_op_def io_op_defs[] = { /* IORING_OP_MADVISE */ .needs_mm = 1, }, + { + /* IORING_OP_SEND */ + .needs_mm = 1, + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_RECV */ + .needs_mm = 1, + .needs_file = 1, + .unbound_nonreg_file = 1, + }, };
static void io_wq_submit_work(struct io_wq_work **workptr); @@ -2799,8 +2815,9 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
sr->msg_flags = READ_ONCE(sqe->msg_flags); sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + sr->len = READ_ONCE(sqe->len);
- if (!io) + if (!io || req->opcode == IORING_OP_SEND) return 0;
io->msg.iov = io->msg.fast_iov; @@ -2880,6 +2897,56 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, #endif }
+static int io_send(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) +{ +#if defined(CONFIG_NET) + struct socket *sock; + int ret; + + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + + sock = sock_from_file(req->file, &ret); + if (sock) { + struct io_sr_msg *sr = &req->sr_msg; + struct msghdr msg; + struct iovec iov; + unsigned flags; + + ret = import_single_range(WRITE, sr->buf, sr->len, &iov, + &msg.msg_iter); + if (ret) + return ret; + + msg.msg_name = NULL; + msg.msg_control = NULL; + msg.msg_controllen = 0; + msg.msg_namelen = 0; + + flags = req->sr_msg.msg_flags; + if (flags & MSG_DONTWAIT) + req->flags |= REQ_F_NOWAIT; + else if (force_nonblock) + flags |= MSG_DONTWAIT; + + ret = __sys_sendmsg_sock(sock, &msg, flags); + if (force_nonblock && ret == -EAGAIN) + return -EAGAIN; + if (ret == -ERESTARTSYS) + ret = -EINTR; + } + + io_cqring_add_event(req, ret); + if (ret < 0) + req_set_fail_links(req); + io_put_req_find_next(req, nxt); + return 0; +#else + return -EOPNOTSUPP; +#endif +} + static int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { @@ -2890,7 +2957,7 @@ static int io_recvmsg_prep(struct io_kiocb *req, sr->msg_flags = READ_ONCE(sqe->msg_flags); sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr));
- if (!io) + if (!io || req->opcode == IORING_OP_RECV) return 0;
io->msg.iov = io->msg.fast_iov; @@ -2972,6 +3039,59 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, #endif }
+static int io_recv(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) +{ +#if defined(CONFIG_NET) + struct socket *sock; + int ret; + + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + + sock = sock_from_file(req->file, &ret); + if (sock) { + struct io_sr_msg *sr = &req->sr_msg; + struct msghdr msg; + struct iovec iov; + unsigned flags; + + ret = import_single_range(READ, sr->buf, sr->len, &iov, + &msg.msg_iter); + if (ret) + return ret; + + msg.msg_name = NULL; + msg.msg_control = NULL; + msg.msg_controllen = 0; + msg.msg_namelen = 0; + msg.msg_iocb = NULL; + msg.msg_flags = 0; + + flags = req->sr_msg.msg_flags; + if (flags & MSG_DONTWAIT) + req->flags |= REQ_F_NOWAIT; + else if (force_nonblock) + flags |= MSG_DONTWAIT; + + ret = __sys_recvmsg_sock(sock, &msg, NULL, NULL, flags); + if (force_nonblock && ret == -EAGAIN) + return -EAGAIN; + if (ret == -ERESTARTSYS) + ret = -EINTR; + } + + io_cqring_add_event(req, ret); + if (ret < 0) + req_set_fail_links(req); + io_put_req_find_next(req, nxt); + return 0; +#else + return -EOPNOTSUPP; +#endif +} + + static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { #if defined(CONFIG_NET) @@ -3808,9 +3928,11 @@ static int io_req_defer_prep(struct io_kiocb *req, ret = io_prep_sfr(req, sqe); break; case IORING_OP_SENDMSG: + case IORING_OP_SEND: ret = io_sendmsg_prep(req, sqe); break; case IORING_OP_RECVMSG: + case IORING_OP_RECV: ret = io_recvmsg_prep(req, sqe); break; case IORING_OP_CONNECT: @@ -3953,20 +4075,28 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, ret = io_sync_file_range(req, nxt, force_nonblock); break; case IORING_OP_SENDMSG: + case IORING_OP_SEND: if (sqe) { ret = io_sendmsg_prep(req, sqe); if (ret < 0) break; } - ret = io_sendmsg(req, nxt, force_nonblock); + if (req->opcode == IORING_OP_SENDMSG) + ret = io_sendmsg(req, nxt, force_nonblock); + else + ret = io_send(req, nxt, force_nonblock); break; case IORING_OP_RECVMSG: + case IORING_OP_RECV: if (sqe) { ret = io_recvmsg_prep(req, sqe); if (ret) break; } - ret = io_recvmsg(req, nxt, force_nonblock); + if (req->opcode == IORING_OP_RECVMSG) + ret = io_recvmsg(req, nxt, force_nonblock); + else + ret = io_recv(req, nxt, force_nonblock); break; case IORING_OP_TIMEOUT: if (sqe) { diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 29fae13395a8..0fe270ab191c 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -90,6 +90,8 @@ enum { IORING_OP_WRITE, IORING_OP_FADVISE, IORING_OP_MADVISE, + IORING_OP_SEND, + IORING_OP_RECV,
/* this goes last, obviously */ IORING_OP_LAST,
From: YueHaibing yuehaibing@huawei.com
mainline inclusion from mainline-5.6-rc1 commit 96fd84d83a778450ffae737d9efa546ac3983b1f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Null check kfree is redundant, so remove it. This is detected by coccinelle.
Signed-off-by: YueHaibing yuehaibing@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index eddba18d46f8..9da48b9f5fd8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1159,8 +1159,7 @@ static void __io_req_aux_free(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx;
- if (req->io) - kfree(req->io); + kfree(req->io); if (req->file) { if (req->flags & REQ_F_FIXED_FILE) percpu_ref_put(&ctx->file_data->refs);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit c150368b496837cb207712e78f903ccfd7633b93 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If an application attempts to register a set with unbounded requests pending, we can be stuck here forever if they don't complete. We can make this wait interruptible, and just abort if we get signaled.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9da48b9f5fd8..ca84a708b6b8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6512,8 +6512,13 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, * after we've killed the percpu ref. */ mutex_unlock(&ctx->uring_lock); - wait_for_completion(&ctx->completions[0]); + ret = wait_for_completion_interruptible(&ctx->completions[0]); mutex_lock(&ctx->uring_lock); + if (ret) { + percpu_ref_resurrect(&ctx->refs); + ret = -EINTR; + goto out; + } }
switch (opcode) { @@ -6559,8 +6564,9 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, if (opcode != IORING_UNREGISTER_FILES && opcode != IORING_REGISTER_FILES_UPDATE) { /* bring the ctx back to life */ - reinit_completion(&ctx->completions[0]); percpu_ref_reinit(&ctx->refs); +out: + reinit_completion(&ctx->completions[0]); } return ret; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 69b3e546139a21b3046b6bf0cb79d5e8c9a3fa75 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In preparation for adding another one, which would make us spill into another long (and hence bump the size of the ctx), change them to bit fields.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ca84a708b6b8..346f2298837d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -202,10 +202,10 @@ struct io_ring_ctx {
struct { unsigned int flags; - bool compat; - bool account_mem; - bool cq_overflow_flushed; - bool drain_next; + int compat: 1; + int account_mem: 1; + int cq_overflow_flushed: 1; + int drain_next: 1;
/* * Ring buffer of indices into array of io_uring_sqe, which is @@ -993,7 +993,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
/* if force is set, the ring is going away. always drop after that */ if (force) - ctx->cq_overflow_flushed = true; + ctx->cq_overflow_flushed = 1;
cqe = NULL; while (!list_empty(&ctx->cq_overflow_list)) { @@ -4486,9 +4486,9 @@ static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
if (unlikely(req->ctx->drain_next)) { req->flags |= REQ_F_IO_DRAIN; - req->ctx->drain_next = false; + req->ctx->drain_next = 0; } - req->ctx->drain_next = (req->flags & REQ_F_DRAIN_LINK); + req->ctx->drain_next = (req->flags & REQ_F_DRAIN_LINK) != 0;
ret = io_req_defer(req, sqe); if (ret) {
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit f2842ab5b72d7ee5f7f8385c2d4f32c133f5837b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If an application is using eventfd notifications with poll to know when new SQEs can be issued, it's expecting the following read/writes to complete inline. And with that, it knows that there are events available, and don't want spurious wakeups on the eventfd for those requests.
This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like IORING_REGISTER_EVENTFD, except it only triggers notifications for events that happen from async completions (IRQ, or io-wq worker completions). Any completions inline from the submission itself will not trigger notifications.
Suggested-by: Mark Papadakis markuspapadakis@icloud.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 17 ++++++++++++++++- include/uapi/linux/io_uring.h | 1 + 2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 346f2298837d..b3ca3f380b37 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -206,6 +206,7 @@ struct io_ring_ctx { int account_mem: 1; int cq_overflow_flushed: 1; int drain_next: 1; + int eventfd_async: 1;
/* * Ring buffer of indices into array of io_uring_sqe, which is @@ -962,13 +963,20 @@ static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx) return &rings->cqes[tail & ctx->cq_mask]; }
+static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx) +{ + if (!ctx->eventfd_async) + return true; + return io_wq_current_is_worker() || in_interrupt(); +} + static void io_cqring_ev_posted(struct io_ring_ctx *ctx) { if (waitqueue_active(&ctx->wait)) wake_up(&ctx->wait); if (waitqueue_active(&ctx->sqo_wait)) wake_up(&ctx->sqo_wait); - if (ctx->cq_ev_fd) + if (ctx->cq_ev_fd && io_should_trigger_evfd(ctx)) eventfd_signal(ctx->cq_ev_fd, 1); }
@@ -6544,10 +6552,17 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, ret = io_sqe_files_update(ctx, arg, nr_args); break; case IORING_REGISTER_EVENTFD: + case IORING_REGISTER_EVENTFD_ASYNC: ret = -EINVAL; if (nr_args != 1) break; ret = io_eventfd_register(ctx, arg); + if (ret) + break; + if (opcode == IORING_REGISTER_EVENTFD_ASYNC) + ctx->eventfd_async = 1; + else + ctx->eventfd_async = 0; break; case IORING_UNREGISTER_EVENTFD: ret = -EINVAL; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 0fe270ab191c..66772a90a7f2 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -192,6 +192,7 @@ struct io_uring_params { #define IORING_REGISTER_EVENTFD 4 #define IORING_UNREGISTER_EVENTFD 5 #define IORING_REGISTER_FILES_UPDATE 6 +#define IORING_REGISTER_EVENTFD_ASYNC 7
struct io_uring_files_update { __u32 offset;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit f8748881b17dc56b3faa1d30c823f071c56593e5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We only use it internally in the prep functions for both statx and openat, so we don't need it to be persistent across the request.
Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [ commit c12cedf24e78("io_uring: add 'struct open_how' to the openat request context" is not applied ] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b3ca3f380b37..90cfc595b3c1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -393,7 +393,6 @@ struct io_open { umode_t mode; unsigned mask; }; - const char __user *fname; struct filename *filename; struct statx __user *buffer; int flags; @@ -2467,6 +2466,7 @@ static int io_fallocate(struct io_kiocb *req, struct io_kiocb **nxt,
static int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { + const char __user *fname; int ret;
if (sqe->ioprio || sqe->buf_index) @@ -2474,10 +2474,10 @@ static int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
req->open.dfd = READ_ONCE(sqe->fd); req->open.mode = READ_ONCE(sqe->len); - req->open.fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); + fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); req->open.flags = READ_ONCE(sqe->open_flags);
- req->open.filename = getname(req->open.fname); + req->open.filename = getname(fname); if (IS_ERR(req->open.filename)) { ret = PTR_ERR(req->open.filename); req->open.filename = NULL; @@ -2591,6 +2591,7 @@ static int io_fadvise(struct io_kiocb *req, struct io_kiocb **nxt,
static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { + const char __user *fname; unsigned lookup_flags; int ret;
@@ -2599,14 +2600,14 @@ static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
req->open.dfd = READ_ONCE(sqe->fd); req->open.mask = READ_ONCE(sqe->len); - req->open.fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); + fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); req->open.buffer = u64_to_user_ptr(READ_ONCE(sqe->addr2)); req->open.flags = READ_ONCE(sqe->statx_flags);
if (vfs_stat_set_lookup_flags(&lookup_flags, req->open.flags)) return -EINVAL;
- req->open.filename = getname_flags(req->open.fname, lookup_flags, NULL); + req->open.filename = getname_flags(fname, lookup_flags, NULL); if (IS_ERR(req->open.filename)) { ret = PTR_ERR(req->open.filename); req->open.filename = NULL;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 354420f705ccd0aa2d41249f3bb55b4afbed1873 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For some test apps at least, user_data is just zeroes. So it's not a good way to tell what the command actually is. Add the opcode to the issue trace point.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 ++- include/trace/events/io_uring.h | 13 +++++++++---- 2 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 90cfc595b3c1..0a616a4d9125 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4751,7 +4751,8 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, req->has_user = *mm != NULL; req->in_async = async; req->needs_fixed_file = async; - trace_io_uring_submit_sqe(ctx, req->user_data, true, async); + trace_io_uring_submit_sqe(ctx, req->opcode, req->user_data, + true, async); if (!io_submit_sqe(req, sqe, statep, &link)) break; } diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h index 5ff28df57763..b116de688a0e 100644 --- a/include/trace/events/io_uring.h +++ b/include/trace/events/io_uring.h @@ -349,6 +349,7 @@ TRACE_EVENT(io_uring_complete, * io_uring_submit_sqe - called before submitting one SQE * * @ctx: pointer to a ring context structure + * @opcode: opcode of request * @user_data: user data associated with the request * @force_nonblock: whether a context blocking or not * @sq_thread: true if sq_thread has submitted this SQE @@ -358,12 +359,14 @@ TRACE_EVENT(io_uring_complete, */ TRACE_EVENT(io_uring_submit_sqe,
- TP_PROTO(void *ctx, u64 user_data, bool force_nonblock, bool sq_thread), + TP_PROTO(void *ctx, u8 opcode, u64 user_data, bool force_nonblock, + bool sq_thread),
- TP_ARGS(ctx, user_data, force_nonblock, sq_thread), + TP_ARGS(ctx, opcode, user_data, force_nonblock, sq_thread),
TP_STRUCT__entry ( __field( void *, ctx ) + __field( u8, opcode ) __field( u64, user_data ) __field( bool, force_nonblock ) __field( bool, sq_thread ) @@ -371,13 +374,15 @@ TRACE_EVENT(io_uring_submit_sqe,
TP_fast_assign( __entry->ctx = ctx; + __entry->opcode = opcode; __entry->user_data = user_data; __entry->force_nonblock = force_nonblock; __entry->sq_thread = sq_thread; ),
- TP_printk("ring %p, user data 0x%llx, non block %d, sq_thread %d", - __entry->ctx, (unsigned long long) __entry->user_data, + TP_printk("ring %p, op %d, data 0x%llx, non block %d, sq_thread %d", + __entry->ctx, __entry->opcode, + (unsigned long long) __entry->user_data, __entry->force_nonblock, __entry->sq_thread) );
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 10fef4bebf979bb705feed087611293d5864adfe category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We can't assume that the whole batch has fixed files in it. If it's a mix, or none at all, then we can end up doing a ref put that either messes up accounting, or causes an oops if we have no fixed files at all.
Also ensure we free requests properly between inflight accounted and normal requests.
Fixes: 82c721577011 ("io_uring: extend batch freeing to cover more cases") Reported-by: Dmitrii Dolgov 9erthalion6@gmail.com Reported-by: Pavel Begunkov asml.silence@gmail.com Tested-by: Dmitrii Dolgov 9erthalion6@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 0a616a4d9125..68881cb7f7c0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1202,21 +1202,24 @@ struct req_batch {
static void io_free_req_many(struct io_ring_ctx *ctx, struct req_batch *rb) { + int fixed_refs = rb->to_free; + if (!rb->to_free) return; if (rb->need_iter) { int i, inflight = 0; unsigned long flags;
+ fixed_refs = 0; for (i = 0; i < rb->to_free; i++) { struct io_kiocb *req = rb->reqs[i];
- if (req->flags & REQ_F_FIXED_FILE) + if (req->flags & REQ_F_FIXED_FILE) { req->file = NULL; + fixed_refs++; + } if (req->flags & REQ_F_INFLIGHT) inflight++; - else - rb->reqs[i] = NULL; __io_req_aux_free(req); } if (!inflight) @@ -1226,7 +1229,7 @@ static void io_free_req_many(struct io_ring_ctx *ctx, struct req_batch *rb) for (i = 0; i < rb->to_free; i++) { struct io_kiocb *req = rb->reqs[i];
- if (req) { + if (req->flags & REQ_F_INFLIGHT) { list_del(&req->inflight_entry); if (!--inflight) break; @@ -1239,8 +1242,9 @@ static void io_free_req_many(struct io_ring_ctx *ctx, struct req_batch *rb) } do_free: kmem_cache_free_bulk(req_cachep, rb->to_free, rb->reqs); + if (fixed_refs) + percpu_ref_put_many(&ctx->file_data->refs, fixed_refs); percpu_ref_put_many(&ctx->refs, rb->to_free); - percpu_ref_put_many(&ctx->file_data->refs, rb->to_free); rb->to_free = rb->need_iter = 0; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 66f4af93da5761d2fa05c0dc673a47003cdb9cfe category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The application currently has no way of knowing if a given opcode is supported or not without having to try and issue one and see if we get -EINVAL or not. And even this approach is fraught with peril, as maybe we're getting -EINVAL due to some fields being missing, or maybe it's just not that easy to issue that particular command without doing some other leg work in terms of setup first.
This adds IORING_REGISTER_PROBE, which fills in a structure with info on what it supported or not. This will work even with sparse opcode fields, which may happen in the future or even today if someone backports specific features to older kernels.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 53 +++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 18 ++++++++++++ 2 files changed, 69 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 68881cb7f7c0..3e0c6b8919ed 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -562,6 +562,8 @@ struct io_op_def { unsigned hash_reg_file : 1; /* unbound wq insertion if file is a non-regular file */ unsigned unbound_nonreg_file : 1; + /* opcode is not supported by this kernel */ + unsigned not_supported : 1; };
static const struct io_op_def io_op_defs[] = { @@ -6498,6 +6500,45 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries, return io_uring_setup(entries, params); }
+static int io_probe(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args) +{ + struct io_uring_probe *p; + size_t size; + int i, ret; + + size = struct_size(p, ops, nr_args); + if (size == SIZE_MAX) + return -EOVERFLOW; + p = kzalloc(size, GFP_KERNEL); + if (!p) + return -ENOMEM; + + ret = -EFAULT; + if (copy_from_user(p, arg, size)) + goto out; + ret = -EINVAL; + if (memchr_inv(p, 0, size)) + goto out; + + p->last_op = IORING_OP_LAST - 1; + if (nr_args > IORING_OP_LAST) + nr_args = IORING_OP_LAST; + + for (i = 0; i < nr_args; i++) { + p->ops[i].op = i; + if (!io_op_defs[i].not_supported) + p->ops[i].flags = IO_URING_OP_SUPPORTED; + } + p->ops_len = i; + + ret = 0; + if (copy_to_user(arg, p, size)) + ret = -EFAULT; +out: + kfree(p); + return ret; +} + static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, void __user *arg, unsigned nr_args) __releases(ctx->uring_lock) @@ -6514,7 +6555,8 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, return -ENXIO;
if (opcode != IORING_UNREGISTER_FILES && - opcode != IORING_REGISTER_FILES_UPDATE) { + opcode != IORING_REGISTER_FILES_UPDATE && + opcode != IORING_REGISTER_PROBE) { percpu_ref_kill(&ctx->refs);
/* @@ -6576,6 +6618,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_eventfd_unregister(ctx); break; + case IORING_REGISTER_PROBE: + ret = -EINVAL; + if (!arg || nr_args > 256) + break; + ret = io_probe(ctx, arg, nr_args); + break; default: ret = -EINVAL; break; @@ -6583,7 +6631,8 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
if (opcode != IORING_UNREGISTER_FILES && - opcode != IORING_REGISTER_FILES_UPDATE) { + opcode != IORING_REGISTER_FILES_UPDATE && + opcode != IORING_REGISTER_PROBE) { /* bring the ctx back to life */ percpu_ref_reinit(&ctx->refs); out: diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 66772a90a7f2..c9629bf48695 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -193,6 +193,7 @@ struct io_uring_params { #define IORING_UNREGISTER_EVENTFD 5 #define IORING_REGISTER_FILES_UPDATE 6 #define IORING_REGISTER_EVENTFD_ASYNC 7 +#define IORING_REGISTER_PROBE 8
struct io_uring_files_update { __u32 offset; @@ -200,4 +201,21 @@ struct io_uring_files_update { __aligned_u64 /* __s32 * */ fds; };
+#define IO_URING_OP_SUPPORTED (1U << 0) + +struct io_uring_probe_op { + __u8 op; + __u8 resv; + __u16 flags; /* IO_URING_OP_* flags */ + __u32 resv2; +}; + +struct io_uring_probe { + __u8 last_op; /* last opcode supported */ + __u8 ops_len; /* length of ops[] array below */ + __u16 resv; + __u32 resv2[3]; + struct io_uring_probe_op ops[0]; +}; + #endif
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit 711be0312df4d350fb5bf1671c132cccae5aaf9a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Move setting ctx->drain_next to the only place it could be set, when it got linked non-head requests. The same for checking it, it's interesting only for a head of a link or a non-linked request.
No functional changes here. This removes some code from the common path and also removes REQ_F_DRAIN_LINK flag, as it doesn't need it anymore.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 41 +++++++++++++++++++++-------------------- 1 file changed, 21 insertions(+), 20 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3e0c6b8919ed..fa3ebfa7f3fc 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -507,7 +507,6 @@ struct io_kiocb { #define REQ_F_LINK 64 /* linked sqes */ #define REQ_F_LINK_TIMEOUT 128 /* has linked timeout */ #define REQ_F_FAIL_LINK 256 /* fail rest of links */ -#define REQ_F_DRAIN_LINK 512 /* link should be fully drained */ #define REQ_F_TIMEOUT 1024 /* timeout request */ #define REQ_F_ISREG 2048 /* regular file */ #define REQ_F_MUST_PUNT 4096 /* must be punted even for NONBLOCK */ @@ -4499,12 +4498,6 @@ static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) { int ret;
- if (unlikely(req->ctx->drain_next)) { - req->flags |= REQ_F_IO_DRAIN; - req->ctx->drain_next = 0; - } - req->ctx->drain_next = (req->flags & REQ_F_DRAIN_LINK) != 0; - ret = io_req_defer(req, sqe); if (ret) { if (ret != -EIOCBQUEUED) { @@ -4571,8 +4564,10 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (*link) { struct io_kiocb *head = *link;
- if (sqe_flags & IOSQE_IO_DRAIN) - head->flags |= REQ_F_DRAIN_LINK | REQ_F_IO_DRAIN; + if (sqe_flags & IOSQE_IO_DRAIN) { + head->flags |= REQ_F_IO_DRAIN; + ctx->drain_next = 1; + }
if (sqe_flags & IOSQE_IO_HARDLINK) req->flags |= REQ_F_HARDLINK; @@ -4596,18 +4591,24 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, io_queue_link_head(head); *link = NULL; } - } else if (sqe_flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) { - req->flags |= REQ_F_LINK; - if (sqe_flags & IOSQE_IO_HARDLINK) - req->flags |= REQ_F_HARDLINK; - - INIT_LIST_HEAD(&req->link_list); - ret = io_req_defer_prep(req, sqe); - if (ret) - req->flags |= REQ_F_FAIL_LINK; - *link = req; } else { - io_queue_sqe(req, sqe); + if (unlikely(ctx->drain_next)) { + req->flags |= REQ_F_IO_DRAIN; + req->ctx->drain_next = 0; + } + if (sqe_flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) { + req->flags |= REQ_F_LINK; + if (sqe_flags & IOSQE_IO_HARDLINK) + req->flags |= REQ_F_HARDLINK; + + INIT_LIST_HEAD(&req->link_list); + ret = io_req_defer_prep(req, sqe); + if (ret) + req->flags |= REQ_F_FAIL_LINK; + *link = req; + } else { + io_queue_sqe(req, sqe); + } }
return true;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit 0791015837f1520dd72918355dcb1f1e79175255 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
__io_commit_cqring() is almost always called when there is a change in the rings, so the check is rather pessimising.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index fa3ebfa7f3fc..cca10716ed34 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -855,14 +855,12 @@ static void __io_commit_cqring(struct io_ring_ctx *ctx) { struct io_rings *rings = ctx->rings;
- if (ctx->cached_cq_tail != READ_ONCE(rings->cq.tail)) { - /* order cqe stores with ring update */ - smp_store_release(&rings->cq.tail, ctx->cached_cq_tail); + /* order cqe stores with ring update */ + smp_store_release(&rings->cq.tail, ctx->cached_cq_tail);
- if (wq_has_sleeper(&ctx->cq_wait)) { - wake_up_interruptible(&ctx->cq_wait); - kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN); - } + if (wq_has_sleeper(&ctx->cq_wait)) { + wake_up_interruptible(&ctx->cq_wait); + kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN); } }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit b14cca0c84c760fbd39ad6bb7e1181e2df103d25 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
req->ring_fd and req->ring_file are used only during the prep stage during submission, which is is protected by mutex. There is no need to store them per-request, place them in ctx.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cca10716ed34..40cbb76ed770 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -251,6 +251,8 @@ struct io_ring_ctx { */ struct fixed_file_data *file_data; unsigned nr_user_files; + int ring_fd; + struct file *ring_file;
/* if used, fixed mapped user buffers */ unsigned nr_user_bufs; @@ -476,15 +478,10 @@ struct io_kiocb { };
struct io_async_ctx *io; - union { - /* - * ring_file is only used in the submission path, and - * llist_node is only used for poll deferred completions - */ - struct file *ring_file; - struct llist_node llist_node; - }; - int ring_fd; + /* + * llist_node is only used for poll deferred completions + */ + struct llist_node llist_node; bool has_user; bool in_async; bool needs_fixed_file; @@ -1136,7 +1133,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
got_it: req->io = NULL; - req->ring_file = NULL; req->file = NULL; req->ctx = ctx; req->flags = 0; @@ -2677,7 +2673,7 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
req->close.fd = READ_ONCE(sqe->fd); if (req->file->f_op == &io_uring_fops || - req->close.fd == req->ring_fd) + req->close.fd == req->ctx->ring_fd) return -EBADF;
return 0; @@ -4336,7 +4332,7 @@ static int io_grab_files(struct io_kiocb *req) int ret = -EBADF; struct io_ring_ctx *ctx = req->ctx;
- if (!req->ring_file) + if (!ctx->ring_file) return -EBADF;
rcu_read_lock(); @@ -4347,7 +4343,7 @@ static int io_grab_files(struct io_kiocb *req) * the fd has changed since we started down this path, and disallow * this operation if it has. */ - if (fcheck(req->ring_fd) == req->ring_file) { + if (fcheck(ctx->ring_fd) == ctx->ring_file) { list_add(&req->inflight_entry, &ctx->inflight_list); req->flags |= REQ_F_INFLIGHT; req->work.files = current->files; @@ -4719,6 +4715,9 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, statep = &state; }
+ ctx->ring_fd = ring_fd; + ctx->ring_file = ring_file; + for (i = 0; i < nr; i++) { const struct io_uring_sqe *sqe; struct io_kiocb *req; @@ -4751,8 +4750,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, } }
- req->ring_file = ring_file; - req->ring_fd = ring_fd; req->has_user = *mm != NULL; req->in_async = async; req->needs_fixed_file = async;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit e46a7950d362231a4d0b078af5f4c109b8e5ac9e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently flush early, but if we have something in progress and a new switch is scheduled, we need to ensure to flush after our teardown as well.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 40cbb76ed770..ed348a47cea1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5030,11 +5030,14 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx) return -ENXIO;
/* protect against inflight atomic switch, which drops the ref */ - flush_work(&data->ref_work); percpu_ref_get(&data->refs); + /* wait for existing switches */ + flush_work(&data->ref_work); percpu_ref_kill_and_confirm(&data->refs, io_file_ref_kill); wait_for_completion(&data->done); percpu_ref_put(&data->refs); + /* flush potential new switch */ + flush_work(&data->ref_work); percpu_ref_exit(&data->refs);
__io_sqe_files_unregister(ctx);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit 87987898a1dbc69b1138f7c10eb9abd655c03396 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
A request can get into the defer list only once, there is no need for marking it as drained, so remove it. This probably was left after extracting __need_defer() for use in timeouts.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ed348a47cea1..4a70be498332 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -500,7 +500,6 @@ struct io_kiocb { #define REQ_F_FIXED_FILE 4 /* ctx owns file */ #define REQ_F_LINK_NEXT 8 /* already grabbed next link */ #define REQ_F_IO_DRAIN 16 /* drain existing IO first */ -#define REQ_F_IO_DRAINED 32 /* drain done */ #define REQ_F_LINK 64 /* linked sqes */ #define REQ_F_LINK_TIMEOUT 128 /* has linked timeout */ #define REQ_F_FAIL_LINK 256 /* fail rest of links */ @@ -812,7 +811,7 @@ static inline bool __req_need_defer(struct io_kiocb *req)
static inline bool req_need_defer(struct io_kiocb *req) { - if ((req->flags & (REQ_F_IO_DRAIN|REQ_F_IO_DRAINED)) == REQ_F_IO_DRAIN) + if (unlikely(req->flags & REQ_F_IO_DRAIN)) return __req_need_defer(req);
return false; @@ -934,10 +933,8 @@ static void io_commit_cqring(struct io_ring_ctx *ctx)
__io_commit_cqring(ctx);
- while ((req = io_get_deferred_req(ctx)) != NULL) { - req->flags |= REQ_F_IO_DRAINED; + while ((req = io_get_deferred_req(ctx)) != NULL) io_queue_async_work(req); - } }
static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit 6b47ee6ecab142f938a40bf3b297abac74218ee2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For each IOSQE_* flag there is a corresponding REQ_F_* flag. And there is a repetitive pattern of their translation: e.g. if (sqe->flags & SQE_FLAG*) req->flags |= REQ_F_FLAG*
Use same numeric values/bits for them and copy instead of manual handling.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 92 ++++++++++++++++++++++++----------- include/uapi/linux/io_uring.h | 23 +++++++-- 2 files changed, 81 insertions(+), 34 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4a70be498332..fd67fbe0ffea 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -46,6 +46,7 @@ #include <linux/compat.h> #include <linux/refcount.h> #include <linux/uio.h> +#include <linux/bits.h>
#include <linux/sched/signal.h> #include <linux/fs.h> @@ -453,6 +454,65 @@ struct io_async_ctx { }; };
+enum { + REQ_F_FIXED_FILE_BIT = IOSQE_FIXED_FILE_BIT, + REQ_F_IO_DRAIN_BIT = IOSQE_IO_DRAIN_BIT, + REQ_F_LINK_BIT = IOSQE_IO_LINK_BIT, + REQ_F_HARDLINK_BIT = IOSQE_IO_HARDLINK_BIT, + REQ_F_FORCE_ASYNC_BIT = IOSQE_ASYNC_BIT, + + REQ_F_LINK_NEXT_BIT, + REQ_F_FAIL_LINK_BIT, + REQ_F_INFLIGHT_BIT, + REQ_F_CUR_POS_BIT, + REQ_F_NOWAIT_BIT, + REQ_F_IOPOLL_COMPLETED_BIT, + REQ_F_LINK_TIMEOUT_BIT, + REQ_F_TIMEOUT_BIT, + REQ_F_ISREG_BIT, + REQ_F_MUST_PUNT_BIT, + REQ_F_TIMEOUT_NOSEQ_BIT, + REQ_F_COMP_LOCKED_BIT, +}; + +enum { + /* ctx owns file */ + REQ_F_FIXED_FILE = BIT(REQ_F_FIXED_FILE_BIT), + /* drain existing IO first */ + REQ_F_IO_DRAIN = BIT(REQ_F_IO_DRAIN_BIT), + /* linked sqes */ + REQ_F_LINK = BIT(REQ_F_LINK_BIT), + /* doesn't sever on completion < 0 */ + REQ_F_HARDLINK = BIT(REQ_F_HARDLINK_BIT), + /* IOSQE_ASYNC */ + REQ_F_FORCE_ASYNC = BIT(REQ_F_FORCE_ASYNC_BIT), + + /* already grabbed next link */ + REQ_F_LINK_NEXT = BIT(REQ_F_LINK_NEXT_BIT), + /* fail rest of links */ + REQ_F_FAIL_LINK = BIT(REQ_F_FAIL_LINK_BIT), + /* on inflight list */ + REQ_F_INFLIGHT = BIT(REQ_F_INFLIGHT_BIT), + /* read/write uses file position */ + REQ_F_CUR_POS = BIT(REQ_F_CUR_POS_BIT), + /* must not punt to workers */ + REQ_F_NOWAIT = BIT(REQ_F_NOWAIT_BIT), + /* polled IO has completed */ + REQ_F_IOPOLL_COMPLETED = BIT(REQ_F_IOPOLL_COMPLETED_BIT), + /* has linked timeout */ + REQ_F_LINK_TIMEOUT = BIT(REQ_F_LINK_TIMEOUT_BIT), + /* timeout request */ + REQ_F_TIMEOUT = BIT(REQ_F_TIMEOUT_BIT), + /* regular file */ + REQ_F_ISREG = BIT(REQ_F_ISREG_BIT), + /* must be punted even for NONBLOCK */ + REQ_F_MUST_PUNT = BIT(REQ_F_MUST_PUNT_BIT), + /* no timeout sequence */ + REQ_F_TIMEOUT_NOSEQ = BIT(REQ_F_TIMEOUT_NOSEQ_BIT), + /* completion under lock */ + REQ_F_COMP_LOCKED = BIT(REQ_F_COMP_LOCKED_BIT), +}; + /* * NOTE! Each of the iocb union members has the file pointer * as the first entry in their struct definition. So you can @@ -495,23 +555,6 @@ struct io_kiocb { struct list_head link_list; unsigned int flags; refcount_t refs; -#define REQ_F_NOWAIT 1 /* must not punt to workers */ -#define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ -#define REQ_F_FIXED_FILE 4 /* ctx owns file */ -#define REQ_F_LINK_NEXT 8 /* already grabbed next link */ -#define REQ_F_IO_DRAIN 16 /* drain existing IO first */ -#define REQ_F_LINK 64 /* linked sqes */ -#define REQ_F_LINK_TIMEOUT 128 /* has linked timeout */ -#define REQ_F_FAIL_LINK 256 /* fail rest of links */ -#define REQ_F_TIMEOUT 1024 /* timeout request */ -#define REQ_F_ISREG 2048 /* regular file */ -#define REQ_F_MUST_PUNT 4096 /* must be punted even for NONBLOCK */ -#define REQ_F_TIMEOUT_NOSEQ 8192 /* no timeout sequence */ -#define REQ_F_INFLIGHT 16384 /* on inflight list */ -#define REQ_F_COMP_LOCKED 32768 /* completion under lock */ -#define REQ_F_HARDLINK 65536 /* doesn't sever on completion < 0 */ -#define REQ_F_FORCE_ASYNC 131072 /* IOSQE_ASYNC */ -#define REQ_F_CUR_POS 262144 /* read/write uses file position */ u64 user_data; u32 result; u32 sequence; @@ -4296,9 +4339,6 @@ static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req, flags = READ_ONCE(sqe->flags); fd = READ_ONCE(sqe->fd);
- if (flags & IOSQE_IO_DRAIN) - req->flags |= REQ_F_IO_DRAIN; - if (!io_req_needs_file(req, fd)) return 0;
@@ -4534,8 +4574,9 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, ret = -EINVAL; goto err_req; } - if (sqe_flags & IOSQE_ASYNC) - req->flags |= REQ_F_FORCE_ASYNC; + /* same numerical values with corresponding REQ_F_*, safe to copy */ + req->flags |= sqe_flags & (IOSQE_IO_DRAIN|IOSQE_IO_HARDLINK| + IOSQE_ASYNC);
ret = io_req_set_file(state, req, sqe); if (unlikely(ret)) { @@ -4559,10 +4600,6 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, head->flags |= REQ_F_IO_DRAIN; ctx->drain_next = 1; } - - if (sqe_flags & IOSQE_IO_HARDLINK) - req->flags |= REQ_F_HARDLINK; - if (io_alloc_async_ctx(req)) { ret = -EAGAIN; goto err_req; @@ -4589,9 +4626,6 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } if (sqe_flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) { req->flags |= REQ_F_LINK; - if (sqe_flags & IOSQE_IO_HARDLINK) - req->flags |= REQ_F_HARDLINK; - INIT_LIST_HEAD(&req->link_list); ret = io_req_defer_prep(req, sqe); if (ret) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index c9629bf48695..f51bb2291185 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -45,14 +45,27 @@ struct io_uring_sqe { }; };
+enum { + IOSQE_FIXED_FILE_BIT, + IOSQE_IO_DRAIN_BIT, + IOSQE_IO_LINK_BIT, + IOSQE_IO_HARDLINK_BIT, + IOSQE_ASYNC_BIT, +}; + /* * sqe->flags */ -#define IOSQE_FIXED_FILE (1U << 0) /* use fixed fileset */ -#define IOSQE_IO_DRAIN (1U << 1) /* issue after inflight IO */ -#define IOSQE_IO_LINK (1U << 2) /* links next sqe */ -#define IOSQE_IO_HARDLINK (1U << 3) /* like LINK, but stronger */ -#define IOSQE_ASYNC (1U << 4) /* always go async */ +/* use fixed fileset */ +#define IOSQE_FIXED_FILE (1U << IOSQE_FIXED_FILE_BIT) +/* issue after inflight IO */ +#define IOSQE_IO_DRAIN (1U << IOSQE_IO_DRAIN_BIT) +/* links next sqe */ +#define IOSQE_IO_LINK (1U << IOSQE_IO_LINK_BIT) +/* like LINK, but stronger */ +#define IOSQE_IO_HARDLINK (1U << IOSQE_IO_HARDLINK_BIT) +/* always go async */ +#define IOSQE_ASYNC (1U << IOSQE_ASYNC_BIT)
/* * io_uring_setup() flags
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit 0463b6c58e557118d602b2f225fa3bbe9b6f3560 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Don't rely on implicit ordering of IORING_OP_ and explicitly place them at a right place in io_op_defs. Now former comments are now a part of the code and won't ever outdate.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [ commit cebdb98617ae("io_uring: add support for IORING_OP_OPENAT2") is not applied ] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 88 ++++++++++++++++----------------------------------- 1 file changed, 28 insertions(+), 60 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index fd67fbe0ffea..c026ba6359ab 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -605,145 +605,113 @@ struct io_op_def { };
static const struct io_op_def io_op_defs[] = { - { - /* IORING_OP_NOP */ - }, - { - /* IORING_OP_READV */ + [IORING_OP_NOP] = {}, + [IORING_OP_READV] = { .async_ctx = 1, .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, }, - { - /* IORING_OP_WRITEV */ + [IORING_OP_WRITEV] = { .async_ctx = 1, .needs_mm = 1, .needs_file = 1, .hash_reg_file = 1, .unbound_nonreg_file = 1, }, - { - /* IORING_OP_FSYNC */ + [IORING_OP_FSYNC] = { .needs_file = 1, }, - { - /* IORING_OP_READ_FIXED */ + [IORING_OP_READ_FIXED] = { .needs_file = 1, .unbound_nonreg_file = 1, }, - { - /* IORING_OP_WRITE_FIXED */ + [IORING_OP_WRITE_FIXED] = { .needs_file = 1, .hash_reg_file = 1, .unbound_nonreg_file = 1, }, - { - /* IORING_OP_POLL_ADD */ + [IORING_OP_POLL_ADD] = { .needs_file = 1, .unbound_nonreg_file = 1, }, - { - /* IORING_OP_POLL_REMOVE */ - }, - { - /* IORING_OP_SYNC_FILE_RANGE */ + [IORING_OP_POLL_REMOVE] = {}, + [IORING_OP_SYNC_FILE_RANGE] = { .needs_file = 1, }, - { - /* IORING_OP_SENDMSG */ + [IORING_OP_SENDMSG] = { .async_ctx = 1, .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, }, - { - /* IORING_OP_RECVMSG */ + [IORING_OP_RECVMSG] = { .async_ctx = 1, .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, }, - { - /* IORING_OP_TIMEOUT */ + [IORING_OP_TIMEOUT] = { .async_ctx = 1, .needs_mm = 1, }, - { - /* IORING_OP_TIMEOUT_REMOVE */ - }, - { - /* IORING_OP_ACCEPT */ + [IORING_OP_TIMEOUT_REMOVE] = {}, + [IORING_OP_ACCEPT] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, }, - { - /* IORING_OP_ASYNC_CANCEL */ - }, - { - /* IORING_OP_LINK_TIMEOUT */ + [IORING_OP_ASYNC_CANCEL] = {}, + [IORING_OP_LINK_TIMEOUT] = { .async_ctx = 1, .needs_mm = 1, }, - { - /* IORING_OP_CONNECT */ + [IORING_OP_CONNECT] = { .async_ctx = 1, .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, }, - { - /* IORING_OP_FALLOCATE */ + [IORING_OP_FALLOCATE] = { .needs_file = 1, }, - { - /* IORING_OP_OPENAT */ + [IORING_OP_OPENAT] = { .needs_file = 1, .fd_non_neg = 1, }, - { - /* IORING_OP_CLOSE */ + [IORING_OP_CLOSE] = { .needs_file = 1, }, - { - /* IORING_OP_FILES_UPDATE */ + [IORING_OP_FILES_UPDATE] = { .needs_mm = 1, }, - { - /* IORING_OP_STATX */ + [IORING_OP_STATX] = { .needs_mm = 1, .needs_file = 1, .fd_non_neg = 1, }, - { - /* IORING_OP_READ */ + [IORING_OP_READ] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, }, - { - /* IORING_OP_WRITE */ + [IORING_OP_WRITE] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, }, - { - /* IORING_OP_FADVISE */ + [IORING_OP_FADVISE] = { .needs_file = 1, }, - { - /* IORING_OP_MADVISE */ + [IORING_OP_MADVISE] = { .needs_mm = 1, }, - { - /* IORING_OP_SEND */ + [IORING_OP_SEND] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, }, - { - /* IORING_OP_RECV */ + [IORING_OP_RECV] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit 1118591ab883f46df4ab614cc976bc4c8e04a464 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Whenever IOSQE_ASYNC is set, requests will be punted to async without getting into io_issue_req() and without proper preparation done (e.g. io_req_defer_prep()). Hence they will be left uninitialised.
Prepare them before punting.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c026ba6359ab..59281a91b30a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4500,11 +4500,15 @@ static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) ret = io_req_defer(req, sqe); if (ret) { if (ret != -EIOCBQUEUED) { +fail_req: io_cqring_add_event(req, ret); req_set_fail_links(req); io_double_put_req(req); } } else if (req->flags & REQ_F_FORCE_ASYNC) { + ret = io_req_defer_prep(req, sqe); + if (unlikely(ret < 0)) + goto fail_req; /* * Never try inline submit of IOSQE_ASYNC is set, go straight * to async execution.
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit 86a761f81ec87a96572214f5db606f60d36aaf08 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
REQ_F_FORCE_ASYNC is checked only for the head of a link. Fix it.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 59281a91b30a..2fb71b69cff3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4454,6 +4454,7 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) */ if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) || (req->flags & REQ_F_MUST_PUNT))) { +punt: if (req->work.flags & IO_WQ_WORK_NEEDS_FILES) { ret = io_grab_files(req); if (ret) @@ -4489,6 +4490,9 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (nxt) { req = nxt; nxt = NULL; + + if (req->flags & REQ_F_FORCE_ASYNC) + goto punt; goto again; } }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 980ad26304abf11e78caaa68023411b9c088b848 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For the non-vectored variant of READV/WRITEV, we don't need to setup an async io context, and we flag that appropriately in the io_op_defs array. However, in fixing this for the 5.5 kernel in commit 74566df3a71c we didn't have these opcodes, so the check there was added just for the READ_FIXED and WRITE_FIXED opcodes. Replace that check with just a single check for needing async context, that covers all four of these read/write variants that don't use an iovec.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2fb71b69cff3..58a204cb5337 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2108,8 +2108,7 @@ static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size, struct iovec *iovec, struct iovec *fast_iov, struct iov_iter *iter) { - if (req->opcode == IORING_OP_READ_FIXED || - req->opcode == IORING_OP_WRITE_FIXED) + if (!io_op_defs[req->opcode].async_ctx) return 0; if (!req->io && io_alloc_async_ctx(req)) return -ENOMEM;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit 8cdf2193a3335b4cfb6e023b41ac293d0843d287 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Draining the middle of a link is tricky, so leave a comment there
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 58a204cb5337..14ca1fadd7b5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4571,6 +4571,13 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (*link) { struct io_kiocb *head = *link;
+ /* + * Taking sequential execution of a link, draining both sides + * of the link also fullfils IOSQE_IO_DRAIN semantics for all + * requests in the link. So, it drains the head and the + * next after the link request. The last one is done via + * drain_next flag to persist the effect across calls. + */ if (sqe_flags & IOSQE_IO_DRAIN) { head->flags |= REQ_F_IO_DRAIN; ctx->drain_next = 1;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit 9466f43741bc08edd7b1bee642dd6f5561091634 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In case of out of memory the second argument of percpu_ref_put_many() in io_submit_sqes() may evaluate into "nr - (-EAGAIN)", that is clearly wrong.
Fixes: 2b85edfc0c90 ("io_uring: batch getting pcpu references") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 14ca1fadd7b5..d3f6e3778392 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4772,8 +4772,11 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, break; }
- if (submitted != nr) - percpu_ref_put_many(&ctx->refs, nr - submitted); + if (unlikely(submitted != nr)) { + int ref_used = (submitted == -EAGAIN) ? 0 : submitted; + + percpu_ref_put_many(&ctx->refs, nr - ref_used); + } if (link) io_queue_link_head(link); if (statep)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 848f7e1887c46f21679c2c12b9e8022f17750721 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In preparation for sharing an io-wq across different users, add a reference count that manages destruction of it.
Reviewed-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 09896c7e4205..ec467c17e160 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -114,6 +114,8 @@ struct io_wq { struct mm_struct *mm; refcount_t refs; struct completion done; + + refcount_t use_refs; };
static bool io_worker_get(struct io_worker *worker) @@ -1074,6 +1076,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) ret = -ENOMEM; goto err; } + refcount_set(&wq->use_refs, 1); reinit_completion(&wq->done); return wq; } @@ -1094,7 +1097,7 @@ static bool io_wq_worker_wake(struct io_worker *worker, void *data) return false; }
-void io_wq_destroy(struct io_wq *wq) +static void __io_wq_destroy(struct io_wq *wq) { int node;
@@ -1114,3 +1117,9 @@ void io_wq_destroy(struct io_wq *wq) kfree(wq->wqes); kfree(wq); } + +void io_wq_destroy(struct io_wq *wq) +{ + if (refcount_dec_and_test(&wq->use_refs)) + __io_wq_destroy(wq); +}
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit cccf0ee834559ae0b327b40290e14f6a2a017177 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently setup the io_wq with a static set of mm and creds. Even for a single-use io-wq per io_uring, this is suboptimal as we have may have multiple enters of the ring. For sharing the io-wq backend, it doesn't work at all.
Switch to passing in the creds and mm when the work item is setup. This means that async work is no longer deferred to the io_uring mm and creds, it is done with the current mm and creds.
Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know they can rely on the current personality (mm and creds) being the same for direct issue and async issue.
Reviewed-by: Stefan Metzmacher metze@samba.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 68 +++++++++++++++++++++++------------ fs/io-wq.h | 7 ++-- fs/io_uring.c | 36 ++++++++++++++++--- include/uapi/linux/io_uring.h | 1 + 4 files changed, 82 insertions(+), 30 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index ec467c17e160..0596627c1c0b 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -57,7 +57,8 @@ struct io_worker {
struct rcu_head rcu; struct mm_struct *mm; - const struct cred *creds; + const struct cred *cur_creds; + const struct cred *saved_creds; struct files_struct *restore_files; };
@@ -110,8 +111,6 @@ struct io_wq {
struct task_struct *manager; struct user_struct *user; - const struct cred *creds; - struct mm_struct *mm; refcount_t refs; struct completion done;
@@ -138,9 +137,9 @@ static bool __io_worker_unuse(struct io_wqe *wqe, struct io_worker *worker) { bool dropped_lock = false;
- if (worker->creds) { - revert_creds(worker->creds); - worker->creds = NULL; + if (worker->saved_creds) { + revert_creds(worker->saved_creds); + worker->cur_creds = worker->saved_creds = NULL; }
if (current->files != worker->restore_files) { @@ -399,6 +398,43 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe, unsigned *hash) return NULL; }
+static void io_wq_switch_mm(struct io_worker *worker, struct io_wq_work *work) +{ + if (worker->mm) { + unuse_mm(worker->mm); + mmput(worker->mm); + worker->mm = NULL; + } + if (!work->mm) { + set_fs(KERNEL_DS); + return; + } + if (mmget_not_zero(work->mm)) { + use_mm(work->mm); + if (!worker->mm) + set_fs(USER_DS); + worker->mm = work->mm; + /* hang on to this mm */ + work->mm = NULL; + return; + } + + /* failed grabbing mm, ensure work gets cancelled */ + work->flags |= IO_WQ_WORK_CANCEL; +} + +static void io_wq_switch_creds(struct io_worker *worker, + struct io_wq_work *work) +{ + const struct cred *old_creds = override_creds(work->creds); + + worker->cur_creds = work->creds; + if (worker->saved_creds) + put_cred(old_creds); /* creds set by previous switch */ + else + worker->saved_creds = old_creds; +} + static void io_worker_handle_work(struct io_worker *worker) __releases(wqe->lock) { @@ -447,18 +483,10 @@ static void io_worker_handle_work(struct io_worker *worker) current->files = work->files; task_unlock(current); } - if ((work->flags & IO_WQ_WORK_NEEDS_USER) && !worker->mm && - wq->mm) { - if (mmget_not_zero(wq->mm)) { - use_mm(wq->mm); - set_fs(USER_DS); - worker->mm = wq->mm; - } else { - work->flags |= IO_WQ_WORK_CANCEL; - } - } - if (!worker->creds) - worker->creds = override_creds(wq->creds); + if (work->mm != worker->mm) + io_wq_switch_mm(worker, work); + if (worker->cur_creds != work->creds) + io_wq_switch_creds(worker, work); /* * OK to set IO_WQ_WORK_CANCEL even for uncancellable work, * the worker function will do the right thing. @@ -1038,7 +1066,6 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
/* caller must already hold a reference to this */ wq->user = data->user; - wq->creds = data->creds;
for_each_node(node) { struct io_wqe *wqe; @@ -1065,9 +1092,6 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
init_completion(&wq->done);
- /* caller must have already done mmgrab() on this mm */ - wq->mm = data->mm; - wq->manager = kthread_create(io_wq_manager, wq, "io_wq_manager"); if (!IS_ERR(wq->manager)) { wake_up_process(wq->manager); diff --git a/fs/io-wq.h b/fs/io-wq.h index 1cd039af8813..167316ad447e 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -7,7 +7,6 @@ enum { IO_WQ_WORK_CANCEL = 1, IO_WQ_WORK_HAS_MM = 2, IO_WQ_WORK_HASHED = 4, - IO_WQ_WORK_NEEDS_USER = 8, IO_WQ_WORK_NEEDS_FILES = 16, IO_WQ_WORK_UNBOUND = 32, IO_WQ_WORK_INTERNAL = 64, @@ -74,6 +73,8 @@ struct io_wq_work { }; void (*func)(struct io_wq_work **); struct files_struct *files; + struct mm_struct *mm; + const struct cred *creds; unsigned flags; };
@@ -83,15 +84,15 @@ struct io_wq_work { (work)->func = _func; \ (work)->flags = 0; \ (work)->files = NULL; \ + (work)->mm = NULL; \ + (work)->creds = NULL; \ } while (0) \
typedef void (get_work_fn)(struct io_wq_work *); typedef void (put_work_fn)(struct io_wq_work *);
struct io_wq_data { - struct mm_struct *mm; struct user_struct *user; - const struct cred *creds;
get_work_fn *get_work; put_work_fn *put_work; diff --git a/fs/io_uring.c b/fs/io_uring.c index d3f6e3778392..25932635a228 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -871,6 +871,29 @@ static void __io_commit_cqring(struct io_ring_ctx *ctx) } }
+static inline void io_req_work_grab_env(struct io_kiocb *req, + const struct io_op_def *def) +{ + if (!req->work.mm && def->needs_mm) { + mmgrab(current->mm); + req->work.mm = current->mm; + } + if (!req->work.creds) + req->work.creds = get_current_cred(); +} + +static inline void io_req_work_drop_env(struct io_kiocb *req) +{ + if (req->work.mm) { + mmdrop(req->work.mm); + req->work.mm = NULL; + } + if (req->work.creds) { + put_cred(req->work.creds); + req->work.creds = NULL; + } +} + static inline bool io_prep_async_work(struct io_kiocb *req, struct io_kiocb **link) { @@ -884,8 +907,8 @@ static inline bool io_prep_async_work(struct io_kiocb *req, if (def->unbound_nonreg_file) req->work.flags |= IO_WQ_WORK_UNBOUND; } - if (def->needs_mm) - req->work.flags |= IO_WQ_WORK_NEEDS_USER; + + io_req_work_grab_env(req, def);
*link = io_prep_linked_timeout(req); return do_hashed; @@ -1176,6 +1199,8 @@ static void __io_req_aux_free(struct io_kiocb *req) else fput(req->file); } + + io_req_work_drop_env(req); }
static void __io_free_req(struct io_kiocb *req) @@ -3916,6 +3941,8 @@ static int io_req_defer_prep(struct io_kiocb *req, { ssize_t ret = 0;
+ io_req_work_grab_env(req, &io_op_defs[req->opcode]); + switch (req->opcode) { case IORING_OP_NOP: break; @@ -5667,9 +5694,7 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx, goto err; }
- data.mm = ctx->sqo_mm; data.user = ctx->user; - data.creds = ctx->creds; data.get_work = io_get_work; data.put_work = io_put_work;
@@ -6468,7 +6493,8 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) goto err;
p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP | - IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS; + IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS | + IORING_FEAT_CUR_PERSONALITY; trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags); return ret; err: diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index f51bb2291185..ffba7b1bf171 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -194,6 +194,7 @@ struct io_uring_params { #define IORING_FEAT_NODROP (1U << 1) #define IORING_FEAT_SUBMIT_STABLE (1U << 2) #define IORING_FEAT_RW_CUR_POS (1U << 3) +#define IORING_FEAT_CUR_PERSONALITY (1U << 4)
/* * io_uring_register(2) opcodes and arguments
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit eba6f5a330cf042bb0001f0b5e8cbf21be1b25d6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Export a helper to attach to an existing io-wq, rather than setting up a new one. This is doable now that we have reference counted io_wq's.
Reported-by: Jens Axboe axboe@kernel.dk Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 8 ++++++++ fs/io-wq.h | 1 + 2 files changed, 9 insertions(+)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 0596627c1c0b..852bedffd508 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -1115,6 +1115,14 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) return ERR_PTR(ret); }
+bool io_wq_get(struct io_wq *wq, struct io_wq_data *data) +{ + if (data->get_work != wq->get_work || data->put_work != wq->put_work) + return false; + + return refcount_inc_not_zero(&wq->use_refs); +} + static bool io_wq_worker_wake(struct io_worker *worker, void *data) { wake_up_process(worker->task); diff --git a/fs/io-wq.h b/fs/io-wq.h index 167316ad447e..c42602c58c56 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -99,6 +99,7 @@ struct io_wq_data { };
struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data); +bool io_wq_get(struct io_wq *wq, struct io_wq_data *data); void io_wq_destroy(struct io_wq *wq);
void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit 24369c2e3bb06d8c4e71fd6ceaf4f8a01ae79b7c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If IORING_SETUP_ATTACH_WQ is set, it expects wq_fd in io_uring_params to be a valid io_uring fd io-wq of which will be shared with the newly created io_uring instance. If the flag is set but it can't share io-wq, it fails.
This allows creation of "sibling" io_urings, where we prefer to keep the SQ/CQ private, but want to share the async backend to minimize the amount of overhead associated with having multiple rings that belong to the same backend.
Reported-by: Jens Axboe axboe@kernel.dk Reported-by: Daurnimator quae@daurnimator.com Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 64 +++++++++++++++++++++++++++-------- include/uapi/linux/io_uring.h | 4 ++- 2 files changed, 53 insertions(+), 15 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 25932635a228..cc4a5e92153b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5646,11 +5646,56 @@ static void io_get_work(struct io_wq_work *work) refcount_inc(&req->refs); }
+static int io_init_wq_offload(struct io_ring_ctx *ctx, + struct io_uring_params *p) +{ + struct io_wq_data data; + struct fd f; + struct io_ring_ctx *ctx_attach; + unsigned int concurrency; + int ret = 0; + + data.user = ctx->user; + data.get_work = io_get_work; + data.put_work = io_put_work; + + if (!(p->flags & IORING_SETUP_ATTACH_WQ)) { + /* Do QD, or 4 * CPUS, whatever is smallest */ + concurrency = min(ctx->sq_entries, 4 * num_online_cpus()); + + ctx->io_wq = io_wq_create(concurrency, &data); + if (IS_ERR(ctx->io_wq)) { + ret = PTR_ERR(ctx->io_wq); + ctx->io_wq = NULL; + } + return ret; + } + + f = fdget(p->wq_fd); + if (!f.file) + return -EBADF; + + if (f.file->f_op != &io_uring_fops) { + ret = -EINVAL; + goto out_fput; + } + + ctx_attach = f.file->private_data; + /* @io_wq is protected by holding the fd */ + if (!io_wq_get(ctx_attach->io_wq, &data)) { + ret = -EINVAL; + goto out_fput; + } + + ctx->io_wq = ctx_attach->io_wq; +out_fput: + fdput(f); + return ret; +} + static int io_sq_offload_start(struct io_ring_ctx *ctx, struct io_uring_params *p) { - struct io_wq_data data; - unsigned concurrency; int ret;
init_waitqueue_head(&ctx->sqo_wait); @@ -5694,18 +5739,9 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx, goto err; }
- data.user = ctx->user; - data.get_work = io_get_work; - data.put_work = io_put_work; - - /* Do QD, or 4 * CPUS, whatever is smallest */ - concurrency = min(ctx->sq_entries, 4 * num_online_cpus()); - ctx->io_wq = io_wq_create(concurrency, &data); - if (IS_ERR(ctx->io_wq)) { - ret = PTR_ERR(ctx->io_wq); - ctx->io_wq = NULL; + ret = io_init_wq_offload(ctx, p); + if (ret) goto err; - }
return 0; err: @@ -6522,7 +6558,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL | IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE | - IORING_SETUP_CLAMP)) + IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ)) return -EINVAL;
ret = io_uring_create(entries, &p); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ffba7b1bf171..4b5a3376d959 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -75,6 +75,7 @@ enum { #define IORING_SETUP_SQ_AFF (1U << 2) /* sq_thread_cpu is valid */ #define IORING_SETUP_CQSIZE (1U << 3) /* app defines CQ size */ #define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */ +#define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */
enum { IORING_OP_NOP, @@ -182,7 +183,8 @@ struct io_uring_params { __u32 sq_thread_cpu; __u32 sq_thread_idle; __u32 features; - __u32 resv[4]; + __u32 wq_fd; + __u32 resv[3]; struct io_sqring_offsets sq_off; struct io_cqring_offsets cq_off; };
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 071698e13ac6ba786dfa22349a7b62deb5a9464d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If an application wants to use a ring with different kinds of credentials, it can register them upfront. We don't lookup credentials, the credentials of the task calling IORING_REGISTER_PERSONALITY is used.
An 'id' is returned for the application to use in subsequent personality support.
Reviewed-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 75 +++++++++++++++++++++++++++++++---- include/uapi/linux/io_uring.h | 2 + 2 files changed, 70 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cc4a5e92153b..dc5ff7771c26 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -273,6 +273,8 @@ struct io_ring_ctx { struct socket *ring_sock; #endif
+ struct idr personality_idr; + struct { unsigned cached_cq_tail; unsigned cq_entries; @@ -792,6 +794,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) INIT_LIST_HEAD(&ctx->cq_overflow_list); init_completion(&ctx->completions[0]); init_completion(&ctx->completions[1]); + idr_init(&ctx->personality_idr); mutex_init(&ctx->uring_lock); init_waitqueue_head(&ctx->wait); spin_lock_init(&ctx->completion_lock); @@ -6120,6 +6123,17 @@ static int io_uring_fasync(int fd, struct file *file, int on) return fasync_helper(fd, file, on, &ctx->cq_fasync); }
+static int io_remove_personalities(int id, void *p, void *data) +{ + struct io_ring_ctx *ctx = data; + const struct cred *cred; + + cred = idr_remove(&ctx->personality_idr, id); + if (cred) + put_cred(cred); + return 0; +} + static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) { mutex_lock(&ctx->uring_lock); @@ -6136,6 +6150,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) /* if we failed setting up the ctx, we might not have any rings */ if (ctx->rings) io_cqring_overflow_flush(ctx, true); + idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx); wait_for_completion(&ctx->completions[0]); io_ring_ctx_free(ctx); } @@ -6616,6 +6631,45 @@ static int io_probe(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args) return ret; }
+static int io_register_personality(struct io_ring_ctx *ctx) +{ + const struct cred *creds = get_current_cred(); + int id; + + id = idr_alloc_cyclic(&ctx->personality_idr, (void *) creds, 1, + USHRT_MAX, GFP_KERNEL); + if (id < 0) + put_cred(creds); + return id; +} + +static int io_unregister_personality(struct io_ring_ctx *ctx, unsigned id) +{ + const struct cred *old_creds; + + old_creds = idr_remove(&ctx->personality_idr, id); + if (old_creds) { + put_cred(old_creds); + return 0; + } + + return -EINVAL; +} + +static bool io_register_op_must_quiesce(int op) +{ + switch (op) { + case IORING_UNREGISTER_FILES: + case IORING_REGISTER_FILES_UPDATE: + case IORING_REGISTER_PROBE: + case IORING_REGISTER_PERSONALITY: + case IORING_UNREGISTER_PERSONALITY: + return false; + default: + return true; + } +} + static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, void __user *arg, unsigned nr_args) __releases(ctx->uring_lock) @@ -6631,9 +6685,7 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, if (percpu_ref_is_dying(&ctx->refs)) return -ENXIO;
- if (opcode != IORING_UNREGISTER_FILES && - opcode != IORING_REGISTER_FILES_UPDATE && - opcode != IORING_REGISTER_PROBE) { + if (io_register_op_must_quiesce(opcode)) { percpu_ref_kill(&ctx->refs);
/* @@ -6701,15 +6753,24 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_probe(ctx, arg, nr_args); break; + case IORING_REGISTER_PERSONALITY: + ret = -EINVAL; + if (arg || nr_args) + break; + ret = io_register_personality(ctx); + break; + case IORING_UNREGISTER_PERSONALITY: + ret = -EINVAL; + if (arg) + break; + ret = io_unregister_personality(ctx, nr_args); + break; default: ret = -EINVAL; break; }
- - if (opcode != IORING_UNREGISTER_FILES && - opcode != IORING_REGISTER_FILES_UPDATE && - opcode != IORING_REGISTER_PROBE) { + if (io_register_op_must_quiesce(opcode)) { /* bring the ctx back to life */ percpu_ref_reinit(&ctx->refs); out: diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 4b5a3376d959..3c65bb6c3e97 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -210,6 +210,8 @@ struct io_uring_params { #define IORING_REGISTER_FILES_UPDATE 6 #define IORING_REGISTER_EVENTFD_ASYNC 7 #define IORING_REGISTER_PROBE 8 +#define IORING_REGISTER_PERSONALITY 9 +#define IORING_UNREGISTER_PERSONALITY 10
struct io_uring_files_update { __u32 offset;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 75c6a03904e0dd414a4d99a3072075cb5117e5bc category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For personalities previously registered via IORING_REGISTER_PERSONALITY, allow any command to select them. This is done through setting sqe->personality to the id returned from registration, and then flagging sqe->flags with IOSQE_PERSONALITY.
Reviewed-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 20 +++++++++++++++++++- include/uapi/linux/io_uring.h | 7 ++++++- 2 files changed, 25 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index dc5ff7771c26..3c4ddee91f69 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4568,9 +4568,10 @@ static inline void io_queue_link_head(struct io_kiocb *req) static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct io_submit_state *state, struct io_kiocb **link) { + const struct cred *old_creds = NULL; struct io_ring_ctx *ctx = req->ctx; unsigned int sqe_flags; - int ret; + int ret, id;
sqe_flags = READ_ONCE(sqe->flags);
@@ -4579,6 +4580,19 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, ret = -EINVAL; goto err_req; } + + id = READ_ONCE(sqe->personality); + if (id) { + const struct cred *personality_creds; + + personality_creds = idr_find(&ctx->personality_idr, id); + if (unlikely(!personality_creds)) { + ret = -EINVAL; + goto err_req; + } + old_creds = override_creds(personality_creds); + } + /* same numerical values with corresponding REQ_F_*, safe to copy */ req->flags |= sqe_flags & (IOSQE_IO_DRAIN|IOSQE_IO_HARDLINK| IOSQE_ASYNC); @@ -4588,6 +4602,8 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, err_req: io_cqring_add_event(req, ret); io_double_put_req(req); + if (old_creds) + revert_creds(old_creds); return false; }
@@ -4648,6 +4664,8 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } }
+ if (old_creds) + revert_creds(old_creds); return true; }
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 3c65bb6c3e97..ad96791b34cf 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -40,7 +40,12 @@ struct io_uring_sqe { }; __u64 user_data; /* data to be passed back at completion time */ union { - __u16 buf_index; /* index into fixed buffers, if used */ + struct { + /* index into fixed buffers, if used */ + __u16 buf_index; + /* personality to use, if used */ + __u16 personality; + }; __u64 __pad2[3]; }; };
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit f86cd20c9454847a524ddbdcdec32c0380ed7c9b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We're not consistent in how the file table is grabbed and assigned if we have a command linked that requires the use of it.
Add ->file_table to the io_op_defs[] array, and use that to determine when to grab the table instead of having the handlers set it if they need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES flag. We always initialize work->files, so io-wq can just check for that.
Signed-off-by: Jens Axboe axboe@kernel.dk Conflicts: fs/io_uring.c [commit cebdb98617ae("io_uring: add support for IORING_OP_OPENAT2") is not applied] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 3 +-- fs/io-wq.h | 1 - fs/io_uring.c | 30 +++++++++++++++++++----------- 3 files changed, 20 insertions(+), 14 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 852bedffd508..41ce88543b81 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -477,8 +477,7 @@ static void io_worker_handle_work(struct io_worker *worker) if (work->flags & IO_WQ_WORK_CB) work->func(&work);
- if ((work->flags & IO_WQ_WORK_NEEDS_FILES) && - current->files != work->files) { + if (work->files && current->files != work->files) { task_lock(current); current->files = work->files; task_unlock(current); diff --git a/fs/io-wq.h b/fs/io-wq.h index c42602c58c56..50b3378febf2 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -7,7 +7,6 @@ enum { IO_WQ_WORK_CANCEL = 1, IO_WQ_WORK_HAS_MM = 2, IO_WQ_WORK_HASHED = 4, - IO_WQ_WORK_NEEDS_FILES = 16, IO_WQ_WORK_UNBOUND = 32, IO_WQ_WORK_INTERNAL = 64, IO_WQ_WORK_CB = 128, diff --git a/fs/io_uring.c b/fs/io_uring.c index 3c4ddee91f69..f92e1f261dea 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -604,6 +604,8 @@ struct io_op_def { unsigned unbound_nonreg_file : 1; /* opcode is not supported by this kernel */ unsigned not_supported : 1; + /* needs file table */ + unsigned file_table : 1; };
static const struct io_op_def io_op_defs[] = { @@ -662,6 +664,7 @@ static const struct io_op_def io_op_defs[] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, + .file_table = 1, }, [IORING_OP_ASYNC_CANCEL] = {}, [IORING_OP_LINK_TIMEOUT] = { @@ -680,12 +683,15 @@ static const struct io_op_def io_op_defs[] = { [IORING_OP_OPENAT] = { .needs_file = 1, .fd_non_neg = 1, + .file_table = 1, }, [IORING_OP_CLOSE] = { .needs_file = 1, + .file_table = 1, }, [IORING_OP_FILES_UPDATE] = { .needs_mm = 1, + .file_table = 1, }, [IORING_OP_STATX] = { .needs_mm = 1, @@ -729,6 +735,7 @@ static void io_queue_linked_timeout(struct io_kiocb *req); static int __io_sqe_files_update(struct io_ring_ctx *ctx, struct io_uring_files_update *ip, unsigned nr_args); +static int io_grab_files(struct io_kiocb *req);
static struct kmem_cache *req_cachep;
@@ -2528,10 +2535,8 @@ static int io_openat(struct io_kiocb *req, struct io_kiocb **nxt, struct file *file; int ret;
- if (force_nonblock) { - req->work.flags |= IO_WQ_WORK_NEEDS_FILES; + if (force_nonblock) return -EAGAIN; - }
ret = build_open_flags(req->open.flags, req->open.mode, &op); if (ret) @@ -2750,10 +2755,8 @@ static int io_close(struct io_kiocb *req, struct io_kiocb **nxt, return ret;
/* if the file has a flush method, be safe and punt to async */ - if (req->close.put_file->f_op->flush && !io_wq_current_is_worker()) { - req->work.flags |= IO_WQ_WORK_NEEDS_FILES; + if (req->close.put_file->f_op->flush && !io_wq_current_is_worker()) goto eagain; - }
/* * No ->flush(), safely close from here and just punt the @@ -3197,7 +3200,6 @@ static int io_accept(struct io_kiocb *req, struct io_kiocb **nxt, ret = __io_accept(req, nxt, force_nonblock); if (ret == -EAGAIN && force_nonblock) { req->work.func = io_accept_finish; - req->work.flags |= IO_WQ_WORK_NEEDS_FILES; io_put_req(req); return -EAGAIN; } @@ -3920,10 +3922,8 @@ static int io_files_update(struct io_kiocb *req, bool force_nonblock) struct io_uring_files_update up; int ret;
- if (force_nonblock) { - req->work.flags |= IO_WQ_WORK_NEEDS_FILES; + if (force_nonblock) return -EAGAIN; - }
up.offset = req->files_update.offset; up.fds = req->files_update.arg; @@ -3944,6 +3944,12 @@ static int io_req_defer_prep(struct io_kiocb *req, { ssize_t ret = 0;
+ if (io_op_defs[req->opcode].file_table) { + ret = io_grab_files(req); + if (unlikely(ret)) + return ret; + } + io_req_work_grab_env(req, &io_op_defs[req->opcode]);
switch (req->opcode) { @@ -4366,6 +4372,8 @@ static int io_grab_files(struct io_kiocb *req) int ret = -EBADF; struct io_ring_ctx *ctx = req->ctx;
+ if (req->work.files) + return 0; if (!ctx->ring_file) return -EBADF;
@@ -4484,7 +4492,7 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) || (req->flags & REQ_F_MUST_PUNT))) { punt: - if (req->work.flags & IO_WQ_WORK_NEEDS_FILES) { + if (io_op_defs[req->opcode].file_table) { ret = io_grab_files(req); if (ret) goto err;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 58e41a44c488f3e9601fd8150f58377ef8f44889 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
No functional changes in this patch.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/eventpoll.c | 45 +++++++++++++++++++++++++-------------------- 1 file changed, 25 insertions(+), 20 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 6d4d73faabfd..cfe8dbf8199d 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1991,27 +1991,15 @@ SYSCALL_DEFINE1(epoll_create, int, size) return do_epoll_create(0); }
-/* - * The following function implements the controller interface for - * the eventpoll file that enables the insertion/removal/change of - * file descriptors inside the interest set. - */ -SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, - struct epoll_event __user *, event) +static int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds) { int error; int full_check = 0; struct fd f, tf; struct eventpoll *ep; struct epitem *epi; - struct epoll_event epds; struct eventpoll *tep = NULL;
- error = -EFAULT; - if (ep_op_has_event(op) && - copy_from_user(&epds, event, sizeof(struct epoll_event))) - goto error_return; - error = -EBADF; f = fdget(epfd); if (!f.file) @@ -2029,7 +2017,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
/* Check if EPOLLWAKEUP is allowed */ if (ep_op_has_event(op)) - ep_take_care_of_epollwakeup(&epds); + ep_take_care_of_epollwakeup(epds);
/* * We have to check that the file structure underneath the file descriptor @@ -2045,11 +2033,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation. * Also, we do not currently supported nested exclusive wakeups. */ - if (ep_op_has_event(op) && (epds.events & EPOLLEXCLUSIVE)) { + if (ep_op_has_event(op) && (epds->events & EPOLLEXCLUSIVE)) { if (op == EPOLL_CTL_MOD) goto error_tgt_fput; if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) || - (epds.events & ~EPOLLEXCLUSIVE_OK_BITS))) + (epds->events & ~EPOLLEXCLUSIVE_OK_BITS))) goto error_tgt_fput; }
@@ -2110,8 +2098,8 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, switch (op) { case EPOLL_CTL_ADD: if (!epi) { - epds.events |= EPOLLERR | EPOLLHUP; - error = ep_insert(ep, &epds, tf.file, fd, full_check); + epds->events |= EPOLLERR | EPOLLHUP; + error = ep_insert(ep, epds, tf.file, fd, full_check); } else error = -EEXIST; break; @@ -2124,8 +2112,8 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, case EPOLL_CTL_MOD: if (epi) { if (!(epi->event.events & EPOLLEXCLUSIVE)) { - epds.events |= EPOLLERR | EPOLLHUP; - error = ep_modify(ep, epi, &epds); + epds->events |= EPOLLERR | EPOLLHUP; + error = ep_modify(ep, epi, epds); } } else error = -ENOENT; @@ -2150,6 +2138,23 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, return error; }
+/* + * The following function implements the controller interface for + * the eventpoll file that enables the insertion/removal/change of + * file descriptors inside the interest set. + */ +SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, + struct epoll_event __user *, event) +{ + struct epoll_event epds; + + if (ep_op_has_event(op) && + copy_from_user(&epds, event, sizeof(struct epoll_event))) + return -EFAULT; + + return do_epoll_ctl(epfd, op, fd, &epds); +} + /* * Implement the event wait interface for the eventpoll file. It is the kernel * part of the user space epoll_wait(2).
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 39220e8d4a2aaab045ea03cc16d737e85d0817bf category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Also make it available outside of epoll, along with the helper that decides if we need to copy the passed in epoll_event.
Signed-off-by: Jens Axboe axboe@kernel.dk Conflicts: fs/eventpoll.c [conflicts with get_file(tf.file); in commit 492a9215c4e6 ("epoll: Keep a reference on files added to the check list")] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/eventpoll.c | 46 ++++++++++++++++++++++++++++----------- include/linux/eventpoll.h | 9 ++++++++ 2 files changed, 42 insertions(+), 13 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index cfe8dbf8199d..d46007154250 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -356,12 +356,6 @@ static inline struct epitem *ep_item_from_epqueue(poll_table *p) return container_of(p, struct ep_pqueue, pt)->epi; }
-/* Tells if the epoll_ctl(2) operation needs an event copy from userspace */ -static inline int ep_op_has_event(int op) -{ - return op != EPOLL_CTL_DEL; -} - /* Initialize the poll safe wake up structure */ static void ep_nested_calls_init(struct nested_calls *ncalls) { @@ -1991,7 +1985,20 @@ SYSCALL_DEFINE1(epoll_create, int, size) return do_epoll_create(0); }
-static int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds) +static inline int epoll_mutex_lock(struct mutex *mutex, int depth, + bool nonblock) +{ + if (!nonblock) { + mutex_lock_nested(mutex, depth); + return 0; + } + if (mutex_trylock(mutex)) + return 0; + return -EAGAIN; +} + +int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds, + bool nonblock) { int error; int full_check = 0; @@ -2062,14 +2069,18 @@ static int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds) * deep wakeup paths from forming in parallel through multiple * EPOLL_CTL_ADD operations. */ - mutex_lock_nested(&ep->mtx, 0); + error = epoll_mutex_lock(&ep->mtx, 0, nonblock); + if (error) + goto error_tgt_fput; if (op == EPOLL_CTL_ADD) { if (!list_empty(&f.file->f_ep_links) || ep->gen == loop_check_gen || is_file_epoll(tf.file)) { - full_check = 1; mutex_unlock(&ep->mtx); - mutex_lock(&epmutex); + error = epoll_mutex_lock(&epmutex, 0, nonblock); + if (error) + goto error_tgt_fput; + full_check = 1; if (is_file_epoll(tf.file)) { error = -ELOOP; if (ep_loop_check(ep, tf.file) != 0) @@ -2079,10 +2090,19 @@ static int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds) list_add(&tf.file->f_tfile_llink, &tfile_check_list); } - mutex_lock_nested(&ep->mtx, 0); + error = epoll_mutex_lock(&ep->mtx, 0, nonblock); + if (error) { +out_del: + list_del(&tf.file->f_tfile_llink); + goto error_tgt_fput; + } if (is_file_epoll(tf.file)) { tep = tf.file->private_data; - mutex_lock_nested(&tep->mtx, 1); + error = epoll_mutex_lock(&tep->mtx, 1, nonblock); + if (error) { + mutex_unlock(&ep->mtx); + goto out_del; + } } } } @@ -2152,7 +2172,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, copy_from_user(&epds, event, sizeof(struct epoll_event))) return -EFAULT;
- return do_epoll_ctl(epfd, op, fd, &epds); + return do_epoll_ctl(epfd, op, fd, &epds, false); }
/* diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h index 2f14ac73d01d..48dedbafe5fa 100644 --- a/include/linux/eventpoll.h +++ b/include/linux/eventpoll.h @@ -66,6 +66,15 @@ static inline void eventpoll_release(struct file *file) eventpoll_release_file(file); }
+int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds, + bool nonblock); + +/* Tells if the epoll_ctl(2) operation needs an event copy from userspace */ +static inline int ep_op_has_event(int op) +{ + return op != EPOLL_CTL_DEL; +} + #else
static inline void eventpoll_init_file(struct file *file) {}
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 3e4827b05d2ac2d377ed136a52829ec46787bf4b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This adds IORING_OP_EPOLL_CTL, which can perform the same work as the epoll_ctl(2) system call.
Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c include/uapi/linux/io_uring.h [commit cebdb98617ae("io_uring: add support for IORING_OP_OPENAT2") is not applied] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 71 +++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 1 + 2 files changed, 72 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f92e1f261dea..d4e5f2ec8151 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -74,6 +74,7 @@ #include <linux/namei.h> #include <linux/fsnotify.h> #include <linux/fadvise.h> +#include <linux/eventpoll.h>
#define CREATE_TRACE_POINTS #include <trace/events/io_uring.h> @@ -424,6 +425,14 @@ struct io_madvise { u32 advice; };
+struct io_epoll { + struct file *file; + int epfd; + int op; + int fd; + struct epoll_event event; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -537,6 +546,7 @@ struct io_kiocb { struct io_files_update files_update; struct io_fadvise fadvise; struct io_madvise madvise; + struct io_epoll epoll; };
struct io_async_ctx *io; @@ -724,6 +734,10 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .unbound_nonreg_file = 1, }, + [IORING_OP_EPOLL_CTL] = { + .unbound_nonreg_file = 1, + .file_table = 1, + }, };
static void io_wq_submit_work(struct io_wq_work **workptr); @@ -2563,6 +2577,52 @@ static int io_openat(struct io_kiocb *req, struct io_kiocb **nxt, return 0; }
+static int io_epoll_ctl_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ +#if defined(CONFIG_EPOLL) + if (sqe->ioprio || sqe->buf_index) + return -EINVAL; + + req->epoll.epfd = READ_ONCE(sqe->fd); + req->epoll.op = READ_ONCE(sqe->len); + req->epoll.fd = READ_ONCE(sqe->off); + + if (ep_op_has_event(req->epoll.op)) { + struct epoll_event __user *ev; + + ev = u64_to_user_ptr(READ_ONCE(sqe->addr)); + if (copy_from_user(&req->epoll.event, ev, sizeof(*ev))) + return -EFAULT; + } + + return 0; +#else + return -EOPNOTSUPP; +#endif +} + +static int io_epoll_ctl(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) +{ +#if defined(CONFIG_EPOLL) + struct io_epoll *ie = &req->epoll; + int ret; + + ret = do_epoll_ctl(ie->epfd, ie->op, ie->fd, &ie->event, force_nonblock); + if (force_nonblock && ret == -EAGAIN) + return -EAGAIN; + + if (ret < 0) + req_set_fail_links(req); + io_cqring_add_event(req, ret); + io_put_req_find_next(req, nxt); + return 0; +#else + return -EOPNOTSUPP; +#endif +} + static int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) @@ -4024,6 +4084,9 @@ static int io_req_defer_prep(struct io_kiocb *req, case IORING_OP_MADVISE: ret = io_madvise_prep(req, sqe); break; + case IORING_OP_EPOLL_CTL: + ret = io_epoll_ctl_prep(req, sqe); + break; default: printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", req->opcode); @@ -4244,6 +4307,14 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } ret = io_madvise(req, nxt, force_nonblock); break; + case IORING_OP_EPOLL_CTL: + if (sqe) { + ret = io_epoll_ctl_prep(req, sqe); + if (ret) + break; + } + ret = io_epoll_ctl(req, nxt, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ad96791b34cf..90fed30a38b7 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -111,6 +111,7 @@ enum { IORING_OP_MADVISE, IORING_OP_SEND, IORING_OP_RECV, + IORING_OP_EPOLL_CTL,
/* this goes last, obviously */ IORING_OP_LAST,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 87ce955b24c9940cb2ca7e5173fcf175578d9fe9 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
It can be hard to know exactly what is registered with the ring. Especially for credentials, it'd be handy to be able to see which ones are registered, what personalities they have, and what the ID of each of them is.
This adds support for showing information registered in the ring from the fdinfo of the io_uring fd. Here's an example from a test case that registers 4 files (two of them sparse), 4 buffers, and 2 personalities:
pos: 0 flags: 02000002 mnt_id: 14 UserFiles: 4 0: file-no-1 1: file-no-2 2: <none> 3: <none> UserBufs: 4 0: 0x563817c46000/128 1: 0x563817c47000/256 2: 0x563817c48000/512 3: 0x563817c49000/1024 Personalities: 1 Uid: 0 0 0 0 Gid: 0 0 0 0 Groups: 0 CapEff: 0000003fffffffff 2 Uid: 0 0 0 0 Gid: 0 0 0 0 Groups: 0 CapEff: 0000003fffffffff
Suggested-by: Jann Horn jannh@google.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 75 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d4e5f2ec8151..b60e528741d5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6443,6 +6443,80 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, return submitted ? submitted : ret; }
+static int io_uring_show_cred(int id, void *p, void *data) +{ + const struct cred *cred = p; + struct seq_file *m = data; + struct user_namespace *uns = seq_user_ns(m); + struct group_info *gi; + kernel_cap_t cap; + unsigned __capi; + int g; + + seq_printf(m, "%5d\n", id); + seq_put_decimal_ull(m, "\tUid:\t", from_kuid_munged(uns, cred->uid)); + seq_put_decimal_ull(m, "\t\t", from_kuid_munged(uns, cred->euid)); + seq_put_decimal_ull(m, "\t\t", from_kuid_munged(uns, cred->suid)); + seq_put_decimal_ull(m, "\t\t", from_kuid_munged(uns, cred->fsuid)); + seq_put_decimal_ull(m, "\n\tGid:\t", from_kgid_munged(uns, cred->gid)); + seq_put_decimal_ull(m, "\t\t", from_kgid_munged(uns, cred->egid)); + seq_put_decimal_ull(m, "\t\t", from_kgid_munged(uns, cred->sgid)); + seq_put_decimal_ull(m, "\t\t", from_kgid_munged(uns, cred->fsgid)); + seq_puts(m, "\n\tGroups:\t"); + gi = cred->group_info; + for (g = 0; g < gi->ngroups; g++) { + seq_put_decimal_ull(m, g ? " " : "", + from_kgid_munged(uns, gi->gid[g])); + } + seq_puts(m, "\n\tCapEff:\t"); + cap = cred->cap_effective; + CAP_FOR_EACH_U32(__capi) + seq_put_hex_ll(m, NULL, cap.cap[CAP_LAST_U32 - __capi], 8); + seq_putc(m, '\n'); + return 0; +} + +static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m) +{ + int i; + + mutex_lock(&ctx->uring_lock); + seq_printf(m, "UserFiles:\t%u\n", ctx->nr_user_files); + for (i = 0; i < ctx->nr_user_files; i++) { + struct fixed_file_table *table; + struct file *f; + + table = &ctx->file_data->table[i >> IORING_FILE_TABLE_SHIFT]; + f = table->files[i & IORING_FILE_TABLE_MASK]; + if (f) + seq_printf(m, "%5u: %s\n", i, file_dentry(f)->d_iname); + else + seq_printf(m, "%5u: <none>\n", i); + } + seq_printf(m, "UserBufs:\t%u\n", ctx->nr_user_bufs); + for (i = 0; i < ctx->nr_user_bufs; i++) { + struct io_mapped_ubuf *buf = &ctx->user_bufs[i]; + + seq_printf(m, "%5u: 0x%llx/%u\n", i, buf->ubuf, + (unsigned int) buf->len); + } + if (!idr_is_empty(&ctx->personality_idr)) { + seq_printf(m, "Personalities:\n"); + idr_for_each(&ctx->personality_idr, io_uring_show_cred, m); + } + mutex_unlock(&ctx->uring_lock); +} + +static void io_uring_show_fdinfo(struct seq_file *m, struct file *f) +{ + struct io_ring_ctx *ctx = f->private_data; + + if (percpu_ref_tryget(&ctx->refs)) { + __io_uring_show_fdinfo(ctx, m); + percpu_ref_put(&ctx->refs); + } +} + static const struct file_operations io_uring_fops = { .release = io_uring_release, .flush = io_uring_flush, @@ -6453,6 +6527,7 @@ static const struct file_operations io_uring_fops = { #endif .poll = io_uring_poll, .fasync = io_uring_fasync, + .show_fdinfo = io_uring_show_fdinfo, };
static int io_allocate_scq_urings(struct io_ring_ctx *ctx,
From: Stefan Metzmacher metze@samba.org
mainline inclusion from mainline-5.6-rc1 commit d7f62e825fd19202a0749d10fb439714c51f67d2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
With nesting of anonymous unions and structs it's hard to review layout changes. It's better to ask the compiler for these things.
Signed-off-by: Stefan Metzmacher metze@samba.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b60e528741d5..c42bf74a3537 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6980,6 +6980,39 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
static int __init io_uring_init(void) { +#define __BUILD_BUG_VERIFY_ELEMENT(stype, eoffset, etype, ename) do { \ + BUILD_BUG_ON(offsetof(stype, ename) != eoffset); \ + BUILD_BUG_ON(sizeof(etype) != sizeof_field(stype, ename)); \ +} while (0) + +#define BUILD_BUG_SQE_ELEM(eoffset, etype, ename) \ + __BUILD_BUG_VERIFY_ELEMENT(struct io_uring_sqe, eoffset, etype, ename) + BUILD_BUG_ON(sizeof(struct io_uring_sqe) != 64); + BUILD_BUG_SQE_ELEM(0, __u8, opcode); + BUILD_BUG_SQE_ELEM(1, __u8, flags); + BUILD_BUG_SQE_ELEM(2, __u16, ioprio); + BUILD_BUG_SQE_ELEM(4, __s32, fd); + BUILD_BUG_SQE_ELEM(8, __u64, off); + BUILD_BUG_SQE_ELEM(8, __u64, addr2); + BUILD_BUG_SQE_ELEM(16, __u64, addr); + BUILD_BUG_SQE_ELEM(24, __u32, len); + BUILD_BUG_SQE_ELEM(28, __kernel_rwf_t, rw_flags); + BUILD_BUG_SQE_ELEM(28, /* compat */ int, rw_flags); + BUILD_BUG_SQE_ELEM(28, /* compat */ __u32, rw_flags); + BUILD_BUG_SQE_ELEM(28, __u32, fsync_flags); + BUILD_BUG_SQE_ELEM(28, __u16, poll_events); + BUILD_BUG_SQE_ELEM(28, __u32, sync_range_flags); + BUILD_BUG_SQE_ELEM(28, __u32, msg_flags); + BUILD_BUG_SQE_ELEM(28, __u32, timeout_flags); + BUILD_BUG_SQE_ELEM(28, __u32, accept_flags); + BUILD_BUG_SQE_ELEM(28, __u32, cancel_flags); + BUILD_BUG_SQE_ELEM(28, __u32, open_flags); + BUILD_BUG_SQE_ELEM(28, __u32, statx_flags); + BUILD_BUG_SQE_ELEM(28, __u32, fadvise_advice); + BUILD_BUG_SQE_ELEM(32, __u64, user_data); + BUILD_BUG_SQE_ELEM(40, __u16, buf_index); + BUILD_BUG_SQE_ELEM(42, __u16, personality); + BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST); req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); return 0;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit f0b493e6b9a8959356983f57112229e69c2f7b8c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we have nested or circular eventfd wakeups, then we can deadlock if we run them inline from our poll waitqueue wakeup handler. It's also possible to have very long chains of notifications, to the extent where we could risk blowing the stack.
Check the eventfd recursion count before calling eventfd_signal(). If it's non-zero, then punt the signaling to async context. This is always safe, as it takes us out-of-line in terms of stack and locking context.
Cc: stable@vger.kernel.org # 5.1+ Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 37 ++++++++++++++++++++++++++++++------- 1 file changed, 30 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c42bf74a3537..bb569a31882d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1015,21 +1015,28 @@ static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx) { + if (!ctx->cq_ev_fd) + return false; if (!ctx->eventfd_async) return true; return io_wq_current_is_worker() || in_interrupt(); }
-static void io_cqring_ev_posted(struct io_ring_ctx *ctx) +static void __io_cqring_ev_posted(struct io_ring_ctx *ctx, bool trigger_ev) { if (waitqueue_active(&ctx->wait)) wake_up(&ctx->wait); if (waitqueue_active(&ctx->sqo_wait)) wake_up(&ctx->sqo_wait); - if (ctx->cq_ev_fd && io_should_trigger_evfd(ctx)) + if (trigger_ev) eventfd_signal(ctx->cq_ev_fd, 1); }
+static void io_cqring_ev_posted(struct io_ring_ctx *ctx) +{ + __io_cqring_ev_posted(ctx, io_should_trigger_evfd(ctx)); +} + /* Returns true if there are no backlogged entries after the flush */ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) { @@ -3513,6 +3520,14 @@ static void io_poll_flush(struct io_wq_work **workptr) __io_poll_flush(req->ctx, nodes); }
+static void io_poll_trigger_evfd(struct io_wq_work **workptr) +{ + struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); + + eventfd_signal(req->ctx->cq_ev_fd, 1); + io_put_req(req); +} + static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, void *key) { @@ -3538,14 +3553,22 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
if (llist_empty(&ctx->poll_llist) && spin_trylock_irqsave(&ctx->completion_lock, flags)) { + bool trigger_ev; + hash_del(&req->hash_node); io_poll_complete(req, mask, 0); - req->flags |= REQ_F_COMP_LOCKED; - io_put_req(req); - spin_unlock_irqrestore(&ctx->completion_lock, flags);
- io_cqring_ev_posted(ctx); - req = NULL; + trigger_ev = io_should_trigger_evfd(ctx); + if (trigger_ev && eventfd_signal_count()) { + trigger_ev = false; + req->work.func = io_poll_trigger_evfd; + } else { + req->flags |= REQ_F_COMP_LOCKED; + io_put_req(req); + req = NULL; + } + spin_unlock_irqrestore(&ctx->completion_lock, flags); + __io_cqring_ev_posted(ctx, trigger_ev); } else { req->result = mask; req->llist_node.next = NULL;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 0b7b21e42ba2d6ac9595a4358a9354249605a3af category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Don't use the recvmsg/sendmsg helpers, use the same helpers that the recv(2) and send(2) system calls use.
Reported-by: 李通洲 carter.li@eoitek.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index bb569a31882d..31359a6eab42 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3042,7 +3042,8 @@ static int io_send(struct io_kiocb *req, struct io_kiocb **nxt, else if (force_nonblock) flags |= MSG_DONTWAIT;
- ret = __sys_sendmsg_sock(sock, &msg, flags); + msg.msg_flags = flags; + ret = sock_sendmsg(sock, &msg); if (force_nonblock && ret == -EAGAIN) return -EAGAIN; if (ret == -ERESTARTSYS) @@ -3068,6 +3069,7 @@ static int io_recvmsg_prep(struct io_kiocb *req,
sr->msg_flags = READ_ONCE(sqe->msg_flags); sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + sr->len = READ_ONCE(sqe->len);
if (!io || req->opcode == IORING_OP_RECV) return 0; @@ -3186,7 +3188,7 @@ static int io_recv(struct io_kiocb *req, struct io_kiocb **nxt, else if (force_nonblock) flags |= MSG_DONTWAIT;
- ret = __sys_recvmsg_sock(sock, &msg, NULL, NULL, flags); + ret = sock_recvmsg(sock, &msg, flags); if (force_nonblock && ret == -EAGAIN) return -EAGAIN; if (ret == -ERESTARTSYS)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 5d204bcfa09330972ad3428a8f81c23f371d3e6d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we have a read/write that is deferred, we already setup the async IO context for that request, and mapped it. When we later try and execute the request and we get -EAGAIN, we don't want to attempt to re-map it. If we do, we end up with garbage in the iovec, which typically leads to an -EFAULT or -EINVAL completion.
Cc: stable@vger.kernel.org # 5.5 Reported-by: Dan Melnic dmm@fb.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 31359a6eab42..63261cd05831 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2166,10 +2166,12 @@ static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size, { if (!io_op_defs[req->opcode].async_ctx) return 0; - if (!req->io && io_alloc_async_ctx(req)) - return -ENOMEM; + if (!req->io) { + if (io_alloc_async_ctx(req)) + return -ENOMEM;
- io_req_map_rw(req, io_size, iovec, fast_iov, iter); + io_req_map_rw(req, io_size, iovec, fast_iov, iter); + } req->work.func = io_rw_async; return 0; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc2 commit 8fef80bf56a49c60b457dedb99fd6c5279a5dbe1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
openat() and statx() may have allocated ->open.filename, which should be be put. Add cleanup handlers for them.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [remove IORING_OP_OPENAT2 for commit cebdb98617ae ("io_uring: add support for IORING_OP_OPENAT2") is not applied] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6d2e1d1411ae..98243d7f5f3d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2532,6 +2532,7 @@ static int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return ret; }
+ req->flags |= REQ_F_NEED_CLEANUP; return 0; }
@@ -2563,6 +2564,7 @@ static int io_openat(struct io_kiocb *req, struct io_kiocb **nxt, } err: putname(req->open.filename); + req->flags &= ~REQ_F_NEED_CLEANUP; if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); @@ -2715,6 +2717,7 @@ static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return ret; }
+ req->flags |= REQ_F_NEED_CLEANUP; return 0; }
@@ -2752,6 +2755,7 @@ static int io_statx(struct io_kiocb *req, struct io_kiocb **nxt, ret = cp_statx(&stat, ctx->buffer); err: putname(ctx->filename); + req->flags &= ~REQ_F_NEED_CLEANUP; if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); @@ -4170,6 +4174,10 @@ static void io_cleanup_req(struct io_kiocb *req) if (io->msg.iov != io->msg.fast_iov) kfree(io->msg.iov); break; + case IORING_OP_OPENAT: + case IORING_OP_STATX: + putname(req->open.filename); + break; }
req->flags &= ~REQ_F_NEED_CLEANUP;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc2 commit ff002b30181d30cdfbca316dadd099c3ca0d739c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This passes it in to io-wq, so it assumes the right fs_struct when executing async work that may need to do lookups.
Cc: stable@vger.kernel.org # 5.3+ Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [commit cebdb98617ae("io_uring: add support for IORING_OP_OPENAT2") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c06b0c0808ab..64faee19c82f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -75,6 +75,7 @@ #include <linux/fsnotify.h> #include <linux/fadvise.h> #include <linux/eventpoll.h> +#include <linux/fs_struct.h>
#define CREATE_TRACE_POINTS #include <trace/events/io_uring.h> @@ -612,6 +613,8 @@ struct io_op_def { unsigned not_supported : 1; /* needs file table */ unsigned file_table : 1; + /* needs ->fs */ + unsigned needs_fs : 1; };
static const struct io_op_def io_op_defs[] = { @@ -654,12 +657,14 @@ static const struct io_op_def io_op_defs[] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, + .needs_fs = 1, }, [IORING_OP_RECVMSG] = { .async_ctx = 1, .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, + .needs_fs = 1, }, [IORING_OP_TIMEOUT] = { .async_ctx = 1, @@ -690,6 +695,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .fd_non_neg = 1, .file_table = 1, + .needs_fs = 1, }, [IORING_OP_CLOSE] = { .needs_file = 1, @@ -703,6 +709,7 @@ static const struct io_op_def io_op_defs[] = { .needs_mm = 1, .needs_file = 1, .fd_non_neg = 1, + .needs_fs = 1, }, [IORING_OP_READ] = { .needs_mm = 1, @@ -902,6 +909,16 @@ static inline void io_req_work_grab_env(struct io_kiocb *req, } if (!req->work.creds) req->work.creds = get_current_cred(); + if (!req->work.fs && def->needs_fs) { + spin_lock(¤t->fs->lock); + if (!current->fs->in_exec) { + req->work.fs = current->fs; + req->work.fs->users++; + } else { + req->work.flags |= IO_WQ_WORK_CANCEL; + } + spin_unlock(¤t->fs->lock); + } }
static inline void io_req_work_drop_env(struct io_kiocb *req) @@ -914,6 +931,16 @@ static inline void io_req_work_drop_env(struct io_kiocb *req) put_cred(req->work.creds); req->work.creds = NULL; } + if (req->work.fs) { + struct fs_struct *fs = req->work.fs; + + spin_lock(&req->work.fs->lock); + if (--fs->users) + fs = NULL; + spin_unlock(&req->work.fs->lock); + if (fs) + free_fs_struct(fs); + } }
static inline bool io_prep_async_work(struct io_kiocb *req,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc2 commit 0b5faf6ba7fb78bb1fe7336d23ea1978386a6c3a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Don't just check for dirfd == -1, we should allow AT_FDCWD as well for relative lookups.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 64faee19c82f..50576b9403df 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4479,7 +4479,7 @@ static int io_req_needs_file(struct io_kiocb *req, int fd) { if (!io_op_defs[req->opcode].needs_file) return 0; - if (fd == -1 && io_op_defs[req->opcode].fd_non_neg) + if ((fd == -1 || fd == AT_FDCWD) && io_op_defs[req->opcode].fd_non_neg) return 0; return 1; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc2 commit a93b33312f63ef6d5997f45d6fdf4de84c5396cc category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
First, io_close() misses filp_close() and io_cqring_add_event(), when f_op->flush is defined. That's because in this case it will io_queue_async_work() itself not grabbing files, so the corresponding chunk in io_close_finish() won't be executed.
Second, when submitted through io_wq_submit_work(), it will do filp_close() and *_add_event() twice: first inline in io_close(), and the second one in call to io_close_finish() from io_close(). The second one will also fire, because it was submitted async through generic path, and so have grabbed files.
And the last nice thing is to remove this weird pilgrimage with checking work/old_work and casting it to nxt. Just use a helper instead.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 45 ++++++++++++++++----------------------------- 1 file changed, 16 insertions(+), 29 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 50576b9403df..bb8167692eb3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2818,24 +2818,25 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
+/* only called when __close_fd_get_file() is done */ +static void __io_close_finish(struct io_kiocb *req, struct io_kiocb **nxt) +{ + int ret; + + ret = filp_close(req->close.put_file, req->work.files); + if (ret < 0) + req_set_fail_links(req); + io_cqring_add_event(req, ret); + fput(req->close.put_file); + io_put_req_find_next(req, nxt); +} + static void io_close_finish(struct io_wq_work **workptr) { struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); struct io_kiocb *nxt = NULL;
- /* Invoked with files, we need to do the close */ - if (req->work.files) { - int ret; - - ret = filp_close(req->close.put_file, req->work.files); - if (ret < 0) - req_set_fail_links(req); - io_cqring_add_event(req, ret); - } - - fput(req->close.put_file); - - io_put_req_find_next(req, &nxt); + __io_close_finish(req, &nxt); if (nxt) io_wq_assign_next(workptr, nxt); } @@ -2858,22 +2859,8 @@ static int io_close(struct io_kiocb *req, struct io_kiocb **nxt, * No ->flush(), safely close from here and just punt the * fput() to async context. */ - ret = filp_close(req->close.put_file, current->files); - - if (ret < 0) - req_set_fail_links(req); - io_cqring_add_event(req, ret); - - if (io_wq_current_is_worker()) { - struct io_wq_work *old_work, *work; - - old_work = work = &req->work; - io_close_finish(&work); - if (work && work != old_work) - *nxt = container_of(work, struct io_kiocb, work); - return 0; - } - + __io_close_finish(req, nxt); + return 0; eagain: req->work.func = io_close_finish; /*
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc2 commit 5f798beaf35d79355cbf18019c1993a84475a2c3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Requests may be prepared multiple times with ->io allocated (i.e. async prepared). Preparation functions don't handle it and forget about previously allocated resources. This may happen in case of: - spurious defer_check - non-head (i.e. async prepared) request executed in sync (via nxt).
Make the handlers check, whether they already allocated resources, which is true IFF REQ_F_NEED_CLEANUP is set.
Cc: stable@vger.kernel.org # 5.5 Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index bb8167692eb3..2e61433e6da6 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2199,7 +2199,8 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (unlikely(!(req->file->f_mode & FMODE_READ))) return -EBADF;
- if (!req->io) + /* either don't need iovec imported or already have it */ + if (!req->io || req->flags & REQ_F_NEED_CLEANUP) return 0;
io = req->io; @@ -2287,7 +2288,8 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (unlikely(!(req->file->f_mode & FMODE_WRITE))) return -EBADF;
- if (!req->io) + /* either don't need iovec imported or already have it */ + if (!req->io || req->flags & REQ_F_NEED_CLEANUP) return 0;
io = req->io; @@ -2941,6 +2943,9 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
if (!io || req->opcode == IORING_OP_SEND) return 0; + /* iovec is already imported */ + if (req->flags & REQ_F_NEED_CLEANUP) + return 0;
io->msg.iov = io->msg.fast_iov; ret = sendmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, @@ -3091,6 +3096,9 @@ static int io_recvmsg_prep(struct io_kiocb *req,
if (!io || req->opcode == IORING_OP_RECV) return 0; + /* iovec is already imported */ + if (req->flags & REQ_F_NEED_CLEANUP) + return 0;
io->msg.iov = io->msg.fast_iov; ret = recvmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc2 commit 0bdbdd08a8f991bdaee54465a168c0795ea5d28b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
As in the previous patch, make openat*_prep() and statx_prep() handle double preparation to avoid resource leakage.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Conflicts: fs/io_uring.c [skip io_openat2_prep() for commit cebdb98617ae ("io_uring: add support for IORING_OP_OPENAT2") not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2e61433e6da6..465c46f48025 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2554,6 +2554,8 @@ static int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EINVAL; if (sqe->flags & IOSQE_FIXED_FILE) return -EBADF; + if (req->flags & REQ_F_NEED_CLEANUP) + return 0;
req->open.dfd = READ_ONCE(sqe->fd); req->open.mode = READ_ONCE(sqe->len); @@ -2735,6 +2737,8 @@ static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EINVAL; if (sqe->flags & IOSQE_FIXED_FILE) return -EBADF; + if (req->flags & REQ_F_NEED_CLEANUP) + return 0;
req->open.dfd = READ_ONCE(sqe->fd); req->open.mask = READ_ONCE(sqe->len);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc2 commit 00bcda13dcbf6bf7fa6f2a5886dd555362de8cfa category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We want to use the cancel functionality for canceling based on not just the work itself. Instead of matching on the work address manually, allow a match handler to tell us if we found the right work item or not.
No functional changes in this patch.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 33 ++++++++++++++++++++++----------- 1 file changed, 22 insertions(+), 11 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 0f02f35f45d0..248efd65b869 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -938,17 +938,19 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel, return ret; }
+struct work_match { + bool (*fn)(struct io_wq_work *, void *data); + void *data; +}; + static bool io_wq_worker_cancel(struct io_worker *worker, void *data) { - struct io_wq_work *work = data; + struct work_match *match = data; unsigned long flags; bool ret = false;
- if (worker->cur_work != work) - return false; - spin_lock_irqsave(&worker->lock, flags); - if (worker->cur_work == work && + if (match->fn(worker->cur_work, match->data) && !(worker->cur_work->flags & IO_WQ_WORK_NO_CANCEL)) { send_sig(SIGINT, worker->task, 1); ret = true; @@ -959,15 +961,13 @@ static bool io_wq_worker_cancel(struct io_worker *worker, void *data) }
static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, - struct io_wq_work *cwork) + struct work_match *match) { struct io_wq_work_node *node, *prev; struct io_wq_work *work; unsigned long flags; bool found = false;
- cwork->flags |= IO_WQ_WORK_CANCEL; - /* * First check pending list, if we're lucky we can just remove it * from there. CANCEL_OK means that the work is returned as-new, @@ -977,7 +977,7 @@ static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, wq_list_for_each(node, prev, &wqe->work_list) { work = container_of(node, struct io_wq_work, list);
- if (work == cwork) { + if (match->fn(work, match->data)) { wq_node_del(&wqe->work_list, node, prev); found = true; break; @@ -998,20 +998,31 @@ static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, * completion will run normally in this case. */ rcu_read_lock(); - found = io_wq_for_each_worker(wqe, io_wq_worker_cancel, cwork); + found = io_wq_for_each_worker(wqe, io_wq_worker_cancel, match); rcu_read_unlock(); return found ? IO_WQ_CANCEL_RUNNING : IO_WQ_CANCEL_NOTFOUND; }
+static bool io_wq_work_match(struct io_wq_work *work, void *data) +{ + return work == data; +} + enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork) { + struct work_match match = { + .fn = io_wq_work_match, + .data = cwork + }; enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; int node;
+ cwork->flags |= IO_WQ_WORK_CANCEL; + for_each_node(node) { struct io_wqe *wqe = wq->wqes[node];
- ret = io_wqe_cancel_work(wqe, cwork); + ret = io_wqe_cancel_work(wqe, &match); if (ret != IO_WQ_CANCEL_NOTFOUND) break; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc2 commit 36282881a795cbf717aca79392ae9cdf0fef59c9 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add a helper that allows the caller to cancel work based on what mm it belongs to. This allows io_uring to cancel work from a given task or thread when it exits.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 29 +++++++++++++++++++++++++++++ fs/io-wq.h | 2 ++ 2 files changed, 31 insertions(+)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 248efd65b869..419845d514df 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -1030,6 +1030,35 @@ enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork) return ret; }
+static bool io_wq_pid_match(struct io_wq_work *work, void *data) +{ + pid_t pid = (pid_t) (unsigned long) data; + + if (work) + return work->task_pid == pid; + return false; +} + +enum io_wq_cancel io_wq_cancel_pid(struct io_wq *wq, pid_t pid) +{ + struct work_match match = { + .fn = io_wq_pid_match, + .data = (void *) (unsigned long) pid + }; + enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; + int node; + + for_each_node(node) { + struct io_wqe *wqe = wq->wqes[node]; + + ret = io_wqe_cancel_work(wqe, &match); + if (ret != IO_WQ_CANCEL_NOTFOUND) + break; + } + + return ret; +} + struct io_wq_flush_data { struct io_wq_work work; struct completion done; diff --git a/fs/io-wq.h b/fs/io-wq.h index f152ba677d8f..ccc7d84af57d 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -76,6 +76,7 @@ struct io_wq_work { const struct cred *creds; struct fs_struct *fs; unsigned flags; + pid_t task_pid; };
#define INIT_IO_WORK(work, _func) \ @@ -109,6 +110,7 @@ void io_wq_flush(struct io_wq *wq);
void io_wq_cancel_all(struct io_wq *wq); enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork); +enum io_wq_cancel io_wq_cancel_pid(struct io_wq *wq, pid_t pid);
typedef bool (work_cancel_fn)(struct io_wq_work *, void *);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc2 commit 6ab231448fdc5e37c15a94a4700fca11e80007f7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Normally we cancel all work we track, but for untracked work we could leave the async worker behind until that work completes. This is totally fine, but does leave resources pending after the task is gone until that work completes.
Cancel work that this task queued up when it goes away.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 465c46f48025..a63302ba21ae 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -919,6 +919,8 @@ static inline void io_req_work_grab_env(struct io_kiocb *req, } spin_unlock(¤t->fs->lock); } + if (!req->work.task_pid) + req->work.task_pid = task_pid_vnr(current); }
static inline void io_req_work_drop_env(struct io_kiocb *req) @@ -6409,6 +6411,13 @@ static int io_uring_flush(struct file *file, void *data) struct io_ring_ctx *ctx = file->private_data;
io_uring_cancel_files(ctx, data); + + /* + * If the task is going away, cancel work it may have pending + */ + if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) + io_wq_cancel_pid(ctx->io_wq, task_pid_vnr(current)); + return 0; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc2 commit b537916ca5107c3a8714b8ab3099c0ec205aec12 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Jonas reports that he sometimes sees -97/-22 error returns from sendmsg, if it gets punted async. This is due to not retaining the sockaddr_storage between calls. Include that in the state we copy when going async.
Cc: stable@vger.kernel.org # 5.3+ Reported-by: Jonas Bonn jonas@norrbonn.se Tested-by: Jonas Bonn jonas@norrbonn.se Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a63302ba21ae..cae36041e12b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -443,6 +443,7 @@ struct io_async_msghdr { struct iovec *iov; struct sockaddr __user *uaddr; struct msghdr msg; + struct sockaddr_storage addr; };
struct io_async_rw { @@ -2978,12 +2979,11 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, sock = sock_from_file(req->file, &ret); if (sock) { struct io_async_ctx io; - struct sockaddr_storage addr; unsigned flags;
if (req->io) { kmsg = &req->io->msg; - kmsg->msg.msg_name = &addr; + kmsg->msg.msg_name = &req->io->msg.addr; /* if iov is set, it's allocated already */ if (!kmsg->iov) kmsg->iov = kmsg->fast_iov; @@ -2992,7 +2992,7 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, struct io_sr_msg *sr = &req->sr_msg;
kmsg = &io.msg; - kmsg->msg.msg_name = &addr; + kmsg->msg.msg_name = &io.msg.addr;
io.msg.iov = io.msg.fast_iov; ret = sendmsg_copy_msghdr(&io.msg.msg, sr->msg, @@ -3131,12 +3131,11 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, sock = sock_from_file(req->file, &ret); if (sock) { struct io_async_ctx io; - struct sockaddr_storage addr; unsigned flags;
if (req->io) { kmsg = &req->io->msg; - kmsg->msg.msg_name = &addr; + kmsg->msg.msg_name = &req->io->msg.addr; /* if iov is set, it's allocated already */ if (!kmsg->iov) kmsg->iov = kmsg->fast_iov; @@ -3145,7 +3144,7 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, struct io_sr_msg *sr = &req->sr_msg;
kmsg = &io.msg; - kmsg->msg.msg_name = &addr; + kmsg->msg.msg_name = &io.msg.addr;
io.msg.iov = io.msg.fast_iov; ret = recvmsg_copy_msghdr(&io.msg.msg, sr->msg,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc2 commit 7563439adfae153b20331f1567c8b5d0e5cbd8a7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Glauber reports a crash on init on a box he has:
RIP: 0010:__alloc_pages_nodemask+0x132/0x340 Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 74 01 00 00 <3b> 77 08 0f 82 6b 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02 RSP: 0018:ffffb8be4d0b7c28 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080 RBP: 0000000000012cc0 R08: 0000000000000000 R09: 0000000000000002 R10: 0000000000000dc0 R11: ffff995c60400100 R12: 0000000000000000 R13: 0000000000012cc0 R14: 0000000000000001 R15: ffff995c60db00f0 FS: 00007f4d115ca900(0000) GS:ffff995c60d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000002088 CR3: 00000017cca66002 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: alloc_slab_page+0x46/0x320 new_slab+0x9d/0x4e0 ___slab_alloc+0x507/0x6a0 ? io_wq_create+0xb4/0x2a0 __slab_alloc+0x1c/0x30 kmem_cache_alloc_node_trace+0xa6/0x260 io_wq_create+0xb4/0x2a0 io_uring_setup+0x97f/0xaa0 ? io_remove_personalities+0x30/0x30 ? io_poll_trigger_evfd+0x30/0x30 do_syscall_64+0x5b/0x1c0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f4d116cb1ed
which is due to the 'wqe' and 'worker' allocation being node affine. But it isn't valid to call the node affine allocation if the node isn't online.
Setup structures for even offline nodes, as usual, but skip them in terms of thread setup to not waste resources. If the node isn't online, just alloc memory with NUMA_NO_NODE.
Reported-by: Glauber Costa glauber@scylladb.com Tested-by: Glauber Costa glauber@scylladb.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 419845d514df..4e9a202362e5 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -700,11 +700,16 @@ static int io_wq_manager(void *data) /* create fixed workers */ refcount_set(&wq->refs, workers_to_create); for_each_node(node) { + if (!node_online(node)) + continue; if (!create_io_worker(wq, wq->wqes[node], IO_WQ_ACCT_BOUND)) goto err; workers_to_create--; }
+ while (workers_to_create--) + refcount_dec(&wq->refs); + complete(&wq->done);
while (!kthread_should_stop()) { @@ -712,6 +717,9 @@ static int io_wq_manager(void *data) struct io_wqe *wqe = wq->wqes[node]; bool fork_worker[2] = { false, false };
+ if (!node_online(node)) + continue; + spin_lock_irq(&wqe->lock); if (io_wqe_need_worker(wqe, IO_WQ_ACCT_BOUND)) fork_worker[IO_WQ_ACCT_BOUND] = true; @@ -830,7 +838,9 @@ static bool io_wq_for_each_worker(struct io_wqe *wqe,
list_for_each_entry_rcu(worker, &wqe->all_list, all_list) { if (io_worker_get(worker)) { - ret = func(worker, data); + /* no task if node is/was offline */ + if (worker->task) + ret = func(worker, data); io_worker_release(worker); if (ret) break; @@ -1085,6 +1095,8 @@ void io_wq_flush(struct io_wq *wq) for_each_node(node) { struct io_wqe *wqe = wq->wqes[node];
+ if (!node_online(node)) + continue; init_completion(&data.done); INIT_IO_WORK(&data.work, io_wq_flush_func); data.work.flags |= IO_WQ_WORK_INTERNAL; @@ -1116,12 +1128,15 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
for_each_node(node) { struct io_wqe *wqe; + int alloc_node = node;
- wqe = kzalloc_node(sizeof(struct io_wqe), GFP_KERNEL, node); + if (!node_online(alloc_node)) + alloc_node = NUMA_NO_NODE; + wqe = kzalloc_node(sizeof(struct io_wqe), GFP_KERNEL, alloc_node); if (!wqe) goto err; wq->wqes[node] = wqe; - wqe->node = node; + wqe->node = alloc_node; wqe->acct[IO_WQ_ACCT_BOUND].max_workers = bounded; atomic_set(&wqe->acct[IO_WQ_ACCT_BOUND].nr_running, 0); if (wq->user) { @@ -1129,7 +1144,6 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) task_rlimit(current, RLIMIT_NPROC); } atomic_set(&wqe->acct[IO_WQ_ACCT_UNBOUND].nr_running, 0); - wqe->node = node; wqe->wq = wq; spin_lock_init(&wqe->lock); INIT_WQ_LIST(&wqe->work_list);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc2 commit 2ca10259b4189a433c309054496dd6af1415f992 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Carter reported an issue where he could produce a stall on ring exit, when we're cleaning up requests that match the given file table. For this particular test case, a combination of a few things caused the issue:
- The cq ring was overflown - The request being canceled was in the overflow list
The combination of the above means that the cq overflow list holds a reference to the request. The request is canceled correctly, but since the overflow list holds a reference to it, the final put won't happen. Since the final put doesn't happen, the request remains in the inflight. Hence we never finish the cancelation flush.
Fix this by removing requests from the overflow list if we're canceling them.
Cc: stable@vger.kernel.org # 5.5 Reported-by: Carter Li 李通洲 carter.li@eoitek.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cae36041e12b..bbb5a45f3718 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -482,6 +482,7 @@ enum { REQ_F_TIMEOUT_NOSEQ_BIT, REQ_F_COMP_LOCKED_BIT, REQ_F_NEED_CLEANUP_BIT, + REQ_F_OVERFLOW_BIT, };
enum { @@ -522,6 +523,8 @@ enum { REQ_F_COMP_LOCKED = BIT(REQ_F_COMP_LOCKED_BIT), /* needs cleanup */ REQ_F_NEED_CLEANUP = BIT(REQ_F_NEED_CLEANUP_BIT), + /* in overflow list */ + REQ_F_OVERFLOW = BIT(REQ_F_OVERFLOW_BIT), };
/* @@ -1097,6 +1100,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) req = list_first_entry(&ctx->cq_overflow_list, struct io_kiocb, list); list_move(&req->list, &list); + req->flags &= ~REQ_F_OVERFLOW; if (cqe) { WRITE_ONCE(cqe->user_data, req->user_data); WRITE_ONCE(cqe->res, req->result); @@ -1149,6 +1153,7 @@ static void io_cqring_fill_event(struct io_kiocb *req, long res) set_bit(0, &ctx->sq_check_overflow); set_bit(0, &ctx->cq_check_overflow); } + req->flags |= REQ_F_OVERFLOW; refcount_inc(&req->refs); req->result = res; list_add_tail(&req->list, &ctx->cq_overflow_list); @@ -6398,6 +6403,29 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, if (!cancel_req) break;
+ if (cancel_req->flags & REQ_F_OVERFLOW) { + spin_lock_irq(&ctx->completion_lock); + list_del(&cancel_req->list); + cancel_req->flags &= ~REQ_F_OVERFLOW; + if (list_empty(&ctx->cq_overflow_list)) { + clear_bit(0, &ctx->sq_check_overflow); + clear_bit(0, &ctx->cq_check_overflow); + } + spin_unlock_irq(&ctx->completion_lock); + + WRITE_ONCE(ctx->rings->cq_overflow, + atomic_inc_return(&ctx->cached_cq_overflow)); + + /* + * Put inflight ref and overflow ref. If that's + * all we had, then we're done with this request. + */ + if (refcount_sub_and_test(2, &cancel_req->refs)) { + io_put_req(cancel_req); + continue; + } + } + io_wq_cancel_work(ctx->io_wq, &cancel_req->work); io_put_req(cancel_req); schedule();
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc3 commit 7fbeb95d0f68e21e6ca61284f1ac681630976947 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
fallocate_finish() is missing cancellation check. Add it. It's safe to do that, as only flags setup and sqe fields copy are done before it gets into __io_fallocate().
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index bbb5a45f3718..c0f3400f6ceb 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2511,6 +2511,9 @@ static void io_fallocate_finish(struct io_wq_work **workptr) struct io_kiocb *nxt = NULL; int ret;
+ if (io_req_cancelled(req)) + return; + ret = vfs_fallocate(req->file, req->sync.mode, req->sync.off, req->sync.len); if (ret < 0) @@ -2850,6 +2853,7 @@ static void io_close_finish(struct io_wq_work **workptr) struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); struct io_kiocb *nxt = NULL;
+ /* not cancellable, don't do io_req_cancelled() */ __io_close_finish(req, &nxt); if (nxt) io_wq_assign_next(workptr, nxt);
From: Dan Carpenter dan.carpenter@oracle.com
mainline inclusion from mainline-5.6-rc3 commit 297a31e3e8318f533cff4fe33ffaefb74f72c6e2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The "kmsg" pointer can't be NULL and we have already dereferenced it so a check here would be useless.
Reviewed-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c0f3400f6ceb..4d82b04a92c9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3021,7 +3021,7 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, if (req->io) return -EAGAIN; if (io_alloc_async_ctx(req)) { - if (kmsg && kmsg->iov != kmsg->fast_iov) + if (kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); return -ENOMEM; } @@ -3175,7 +3175,7 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, if (req->io) return -EAGAIN; if (io_alloc_async_ctx(req)) { - if (kmsg && kmsg->iov != kmsg->fast_iov) + if (kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); return -ENOMEM; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc3 commit 929a3af90f0f4bd7132d83552c1a98c83f60ef7e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_cleanup_req() should be called before req->io is freed, and so shouldn't be after __io_free_req() -> __io_req_aux_free(). Also, it will be ignored for in io_free_req_many(), which use __io_req_aux_free().
Place cleanup_req() into __io_req_aux_free().
Fixes: 99bc4c38537d774 ("io_uring: fix iovec leaks") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4d82b04a92c9..4ed5a7d97640 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1254,6 +1254,9 @@ static void __io_req_aux_free(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx;
+ if (req->flags & REQ_F_NEED_CLEANUP) + io_cleanup_req(req); + kfree(req->io); if (req->file) { if (req->flags & REQ_F_FIXED_FILE) @@ -1269,9 +1272,6 @@ static void __io_free_req(struct io_kiocb *req) { __io_req_aux_free(req);
- if (req->flags & REQ_F_NEED_CLEANUP) - io_cleanup_req(req); - if (req->flags & REQ_F_INFLIGHT) { struct io_ring_ctx *ctx = req->ctx; unsigned long flags;
From: Stefano Garzarella sgarzare@redhat.com
mainline inclusion from mainline-5.6-rc3 commit 7143b5ac5750f404ff3a594b34fdf3fc2f99f828 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This patch drops 'cur_mm' before calling cond_resched(), to prevent the sq_thread from spinning even when the user process is finished.
Before this patch, if the user process ended without closing the io_uring fd, the sq_thread continues to spin until the 'sq_thread_idle' timeout ends.
In the worst case where the 'sq_thread_idle' parameter is bigger than INT_MAX, the sq_thread will spin forever.
Fixes: 6c271ce2f1d5 ("io_uring: add submission polling") Signed-off-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4ed5a7d97640..c3f0df489a57 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5076,6 +5076,18 @@ static int io_sq_thread(void *data) * to enter the kernel to reap and flush events. */ if (!to_submit || ret == -EBUSY) { + /* + * Drop cur_mm before scheduling, we can't hold it for + * long periods (or over schedule()). Do this before + * adding ourselves to the waitqueue, as the unuse/drop + * may sleep. + */ + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + cur_mm = NULL; + } + /* * We're polling. If we're within the defined idle * period, then let us spin without work before going @@ -5090,18 +5102,6 @@ static int io_sq_thread(void *data) continue; }
- /* - * Drop cur_mm before scheduling, we can't hold it for - * long periods (or over schedule()). Do this before - * adding ourselves to the waitqueue, as the unuse/drop - * may sleep. - */ - if (cur_mm) { - unuse_mm(cur_mm); - mmput(cur_mm); - cur_mm = NULL; - } - prepare_to_wait(&ctx->sqo_wait, &wait, TASK_INTERRUPTIBLE);
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.6-rc3 commit c7849be9cc2dd2754c48ddbaca27c2de6d80a95d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Since commit a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending"), if we already events pending, we won't enter poll loop. In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has been terminated and don't reap pending events which are already in cq ring, and there are some reqs in poll_list, io_sq_thread will enter __io_iopoll_check(), and find pending events, then return, this loop will never have a chance to exit.
I have seen this issue in fio stress tests, to fix this issue, let io_sq_thread call io_iopoll_getevents() with argument 'min' being zero, and remove __io_iopoll_check().
Fixes: a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending") Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 27 +++++++++------------------ 1 file changed, 9 insertions(+), 18 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c3f0df489a57..717055df430a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1666,11 +1666,17 @@ static void io_iopoll_reap_events(struct io_ring_ctx *ctx) mutex_unlock(&ctx->uring_lock); }
-static int __io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, - long min) +static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, + long min) { int iters = 0, ret = 0;
+ /* + * We disallow the app entering submit/complete with polling, but we + * still need to lock the ring to prevent racing with polled issue + * that got punted to a workqueue. + */ + mutex_lock(&ctx->uring_lock); do { int tmin = 0;
@@ -1706,21 +1712,6 @@ static int __io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, ret = 0; } while (min && !*nr_events && !need_resched());
- return ret; -} - -static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, - long min) -{ - int ret; - - /* - * We disallow the app entering submit/complete with polling, but we - * still need to lock the ring to prevent racing with polled issue - * that got punted to a workqueue. - */ - mutex_lock(&ctx->uring_lock); - ret = __io_iopoll_check(ctx, nr_events, min); mutex_unlock(&ctx->uring_lock); return ret; } @@ -5052,7 +5043,7 @@ static int io_sq_thread(void *data) */ mutex_lock(&ctx->uring_lock); if (!list_empty(&ctx->poll_list)) - __io_iopoll_check(ctx, &nr_events, 0); + io_iopoll_getevents(ctx, &nr_events, 0); else inflight = 0; mutex_unlock(&ctx->uring_lock);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc4 commit 193155c8c9429f57400daf1f2ef0075016767112 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we have a chain of requests and they don't all use the same credentials, then the head of the chain will be issued with the credentails of the tail of the chain.
Ensure __io_queue_sqe() overrides the credentials, if they are different.
Once we do that, we can clean up the creds handling as well, by only having io_submit_sqe() do the lookup of a personality. It doesn't need to assign it, since __io_queue_sqe() now always does the right thing.
Fixes: 75c6a03904e0 ("io_uring: support using a registered personality for commands") Reported-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 717055df430a..b02907e824f3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4639,11 +4639,21 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_kiocb *linked_timeout; struct io_kiocb *nxt = NULL; + const struct cred *old_creds = NULL; int ret;
again: linked_timeout = io_prep_linked_timeout(req);
+ if (req->work.creds && req->work.creds != current_cred()) { + if (old_creds) + revert_creds(old_creds); + if (old_creds == req->work.creds) + old_creds = NULL; /* restored original creds */ + else + old_creds = override_creds(req->work.creds); + } + ret = io_issue_sqe(req, sqe, &nxt, true);
/* @@ -4693,6 +4703,8 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) goto punt; goto again; } + if (old_creds) + revert_creds(old_creds); }
static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) @@ -4737,7 +4749,6 @@ static inline void io_queue_link_head(struct io_kiocb *req) static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct io_submit_state *state, struct io_kiocb **link) { - const struct cred *old_creds = NULL; struct io_ring_ctx *ctx = req->ctx; unsigned int sqe_flags; int ret, id; @@ -4752,14 +4763,12 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
id = READ_ONCE(sqe->personality); if (id) { - const struct cred *personality_creds; - - personality_creds = idr_find(&ctx->personality_idr, id); - if (unlikely(!personality_creds)) { + req->work.creds = idr_find(&ctx->personality_idr, id); + if (unlikely(!req->work.creds)) { ret = -EINVAL; goto err_req; } - old_creds = override_creds(personality_creds); + get_cred(req->work.creds); }
/* same numerical values with corresponding REQ_F_*, safe to copy */ @@ -4771,8 +4780,6 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, err_req: io_cqring_add_event(req, ret); io_double_put_req(req); - if (old_creds) - revert_creds(old_creds); return false; }
@@ -4833,8 +4840,6 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } }
- if (old_creds) - revert_creds(old_creds); return true; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc4 commit 41726c9a50e7464beca7112d0aebf3a0090c62d2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We somehow never free the idr, even though we init it for every ctx. Free it when the rest of the ring data is freed.
Fixes: 071698e13ac6 ("io_uring: allow registering credentials") Reviewed-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b02907e824f3..99dd85b92ec5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6274,6 +6274,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) io_sqe_buffer_unregister(ctx); io_sqe_files_unregister(ctx); io_eventfd_unregister(ctx); + idr_destroy(&ctx->personality_idr);
#if defined(CONFIG_UNIX) if (ctx->ring_sock) {
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.6-rc4 commit bdcd3eab2a9ae0ac93f27275b6895dd95e5bf360 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
After making ext4 support iopoll method: let ext4_file_operations's iopoll method be iomap_dio_iopoll(), we found fio can easily hang in fio_ioring_getevents() with below fio job: rm -f testfile; sync; sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread -rw=write -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled.
There are two issues that results in this hang, one reason is that when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio does not use io_uring_enter to get completed events, it relies on kernel io_sq_thread to poll for completed events.
Another reason is that there is a race: when io_submit_sqes() in io_sq_thread() submits a batch of sqes, variable 'inflight' will record the number of submitted reqs, then io_sq_thread will poll for reqs which have been added to poll_list. But note, if some previous reqs have been punted to io worker, these reqs will won't be in poll_list timely. io_sq_thread() will only poll for a part of previous submitted reqs, and then find poll_list is empty, reset variable 'inflight' to be zero. If app just waits these deferred reqs and does not wake up io_sq_thread again, then hang happens.
For app that entirely relies on io_sq_thread to poll completed requests, let io_iopoll_req_issued() wake up io_sq_thread properly when adding new element to poll_list, and when io_sq_thread prepares to sleep, check whether poll_list is empty again, if not empty, continue to poll.
Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 59 +++++++++++++++++++++++---------------------------- 1 file changed, 27 insertions(+), 32 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 99dd85b92ec5..52b21cd6cca0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1815,6 +1815,10 @@ static void io_iopoll_req_issued(struct io_kiocb *req) list_add(&req->list, &ctx->poll_list); else list_add_tail(&req->list, &ctx->poll_list); + + if ((ctx->flags & IORING_SETUP_SQPOLL) && + wq_has_sleeper(&ctx->sqo_wait)) + wake_up(&ctx->sqo_wait); }
static void io_file_put(struct io_submit_state *state) @@ -5020,9 +5024,8 @@ static int io_sq_thread(void *data) const struct cred *old_cred; mm_segment_t old_fs; DEFINE_WAIT(wait); - unsigned inflight; unsigned long timeout; - int ret; + int ret = 0;
complete(&ctx->completions[1]);
@@ -5030,39 +5033,19 @@ static int io_sq_thread(void *data) set_fs(USER_DS); old_cred = override_creds(ctx->creds);
- ret = timeout = inflight = 0; + timeout = jiffies + ctx->sq_thread_idle; while (!kthread_should_park()) { unsigned int to_submit;
- if (inflight) { + if (!list_empty(&ctx->poll_list)) { unsigned nr_events = 0;
- if (ctx->flags & IORING_SETUP_IOPOLL) { - /* - * inflight is the count of the maximum possible - * entries we submitted, but it can be smaller - * if we dropped some of them. If we don't have - * poll entries available, then we know that we - * have nothing left to poll for. Reset the - * inflight count to zero in that case. - */ - mutex_lock(&ctx->uring_lock); - if (!list_empty(&ctx->poll_list)) - io_iopoll_getevents(ctx, &nr_events, 0); - else - inflight = 0; - mutex_unlock(&ctx->uring_lock); - } else { - /* - * Normal IO, just pretend everything completed. - * We don't have to poll completions for that. - */ - nr_events = inflight; - } - - inflight -= nr_events; - if (!inflight) + mutex_lock(&ctx->uring_lock); + if (!list_empty(&ctx->poll_list)) + io_iopoll_getevents(ctx, &nr_events, 0); + else timeout = jiffies + ctx->sq_thread_idle; + mutex_unlock(&ctx->uring_lock); }
to_submit = io_sqring_entries(ctx); @@ -5091,7 +5074,7 @@ static int io_sq_thread(void *data) * more IO, we should wait for the application to * reap events and wake us up. */ - if (inflight || + if (!list_empty(&ctx->poll_list) || (!time_after(jiffies, timeout) && ret != -EBUSY && !percpu_ref_is_dying(&ctx->refs))) { cond_resched(); @@ -5101,6 +5084,19 @@ static int io_sq_thread(void *data) prepare_to_wait(&ctx->sqo_wait, &wait, TASK_INTERRUPTIBLE);
+ /* + * While doing polled IO, before going to sleep, we need + * to check if there are new reqs added to poll_list, it + * is because reqs may have been punted to io worker and + * will be added to poll_list later, hence check the + * poll_list again. + */ + if ((ctx->flags & IORING_SETUP_IOPOLL) && + !list_empty_careful(&ctx->poll_list)) { + finish_wait(&ctx->sqo_wait, &wait); + continue; + } + /* Tell userspace we may need a wakeup call */ ctx->rings->sq_flags |= IORING_SQ_NEED_WAKEUP; /* make sure to read SQ tail after writing flags */ @@ -5128,8 +5124,7 @@ static int io_sq_thread(void *data) mutex_lock(&ctx->uring_lock); ret = io_submit_sqes(ctx, to_submit, NULL, -1, &cur_mm, true); mutex_unlock(&ctx->uring_lock); - if (ret > 0) - inflight += ret; + timeout = jiffies + ctx->sq_thread_idle; }
set_fs(old_fs);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc4 commit 3030fd4cb783377eca0e2a3eee63724a5c66ee15 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Andres reports that buffered IO seems to suck up more cycles than we would like, and he narrowed it down to the fact that the io-wq workers will briefly spin for more work on completion of a work item. This was a win on the networking side, but apparently some other cases take a hit because of it. Remove the optimization to avoid burning more CPU than we have to for disk IO.
Reported-by: Andres Freund andres@anarazel.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 19 ------------------- 1 file changed, 19 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 4e9a202362e5..88f34f66c387 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -536,42 +536,23 @@ static void io_worker_handle_work(struct io_worker *worker) } while (1); }
-static inline void io_worker_spin_for_work(struct io_wqe *wqe) -{ - int i = 0; - - while (++i < 1000) { - if (io_wqe_run_queue(wqe)) - break; - if (need_resched()) - break; - cpu_relax(); - } -} - static int io_wqe_worker(void *data) { struct io_worker *worker = data; struct io_wqe *wqe = worker->wqe; struct io_wq *wq = wqe->wq; - bool did_work;
io_worker_start(wqe, worker);
- did_work = false; while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) { set_current_state(TASK_INTERRUPTIBLE); loop: - if (did_work) - io_worker_spin_for_work(wqe); spin_lock_irq(&wqe->lock); if (io_wqe_run_queue(wqe)) { __set_current_state(TASK_RUNNING); io_worker_handle_work(worker); - did_work = true; goto loop; } - did_work = false; /* drops the lock on success, retry */ if (__io_worker_idle(wqe, worker)) { __release(&wqe->lock);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc4 commit 2d141dd2caa78fbaf87b57c27769bdc14975ab3d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We use ->task_pid for exit cancellation, but we need to ensure it's cleared to zero for io_req_work_grab_env() to do the right thing. Take a suggestion from Bart and clear the whole thing, just setting the function passed in. This makes it more future proof as well.
Fixes: 36282881a795 ("io-wq: add io_wq_cancel_pid() to cancel based on a specific pid") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.h | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-)
diff --git a/fs/io-wq.h b/fs/io-wq.h index ccc7d84af57d..33baba4370c5 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -79,16 +79,10 @@ struct io_wq_work { pid_t task_pid; };
-#define INIT_IO_WORK(work, _func) \ - do { \ - (work)->list.next = NULL; \ - (work)->func = _func; \ - (work)->files = NULL; \ - (work)->mm = NULL; \ - (work)->creds = NULL; \ - (work)->fs = NULL; \ - (work)->flags = 0; \ - } while (0) \ +#define INIT_IO_WORK(work, _func) \ + do { \ + *(work) = (struct io_wq_work){ .func = _func }; \ + } while (0) \
typedef void (get_work_fn)(struct io_wq_work *); typedef void (put_work_fn)(struct io_wq_work *);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 8da11c19940ddbc22fc835bce3f361f4d2417fb0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Preparation without functional changes. Adds io_get_file(), that allows to grab files not only into req->file.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 72 ++++++++++++++++++++++++++++++--------------------- 1 file changed, 43 insertions(+), 29 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e5e1ca32f21d..3fbc9f02f630 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1250,6 +1250,15 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, return NULL; }
+static inline void io_put_file(struct io_kiocb *req, struct file *file, + bool fixed) +{ + if (fixed) + percpu_ref_put(&req->ctx->file_data->refs); + else + fput(file); +} + static void __io_req_do_free(struct io_kiocb *req) { if (likely(!io_is_fallback_req(req))) @@ -1260,18 +1269,12 @@ static void __io_req_do_free(struct io_kiocb *req)
static void __io_req_aux_free(struct io_kiocb *req) { - struct io_ring_ctx *ctx = req->ctx; - if (req->flags & REQ_F_NEED_CLEANUP) io_cleanup_req(req);
kfree(req->io); - if (req->file) { - if (req->flags & REQ_F_FIXED_FILE) - percpu_ref_put(&ctx->file_data->refs); - else - fput(req->file); - } + if (req->file) + io_put_file(req, req->file, (req->flags & REQ_F_FIXED_FILE));
io_req_work_drop_env(req); } @@ -1845,7 +1848,7 @@ static void io_file_put(struct io_submit_state *state) * assuming most submissions are for one file, or at least that each file * has more than one submission. */ -static struct file *io_file_get(struct io_submit_state *state, int fd) +static struct file *__io_file_get(struct io_submit_state *state, int fd) { if (!state) return fget(fd); @@ -4521,41 +4524,52 @@ static inline struct file *io_file_from_index(struct io_ring_ctx *ctx, return table->files[index & IORING_FILE_TABLE_MASK];; }
-static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req, - const struct io_uring_sqe *sqe) +static int io_file_get(struct io_submit_state *state, struct io_kiocb *req, + int fd, struct file **out_file, bool fixed) { struct io_ring_ctx *ctx = req->ctx; - unsigned flags; - int fd; - - flags = READ_ONCE(sqe->flags); - fd = READ_ONCE(sqe->fd); - - if (!io_req_needs_file(req, fd)) - return 0; + struct file *file;
- if (flags & IOSQE_FIXED_FILE) { + if (fixed) { if (unlikely(!ctx->file_data || (unsigned) fd >= ctx->nr_user_files)) return -EBADF; fd = array_index_nospec(fd, ctx->nr_user_files); - req->file = io_file_from_index(ctx, fd); - if (!req->file) + file = io_file_from_index(ctx, fd); + if (!file) return -EBADF; - req->flags |= REQ_F_FIXED_FILE; percpu_ref_get(&ctx->file_data->refs); } else { - if (req->needs_fixed_file) - return -EBADF; trace_io_uring_file_get(ctx, fd); - req->file = io_file_get(state, fd); - if (unlikely(!req->file)) + file = __io_file_get(state, fd); + if (unlikely(!file)) return -EBADF; }
+ *out_file = file; return 0; }
+static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + unsigned flags; + int fd; + bool fixed; + + flags = READ_ONCE(sqe->flags); + fd = READ_ONCE(sqe->fd); + + if (!io_req_needs_file(req, fd)) + return 0; + + fixed = (flags & IOSQE_FIXED_FILE); + if (unlikely(!fixed && req->needs_fixed_file)) + return -EBADF; + + return io_file_get(state, req, fd, &req->file, fixed); +} + static int io_grab_files(struct io_kiocb *req) { int ret = -EBADF; @@ -4800,8 +4814,8 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, }
/* same numerical values with corresponding REQ_F_*, safe to copy */ - req->flags |= sqe_flags & (IOSQE_IO_DRAIN|IOSQE_IO_HARDLINK| - IOSQE_ASYNC); + req->flags |= sqe_flags & (IOSQE_IO_DRAIN | IOSQE_IO_HARDLINK | + IOSQE_ASYNC | IOSQE_FIXED_FILE);
ret = io_req_set_file(state, req, sqe); if (unlikely(ret)) {
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 7d67af2c013402537385dae343a2d0f6a4cb3bfd category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add support for splice(2).
- output file is specified as sqe->fd, so it's handled by generic code - hash_reg_file handled by generic code as well - len is 32bit, but should be fine - the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which is a splice flag (i.e. sqe->splice_flags).
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 109 ++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 14 ++++- 2 files changed, 122 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3fbc9f02f630..cdfcc578fe6b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -76,6 +76,7 @@ #include <linux/fadvise.h> #include <linux/eventpoll.h> #include <linux/fs_struct.h> +#include <linux/splice.h>
#define CREATE_TRACE_POINTS #include <trace/events/io_uring.h> @@ -431,6 +432,15 @@ struct io_epoll { struct epoll_event event; };
+struct io_splice { + struct file *file_out; + struct file *file_in; + loff_t off_out; + loff_t off_in; + u64 len; + unsigned int flags; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -547,6 +557,7 @@ struct io_kiocb { struct io_fadvise fadvise; struct io_madvise madvise; struct io_epoll epoll; + struct io_splice splice; };
struct io_async_ctx *io; @@ -741,6 +752,11 @@ static const struct io_op_def io_op_defs[] = { .unbound_nonreg_file = 1, .file_table = 1, }, + [IORING_OP_SPLICE] = { + .needs_file = 1, + .hash_reg_file = 1, + .unbound_nonreg_file = 1, + } };
static void io_wq_submit_work(struct io_wq_work **workptr); @@ -755,6 +771,10 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, static int io_grab_files(struct io_kiocb *req); static void io_ring_file_ref_flush(struct fixed_file_data *data); static void io_cleanup_req(struct io_kiocb *req); +static int io_file_get(struct io_submit_state *state, + struct io_kiocb *req, + int fd, struct file **out_file, + bool fixed);
static struct kmem_cache *req_cachep;
@@ -2401,6 +2421,77 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, return ret; }
+static int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_splice* sp = &req->splice; + unsigned int valid_flags = SPLICE_F_FD_IN_FIXED | SPLICE_F_ALL; + int ret; + + if (req->flags & REQ_F_NEED_CLEANUP) + return 0; + + sp->file_in = NULL; + sp->off_in = READ_ONCE(sqe->splice_off_in); + sp->off_out = READ_ONCE(sqe->off); + sp->len = READ_ONCE(sqe->len); + sp->flags = READ_ONCE(sqe->splice_flags); + + if (unlikely(sp->flags & ~valid_flags)) + return -EINVAL; + + ret = io_file_get(NULL, req, READ_ONCE(sqe->splice_fd_in), &sp->file_in, + (sp->flags & SPLICE_F_FD_IN_FIXED)); + if (ret) + return ret; + req->flags |= REQ_F_NEED_CLEANUP; + + if (!S_ISREG(file_inode(sp->file_in)->i_mode)) + req->work.flags |= IO_WQ_WORK_UNBOUND; + + return 0; +} + +static bool io_splice_punt(struct file *file) +{ + if (get_pipe_info(file)) + return false; + if (!io_file_supports_async(file)) + return true; + return !(file->f_mode & O_NONBLOCK); +} + +static int io_splice(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) +{ + struct io_splice *sp = &req->splice; + struct file *in = sp->file_in; + struct file *out = sp->file_out; + unsigned int flags = sp->flags & ~SPLICE_F_FD_IN_FIXED; + loff_t *poff_in, *poff_out; + long ret; + + if (force_nonblock) { + if (io_splice_punt(in) || io_splice_punt(out)) + return -EAGAIN; + flags |= SPLICE_F_NONBLOCK; + } + + poff_in = (sp->off_in == -1) ? NULL : &sp->off_in; + poff_out = (sp->off_out == -1) ? NULL : &sp->off_out; + ret = do_splice(in, poff_in, out, poff_out, sp->len, flags); + if (force_nonblock && ret == -EAGAIN) + return -EAGAIN; + + io_put_file(req, in, (sp->flags & SPLICE_F_FD_IN_FIXED)); + req->flags &= ~REQ_F_NEED_CLEANUP; + + io_cqring_add_event(req, ret); + if (ret != sp->len) + req_set_fail_links(req); + io_put_req_find_next(req, nxt); + return 0; +} + /* * IORING_OP_NOP just posts a completion event, nothing else. */ @@ -4182,6 +4273,9 @@ static int io_req_defer_prep(struct io_kiocb *req, case IORING_OP_EPOLL_CTL: ret = io_epoll_ctl_prep(req, sqe); break; + case IORING_OP_SPLICE: + ret = io_splice_prep(req, sqe); + break; default: printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", req->opcode); @@ -4243,6 +4337,10 @@ static void io_cleanup_req(struct io_kiocb *req) case IORING_OP_STATX: putname(req->open.filename); break; + case IORING_OP_SPLICE: + io_put_file(req, req->splice.file_in, + (req->splice.flags & SPLICE_F_FD_IN_FIXED)); + break; }
req->flags &= ~REQ_F_NEED_CLEANUP; @@ -4438,6 +4536,14 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } ret = io_epoll_ctl(req, nxt, force_nonblock); break; + case IORING_OP_SPLICE: + if (sqe) { + ret = io_splice_prep(req, sqe); + if (ret < 0) + break; + } + ret = io_splice(req, nxt, force_nonblock); + break; default: ret = -EINVAL; break; @@ -7196,6 +7302,7 @@ static int __init io_uring_init(void) BUILD_BUG_SQE_ELEM(8, __u64, off); BUILD_BUG_SQE_ELEM(8, __u64, addr2); BUILD_BUG_SQE_ELEM(16, __u64, addr); + BUILD_BUG_SQE_ELEM(16, __u64, splice_off_in); BUILD_BUG_SQE_ELEM(24, __u32, len); BUILD_BUG_SQE_ELEM(28, __kernel_rwf_t, rw_flags); BUILD_BUG_SQE_ELEM(28, /* compat */ int, rw_flags); @@ -7210,9 +7317,11 @@ static int __init io_uring_init(void) BUILD_BUG_SQE_ELEM(28, __u32, open_flags); BUILD_BUG_SQE_ELEM(28, __u32, statx_flags); BUILD_BUG_SQE_ELEM(28, __u32, fadvise_advice); + BUILD_BUG_SQE_ELEM(28, __u32, splice_flags); BUILD_BUG_SQE_ELEM(32, __u64, user_data); BUILD_BUG_SQE_ELEM(40, __u16, buf_index); BUILD_BUG_SQE_ELEM(42, __u16, personality); + BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in);
BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST); req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 90fed30a38b7..6c607e42db68 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -23,7 +23,10 @@ struct io_uring_sqe { __u64 off; /* offset into file */ __u64 addr2; }; - __u64 addr; /* pointer to buffer or iovecs */ + union { + __u64 addr; /* pointer to buffer or iovecs */ + __u64 splice_off_in; + }; __u32 len; /* buffer size or number of iovecs */ union { __kernel_rwf_t rw_flags; @@ -37,6 +40,7 @@ struct io_uring_sqe { __u32 open_flags; __u32 statx_flags; __u32 fadvise_advice; + __u32 splice_flags; }; __u64 user_data; /* data to be passed back at completion time */ union { @@ -45,6 +49,7 @@ struct io_uring_sqe { __u16 buf_index; /* personality to use, if used */ __u16 personality; + __s32 splice_fd_in; }; __u64 __pad2[3]; }; @@ -112,6 +117,7 @@ enum { IORING_OP_SEND, IORING_OP_RECV, IORING_OP_EPOLL_CTL, + IORING_OP_SPLICE,
/* this goes last, obviously */ IORING_OP_LAST, @@ -127,6 +133,12 @@ enum { */ #define IORING_TIMEOUT_ABS (1U << 0)
+/* + * sqe->splice_flags + * extends splice(2) flags + */ +#define SPLICE_F_FD_IN_FIXED (1U << 31) /* the last bit of __u32 */ + /* * IO completion data structure (Completion Queue Entry) */
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit b0a20349f212dc725f5ddfd060e426fe6181d9c5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Deduplicate call to io_cqring_fill_event(), plain and easy
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cdfcc578fe6b..40f6e95b471a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3593,10 +3593,7 @@ static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error) struct io_ring_ctx *ctx = req->ctx;
req->poll.done = true; - if (error) - io_cqring_fill_event(req, error); - else - io_cqring_fill_event(req, mangle_poll(mask)); + io_cqring_fill_event(req, error ? error : mangle_poll(mask)); io_commit_cqring(ctx); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 02d27d895323c4baa3234e4bed015eb3a196e1dd category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_recvmsg() and io_sendmsg() duplicate nonblock -EAGAIN finilising part, so add helper for that.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 43 +++++++++++++++++++------------------------ 1 file changed, 19 insertions(+), 24 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 40f6e95b471a..7891229e1b5c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3044,6 +3044,21 @@ static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt, return 0; }
+static int io_setup_async_msg(struct io_kiocb *req, + struct io_async_msghdr *kmsg) +{ + if (req->io) + return -EAGAIN; + if (io_alloc_async_ctx(req)) { + if (kmsg->iov != kmsg->fast_iov) + kfree(kmsg->iov); + return -ENOMEM; + } + req->flags |= REQ_F_NEED_CLEANUP; + memcpy(&req->io->msg, kmsg, sizeof(*kmsg)); + return -EAGAIN; +} + static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { #if defined(CONFIG_NET) @@ -3120,18 +3135,8 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, flags |= MSG_DONTWAIT;
ret = __sys_sendmsg_sock(sock, &kmsg->msg, flags); - if (force_nonblock && ret == -EAGAIN) { - if (req->io) - return -EAGAIN; - if (io_alloc_async_ctx(req)) { - if (kmsg->iov != kmsg->fast_iov) - kfree(kmsg->iov); - return -ENOMEM; - } - req->flags |= REQ_F_NEED_CLEANUP; - memcpy(&req->io->msg, &io.msg, sizeof(io.msg)); - return -EAGAIN; - } + if (force_nonblock && ret == -EAGAIN) + return io_setup_async_msg(req, kmsg); if (ret == -ERESTARTSYS) ret = -EINTR; } @@ -3279,18 +3284,8 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt,
ret = __sys_recvmsg_sock(sock, &kmsg->msg, req->sr_msg.msg, kmsg->uaddr, flags); - if (force_nonblock && ret == -EAGAIN) { - if (req->io) - return -EAGAIN; - if (io_alloc_async_ctx(req)) { - if (kmsg->iov != kmsg->fast_iov) - kfree(kmsg->iov); - return -ENOMEM; - } - memcpy(&req->io->msg, &io.msg, sizeof(io.msg)); - req->flags |= REQ_F_NEED_CLEANUP; - return -EAGAIN; - } + if (force_nonblock && ret == -EAGAIN) + return io_setup_async_msg(req, kmsg); if (ret == -ERESTARTSYS) ret = -EINTR; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit e85530ddda4f08d4f9ed6506d4a1f42e086e3b21 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
IO_WQ_WORK_HAS_MM is set but never used, remove it.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 2 -- fs/io-wq.h | 1 - 2 files changed, 3 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 587815b8b088..90767828ad01 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -500,8 +500,6 @@ static void io_worker_handle_work(struct io_worker *worker) */ if (test_bit(IO_WQ_BIT_CANCEL, &wq->state)) work->flags |= IO_WQ_WORK_CANCEL; - if (worker->mm) - work->flags |= IO_WQ_WORK_HAS_MM;
if (wq->get_work) { put_work = work; diff --git a/fs/io-wq.h b/fs/io-wq.h index e5e15f2c93ec..d500d88ab84e 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -5,7 +5,6 @@ struct io_wq;
enum { IO_WQ_WORK_CANCEL = 1, - IO_WQ_WORK_HAS_MM = 2, IO_WQ_WORK_HASHED = 4, IO_WQ_WORK_UNBOUND = 32, IO_WQ_WORK_CB = 128,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 5eae8619907a1389dbd1b4a1049caf52782c0916 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
IO_WQ_WORK_CB is used only for linked timeouts, which will be armed before the work setup (i.e. mm, override creds, etc). The setup shouldn't take long, so it's ok to arm it a bit later and get rid of IO_WQ_WORK_CB.
Make io-wq call work->func() only once, callbacks will handle the rest. i.e. the linked timeout handler will do the actual issue. And as a bonus, it removes an extra indirect call.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 3 --- fs/io-wq.h | 1 - fs/io_uring.c | 3 +-- 3 files changed, 1 insertion(+), 6 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 90767828ad01..f3894022d467 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -480,9 +480,6 @@ static void io_worker_handle_work(struct io_worker *worker) worker->cur_work = work; spin_unlock_irq(&worker->lock);
- if (work->flags & IO_WQ_WORK_CB) - work->func(&work); - if (work->files && current->files != work->files) { task_lock(current); current->files = work->files; diff --git a/fs/io-wq.h b/fs/io-wq.h index d500d88ab84e..a0978d6958f0 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -7,7 +7,6 @@ enum { IO_WQ_WORK_CANCEL = 1, IO_WQ_WORK_HASHED = 4, IO_WQ_WORK_UNBOUND = 32, - IO_WQ_WORK_CB = 128, IO_WQ_WORK_NO_CANCEL = 256, IO_WQ_WORK_CONCURRENT = 512,
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7891229e1b5c..a3f93c6ebe4d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2546,7 +2546,7 @@ static void io_link_work_cb(struct io_wq_work **workptr) struct io_kiocb *link = work->data;
io_queue_linked_timeout(link); - work->func = io_wq_submit_work; + io_wq_submit_work(workptr); }
static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) @@ -2556,7 +2556,6 @@ static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) io_prep_next_work(nxt, &link); *workptr = &nxt->work; if (link) { - nxt->work.flags |= IO_WQ_WORK_CB; nxt->work.func = io_link_work_cb; nxt->work.data = link; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 3684f24653534c71c7dc9f44d7281a838f4e4979 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
@hash_map is unsigned long, but BIT_ULL() is used for manipulations. BIT() is a better match as it returns exactly unsigned long value.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index f3894022d467..0ca2b17c82f9 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -394,8 +394,8 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe, unsigned *hash)
/* hashed, can run if not already running */ *hash = work->flags >> IO_WQ_HASH_SHIFT; - if (!(wqe->hash_map & BIT_ULL(*hash))) { - wqe->hash_map |= BIT_ULL(*hash); + if (!(wqe->hash_map & BIT(*hash))) { + wqe->hash_map |= BIT(*hash); wq_node_del(&wqe->work_list, node, prev); return work; } @@ -513,7 +513,7 @@ static void io_worker_handle_work(struct io_worker *worker) spin_lock_irq(&wqe->lock);
if (hash != -1U) { - wqe->hash_map &= ~BIT_ULL(hash); + wqe->hash_map &= ~BIT(hash); wqe->flags &= ~IO_WQE_FLAG_STALLED; } if (work && work != old_work) {
From: Oleg Nesterov oleg@redhat.com
mainline inclusion from mainline-5.7-rc1 commit 6fb614920b38bbf3c1c7fcd944c6d9b5d746103d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
As Peter pointed out, task_work() can avoid ->pi_lock and cmpxchg() if task->task_works == NULL && !PF_EXITING.
And in fact the only reason why task_work_run() needs ->pi_lock is the possible race with task_work_cancel(), we can optimize this code and make the locking more clear.
Signed-off-by: Oleg Nesterov oleg@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/task_work.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/kernel/task_work.c b/kernel/task_work.c index 0fef395662a6..825f28259a19 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -97,16 +97,26 @@ void task_work_run(void) * work->func() can do task_work_add(), do not set * work_exited unless the list is empty. */ - raw_spin_lock_irq(&task->pi_lock); do { + head = NULL; work = READ_ONCE(task->task_works); - head = !work && (task->flags & PF_EXITING) ? - &work_exited : NULL; + if (!work) { + if (task->flags & PF_EXITING) + head = &work_exited; + else + break; + } } while (cmpxchg(&task->task_works, work, head) != work); - raw_spin_unlock_irq(&task->pi_lock);
if (!work) break; + /* + * Synchronize with task_work_cancel(). It can not remove + * the first entry == work, cmpxchg(task_works) must fail. + * But it can remove another entry from the ->next list. + */ + raw_spin_lock_irq(&task->pi_lock); + raw_spin_unlock_irq(&task->pi_lock);
do { next = work->next;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit c2f2eb7d2c1cdc37fa9633bae96f381d33ee7a14 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Store the io_kiocb in the private field instead of the poll entry, this is in preparation for allowing multiple waitqueues.
No functional changes in this patch.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a3f93c6ebe4d..f845a9a55f02 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3682,8 +3682,8 @@ static void io_poll_trigger_evfd(struct io_wq_work **workptr) static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, void *key) { - struct io_poll_iocb *poll = wait->private; - struct io_kiocb *req = container_of(poll, struct io_kiocb, poll); + struct io_kiocb *req = wait->private; + struct io_poll_iocb *poll = &req->poll; struct io_ring_ctx *ctx = req->ctx; __poll_t mask = key_to_poll(key);
@@ -3806,7 +3806,7 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) /* initialized the list so that we can do list_empty checks */ INIT_LIST_HEAD(&poll->wait.entry); init_waitqueue_func_entry(&poll->wait, io_poll_wake); - poll->wait.private = poll; + poll->wait.private = req;
INIT_LIST_HEAD(&req->list);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit b41e98524e424d104aa7851d54fd65820759875a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For poll requests, it's not uncommon to link a read (or write) after the poll to execute immediately after the file is marked as ready. Since the poll completion is called inside the waitqueue wake up handler, we have to punt that linked request to async context. This slows down the processing, and actually means it's faster to not use a link for this use case.
We also run into problems if the completion_lock is contended, as we're doing a different lock ordering than the issue side is. Hence we have to do trylock for completion, and if that fails, go async. Poll removal needs to go async as well, for the same reason.
eventfd notification needs special case as well, to avoid stack blowing recursion or deadlocks.
These are all deficiencies that were inherited from the aio poll implementation, but I think we can do better. When a poll completes, simply queue it up in the task poll list. When the task completes the list, we can run dependent links inline as well. This means we never have to go async, and we can remove a bunch of code associated with that, and optimizations to try and make that run faster. The diffstat speaks for itself.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 218 ++++++++++++++++++-------------------------------- 1 file changed, 76 insertions(+), 142 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f845a9a55f02..df66bf2ea600 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -77,6 +77,7 @@ #include <linux/eventpoll.h> #include <linux/fs_struct.h> #include <linux/splice.h> +#include <linux/task_work.h>
#define CREATE_TRACE_POINTS #include <trace/events/io_uring.h> @@ -291,7 +292,6 @@ struct io_ring_ctx {
struct { spinlock_t completion_lock; - struct llist_head poll_llist;
/* * ->poll_list is protected by the ctx->uring_lock for @@ -561,10 +561,6 @@ struct io_kiocb { };
struct io_async_ctx *io; - /* - * llist_node is only used for poll deferred completions - */ - struct llist_node llist_node; bool needs_fixed_file; u8 opcode;
@@ -582,7 +578,17 @@ struct io_kiocb {
struct list_head inflight_entry;
- struct io_wq_work work; + union { + /* + * Only commands that never go async can use the below fields, + * obviously. Right now only IORING_OP_POLL_ADD uses them. + */ + struct { + struct task_struct *task; + struct callback_head task_work; + }; + struct io_wq_work work; + }; };
#define IO_PLUG_THRESHOLD 2 @@ -771,10 +777,10 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, static int io_grab_files(struct io_kiocb *req); static void io_ring_file_ref_flush(struct fixed_file_data *data); static void io_cleanup_req(struct io_kiocb *req); -static int io_file_get(struct io_submit_state *state, - struct io_kiocb *req, - int fd, struct file **out_file, - bool fixed); +static int io_file_get(struct io_submit_state *state, struct io_kiocb *req, + int fd, struct file **out_file, bool fixed); +static void __io_queue_sqe(struct io_kiocb *req, + const struct io_uring_sqe *sqe);
static struct kmem_cache *req_cachep;
@@ -844,7 +850,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) mutex_init(&ctx->uring_lock); init_waitqueue_head(&ctx->wait); spin_lock_init(&ctx->completion_lock); - init_llist_head(&ctx->poll_llist); INIT_LIST_HEAD(&ctx->poll_list); INIT_LIST_HEAD(&ctx->defer_list); INIT_LIST_HEAD(&ctx->timeout_list); @@ -1078,24 +1083,19 @@ static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx) return false; if (!ctx->eventfd_async) return true; - return io_wq_current_is_worker() || in_interrupt(); + return io_wq_current_is_worker(); }
-static void __io_cqring_ev_posted(struct io_ring_ctx *ctx, bool trigger_ev) +static void io_cqring_ev_posted(struct io_ring_ctx *ctx) { if (waitqueue_active(&ctx->wait)) wake_up(&ctx->wait); if (waitqueue_active(&ctx->sqo_wait)) wake_up(&ctx->sqo_wait); - if (trigger_ev) + if (io_should_trigger_evfd(ctx)) eventfd_signal(ctx->cq_ev_fd, 1); }
-static void io_cqring_ev_posted(struct io_ring_ctx *ctx) -{ - __io_cqring_ev_posted(ctx, io_should_trigger_evfd(ctx)); -} - /* Returns true if there are no backlogged entries after the flush */ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) { @@ -3500,18 +3500,27 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt, #endif }
-static void io_poll_remove_one(struct io_kiocb *req) +static bool io_poll_remove_one(struct io_kiocb *req) { struct io_poll_iocb *poll = &req->poll; + bool do_complete = false;
spin_lock(&poll->head->lock); WRITE_ONCE(poll->canceled, true); if (!list_empty(&poll->wait.entry)) { list_del_init(&poll->wait.entry); - io_queue_async_work(req); + do_complete = true; } spin_unlock(&poll->head->lock); hash_del(&req->hash_node); + if (do_complete) { + io_cqring_fill_event(req, -ECANCELED); + io_commit_cqring(req->ctx); + req->flags |= REQ_F_COMP_LOCKED; + io_put_req(req); + } + + return do_complete; }
static void io_poll_remove_all(struct io_ring_ctx *ctx) @@ -3529,6 +3538,8 @@ static void io_poll_remove_all(struct io_ring_ctx *ctx) io_poll_remove_one(req); } spin_unlock_irq(&ctx->completion_lock); + + io_cqring_ev_posted(ctx); }
static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr) @@ -3538,10 +3549,11 @@ static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr)
list = &ctx->cancel_hash[hash_long(sqe_addr, ctx->cancel_hash_bits)]; hlist_for_each_entry(req, list, hash_node) { - if (sqe_addr == req->user_data) { - io_poll_remove_one(req); + if (sqe_addr != req->user_data) + continue; + if (io_poll_remove_one(req)) return 0; - } + return -EALREADY; }
return -ENOENT; @@ -3591,92 +3603,28 @@ static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error) io_commit_cqring(ctx); }
-static void io_poll_complete_work(struct io_wq_work **workptr) +static void io_poll_task_handler(struct io_kiocb *req, struct io_kiocb **nxt) { - struct io_wq_work *work = *workptr; - struct io_kiocb *req = container_of(work, struct io_kiocb, work); - struct io_poll_iocb *poll = &req->poll; - struct poll_table_struct pt = { ._key = poll->events }; struct io_ring_ctx *ctx = req->ctx; - struct io_kiocb *nxt = NULL; - __poll_t mask = 0; - int ret = 0;
- if (work->flags & IO_WQ_WORK_CANCEL) { - WRITE_ONCE(poll->canceled, true); - ret = -ECANCELED; - } else if (READ_ONCE(poll->canceled)) { - ret = -ECANCELED; - } - - if (ret != -ECANCELED) - mask = vfs_poll(poll->file, &pt) & poll->events; - - /* - * Note that ->ki_cancel callers also delete iocb from active_reqs after - * calling ->ki_cancel. We need the ctx_lock roundtrip here to - * synchronize with them. In the cancellation case the list_del_init - * itself is not actually needed, but harmless so we keep it in to - * avoid further branches in the fast path. - */ spin_lock_irq(&ctx->completion_lock); - if (!mask && ret != -ECANCELED) { - add_wait_queue(poll->head, &poll->wait); - spin_unlock_irq(&ctx->completion_lock); - return; - } hash_del(&req->hash_node); - io_poll_complete(req, mask, ret); - spin_unlock_irq(&ctx->completion_lock); - - io_cqring_ev_posted(ctx); - - if (ret < 0) - req_set_fail_links(req); - io_put_req_find_next(req, &nxt); - if (nxt) - io_wq_assign_next(workptr, nxt); -} - -static void __io_poll_flush(struct io_ring_ctx *ctx, struct llist_node *nodes) -{ - struct io_kiocb *req, *tmp; - struct req_batch rb; - - rb.to_free = rb.need_iter = 0; - spin_lock_irq(&ctx->completion_lock); - llist_for_each_entry_safe(req, tmp, nodes, llist_node) { - hash_del(&req->hash_node); - io_poll_complete(req, req->result, 0); - - if (refcount_dec_and_test(&req->refs) && - !io_req_multi_free(&rb, req)) { - req->flags |= REQ_F_COMP_LOCKED; - io_free_req(req); - } - } + io_poll_complete(req, req->result, 0); + req->flags |= REQ_F_COMP_LOCKED; + io_put_req_find_next(req, nxt); spin_unlock_irq(&ctx->completion_lock);
io_cqring_ev_posted(ctx); - io_free_req_many(ctx, &rb); -} - -static void io_poll_flush(struct io_wq_work **workptr) -{ - struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - struct llist_node *nodes; - - nodes = llist_del_all(&req->ctx->poll_llist); - if (nodes) - __io_poll_flush(req->ctx, nodes); }
-static void io_poll_trigger_evfd(struct io_wq_work **workptr) +static void io_poll_task_func(struct callback_head *cb) { - struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); + struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); + struct io_kiocb *nxt = NULL;
- eventfd_signal(req->ctx->cq_ev_fd, 1); - io_put_req(req); + io_poll_task_handler(req, &nxt); + if (nxt) + __io_queue_sqe(nxt, NULL); }
static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, @@ -3684,8 +3632,8 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, { struct io_kiocb *req = wait->private; struct io_poll_iocb *poll = &req->poll; - struct io_ring_ctx *ctx = req->ctx; __poll_t mask = key_to_poll(key); + struct task_struct *tsk;
/* for instances that support it check for an event match first: */ if (mask && !(mask & poll->events)) @@ -3693,46 +3641,11 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
list_del_init(&poll->wait.entry);
- /* - * Run completion inline if we can. We're using trylock here because - * we are violating the completion_lock -> poll wq lock ordering. - * If we have a link timeout we're going to need the completion_lock - * for finalizing the request, mark us as having grabbed that already. - */ - if (mask) { - unsigned long flags; - - if (llist_empty(&ctx->poll_llist) && - spin_trylock_irqsave(&ctx->completion_lock, flags)) { - bool trigger_ev; - - hash_del(&req->hash_node); - io_poll_complete(req, mask, 0); - - trigger_ev = io_should_trigger_evfd(ctx); - if (trigger_ev && eventfd_signal_count()) { - trigger_ev = false; - req->work.func = io_poll_trigger_evfd; - } else { - req->flags |= REQ_F_COMP_LOCKED; - io_put_req(req); - req = NULL; - } - spin_unlock_irqrestore(&ctx->completion_lock, flags); - __io_cqring_ev_posted(ctx, trigger_ev); - } else { - req->result = mask; - req->llist_node.next = NULL; - /* if the list wasn't empty, we're done */ - if (!llist_add(&req->llist_node, &ctx->poll_llist)) - req = NULL; - else - req->work.func = io_poll_flush; - } - } - if (req) - io_queue_async_work(req); - + tsk = req->task; + req->result = mask; + init_task_work(&req->task_work, io_poll_task_func); + task_work_add(tsk, &req->task_work, true); + wake_up_process(tsk); return 1; }
@@ -3780,6 +3693,9 @@ static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe
events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; + + /* task will wait for requests on exit, don't need a ref */ + req->task = current; return 0; }
@@ -3791,7 +3707,6 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) bool cancel = false; __poll_t mask;
- INIT_IO_WORK(&req->work, io_poll_complete_work); INIT_HLIST_NODE(&req->hash_node);
poll->head = NULL; @@ -5216,6 +5131,8 @@ static int io_sq_thread(void *data) if (!list_empty(&ctx->poll_list) || (!time_after(jiffies, timeout) && ret != -EBUSY && !percpu_ref_is_dying(&ctx->refs))) { + if (current->task_works) + task_work_run(); cond_resched(); continue; } @@ -5247,6 +5164,10 @@ static int io_sq_thread(void *data) finish_wait(&ctx->sqo_wait, &wait); break; } + if (current->task_works) { + task_work_run(); + continue; + } if (signal_pending(current)) flush_signals(current); schedule(); @@ -5266,6 +5187,9 @@ static int io_sq_thread(void *data) timeout = jiffies + ctx->sq_thread_idle; }
+ if (current->task_works) + task_work_run(); + set_fs(old_fs); if (cur_mm) { unuse_mm(cur_mm); @@ -5330,8 +5254,13 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, struct io_rings *rings = ctx->rings; int ret = 0;
- if (io_cqring_events(ctx, false) >= min_events) - return 0; + do { + if (io_cqring_events(ctx, false) >= min_events) + return 0; + if (!current->task_works) + break; + task_work_run(); + } while (1);
if (sig) { #ifdef CONFIG_COMPAT @@ -5351,6 +5280,8 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, do { prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); + if (current->task_works) + task_work_run(); if (io_should_wake(&iowq, false)) break; schedule(); @@ -6677,6 +6608,9 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, int submitted = 0; struct fd f;
+ if (current->task_works) + task_work_run(); + if (flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP)) return -EINVAL;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 8a72758c51f8a5501a0e01ea95069630edb9ca07 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add a pollin/pollout field to the request table, and have commands that we can safely poll for properly marked.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index df66bf2ea600..0deaeb894892 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -632,6 +632,9 @@ struct io_op_def { unsigned file_table : 1; /* needs ->fs */ unsigned needs_fs : 1; + /* set if opcode supports polled "wait" */ + unsigned pollin : 1; + unsigned pollout : 1; };
static const struct io_op_def io_op_defs[] = { @@ -641,6 +644,7 @@ static const struct io_op_def io_op_defs[] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, + .pollin = 1, }, [IORING_OP_WRITEV] = { .async_ctx = 1, @@ -648,6 +652,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .hash_reg_file = 1, .unbound_nonreg_file = 1, + .pollout = 1, }, [IORING_OP_FSYNC] = { .needs_file = 1, @@ -655,11 +660,13 @@ static const struct io_op_def io_op_defs[] = { [IORING_OP_READ_FIXED] = { .needs_file = 1, .unbound_nonreg_file = 1, + .pollin = 1, }, [IORING_OP_WRITE_FIXED] = { .needs_file = 1, .hash_reg_file = 1, .unbound_nonreg_file = 1, + .pollout = 1, }, [IORING_OP_POLL_ADD] = { .needs_file = 1, @@ -675,6 +682,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .unbound_nonreg_file = 1, .needs_fs = 1, + .pollout = 1, }, [IORING_OP_RECVMSG] = { .async_ctx = 1, @@ -682,6 +690,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .unbound_nonreg_file = 1, .needs_fs = 1, + .pollin = 1, }, [IORING_OP_TIMEOUT] = { .async_ctx = 1, @@ -693,6 +702,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .unbound_nonreg_file = 1, .file_table = 1, + .pollin = 1, }, [IORING_OP_ASYNC_CANCEL] = {}, [IORING_OP_LINK_TIMEOUT] = { @@ -704,6 +714,7 @@ static const struct io_op_def io_op_defs[] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, + .pollout = 1, }, [IORING_OP_FALLOCATE] = { .needs_file = 1, @@ -732,11 +743,13 @@ static const struct io_op_def io_op_defs[] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, + .pollin = 1, }, [IORING_OP_WRITE] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, + .pollout = 1, }, [IORING_OP_FADVISE] = { .needs_file = 1, @@ -748,11 +761,13 @@ static const struct io_op_def io_op_defs[] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, + .pollout = 1, }, [IORING_OP_RECV] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, + .pollin = 1, }, [IORING_OP_EPOLL_CTL] = { .unbound_nonreg_file = 1,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit d7718a9d25a61442da8ee8aeeff6a0097f0ccfd6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently io_uring tries any request in a non-blocking manner, if it can, and then retries from a worker thread if we get -EAGAIN. Now that we have a new and fancy poll based retry backend, use that to retry requests if the file supports it.
This means that, for example, an IORING_OP_RECVMSG on a socket no longer requires an async thread to complete the IO. If we get -EAGAIN reading from the socket in a non-blocking manner, we arm a poll handler for notification on when the socket becomes readable. When it does, the pending read is executed directly by the task again, through the io_uring task work handlers. Not only is this faster and more efficient, it also means we're not generating potentially tons of async threads that just sit and block, waiting for the IO to complete.
The feature is marked with IORING_FEAT_FAST_POLL, meaning that async pollable IO is fast, and that poll<link>other_op is fast as well.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 354 ++++++++++++++++++++++++-------- include/trace/events/io_uring.h | 103 ++++++++++ include/uapi/linux/io_uring.h | 1 + 3 files changed, 375 insertions(+), 83 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 0deaeb894892..aba21e017cb9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -490,6 +490,7 @@ enum { REQ_F_COMP_LOCKED_BIT, REQ_F_NEED_CLEANUP_BIT, REQ_F_OVERFLOW_BIT, + REQ_F_POLLED_BIT, };
enum { @@ -532,6 +533,13 @@ enum { REQ_F_NEED_CLEANUP = BIT(REQ_F_NEED_CLEANUP_BIT), /* in overflow list */ REQ_F_OVERFLOW = BIT(REQ_F_OVERFLOW_BIT), + /* already went through poll handler */ + REQ_F_POLLED = BIT(REQ_F_POLLED_BIT), +}; + +struct async_poll { + struct io_poll_iocb poll; + struct io_wq_work work; };
/* @@ -565,27 +573,29 @@ struct io_kiocb { u8 opcode;
struct io_ring_ctx *ctx; - union { - struct list_head list; - struct hlist_node hash_node; - }; - struct list_head link_list; + struct list_head list; unsigned int flags; refcount_t refs; + struct task_struct *task; u64 user_data; u32 result; u32 sequence;
+ struct list_head link_list; + struct list_head inflight_entry;
union { /* * Only commands that never go async can use the below fields, - * obviously. Right now only IORING_OP_POLL_ADD uses them. + * obviously. Right now only IORING_OP_POLL_ADD uses them, and + * async armed poll handlers for regular commands. The latter + * restore the work, if needed. */ struct { - struct task_struct *task; struct callback_head task_work; + struct hlist_node hash_node; + struct async_poll *apoll; }; struct io_wq_work work; }; @@ -3515,9 +3525,209 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt, #endif }
-static bool io_poll_remove_one(struct io_kiocb *req) +struct io_poll_table { + struct poll_table_struct pt; + struct io_kiocb *req; + int error; +}; + +static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt, + struct wait_queue_head *head) +{ + if (unlikely(poll->head)) { + pt->error = -EINVAL; + return; + } + + pt->error = 0; + poll->head = head; + add_wait_queue(head, &poll->wait); +} + +static void io_async_queue_proc(struct file *file, struct wait_queue_head *head, + struct poll_table_struct *p) +{ + struct io_poll_table *pt = container_of(p, struct io_poll_table, pt); + + __io_queue_proc(&pt->req->apoll->poll, pt, head); +} + +static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, + __poll_t mask, task_work_func_t func) +{ + struct task_struct *tsk; + + /* for instances that support it check for an event match first: */ + if (mask && !(mask & poll->events)) + return 0; + + trace_io_uring_task_add(req->ctx, req->opcode, req->user_data, mask); + + list_del_init(&poll->wait.entry); + + tsk = req->task; + req->result = mask; + init_task_work(&req->task_work, func); + /* + * If this fails, then the task is exiting. If that is the case, then + * the exit check will ultimately cancel these work items. Hence we + * don't need to check here and handle it specifically. + */ + task_work_add(tsk, &req->task_work, true); + wake_up_process(tsk); + return 1; +} + +static void io_async_task_func(struct callback_head *cb) +{ + struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); + struct async_poll *apoll = req->apoll; + struct io_ring_ctx *ctx = req->ctx; + + trace_io_uring_task_run(req->ctx, req->opcode, req->user_data); + + WARN_ON_ONCE(!list_empty(&req->apoll->poll.wait.entry)); + + if (hash_hashed(&req->hash_node)) { + spin_lock_irq(&ctx->completion_lock); + hash_del(&req->hash_node); + spin_unlock_irq(&ctx->completion_lock); + } + + /* restore ->work in case we need to retry again */ + memcpy(&req->work, &apoll->work, sizeof(req->work)); + + __set_current_state(TASK_RUNNING); + mutex_lock(&ctx->uring_lock); + __io_queue_sqe(req, NULL); + mutex_unlock(&ctx->uring_lock); + + kfree(apoll); +} + +static int io_async_wake(struct wait_queue_entry *wait, unsigned mode, int sync, + void *key) +{ + struct io_kiocb *req = wait->private; + struct io_poll_iocb *poll = &req->apoll->poll; + + trace_io_uring_poll_wake(req->ctx, req->opcode, req->user_data, + key_to_poll(key)); + + return __io_async_wake(req, poll, key_to_poll(key), io_async_task_func); +} + +static void io_poll_req_insert(struct io_kiocb *req) +{ + struct io_ring_ctx *ctx = req->ctx; + struct hlist_head *list; + + list = &ctx->cancel_hash[hash_long(req->user_data, ctx->cancel_hash_bits)]; + hlist_add_head(&req->hash_node, list); +} + +static __poll_t __io_arm_poll_handler(struct io_kiocb *req, + struct io_poll_iocb *poll, + struct io_poll_table *ipt, __poll_t mask, + wait_queue_func_t wake_func) + __acquires(&ctx->completion_lock) +{ + struct io_ring_ctx *ctx = req->ctx; + bool cancel = false; + + poll->file = req->file; + poll->head = NULL; + poll->done = poll->canceled = false; + poll->events = mask; + + ipt->pt._key = mask; + ipt->req = req; + ipt->error = -EINVAL; + + INIT_LIST_HEAD(&poll->wait.entry); + init_waitqueue_func_entry(&poll->wait, wake_func); + poll->wait.private = req; + + mask = vfs_poll(req->file, &ipt->pt) & poll->events; + + spin_lock_irq(&ctx->completion_lock); + if (likely(poll->head)) { + spin_lock(&poll->head->lock); + if (unlikely(list_empty(&poll->wait.entry))) { + if (ipt->error) + cancel = true; + ipt->error = 0; + mask = 0; + } + if (mask || ipt->error) + list_del_init(&poll->wait.entry); + else if (cancel) + WRITE_ONCE(poll->canceled, true); + else if (!poll->done) /* actually waiting for an event */ + io_poll_req_insert(req); + spin_unlock(&poll->head->lock); + } + + return mask; +} + +static bool io_arm_poll_handler(struct io_kiocb *req) +{ + const struct io_op_def *def = &io_op_defs[req->opcode]; + struct io_ring_ctx *ctx = req->ctx; + struct async_poll *apoll; + struct io_poll_table ipt; + __poll_t mask, ret; + + if (!req->file || !file_can_poll(req->file)) + return false; + if (req->flags & (REQ_F_MUST_PUNT | REQ_F_POLLED)) + return false; + if (!def->pollin && !def->pollout) + return false; + + apoll = kmalloc(sizeof(*apoll), GFP_ATOMIC); + if (unlikely(!apoll)) + return false; + + req->flags |= REQ_F_POLLED; + memcpy(&apoll->work, &req->work, sizeof(req->work)); + + /* + * Don't need a reference here, as we're adding it to the task + * task_works list. If the task exits, the list is pruned. + */ + req->task = current; + req->apoll = apoll; + INIT_HLIST_NODE(&req->hash_node); + + if (def->pollin) + mask = POLLIN | POLLRDNORM; + if (def->pollout) + mask |= POLLOUT | POLLWRNORM; + mask |= POLLERR | POLLPRI; + + ipt.pt._qproc = io_async_queue_proc; + + ret = __io_arm_poll_handler(req, &apoll->poll, &ipt, mask, + io_async_wake); + if (ret) { + ipt.error = 0; + apoll->poll.done = true; + spin_unlock_irq(&ctx->completion_lock); + memcpy(&req->work, &apoll->work, sizeof(req->work)); + kfree(apoll); + return false; + } + spin_unlock_irq(&ctx->completion_lock); + trace_io_uring_poll_arm(ctx, req->opcode, req->user_data, mask, + apoll->poll.events); + return true; +} + +static bool __io_poll_remove_one(struct io_kiocb *req, + struct io_poll_iocb *poll) { - struct io_poll_iocb *poll = &req->poll; bool do_complete = false;
spin_lock(&poll->head->lock); @@ -3527,7 +3737,24 @@ static bool io_poll_remove_one(struct io_kiocb *req) do_complete = true; } spin_unlock(&poll->head->lock); + return do_complete; +} + +static bool io_poll_remove_one(struct io_kiocb *req) +{ + bool do_complete; + + if (req->opcode == IORING_OP_POLL_ADD) { + do_complete = __io_poll_remove_one(req, &req->poll); + } else { + /* non-poll requests have submit ref still */ + do_complete = __io_poll_remove_one(req, &req->apoll->poll); + if (do_complete) + io_put_req(req); + } + hash_del(&req->hash_node); + if (do_complete) { io_cqring_fill_event(req, -ECANCELED); io_commit_cqring(req->ctx); @@ -3638,8 +3865,13 @@ static void io_poll_task_func(struct callback_head *cb) struct io_kiocb *nxt = NULL;
io_poll_task_handler(req, &nxt); - if (nxt) + if (nxt) { + struct io_ring_ctx *ctx = nxt->ctx; + + mutex_lock(&ctx->uring_lock); __io_queue_sqe(nxt, NULL); + mutex_unlock(&ctx->uring_lock); + } }
static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, @@ -3647,51 +3879,16 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, { struct io_kiocb *req = wait->private; struct io_poll_iocb *poll = &req->poll; - __poll_t mask = key_to_poll(key); - struct task_struct *tsk;
- /* for instances that support it check for an event match first: */ - if (mask && !(mask & poll->events)) - return 0; - - list_del_init(&poll->wait.entry); - - tsk = req->task; - req->result = mask; - init_task_work(&req->task_work, io_poll_task_func); - task_work_add(tsk, &req->task_work, true); - wake_up_process(tsk); - return 1; + return __io_async_wake(req, poll, key_to_poll(key), io_poll_task_func); }
-struct io_poll_table { - struct poll_table_struct pt; - struct io_kiocb *req; - int error; -}; - static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head, struct poll_table_struct *p) { struct io_poll_table *pt = container_of(p, struct io_poll_table, pt);
- if (unlikely(pt->req->poll.head)) { - pt->error = -EINVAL; - return; - } - - pt->error = 0; - pt->req->poll.head = head; - add_wait_queue(head, &pt->req->poll.wait); -} - -static void io_poll_req_insert(struct io_kiocb *req) -{ - struct io_ring_ctx *ctx = req->ctx; - struct hlist_head *list; - - list = &ctx->cancel_hash[hash_long(req->user_data, ctx->cancel_hash_bits)]; - hlist_add_head(&req->hash_node, list); + __io_queue_proc(&pt->req->poll, pt, head); }
static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) @@ -3709,7 +3906,10 @@ static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP;
- /* task will wait for requests on exit, don't need a ref */ + /* + * Don't need a reference here, as we're adding it to the task + * task_works list. If the task exits, the list is pruned. + */ req->task = current; return 0; } @@ -3719,46 +3919,15 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) struct io_poll_iocb *poll = &req->poll; struct io_ring_ctx *ctx = req->ctx; struct io_poll_table ipt; - bool cancel = false; __poll_t mask;
INIT_HLIST_NODE(&req->hash_node); - - poll->head = NULL; - poll->done = false; - poll->canceled = false; - - ipt.pt._qproc = io_poll_queue_proc; - ipt.pt._key = poll->events; - ipt.req = req; - ipt.error = -EINVAL; /* same as no support for IOCB_CMD_POLL */ - - /* initialized the list so that we can do list_empty checks */ - INIT_LIST_HEAD(&poll->wait.entry); - init_waitqueue_func_entry(&poll->wait, io_poll_wake); - poll->wait.private = req; - INIT_LIST_HEAD(&req->list); + ipt.pt._qproc = io_poll_queue_proc;
- mask = vfs_poll(poll->file, &ipt.pt) & poll->events; + mask = __io_arm_poll_handler(req, &req->poll, &ipt, poll->events, + io_poll_wake);
- spin_lock_irq(&ctx->completion_lock); - if (likely(poll->head)) { - spin_lock(&poll->head->lock); - if (unlikely(list_empty(&poll->wait.entry))) { - if (ipt.error) - cancel = true; - ipt.error = 0; - mask = 0; - } - if (mask || ipt.error) - list_del_init(&poll->wait.entry); - else if (cancel) - WRITE_ONCE(poll->canceled, true); - else if (!poll->done) /* actually waiting for an event */ - io_poll_req_insert(req); - spin_unlock(&poll->head->lock); - } if (mask) { /* no async, we'd stolen it */ ipt.error = 0; io_poll_complete(req, mask, 0); @@ -4694,6 +4863,9 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req)
if (!(req->flags & REQ_F_LINK)) return NULL; + /* for polled retry, if flag is set, we already went through here */ + if (req->flags & REQ_F_POLLED) + return NULL;
nxt = list_first_entry_or_null(&req->link_list, struct io_kiocb, link_list); @@ -4731,6 +4903,11 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) */ if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) || (req->flags & REQ_F_MUST_PUNT))) { + if (io_arm_poll_handler(req)) { + if (linked_timeout) + io_queue_linked_timeout(linked_timeout); + goto done_req; + } punt: if (io_op_defs[req->opcode].file_table) { ret = io_grab_files(req); @@ -6748,6 +6925,17 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m) seq_printf(m, "Personalities:\n"); idr_for_each(&ctx->personality_idr, io_uring_show_cred, m); } + seq_printf(m, "PollList:\n"); + spin_lock_irq(&ctx->completion_lock); + for (i = 0; i < (1U << ctx->cancel_hash_bits); i++) { + struct hlist_head *list = &ctx->cancel_hash[i]; + struct io_kiocb *req; + + hlist_for_each_entry(req, list, hash_node) + seq_printf(m, " op=%d, task_works=%d\n", req->opcode, + req->task->task_works != NULL); + } + spin_unlock_irq(&ctx->completion_lock); mutex_unlock(&ctx->uring_lock); }
@@ -6964,7 +7152,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p)
p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP | IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS | - IORING_FEAT_CUR_PERSONALITY; + IORING_FEAT_CUR_PERSONALITY | IORING_FEAT_FAST_POLL; trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags); return ret; err: diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h index b116de688a0e..be97b7fa0ac9 100644 --- a/include/trace/events/io_uring.h +++ b/include/trace/events/io_uring.h @@ -386,6 +386,109 @@ TRACE_EVENT(io_uring_submit_sqe, __entry->force_nonblock, __entry->sq_thread) );
+TRACE_EVENT(io_uring_poll_arm, + + TP_PROTO(void *ctx, u8 opcode, u64 user_data, int mask, int events), + + TP_ARGS(ctx, opcode, user_data, mask, events), + + TP_STRUCT__entry ( + __field( void *, ctx ) + __field( u8, opcode ) + __field( u64, user_data ) + __field( int, mask ) + __field( int, events ) + ), + + TP_fast_assign( + __entry->ctx = ctx; + __entry->opcode = opcode; + __entry->user_data = user_data; + __entry->mask = mask; + __entry->events = events; + ), + + TP_printk("ring %p, op %d, data 0x%llx, mask 0x%x, events 0x%x", + __entry->ctx, __entry->opcode, + (unsigned long long) __entry->user_data, + __entry->mask, __entry->events) +); + +TRACE_EVENT(io_uring_poll_wake, + + TP_PROTO(void *ctx, u8 opcode, u64 user_data, int mask), + + TP_ARGS(ctx, opcode, user_data, mask), + + TP_STRUCT__entry ( + __field( void *, ctx ) + __field( u8, opcode ) + __field( u64, user_data ) + __field( int, mask ) + ), + + TP_fast_assign( + __entry->ctx = ctx; + __entry->opcode = opcode; + __entry->user_data = user_data; + __entry->mask = mask; + ), + + TP_printk("ring %p, op %d, data 0x%llx, mask 0x%x", + __entry->ctx, __entry->opcode, + (unsigned long long) __entry->user_data, + __entry->mask) +); + +TRACE_EVENT(io_uring_task_add, + + TP_PROTO(void *ctx, u8 opcode, u64 user_data, int mask), + + TP_ARGS(ctx, opcode, user_data, mask), + + TP_STRUCT__entry ( + __field( void *, ctx ) + __field( u8, opcode ) + __field( u64, user_data ) + __field( int, mask ) + ), + + TP_fast_assign( + __entry->ctx = ctx; + __entry->opcode = opcode; + __entry->user_data = user_data; + __entry->mask = mask; + ), + + TP_printk("ring %p, op %d, data 0x%llx, mask %x", + __entry->ctx, __entry->opcode, + (unsigned long long) __entry->user_data, + __entry->mask) +); + +TRACE_EVENT(io_uring_task_run, + + TP_PROTO(void *ctx, u8 opcode, u64 user_data), + + TP_ARGS(ctx, opcode, user_data), + + TP_STRUCT__entry ( + __field( void *, ctx ) + __field( u8, opcode ) + __field( u64, user_data ) + ), + + TP_fast_assign( + __entry->ctx = ctx; + __entry->opcode = opcode; + __entry->user_data = user_data; + ), + + TP_printk("ring %p, op %d, data 0x%llx", + __entry->ctx, __entry->opcode, + (unsigned long long) __entry->user_data) +); + #endif /* _TRACE_IO_URING_H */
/* This part must be outside protection */ diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 6c607e42db68..14b4f075068f 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -215,6 +215,7 @@ struct io_uring_params { #define IORING_FEAT_SUBMIT_STABLE (1U << 2) #define IORING_FEAT_RW_CUR_POS (1U << 3) #define IORING_FEAT_CUR_PERSONALITY (1U << 4) +#define IORING_FEAT_FAST_POLL (1U << 5)
/* * io_uring_register(2) opcodes and arguments
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 4bc4494ec7c97ee38e2aa3d1cd76e289c49ac083 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
After __io_queue_sqe() ended up in io_queue_async_work(), it's already known that there is no @nxt req, so skip the check and return from the function.
Also, @nxt initialisation now can be done just before io_put_req_find_next(), as there is no jumping until it's checked.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index aba21e017cb9..ab68201407a2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4879,7 +4879,7 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req) static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_kiocb *linked_timeout; - struct io_kiocb *nxt = NULL; + struct io_kiocb *nxt; const struct cred *old_creds = NULL; int ret;
@@ -4906,7 +4906,7 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (io_arm_poll_handler(req)) { if (linked_timeout) io_queue_linked_timeout(linked_timeout); - goto done_req; + goto exit; } punt: if (io_op_defs[req->opcode].file_table) { @@ -4920,10 +4920,11 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) * submit reference when the iocb is actually submitted. */ io_queue_async_work(req); - goto done_req; + goto exit; }
err: + nxt = NULL; /* drop submission reference */ io_put_req_find_next(req, &nxt);
@@ -4940,15 +4941,14 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) req_set_fail_links(req); io_put_req(req); } -done_req: if (nxt) { req = nxt; - nxt = NULL;
if (req->flags & REQ_F_FORCE_ASYNC) goto punt; goto again; } +exit: if (old_creds) revert_creds(old_creds); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 3b17cf5a58f2a38e23ee980b5dece717d0464fb7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io-wq cares about IO_WQ_WORK_UNBOUND flag only while enqueueing, so it's useless setting it for a next req of a link. Thus, removed it from io_prep_linked_timeout(), and inline the function.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 13 +------------ 1 file changed, 1 insertion(+), 12 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ab68201407a2..6b0b5d6ad145 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -995,17 +995,6 @@ static inline void io_req_work_drop_env(struct io_kiocb *req) } }
-static inline void io_prep_next_work(struct io_kiocb *req, - struct io_kiocb **link) -{ - const struct io_op_def *def = &io_op_defs[req->opcode]; - - if (!(req->flags & REQ_F_ISREG) && def->unbound_nonreg_file) - req->work.flags |= IO_WQ_WORK_UNBOUND; - - *link = io_prep_linked_timeout(req); -} - static inline bool io_prep_async_work(struct io_kiocb *req, struct io_kiocb **link) { @@ -2578,8 +2567,8 @@ static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) { struct io_kiocb *link;
- io_prep_next_work(nxt, &link); *workptr = &nxt->work; + link = io_prep_linked_timeout(nxt); if (link) { nxt->work.func = io_link_work_cb; nxt->work.data = link;
From: Nathan Chancellor natechancellor@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 8755d97a09fed0de206772bcad1838301293c4d8 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Clang warns:
fs/io_uring.c:4178:6: warning: variable 'mask' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] if (def->pollin) ^~~~~~~~~~~ fs/io_uring.c:4182:2: note: uninitialized use occurs here mask |= POLLERR | POLLPRI; ^~~~ fs/io_uring.c:4178:2: note: remove the 'if' if its condition is always true if (def->pollin) ^~~~~~~~~~~~~~~~ fs/io_uring.c:4154:15: note: initialize the variable 'mask' to silence this warning __poll_t mask, ret; ^ = 0 1 warning generated.
io_op_defs has many definitions where pollin is not set so mask indeed might be uninitialized. Initialize it to zero and change the next assignment to |=, in case further masks are added in the future to avoid missing changing the assignment then.
Fixes: d7718a9d25a6 ("io_uring: use poll driven retry for files that support it") Link: https://github.com/ClangBuiltLinux/linux/issues/916 Signed-off-by: Nathan Chancellor natechancellor@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6b0b5d6ad145..5a97d110602a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3690,8 +3690,9 @@ static bool io_arm_poll_handler(struct io_kiocb *req) req->apoll = apoll; INIT_HLIST_NODE(&req->hash_node);
+ mask = 0; if (def->pollin) - mask = POLLIN | POLLRDNORM; + mask |= POLLIN | POLLRDNORM; if (def->pollout) mask |= POLLOUT | POLLWRNORM; mask |= POLLERR | POLLPRI;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit a2100672f3b2afdd55ccc2e640d1a8bd99ff6338 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Don't abuse labels for plain and straightworward code.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5a97d110602a..54acd816c7dd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2980,8 +2980,16 @@ static int io_close(struct io_kiocb *req, struct io_kiocb **nxt, return ret;
/* if the file has a flush method, be safe and punt to async */ - if (req->close.put_file->f_op->flush && !io_wq_current_is_worker()) - goto eagain; + if (req->close.put_file->f_op->flush && force_nonblock) { + req->work.func = io_close_finish; + /* + * Do manual async queue here to avoid grabbing files - we don't + * need the files, and it'll cause io_close_finish() to close + * the file again and cause a double CQE entry for this request + */ + io_queue_async_work(req); + return 0; + }
/* * No ->flush(), safely close from here and just punt the @@ -2989,15 +2997,6 @@ static int io_close(struct io_kiocb *req, struct io_kiocb **nxt, */ __io_close_finish(req, nxt); return 0; -eagain: - req->work.func = io_close_finish; - /* - * Do manual async queue here to avoid grabbing files - we don't - * need the files, and it'll cause io_close_finish() to close - * the file again and cause a double CQE entry for this request - */ - io_queue_async_work(req); - return 0; }
static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 594506fec5faec2b1ec82ad6fb0c8132512fc459 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The rule is simple, any async handler gets a submission ref and should put it at the end. Make them all follow it, and so more consistent.
This is a preparation patch, and as io_wq_assign_next() currently won't ever work, this doesn't care to use io_put_req_find_next() instead of io_put_req().
Signed-off-by: Pavel Begunkov asml.silence@gmail.com
refcount_inc_not_zero() -> refcount_inc() fix.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 54acd816c7dd..b56b3ff5e519 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2547,7 +2547,7 @@ static bool io_req_cancelled(struct io_kiocb *req) if (req->work.flags & IO_WQ_WORK_CANCEL) { req_set_fail_links(req); io_cqring_add_event(req, -ECANCELED); - io_put_req(req); + io_double_put_req(req); return true; }
@@ -2597,6 +2597,7 @@ static void io_fsync_finish(struct io_wq_work **workptr) if (io_req_cancelled(req)) return; __io_fsync(req, &nxt); + io_put_req(req); /* drop submission reference */ if (nxt) io_wq_assign_next(workptr, nxt); } @@ -2606,7 +2607,6 @@ static int io_fsync(struct io_kiocb *req, struct io_kiocb **nxt, { /* fsync always requires a blocking context */ if (force_nonblock) { - io_put_req(req); req->work.func = io_fsync_finish; return -EAGAIN; } @@ -2618,9 +2618,6 @@ static void __io_fallocate(struct io_kiocb *req, struct io_kiocb **nxt) { int ret;
- if (io_req_cancelled(req)) - return; - ret = vfs_fallocate(req->file, req->sync.mode, req->sync.off, req->sync.len); if (ret < 0) @@ -2634,7 +2631,10 @@ static void io_fallocate_finish(struct io_wq_work **workptr) struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); struct io_kiocb *nxt = NULL;
+ if (io_req_cancelled(req)) + return; __io_fallocate(req, &nxt); + io_put_req(req); /* drop submission reference */ if (nxt) io_wq_assign_next(workptr, nxt); } @@ -2656,7 +2656,6 @@ static int io_fallocate(struct io_kiocb *req, struct io_kiocb **nxt, { /* fallocate always requiring blocking context */ if (force_nonblock) { - io_put_req(req); req->work.func = io_fallocate_finish; return -EAGAIN; } @@ -2965,6 +2964,7 @@ static void io_close_finish(struct io_wq_work **workptr)
/* not cancellable, don't do io_req_cancelled() */ __io_close_finish(req, &nxt); + io_put_req(req); /* drop submission reference */ if (nxt) io_wq_assign_next(workptr, nxt); } @@ -2981,6 +2981,9 @@ static int io_close(struct io_kiocb *req, struct io_kiocb **nxt,
/* if the file has a flush method, be safe and punt to async */ if (req->close.put_file->f_op->flush && force_nonblock) { + /* submission ref will be dropped, take it for async */ + refcount_inc(&req->refs); + req->work.func = io_close_finish; /* * Do manual async queue here to avoid grabbing files - we don't @@ -3038,6 +3041,7 @@ static void io_sync_file_range_finish(struct io_wq_work **workptr) if (io_req_cancelled(req)) return; __io_sync_file_range(req, &nxt); + io_put_req(req); /* put submission ref */ if (nxt) io_wq_assign_next(workptr, nxt); } @@ -3047,7 +3051,6 @@ static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt, { /* sync_file_range always requires a blocking context */ if (force_nonblock) { - io_put_req(req); req->work.func = io_sync_file_range_finish; return -EAGAIN; } @@ -3416,11 +3419,10 @@ static void io_accept_finish(struct io_wq_work **workptr) struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); struct io_kiocb *nxt = NULL;
- io_put_req(req); - if (io_req_cancelled(req)) return; __io_accept(req, &nxt, false); + io_put_req(req); /* drop submission reference */ if (nxt) io_wq_assign_next(workptr, nxt); } @@ -4677,17 +4679,14 @@ static void io_wq_submit_work(struct io_wq_work **workptr) } while (1); }
- /* drop submission reference */ - io_put_req(req); - if (ret) { req_set_fail_links(req); io_cqring_add_event(req, ret); io_put_req(req); }
- /* if a dependent link is ready, pass it back */ - if (!ret && nxt) + io_put_req(req); /* drop submission reference */ + if (nxt) io_wq_assign_next(workptr, nxt); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 014db0073cc6a12e1f421b9231d6f3aa35735823 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There will be no use for @nxt in the handlers, and it's doesn't work anyway, so purge it
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [ignore openat2 for commit cebdb98617ae ("io_uring: add support for IORING_OP_OPENAT2") not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 197 +++++++++++++++++++++----------------------------- 1 file changed, 83 insertions(+), 114 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b56b3ff5e519..270c1d0fe5e1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1801,17 +1801,6 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2) io_put_req(req); }
-static struct io_kiocb *__io_complete_rw(struct kiocb *kiocb, long res) -{ - struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); - struct io_kiocb *nxt = NULL; - - io_complete_rw_common(kiocb, res); - io_put_req_find_next(req, &nxt); - - return nxt; -} - static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); @@ -2006,14 +1995,14 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) } }
-static void kiocb_done(struct kiocb *kiocb, ssize_t ret, struct io_kiocb **nxt) +static void kiocb_done(struct kiocb *kiocb, ssize_t ret) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
if (req->flags & REQ_F_CUR_POS) req->file->f_pos = kiocb->ki_pos; if (ret >= 0 && kiocb->ki_complete == io_complete_rw) - *nxt = __io_complete_rw(kiocb, ret); + io_complete_rw(kiocb, ret, 0); else io_rw_done(kiocb, ret); } @@ -2262,8 +2251,7 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; }
-static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_read(struct io_kiocb *req, bool force_nonblock) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw.kiocb; @@ -2303,7 +2291,7 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt,
/* Catch -EAGAIN return for forced non-blocking submission */ if (!force_nonblock || ret2 != -EAGAIN) { - kiocb_done(kiocb, ret2, nxt); + kiocb_done(kiocb, ret2); } else { copy_iov: ret = io_setup_async_rw(req, io_size, iovec, @@ -2352,8 +2340,7 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; }
-static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_write(struct io_kiocb *req, bool force_nonblock) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw.kiocb; @@ -2417,7 +2404,7 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, if (ret2 == -EOPNOTSUPP && (kiocb->ki_flags & IOCB_NOWAIT)) ret2 = -EAGAIN; if (!force_nonblock || ret2 != -EAGAIN) { - kiocb_done(kiocb, ret2, nxt); + kiocb_done(kiocb, ret2); } else { copy_iov: ret = io_setup_async_rw(req, io_size, iovec, @@ -2474,8 +2461,7 @@ static bool io_splice_punt(struct file *file) return !(file->f_mode & O_NONBLOCK); }
-static int io_splice(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_splice(struct io_kiocb *req, bool force_nonblock) { struct io_splice *sp = &req->splice; struct file *in = sp->file_in; @@ -2502,7 +2488,7 @@ static int io_splice(struct io_kiocb *req, struct io_kiocb **nxt, io_cqring_add_event(req, ret); if (ret != sp->len) req_set_fail_links(req); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; }
@@ -2575,7 +2561,7 @@ static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) } }
-static void __io_fsync(struct io_kiocb *req, struct io_kiocb **nxt) +static void __io_fsync(struct io_kiocb *req) { loff_t end = req->sync.off + req->sync.len; int ret; @@ -2586,7 +2572,7 @@ static void __io_fsync(struct io_kiocb *req, struct io_kiocb **nxt) if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); }
static void io_fsync_finish(struct io_wq_work **workptr) @@ -2596,25 +2582,24 @@ static void io_fsync_finish(struct io_wq_work **workptr)
if (io_req_cancelled(req)) return; - __io_fsync(req, &nxt); + __io_fsync(req); io_put_req(req); /* drop submission reference */ if (nxt) io_wq_assign_next(workptr, nxt); }
-static int io_fsync(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_fsync(struct io_kiocb *req, bool force_nonblock) { /* fsync always requires a blocking context */ if (force_nonblock) { req->work.func = io_fsync_finish; return -EAGAIN; } - __io_fsync(req, nxt); + __io_fsync(req); return 0; }
-static void __io_fallocate(struct io_kiocb *req, struct io_kiocb **nxt) +static void __io_fallocate(struct io_kiocb *req) { int ret;
@@ -2623,7 +2608,7 @@ static void __io_fallocate(struct io_kiocb *req, struct io_kiocb **nxt) if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); }
static void io_fallocate_finish(struct io_wq_work **workptr) @@ -2633,7 +2618,7 @@ static void io_fallocate_finish(struct io_wq_work **workptr)
if (io_req_cancelled(req)) return; - __io_fallocate(req, &nxt); + __io_fallocate(req); io_put_req(req); /* drop submission reference */ if (nxt) io_wq_assign_next(workptr, nxt); @@ -2651,8 +2636,7 @@ static int io_fallocate_prep(struct io_kiocb *req, return 0; }
-static int io_fallocate(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_fallocate(struct io_kiocb *req, bool force_nonblock) { /* fallocate always requiring blocking context */ if (force_nonblock) { @@ -2660,7 +2644,7 @@ static int io_fallocate(struct io_kiocb *req, struct io_kiocb **nxt, return -EAGAIN; }
- __io_fallocate(req, nxt); + __io_fallocate(req); return 0; }
@@ -2693,8 +2677,7 @@ static int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static int io_openat(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_openat(struct io_kiocb *req, bool force_nonblock) { struct open_flags op; struct file *file; @@ -2725,7 +2708,7 @@ static int io_openat(struct io_kiocb *req, struct io_kiocb **nxt, if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; }
@@ -2754,8 +2737,7 @@ static int io_epoll_ctl_prep(struct io_kiocb *req, #endif }
-static int io_epoll_ctl(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_epoll_ctl(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_EPOLL) struct io_epoll *ie = &req->epoll; @@ -2768,7 +2750,7 @@ static int io_epoll_ctl(struct io_kiocb *req, struct io_kiocb **nxt, if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; #else return -EOPNOTSUPP; @@ -2790,8 +2772,7 @@ static int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) #endif }
-static int io_madvise(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_madvise(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) struct io_madvise *ma = &req->madvise; @@ -2804,7 +2785,7 @@ static int io_madvise(struct io_kiocb *req, struct io_kiocb **nxt, if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; #else return -EOPNOTSUPP; @@ -2822,8 +2803,7 @@ static int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static int io_fadvise(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_fadvise(struct io_kiocb *req, bool force_nonblock) { struct io_fadvise *fa = &req->fadvise; int ret; @@ -2843,7 +2823,7 @@ static int io_fadvise(struct io_kiocb *req, struct io_kiocb **nxt, if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; }
@@ -2880,8 +2860,7 @@ static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static int io_statx(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_statx(struct io_kiocb *req, bool force_nonblock) { struct io_open *ctx = &req->open; unsigned lookup_flags; @@ -2918,7 +2897,7 @@ static int io_statx(struct io_kiocb *req, struct io_kiocb **nxt, if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; }
@@ -2945,7 +2924,7 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) }
/* only called when __close_fd_get_file() is done */ -static void __io_close_finish(struct io_kiocb *req, struct io_kiocb **nxt) +static void __io_close_finish(struct io_kiocb *req) { int ret;
@@ -2954,7 +2933,7 @@ static void __io_close_finish(struct io_kiocb *req, struct io_kiocb **nxt) req_set_fail_links(req); io_cqring_add_event(req, ret); fput(req->close.put_file); - io_put_req_find_next(req, nxt); + io_put_req(req); }
static void io_close_finish(struct io_wq_work **workptr) @@ -2963,14 +2942,13 @@ static void io_close_finish(struct io_wq_work **workptr) struct io_kiocb *nxt = NULL;
/* not cancellable, don't do io_req_cancelled() */ - __io_close_finish(req, &nxt); + __io_close_finish(req); io_put_req(req); /* drop submission reference */ if (nxt) io_wq_assign_next(workptr, nxt); }
-static int io_close(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_close(struct io_kiocb *req, bool force_nonblock) { int ret;
@@ -2998,7 +2976,7 @@ static int io_close(struct io_kiocb *req, struct io_kiocb **nxt, * No ->flush(), safely close from here and just punt the * fput() to async context. */ - __io_close_finish(req, nxt); + __io_close_finish(req); return 0; }
@@ -3020,7 +2998,7 @@ static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static void __io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt) +static void __io_sync_file_range(struct io_kiocb *req) { int ret;
@@ -3029,7 +3007,7 @@ static void __io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt) if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); }
@@ -3040,14 +3018,13 @@ static void io_sync_file_range_finish(struct io_wq_work **workptr)
if (io_req_cancelled(req)) return; - __io_sync_file_range(req, &nxt); + __io_sync_file_range(req); io_put_req(req); /* put submission ref */ if (nxt) io_wq_assign_next(workptr, nxt); }
-static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_sync_file_range(struct io_kiocb *req, bool force_nonblock) { /* sync_file_range always requires a blocking context */ if (force_nonblock) { @@ -3055,7 +3032,7 @@ static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt, return -EAGAIN; }
- __io_sync_file_range(req, nxt); + __io_sync_file_range(req); return 0; }
@@ -3107,8 +3084,7 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) #endif }
-static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_NET) struct io_async_msghdr *kmsg = NULL; @@ -3162,15 +3138,14 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, io_cqring_add_event(req, ret); if (ret < 0) req_set_fail_links(req); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; #else return -EOPNOTSUPP; #endif }
-static int io_send(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_send(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_NET) struct socket *sock; @@ -3213,7 +3188,7 @@ static int io_send(struct io_kiocb *req, struct io_kiocb **nxt, io_cqring_add_event(req, ret); if (ret < 0) req_set_fail_links(req); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; #else return -EOPNOTSUPP; @@ -3254,8 +3229,7 @@ static int io_recvmsg_prep(struct io_kiocb *req, #endif }
-static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_NET) struct io_async_msghdr *kmsg = NULL; @@ -3311,15 +3285,14 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, io_cqring_add_event(req, ret); if (ret < 0) req_set_fail_links(req); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; #else return -EOPNOTSUPP; #endif }
-static int io_recv(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_recv(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_NET) struct socket *sock; @@ -3363,7 +3336,7 @@ static int io_recv(struct io_kiocb *req, struct io_kiocb **nxt, io_cqring_add_event(req, ret); if (ret < 0) req_set_fail_links(req); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; #else return -EOPNOTSUPP; @@ -3392,8 +3365,7 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) }
#if defined(CONFIG_NET) -static int __io_accept(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int __io_accept(struct io_kiocb *req, bool force_nonblock) { struct io_accept *accept = &req->accept; unsigned file_flags; @@ -3410,7 +3382,7 @@ static int __io_accept(struct io_kiocb *req, struct io_kiocb **nxt, if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; }
@@ -3421,20 +3393,19 @@ static void io_accept_finish(struct io_wq_work **workptr)
if (io_req_cancelled(req)) return; - __io_accept(req, &nxt, false); + __io_accept(req, false); io_put_req(req); /* drop submission reference */ if (nxt) io_wq_assign_next(workptr, nxt); } #endif
-static int io_accept(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_accept(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_NET) int ret;
- ret = __io_accept(req, nxt, force_nonblock); + ret = __io_accept(req, force_nonblock); if (ret == -EAGAIN && force_nonblock) { req->work.func = io_accept_finish; return -EAGAIN; @@ -3469,8 +3440,7 @@ static int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) #endif }
-static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_connect(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_NET) struct io_async_ctx __io, *io; @@ -3508,7 +3478,7 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt, if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; #else return -EOPNOTSUPP; @@ -3905,7 +3875,7 @@ static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe return 0; }
-static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) +static int io_poll_add(struct io_kiocb *req) { struct io_poll_iocb *poll = &req->poll; struct io_ring_ctx *ctx = req->ctx; @@ -3927,7 +3897,7 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt)
if (mask) { io_cqring_ev_posted(ctx); - io_put_req_find_next(req, nxt); + io_put_req(req); } return ipt.error; } @@ -4176,7 +4146,7 @@ static int io_async_cancel_one(struct io_ring_ctx *ctx, void *sqe_addr)
static void io_async_find_and_cancel(struct io_ring_ctx *ctx, struct io_kiocb *req, __u64 sqe_addr, - struct io_kiocb **nxt, int success_ret) + int success_ret) { unsigned long flags; int ret; @@ -4202,7 +4172,7 @@ static void io_async_find_and_cancel(struct io_ring_ctx *ctx,
if (ret < 0) req_set_fail_links(req); - io_put_req_find_next(req, nxt); + io_put_req(req); }
static int io_async_cancel_prep(struct io_kiocb *req, @@ -4218,11 +4188,11 @@ static int io_async_cancel_prep(struct io_kiocb *req, return 0; }
-static int io_async_cancel(struct io_kiocb *req, struct io_kiocb **nxt) +static int io_async_cancel(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx;
- io_async_find_and_cancel(ctx, req, req->cancel.addr, nxt, 0); + io_async_find_and_cancel(ctx, req, req->cancel.addr, 0); return 0; }
@@ -4428,7 +4398,7 @@ static void io_cleanup_req(struct io_kiocb *req) }
static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, - struct io_kiocb **nxt, bool force_nonblock) + bool force_nonblock) { struct io_ring_ctx *ctx = req->ctx; int ret; @@ -4445,7 +4415,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret < 0) break; } - ret = io_read(req, nxt, force_nonblock); + ret = io_read(req, force_nonblock); break; case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: @@ -4455,7 +4425,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret < 0) break; } - ret = io_write(req, nxt, force_nonblock); + ret = io_write(req, force_nonblock); break; case IORING_OP_FSYNC: if (sqe) { @@ -4463,7 +4433,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret < 0) break; } - ret = io_fsync(req, nxt, force_nonblock); + ret = io_fsync(req, force_nonblock); break; case IORING_OP_POLL_ADD: if (sqe) { @@ -4471,7 +4441,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_poll_add(req, nxt); + ret = io_poll_add(req); break; case IORING_OP_POLL_REMOVE: if (sqe) { @@ -4487,7 +4457,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret < 0) break; } - ret = io_sync_file_range(req, nxt, force_nonblock); + ret = io_sync_file_range(req, force_nonblock); break; case IORING_OP_SENDMSG: case IORING_OP_SEND: @@ -4497,9 +4467,9 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, break; } if (req->opcode == IORING_OP_SENDMSG) - ret = io_sendmsg(req, nxt, force_nonblock); + ret = io_sendmsg(req, force_nonblock); else - ret = io_send(req, nxt, force_nonblock); + ret = io_send(req, force_nonblock); break; case IORING_OP_RECVMSG: case IORING_OP_RECV: @@ -4509,9 +4479,9 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, break; } if (req->opcode == IORING_OP_RECVMSG) - ret = io_recvmsg(req, nxt, force_nonblock); + ret = io_recvmsg(req, force_nonblock); else - ret = io_recv(req, nxt, force_nonblock); + ret = io_recv(req, force_nonblock); break; case IORING_OP_TIMEOUT: if (sqe) { @@ -4535,7 +4505,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_accept(req, nxt, force_nonblock); + ret = io_accept(req, force_nonblock); break; case IORING_OP_CONNECT: if (sqe) { @@ -4543,7 +4513,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_connect(req, nxt, force_nonblock); + ret = io_connect(req, force_nonblock); break; case IORING_OP_ASYNC_CANCEL: if (sqe) { @@ -4551,7 +4521,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_async_cancel(req, nxt); + ret = io_async_cancel(req); break; case IORING_OP_FALLOCATE: if (sqe) { @@ -4559,7 +4529,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_fallocate(req, nxt, force_nonblock); + ret = io_fallocate(req, force_nonblock); break; case IORING_OP_OPENAT: if (sqe) { @@ -4567,7 +4537,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_openat(req, nxt, force_nonblock); + ret = io_openat(req, force_nonblock); break; case IORING_OP_CLOSE: if (sqe) { @@ -4575,7 +4545,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_close(req, nxt, force_nonblock); + ret = io_close(req, force_nonblock); break; case IORING_OP_FILES_UPDATE: if (sqe) { @@ -4591,7 +4561,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_statx(req, nxt, force_nonblock); + ret = io_statx(req, force_nonblock); break; case IORING_OP_FADVISE: if (sqe) { @@ -4599,7 +4569,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_fadvise(req, nxt, force_nonblock); + ret = io_fadvise(req, force_nonblock); break; case IORING_OP_MADVISE: if (sqe) { @@ -4607,7 +4577,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_madvise(req, nxt, force_nonblock); + ret = io_madvise(req, force_nonblock); break; case IORING_OP_EPOLL_CTL: if (sqe) { @@ -4615,7 +4585,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_epoll_ctl(req, nxt, force_nonblock); + ret = io_epoll_ctl(req, force_nonblock); break; case IORING_OP_SPLICE: if (sqe) { @@ -4623,7 +4593,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret < 0) break; } - ret = io_splice(req, nxt, force_nonblock); + ret = io_splice(req, force_nonblock); break; default: ret = -EINVAL; @@ -4667,7 +4637,7 @@ static void io_wq_submit_work(struct io_wq_work **workptr)
if (!ret) { do { - ret = io_issue_sqe(req, NULL, &nxt, false); + ret = io_issue_sqe(req, NULL, false); /* * We can get EAGAIN for polled IO even though we're * forcing a sync submission from here, since we can't @@ -4813,8 +4783,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer)
if (prev) { req_set_fail_links(prev); - io_async_find_and_cancel(ctx, req, prev->user_data, NULL, - -ETIME); + io_async_find_and_cancel(ctx, req, prev->user_data, -ETIME); io_put_req(prev); } else { io_cqring_add_event(req, -ETIME); @@ -4883,7 +4852,7 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) old_creds = override_creds(req->work.creds); }
- ret = io_issue_sqe(req, sqe, &nxt, true); + ret = io_issue_sqe(req, sqe, true);
/* * We async punt it if the file wasn't marked NOWAIT, or if the file
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 7a743e225b2a9da772b28a50031e1ccd8a8ce404 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If after dropping the submission reference req->refs == 1, the request is done, because this one is for io_put_work() and will be dropped synchronously shortly after. In this case it's safe to steal a next work from the request.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 89 +++++++++++++++++++++++++++------------------------ 1 file changed, 48 insertions(+), 41 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 270c1d0fe5e1..d6eaafea0aa1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1515,6 +1515,27 @@ static void io_free_req(struct io_kiocb *req) io_queue_async_work(nxt); }
+static void io_link_work_cb(struct io_wq_work **workptr) +{ + struct io_wq_work *work = *workptr; + struct io_kiocb *link = work->data; + + io_queue_linked_timeout(link); + io_wq_submit_work(workptr); +} + +static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) +{ + struct io_kiocb *link; + + *workptr = &nxt->work; + link = io_prep_linked_timeout(nxt); + if (link) { + nxt->work.func = io_link_work_cb; + nxt->work.data = link; + } +} + /* * Drop reference to request, return next in chain (if there is one) if this * was the last reference to this request. @@ -1534,6 +1555,27 @@ static void io_put_req(struct io_kiocb *req) io_free_req(req); }
+static void io_put_req_async_completion(struct io_kiocb *req, + struct io_wq_work **workptr) +{ + /* + * It's in an io-wq worker, so there always should be at least + * one reference, which will be dropped in io_put_work() just + * after the current handler returns. + * + * It also means, that if the counter dropped to 1, then there is + * no asynchronous users left, so it's safe to steal the next work. + */ + refcount_dec(&req->refs); + if (refcount_read(&req->refs) == 1) { + struct io_kiocb *nxt = NULL; + + io_req_find_next(req, &nxt); + if (nxt) + io_wq_assign_next(workptr, nxt); + } +} + /* * Must only be used if we don't need to care about links, usually from * within the completion handling itself. @@ -2540,27 +2582,6 @@ static bool io_req_cancelled(struct io_kiocb *req) return false; }
-static void io_link_work_cb(struct io_wq_work **workptr) -{ - struct io_wq_work *work = *workptr; - struct io_kiocb *link = work->data; - - io_queue_linked_timeout(link); - io_wq_submit_work(workptr); -} - -static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) -{ - struct io_kiocb *link; - - *workptr = &nxt->work; - link = io_prep_linked_timeout(nxt); - if (link) { - nxt->work.func = io_link_work_cb; - nxt->work.data = link; - } -} - static void __io_fsync(struct io_kiocb *req) { loff_t end = req->sync.off + req->sync.len; @@ -2578,14 +2599,11 @@ static void __io_fsync(struct io_kiocb *req) static void io_fsync_finish(struct io_wq_work **workptr) { struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - struct io_kiocb *nxt = NULL;
if (io_req_cancelled(req)) return; __io_fsync(req); - io_put_req(req); /* drop submission reference */ - if (nxt) - io_wq_assign_next(workptr, nxt); + io_put_req_async_completion(req, workptr); }
static int io_fsync(struct io_kiocb *req, bool force_nonblock) @@ -2614,14 +2632,11 @@ static void __io_fallocate(struct io_kiocb *req) static void io_fallocate_finish(struct io_wq_work **workptr) { struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - struct io_kiocb *nxt = NULL;
if (io_req_cancelled(req)) return; __io_fallocate(req); - io_put_req(req); /* drop submission reference */ - if (nxt) - io_wq_assign_next(workptr, nxt); + io_put_req_async_completion(req, workptr); }
static int io_fallocate_prep(struct io_kiocb *req, @@ -2939,13 +2954,10 @@ static void __io_close_finish(struct io_kiocb *req) static void io_close_finish(struct io_wq_work **workptr) { struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - struct io_kiocb *nxt = NULL;
/* not cancellable, don't do io_req_cancelled() */ __io_close_finish(req); - io_put_req(req); /* drop submission reference */ - if (nxt) - io_wq_assign_next(workptr, nxt); + io_put_req_async_completion(req, workptr); }
static int io_close(struct io_kiocb *req, bool force_nonblock) @@ -3389,14 +3401,11 @@ static int __io_accept(struct io_kiocb *req, bool force_nonblock) static void io_accept_finish(struct io_wq_work **workptr) { struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - struct io_kiocb *nxt = NULL;
if (io_req_cancelled(req)) return; __io_accept(req, false); - io_put_req(req); /* drop submission reference */ - if (nxt) - io_wq_assign_next(workptr, nxt); + io_put_req_async_completion(req, workptr); } #endif
@@ -4626,7 +4635,6 @@ static void io_wq_submit_work(struct io_wq_work **workptr) { struct io_wq_work *work = *workptr; struct io_kiocb *req = container_of(work, struct io_kiocb, work); - struct io_kiocb *nxt = NULL; int ret = 0;
/* if NO_CANCEL is set, we must still run the work */ @@ -4655,9 +4663,7 @@ static void io_wq_submit_work(struct io_wq_work **workptr) io_put_req(req); }
- io_put_req(req); /* drop submission reference */ - if (nxt) - io_wq_assign_next(workptr, nxt); + io_put_req_async_completion(req, workptr); }
static int io_req_needs_file(struct io_kiocb *req, int fd) @@ -6069,6 +6075,7 @@ static void io_put_work(struct io_wq_work *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work);
+ /* Consider that io_put_req_async_completion() relies on this ref */ io_put_req(req); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit dc026a73c7221b4d9d146ed0bde69ff578ebe8dc category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This is a preparation patch, it adds some helpers and makes the next patches cleaner.
- extract io_impersonate_work() and io_assign_current_work() - replace @next label with nested do-while - move put_work() right after NULL'ing cur_work.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 123 ++++++++++++++++++++++++++++------------------------- 1 file changed, 64 insertions(+), 59 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 0ca2b17c82f9..d6479bfbfd51 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -441,14 +441,43 @@ static void io_wq_switch_creds(struct io_worker *worker, worker->saved_creds = old_creds; }
+static void io_impersonate_work(struct io_worker *worker, + struct io_wq_work *work) +{ + if (work->files && current->files != work->files) { + task_lock(current); + current->files = work->files; + task_unlock(current); + } + if (work->fs && current->fs != work->fs) + current->fs = work->fs; + if (work->mm != worker->mm) + io_wq_switch_mm(worker, work); + if (worker->cur_creds != work->creds) + io_wq_switch_creds(worker, work); +} + +static void io_assign_current_work(struct io_worker *worker, + struct io_wq_work *work) +{ + /* flush pending signals before assigning new work */ + if (signal_pending(current)) + flush_signals(current); + cond_resched(); + + spin_lock_irq(&worker->lock); + worker->cur_work = work; + spin_unlock_irq(&worker->lock); +} + static void io_worker_handle_work(struct io_worker *worker) __releases(wqe->lock) { - struct io_wq_work *work, *old_work = NULL, *put_work = NULL; struct io_wqe *wqe = worker->wqe; struct io_wq *wq = wqe->wq;
do { + struct io_wq_work *work, *old_work; unsigned hash = -1U;
/* @@ -465,69 +494,45 @@ static void io_worker_handle_work(struct io_worker *worker) wqe->flags |= IO_WQE_FLAG_STALLED;
spin_unlock_irq(&wqe->lock); - if (put_work && wq->put_work) - wq->put_work(old_work); if (!work) break; -next: - /* flush any pending signals before assigning new work */ - if (signal_pending(current)) - flush_signals(current); - - cond_resched();
- spin_lock_irq(&worker->lock); - worker->cur_work = work; - spin_unlock_irq(&worker->lock); - - if (work->files && current->files != work->files) { - task_lock(current); - current->files = work->files; - task_unlock(current); - } - if (work->fs && current->fs != work->fs) - current->fs = work->fs; - if (work->mm != worker->mm) - io_wq_switch_mm(worker, work); - if (worker->cur_creds != work->creds) - io_wq_switch_creds(worker, work); - /* - * OK to set IO_WQ_WORK_CANCEL even for uncancellable work, - * the worker function will do the right thing. - */ - if (test_bit(IO_WQ_BIT_CANCEL, &wq->state)) - work->flags |= IO_WQ_WORK_CANCEL; - - if (wq->get_work) { - put_work = work; - wq->get_work(work); - } - - old_work = work; - work->func(&work); - - spin_lock_irq(&worker->lock); - worker->cur_work = NULL; - spin_unlock_irq(&worker->lock); - - spin_lock_irq(&wqe->lock); - - if (hash != -1U) { - wqe->hash_map &= ~BIT(hash); - wqe->flags &= ~IO_WQE_FLAG_STALLED; - } - if (work && work != old_work) { - spin_unlock_irq(&wqe->lock); - - if (put_work && wq->put_work) { - wq->put_work(put_work); - put_work = NULL; + /* handle a whole dependent link */ + do { + io_assign_current_work(worker, work); + io_impersonate_work(worker, work); + + /* + * OK to set IO_WQ_WORK_CANCEL even for uncancellable + * work, the worker function will do the right thing. + */ + if (test_bit(IO_WQ_BIT_CANCEL, &wq->state)) + work->flags |= IO_WQ_WORK_CANCEL; + + if (wq->get_work) + wq->get_work(work); + + old_work = work; + work->func(&work); + + spin_lock_irq(&worker->lock); + worker->cur_work = NULL; + spin_unlock_irq(&worker->lock); + + if (wq->put_work) + wq->put_work(old_work); + + if (hash != -1U) { + spin_lock_irq(&wqe->lock); + wqe->hash_map &= ~BIT_ULL(hash); + wqe->flags &= ~IO_WQE_FLAG_STALLED; + spin_unlock_irq(&wqe->lock); + /* dependent work is not hashed */ + hash = -1U; } + } while (work && work != old_work);
- /* dependent work not hashed */ - hash = -1U; - goto next; - } + spin_lock_irq(&wqe->lock); } while (1); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 58e3931987377d3f4ec7bbc13e4ea0aab52dc6b0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There are 2 optimisations: - Now, io_worker_handler_work() do io_assign_current_work() twice per request, and each one adds lock/unlock(worker->lock) pair. The first is to reset worker->cur_work to NULL, and the second to set a real work shortly after. If there is a dependant work, set it immediately, that effectively removes the extra NULL'ing.
- And there is no use in taking wqe->lock for linked works, as they are not hashed now. Optimise it out.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index d6479bfbfd51..05f2fdc6bdce 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -477,7 +477,7 @@ static void io_worker_handle_work(struct io_worker *worker) struct io_wq *wq = wqe->wq;
do { - struct io_wq_work *work, *old_work; + struct io_wq_work *work; unsigned hash = -1U;
/* @@ -496,12 +496,13 @@ static void io_worker_handle_work(struct io_worker *worker) spin_unlock_irq(&wqe->lock); if (!work) break; + io_assign_current_work(worker, work);
/* handle a whole dependent link */ do { - io_assign_current_work(worker, work); - io_impersonate_work(worker, work); + struct io_wq_work *old_work;
+ io_impersonate_work(worker, work); /* * OK to set IO_WQ_WORK_CANCEL even for uncancellable * work, the worker function will do the right thing. @@ -514,10 +515,8 @@ static void io_worker_handle_work(struct io_worker *worker)
old_work = work; work->func(&work); - - spin_lock_irq(&worker->lock); - worker->cur_work = NULL; - spin_unlock_irq(&worker->lock); + work = (old_work == work) ? NULL : work; + io_assign_current_work(worker, work);
if (wq->put_work) wq->put_work(old_work); @@ -530,7 +529,7 @@ static void io_worker_handle_work(struct io_worker *worker) /* dependent work is not hashed */ hash = -1U; } - } while (work && work != old_work); + } while (work);
spin_lock_irq(&wqe->lock); } while (1);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-v5.7-rc1 commit f462fd36fc43662eeb42c95a9b8da8659af6d75e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
When executing non-linked hashed work, io_worker_handle_work() will lock-unlock wqe->lock to update hash, and then immediately lock-unlock to get next work. Optimise this case and do lock/unlock only once.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 05f2fdc6bdce..3a3a818f5416 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -475,11 +475,11 @@ static void io_worker_handle_work(struct io_worker *worker) { struct io_wqe *wqe = worker->wqe; struct io_wq *wq = wqe->wq; + unsigned hash = -1U;
do { struct io_wq_work *work; - unsigned hash = -1U; - +get_next: /* * If we got some work, mark us as busy. If we didn't, but * the list isn't empty, it means we stalled on hashed work. @@ -525,9 +525,12 @@ static void io_worker_handle_work(struct io_worker *worker) spin_lock_irq(&wqe->lock); wqe->hash_map &= ~BIT_ULL(hash); wqe->flags &= ~IO_WQE_FLAG_STALLED; - spin_unlock_irq(&wqe->lock); /* dependent work is not hashed */ hash = -1U; + /* skip unnecessary unlock-lock wqe->lock */ + if (!work) + goto get_next; + spin_unlock_irq(&wqe->lock); } } while (work);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit e9fd939654f17651ff65e7e55aa6934d29eb4335 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
First it changes io-wq interfaces. It replaces {get,put}_work() with free_work(), which guaranteed to be called exactly once. It also enforces free_work() callback to be non-NULL.
io_uring follows the changes and instead of putting a submission reference in io_put_req_async_completion(), it will be done in io_free_work(). As removes io_get_work() with corresponding refcount_inc(), the ref balance is maintained.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 29 ++++++++++++++--------------- fs/io-wq.h | 6 ++---- fs/io_uring.c | 31 +++++++++++-------------------- 3 files changed, 27 insertions(+), 39 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 3a3a818f5416..73c5bb244730 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -108,8 +108,7 @@ struct io_wq { struct io_wqe **wqes; unsigned long state;
- get_work_fn *get_work; - put_work_fn *put_work; + free_work_fn *free_work;
struct task_struct *manager; struct user_struct *user; @@ -510,16 +509,11 @@ static void io_worker_handle_work(struct io_worker *worker) if (test_bit(IO_WQ_BIT_CANCEL, &wq->state)) work->flags |= IO_WQ_WORK_CANCEL;
- if (wq->get_work) - wq->get_work(work); - old_work = work; work->func(&work); work = (old_work == work) ? NULL : work; io_assign_current_work(worker, work); - - if (wq->put_work) - wq->put_work(old_work); + wq->free_work(old_work);
if (hash != -1U) { spin_lock_irq(&wqe->lock); @@ -750,14 +744,17 @@ static bool io_wq_can_queue(struct io_wqe *wqe, struct io_wqe_acct *acct, return true; }
-static void io_run_cancel(struct io_wq_work *work) +static void io_run_cancel(struct io_wq_work *work, struct io_wqe *wqe) { + struct io_wq *wq = wqe->wq; + do { struct io_wq_work *old_work = work;
work->flags |= IO_WQ_WORK_CANCEL; work->func(&work); work = (work == old_work) ? NULL : work; + wq->free_work(old_work); } while (work); }
@@ -774,7 +771,7 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work) * It's close enough to not be an issue, fork() has the same delay. */ if (unlikely(!io_wq_can_queue(wqe, acct, work))) { - io_run_cancel(work); + io_run_cancel(work, wqe); return; }
@@ -913,7 +910,7 @@ static enum io_wq_cancel io_wqe_cancel_cb_work(struct io_wqe *wqe, spin_unlock_irqrestore(&wqe->lock, flags);
if (found) { - io_run_cancel(work); + io_run_cancel(work, wqe); return IO_WQ_CANCEL_OK; }
@@ -988,7 +985,7 @@ static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, spin_unlock_irqrestore(&wqe->lock, flags);
if (found) { - io_run_cancel(work); + io_run_cancel(work, wqe); return IO_WQ_CANCEL_OK; }
@@ -1065,6 +1062,9 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) int ret = -ENOMEM, node; struct io_wq *wq;
+ if (WARN_ON_ONCE(!data->free_work)) + return ERR_PTR(-EINVAL); + wq = kzalloc(sizeof(*wq), GFP_KERNEL); if (!wq) return ERR_PTR(-ENOMEM); @@ -1075,8 +1075,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) return ERR_PTR(-ENOMEM); }
- wq->get_work = data->get_work; - wq->put_work = data->put_work; + wq->free_work = data->free_work;
/* caller must already hold a reference to this */ wq->user = data->user; @@ -1133,7 +1132,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
bool io_wq_get(struct io_wq *wq, struct io_wq_data *data) { - if (data->get_work != wq->get_work || data->put_work != wq->put_work) + if (data->free_work != wq->free_work) return false;
return refcount_inc_not_zero(&wq->use_refs); diff --git a/fs/io-wq.h b/fs/io-wq.h index a0978d6958f0..2117b9a4f161 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -81,14 +81,12 @@ struct io_wq_work { *(work) = (struct io_wq_work){ .func = _func }; \ } while (0) \
-typedef void (get_work_fn)(struct io_wq_work *); -typedef void (put_work_fn)(struct io_wq_work *); +typedef void (free_work_fn)(struct io_wq_work *);
struct io_wq_data { struct user_struct *user;
- get_work_fn *get_work; - put_work_fn *put_work; + free_work_fn *free_work; };
struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data); diff --git a/fs/io_uring.c b/fs/io_uring.c index d6eaafea0aa1..d1b0a7845e1c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1555,8 +1555,8 @@ static void io_put_req(struct io_kiocb *req) io_free_req(req); }
-static void io_put_req_async_completion(struct io_kiocb *req, - struct io_wq_work **workptr) +static void io_steal_work(struct io_kiocb *req, + struct io_wq_work **workptr) { /* * It's in an io-wq worker, so there always should be at least @@ -1566,7 +1566,6 @@ static void io_put_req_async_completion(struct io_kiocb *req, * It also means, that if the counter dropped to 1, then there is * no asynchronous users left, so it's safe to steal the next work. */ - refcount_dec(&req->refs); if (refcount_read(&req->refs) == 1) { struct io_kiocb *nxt = NULL;
@@ -2575,7 +2574,7 @@ static bool io_req_cancelled(struct io_kiocb *req) if (req->work.flags & IO_WQ_WORK_CANCEL) { req_set_fail_links(req); io_cqring_add_event(req, -ECANCELED); - io_double_put_req(req); + io_put_req(req); return true; }
@@ -2603,7 +2602,7 @@ static void io_fsync_finish(struct io_wq_work **workptr) if (io_req_cancelled(req)) return; __io_fsync(req); - io_put_req_async_completion(req, workptr); + io_steal_work(req, workptr); }
static int io_fsync(struct io_kiocb *req, bool force_nonblock) @@ -2636,7 +2635,7 @@ static void io_fallocate_finish(struct io_wq_work **workptr) if (io_req_cancelled(req)) return; __io_fallocate(req); - io_put_req_async_completion(req, workptr); + io_steal_work(req, workptr); }
static int io_fallocate_prep(struct io_kiocb *req, @@ -2957,7 +2956,7 @@ static void io_close_finish(struct io_wq_work **workptr)
/* not cancellable, don't do io_req_cancelled() */ __io_close_finish(req); - io_put_req_async_completion(req, workptr); + io_steal_work(req, workptr); }
static int io_close(struct io_kiocb *req, bool force_nonblock) @@ -3405,7 +3404,7 @@ static void io_accept_finish(struct io_wq_work **workptr) if (io_req_cancelled(req)) return; __io_accept(req, false); - io_put_req_async_completion(req, workptr); + io_steal_work(req, workptr); } #endif
@@ -4663,7 +4662,7 @@ static void io_wq_submit_work(struct io_wq_work **workptr) io_put_req(req); }
- io_put_req_async_completion(req, workptr); + io_steal_work(req, workptr); }
static int io_req_needs_file(struct io_kiocb *req, int fd) @@ -6071,21 +6070,14 @@ static int io_sqe_files_update(struct io_ring_ctx *ctx, void __user *arg, return __io_sqe_files_update(ctx, &up, nr_args); }
-static void io_put_work(struct io_wq_work *work) +static void io_free_work(struct io_wq_work *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work);
- /* Consider that io_put_req_async_completion() relies on this ref */ + /* Consider that io_steal_work() relies on this ref */ io_put_req(req); }
-static void io_get_work(struct io_wq_work *work) -{ - struct io_kiocb *req = container_of(work, struct io_kiocb, work); - - refcount_inc(&req->refs); -} - static int io_init_wq_offload(struct io_ring_ctx *ctx, struct io_uring_params *p) { @@ -6096,8 +6088,7 @@ static int io_init_wq_offload(struct io_ring_ctx *ctx, int ret = 0;
data.user = ctx->user; - data.get_work = io_get_work; - data.put_work = io_put_work; + data.free_work = io_free_work;
if (!(p->flags & IORING_SETUP_ATTACH_WQ)) { /* Do QD, or 4 * CPUS, whatever is smallest */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 5a2e745d4d430c4dbeeeb448c3d5c0c3109e511e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This just prepares the ring for having lists of buffers associated with it, that the application can provide for SQEs to consume instead of providing their own.
The buffers are organized by group ID.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d1b0a7845e1c..dc5381515877 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -195,6 +195,13 @@ struct fixed_file_data { struct completion done; };
+struct io_buffer { + struct list_head list; + __u64 addr; + __s32 len; + __u16 bid; +}; + struct io_ring_ctx { struct { struct percpu_ref refs; @@ -272,6 +279,8 @@ struct io_ring_ctx { struct socket *ring_sock; #endif
+ struct idr io_buffer_idr; + struct idr personality_idr;
struct { @@ -871,6 +880,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) INIT_LIST_HEAD(&ctx->cq_overflow_list); init_completion(&ctx->completions[0]); init_completion(&ctx->completions[1]); + idr_init(&ctx->io_buffer_idr); idr_init(&ctx->personality_idr); mutex_init(&ctx->uring_lock); init_waitqueue_head(&ctx->wait); @@ -6491,6 +6501,30 @@ static int io_eventfd_unregister(struct io_ring_ctx *ctx) return -ENXIO; }
+static int __io_destroy_buffers(int id, void *p, void *data) +{ + struct io_ring_ctx *ctx = data; + struct io_buffer *buf = p; + + /* the head kbuf is the list itself */ + while (!list_empty(&buf->list)) { + struct io_buffer *nxt; + + nxt = list_first_entry(&buf->list, struct io_buffer, list); + list_del(&nxt->list); + kfree(nxt); + } + kfree(buf); + idr_remove(&ctx->io_buffer_idr, id); + return 0; +} + +static void io_destroy_buffers(struct io_ring_ctx *ctx) +{ + idr_for_each(&ctx->io_buffer_idr, __io_destroy_buffers, ctx); + idr_destroy(&ctx->io_buffer_idr); +} + static void io_ring_ctx_free(struct io_ring_ctx *ctx) { io_finish_async(ctx); @@ -6501,6 +6535,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) io_sqe_buffer_unregister(ctx); io_sqe_files_unregister(ctx); io_eventfd_unregister(ctx); + io_destroy_buffers(ctx); idr_destroy(&ctx->personality_idr);
#if defined(CONFIG_UNIX)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit ddf0322db79c5984dc1a1db890f946dd19b7d6d9 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to support passing in an addr/len that is associated with a buffer ID and buffer group ID. The group ID is used to index and lookup the buffers, while the buffer ID can be used to notify the application which buffer in the group was used. The addr passed in is the starting buffer address, and length is each buffer length. A number of buffers to add with can be specified, in which case addr is incremented by length for each addition, and each buffer increments the buffer ID specified.
No validation is done of the buffer ID. If the application provides buffers within the same group with identical buffer IDs, then it'll have a hard time telling which buffer ID was used. The only restriction is that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the maximum ID that can be used.
Signed-off-by: Jens Axboe axboe@kernel.dk Conflicts: fs/io_uring.c [commit cebdb98617ae ("io_uring: add support for IORING_OP_OPENAT2") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 138 +++++++++++++++++++++++++++++++++- include/uapi/linux/io_uring.h | 10 ++- 2 files changed, 145 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index dc5381515877..b665cc71ba23 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -450,6 +450,15 @@ struct io_splice { unsigned int flags; };
+struct io_provide_buf { + struct file *file; + __u64 addr; + __s32 len; + __u32 bgid; + __u16 nbufs; + __u16 bid; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -575,6 +584,7 @@ struct io_kiocb { struct io_madvise madvise; struct io_epoll epoll; struct io_splice splice; + struct io_provide_buf pbuf; };
struct io_async_ctx *io; @@ -796,7 +806,8 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .hash_reg_file = 1, .unbound_nonreg_file = 1, - } + }, + [IORING_OP_PROVIDE_BUFFERS] = {}, };
static void io_wq_submit_work(struct io_wq_work **workptr); @@ -2736,6 +2747,120 @@ static int io_openat(struct io_kiocb *req, bool force_nonblock) return 0; }
+static int io_provide_buffers_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + struct io_provide_buf *p = &req->pbuf; + u64 tmp; + + if (sqe->ioprio || sqe->rw_flags) + return -EINVAL; + + tmp = READ_ONCE(sqe->fd); + if (!tmp || tmp > USHRT_MAX) + return -E2BIG; + p->nbufs = tmp; + p->addr = READ_ONCE(sqe->addr); + p->len = READ_ONCE(sqe->len); + + if (!access_ok(u64_to_user_ptr(p->addr), p->len)) + return -EFAULT; + + p->bgid = READ_ONCE(sqe->buf_group); + tmp = READ_ONCE(sqe->off); + if (tmp > USHRT_MAX) + return -E2BIG; + p->bid = tmp; + return 0; +} + +static int io_add_buffers(struct io_provide_buf *pbuf, struct io_buffer **head) +{ + struct io_buffer *buf; + u64 addr = pbuf->addr; + int i, bid = pbuf->bid; + + for (i = 0; i < pbuf->nbufs; i++) { + buf = kmalloc(sizeof(*buf), GFP_KERNEL); + if (!buf) + break; + + buf->addr = addr; + buf->len = pbuf->len; + buf->bid = bid; + addr += pbuf->len; + bid++; + if (!*head) { + INIT_LIST_HEAD(&buf->list); + *head = buf; + } else { + list_add_tail(&buf->list, &(*head)->list); + } + } + + return i ? i : -ENOMEM; +} + +static void io_ring_submit_unlock(struct io_ring_ctx *ctx, bool needs_lock) +{ + if (needs_lock) + mutex_unlock(&ctx->uring_lock); +} + +static void io_ring_submit_lock(struct io_ring_ctx *ctx, bool needs_lock) +{ + /* + * "Normal" inline submissions always hold the uring_lock, since we + * grab it from the system call. Same is true for the SQPOLL offload. + * The only exception is when we've detached the request and issue it + * from an async worker thread, grab the lock for that case. + */ + if (needs_lock) + mutex_lock(&ctx->uring_lock); +} + +static int io_provide_buffers(struct io_kiocb *req, bool force_nonblock) +{ + struct io_provide_buf *p = &req->pbuf; + struct io_ring_ctx *ctx = req->ctx; + struct io_buffer *head, *list; + int ret = 0; + + io_ring_submit_lock(ctx, !force_nonblock); + + lockdep_assert_held(&ctx->uring_lock); + + list = head = idr_find(&ctx->io_buffer_idr, p->bgid); + + ret = io_add_buffers(p, &head); + if (ret < 0) + goto out; + + if (!list) { + ret = idr_alloc(&ctx->io_buffer_idr, head, p->bgid, p->bgid + 1, + GFP_KERNEL); + if (ret < 0) { + while (!list_empty(&head->list)) { + struct io_buffer *buf; + + buf = list_first_entry(&head->list, + struct io_buffer, list); + list_del(&buf->list); + kfree(buf); + } + kfree(head); + goto out; + } + } +out: + io_ring_submit_unlock(ctx, !force_nonblock); + if (ret < 0) + req_set_fail_links(req); + io_cqring_add_event(req, ret); + io_put_req(req); + return 0; +} + static int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { @@ -4345,6 +4470,9 @@ static int io_req_defer_prep(struct io_kiocb *req, case IORING_OP_SPLICE: ret = io_splice_prep(req, sqe); break; + case IORING_OP_PROVIDE_BUFFERS: + ret = io_provide_buffers_prep(req, sqe); + break; default: printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", req->opcode); @@ -4613,6 +4741,14 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } ret = io_splice(req, force_nonblock); break; + case IORING_OP_PROVIDE_BUFFERS: + if (sqe) { + ret = io_provide_buffers_prep(req, sqe); + if (ret) + break; + } + ret = io_provide_buffers(req, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 14b4f075068f..5a3c5dd07e82 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -45,8 +45,13 @@ struct io_uring_sqe { __u64 user_data; /* data to be passed back at completion time */ union { struct { - /* index into fixed buffers, if used */ - __u16 buf_index; + /* pack this to avoid bogus arm OABI complaints */ + union { + /* index into fixed buffers, if used */ + __u16 buf_index; + /* for grouped buffer selection */ + __u16 buf_group; + } __attribute__((packed)); /* personality to use, if used */ __u16 personality; __s32 splice_fd_in; @@ -118,6 +123,7 @@ enum { IORING_OP_RECV, IORING_OP_EPOLL_CTL, IORING_OP_SPLICE, + IORING_OP_PROVIDE_BUFFERS,
/* this goes last, obviously */ IORING_OP_LAST,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit bcda7baaa3f15c7a95db3c024bb046d6e298f76b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If a server process has tons of pending socket connections, generally it uses epoll to wait for activity. When the socket is ready for reading (or writing), the task can select a buffer and issue a recv/send on the given fd.
Now that we have fast (non-async thread) support, a task can have tons of pending reads or writes pending. But that means they need buffers to back that data, and if the number of connections is high enough, having them preallocated for all possible connections is unfeasible.
With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to use for any request. The request then sets IOSQE_BUFFER_SELECT in the sqe, and a given group ID in sqe->buf_group. When the fd becomes ready, a free buffer from the specified group is selected. If none are available, the request is terminated with -ENOBUFS. If successful, the CQE on completion will contain the buffer ID chosen in the cqe->flags member, encoded as:
(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
Once a buffer has been consumed by a request, it is no longer available and must be registered again with IORING_OP_PROVIDE_BUFFERS.
Requests need to support this feature. For now, IORING_OP_READ and IORING_OP_RECV support it. This is checked on SQE submission, a CQE with res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 224 ++++++++++++++++++++++++++++------ include/uapi/linux/io_uring.h | 14 +++ 2 files changed, 199 insertions(+), 39 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b665cc71ba23..afd71ea5c918 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -396,7 +396,9 @@ struct io_sr_msg { void __user *buf; }; int msg_flags; + int bgid; size_t len; + struct io_buffer *kbuf; };
struct io_open { @@ -493,6 +495,7 @@ enum { REQ_F_LINK_BIT = IOSQE_IO_LINK_BIT, REQ_F_HARDLINK_BIT = IOSQE_IO_HARDLINK_BIT, REQ_F_FORCE_ASYNC_BIT = IOSQE_ASYNC_BIT, + REQ_F_BUFFER_SELECT_BIT = IOSQE_BUFFER_SELECT_BIT,
REQ_F_LINK_NEXT_BIT, REQ_F_FAIL_LINK_BIT, @@ -509,6 +512,7 @@ enum { REQ_F_NEED_CLEANUP_BIT, REQ_F_OVERFLOW_BIT, REQ_F_POLLED_BIT, + REQ_F_BUFFER_SELECTED_BIT, };
enum { @@ -522,6 +526,8 @@ enum { REQ_F_HARDLINK = BIT(REQ_F_HARDLINK_BIT), /* IOSQE_ASYNC */ REQ_F_FORCE_ASYNC = BIT(REQ_F_FORCE_ASYNC_BIT), + /* IOSQE_BUFFER_SELECT */ + REQ_F_BUFFER_SELECT = BIT(REQ_F_BUFFER_SELECT_BIT),
/* already grabbed next link */ REQ_F_LINK_NEXT = BIT(REQ_F_LINK_NEXT_BIT), @@ -553,6 +559,8 @@ enum { REQ_F_OVERFLOW = BIT(REQ_F_OVERFLOW_BIT), /* already went through poll handler */ REQ_F_POLLED = BIT(REQ_F_POLLED_BIT), + /* buffer already selected */ + REQ_F_BUFFER_SELECTED = BIT(REQ_F_BUFFER_SELECTED_BIT), };
struct async_poll { @@ -615,6 +623,7 @@ struct io_kiocb { struct callback_head task_work; struct hlist_node hash_node; struct async_poll *apoll; + int cflags; }; struct io_wq_work work; }; @@ -664,6 +673,8 @@ struct io_op_def { /* set if opcode supports polled "wait" */ unsigned pollin : 1; unsigned pollout : 1; + /* op supports buffer selection */ + unsigned buffer_select : 1; };
static const struct io_op_def io_op_defs[] = { @@ -773,6 +784,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .unbound_nonreg_file = 1, .pollin = 1, + .buffer_select = 1, }, [IORING_OP_WRITE] = { .needs_mm = 1, @@ -797,6 +809,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .unbound_nonreg_file = 1, .pollin = 1, + .buffer_select = 1, }, [IORING_OP_EPOLL_CTL] = { .unbound_nonreg_file = 1, @@ -1167,7 +1180,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) if (cqe) { WRITE_ONCE(cqe->user_data, req->user_data); WRITE_ONCE(cqe->res, req->result); - WRITE_ONCE(cqe->flags, 0); + WRITE_ONCE(cqe->flags, req->cflags); } else { WRITE_ONCE(ctx->rings->cq_overflow, atomic_inc_return(&ctx->cached_cq_overflow)); @@ -1191,7 +1204,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) return cqe != NULL; }
-static void io_cqring_fill_event(struct io_kiocb *req, long res) +static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags) { struct io_ring_ctx *ctx = req->ctx; struct io_uring_cqe *cqe; @@ -1207,7 +1220,7 @@ static void io_cqring_fill_event(struct io_kiocb *req, long res) if (likely(cqe)) { WRITE_ONCE(cqe->user_data, req->user_data); WRITE_ONCE(cqe->res, res); - WRITE_ONCE(cqe->flags, 0); + WRITE_ONCE(cqe->flags, cflags); } else if (ctx->cq_overflow_flushed) { WRITE_ONCE(ctx->rings->cq_overflow, atomic_inc_return(&ctx->cached_cq_overflow)); @@ -1219,23 +1232,34 @@ static void io_cqring_fill_event(struct io_kiocb *req, long res) req->flags |= REQ_F_OVERFLOW; refcount_inc(&req->refs); req->result = res; + req->cflags = cflags; list_add_tail(&req->list, &ctx->cq_overflow_list); } }
-static void io_cqring_add_event(struct io_kiocb *req, long res) +static void io_cqring_fill_event(struct io_kiocb *req, long res) +{ + __io_cqring_fill_event(req, res, 0); +} + +static void __io_cqring_add_event(struct io_kiocb *req, long res, long cflags) { struct io_ring_ctx *ctx = req->ctx; unsigned long flags;
spin_lock_irqsave(&ctx->completion_lock, flags); - io_cqring_fill_event(req, res); + __io_cqring_fill_event(req, res, cflags); io_commit_cqring(ctx); spin_unlock_irqrestore(&ctx->completion_lock, flags);
io_cqring_ev_posted(ctx); }
+static void io_cqring_add_event(struct io_kiocb *req, long res) +{ + __io_cqring_add_event(req, res, 0); +} + static inline bool io_is_fallback_req(struct io_kiocb *req) { return req == (struct io_kiocb *) @@ -1657,6 +1681,18 @@ static inline bool io_req_multi_free(struct req_batch *rb, struct io_kiocb *req) return true; }
+static int io_put_kbuf(struct io_kiocb *req) +{ + struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr; + int cflags; + + cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT; + cflags |= IORING_CQE_F_BUFFER; + req->rw.addr = 0; + kfree(kbuf); + return cflags; +} + /* * Find and free completed poll iocbs */ @@ -1668,10 +1704,15 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
rb.to_free = rb.need_iter = 0; while (!list_empty(done)) { + int cflags = 0; + req = list_first_entry(done, struct io_kiocb, list); list_del(&req->list);
- io_cqring_fill_event(req, req->result); + if (req->flags & REQ_F_BUFFER_SELECTED) + cflags = io_put_kbuf(req); + + __io_cqring_fill_event(req, req->result, cflags); (*nr_events)++;
if (refcount_dec_and_test(&req->refs) && @@ -1846,13 +1887,16 @@ static inline void req_set_fail_links(struct io_kiocb *req) static void io_complete_rw_common(struct kiocb *kiocb, long res) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); + int cflags = 0;
if (kiocb->ki_flags & IOCB_WRITE) kiocb_end_write(req);
if (res != req->result) req_set_fail_links(req); - io_cqring_add_event(req, res); + if (req->flags & REQ_F_BUFFER_SELECTED) + cflags = io_put_kbuf(req); + __io_cqring_add_event(req, res, cflags); }
static void io_complete_rw(struct kiocb *kiocb, long res, long res2) @@ -2030,7 +2074,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
req->rw.addr = READ_ONCE(sqe->addr); req->rw.len = READ_ONCE(sqe->len); - /* we own ->private, reuse it for the buffer index */ + /* we own ->private, reuse it for the buffer index / buffer ID */ req->rw.kiocb.private = (void *) (unsigned long) READ_ONCE(sqe->buf_index); return 0; @@ -2143,8 +2187,61 @@ static ssize_t io_import_fixed(struct io_kiocb *req, int rw, return len; }
+static void io_ring_submit_unlock(struct io_ring_ctx *ctx, bool needs_lock) +{ + if (needs_lock) + mutex_unlock(&ctx->uring_lock); +} + +static void io_ring_submit_lock(struct io_ring_ctx *ctx, bool needs_lock) +{ + /* + * "Normal" inline submissions always hold the uring_lock, since we + * grab it from the system call. Same is true for the SQPOLL offload. + * The only exception is when we've detached the request and issue it + * from an async worker thread, grab the lock for that case. + */ + if (needs_lock) + mutex_lock(&ctx->uring_lock); +} + +static struct io_buffer *io_buffer_select(struct io_kiocb *req, size_t *len, + int bgid, struct io_buffer *kbuf, + bool needs_lock) +{ + struct io_buffer *head; + + if (req->flags & REQ_F_BUFFER_SELECTED) + return kbuf; + + io_ring_submit_lock(req->ctx, needs_lock); + + lockdep_assert_held(&req->ctx->uring_lock); + + head = idr_find(&req->ctx->io_buffer_idr, bgid); + if (head) { + if (!list_empty(&head->list)) { + kbuf = list_last_entry(&head->list, struct io_buffer, + list); + list_del(&kbuf->list); + } else { + kbuf = head; + idr_remove(&req->ctx->io_buffer_idr, bgid); + } + if (*len > kbuf->len) + *len = kbuf->len; + } else { + kbuf = ERR_PTR(-ENOBUFS); + } + + io_ring_submit_unlock(req->ctx, needs_lock); + + return kbuf; +} + static ssize_t io_import_iovec(int rw, struct io_kiocb *req, - struct iovec **iovec, struct iov_iter *iter) + struct iovec **iovec, struct iov_iter *iter, + bool needs_lock) { void __user *buf = u64_to_user_ptr(req->rw.addr); size_t sqe_len = req->rw.len; @@ -2156,12 +2253,29 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, return io_import_fixed(req, rw, iter); }
- /* buffer index only valid with fixed read/write */ - if (req->rw.kiocb.private) + /* buffer index only valid with fixed read/write, or buffer select */ + if (req->rw.kiocb.private && !(req->flags & REQ_F_BUFFER_SELECT)) return -EINVAL;
if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) { ssize_t ret; + + if (req->flags & REQ_F_BUFFER_SELECT) { + struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr; + int bgid; + + bgid = (int) (unsigned long) req->rw.kiocb.private; + kbuf = io_buffer_select(req, &sqe_len, bgid, kbuf, + needs_lock); + if (IS_ERR(kbuf)) { + *iovec = NULL; + return PTR_ERR(kbuf); + } + req->rw.addr = (u64) kbuf; + req->flags |= REQ_F_BUFFER_SELECTED; + buf = u64_to_user_ptr(kbuf->addr); + } + ret = import_single_range(rw, buf, sqe_len, *iovec, iter); *iovec = NULL; return ret < 0 ? ret : sqe_len; @@ -2304,7 +2418,7 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, io = req->io; io->rw.iov = io->rw.fast_iov; req->io = NULL; - ret = io_import_iovec(READ, req, &io->rw.iov, &iter); + ret = io_import_iovec(READ, req, &io->rw.iov, &iter, !force_nonblock); req->io = io; if (ret < 0) return ret; @@ -2321,7 +2435,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock) size_t iov_count; ssize_t io_size, ret;
- ret = io_import_iovec(READ, req, &iovec, &iter); + ret = io_import_iovec(READ, req, &iovec, &iter, !force_nonblock); if (ret < 0) return ret;
@@ -2393,7 +2507,7 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, io = req->io; io->rw.iov = io->rw.fast_iov; req->io = NULL; - ret = io_import_iovec(WRITE, req, &io->rw.iov, &iter); + ret = io_import_iovec(WRITE, req, &io->rw.iov, &iter, !force_nonblock); req->io = io; if (ret < 0) return ret; @@ -2410,7 +2524,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) size_t iov_count; ssize_t ret, io_size;
- ret = io_import_iovec(WRITE, req, &iovec, &iter); + ret = io_import_iovec(WRITE, req, &iovec, &iter, !force_nonblock); if (ret < 0) return ret;
@@ -2801,24 +2915,6 @@ static int io_add_buffers(struct io_provide_buf *pbuf, struct io_buffer **head) return i ? i : -ENOMEM; }
-static void io_ring_submit_unlock(struct io_ring_ctx *ctx, bool needs_lock) -{ - if (needs_lock) - mutex_unlock(&ctx->uring_lock); -} - -static void io_ring_submit_lock(struct io_ring_ctx *ctx, bool needs_lock) -{ - /* - * "Normal" inline submissions always hold the uring_lock, since we - * grab it from the system call. Same is true for the SQPOLL offload. - * The only exception is when we've detached the request and issue it - * from an async worker thread, grab the lock for that case. - */ - if (needs_lock) - mutex_lock(&ctx->uring_lock); -} - static int io_provide_buffers(struct io_kiocb *req, bool force_nonblock) { struct io_provide_buf *p = &req->pbuf; @@ -3341,6 +3437,27 @@ static int io_send(struct io_kiocb *req, bool force_nonblock) #endif }
+static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req, + int *cflags, bool needs_lock) +{ + struct io_sr_msg *sr = &req->sr_msg; + struct io_buffer *kbuf; + + if (!(req->flags & REQ_F_BUFFER_SELECT)) + return NULL; + + kbuf = io_buffer_select(req, &sr->len, sr->bgid, sr->kbuf, needs_lock); + if (IS_ERR(kbuf)) + return kbuf; + + sr->kbuf = kbuf; + req->flags |= REQ_F_BUFFER_SELECTED; + + *cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT; + *cflags |= IORING_CQE_F_BUFFER; + return kbuf; +} + static int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { @@ -3352,6 +3469,7 @@ static int io_recvmsg_prep(struct io_kiocb *req, sr->msg_flags = READ_ONCE(sqe->msg_flags); sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); sr->len = READ_ONCE(sqe->len); + sr->bgid = READ_ONCE(sqe->buf_group);
#ifdef CONFIG_COMPAT if (req->ctx->compat) @@ -3441,8 +3559,9 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) static int io_recv(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_NET) + struct io_buffer *kbuf = NULL; struct socket *sock; - int ret; + int ret, cflags = 0;
if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; @@ -3450,15 +3569,25 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock) sock = sock_from_file(req->file, &ret); if (sock) { struct io_sr_msg *sr = &req->sr_msg; + void __user *buf = sr->buf; struct msghdr msg; struct iovec iov; unsigned flags;
- ret = import_single_range(READ, sr->buf, sr->len, &iov, + kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock); + if (IS_ERR(kbuf)) + return PTR_ERR(kbuf); + else if (kbuf) + buf = u64_to_user_ptr(kbuf->addr); + + ret = import_single_range(READ, buf, sr->len, &iov, &msg.msg_iter); - if (ret) + if (ret) { + kfree(kbuf); return ret; + }
+ req->flags |= REQ_F_NEED_CLEANUP; msg.msg_name = NULL; msg.msg_control = NULL; msg.msg_controllen = 0; @@ -3479,7 +3608,9 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock) ret = -EINTR; }
- io_cqring_add_event(req, ret); + kfree(kbuf); + req->flags &= ~REQ_F_NEED_CLEANUP; + __io_cqring_add_event(req, ret, cflags); if (ret < 0) req_set_fail_links(req); io_put_req(req); @@ -4519,6 +4650,9 @@ static void io_cleanup_req(struct io_kiocb *req) case IORING_OP_READV: case IORING_OP_READ_FIXED: case IORING_OP_READ: + if (req->flags & REQ_F_BUFFER_SELECTED) + kfree((void *)(unsigned long)req->rw.addr); + /* fallthrough */ case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: case IORING_OP_WRITE: @@ -4530,6 +4664,10 @@ static void io_cleanup_req(struct io_kiocb *req) if (io->msg.iov != io->msg.fast_iov) kfree(io->msg.iov); break; + case IORING_OP_RECV: + if (req->flags & REQ_F_BUFFER_SELECTED) + kfree(req->sr_msg.kbuf); + break; case IORING_OP_OPENAT: case IORING_OP_STATX: putname(req->open.filename); @@ -5098,7 +5236,8 @@ static inline void io_queue_link_head(struct io_kiocb *req) }
#define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK| \ - IOSQE_IO_HARDLINK | IOSQE_ASYNC) + IOSQE_IO_HARDLINK | IOSQE_ASYNC | \ + IOSQE_BUFFER_SELECT)
static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct io_submit_state *state, struct io_kiocb **link) @@ -5115,6 +5254,12 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, goto err_req; }
+ if ((sqe_flags & IOSQE_BUFFER_SELECT) && + !io_op_defs[req->opcode].buffer_select) { + ret = -EOPNOTSUPP; + goto err_req; + } + id = READ_ONCE(sqe->personality); if (id) { req->work.creds = idr_find(&ctx->personality_idr, id); @@ -5127,7 +5272,8 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
/* same numerical values with corresponding REQ_F_*, safe to copy */ req->flags |= sqe_flags & (IOSQE_IO_DRAIN | IOSQE_IO_HARDLINK | - IOSQE_ASYNC | IOSQE_FIXED_FILE); + IOSQE_ASYNC | IOSQE_FIXED_FILE | + IOSQE_BUFFER_SELECT);
ret = io_req_set_file(state, req, sqe); if (unlikely(ret)) { diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 5a3c5dd07e82..28a85bdff505 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -66,6 +66,7 @@ enum { IOSQE_IO_LINK_BIT, IOSQE_IO_HARDLINK_BIT, IOSQE_ASYNC_BIT, + IOSQE_BUFFER_SELECT_BIT, };
/* @@ -81,6 +82,8 @@ enum { #define IOSQE_IO_HARDLINK (1U << IOSQE_IO_HARDLINK_BIT) /* always go async */ #define IOSQE_ASYNC (1U << IOSQE_ASYNC_BIT) +/* select buffer from sqe->buf_group */ +#define IOSQE_BUFFER_SELECT (1U << IOSQE_BUFFER_SELECT_BIT)
/* * io_uring_setup() flags @@ -154,6 +157,17 @@ struct io_uring_cqe { __u32 flags; };
+/* + * cqe->flags + * + * IORING_CQE_F_BUFFER If set, the upper 16 bits are the buffer ID + */ +#define IORING_CQE_F_BUFFER (1U << 0) + +enum { + IORING_CQE_BUFFER_SHIFT = 16, +}; + /* * Magic offsets for the application to mmap the data it needs */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 4d954c258a0c365a85a2d1b1cccf63aec38fca4c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This adds support for the vectored read. This is limited to supporting just 1 segment in the iov, and is provided just for convenience for applications that use IORING_OP_READV already.
The iov helpers will be used for IORING_OP_RECVMSG as well.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 111 +++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 97 insertions(+), 14 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index afd71ea5c918..a1111cc25bac 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -685,6 +685,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .unbound_nonreg_file = 1, .pollin = 1, + .buffer_select = 1, }, [IORING_OP_WRITEV] = { .async_ctx = 1, @@ -1683,9 +1684,10 @@ static inline bool io_req_multi_free(struct req_batch *rb, struct io_kiocb *req)
static int io_put_kbuf(struct io_kiocb *req) { - struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr; + struct io_buffer *kbuf; int cflags;
+ kbuf = (struct io_buffer *) (unsigned long) req->rw.addr; cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT; cflags |= IORING_CQE_F_BUFFER; req->rw.addr = 0; @@ -2239,12 +2241,95 @@ static struct io_buffer *io_buffer_select(struct io_kiocb *req, size_t *len, return kbuf; }
+static void __user *io_rw_buffer_select(struct io_kiocb *req, size_t *len, + bool needs_lock) +{ + struct io_buffer *kbuf; + int bgid; + + kbuf = (struct io_buffer *) (unsigned long) req->rw.addr; + bgid = (int) (unsigned long) req->rw.kiocb.private; + kbuf = io_buffer_select(req, len, bgid, kbuf, needs_lock); + if (IS_ERR(kbuf)) + return kbuf; + req->rw.addr = (u64) (unsigned long) kbuf; + req->flags |= REQ_F_BUFFER_SELECTED; + return u64_to_user_ptr(kbuf->addr); +} + +#ifdef CONFIG_COMPAT +static ssize_t io_compat_import(struct io_kiocb *req, struct iovec *iov, + bool needs_lock) +{ + struct compat_iovec __user *uiov; + compat_ssize_t clen; + void __user *buf; + ssize_t len; + + uiov = u64_to_user_ptr(req->rw.addr); + if (!access_ok(uiov, sizeof(*uiov))) + return -EFAULT; + if (__get_user(clen, &uiov->iov_len)) + return -EFAULT; + if (clen < 0) + return -EINVAL; + + len = clen; + buf = io_rw_buffer_select(req, &len, needs_lock); + if (IS_ERR(buf)) + return PTR_ERR(buf); + iov[0].iov_base = buf; + iov[0].iov_len = (compat_size_t) len; + return 0; +} +#endif + +static ssize_t __io_iov_buffer_select(struct io_kiocb *req, struct iovec *iov, + bool needs_lock) +{ + struct iovec __user *uiov = u64_to_user_ptr(req->rw.addr); + void __user *buf; + ssize_t len; + + if (copy_from_user(iov, uiov, sizeof(*uiov))) + return -EFAULT; + + len = iov[0].iov_len; + if (len < 0) + return -EINVAL; + buf = io_rw_buffer_select(req, &len, needs_lock); + if (IS_ERR(buf)) + return PTR_ERR(buf); + iov[0].iov_base = buf; + iov[0].iov_len = len; + return 0; +} + +static ssize_t io_iov_buffer_select(struct io_kiocb *req, struct iovec *iov, + bool needs_lock) +{ + if (req->flags & REQ_F_BUFFER_SELECTED) + return 0; + if (!req->rw.len) + return 0; + else if (req->rw.len > 1) + return -EINVAL; + +#ifdef CONFIG_COMPAT + if (req->ctx->compat) + return io_compat_import(req, iov, needs_lock); +#endif + + return __io_iov_buffer_select(req, iov, needs_lock); +} + static ssize_t io_import_iovec(int rw, struct io_kiocb *req, struct iovec **iovec, struct iov_iter *iter, bool needs_lock) { void __user *buf = u64_to_user_ptr(req->rw.addr); size_t sqe_len = req->rw.len; + ssize_t ret; u8 opcode;
opcode = req->opcode; @@ -2258,22 +2343,12 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, return -EINVAL;
if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) { - ssize_t ret; - if (req->flags & REQ_F_BUFFER_SELECT) { - struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr; - int bgid; - - bgid = (int) (unsigned long) req->rw.kiocb.private; - kbuf = io_buffer_select(req, &sqe_len, bgid, kbuf, - needs_lock); - if (IS_ERR(kbuf)) { + buf = io_rw_buffer_select(req, &sqe_len, needs_lock); + if (IS_ERR(buf)) { *iovec = NULL; - return PTR_ERR(kbuf); + return PTR_ERR(buf); } - req->rw.addr = (u64) kbuf; - req->flags |= REQ_F_BUFFER_SELECTED; - buf = u64_to_user_ptr(kbuf->addr); }
ret = import_single_range(rw, buf, sqe_len, *iovec, iter); @@ -2291,6 +2366,14 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, return iorw->size; }
+ if (req->flags & REQ_F_BUFFER_SELECT) { + ret = io_iov_buffer_select(req, *iovec, needs_lock); + if (!ret) + iov_iter_init(iter, rw, *iovec, 1, (*iovec)->iov_len); + *iovec = NULL; + return ret; + } + #ifdef CONFIG_COMPAT if (req->ctx->compat) return compat_import_iovec(rw, buf, sqe_len, UIO_FASTIOV,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 0a384abfae66651b28e4bbe16883b1ff046ba3b3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This splits it into two parts, one that imports the message, and one that imports the iovec. This allows a caller to only do the first part, and import the iovec manually afterwards.
No functional changes in this patch.
Acked-by: David Miller davem@davemloft.net Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/socket.h | 4 ++++ include/net/compat.h | 3 +++ net/compat.c | 30 +++++++++++++++++++++++------- net/socket.c | 25 +++++++++++++++++++++---- 4 files changed, 51 insertions(+), 11 deletions(-)
diff --git a/include/linux/socket.h b/include/linux/socket.h index 97f2a929b2bf..05c87e849a87 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -377,6 +377,10 @@ extern int recvmsg_copy_msghdr(struct msghdr *msg, struct user_msghdr __user *umsg, unsigned flags, struct sockaddr __user **uaddr, struct iovec **iov); +extern int __copy_msghdr_from_user(struct msghdr *kmsg, + struct user_msghdr __user *umsg, + struct sockaddr __user **save_addr, + struct iovec __user **uiov, size_t *nsegs);
/* helpers which do the actual work for syscalls */ extern int __sys_recvfrom(int fd, void __user *ubuf, size_t size, diff --git a/include/net/compat.h b/include/net/compat.h index 4c6d75612b6c..2f861518cc89 100644 --- a/include/net/compat.h +++ b/include/net/compat.h @@ -41,6 +41,9 @@ int compat_sock_get_timestampns(struct sock *, struct timespec __user *); #define compat_mmsghdr mmsghdr #endif /* defined(CONFIG_COMPAT) */
+int __get_compat_msghdr(struct msghdr *kmsg, struct compat_msghdr __user *umsg, + struct sockaddr __user **save_addr, compat_uptr_t *ptr, + compat_size_t *len); int get_compat_msghdr(struct msghdr *, struct compat_msghdr __user *, struct sockaddr __user **, struct iovec **); struct sock_fprog __user *get_compat_bpf_fprog(char __user *optval); diff --git a/net/compat.c b/net/compat.c index 2582a9223d80..42afe8f45ff8 100644 --- a/net/compat.c +++ b/net/compat.c @@ -32,10 +32,10 @@ #include <linux/uaccess.h> #include <net/compat.h>
-int get_compat_msghdr(struct msghdr *kmsg, - struct compat_msghdr __user *umsg, - struct sockaddr __user **save_addr, - struct iovec **iov) +int __get_compat_msghdr(struct msghdr *kmsg, + struct compat_msghdr __user *umsg, + struct sockaddr __user **save_addr, + compat_uptr_t *ptr, compat_size_t *len) { struct compat_msghdr msg; ssize_t err; @@ -78,10 +78,26 @@ int get_compat_msghdr(struct msghdr *kmsg, return -EMSGSIZE;
kmsg->msg_iocb = NULL; + *ptr = msg.msg_iov; + *len = msg.msg_iovlen; + return 0; +} + +int get_compat_msghdr(struct msghdr *kmsg, + struct compat_msghdr __user *umsg, + struct sockaddr __user **save_addr, + struct iovec **iov) +{ + compat_uptr_t ptr; + compat_size_t len; + ssize_t err; + + err = __get_compat_msghdr(kmsg, umsg, save_addr, &ptr, &len); + if (err) + return err;
- err = compat_import_iovec(save_addr ? READ : WRITE, - compat_ptr(msg.msg_iov), msg.msg_iovlen, - UIO_FASTIOV, iov, &kmsg->msg_iter); + err = compat_import_iovec(save_addr ? READ : WRITE, compat_ptr(ptr), + len, UIO_FASTIOV, iov, &kmsg->msg_iter); return err < 0 ? err : 0; }
diff --git a/net/socket.c b/net/socket.c index 06c544fafa63..50403ebdd8f6 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2018,10 +2018,10 @@ struct used_address { unsigned int name_len; };
-static int copy_msghdr_from_user(struct msghdr *kmsg, - struct user_msghdr __user *umsg, - struct sockaddr __user **save_addr, - struct iovec **iov) +int __copy_msghdr_from_user(struct msghdr *kmsg, + struct user_msghdr __user *umsg, + struct sockaddr __user **save_addr, + struct iovec __user **uiov, size_t *nsegs) { struct user_msghdr msg; ssize_t err; @@ -2063,6 +2063,23 @@ static int copy_msghdr_from_user(struct msghdr *kmsg, return -EMSGSIZE;
kmsg->msg_iocb = NULL; + *uiov = msg.msg_iov; + *nsegs = msg.msg_iovlen; + return 0; +} + +static int copy_msghdr_from_user(struct msghdr *kmsg, + struct user_msghdr __user *umsg, + struct sockaddr __user **save_addr, + struct iovec **iov) +{ + struct user_msghdr msg; + ssize_t err; + + err = __copy_msghdr_from_user(kmsg, umsg, save_addr, &msg.msg_iov, + &msg.msg_iovlen); + if (err) + return err;
err = import_iovec(save_addr ? READ : WRITE, msg.msg_iov, msg.msg_iovlen,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 52de1fe122408d7a62b6cff9ed3895ebb882d71f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Like IORING_OP_READV, this is limited to supporting just a single segment in the iovec passed in.
Signed-off-by: Jens Axboe axboe@kernel.dk
Modified: include/net/compat.h [move __get_compat_msghdr inside CONFIG_COMPAT] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 118 ++++++++++++++++++++++++++++++++++++++----- include/net/compat.h | 6 +-- 2 files changed, 109 insertions(+), 15 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a1111cc25bac..97ddc9fbf625 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -44,6 +44,7 @@ #include <linux/errno.h> #include <linux/syscalls.h> #include <linux/compat.h> +#include <net/compat.h> #include <linux/refcount.h> #include <linux/uio.h> #include <linux/bits.h> @@ -732,6 +733,7 @@ static const struct io_op_def io_op_defs[] = { .unbound_nonreg_file = 1, .needs_fs = 1, .pollin = 1, + .buffer_select = 1, }, [IORING_OP_TIMEOUT] = { .async_ctx = 1, @@ -3520,6 +3522,92 @@ static int io_send(struct io_kiocb *req, bool force_nonblock) #endif }
+static int __io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io) +{ + struct io_sr_msg *sr = &req->sr_msg; + struct iovec __user *uiov; + size_t iov_len; + int ret; + + ret = __copy_msghdr_from_user(&io->msg.msg, sr->msg, &io->msg.uaddr, + &uiov, &iov_len); + if (ret) + return ret; + + if (req->flags & REQ_F_BUFFER_SELECT) { + if (iov_len > 1) + return -EINVAL; + if (copy_from_user(io->msg.iov, uiov, sizeof(*uiov))) + return -EFAULT; + sr->len = io->msg.iov[0].iov_len; + iov_iter_init(&io->msg.msg.msg_iter, READ, io->msg.iov, 1, + sr->len); + io->msg.iov = NULL; + } else { + ret = import_iovec(READ, uiov, iov_len, UIO_FASTIOV, + &io->msg.iov, &io->msg.msg.msg_iter); + if (ret > 0) + ret = 0; + } + + return ret; +} + +#ifdef CONFIG_COMPAT +static int __io_compat_recvmsg_copy_hdr(struct io_kiocb *req, + struct io_async_ctx *io) +{ + struct compat_msghdr __user *msg_compat; + struct io_sr_msg *sr = &req->sr_msg; + struct compat_iovec __user *uiov; + compat_uptr_t ptr; + compat_size_t len; + int ret; + + msg_compat = (struct compat_msghdr __user *) sr->msg; + ret = __get_compat_msghdr(&io->msg.msg, msg_compat, &io->msg.uaddr, + &ptr, &len); + if (ret) + return ret; + + uiov = compat_ptr(ptr); + if (req->flags & REQ_F_BUFFER_SELECT) { + compat_ssize_t clen; + + if (len > 1) + return -EINVAL; + if (!access_ok(uiov, sizeof(*uiov))) + return -EFAULT; + if (__get_user(clen, &uiov->iov_len)) + return -EFAULT; + if (clen < 0) + return -EINVAL; + sr->len = io->msg.iov[0].iov_len; + io->msg.iov = NULL; + } else { + ret = compat_import_iovec(READ, uiov, len, UIO_FASTIOV, + &io->msg.iov, + &io->msg.msg.msg_iter); + if (ret < 0) + return ret; + } + + return 0; +} +#endif + +static int io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io) +{ + io->msg.iov = io->msg.fast_iov; + +#ifdef CONFIG_COMPAT + if (req->ctx->compat) + return __io_compat_recvmsg_copy_hdr(req, io); +#endif + + return __io_recvmsg_copy_hdr(req, io); +} + static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req, int *cflags, bool needs_lock) { @@ -3565,9 +3653,7 @@ static int io_recvmsg_prep(struct io_kiocb *req, if (req->flags & REQ_F_NEED_CLEANUP) return 0;
- io->msg.iov = io->msg.fast_iov; - ret = recvmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, - &io->msg.uaddr, &io->msg.iov); + ret = io_recvmsg_copy_hdr(req, io); if (!ret) req->flags |= REQ_F_NEED_CLEANUP; return ret; @@ -3581,13 +3667,14 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) #if defined(CONFIG_NET) struct io_async_msghdr *kmsg = NULL; struct socket *sock; - int ret; + int ret, cflags = 0;
if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL;
sock = sock_from_file(req->file, &ret); if (sock) { + struct io_buffer *kbuf; struct io_async_ctx io; unsigned flags;
@@ -3599,19 +3686,23 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) kmsg->iov = kmsg->fast_iov; kmsg->msg.msg_iter.iov = kmsg->iov; } else { - struct io_sr_msg *sr = &req->sr_msg; - kmsg = &io.msg; kmsg->msg.msg_name = &io.msg.addr;
- io.msg.iov = io.msg.fast_iov; - ret = recvmsg_copy_msghdr(&io.msg.msg, sr->msg, - sr->msg_flags, &io.msg.uaddr, - &io.msg.iov); + ret = io_recvmsg_copy_hdr(req, &io); if (ret) return ret; }
+ kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock); + if (IS_ERR(kbuf)) { + return PTR_ERR(kbuf); + } else if (kbuf) { + kmsg->fast_iov[0].iov_base = u64_to_user_ptr(kbuf->addr); + iov_iter_init(&kmsg->msg.msg_iter, READ, kmsg->iov, + 1, req->sr_msg.len); + } + flags = req->sr_msg.msg_flags; if (flags & MSG_DONTWAIT) req->flags |= REQ_F_NOWAIT; @@ -3629,7 +3720,7 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) if (kmsg && kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); req->flags &= ~REQ_F_NEED_CLEANUP; - io_cqring_add_event(req, ret); + __io_cqring_add_event(req, ret, cflags); if (ret < 0) req_set_fail_links(req); io_put_req(req); @@ -4742,8 +4833,11 @@ static void io_cleanup_req(struct io_kiocb *req) if (io->rw.iov != io->rw.fast_iov) kfree(io->rw.iov); break; - case IORING_OP_SENDMSG: case IORING_OP_RECVMSG: + if (req->flags & REQ_F_BUFFER_SELECTED) + kfree(req->sr_msg.kbuf); + /* fallthrough */ + case IORING_OP_SENDMSG: if (io->msg.iov != io->msg.fast_iov) kfree(io->msg.iov); break; diff --git a/include/net/compat.h b/include/net/compat.h index 2f861518cc89..5db8429b5947 100644 --- a/include/net/compat.h +++ b/include/net/compat.h @@ -32,6 +32,9 @@ struct compat_cmsghdr {
int compat_sock_get_timestamp(struct sock *, struct timeval __user *); int compat_sock_get_timestampns(struct sock *, struct timespec __user *); +int __get_compat_msghdr(struct msghdr *kmsg, struct compat_msghdr __user *umsg, + struct sockaddr __user **save_addr, compat_uptr_t *ptr, + compat_size_t *len);
#else /* defined(CONFIG_COMPAT) */ /* @@ -41,9 +44,6 @@ int compat_sock_get_timestampns(struct sock *, struct timespec __user *); #define compat_mmsghdr mmsghdr #endif /* defined(CONFIG_COMPAT) */
-int __get_compat_msghdr(struct msghdr *kmsg, struct compat_msghdr __user *umsg, - struct sockaddr __user **save_addr, compat_uptr_t *ptr, - compat_size_t *len); int get_compat_msghdr(struct msghdr *, struct compat_msghdr __user *, struct sockaddr __user **, struct iovec **); struct sock_fprog __user *get_compat_bpf_fprog(char __user *optval);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 067524e914cb23e20d59480b318fe2625eaee7c8 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers is to trigger IO on them. The usual case of shrinking a buffer pool would be to just not replenish the buffers when IO completes, and instead just free it. But it may be nice to have a way to manually remove a number of buffers from a given group, and IORING_OP_REMOVE_BUFFERS provides that functionality.
Signed-off-by: Jens Axboe axboe@kernel.dk Conflicts: fs/io_uring.c [commit cebdb98617ae ("io_uring: add support for IORING_OP_OPENAT2") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 102 +++++++++++++++++++++++++++------- include/uapi/linux/io_uring.h | 1 + 2 files changed, 84 insertions(+), 19 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 97ddc9fbf625..f91e154a9a61 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -824,6 +824,7 @@ static const struct io_op_def io_op_defs[] = { .unbound_nonreg_file = 1, }, [IORING_OP_PROVIDE_BUFFERS] = {}, + [IORING_OP_REMOVE_BUFFERS] = {}, };
static void io_wq_submit_work(struct io_wq_work **workptr); @@ -2946,6 +2947,75 @@ static int io_openat(struct io_kiocb *req, bool force_nonblock) return 0; }
+static int io_remove_buffers_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + struct io_provide_buf *p = &req->pbuf; + u64 tmp; + + if (sqe->ioprio || sqe->rw_flags || sqe->addr || sqe->len || sqe->off) + return -EINVAL; + + tmp = READ_ONCE(sqe->fd); + if (!tmp || tmp > USHRT_MAX) + return -EINVAL; + + memset(p, 0, sizeof(*p)); + p->nbufs = tmp; + p->bgid = READ_ONCE(sqe->buf_group); + return 0; +} + +static int __io_remove_buffers(struct io_ring_ctx *ctx, struct io_buffer *buf, + int bgid, unsigned nbufs) +{ + unsigned i = 0; + + /* shouldn't happen */ + if (!nbufs) + return 0; + + /* the head kbuf is the list itself */ + while (!list_empty(&buf->list)) { + struct io_buffer *nxt; + + nxt = list_first_entry(&buf->list, struct io_buffer, list); + list_del(&nxt->list); + kfree(nxt); + if (++i == nbufs) + return i; + } + i++; + kfree(buf); + idr_remove(&ctx->io_buffer_idr, bgid); + + return i; +} + +static int io_remove_buffers(struct io_kiocb *req, bool force_nonblock) +{ + struct io_provide_buf *p = &req->pbuf; + struct io_ring_ctx *ctx = req->ctx; + struct io_buffer *head; + int ret = 0; + + io_ring_submit_lock(ctx, !force_nonblock); + + lockdep_assert_held(&ctx->uring_lock); + + ret = -ENOENT; + head = idr_find(&ctx->io_buffer_idr, p->bgid); + if (head) + ret = __io_remove_buffers(ctx, head, p->bgid, p->nbufs); + + io_ring_submit_lock(ctx, !force_nonblock); + if (ret < 0) + req_set_fail_links(req); + io_cqring_add_event(req, ret); + io_put_req(req); + return 0; +} + static int io_provide_buffers_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { @@ -3021,15 +3091,7 @@ static int io_provide_buffers(struct io_kiocb *req, bool force_nonblock) ret = idr_alloc(&ctx->io_buffer_idr, head, p->bgid, p->bgid + 1, GFP_KERNEL); if (ret < 0) { - while (!list_empty(&head->list)) { - struct io_buffer *buf; - - buf = list_first_entry(&head->list, - struct io_buffer, list); - list_del(&buf->list); - kfree(buf); - } - kfree(head); + __io_remove_buffers(ctx, head, p->bgid, -1U); goto out; } } @@ -4778,6 +4840,9 @@ static int io_req_defer_prep(struct io_kiocb *req, case IORING_OP_PROVIDE_BUFFERS: ret = io_provide_buffers_prep(req, sqe); break; + case IORING_OP_REMOVE_BUFFERS: + ret = io_remove_buffers_prep(req, sqe); + break; default: printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", req->opcode); @@ -5064,6 +5129,14 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } ret = io_provide_buffers(req, force_nonblock); break; + case IORING_OP_REMOVE_BUFFERS: + if (sqe) { + ret = io_remove_buffers_prep(req, sqe); + if (ret) + break; + } + ret = io_remove_buffers(req, force_nonblock); + break; default: ret = -EINVAL; break; @@ -6965,16 +7038,7 @@ static int __io_destroy_buffers(int id, void *p, void *data) struct io_ring_ctx *ctx = data; struct io_buffer *buf = p;
- /* the head kbuf is the list itself */ - while (!list_empty(&buf->list)) { - struct io_buffer *nxt; - - nxt = list_first_entry(&buf->list, struct io_buffer, list); - list_del(&nxt->list); - kfree(nxt); - } - kfree(buf); - idr_remove(&ctx->io_buffer_idr, id); + __io_remove_buffers(ctx, buf, id, -1U); return 0; }
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 28a85bdff505..b8c6c1a9cbb4 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -127,6 +127,7 @@ enum { IORING_OP_EPOLL_CTL, IORING_OP_SPLICE, IORING_OP_PROVIDE_BUFFERS, + IORING_OP_REMOVE_BUFFERS,
/* this goes last, obviously */ IORING_OP_LAST,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 84557871f2ff332edd445d70349c8724c313c683 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Not easy to tell if we're going over the size of bits we can shove in req->flags, so add an end-of-bits marker and a BUILD_BUG_ON() check for it.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f91e154a9a61..f32a430b2729 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -514,6 +514,9 @@ enum { REQ_F_OVERFLOW_BIT, REQ_F_POLLED_BIT, REQ_F_BUFFER_SELECTED_BIT, + + /* not a real bit, just to check we're not overflowing the space */ + __REQ_F_LAST_BIT, };
enum { @@ -7965,6 +7968,7 @@ static int __init io_uring_init(void) BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in);
BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST); + BUILD_BUG_ON(__REQ_F_LAST_BIT >= 8 * sizeof(int)); req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); return 0; };
From: YueHaibing yuehaibing@huawei.com
mainline inclusion from mainline-5.7-rc1 commit 469956e853ccdba72bb82ad2eea6e8ab6b15791f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If CONFIG_NET is not set, gcc warns:
fs/io_uring.c:3110:12: warning: io_setup_async_msg defined but not used [-Wunused-function] static int io_setup_async_msg(struct io_kiocb *req, ^~~~~~~~~~~~~~~~~~
There are many funcions wraped by CONFIG_NET, move them together to simplify code, also fix this warning.
Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: YueHaibing yuehaibing@huawei.com
Minor tweaks.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 94 ++++++++++++++++++++++++++++----------------------- 1 file changed, 52 insertions(+), 42 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f32a430b2729..68b20cfe855e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3428,6 +3428,7 @@ static int io_sync_file_range(struct io_kiocb *req, bool force_nonblock) return 0; }
+#if defined(CONFIG_NET) static int io_setup_async_msg(struct io_kiocb *req, struct io_async_msghdr *kmsg) { @@ -3445,7 +3446,6 @@ static int io_setup_async_msg(struct io_kiocb *req,
static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { -#if defined(CONFIG_NET) struct io_sr_msg *sr = &req->sr_msg; struct io_async_ctx *io = req->io; int ret; @@ -3471,14 +3471,10 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (!ret) req->flags |= REQ_F_NEED_CLEANUP; return ret; -#else - return -EOPNOTSUPP; -#endif }
static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) { -#if defined(CONFIG_NET) struct io_async_msghdr *kmsg = NULL; struct socket *sock; int ret; @@ -3532,14 +3528,10 @@ static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) req_set_fail_links(req); io_put_req(req); return 0; -#else - return -EOPNOTSUPP; -#endif }
static int io_send(struct io_kiocb *req, bool force_nonblock) { -#if defined(CONFIG_NET) struct socket *sock; int ret;
@@ -3582,9 +3574,6 @@ static int io_send(struct io_kiocb *req, bool force_nonblock) req_set_fail_links(req); io_put_req(req); return 0; -#else - return -EOPNOTSUPP; -#endif }
static int __io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io) @@ -3697,7 +3686,6 @@ static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req, static int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { -#if defined(CONFIG_NET) struct io_sr_msg *sr = &req->sr_msg; struct io_async_ctx *io = req->io; int ret; @@ -3722,14 +3710,10 @@ static int io_recvmsg_prep(struct io_kiocb *req, if (!ret) req->flags |= REQ_F_NEED_CLEANUP; return ret; -#else - return -EOPNOTSUPP; -#endif }
static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) { -#if defined(CONFIG_NET) struct io_async_msghdr *kmsg = NULL; struct socket *sock; int ret, cflags = 0; @@ -3790,14 +3774,10 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) req_set_fail_links(req); io_put_req(req); return 0; -#else - return -EOPNOTSUPP; -#endif }
static int io_recv(struct io_kiocb *req, bool force_nonblock) { -#if defined(CONFIG_NET) struct io_buffer *kbuf = NULL; struct socket *sock; int ret, cflags = 0; @@ -3854,15 +3834,10 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock) req_set_fail_links(req); io_put_req(req); return 0; -#else - return -EOPNOTSUPP; -#endif }
- static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { -#if defined(CONFIG_NET) struct io_accept *accept = &req->accept;
if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) @@ -3875,12 +3850,8 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) accept->flags = READ_ONCE(sqe->accept_flags); accept->nofile = rlimit(RLIMIT_NOFILE); return 0; -#else - return -EOPNOTSUPP; -#endif }
-#if defined(CONFIG_NET) static int __io_accept(struct io_kiocb *req, bool force_nonblock) { struct io_accept *accept = &req->accept; @@ -3911,11 +3882,9 @@ static void io_accept_finish(struct io_wq_work **workptr) __io_accept(req, false); io_steal_work(req, workptr); } -#endif
static int io_accept(struct io_kiocb *req, bool force_nonblock) { -#if defined(CONFIG_NET) int ret;
ret = __io_accept(req, force_nonblock); @@ -3924,14 +3893,10 @@ static int io_accept(struct io_kiocb *req, bool force_nonblock) return -EAGAIN; } return 0; -#else - return -EOPNOTSUPP; -#endif }
static int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { -#if defined(CONFIG_NET) struct io_connect *conn = &req->connect; struct io_async_ctx *io = req->io;
@@ -3948,14 +3913,10 @@ static int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return move_addr_to_kernel(conn->addr, conn->addr_len, &io->connect.address); -#else - return -EOPNOTSUPP; -#endif }
static int io_connect(struct io_kiocb *req, bool force_nonblock) { -#if defined(CONFIG_NET) struct io_async_ctx __io, *io; unsigned file_flags; int ret; @@ -3993,10 +3954,59 @@ static int io_connect(struct io_kiocb *req, bool force_nonblock) io_cqring_add_event(req, ret); io_put_req(req); return 0; -#else +} +#else /* !CONFIG_NET */ +static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + return -EOPNOTSUPP; +} + +static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) +{ + return -EOPNOTSUPP; +} + +static int io_send(struct io_kiocb *req, bool force_nonblock) +{ + return -EOPNOTSUPP; +} + +static int io_recvmsg_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + return -EOPNOTSUPP; +} + +static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) +{ + return -EOPNOTSUPP; +} + +static int io_recv(struct io_kiocb *req, bool force_nonblock) +{ + return -EOPNOTSUPP; +} + +static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + return -EOPNOTSUPP; +} + +static int io_accept(struct io_kiocb *req, bool force_nonblock) +{ + return -EOPNOTSUPP; +} + +static int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + return -EOPNOTSUPP; +} + +static int io_connect(struct io_kiocb *req, bool force_nonblock) +{ return -EOPNOTSUPP; -#endif } +#endif /* CONFIG_NET */
struct io_poll_table { struct poll_table_struct pt;
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.7-rc1 commit 32b2244a840a90ea94ba42392de5c48d53f521f5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA
When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need to do io completion events polling again, they can rely on io_sq_thread to do polling work, which can reduce cpu usage and uring_lock contention.
I modify fio io_uring engine codes a bit to evaluate the performance: static int fio_ioring_getevents(struct thread_data *td, unsigned int min, continue; }
- if (!o->sqpoll_thread) { + if (o->sqpoll_thread && o->hipri) { r = io_uring_enter(ld, 0, actual_min, IORING_ENTER_GETEVENTS); if (r < 0) {
and use "fio -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread -rw=read -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k -size=10G -numjobs=1 -time_based -runtime=120"
original codes -------------------------------------------------------------------- iodepth | 4 | 8 | 16 | 32 | 64 bw | 1133MB/s | 1519MB/s | 2090MB/s | 2710MB/s | 3012MB/s fio cpu usage | 100% | 100% | 100% | 100% | 100% --------------------------------------------------------------------
with patch -------------------------------------------------------------------- iodepth | 4 | 8 | 16 | 32 | 64 bw | 1196MB/s | 1721MB/s | 2351MB/s | 2977MB/s | 3357MB/s fio cpu usage | 63.8% | 74.4%% | 81.1% | 83.7% | 82.4% -------------------------------------------------------------------- bw improve | 5.5% | 13.2% | 12.3% | 9.8% | 11.5% --------------------------------------------------------------------
From above test results, we can see that bw has above 5.5%~13%
improvement, and fio process's cpu usage also drops much. Note this won't improve io_sq_thread's cpu usage when SETUP_IOPOLL|SETUP_SQPOLL are both enabled, in this case, io_sq_thread always has 100% cpu usage. I think this patch will be friendly to applications which will often use io_uring_wait_cqe() or similar from liburing.
Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 68b20cfe855e..46fd2f417edf 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1729,6 +1729,8 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, }
io_commit_cqring(ctx); + if (ctx->flags & IORING_SETUP_SQPOLL) + io_cqring_ev_posted(ctx); io_free_req_many(ctx, &rb); }
@@ -7375,7 +7377,14 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
min_complete = min(min_complete, ctx->cq_entries);
- if (ctx->flags & IORING_SETUP_IOPOLL) { + /* + * When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, user + * space applications don't need to do io completion events + * polling again, they can rely on io_sq_thread to do polling + * work, which can reduce cpu usage and uring_lock contention. + */ + if (ctx->flags & IORING_SETUP_IOPOLL && + !(ctx->flags & IORING_SETUP_SQPOLL)) { ret = io_iopoll_check(ctx, &nr_events, min_complete); } else { ret = io_cqring_wait(ctx, min_complete, sig, sigsz);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit bbbdeb4720a0759ec90e3bcb20ad28d19e531346 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This just syncs the header it with the liburing version, so there's no confusion on the license of the header parts.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/uapi/linux/io_uring.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index b8c6c1a9cbb4..ed90e7f75f15 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -1,4 +1,4 @@ -/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note OR MIT */ /* * Header file for the io_uring interface. *
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 3f9d64415fdaa73017fcb168930006648617b488 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Ensure we keep the truncated value, if we did truncate it. If not, we might read/write more than the registered buffer size.
Also for retry, ensure that we return the truncated mapped value for the vectorized versions of the read/write commands.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 46fd2f417edf..6faf90e6d20d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2357,6 +2357,7 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, *iovec = NULL; return PTR_ERR(buf); } + req->rw.len = sqe_len; }
ret = import_single_range(rw, buf, sqe_len, *iovec, iter); @@ -2376,8 +2377,10 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
if (req->flags & REQ_F_BUFFER_SELECT) { ret = io_iov_buffer_select(req, *iovec, needs_lock); - if (!ret) - iov_iter_init(iter, rw, *iovec, 1, (*iovec)->iov_len); + if (!ret) { + ret = (*iovec)->iov_len; + iov_iter_init(iter, rw, *iovec, 1, ret); + } *iovec = NULL; return ret; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 2293b4195800f88de2c454a24b25874be56d87f3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Deduplicate cancellation parts, as many of them looks the same, as do e.g. - io_wqe_cancel_cb_work() and io_wqe_cancel_work() - io_wq_worker_cancel() and io_work_cancel()
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 136 ++++++++++------------------------------------------- 1 file changed, 24 insertions(+), 112 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 73c5bb244730..d2fb0796eaf9 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -856,14 +856,13 @@ void io_wq_cancel_all(struct io_wq *wq) }
struct io_cb_cancel_data { - struct io_wqe *wqe; - work_cancel_fn *cancel; - void *caller_data; + work_cancel_fn *fn; + void *data; };
-static bool io_work_cancel(struct io_worker *worker, void *cancel_data) +static bool io_wq_worker_cancel(struct io_worker *worker, void *data) { - struct io_cb_cancel_data *data = cancel_data; + struct io_cb_cancel_data *match = data; unsigned long flags; bool ret = false;
@@ -874,83 +873,7 @@ static bool io_work_cancel(struct io_worker *worker, void *cancel_data) spin_lock_irqsave(&worker->lock, flags); if (worker->cur_work && !(worker->cur_work->flags & IO_WQ_WORK_NO_CANCEL) && - data->cancel(worker->cur_work, data->caller_data)) { - send_sig(SIGINT, worker->task, 1); - ret = true; - } - spin_unlock_irqrestore(&worker->lock, flags); - - return ret; -} - -static enum io_wq_cancel io_wqe_cancel_cb_work(struct io_wqe *wqe, - work_cancel_fn *cancel, - void *cancel_data) -{ - struct io_cb_cancel_data data = { - .wqe = wqe, - .cancel = cancel, - .caller_data = cancel_data, - }; - struct io_wq_work_node *node, *prev; - struct io_wq_work *work; - unsigned long flags; - bool found = false; - - spin_lock_irqsave(&wqe->lock, flags); - wq_list_for_each(node, prev, &wqe->work_list) { - work = container_of(node, struct io_wq_work, list); - - if (cancel(work, cancel_data)) { - wq_node_del(&wqe->work_list, node, prev); - found = true; - break; - } - } - spin_unlock_irqrestore(&wqe->lock, flags); - - if (found) { - io_run_cancel(work, wqe); - return IO_WQ_CANCEL_OK; - } - - rcu_read_lock(); - found = io_wq_for_each_worker(wqe, io_work_cancel, &data); - rcu_read_unlock(); - return found ? IO_WQ_CANCEL_RUNNING : IO_WQ_CANCEL_NOTFOUND; -} - -enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel, - void *data) -{ - enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; - int node; - - for_each_node(node) { - struct io_wqe *wqe = wq->wqes[node]; - - ret = io_wqe_cancel_cb_work(wqe, cancel, data); - if (ret != IO_WQ_CANCEL_NOTFOUND) - break; - } - - return ret; -} - -struct work_match { - bool (*fn)(struct io_wq_work *, void *data); - void *data; -}; - -static bool io_wq_worker_cancel(struct io_worker *worker, void *data) -{ - struct work_match *match = data; - unsigned long flags; - bool ret = false; - - spin_lock_irqsave(&worker->lock, flags); - if (match->fn(worker->cur_work, match->data) && - !(worker->cur_work->flags & IO_WQ_WORK_NO_CANCEL)) { + match->fn(worker->cur_work, match->data)) { send_sig(SIGINT, worker->task, 1); ret = true; } @@ -960,7 +883,7 @@ static bool io_wq_worker_cancel(struct io_worker *worker, void *data) }
static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, - struct work_match *match) + struct io_cb_cancel_data *match) { struct io_wq_work_node *node, *prev; struct io_wq_work *work; @@ -1001,22 +924,16 @@ static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, return found ? IO_WQ_CANCEL_RUNNING : IO_WQ_CANCEL_NOTFOUND; }
-static bool io_wq_work_match(struct io_wq_work *work, void *data) -{ - return work == data; -} - -enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork) +enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel, + void *data) { - struct work_match match = { - .fn = io_wq_work_match, - .data = cwork + struct io_cb_cancel_data match = { + .fn = cancel, + .data = data, }; enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; int node;
- cwork->flags |= IO_WQ_WORK_CANCEL; - for_each_node(node) { struct io_wqe *wqe = wq->wqes[node];
@@ -1028,33 +945,28 @@ enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork) return ret; }
+static bool io_wq_io_cb_cancel_data(struct io_wq_work *work, void *data) +{ + return work == data; +} + +enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork) +{ + return io_wq_cancel_cb(wq, io_wq_io_cb_cancel_data, (void *)cwork); +} + static bool io_wq_pid_match(struct io_wq_work *work, void *data) { pid_t pid = (pid_t) (unsigned long) data;
- if (work) - return work->task_pid == pid; - return false; + return work->task_pid == pid; }
enum io_wq_cancel io_wq_cancel_pid(struct io_wq *wq, pid_t pid) { - struct work_match match = { - .fn = io_wq_pid_match, - .data = (void *) (unsigned long) pid - }; - enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; - int node; - - for_each_node(node) { - struct io_wqe *wqe = wq->wqes[node]; + void *data = (void *) (unsigned long) pid;
- ret = io_wqe_cancel_work(wqe, &match); - if (ret != IO_WQ_CANCEL_NOTFOUND) - break; - } - - return ret; + return io_wq_cancel_cb(wq, io_wq_pid_match, data); }
struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit d78298e73a3443a3c1766fa89f5370f52a4efd94 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This little tweak restores the behaviour that was before the recent io_worker_handle_work() optimisation patches. It makes the function do cond_resched() and flush_signals() only if there is an actual work to execute.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index d2fb0796eaf9..584db08f0547 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -459,10 +459,12 @@ static void io_impersonate_work(struct io_worker *worker, static void io_assign_current_work(struct io_worker *worker, struct io_wq_work *work) { - /* flush pending signals before assigning new work */ - if (signal_pending(current)) - flush_signals(current); - cond_resched(); + if (work) { + /* flush pending signals before assigning new work */ + if (signal_pending(current)) + flush_signals(current); + cond_resched(); + }
spin_lock_irq(&worker->lock); worker->cur_work = work;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 8766dd516c535abf04491dca674d0ef6c95d814f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
It's a preparation patch removing io_wq_enqueue_hashed(), which now should be done by io_wq_hash_work() + io_wq_enqueue().
Also, set hash value for dependant works, and do it as late as possible, because req->file can be unavailable before. This hash will be ignored by io-wq.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 14 +++++--------- fs/io-wq.h | 7 ++++++- fs/io_uring.c | 24 ++++++++++-------------- 3 files changed, 21 insertions(+), 24 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 584db08f0547..c6569e14d847 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -386,7 +386,7 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe, unsigned *hash) work = container_of(node, struct io_wq_work, list);
/* not hashed, can run anytime */ - if (!(work->flags & IO_WQ_WORK_HASHED)) { + if (!io_wq_is_hashed(work)) { wq_node_del(&wqe->work_list, node, prev); return work; } @@ -796,19 +796,15 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work) }
/* - * Enqueue work, hashed by some key. Work items that hash to the same value - * will not be done in parallel. Used to limit concurrent writes, generally - * hashed by inode. + * Work items that hash to the same value will not be done in parallel. + * Used to limit concurrent writes, generally hashed by inode. */ -void io_wq_enqueue_hashed(struct io_wq *wq, struct io_wq_work *work, void *val) +void io_wq_hash_work(struct io_wq_work *work, void *val) { - struct io_wqe *wqe = wq->wqes[numa_node_id()]; - unsigned bit; - + unsigned int bit;
bit = hash_ptr(val, IO_WQ_HASH_ORDER); work->flags |= (IO_WQ_WORK_HASHED | (bit << IO_WQ_HASH_SHIFT)); - io_wqe_enqueue(wqe, work); }
static bool io_wqe_worker_send_sig(struct io_worker *worker, void *data) diff --git a/fs/io-wq.h b/fs/io-wq.h index 2117b9a4f161..298b21f4a4d2 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -94,7 +94,12 @@ bool io_wq_get(struct io_wq *wq, struct io_wq_data *data); void io_wq_destroy(struct io_wq *wq);
void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work); -void io_wq_enqueue_hashed(struct io_wq *wq, struct io_wq_work *work, void *val); +void io_wq_hash_work(struct io_wq_work *work, void *val); + +static inline bool io_wq_is_hashed(struct io_wq_work *work) +{ + return work->flags & IO_WQ_WORK_HASHED; +}
void io_wq_cancel_all(struct io_wq *wq); enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork); diff --git a/fs/io_uring.c b/fs/io_uring.c index 6faf90e6d20d..c59250bffc7a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1036,15 +1036,14 @@ static inline void io_req_work_drop_env(struct io_kiocb *req) } }
-static inline bool io_prep_async_work(struct io_kiocb *req, +static inline void io_prep_async_work(struct io_kiocb *req, struct io_kiocb **link) { const struct io_op_def *def = &io_op_defs[req->opcode]; - bool do_hashed = false;
if (req->flags & REQ_F_ISREG) { if (def->hash_reg_file) - do_hashed = true; + io_wq_hash_work(&req->work, file_inode(req->file)); } else { if (def->unbound_nonreg_file) req->work.flags |= IO_WQ_WORK_UNBOUND; @@ -1053,25 +1052,18 @@ static inline bool io_prep_async_work(struct io_kiocb *req, io_req_work_grab_env(req, def);
*link = io_prep_linked_timeout(req); - return do_hashed; }
static inline void io_queue_async_work(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; struct io_kiocb *link; - bool do_hashed;
- do_hashed = io_prep_async_work(req, &link); + io_prep_async_work(req, &link);
- trace_io_uring_queue_async_work(ctx, do_hashed, req, &req->work, - req->flags); - if (!do_hashed) { - io_wq_enqueue(ctx->io_wq, &req->work); - } else { - io_wq_enqueue_hashed(ctx->io_wq, &req->work, - file_inode(req->file)); - } + trace_io_uring_queue_async_work(ctx, io_wq_is_hashed(&req->work), req, + &req->work, req->flags); + io_wq_enqueue(ctx->io_wq, &req->work);
if (link) io_queue_linked_timeout(link); @@ -1579,6 +1571,10 @@ static void io_link_work_cb(struct io_wq_work **workptr) static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) { struct io_kiocb *link; + const struct io_op_def *def = &io_op_defs[nxt->opcode]; + + if ((nxt->flags & REQ_F_ISREG) && def->hash_reg_file) + io_wq_hash_work(&nxt->work, file_inode(nxt->file));
*workptr = &nxt->work; link = io_prep_linked_timeout(nxt);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 60cf46ae605446feb0c43c472c0fd1af4cd96231 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Enable io-wq hashing stuff for dependent works simply by re-enqueueing such requests.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 25 +++++++++++++++++++------ 1 file changed, 19 insertions(+), 6 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index c6569e14d847..4f7bdb3fd73c 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -376,11 +376,17 @@ static bool __io_worker_idle(struct io_wqe *wqe, struct io_worker *worker) return __io_worker_unuse(wqe, worker); }
-static struct io_wq_work *io_get_next_work(struct io_wqe *wqe, unsigned *hash) +static inline unsigned int io_get_work_hash(struct io_wq_work *work) +{ + return work->flags >> IO_WQ_HASH_SHIFT; +} + +static struct io_wq_work *io_get_next_work(struct io_wqe *wqe) __must_hold(wqe->lock) { struct io_wq_work_node *node, *prev; struct io_wq_work *work; + unsigned int hash;
wq_list_for_each(node, prev, &wqe->work_list) { work = container_of(node, struct io_wq_work, list); @@ -392,9 +398,9 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe, unsigned *hash) }
/* hashed, can run if not already running */ - *hash = work->flags >> IO_WQ_HASH_SHIFT; - if (!(wqe->hash_map & BIT(*hash))) { - wqe->hash_map |= BIT(*hash); + hash = io_get_work_hash(work); + if (!(wqe->hash_map & BIT(hash))) { + wqe->hash_map |= BIT(hash); wq_node_del(&wqe->work_list, node, prev); return work; } @@ -471,15 +477,17 @@ static void io_assign_current_work(struct io_worker *worker, spin_unlock_irq(&worker->lock); }
+static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work); + static void io_worker_handle_work(struct io_worker *worker) __releases(wqe->lock) { struct io_wqe *wqe = worker->wqe; struct io_wq *wq = wqe->wq; - unsigned hash = -1U;
do { struct io_wq_work *work; + unsigned int hash; get_next: /* * If we got some work, mark us as busy. If we didn't, but @@ -488,7 +496,7 @@ static void io_worker_handle_work(struct io_worker *worker) * can't make progress, any work completion or insertion will * clear the stalled flag. */ - work = io_get_next_work(wqe, &hash); + work = io_get_next_work(wqe); if (work) __io_worker_busy(wqe, worker, work); else if (!wq_list_empty(&wqe->work_list)) @@ -512,11 +520,16 @@ static void io_worker_handle_work(struct io_worker *worker) work->flags |= IO_WQ_WORK_CANCEL;
old_work = work; + hash = io_get_work_hash(work); work->func(&work); work = (old_work == work) ? NULL : work; io_assign_current_work(worker, work); wq->free_work(old_work);
+ if (work && io_wq_is_hashed(work)) { + io_wqe_enqueue(wqe, work); + work = NULL; + } if (hash != -1U) { spin_lock_irq(&wqe->lock); wqe->hash_map &= ~BIT_ULL(hash);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 4ed734b0d0913e566a9d871e15d24eb240f269f7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
With the previous fixes for number of files open checking, I added some debug code to see if we had other spots where we're checking rlimit() against the async io-wq workers. The only one I found was file size checking, which we should also honor.
During write and fallocate prep, store the max file size and override that for the current ask if we're in io-wq worker context.
Cc: stable@vger.kernel.org # 5.1+ Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c59250bffc7a..9141aa266007 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -607,7 +607,10 @@ struct io_kiocb { struct list_head list; unsigned int flags; refcount_t refs; - struct task_struct *task; + union { + struct task_struct *task; + unsigned long fsize; + }; u64 user_data; u32 result; u32 sequence; @@ -2590,6 +2593,8 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (unlikely(!(req->file->f_mode & FMODE_WRITE))) return -EBADF;
+ req->fsize = rlimit(RLIMIT_FSIZE); + /* either don't need iovec imported or already have it */ if (!req->io || req->flags & REQ_F_NEED_CLEANUP) return 0; @@ -2659,10 +2664,17 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) } kiocb->ki_flags |= IOCB_WRITE;
+ if (!force_nonblock) + current->signal->rlim[RLIMIT_FSIZE].rlim_cur = req->fsize; + if (req->file->f_op->write_iter) ret2 = call_write_iter(req->file, kiocb, &iter); else ret2 = loop_rw_iter(WRITE, req->file, kiocb, &iter); + + if (!force_nonblock) + current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY; + /* * Raw bdev writes will -EOPNOTSUPP for IOCB_NOWAIT. Just * retry them without IOCB_NOWAIT. @@ -2845,8 +2857,10 @@ static void __io_fallocate(struct io_kiocb *req) { int ret;
+ current->signal->rlim[RLIMIT_FSIZE].rlim_cur = req->fsize; ret = vfs_fallocate(req->file, req->sync.mode, req->sync.off, req->sync.len); + current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY; if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); @@ -2872,6 +2886,7 @@ static int io_fallocate_prep(struct io_kiocb *req, req->sync.off = READ_ONCE(sqe->off); req->sync.len = READ_ONCE(sqe->addr); req->sync.mode = READ_ONCE(sqe->len); + req->fsize = rlimit(RLIMIT_FSIZE); return 0; }
From: Lukas Bulwahn lukas.bulwahn@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 9f5834c868e901b00f1bfe4d0052b5906b4a2b7f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Commit bbbdeb4720a0 ("io_uring: dual license io_uring.h uapi header") uses a nested SPDX-License-Identifier to dual license the header.
Since then, ./scripts/spdxcheck.py complains:
include/uapi/linux/io_uring.h: 1:60 Missing parentheses: OR
Add parentheses to make spdxcheck.py happy.
Signed-off-by: Lukas Bulwahn lukas.bulwahn@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/uapi/linux/io_uring.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ed90e7f75f15..6e35b534c4b8 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -1,4 +1,4 @@ -/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note OR MIT */ +/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */ /* * Header file for the io_uring interface. *
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit f2cf11492b8b30d89b2fbf525c9ea5e8c4ccc842 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
After io_assign_current_work() of a linked work, it can be decided to offloaded to another thread so doing io_wqe_enqueue(). However, until next io_assign_current_work() it can be cancelled, that isn't handled.
Don't assign it, if it's not going to be executed.
Fixes: 60cf46ae6054 ("io-wq: hash dependent work") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 4f7bdb3fd73c..db03fe55179a 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -486,7 +486,7 @@ static void io_worker_handle_work(struct io_worker *worker) struct io_wq *wq = wqe->wq;
do { - struct io_wq_work *work; + struct io_wq_work *work, *assign_work; unsigned int hash; get_next: /* @@ -523,10 +523,14 @@ static void io_worker_handle_work(struct io_worker *worker) hash = io_get_work_hash(work); work->func(&work); work = (old_work == work) ? NULL : work; - io_assign_current_work(worker, work); + + assign_work = work; + if (work && io_wq_is_hashed(work)) + assign_work = NULL; + io_assign_current_work(worker, assign_work); wq->free_work(old_work);
- if (work && io_wq_is_hashed(work)) { + if (work && !assign_work) { io_wqe_enqueue(wqe, work); work = NULL; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 18a542ff19ad149fac9e5a36a4012e3cac7b3b3b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
work->data and work->list are shared in union. io_wq_assign_next() sets ->data if a req having a linked_timeout, but then io-wq may want to use work->list, e.g. to do re-enqueue of a request, so corrupting ->data.
->data is not necessary, just remove it and extract linked_timeout through @link_list.
Fixes: 60cf46ae6054 ("io-wq: hash dependent work") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.h | 5 +---- fs/io_uring.c | 9 ++++----- 2 files changed, 5 insertions(+), 9 deletions(-)
diff --git a/fs/io-wq.h b/fs/io-wq.h index 298b21f4a4d2..d2a5684bf673 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -63,10 +63,7 @@ static inline void wq_node_del(struct io_wq_work_list *list, } while (0)
struct io_wq_work { - union { - struct io_wq_work_node list; - void *data; - }; + struct io_wq_work_node list; void (*func)(struct io_wq_work **); struct files_struct *files; struct mm_struct *mm; diff --git a/fs/io_uring.c b/fs/io_uring.c index 9141aa266007..846632fbdc7c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1564,9 +1564,10 @@ static void io_free_req(struct io_kiocb *req)
static void io_link_work_cb(struct io_wq_work **workptr) { - struct io_wq_work *work = *workptr; - struct io_kiocb *link = work->data; + struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); + struct io_kiocb *link;
+ link = list_first_entry(&req->link_list, struct io_kiocb, link_list); io_queue_linked_timeout(link); io_wq_submit_work(workptr); } @@ -1581,10 +1582,8 @@ static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt)
*workptr = &nxt->work; link = io_prep_linked_timeout(nxt); - if (link) { + if (link) nxt->work.func = io_link_work_cb; - nxt->work.data = link; - } }
/*
From: Hillf Danton hdanton@sina.com
mainline inclusion from mainline-5.7-rc1 commit 4afdb733b1606c6cb86e7833f9335f4870cf7ddd category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
A case of task hung was reported by syzbot,
INFO: task syz-executor975:9880 blocked for more than 143 seconds. Not tainted 5.6.0-rc6-syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syz-executor975 D27576 9880 9878 0x80004000 Call Trace: schedule+0xd0/0x2a0 kernel/sched/core.c:4154 schedule_timeout+0x6db/0xba0 kernel/time/timer.c:1871 do_wait_for_common kernel/sched/completion.c:83 [inline] __wait_for_common kernel/sched/completion.c:104 [inline] wait_for_common kernel/sched/completion.c:115 [inline] wait_for_completion+0x26a/0x3c0 kernel/sched/completion.c:136 io_queue_file_removal+0x1af/0x1e0 fs/io_uring.c:5826 __io_sqe_files_update.isra.0+0x3a1/0xb00 fs/io_uring.c:5867 io_sqe_files_update fs/io_uring.c:5918 [inline] __io_uring_register+0x377/0x2c00 fs/io_uring.c:7131 __do_sys_io_uring_register fs/io_uring.c:7202 [inline] __se_sys_io_uring_register fs/io_uring.c:7184 [inline] __x64_sys_io_uring_register+0x192/0x560 fs/io_uring.c:7184 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe
and bisect pointed to 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update").
It is down to the order that we wait for work done before flushing it while nobody is likely going to wake us up.
We can drop that completion on stack as flushing work itself is a sync operation we need and no more is left behind it.
To that end, io_file_put::done is re-used for indicating if it can be freed in the workqueue worker context.
Reported-and-Inspired-by: syzbot syzbot+538d1957ce178382a394@syzkaller.appspotmail.com Signed-off-by: Hillf Danton hdanton@sina.com
Rename ->done to ->free_pfile
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 846632fbdc7c..378c5e3b6ad8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6315,7 +6315,7 @@ static void io_ring_file_put(struct io_ring_ctx *ctx, struct file *file) struct io_file_put { struct llist_node llist; struct file *file; - struct completion *done; + bool free_pfile; };
static void io_ring_file_ref_flush(struct fixed_file_data *data) @@ -6326,9 +6326,7 @@ static void io_ring_file_ref_flush(struct fixed_file_data *data) while ((node = llist_del_all(&data->put_llist)) != NULL) { llist_for_each_entry_safe(pfile, tmp, node, llist) { io_ring_file_put(data->ctx, pfile->file); - if (pfile->done) - complete(pfile->done); - else + if (pfile->free_pfile) kfree(pfile); } } @@ -6528,7 +6526,6 @@ static bool io_queue_file_removal(struct fixed_file_data *data, struct file *file) { struct io_file_put *pfile, pfile_stack; - DECLARE_COMPLETION_ONSTACK(done);
/* * If we fail allocating the struct we need for doing async reomval @@ -6537,15 +6534,15 @@ static bool io_queue_file_removal(struct fixed_file_data *data, pfile = kzalloc(sizeof(*pfile), GFP_KERNEL); if (!pfile) { pfile = &pfile_stack; - pfile->done = &done; - } + pfile->free_pfile = false; + } else + pfile->free_pfile = true;
pfile->file = file; llist_add(&pfile->llist, &data->put_llist);
if (pfile == &pfile_stack) { percpu_ref_switch_to_atomic(&data->refs, io_atomic_switch); - wait_for_completion(&done); flush_work(&data->ref_work); return false; }
From: Hillf Danton hdanton@sina.com
mainline inclusion from mainline-5.7-rc1 commit a5318d3cdffbecf075928363d7e4becfeddabfcb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Sync removal of file is only used in case of a GFP_KERNEL kmalloc failure at the cost of io_file_put::done and work flush, while a glich like it can be handled at the call site without too much pain.
That said, what is proposed is to drop sync removing of file, and the kink in neck as well.
Signed-off-by: Hillf Danton hdanton@sina.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 34 ++++++++++------------------------ 1 file changed, 10 insertions(+), 24 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 378c5e3b6ad8..cd1fd6908cbd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6315,7 +6315,6 @@ static void io_ring_file_put(struct io_ring_ctx *ctx, struct file *file) struct io_file_put { struct llist_node llist; struct file *file; - bool free_pfile; };
static void io_ring_file_ref_flush(struct fixed_file_data *data) @@ -6326,8 +6325,7 @@ static void io_ring_file_ref_flush(struct fixed_file_data *data) while ((node = llist_del_all(&data->put_llist)) != NULL) { llist_for_each_entry_safe(pfile, tmp, node, llist) { io_ring_file_put(data->ctx, pfile->file); - if (pfile->free_pfile) - kfree(pfile); + kfree(pfile); } } } @@ -6522,32 +6520,18 @@ static void io_atomic_switch(struct percpu_ref *ref) percpu_ref_get(&data->refs); }
-static bool io_queue_file_removal(struct fixed_file_data *data, +static int io_queue_file_removal(struct fixed_file_data *data, struct file *file) { - struct io_file_put *pfile, pfile_stack; + struct io_file_put *pfile;
- /* - * If we fail allocating the struct we need for doing async reomval - * of this file, just punt to sync and wait for it. - */ pfile = kzalloc(sizeof(*pfile), GFP_KERNEL); - if (!pfile) { - pfile = &pfile_stack; - pfile->free_pfile = false; - } else - pfile->free_pfile = true; + if (!pfile) + return -ENOMEM;
pfile->file = file; llist_add(&pfile->llist, &data->put_llist); - - if (pfile == &pfile_stack) { - percpu_ref_switch_to_atomic(&data->refs, io_atomic_switch); - flush_work(&data->ref_work); - return false; - } - - return true; + return 0; }
static int __io_sqe_files_update(struct io_ring_ctx *ctx, @@ -6582,9 +6566,11 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, index = i & IORING_FILE_TABLE_MASK; if (table->files[index]) { file = io_file_from_index(ctx, index); + err = io_queue_file_removal(data, file); + if (err) + break; table->files[index] = NULL; - if (io_queue_file_removal(data, file)) - ref_switch = true; + ref_switch = true; } if (fd != -1) { file = fget(fd);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 86f3cd1b589a10dbdca98c52cc0cd0f56523c9b3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We always punt async buffered writes to an io-wq helper, as the core kernel does not have IOCB_NOWAIT support for that. Most buffered async writes complete very quickly, as it's just a copy operation. This means that doing multiple locking roundtrips on the shared wqe lock for each buffered write is wasteful. Additionally, buffered writes are hashed work items, which means that any buffered write to a given file is serialized.
Keep identicaly hashed work items contiguously in @wqe->work_list, and track a tail for each hash bucket. On dequeue of a hashed item, splice all of the same hash in one go using the tracked tail. Until the batch is done, the caller doesn't have to synchronize with the wqe or worker locks again.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 68 ++++++++++++++++++++++++++++++++++++++---------------- fs/io-wq.h | 45 +++++++++++++++++++++++++++++------- 2 files changed, 85 insertions(+), 28 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index db03fe55179a..4fd7b31c40a3 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -70,6 +70,8 @@ struct io_worker { #define IO_WQ_HASH_ORDER 5 #endif
+#define IO_WQ_NR_HASH_BUCKETS (1u << IO_WQ_HASH_ORDER) + struct io_wqe_acct { unsigned nr_workers; unsigned max_workers; @@ -99,6 +101,7 @@ struct io_wqe { struct list_head all_list;
struct io_wq *wq; + struct io_wq_work *hash_tail[IO_WQ_NR_HASH_BUCKETS]; };
/* @@ -385,7 +388,7 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe) __must_hold(wqe->lock) { struct io_wq_work_node *node, *prev; - struct io_wq_work *work; + struct io_wq_work *work, *tail; unsigned int hash;
wq_list_for_each(node, prev, &wqe->work_list) { @@ -393,7 +396,7 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe)
/* not hashed, can run anytime */ if (!io_wq_is_hashed(work)) { - wq_node_del(&wqe->work_list, node, prev); + wq_list_del(&wqe->work_list, node, prev); return work; }
@@ -401,7 +404,10 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe) hash = io_get_work_hash(work); if (!(wqe->hash_map & BIT(hash))) { wqe->hash_map |= BIT(hash); - wq_node_del(&wqe->work_list, node, prev); + /* all items with this hash lie in [work, tail] */ + tail = wqe->hash_tail[hash]; + wqe->hash_tail[hash] = NULL; + wq_list_cut(&wqe->work_list, &tail->list, prev); return work; } } @@ -486,7 +492,7 @@ static void io_worker_handle_work(struct io_worker *worker) struct io_wq *wq = wqe->wq;
do { - struct io_wq_work *work, *assign_work; + struct io_wq_work *work; unsigned int hash; get_next: /* @@ -509,8 +515,9 @@ static void io_worker_handle_work(struct io_worker *worker)
/* handle a whole dependent link */ do { - struct io_wq_work *old_work; + struct io_wq_work *old_work, *next_hashed, *linked;
+ next_hashed = wq_next_work(work); io_impersonate_work(worker, work); /* * OK to set IO_WQ_WORK_CANCEL even for uncancellable @@ -519,22 +526,23 @@ static void io_worker_handle_work(struct io_worker *worker) if (test_bit(IO_WQ_BIT_CANCEL, &wq->state)) work->flags |= IO_WQ_WORK_CANCEL;
- old_work = work; hash = io_get_work_hash(work); - work->func(&work); - work = (old_work == work) ? NULL : work; - - assign_work = work; - if (work && io_wq_is_hashed(work)) - assign_work = NULL; - io_assign_current_work(worker, assign_work); + linked = old_work = work; + linked->func(&linked); + linked = (old_work == linked) ? NULL : linked; + + work = next_hashed; + if (!work && linked && !io_wq_is_hashed(linked)) { + work = linked; + linked = NULL; + } + io_assign_current_work(worker, work); wq->free_work(old_work);
- if (work && !assign_work) { - io_wqe_enqueue(wqe, work); - work = NULL; - } - if (hash != -1U) { + if (linked) + io_wqe_enqueue(wqe, linked); + + if (hash != -1U && !next_hashed) { spin_lock_irq(&wqe->lock); wqe->hash_map &= ~BIT_ULL(hash); wqe->flags &= ~IO_WQE_FLAG_STALLED; @@ -777,6 +785,26 @@ static void io_run_cancel(struct io_wq_work *work, struct io_wqe *wqe) } while (work); }
+static void io_wqe_insert_work(struct io_wqe *wqe, struct io_wq_work *work) +{ + unsigned int hash; + struct io_wq_work *tail; + + if (!io_wq_is_hashed(work)) { +append: + wq_list_add_tail(&work->list, &wqe->work_list); + return; + } + + hash = io_get_work_hash(work); + tail = wqe->hash_tail[hash]; + wqe->hash_tail[hash] = work; + if (!tail) + goto append; + + wq_list_add_after(&work->list, &tail->list, &wqe->work_list); +} + static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work) { struct io_wqe_acct *acct = io_work_get_acct(wqe, work); @@ -796,7 +824,7 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
work_flags = work->flags; spin_lock_irqsave(&wqe->lock, flags); - wq_list_add_tail(&work->list, &wqe->work_list); + io_wqe_insert_work(wqe, work); wqe->flags &= ~IO_WQE_FLAG_STALLED; spin_unlock_irqrestore(&wqe->lock, flags);
@@ -915,7 +943,7 @@ static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, work = container_of(node, struct io_wq_work, list);
if (match->fn(work, match->data)) { - wq_node_del(&wqe->work_list, node, prev); + wq_list_del(&wqe->work_list, node, prev); found = true; break; } diff --git a/fs/io-wq.h b/fs/io-wq.h index d2a5684bf673..3ee7356d6be5 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -28,6 +28,18 @@ struct io_wq_work_list { struct io_wq_work_node *last; };
+static inline void wq_list_add_after(struct io_wq_work_node *node, + struct io_wq_work_node *pos, + struct io_wq_work_list *list) +{ + struct io_wq_work_node *next = pos->next; + + pos->next = node; + node->next = next; + if (!next) + list->last = node; +} + static inline void wq_list_add_tail(struct io_wq_work_node *node, struct io_wq_work_list *list) { @@ -40,17 +52,26 @@ static inline void wq_list_add_tail(struct io_wq_work_node *node, } }
-static inline void wq_node_del(struct io_wq_work_list *list, - struct io_wq_work_node *node, +static inline void wq_list_cut(struct io_wq_work_list *list, + struct io_wq_work_node *last, struct io_wq_work_node *prev) { - if (node == list->first) - WRITE_ONCE(list->first, node->next); - if (node == list->last) + /* first in the list, if prev==NULL */ + if (!prev) + WRITE_ONCE(list->first, last->next); + else + prev->next = last->next; + + if (last == list->last) list->last = prev; - if (prev) - prev->next = node->next; - node->next = NULL; + last->next = NULL; +} + +static inline void wq_list_del(struct io_wq_work_list *list, + struct io_wq_work_node *node, + struct io_wq_work_node *prev) +{ + wq_list_cut(list, node, prev); }
#define wq_list_for_each(pos, prv, head) \ @@ -78,6 +99,14 @@ struct io_wq_work { *(work) = (struct io_wq_work){ .func = _func }; \ } while (0) \
+static inline struct io_wq_work *wq_next_work(struct io_wq_work *work) +{ + if (!work->list.next) + return NULL; + + return container_of(work->list.next, struct io_wq_work, list); +} + typedef void (free_work_fn)(struct io_wq_work *);
struct io_wq_data {
From: Chucheng Luo luochucheng@vivo.com
mainline inclusion from mainline-5.7-rc1 commit bff6035d0c40fa1dd195aa41f61814d622883420 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The missing 'return' work may make it hard for other developers to understand it.
Signed-off-by: Chucheng Luo luochucheng@vivo.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cd1fd6908cbd..8ab0bafebf5e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2675,7 +2675,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
/* - * Raw bdev writes will -EOPNOTSUPP for IOCB_NOWAIT. Just + * Raw bdev writes will return -EOPNOTSUPP for IOCB_NOWAIT. Just * retry them without IOCB_NOWAIT. */ if (ret2 == -EOPNOTSUPP && (kiocb->ki_flags & IOCB_NOWAIT))
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.7-rc1 commit 3d9932a8b240c9019f48358e8a6928c53c2c7f6b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Cleanup io_alloc_async_ctx() a bit, add a new __io_alloc_async_ctx(), so io_setup_async_rw() won't need to check whether async_ctx is true or false again.
Reviewed-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8ab0bafebf5e..9d311d535efc 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2466,12 +2466,18 @@ static void io_req_map_rw(struct io_kiocb *req, ssize_t io_size, } }
+static inline int __io_alloc_async_ctx(struct io_kiocb *req) +{ + req->io = kmalloc(sizeof(*req->io), GFP_KERNEL); + return req->io == NULL; +} + static int io_alloc_async_ctx(struct io_kiocb *req) { if (!io_op_defs[req->opcode].async_ctx) return 0; - req->io = kmalloc(sizeof(*req->io), GFP_KERNEL); - return req->io == NULL; + + return __io_alloc_async_ctx(req); }
static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size, @@ -2481,7 +2487,7 @@ static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size, if (!io_op_defs[req->opcode].async_ctx) return 0; if (!req->io) { - if (io_alloc_async_ctx(req)) + if (__io_alloc_async_ctx(req)) return -ENOMEM;
io_req_map_rw(req, io_size, iovec, fast_iov, iter);
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.7-rc1 commit 0558955373023b08f638c9ede36741b0e4200f58 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
While diving into io_uring fileset register/unregister/update codes, we found one bug in the fileset update handling. io_uring fileset update use a percpu_ref variable to check whether we can put the previously registered file, only when the refcnt of the perfcpu_ref variable reaches zero, can we safely put these files. But this doesn't work so well. If applications always issue requests continually, this perfcpu_ref will never have an chance to reach zero, and it'll always be in atomic mode, also will defeat the gains introduced by fileset register/unresiger/update feature, which are used to reduce the atomic operation overhead of fput/fget.
To fix this issue, while applications do IORING_REGISTER_FILES or IORING_REGISTER_FILES_UPDATE operations, we allocate a new percpu_ref and kill the old percpu_ref, new requests will use the new percpu_ref. Once all previous old requests complete, old percpu_refs will be dropped and registered files will be put safely.
Link: https://lore.kernel.org/io-uring/5a8dac33-4ca2-4847-b091-f7dcd3ad0ff3@linux.... Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 204 ++++++++++++++++++++++++++++++-------------------- 1 file changed, 124 insertions(+), 80 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9d311d535efc..a1ac44e506db 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -186,14 +186,23 @@ struct fixed_file_table { struct file **files; };
+struct fixed_file_ref_node { + struct percpu_ref refs; + struct list_head node; + struct list_head file_list; + struct fixed_file_data *file_data; + struct work_struct work; +}; + struct fixed_file_data { struct fixed_file_table *table; struct io_ring_ctx *ctx;
+ struct percpu_ref *cur_refs; struct percpu_ref refs; - struct llist_head put_llist; - struct work_struct ref_work; struct completion done; + struct list_head ref_list; + spinlock_t lock; };
struct io_buffer { @@ -619,6 +628,8 @@ struct io_kiocb {
struct list_head inflight_entry;
+ struct percpu_ref *fixed_file_refs; + union { /* * Only commands that never go async can use the below fields, @@ -843,7 +854,6 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, struct io_uring_files_update *ip, unsigned nr_args); static int io_grab_files(struct io_kiocb *req); -static void io_ring_file_ref_flush(struct fixed_file_data *data); static void io_cleanup_req(struct io_kiocb *req); static int io_file_get(struct io_submit_state *state, struct io_kiocb *req, int fd, struct file **out_file, bool fixed); @@ -1335,7 +1345,7 @@ static inline void io_put_file(struct io_kiocb *req, struct file *file, bool fixed) { if (fixed) - percpu_ref_put(&req->ctx->file_data->refs); + percpu_ref_put(req->fixed_file_refs); else fput(file); } @@ -1387,21 +1397,18 @@ struct req_batch {
static void io_free_req_many(struct io_ring_ctx *ctx, struct req_batch *rb) { - int fixed_refs = rb->to_free; - if (!rb->to_free) return; if (rb->need_iter) { int i, inflight = 0; unsigned long flags;
- fixed_refs = 0; for (i = 0; i < rb->to_free; i++) { struct io_kiocb *req = rb->reqs[i];
if (req->flags & REQ_F_FIXED_FILE) { req->file = NULL; - fixed_refs++; + percpu_ref_put(req->fixed_file_refs); } if (req->flags & REQ_F_INFLIGHT) inflight++; @@ -1427,8 +1434,6 @@ static void io_free_req_many(struct io_ring_ctx *ctx, struct req_batch *rb) } do_free: kmem_cache_free_bulk(req_cachep, rb->to_free, rb->reqs); - if (fixed_refs) - percpu_ref_put_many(&ctx->file_data->refs, fixed_refs); percpu_ref_put_many(&ctx->refs, rb->to_free); rb->to_free = rb->need_iter = 0; } @@ -5265,7 +5270,8 @@ static int io_file_get(struct io_submit_state *state, struct io_kiocb *req, file = io_file_from_index(ctx, fd); if (!file) return -EBADF; - percpu_ref_get(&ctx->file_data->refs); + req->fixed_file_refs = ctx->file_data->cur_refs; + percpu_ref_get(req->fixed_file_refs); } else { trace_io_uring_file_get(ctx, fd); file = __io_file_get(state, fd); @@ -6058,43 +6064,36 @@ static void io_file_ref_kill(struct percpu_ref *ref) complete(&data->done); }
-static void io_file_ref_exit_and_free(struct work_struct *work) -{ - struct fixed_file_data *data; - - data = container_of(work, struct fixed_file_data, ref_work); - - /* - * Ensure any percpu-ref atomic switch callback has run, it could have - * been in progress when the files were being unregistered. Once - * that's done, we can safely exit and free the ref and containing - * data structure. - */ - rcu_barrier(); - percpu_ref_exit(&data->refs); - kfree(data); -} - static int io_sqe_files_unregister(struct io_ring_ctx *ctx) { struct fixed_file_data *data = ctx->file_data; + struct fixed_file_ref_node *ref_node = NULL; unsigned nr_tables, i; + unsigned long flags;
if (!data) return -ENXIO;
- percpu_ref_kill_and_confirm(&data->refs, io_file_ref_kill); - flush_work(&data->ref_work); + spin_lock_irqsave(&data->lock, flags); + if (!list_empty(&data->ref_list)) + ref_node = list_first_entry(&data->ref_list, + struct fixed_file_ref_node, node); + spin_unlock_irqrestore(&data->lock, flags); + if (ref_node) + percpu_ref_kill(&ref_node->refs); + + percpu_ref_kill(&data->refs); + + /* wait for all refs nodes to complete */ wait_for_completion(&data->done); - io_ring_file_ref_flush(data);
__io_sqe_files_unregister(ctx); nr_tables = DIV_ROUND_UP(ctx->nr_user_files, IORING_MAX_FILES_TABLE); for (i = 0; i < nr_tables; i++) kfree(data->table[i].files); kfree(data->table); - INIT_WORK(&data->ref_work, io_file_ref_exit_and_free); - queue_work(system_wq, &data->ref_work); + percpu_ref_exit(&data->refs); + kfree(data); ctx->file_data = NULL; ctx->nr_user_files = 0; return 0; @@ -6319,46 +6318,72 @@ static void io_ring_file_put(struct io_ring_ctx *ctx, struct file *file) }
struct io_file_put { - struct llist_node llist; + struct list_head list; struct file *file; };
-static void io_ring_file_ref_flush(struct fixed_file_data *data) +static void io_file_put_work(struct work_struct *work) { + struct fixed_file_ref_node *ref_node; + struct fixed_file_data *file_data; + struct io_ring_ctx *ctx; struct io_file_put *pfile, *tmp; - struct llist_node *node; + unsigned long flags;
- while ((node = llist_del_all(&data->put_llist)) != NULL) { - llist_for_each_entry_safe(pfile, tmp, node, llist) { - io_ring_file_put(data->ctx, pfile->file); - kfree(pfile); - } + ref_node = container_of(work, struct fixed_file_ref_node, work); + file_data = ref_node->file_data; + ctx = file_data->ctx; + + list_for_each_entry_safe(pfile, tmp, &ref_node->file_list, list) { + list_del_init(&pfile->list); + io_ring_file_put(ctx, pfile->file); + kfree(pfile); } + + spin_lock_irqsave(&file_data->lock, flags); + list_del_init(&ref_node->node); + spin_unlock_irqrestore(&file_data->lock, flags); + + percpu_ref_exit(&ref_node->refs); + kfree(ref_node); + percpu_ref_put(&file_data->refs); }
-static void io_ring_file_ref_switch(struct work_struct *work) +static void io_file_data_ref_zero(struct percpu_ref *ref) { - struct fixed_file_data *data; + struct fixed_file_ref_node *ref_node;
- data = container_of(work, struct fixed_file_data, ref_work); - io_ring_file_ref_flush(data); - percpu_ref_switch_to_percpu(&data->refs); + ref_node = container_of(ref, struct fixed_file_ref_node, refs); + + queue_work(system_wq, &ref_node->work); }
-static void io_file_data_ref_zero(struct percpu_ref *ref) +static struct fixed_file_ref_node *alloc_fixed_file_ref_node( + struct io_ring_ctx *ctx) { - struct fixed_file_data *data; + struct fixed_file_ref_node *ref_node;
- data = container_of(ref, struct fixed_file_data, refs); + ref_node = kzalloc(sizeof(*ref_node), GFP_KERNEL); + if (!ref_node) + return ERR_PTR(-ENOMEM);
- /* - * We can't safely switch from inside this context, punt to wq. If - * the table ref is going away, the table is being unregistered. - * Don't queue up the async work for that case, the caller will - * handle it. - */ - if (!percpu_ref_is_dying(&data->refs)) - queue_work(system_wq, &data->ref_work); + if (percpu_ref_init(&ref_node->refs, io_file_data_ref_zero, + 0, GFP_KERNEL)) { + kfree(ref_node); + return ERR_PTR(-ENOMEM); + } + INIT_LIST_HEAD(&ref_node->node); + INIT_LIST_HEAD(&ref_node->file_list); + INIT_WORK(&ref_node->work, io_file_put_work); + ref_node->file_data = ctx->file_data; + return ref_node; + +} + +static void destroy_fixed_file_ref_node(struct fixed_file_ref_node *ref_node) +{ + percpu_ref_exit(&ref_node->refs); + kfree(ref_node); }
static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, @@ -6369,6 +6394,8 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, struct file *file; int fd, ret = 0; unsigned i; + struct fixed_file_ref_node *ref_node; + unsigned long flags;
if (ctx->file_data) return -EBUSY; @@ -6382,6 +6409,7 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, return -ENOMEM; ctx->file_data->ctx = ctx; init_completion(&ctx->file_data->done); + INIT_LIST_HEAD(&ctx->file_data->ref_list);
nr_tables = DIV_ROUND_UP(nr_args, IORING_MAX_FILES_TABLE); ctx->file_data->table = kcalloc(nr_tables, @@ -6393,15 +6421,13 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, return -ENOMEM; }
- if (percpu_ref_init(&ctx->file_data->refs, io_file_data_ref_zero, + if (percpu_ref_init(&ctx->file_data->refs, io_file_ref_kill, PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) { kfree(ctx->file_data->table); kfree(ctx->file_data); ctx->file_data = NULL; return -ENOMEM; } - ctx->file_data->put_llist.first = NULL; - INIT_WORK(&ctx->file_data->ref_work, io_ring_file_ref_switch);
if (io_sqe_alloc_file_tables(ctx, nr_tables, nr_args)) { percpu_ref_exit(&ctx->file_data->refs); @@ -6464,9 +6490,22 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, }
ret = io_sqe_files_scm(ctx); - if (ret) + if (ret) { io_sqe_files_unregister(ctx); + return ret; + }
+ ref_node = alloc_fixed_file_ref_node(ctx); + if (IS_ERR(ref_node)) { + io_sqe_files_unregister(ctx); + return PTR_ERR(ref_node); + } + + ctx->file_data->cur_refs = &ref_node->refs; + spin_lock_irqsave(&ctx->file_data->lock, flags); + list_add(&ref_node->node, &ctx->file_data->ref_list); + spin_unlock_irqrestore(&ctx->file_data->lock, flags); + percpu_ref_get(&ctx->file_data->refs); return ret; }
@@ -6513,30 +6552,21 @@ static int io_sqe_file_register(struct io_ring_ctx *ctx, struct file *file, #endif }
-static void io_atomic_switch(struct percpu_ref *ref) -{ - struct fixed_file_data *data; - - /* - * Juggle reference to ensure we hit zero, if needed, so we can - * switch back to percpu mode - */ - data = container_of(ref, struct fixed_file_data, refs); - percpu_ref_put(&data->refs); - percpu_ref_get(&data->refs); -} - static int io_queue_file_removal(struct fixed_file_data *data, - struct file *file) + struct file *file) { struct io_file_put *pfile; + struct percpu_ref *refs = data->cur_refs; + struct fixed_file_ref_node *ref_node;
pfile = kzalloc(sizeof(*pfile), GFP_KERNEL); if (!pfile) return -ENOMEM;
+ ref_node = container_of(refs, struct fixed_file_ref_node, refs); pfile->file = file; - llist_add(&pfile->llist, &data->put_llist); + list_add(&pfile->list, &ref_node->file_list); + return 0; }
@@ -6545,17 +6575,23 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, unsigned nr_args) { struct fixed_file_data *data = ctx->file_data; - bool ref_switch = false; + struct fixed_file_ref_node *ref_node; struct file *file; __s32 __user *fds; int fd, i, err; __u32 done; + unsigned long flags; + bool needs_switch = false;
if (check_add_overflow(up->offset, nr_args, &done)) return -EOVERFLOW; if (done > ctx->nr_user_files) return -EINVAL;
+ ref_node = alloc_fixed_file_ref_node(ctx); + if (IS_ERR(ref_node)) + return PTR_ERR(ref_node); + done = 0; fds = u64_to_user_ptr(up->fds); while (nr_args) { @@ -6576,7 +6612,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, if (err) break; table->files[index] = NULL; - ref_switch = true; + needs_switch = true; } if (fd != -1) { file = fget(fd); @@ -6607,11 +6643,19 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, up->offset++; }
- if (ref_switch) - percpu_ref_switch_to_atomic(&data->refs, io_atomic_switch); + if (needs_switch) { + percpu_ref_kill(data->cur_refs); + spin_lock_irqsave(&data->lock, flags); + list_add(&ref_node->node, &data->ref_list); + data->cur_refs = &ref_node->refs; + spin_unlock_irqrestore(&data->lock, flags); + percpu_ref_get(&ctx->file_data->refs); + } else + destroy_fixed_file_ref_node(ref_node);
return done ? done : err; } + static int io_sqe_files_update(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args) {
From: Hillf Danton hdanton@sina.com
mainline inclusion from mainline-5.7-rc1 commit 10bea96dcc13ad841d53bdcc9d8e731e9e0ad58f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add it to pair with prepare_to_wait() in an attempt to avoid anything weird in the field.
Fixes: b41e98524e42 ("io_uring: add per-task callback handler") Reported-by: syzbot+0c3370f235b74b3cfd97@syzkaller.appspotmail.com Signed-off-by: Hillf Danton hdanton@sina.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a1ac44e506db..2b3b77999094 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5902,6 +5902,7 @@ static int io_sq_thread(void *data) } if (current->task_works) { task_work_run(); + finish_wait(&ctx->sqo_wait, &wait); continue; } if (signal_pending(current))
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.7-rc1 commit 45097daea2f4e89bdb1c98359f78d0d6feb8e5c8 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In io_read_prep() or io_write_prep(), io_req_map_rw() takes struct io_async_rw's fast_iov as argument to call io_import_iovec(), and if io_import_iovec() uses struct io_async_rw's fast_iov as valid iovec array, later indeed io_req_map_rw() does not need to do the memcpy operation, because they are same pointers.
Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 80dc4b0dd1f0..2f0f65eb59a4 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2487,8 +2487,9 @@ static void io_req_map_rw(struct io_kiocb *req, ssize_t io_size, req->io->rw.iov = iovec; if (!req->io->rw.iov) { req->io->rw.iov = req->io->rw.fast_iov; - memcpy(req->io->rw.iov, fast_iov, - sizeof(struct iovec) * iter->nr_segs); + if (req->io->rw.iov != fast_iov) + memcpy(req->io->rw.iov, fast_iov, + sizeof(struct iovec) * iter->nr_segs); } else { req->flags |= REQ_F_NEED_CLEANUP; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 709b302faddfac757d87df2080f900eccb1dc9e2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Make io_get_sqring() care only about sqes themselves, not initialising the io_kiocb. Also, split it into get + consume, that will be helpful in the future.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 40 ++++++++++++++++++++++------------------ 1 file changed, 22 insertions(+), 18 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2f0f65eb59a4..2349602fd013 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5715,8 +5715,7 @@ static void io_commit_sqring(struct io_ring_ctx *ctx) * used, it's important that those reads are done through READ_ONCE() to * prevent a re-load down the line. */ -static bool io_get_sqring(struct io_ring_ctx *ctx, struct io_kiocb *req, - const struct io_uring_sqe **sqe_ptr) +static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx) { u32 *sq_array = ctx->sq_array; unsigned head; @@ -5730,25 +5729,18 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct io_kiocb *req, * though the application is the one updating it. */ head = READ_ONCE(sq_array[ctx->cached_sq_head & ctx->sq_mask]); - if (likely(head < ctx->sq_entries)) { - /* - * All io need record the previous position, if LINK vs DARIN, - * it can be used to mark the position of the first IO in the - * link list. - */ - req->sequence = ctx->cached_sq_head; - *sqe_ptr = &ctx->sq_sqes[head]; - req->opcode = READ_ONCE((*sqe_ptr)->opcode); - req->user_data = READ_ONCE((*sqe_ptr)->user_data); - ctx->cached_sq_head++; - return true; - } + if (likely(head < ctx->sq_entries)) + return &ctx->sq_sqes[head];
/* drop invalid entries */ - ctx->cached_sq_head++; ctx->cached_sq_dropped++; WRITE_ONCE(ctx->rings->sq_dropped, ctx->cached_sq_dropped); - return false; + return NULL; +} + +static inline void io_consume_sqe(struct io_ring_ctx *ctx) +{ + ctx->cached_sq_head++; }
static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, @@ -5792,11 +5784,23 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, submitted = -EAGAIN; break; } - if (!io_get_sqring(ctx, req, &sqe)) { + sqe = io_get_sqe(ctx); + if (!sqe) { __io_req_do_free(req); + io_consume_sqe(ctx); break; }
+ /* + * All io need record the previous position, if LINK vs DARIN, + * it can be used to mark the position of the first IO in the + * link list. + */ + req->sequence = ctx->cached_sq_head; + req->opcode = READ_ONCE(sqe->opcode); + req->user_data = READ_ONCE(sqe->user_data); + io_consume_sqe(ctx); + /* will complete beyond this point, count as submitted */ submitted++;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit b1e50e549b1372d9742509230dc4af7dd521d984 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
As io_get_sqe() split into 2 stage get/consume, get an sqe before allocating io_kiocb, so no free_req*() for a failure case is needed, and inline back __io_req_do_free(), which has only 1 user.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 24 +++++++++--------------- 1 file changed, 9 insertions(+), 15 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2349602fd013..d7dd8f3655fe 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1348,14 +1348,6 @@ static inline void io_put_file(struct io_kiocb *req, struct file *file, fput(file); }
-static void __io_req_do_free(struct io_kiocb *req) -{ - if (likely(!io_is_fallback_req(req))) - kmem_cache_free(req_cachep, req); - else - clear_bit_unlock(0, (unsigned long *) req->ctx->fallback_req); -} - static void __io_req_aux_free(struct io_kiocb *req) { if (req->flags & REQ_F_NEED_CLEANUP) @@ -1386,7 +1378,10 @@ static void __io_free_req(struct io_kiocb *req) }
percpu_ref_put(&req->ctx->refs); - __io_req_do_free(req); + if (likely(!io_is_fallback_req(req))) + kmem_cache_free(req_cachep, req); + else + clear_bit_unlock(0, (unsigned long *) req->ctx->fallback_req); }
struct req_batch { @@ -5778,18 +5773,17 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, struct io_kiocb *req; int err;
+ sqe = io_get_sqe(ctx); + if (unlikely(!sqe)) { + io_consume_sqe(ctx); + break; + } req = io_get_req(ctx, statep); if (unlikely(!req)) { if (!submitted) submitted = -EAGAIN; break; } - sqe = io_get_sqe(ctx); - if (!sqe) { - __io_req_do_free(req); - io_consume_sqe(ctx); - break; - }
/* * All io need record the previous position, if LINK vs DARIN,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 0553b8bda8709c47863eab3fff7ac32ad04ca52b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_get_req() do two different things: io_kiocb allocation and initialisation. Move init part out of it and rename into io_alloc_req(). It's simpler this way and also have better data locality.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 53 ++++++++++++++++++++++++++------------------------- 1 file changed, 27 insertions(+), 26 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d7dd8f3655fe..13d0be87bb2d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1287,8 +1287,8 @@ static struct io_kiocb *io_get_fallback_req(struct io_ring_ctx *ctx) return NULL; }
-static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, - struct io_submit_state *state) +static struct io_kiocb *io_alloc_req(struct io_ring_ctx *ctx, + struct io_submit_state *state) { gfp_t gfp = GFP_KERNEL | __GFP_NOWARN; struct io_kiocb *req; @@ -1321,22 +1321,9 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, req = state->reqs[state->free_reqs]; }
-got_it: - req->io = NULL; - req->file = NULL; - req->ctx = ctx; - req->flags = 0; - /* one is dropped after submission, the other at completion */ - refcount_set(&req->refs, 2); - req->task = NULL; - req->result = 0; - INIT_IO_WORK(&req->work, io_wq_submit_work); return req; fallback: - req = io_get_fallback_req(ctx); - if (req) - goto got_it; - return NULL; + return io_get_fallback_req(ctx); }
static inline void io_put_file(struct io_kiocb *req, struct file *file, @@ -5738,6 +5725,28 @@ static inline void io_consume_sqe(struct io_ring_ctx *ctx) ctx->cached_sq_head++; }
+static void io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + /* + * All io need record the previous position, if LINK vs DARIN, + * it can be used to mark the position of the first IO in the + * link list. + */ + req->sequence = ctx->cached_sq_head; + req->opcode = READ_ONCE(sqe->opcode); + req->user_data = READ_ONCE(sqe->user_data); + req->io = NULL; + req->file = NULL; + req->ctx = ctx; + req->flags = 0; + /* one is dropped after submission, the other at completion */ + refcount_set(&req->refs, 2); + req->task = NULL; + req->result = 0; + INIT_IO_WORK(&req->work, io_wq_submit_work); +} + static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, struct file *ring_file, int ring_fd, struct mm_struct **mm, bool async) @@ -5778,23 +5787,15 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, io_consume_sqe(ctx); break; } - req = io_get_req(ctx, statep); + req = io_alloc_req(ctx, statep); if (unlikely(!req)) { if (!submitted) submitted = -EAGAIN; break; }
- /* - * All io need record the previous position, if LINK vs DARIN, - * it can be used to mark the position of the first IO in the - * link list. - */ - req->sequence = ctx->cached_sq_head; - req->opcode = READ_ONCE(sqe->opcode); - req->user_data = READ_ONCE(sqe->user_data); + io_init_req(ctx, req, sqe); io_consume_sqe(ctx); - /* will complete beyond this point, count as submitted */ submitted++;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 9c280f9087118099f50566e906b9d9d5a0fb4529 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Don't re-read userspace-shared sqe->flags, it can be exploited. sqe->flags are copied into req->flags in io_submit_sqe(), check them there instead.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [skip io_openat2_prep for commit cebdb98617ae ("io_uring: add support for IORING_OP_OPENAT2") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 18 +++++++----------- 1 file changed, 7 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 13d0be87bb2d..1a00bcd64616 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2925,7 +2925,7 @@ static int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
if (sqe->ioprio || sqe->buf_index) return -EINVAL; - if (sqe->flags & IOSQE_FIXED_FILE) + if (req->flags & REQ_F_FIXED_FILE) return -EBADF; if (req->flags & REQ_F_NEED_CLEANUP) return 0; @@ -3264,7 +3264,7 @@ static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
if (sqe->ioprio || sqe->buf_index) return -EINVAL; - if (sqe->flags & IOSQE_FIXED_FILE) + if (req->flags & REQ_F_FIXED_FILE) return -EBADF; if (req->flags & REQ_F_NEED_CLEANUP) return 0; @@ -3341,7 +3341,7 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (sqe->ioprio || sqe->off || sqe->addr || sqe->len || sqe->rw_flags || sqe->buf_index) return -EINVAL; - if (sqe->flags & IOSQE_FIXED_FILE) + if (req->flags & REQ_F_FIXED_FILE) return -EBADF;
req->close.fd = READ_ONCE(sqe->fd); @@ -5300,15 +5300,10 @@ static int io_file_get(struct io_submit_state *state, struct io_kiocb *req, }
static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req, - const struct io_uring_sqe *sqe) + int fd, unsigned int flags) { - unsigned flags; - int fd; bool fixed;
- flags = READ_ONCE(sqe->flags); - fd = READ_ONCE(sqe->fd); - if (!io_req_needs_file(req, fd)) return 0;
@@ -5550,7 +5545,7 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, { struct io_ring_ctx *ctx = req->ctx; unsigned int sqe_flags; - int ret, id; + int ret, id, fd;
sqe_flags = READ_ONCE(sqe->flags);
@@ -5581,7 +5576,8 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, IOSQE_ASYNC | IOSQE_FIXED_FILE | IOSQE_BUFFER_SELECT);
- ret = io_req_set_file(state, req, sqe); + fd = READ_ONCE(sqe->fd); + ret = io_req_set_file(state, req, fd, sqe_flags); if (unlikely(ret)) { err_req: io_cqring_add_event(req, ret);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit c398ecb3d611925e4a5411afdf7489914a5c0460 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If completion queue overflow occurs, __io_cqring_fill_event() will update req->cflags, which is in a union with req->work and happens to be aliased to req->work.fs. Following io_free_req() -> io_req_work_drop_env() may get a bunch of different problems (miscount fs->users, segfault, etc) on cleaning @fs.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1a00bcd64616..091997a55009 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -609,6 +609,7 @@ struct io_kiocb { };
struct io_async_ctx *io; + int cflags; bool needs_fixed_file; u8 opcode;
@@ -639,7 +640,6 @@ struct io_kiocb { struct callback_head task_work; struct hlist_node hash_node; struct async_poll *apoll; - int cflags; }; struct io_wq_work work; };
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 85faa7b8346ebef0606d2d0df6d3f8c76acb3654 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We can't reliably wait in io_ring_ctx_wait_and_kill(), since the task_works list isn't ordered (in fact it's LIFO ordered). We could either fix this with a separate task_works list for io_uring work, or just punt the wait-and-free to async context. This ensures that task_work that comes in while we're shutting down is processed correctly. If we don't go async, we could have work past the fput() work for the ring that depends on work that won't be executed until after we're done with the wait-and-free. But as this operation is blocking, it'll never get a chance to run.
This was reproduced with hundreds of thousands of sockets running memcached, haven't been able to reproduce this synthetically.
Reported-by: Dan Melnic dmm@fb.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 091997a55009..5fec669db67a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -326,6 +326,8 @@ struct io_ring_ctx { spinlock_t inflight_lock; struct list_head inflight_list; } ____cacheline_aligned_in_smp; + + struct work_struct exit_work; };
/* @@ -7206,6 +7208,18 @@ static int io_remove_personalities(int id, void *p, void *data) return 0; }
+static void io_ring_exit_work(struct work_struct *work) +{ + struct io_ring_ctx *ctx; + + ctx = container_of(work, struct io_ring_ctx, exit_work); + if (ctx->rings) + io_cqring_overflow_flush(ctx, true); + + wait_for_completion(&ctx->completions[0]); + io_ring_ctx_free(ctx); +} + static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) { mutex_lock(&ctx->uring_lock); @@ -7233,8 +7247,8 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) if (ctx->rings) io_cqring_overflow_flush(ctx, true); idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx); - wait_for_completion(&ctx->completions[0]); - io_ring_ctx_free(ctx); + INIT_WORK(&ctx->exit_work, io_ring_exit_work); + queue_work(system_wq, &ctx->exit_work); }
static int io_uring_release(struct inode *inode, struct file *file)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc2 commit dccc587f6c07ccc734588226fdf62f685558e89f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If io_submit_sqes() can't grab an mm, it fails and exits right away. There is no need to track the fact of the failure. Remove @mm_fault.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5fec669db67a..8306eb7aff79 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5752,7 +5752,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, struct io_submit_state state, *statep = NULL; struct io_kiocb *link = NULL; int i, submitted = 0; - bool mm_fault = false;
/* if we have a backlog and couldn't flush it all, return BUSY */ if (test_bit(0, &ctx->sq_check_overflow)) { @@ -5806,8 +5805,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, }
if (io_op_defs[req->opcode].needs_mm && !*mm) { - mm_fault = mm_fault || !mmget_not_zero(ctx->sqo_mm); - if (unlikely(mm_fault)) { + if (unlikely(!mmget_not_zero(ctx->sqo_mm))) { err = -EFAULT; goto fail_req; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc2 commit bf9c2f1cdcc718b6d2d41172f6ca005fe22cc7ff category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
As a preparation for extracting request init bits, remove self-coded mm tracking from io_submit_sqes(), but rely on current->mm. It's more convenient, than passing this piece of state in other functions.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 37 ++++++++++++++++--------------------- 1 file changed, 16 insertions(+), 21 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8306eb7aff79..521b67216a74 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5746,8 +5746,7 @@ static void io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, }
static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, - struct file *ring_file, int ring_fd, - struct mm_struct **mm, bool async) + struct file *ring_file, int ring_fd, bool async) { struct io_submit_state state, *statep = NULL; struct io_kiocb *link = NULL; @@ -5804,13 +5803,12 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, break; }
- if (io_op_defs[req->opcode].needs_mm && !*mm) { + if (io_op_defs[req->opcode].needs_mm && !current->mm) { if (unlikely(!mmget_not_zero(ctx->sqo_mm))) { err = -EFAULT; goto fail_req; } use_mm(ctx->sqo_mm); - *mm = ctx->sqo_mm; }
req->needs_fixed_file = async; @@ -5836,10 +5834,19 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, return submitted; }
+static inline void io_sq_thread_drop_mm(struct io_ring_ctx *ctx) +{ + struct mm_struct *mm = current->mm; + + if (mm) { + unuse_mm(mm); + mmput(mm); + } +} + static int io_sq_thread(void *data) { struct io_ring_ctx *ctx = data; - struct mm_struct *cur_mm = NULL; const struct cred *old_cred; mm_segment_t old_fs; DEFINE_WAIT(wait); @@ -5880,11 +5887,7 @@ static int io_sq_thread(void *data) * adding ourselves to the waitqueue, as the unuse/drop * may sleep. */ - if (cur_mm) { - unuse_mm(cur_mm); - mmput(cur_mm); - cur_mm = NULL; - } + io_sq_thread_drop_mm(ctx);
/* * We're polling. If we're within the defined idle @@ -5948,7 +5951,7 @@ static int io_sq_thread(void *data) }
mutex_lock(&ctx->uring_lock); - ret = io_submit_sqes(ctx, to_submit, NULL, -1, &cur_mm, true); + ret = io_submit_sqes(ctx, to_submit, NULL, -1, true); mutex_unlock(&ctx->uring_lock); timeout = jiffies + ctx->sq_thread_idle; } @@ -5957,10 +5960,7 @@ static int io_sq_thread(void *data) task_work_run();
set_fs(old_fs); - if (cur_mm) { - unuse_mm(cur_mm); - mmput(cur_mm); - } + io_sq_thread_drop_mm(ctx); revert_creds(old_cred);
kthread_parkme(); @@ -7442,13 +7442,8 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, wake_up(&ctx->sqo_wait); submitted = to_submit; } else if (to_submit) { - struct mm_struct *cur_mm; - mutex_lock(&ctx->uring_lock); - /* already have mm, so io_submit_sqes() won't try to grab it */ - cur_mm = ctx->sqo_mm; - submitted = io_submit_sqes(ctx, to_submit, f.file, fd, - &cur_mm, false); + submitted = io_submit_sqes(ctx, to_submit, f.file, fd, false); mutex_unlock(&ctx->uring_lock);
if (submitted != to_submit)
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.7-rc3 commit 44575a67314b3768d4415252271e8f60c5c70118 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
When testing io_uring IORING_FEAT_FAST_POLL feature, I got below panic: BUG: kernel NULL pointer dereference, address: 0000000000000030 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 5 PID: 2154 Comm: io_uring_echo_s Not tainted 5.6.0+ #359 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:io_wq_submit_work+0xf/0xa0 Code: ff ff ff be 02 00 00 00 e8 ae c9 19 00 e9 58 ff ff ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 49 89 fc 55 53 48 8b 2f <8b> 45 30 48 8d 9d 48 ff ff ff 25 01 01 00 00 83 f8 01 75 07 eb 2a RSP: 0018:ffffbef543e93d58 EFLAGS: 00010286 RAX: ffffffff84364f50 RBX: ffffa3eb50f046b8 RCX: 0000000000000000 RDX: ffffa3eb0efc1840 RSI: 0000000000000006 RDI: ffffa3eb50f046b8 RBP: 0000000000000000 R08: 00000000fffd070d R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffa3eb50f046b8 R13: ffffa3eb0efc2088 R14: ffffffff85b69be0 R15: ffffa3eb0effa4b8 FS: 00007fe9f69cc4c0(0000) GS:ffffa3eb5ef40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000030 CR3: 0000000020410000 CR4: 00000000000006e0 Call Trace: task_work_run+0x6d/0xa0 do_exit+0x39a/0xb80 ? get_signal+0xfe/0xbc0 do_group_exit+0x47/0xb0 get_signal+0x14b/0xbc0 ? __x64_sys_io_uring_enter+0x1b7/0x450 do_signal+0x2c/0x260 ? __x64_sys_io_uring_enter+0x228/0x450 exit_to_usermode_loop+0x87/0xf0 do_syscall_64+0x209/0x230 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x7fe9f64f8df9 Code: Bad RIP value.
task_work_run calls io_wq_submit_work unexpectedly, it's obvious that struct callback_head's func member has been changed. After looking into codes, I found this issue is still due to the union definition: union { /* * Only commands that never go async can use the below fields, * obviously. Right now only IORING_OP_POLL_ADD uses them, and * async armed poll handlers for regular commands. The latter * restore the work, if needed. */ struct { struct callback_head task_work; struct hlist_node hash_node; struct async_poll *apoll; }; struct io_wq_work work; };
When task_work_run has multiple work to execute, the work that calls io_poll_remove_all() will do req->work restore for non-poll request always, but indeed if a non-poll request has been added to a new callback_head, subsequent callback will call io_async_task_func() to handle this request, that means we should not do the restore work for such non-poll request. Meanwhile in io_async_task_func(), we should drop submit ref when req has been canceled.
Fix both issues.
Fixes: b1f573bd15fd ("io_uring: restore req->work when canceling poll request") Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
Use io_double_put_req()
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e8157c223164..b14137bacc3b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4146,17 +4146,17 @@ static void io_async_task_func(struct callback_head *cb)
spin_unlock_irq(&ctx->completion_lock);
+ /* restore ->work in case we need to retry again */ + memcpy(&req->work, &apoll->work, sizeof(req->work)); + if (canceled) { kfree(apoll); io_cqring_ev_posted(ctx); req_set_fail_links(req); - io_put_req(req); + io_double_put_req(req); return; }
- /* restore ->work in case we need to retry again */ - memcpy(&req->work, &apoll->work, sizeof(req->work)); - __set_current_state(TASK_RUNNING); mutex_lock(&ctx->uring_lock); __io_queue_sqe(req, NULL); @@ -4315,7 +4315,7 @@ static bool io_poll_remove_one(struct io_kiocb *req)
hash_del(&req->hash_node);
- if (apoll) { + if (do_complete && apoll) { /* * restore ->work because we need to call io_req_work_drop_env. */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc4 commit 5b0bbee4732cbd58aa98213d4c11a366356bba3d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Clay reports that OP_STATX fails for a test case with a valid fd and empty path:
-- Test 0: statx:fd 3: SUCCEED, file mode 100755 -- Test 1: statx:path ./uring_statx: SUCCEED, file mode 100755 -- Test 2: io_uring_statx:fd 3: FAIL, errno 9: Bad file descriptor -- Test 3: io_uring_statx:path ./uring_statx: SUCCEED, file mode 100755
This is due to statx not grabbing the process file table, hence we can't lookup the fd in async context. If the fd is valid, ensure that we grab the file table so we can grab the file from async context.
Cc: stable@vger.kernel.org # v5.6 Reported-by: Clay Harris bugs@claycon.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b14137bacc3b..e99a4d0dcba7 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -525,6 +525,7 @@ enum { REQ_F_OVERFLOW_BIT, REQ_F_POLLED_BIT, REQ_F_BUFFER_SELECTED_BIT, + REQ_F_NO_FILE_TABLE_BIT,
/* not a real bit, just to check we're not overflowing the space */ __REQ_F_LAST_BIT, @@ -578,6 +579,8 @@ enum { REQ_F_POLLED = BIT(REQ_F_POLLED_BIT), /* buffer already selected */ REQ_F_BUFFER_SELECTED = BIT(REQ_F_BUFFER_SELECTED_BIT), + /* doesn't need file table for this request */ + REQ_F_NO_FILE_TABLE = BIT(REQ_F_NO_FILE_TABLE_BIT), };
struct async_poll { @@ -800,6 +803,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .fd_non_neg = 1, .needs_fs = 1, + .file_table = 1, }, [IORING_OP_READ] = { .needs_mm = 1, @@ -3301,8 +3305,12 @@ static int io_statx(struct io_kiocb *req, bool force_nonblock) struct kstat stat; int ret;
- if (force_nonblock) + if (force_nonblock) { + /* only need file table for an actual valid fd */ + if (ctx->dfd == -1 || ctx->dfd == AT_FDCWD) + req->flags |= REQ_F_NO_FILE_TABLE; return -EAGAIN; + }
if (vfs_stat_set_lookup_flags(&lookup_flags, ctx->flags)) return -EINVAL; @@ -5363,7 +5371,7 @@ static int io_grab_files(struct io_kiocb *req) int ret = -EBADF; struct io_ring_ctx *ctx = req->ctx;
- if (req->work.files) + if (req->work.files || (req->flags & REQ_F_NO_FILE_TABLE)) return 0; if (!ctx->ring_file) return -EBADF;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc4 commit af197f50ac53fff1241598c73ca606754a3bb808 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We can have files like eventfd where it's perfectly fine to do poll based retry on them, right now io_file_supports_async() doesn't take that into account.
Pass in data direction and check the f_op instead of just always needing an async worker.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e99a4d0dcba7..97df63ded1ae 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2032,7 +2032,7 @@ static struct file *__io_file_get(struct io_submit_state *state, int fd) * any file. For now, just ensure that anything potentially problematic is done * inline. */ -static bool io_file_supports_async(struct file *file) +static bool io_file_supports_async(struct file *file, int rw) { umode_t mode = file_inode(file)->i_mode;
@@ -2041,7 +2041,13 @@ static bool io_file_supports_async(struct file *file) if (S_ISREG(mode) && file->f_op != &io_uring_fops) return true;
- return false; + if (!(file->f_mode & FMODE_NOWAIT)) + return false; + + if (rw == READ) + return file->f_op->read_iter != NULL; + + return file->f_op->write_iter != NULL; }
static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, @@ -2569,7 +2575,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock) * If the file doesn't support async, mark it as REQ_F_MUST_PUNT so * we know to async punt it even if it was opened O_NONBLOCK */ - if (force_nonblock && !io_file_supports_async(req->file)) + if (force_nonblock && !io_file_supports_async(req->file, READ)) goto copy_iov;
iov_count = iov_iter_count(&iter); @@ -2660,7 +2666,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) * If the file doesn't support async, mark it as REQ_F_MUST_PUNT so * we know to async punt it even if it was opened O_NONBLOCK */ - if (force_nonblock && !io_file_supports_async(req->file)) + if (force_nonblock && !io_file_supports_async(req->file, WRITE)) goto copy_iov;
/* file path doesn't support NOWAIT for non-direct_IO */ @@ -2754,11 +2760,11 @@ static int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static bool io_splice_punt(struct file *file) +static bool io_splice_punt(struct file *file, int rw) { if (get_pipe_info(file)) return false; - if (!io_file_supports_async(file)) + if (!io_file_supports_async(file, rw)) return true; return !(file->f_flags & O_NONBLOCK); } @@ -2773,7 +2779,7 @@ static int io_splice(struct io_kiocb *req, bool force_nonblock) long ret;
if (force_nonblock) { - if (io_splice_punt(in) || io_splice_punt(out)) + if (io_splice_punt(in, READ) || io_splice_punt(out, WRITE)) return -EAGAIN; flags |= SPLICE_F_NONBLOCK; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc4 commit 490e89676a523c688343d6cb8ca5f5dc476414df category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We do blocking retry from our poll handler, if the file supports polled notifications. Only mark the request as needing an async worker if we can't poll for it.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 97df63ded1ae..50e250ae87a2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2598,7 +2598,8 @@ static int io_read(struct io_kiocb *req, bool force_nonblock) if (ret) goto out_free; /* any defer here is final, must blocking retry */ - if (!(req->flags & REQ_F_NOWAIT)) + if (!(req->flags & REQ_F_NOWAIT) && + !file_can_poll(req->file)) req->flags |= REQ_F_MUST_PUNT; return -EAGAIN; } @@ -2720,7 +2721,8 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) if (ret) goto out_free; /* any defer here is final, must blocking retry */ - req->flags |= REQ_F_MUST_PUNT; + if (!file_can_poll(req->file)) + req->flags |= REQ_F_MUST_PUNT; return -EAGAIN; } }
From: Bijan Mottahedeh bijan.mottahedeh@oracle.com
mainline inclusion from mainline-5.7-rc4 commit dd461af65946de060bff2dab08a63676d2731afe category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Use ctx->fallback_req address for test_and_set_bit_lock() and clear_bit_unlock().
Signed-off-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 50e250ae87a2..e503f6332a90 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1289,7 +1289,7 @@ static struct io_kiocb *io_get_fallback_req(struct io_ring_ctx *ctx) struct io_kiocb *req;
req = ctx->fallback_req; - if (!test_and_set_bit_lock(0, (unsigned long *) ctx->fallback_req)) + if (!test_and_set_bit_lock(0, (unsigned long *) &ctx->fallback_req)) return req;
return NULL; @@ -1376,7 +1376,7 @@ static void __io_free_req(struct io_kiocb *req) if (likely(!io_is_fallback_req(req))) kmem_cache_free(req_cachep, req); else - clear_bit_unlock(0, (unsigned long *) req->ctx->fallback_req); + clear_bit_unlock(0, (unsigned long *) &req->ctx->fallback_req); }
struct req_batch {
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.7-rc4 commit 3fd44c86711f71156b586c22b0495c58f69358bb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
While working on to make io_uring sqpoll mode support syscalls that need struct files_struct, I got cpu soft lockup in io_ring_ctx_wait_and_kill(),
while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait)) cpu_relax();
above loop never has an chance to exit, it's because preempt isn't enabled in the kernel, and the context calling io_ring_ctx_wait_and_kill() and io_sq_thread() run in the same cpu, if io_sq_thread calls a cond_resched() yield cpu and another context enters above loop, then io_sq_thread() will always in runqueue and never exit.
Use cond_resched() can fix this issue.
Reported-by: syzbot+66243bb7126c410cefe6@syzkaller.appspotmail.com Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e503f6332a90..1cfc81208079 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7278,7 +7278,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) * it could cause shutdown to hang. */ while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait)) - cpu_relax(); + cond_resched();
io_kill_timeouts(ctx); io_poll_remove_all(ctx);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc4 commit 7759a0bfadceef3910d0e50f86d63b6ed58b4e70 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
[ 40.179474] refcount_t: underflow; use-after-free. [ 40.179499] WARNING: CPU: 6 PID: 1848 at lib/refcount.c:28 refcount_warn_saturate+0xae/0xf0 ... [ 40.179612] RIP: 0010:refcount_warn_saturate+0xae/0xf0 [ 40.179617] Code: 28 44 0a 01 01 e8 d7 01 c2 ff 0f 0b 5d c3 80 3d 15 44 0a 01 00 75 91 48 c7 c7 b8 f5 75 be c6 05 05 44 0a 01 01 e8 b7 01 c2 ff <0f> 0b 5d c3 80 3d f3 43 0a 01 00 0f 85 6d ff ff ff 48 c7 c7 10 f6 [ 40.179619] RSP: 0018:ffffb252423ebe18 EFLAGS: 00010286 [ 40.179623] RAX: 0000000000000000 RBX: ffff98d65e929400 RCX: 0000000000000000 [ 40.179625] RDX: 0000000000000001 RSI: 0000000000000086 RDI: 00000000ffffffff [ 40.179627] RBP: ffffb252423ebe18 R08: 0000000000000001 R09: 000000000000055d [ 40.179629] R10: 0000000000000c8c R11: 0000000000000001 R12: 0000000000000000 [ 40.179631] R13: ffff98d68c434400 R14: ffff98d6a9cbaa20 R15: ffff98d6a609ccb8 [ 40.179634] FS: 0000000000000000(0000) GS:ffff98d6af580000(0000) knlGS:0000000000000000 [ 40.179636] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 40.179638] CR2: 00000000033e3194 CR3: 000000006480a003 CR4: 00000000003606e0 [ 40.179641] Call Trace: [ 40.179652] io_put_req+0x36/0x40 [ 40.179657] io_free_work+0x15/0x20 [ 40.179661] io_worker_handle_work+0x2f5/0x480 [ 40.179667] io_wqe_worker+0x2a9/0x360 [ 40.179674] ? _raw_spin_unlock_irqrestore+0x24/0x40 [ 40.179681] kthread+0x12c/0x170 [ 40.179685] ? io_worker_handle_work+0x480/0x480 [ 40.179690] ? kthread_park+0x90/0x90 [ 40.179695] ret_from_fork+0x35/0x40 [ 40.179702] ---[ end trace 85027405f00110aa ]---
Opcode handler must never put submission ref, but that's what io_sync_file_range_finish() do. use io_steal_work() there.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1cfc81208079..3652a2e49ccc 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3464,7 +3464,7 @@ static void io_sync_file_range_finish(struct io_wq_work **workptr) if (io_req_cancelled(req)) return; __io_sync_file_range(req); - io_put_req(req); /* put submission ref */ + io_steal_work(req, workptr); }
static int io_sync_file_range(struct io_kiocb *req, bool force_nonblock)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc4 commit 4ee3631451c9a62e6b6bc7ee51fb9a5b34e33509 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_req_defer() do double-checked locking. Use proper helpers for that, i.e. list_empty_careful().
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3652a2e49ccc..1468a79acf3f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4974,7 +4974,7 @@ static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) int ret;
/* Still need defer if there is pending req in defer list. */ - if (!req_need_defer(req) && list_empty(&ctx->defer_list)) + if (!req_need_defer(req) && list_empty_careful(&ctx->defer_list)) return 0;
if (!req->io && io_alloc_async_ctx(req))
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc4 commit 2fb3e82284fca40afbde5351907f0a5b3be717f9 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Nonblocking do_splice() still may wait for some time on an inode mutex. Let's play safe and always punt it async.
Reported-by: Jens Axboe axboe@kernel.dk Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 ++-------------- 1 file changed, 2 insertions(+), 14 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1468a79acf3f..b80bbf8fc0ea 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2762,15 +2762,6 @@ static int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static bool io_splice_punt(struct file *file, int rw) -{ - if (get_pipe_info(file)) - return false; - if (!io_file_supports_async(file, rw)) - return true; - return !(file->f_flags & O_NONBLOCK); -} - static int io_splice(struct io_kiocb *req, bool force_nonblock) { struct io_splice *sp = &req->splice; @@ -2780,11 +2771,8 @@ static int io_splice(struct io_kiocb *req, bool force_nonblock) loff_t *poff_in, *poff_out; long ret;
- if (force_nonblock) { - if (io_splice_punt(in, READ) || io_splice_punt(out, WRITE)) - return -EAGAIN; - flags |= SPLICE_F_NONBLOCK; - } + if (force_nonblock) + return -EAGAIN;
poff_in = (sp->off_in == -1) ? NULL : &sp->off_in; poff_out = (sp->off_out == -1) ? NULL : &sp->off_out;
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.7-rc5 commit d8f1b9716cfd1a1f74c0fedad40c5f65a25aa208 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The prepare_to_wait() and finish_wait() calls in io_uring_cancel_files() are mismatched. Currently I don't see any issues related this bug, just find it by learning codes.
Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b80bbf8fc0ea..1e69744b9ed0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7295,11 +7295,9 @@ static int io_uring_release(struct inode *inode, struct file *file) static void io_uring_cancel_files(struct io_ring_ctx *ctx, struct files_struct *files) { - struct io_kiocb *req; - DEFINE_WAIT(wait); - while (!list_empty_careful(&ctx->inflight_list)) { - struct io_kiocb *cancel_req = NULL; + struct io_kiocb *cancel_req = NULL, *req; + DEFINE_WAIT(wait);
spin_lock_irq(&ctx->inflight_lock); list_for_each_entry(req, &ctx->inflight_list, inflight_entry) { @@ -7339,6 +7337,7 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, */ if (refcount_sub_and_test(2, &cancel_req->refs)) { io_put_req(cancel_req); + finish_wait(&ctx->inflight_wait, &wait); continue; } } @@ -7346,8 +7345,8 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, io_wq_cancel_work(ctx->io_wq, &cancel_req->work); io_put_req(cancel_req); schedule(); + finish_wait(&ctx->inflight_wait, &wait); } - finish_wait(&ctx->inflight_wait, &wait); }
static int io_uring_flush(struct file *file, void *data)
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.7-rc5 commit 7f13657d141346125f4d0bb93eab4777f40c406e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If copy_to_user() in io_uring_setup() failed, we'll leak many kernel resources, which will be recycled until process terminates. This bug can be reproduced by using mprotect to set params to PROT_READ. To fix this issue, refactor io_uring_create() a bit to add a new 'struct io_uring_params __user *params' parameter and move the copy_to_user() in io_uring_setup() to io_uring_setup(), if copy_to_user() failed, we can free kernel resource properly.
Suggested-by: Jens Axboe axboe@kernel.dk Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1e69744b9ed0..4421100bdd12 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7695,7 +7695,8 @@ static int io_uring_get_fd(struct io_ring_ctx *ctx) return ret; }
-static int io_uring_create(unsigned entries, struct io_uring_params *p) +static int io_uring_create(unsigned entries, struct io_uring_params *p, + struct io_uring_params __user *params) { struct user_struct *user = NULL; struct io_ring_ctx *ctx; @@ -7787,6 +7788,14 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) p->cq_off.overflow = offsetof(struct io_rings, cq_overflow); p->cq_off.cqes = offsetof(struct io_rings, cqes);
+ p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP | + IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS | + IORING_FEAT_CUR_PERSONALITY | IORING_FEAT_FAST_POLL; + + if (copy_to_user(params, p, sizeof(*p))) { + ret = -EFAULT; + goto err; + } /* * Install ring fd as the very last thing, so we don't risk someone * having closed it before we finish setup @@ -7795,9 +7804,6 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) if (ret < 0) goto err;
- p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP | - IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS | - IORING_FEAT_CUR_PERSONALITY | IORING_FEAT_FAST_POLL; trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags); return ret; err: @@ -7813,7 +7819,6 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) static long io_uring_setup(u32 entries, struct io_uring_params __user *params) { struct io_uring_params p; - long ret; int i;
if (copy_from_user(&p, params, sizeof(p))) @@ -7828,14 +7833,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params) IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ)) return -EINVAL;
- ret = io_uring_create(entries, &p); - if (ret < 0) - return ret; - - if (copy_to_user(params, &p, sizeof(p))) - return -EFAULT; - - return ret; + return io_uring_create(entries, &p, params); }
SYSCALL_DEFINE2(io_uring_setup, u32, entries,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc5 commit 90da2e3f25c8b4d742b2687b8fed8fc4eb8851da category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
do_splice() is used by io_uring, as will be do_tee(). Move f_mode checks from sys_{splice,tee}() to do_{splice,tee}(), so they're enforced for io_uring as well.
Fixes: 7d67af2c0134 ("io_uring: add splice(2) support") Reported-by: Jann Horn jannh@google.com Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/splice.c | 45 ++++++++++++++++++--------------------------- 1 file changed, 18 insertions(+), 27 deletions(-)
diff --git a/fs/splice.c b/fs/splice.c index 4bfc3c5f7cad..f8aa86070b22 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1108,6 +1108,10 @@ long do_splice(struct file *in, loff_t __user *off_in, loff_t offset; long ret;
+ if (unlikely(!(in->f_mode & FMODE_READ) || + !(out->f_mode & FMODE_WRITE))) + return -EBADF; + ipipe = get_pipe_info(in); opipe = get_pipe_info(out);
@@ -1115,12 +1119,6 @@ long do_splice(struct file *in, loff_t __user *off_in, if (off_in || off_out) return -ESPIPE;
- if (!(in->f_mode & FMODE_READ)) - return -EBADF; - - if (!(out->f_mode & FMODE_WRITE)) - return -EBADF; - /* Splicing to self would be fun, but... */ if (ipipe == opipe) return -EINVAL; @@ -1140,9 +1138,6 @@ long do_splice(struct file *in, loff_t __user *off_in, offset = out->f_pos; }
- if (unlikely(!(out->f_mode & FMODE_WRITE))) - return -EBADF; - if (unlikely(out->f_flags & O_APPEND)) return -EINVAL;
@@ -1421,15 +1416,11 @@ SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user *, off_in, error = -EBADF; in = fdget(fd_in); if (in.file) { - if (in.file->f_mode & FMODE_READ) { - out = fdget(fd_out); - if (out.file) { - if (out.file->f_mode & FMODE_WRITE) - error = do_splice(in.file, off_in, - out.file, off_out, - len, flags); - fdput(out); - } + out = fdget(fd_out); + if (out.file) { + error = do_splice(in.file, off_in, out.file, off_out, + len, flags); + fdput(out); } fdput(in); } @@ -1733,6 +1724,10 @@ static long do_tee(struct file *in, struct file *out, size_t len, struct pipe_inode_info *opipe = get_pipe_info(out); int ret = -EINVAL;
+ if (unlikely(!(in->f_mode & FMODE_READ) || + !(out->f_mode & FMODE_WRITE))) + return -EBADF; + /* * Duplicate the contents of ipipe to opipe without actually * copying the data. @@ -1755,7 +1750,7 @@ static long do_tee(struct file *in, struct file *out, size_t len,
SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags) { - struct fd in; + struct fd in, out; int error;
if (unlikely(flags & ~SPLICE_F_ALL)) @@ -1767,14 +1762,10 @@ SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags) error = -EBADF; in = fdget(fdin); if (in.file) { - if (in.file->f_mode & FMODE_READ) { - struct fd out = fdget(fdout); - if (out.file) { - if (out.file->f_mode & FMODE_WRITE) - error = do_tee(in.file, out.file, - len, flags); - fdput(out); - } + out = fdget(fdout); + if (out.file) { + error = do_tee(in.file, out.file, len, flags); + fdput(out); } fdput(in); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc6 commit c96874265cd04b4bd4a8e114ac9af039a6d83cfe category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
do_splice() doesn't expect len to be 0. Just always return 0 in this case as splice(2) does.
Fixes: 7d67af2c0134 ("io_uring: add splice(2) support") Reported-by: Jann Horn jannh@google.com Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 380d29bfbc5a..88eede355116 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2763,16 +2763,19 @@ static int io_splice(struct io_kiocb *req, bool force_nonblock) struct file *out = sp->file_out; unsigned int flags = sp->flags & ~SPLICE_F_FD_IN_FIXED; loff_t *poff_in, *poff_out; - long ret; + long ret = 0;
if (force_nonblock) return -EAGAIN;
poff_in = (sp->off_in == -1) ? NULL : &sp->off_in; poff_out = (sp->off_out == -1) ? NULL : &sp->off_out; - ret = do_splice(in, poff_in, out, poff_out, sp->len, flags); - if (force_nonblock && ret == -EAGAIN) - return -EAGAIN; + + if (sp->len) { + ret = do_splice(in, poff_in, out, poff_out, sp->len, flags); + if (force_nonblock && ret == -EAGAIN) + return -EAGAIN; + }
io_put_file(req, in, (sp->flags & SPLICE_F_FD_IN_FIXED)); req->flags &= ~REQ_F_NEED_CLEANUP;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc6 commit 9d9e88a24c1f20ebfc2f28b1762ce78c0b9e1cb3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
When we changed the file registration handling, it became important to iterate the bulk request freeing list for fixed files as well, or we miss dropping the fixed file reference. If not, we're leaking references, and we'll get a kworker stuck waiting for file references to disappear.
This also means we can remove the special casing of fixed vs non-fixed files, we need to iterate for both and we can just rely on __io_req_aux_free() doing io_put_file() instead of doing it manually.
Fixes: 055895537302 ("io_uring: refactor file register/unregister/update handling") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 88eede355116..b992b020a819 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1390,10 +1390,6 @@ static void io_free_req_many(struct io_ring_ctx *ctx, struct req_batch *rb) for (i = 0; i < rb->to_free; i++) { struct io_kiocb *req = rb->reqs[i];
- if (req->flags & REQ_F_FIXED_FILE) { - req->file = NULL; - percpu_ref_put(req->fixed_file_refs); - } if (req->flags & REQ_F_INFLIGHT) inflight++; __io_req_aux_free(req); @@ -1666,7 +1662,7 @@ static inline bool io_req_multi_free(struct req_batch *rb, struct io_kiocb *req) if ((req->flags & REQ_F_LINK_HEAD) || io_is_fallback_req(req)) return false;
- if (!(req->flags & REQ_F_FIXED_FILE) || req->io) + if (req->file || req->io) rb->need_iter++;
rb->reqs[rb->to_free++] = req;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc7 commit 583863ed918136412ddf14de2e12534f17cfdc6f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Ensure that ctx->sqo_wait is initialized as soon as the ctx is allocated, instead of deferring it to the offload setup. This fixes a syzbot reported lockdep complaint, which is really due to trying to wake_up on an uninitialized wait queue:
RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441319 RDX: 0000000000000001 RSI: 0000000020000140 RDI: 000000000000047b RBP: 0000000000010475 R08: 0000000000000001 R09: 00000000004002c8 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000402260 R13: 00000000004022f0 R14: 0000000000000000 R15: 0000000000000000 INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 1 PID: 7090 Comm: syz-executor222 Not tainted 5.7.0-rc1-next-20200415-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 assign_lock_key kernel/locking/lockdep.c:913 [inline] register_lock_class+0x1664/0x1760 kernel/locking/lockdep.c:1225 __lock_acquire+0x104/0x4c50 kernel/locking/lockdep.c:4234 lock_acquire+0x1f2/0x8f0 kernel/locking/lockdep.c:4934 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x8c/0xbf kernel/locking/spinlock.c:159 __wake_up_common_lock+0xb4/0x130 kernel/sched/wait.c:122 io_cqring_ev_posted+0xa5/0x1e0 fs/io_uring.c:1160 io_poll_remove_all fs/io_uring.c:4357 [inline] io_ring_ctx_wait_and_kill+0x2bc/0x5a0 fs/io_uring.c:7305 io_uring_create fs/io_uring.c:7843 [inline] io_uring_setup+0x115e/0x22b0 fs/io_uring.c:7870 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x441319 Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
Reported-by: syzbot+8c91f5d054e998721c57@syzkaller.appspotmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b992b020a819..71def07b1c94 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -920,6 +920,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) goto err;
ctx->flags = p->flags; + init_waitqueue_head(&ctx->sqo_wait); init_waitqueue_head(&ctx->cq_wait); INIT_LIST_HEAD(&ctx->cq_overflow_list); init_completion(&ctx->completions[0]); @@ -6773,7 +6774,6 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx, { int ret;
- init_waitqueue_head(&ctx->sqo_wait); mmgrab(current->mm); ctx->sqo_mm = current->mm;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc7 commit bd2ab18a1d6267446eae1b47dd839050452bdf7f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
As for other not inlined requests, alloc req->io for FORCE_ASYNC reqs, so they can be prepared properly.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d491df308235..63e9ae556bae 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5544,9 +5544,15 @@ static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) io_double_put_req(req); } } else if (req->flags & REQ_F_FORCE_ASYNC) { - ret = io_req_defer_prep(req, sqe); - if (unlikely(ret < 0)) - goto fail_req; + if (!req->io) { + ret = -EAGAIN; + if (io_alloc_async_ctx(req)) + goto fail_req; + ret = io_req_defer_prep(req, sqe); + if (unlikely(ret < 0)) + goto fail_req; + } + /* * Never try inline submit of IOSQE_ASYNC is set, go straight * to async execution.
From: Stefano Garzarella sgarzare@redhat.com
mainline inclusion from mainline-5.8-rc1 commit 0d9b5b3af134cddfdc1dd31d41946a0ad389bbf2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This patch adds the new 'cq_flags' field that should be written by the application and read by the kernel.
This new field is available to the userspace application through 'cq_off.flags'. We are using 4-bytes previously reserved and set to zero. This means that if the application finds this field to zero, then the new functionality is not supported.
In the next patch we will introduce the first flag available.
Signed-off-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 +++++++++- include/uapi/linux/io_uring.h | 4 +++- 2 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 278ac42b269e..dfb1a9e9a9b9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -142,7 +142,7 @@ struct io_rings { */ u32 sq_dropped; /* - * Runtime flags + * Runtime SQ flags * * Written by the kernel, shouldn't be modified by the * application. @@ -151,6 +151,13 @@ struct io_rings { * for IORING_SQ_NEED_WAKEUP after updating the sq tail. */ u32 sq_flags; + /* + * Runtime CQ flags + * + * Written by the application, shouldn't be modified by the + * kernel. + */ + u32 cq_flags; /* * Number of completion events lost because the queue was full; * this should be avoided by the application by making sure @@ -7874,6 +7881,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, p->cq_off.ring_entries = offsetof(struct io_rings, cq_ring_entries); p->cq_off.overflow = offsetof(struct io_rings, cq_overflow); p->cq_off.cqes = offsetof(struct io_rings, cqes); + p->cq_off.flags = offsetof(struct io_rings, cq_flags);
p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP | IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS | diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 6e35b534c4b8..94e3359249ab 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -203,7 +203,9 @@ struct io_cqring_offsets { __u32 ring_entries; __u32 overflow; __u32 cqes; - __u64 resv[2]; + __u32 flags; + __u32 resv1; + __u64 resv2; };
/*
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit f2a8d5c7a218b9c24befb756c4eb30aa550ce822 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add IORING_OP_TEE implementing tee(2) support. Almost identical to splice bits, but without offsets.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 62 +++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 1 + 2 files changed, 60 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2b9678f91395..9db2f55082a6 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -852,6 +852,11 @@ static const struct io_op_def io_op_defs[] = { }, [IORING_OP_PROVIDE_BUFFERS] = {}, [IORING_OP_REMOVE_BUFFERS] = {}, + [IORING_OP_TEE] = { + .needs_file = 1, + .hash_reg_file = 1, + .unbound_nonreg_file = 1, + }, };
static void io_wq_submit_work(struct io_wq_work **workptr); @@ -2741,7 +2746,8 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) return ret; }
-static int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +static int __io_splice_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) { struct io_splice* sp = &req->splice; unsigned int valid_flags = SPLICE_F_FD_IN_FIXED | SPLICE_F_ALL; @@ -2751,8 +2757,6 @@ static int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0;
sp->file_in = NULL; - sp->off_in = READ_ONCE(sqe->splice_off_in); - sp->off_out = READ_ONCE(sqe->off); sp->len = READ_ONCE(sqe->len); sp->flags = READ_ONCE(sqe->splice_flags);
@@ -2771,6 +2775,46 @@ static int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
+static int io_tee_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + if (READ_ONCE(sqe->splice_off_in) || READ_ONCE(sqe->off)) + return -EINVAL; + return __io_splice_prep(req, sqe); +} + +static int io_tee(struct io_kiocb *req, bool force_nonblock) +{ + struct io_splice *sp = &req->splice; + struct file *in = sp->file_in; + struct file *out = sp->file_out; + unsigned int flags = sp->flags & ~SPLICE_F_FD_IN_FIXED; + long ret = 0; + + if (force_nonblock) + return -EAGAIN; + if (sp->len) + ret = do_tee(in, out, sp->len, flags); + + io_put_file(req, in, (sp->flags & SPLICE_F_FD_IN_FIXED)); + req->flags &= ~REQ_F_NEED_CLEANUP; + + io_cqring_add_event(req, ret); + if (ret != sp->len) + req_set_fail_links(req); + io_put_req(req); + return 0; +} + +static int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_splice* sp = &req->splice; + + sp->off_in = READ_ONCE(sqe->splice_off_in); + sp->off_out = READ_ONCE(sqe->off); + return __io_splice_prep(req, sqe); +} + static int io_splice(struct io_kiocb *req, bool force_nonblock) { struct io_splice *sp = &req->splice; @@ -5029,6 +5073,9 @@ static int io_req_defer_prep(struct io_kiocb *req, case IORING_OP_REMOVE_BUFFERS: ret = io_remove_buffers_prep(req, sqe); break; + case IORING_OP_TEE: + ret = io_tee_prep(req, sqe); + break; default: printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", req->opcode); @@ -5102,6 +5149,7 @@ static void io_cleanup_req(struct io_kiocb *req) putname(req->open.filename); break; case IORING_OP_SPLICE: + case IORING_OP_TEE: io_put_file(req, req->splice.file_in, (req->splice.flags & SPLICE_F_FD_IN_FIXED)); break; @@ -5324,6 +5372,14 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } ret = io_remove_buffers(req, force_nonblock); break; + case IORING_OP_TEE: + if (sqe) { + ret = io_tee_prep(req, sqe); + if (ret < 0) + break; + } + ret = io_tee(req, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 15aed20c6789..9afedee24e5b 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -128,6 +128,7 @@ enum { IORING_OP_SPLICE, IORING_OP_PROVIDE_BUFFERS, IORING_OP_REMOVE_BUFFERS, + IORING_OP_TEE,
/* this goes last, obviously */ IORING_OP_LAST,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc1 commit 310672552f4aea2ad50704711aa3cdd45f5441e9 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If the request is still hashed in io_async_task_func(), then it cannot have been canceled and it's pointless to check. So save that check.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 33 ++++++++++++++++----------------- 1 file changed, 16 insertions(+), 17 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9db2f55082a6..11decc52f7b7 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4301,7 +4301,7 @@ static void io_async_task_func(struct callback_head *cb) struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); struct async_poll *apoll = req->apoll; struct io_ring_ctx *ctx = req->ctx; - bool canceled; + bool canceled = false;
trace_io_uring_task_run(req->ctx, req->opcode, req->user_data);
@@ -4310,34 +4310,33 @@ static void io_async_task_func(struct callback_head *cb) return; }
- if (hash_hashed(&req->hash_node)) + /* If req is still hashed, it cannot have been canceled. Don't check. */ + if (hash_hashed(&req->hash_node)) { hash_del(&req->hash_node); - - canceled = READ_ONCE(apoll->poll.canceled); - if (canceled) { - io_cqring_fill_event(req, -ECANCELED); - io_commit_cqring(ctx); + } else { + canceled = READ_ONCE(apoll->poll.canceled); + if (canceled) { + io_cqring_fill_event(req, -ECANCELED); + io_commit_cqring(ctx); + } }
spin_unlock_irq(&ctx->completion_lock);
/* restore ->work in case we need to retry again */ memcpy(&req->work, &apoll->work, sizeof(req->work)); + kfree(apoll);
- if (canceled) { - kfree(apoll); + if (!canceled) { + __set_current_state(TASK_RUNNING); + mutex_lock(&ctx->uring_lock); + __io_queue_sqe(req, NULL); + mutex_unlock(&ctx->uring_lock); + } else { io_cqring_ev_posted(ctx); req_set_fail_links(req); io_double_put_req(req); - return; } - - __set_current_state(TASK_RUNNING); - mutex_lock(&ctx->uring_lock); - __io_queue_sqe(req, NULL); - mutex_unlock(&ctx->uring_lock); - - kfree(apoll); }
static int io_async_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.8-rc1 commit 6b668c9b7fc6fc0c313cdaee8b75d17f4d954ab5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
When IORING_SETUP_SQPOLL is enabled, io_ring_ctx_wait_and_kill() will wait for sq thread to idle by busy loop:
while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait)) cond_resched();
Above loop isn't very CPU friendly, it may introduce a short cpu burst on the current cpu.
If ctx->refs is dying, we forbid sq_thread from submitting any further SQEs. Instead they just get discarded when we exit.
Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 13 ++----------- 1 file changed, 2 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 11decc52f7b7..bd7c862f2d67 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6119,7 +6119,8 @@ static int io_sq_thread(void *data) }
mutex_lock(&ctx->uring_lock); - ret = io_submit_sqes(ctx, to_submit, NULL, -1); + if (likely(!percpu_ref_is_dying(&ctx->refs))) + ret = io_submit_sqes(ctx, to_submit, NULL, -1); mutex_unlock(&ctx->uring_lock); timeout = jiffies + ctx->sq_thread_idle; } @@ -7409,16 +7410,6 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) percpu_ref_kill(&ctx->refs); mutex_unlock(&ctx->uring_lock);
- /* - * Wait for sq thread to idle, if we have one. It won't spin on new - * work after we've killed the ctx ref above. This is important to do - * before we cancel existing commands, as the thread could otherwise - * be queueing new work post that. If that's work we need to cancel, - * it could cause shutdown to hang. - */ - while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait)) - cond_resched(); - io_kill_timeouts(ctx); io_poll_remove_all(ctx);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 4518a3cc273cf82efdd36522fb1f13baad173c70 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In io_uring_cancel_files(), after refcount_sub_and_test() leaves 0 req->refs, it calls io_put_req(), which would also put a ref. Call io_free_req() instead.
Cc: stable@vger.kernel.org Fixes: 2ca10259b418 ("io_uring: prune request from overflow list on flush") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index bd7c862f2d67..48965063ea68 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7478,7 +7478,7 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, * all we had, then we're done with this request. */ if (refcount_sub_and_test(2, &cancel_req->refs)) { - io_put_req(cancel_req); + io_free_req(cancel_req); finish_wait(&ctx->inflight_wait, &wait); continue; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 733f5c95e6fdabd05b8dfc15e04512809c9652c2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Move spin_lock_irq() earlier to have only 1 call site of it in io_timeout(). It makes the flow easier.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 48965063ea68..80fc3d7179d7 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4792,6 +4792,7 @@ static int io_timeout(struct io_kiocb *req) u32 seq = req->sequence;
data = &req->io->timeout; + spin_lock_irq(&ctx->completion_lock);
/* * sqe->off holds how many events that need to occur for this @@ -4800,7 +4801,6 @@ static int io_timeout(struct io_kiocb *req) */ if (!count) { req->flags |= REQ_F_TIMEOUT_NOSEQ; - spin_lock_irq(&ctx->completion_lock); entry = ctx->timeout_list.prev; goto add; } @@ -4811,7 +4811,6 @@ static int io_timeout(struct io_kiocb *req) * Insertion sort, ensuring the first entry in the list is always * the one we need first. */ - spin_lock_irq(&ctx->completion_lock); list_for_each_prev(entry, &ctx->timeout_list) { struct io_kiocb *nxt = list_entry(entry, struct io_kiocb, list); unsigned nxt_seq;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 56080b02ed6e71fbc0add2d05a32ed7361dd736a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
SQEs are user writable, don't read sqe->off twice in io_timeout_prep()
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 80fc3d7179d7..a90a548da824 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4750,18 +4750,19 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, { struct io_timeout_data *data; unsigned flags; + u32 off = READ_ONCE(sqe->off);
if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->ioprio || sqe->buf_index || sqe->len != 1) return -EINVAL; - if (sqe->off && is_timeout_link) + if (off && is_timeout_link) return -EINVAL; flags = READ_ONCE(sqe->timeout_flags); if (flags & ~IORING_TIMEOUT_ABS) return -EINVAL;
- req->timeout.count = READ_ONCE(sqe->off); + req->timeout.count = off;
if (!req->io && io_alloc_async_ctx(req)) return -ENOMEM;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 0451894522108d6c72934aff6ef89023743a9ed4 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_commit_cqring() assembly doesn't look good with extra code handling drained requests. IOSQE_IO_DRAIN is slow and discouraged to be used in a hot path, so try to minimise its impact by putting it into a helper and doing a fast check.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a90a548da824..cedbf117450a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -981,19 +981,6 @@ static inline bool req_need_defer(struct io_kiocb *req) return false; }
-static struct io_kiocb *io_get_deferred_req(struct io_ring_ctx *ctx) -{ - struct io_kiocb *req; - - req = list_first_entry_or_null(&ctx->defer_list, struct io_kiocb, list); - if (req && !req_need_defer(req)) { - list_del_init(&req->list); - return req; - } - - return NULL; -} - static struct io_kiocb *io_get_timeout_req(struct io_ring_ctx *ctx) { struct io_kiocb *req; @@ -1126,6 +1113,19 @@ static void io_kill_timeouts(struct io_ring_ctx *ctx) spin_unlock_irq(&ctx->completion_lock); }
+static void __io_queue_deferred(struct io_ring_ctx *ctx) +{ + do { + struct io_kiocb *req = list_first_entry(&ctx->defer_list, + struct io_kiocb, list); + + if (req_need_defer(req)) + break; + list_del_init(&req->list); + io_queue_async_work(req); + } while (!list_empty(&ctx->defer_list)); +} + static void io_commit_cqring(struct io_ring_ctx *ctx) { struct io_kiocb *req; @@ -1135,8 +1135,8 @@ static void io_commit_cqring(struct io_ring_ctx *ctx)
__io_commit_cqring(ctx);
- while ((req = io_get_deferred_req(ctx)) != NULL) - io_queue_async_work(req); + if (unlikely(!list_empty(&ctx->defer_list))) + __io_queue_deferred(ctx); }
static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 0bf0eefdab52d9f9f3a1eeda32a4fc7afe4e9219 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_close() was punting async manually to skip grabbing files. Use REQ_F_NO_FILE_TABLE instead, and pass it through the generic path with -EAGAIN.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 20 +++++--------------- 1 file changed, 5 insertions(+), 15 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cedbf117450a..e50734123350 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3437,25 +3437,15 @@ static int io_close(struct io_kiocb *req, bool force_nonblock)
req->close.put_file = NULL; ret = __close_fd_get_file(req->close.fd, &req->close.put_file); - if (ret < 0) { - if (ret == -ENOENT) - ret = -EBADF; - return ret; - } + if (ret < 0) + return (ret == -ENOENT) ? -EBADF : ret;
/* if the file has a flush method, be safe and punt to async */ if (req->close.put_file->f_op->flush && force_nonblock) { - /* submission ref will be dropped, take it for async */ - refcount_inc(&req->refs); - + /* avoid grabbing files - we don't need the files */ + req->flags |= REQ_F_NO_FILE_TABLE | REQ_F_MUST_PUNT; req->work.func = io_close_finish; - /* - * Do manual async queue here to avoid grabbing files - we don't - * need the files, and it'll cause io_close_finish() to close - * the file again and cause a double CQE entry for this request - */ - io_queue_async_work(req); - return 0; + return -EAGAIN; }
/*
From: Bijan Mottahedeh bijan.mottahedeh@oracle.com
mainline inclusion from mainline-5.8-rc1 commit 1d9e1288039a47dc1189c3c1fed5cf3c215e94b7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Separate statx data from open in io_kiocb. No functional changes.
Signed-off-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [commit c12cedf24e78("io_uring: add 'struct open_how' to the openat request context") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 32 ++++++++++++++++++++------------ 1 file changed, 20 insertions(+), 12 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e50734123350..08ee4e0e815f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -427,10 +427,8 @@ struct io_open { int dfd; union { umode_t mode; - unsigned mask; }; struct filename *filename; - struct statx __user *buffer; int flags; unsigned long nofile; }; @@ -482,6 +480,15 @@ struct io_provide_buf { __u16 bid; };
+struct io_statx { + struct file *file; + int dfd; + unsigned int mask; + unsigned int flags; + struct filename *filename; + struct statx __user *buffer; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -623,6 +630,7 @@ struct io_kiocb { struct io_epoll epoll; struct io_splice splice; struct io_provide_buf pbuf; + struct io_statx statx; };
struct io_async_ctx *io; @@ -3326,19 +3334,19 @@ static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (req->flags & REQ_F_NEED_CLEANUP) return 0;
- req->open.dfd = READ_ONCE(sqe->fd); - req->open.mask = READ_ONCE(sqe->len); + req->statx.dfd = READ_ONCE(sqe->fd); + req->statx.mask = READ_ONCE(sqe->len); fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); - req->open.buffer = u64_to_user_ptr(READ_ONCE(sqe->addr2)); - req->open.flags = READ_ONCE(sqe->statx_flags); + req->statx.buffer = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + req->statx.flags = READ_ONCE(sqe->statx_flags);
- if (vfs_stat_set_lookup_flags(&lookup_flags, req->open.flags)) + if (vfs_stat_set_lookup_flags(&lookup_flags, req->statx.flags)) return -EINVAL;
- req->open.filename = getname_flags(fname, lookup_flags, NULL); - if (IS_ERR(req->open.filename)) { - ret = PTR_ERR(req->open.filename); - req->open.filename = NULL; + req->statx.filename = getname_flags(fname, lookup_flags, NULL); + if (IS_ERR(req->statx.filename)) { + ret = PTR_ERR(req->statx.filename); + req->statx.filename = NULL; return ret; }
@@ -3348,7 +3356,7 @@ static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
static int io_statx(struct io_kiocb *req, bool force_nonblock) { - struct io_open *ctx = &req->open; + struct io_statx *ctx = &req->statx; unsigned lookup_flags; struct path path; struct kstat stat;
From: Bijan Mottahedeh bijan.mottahedeh@oracle.com
mainline inclusion from mainline-5.8-rc1 commit 0018784fc84f636d473a0d2a65a34f9d01893c0a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This is a prepatory patch to allow io_uring to invoke statx directly.
Signed-off-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/internal.h | 2 ++ fs/stat.c | 32 +++++++++++++++++++------------- 2 files changed, 21 insertions(+), 13 deletions(-)
diff --git a/fs/internal.h b/fs/internal.h index acbc60a8e13e..6aa0e08161ac 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -195,3 +195,5 @@ int sb_init_dio_done_wq(struct super_block *sb); */ unsigned vfs_stat_set_lookup_flags(unsigned *lookup_flags, int flags); int cp_statx(const struct kstat *stat, struct statx __user *buffer); +int do_statx(int dfd, const char __user *filename, unsigned flags, + unsigned int mask, struct statx __user *buffer); diff --git a/fs/stat.c b/fs/stat.c index 46dfe0df1a71..a69de0897b74 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -562,6 +562,24 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer) return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0; }
+int do_statx(int dfd, const char __user *filename, unsigned flags, + unsigned int mask, struct statx __user *buffer) +{ + struct kstat stat; + int error; + + if (mask & STATX__RESERVED) + return -EINVAL; + if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE) + return -EINVAL; + + error = vfs_statx(dfd, filename, flags, &stat, mask); + if (error) + return error; + + return cp_statx(&stat, buffer); +} + /** * sys_statx - System call to get enhanced stats * @dfd: Base directory to pathwalk from *or* fd to stat. @@ -578,19 +596,7 @@ SYSCALL_DEFINE5(statx, unsigned int, mask, struct statx __user *, buffer) { - struct kstat stat; - int error; - - if (mask & STATX__RESERVED) - return -EINVAL; - if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE) - return -EINVAL; - - error = vfs_statx(dfd, filename, flags, &stat, mask); - if (error) - return error; - - return cp_statx(&stat, buffer); + return do_statx(dfd, filename, flags, mask, buffer); }
#ifdef CONFIG_COMPAT
From: Bijan Mottahedeh bijan.mottahedeh@oracle.com
mainline inclusion from mainline-5.8-rc1 commit e62753e4e2926f249d088cc0517be5ed4efec6d6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Calling statx directly both simplifies the interface and avoids potential incompatibilities between sync and async invokations.
Signed-off-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [commit cebdb98617ae("io_uring: add support for IORING_OP_OPENAT2") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 50 ++++---------------------------------------------- 1 file changed, 4 insertions(+), 46 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 08ee4e0e815f..72991fb10d28 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -485,7 +485,7 @@ struct io_statx { int dfd; unsigned int mask; unsigned int flags; - struct filename *filename; + const char __user *filename; struct statx __user *buffer; };
@@ -3323,43 +3323,23 @@ static int io_fadvise(struct io_kiocb *req, bool force_nonblock)
static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { - const char __user *fname; - unsigned lookup_flags; - int ret; - if (sqe->ioprio || sqe->buf_index) return -EINVAL; if (req->flags & REQ_F_FIXED_FILE) return -EBADF; - if (req->flags & REQ_F_NEED_CLEANUP) - return 0;
req->statx.dfd = READ_ONCE(sqe->fd); req->statx.mask = READ_ONCE(sqe->len); - fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); + req->statx.filename = u64_to_user_ptr(READ_ONCE(sqe->addr)); req->statx.buffer = u64_to_user_ptr(READ_ONCE(sqe->addr2)); req->statx.flags = READ_ONCE(sqe->statx_flags);
- if (vfs_stat_set_lookup_flags(&lookup_flags, req->statx.flags)) - return -EINVAL; - - req->statx.filename = getname_flags(fname, lookup_flags, NULL); - if (IS_ERR(req->statx.filename)) { - ret = PTR_ERR(req->statx.filename); - req->statx.filename = NULL; - return ret; - } - - req->flags |= REQ_F_NEED_CLEANUP; return 0; }
static int io_statx(struct io_kiocb *req, bool force_nonblock) { struct io_statx *ctx = &req->statx; - unsigned lookup_flags; - struct path path; - struct kstat stat; int ret;
if (force_nonblock) { @@ -3369,29 +3349,9 @@ static int io_statx(struct io_kiocb *req, bool force_nonblock) return -EAGAIN; }
- if (vfs_stat_set_lookup_flags(&lookup_flags, ctx->flags)) - return -EINVAL; - -retry: - /* filename_lookup() drops it, keep a reference */ - ctx->filename->refcnt++; - - ret = filename_lookup(ctx->dfd, ctx->filename, lookup_flags, &path, - NULL); - if (ret) - goto err; + ret = do_statx(ctx->dfd, ctx->filename, ctx->flags, ctx->mask, + ctx->buffer);
- ret = vfs_getattr(&path, &stat, ctx->mask, ctx->flags); - path_put(&path); - if (retry_estale(ret, lookup_flags)) { - lookup_flags |= LOOKUP_REVAL; - goto retry; - } - if (!ret) - ret = cp_statx(&stat, ctx->buffer); -err: - putname(ctx->filename); - req->flags &= ~REQ_F_NEED_CLEANUP; if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); @@ -5142,8 +5102,6 @@ static void io_cleanup_req(struct io_kiocb *req) kfree(req->sr_msg.kbuf); break; case IORING_OP_OPENAT: - case IORING_OP_STATX: - putname(req->open.filename); break; case IORING_OP_SPLICE: case IORING_OP_TEE:
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit bfe68a221905de37e65394a6d58c1e5f3e545d2f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Offset timeouts wait not for sqe->off non-timeout CQEs, but rather sqe->off + number of prior inflight requests. Wait exactly for sqe->off non-timeout completions
Reported-by: Jens Axboe axboe@kernel.dk Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 65 +++++++++++---------------------------------------- 1 file changed, 14 insertions(+), 51 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4ec02d11110f..5757474c0754 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -394,7 +394,8 @@ struct io_timeout { struct file *file; u64 addr; int flags; - u32 count; + u32 off; + u32 target_seq; };
struct io_rw { @@ -1125,8 +1126,10 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx)
if (req->flags & REQ_F_TIMEOUT_NOSEQ) break; - if (__req_need_defer(req)) + if (req->timeout.target_seq != ctx->cached_cq_tail + - atomic_read(&ctx->cq_timeouts)) break; + list_del_init(&req->list); io_kill_timeout(req); } @@ -4609,20 +4612,8 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) * We could be racing with timeout deletion. If the list is empty, * then timeout lookup already found it and will be handling it. */ - if (!list_empty(&req->list)) { - struct io_kiocb *prev; - - /* - * Adjust the reqs sequence before the current one because it - * will consume a slot in the cq_ring and the cq_tail - * pointer will be increased, otherwise other timeout reqs may - * return in advance without waiting for enough wait_nr. - */ - prev = req; - list_for_each_entry_continue_reverse(prev, &ctx->timeout_list, list) - prev->sequence++; + if (!list_empty(&req->list)) list_del_init(&req->list); - }
io_cqring_fill_event(req, -ETIME); io_commit_cqring(ctx); @@ -4714,7 +4705,7 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (flags & ~IORING_TIMEOUT_ABS) return -EINVAL;
- req->timeout.count = off; + req->timeout.off = off;
if (!req->io && io_alloc_async_ctx(req)) return -ENOMEM; @@ -4738,13 +4729,10 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, static int io_timeout(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; - struct io_timeout_data *data; + struct io_timeout_data *data = &req->io->timeout; struct list_head *entry; - unsigned span = 0; - u32 count = req->timeout.count; - u32 seq = req->sequence; + u32 tail, off = req->timeout.off;
- data = &req->io->timeout; spin_lock_irq(&ctx->completion_lock);
/* @@ -4752,13 +4740,14 @@ static int io_timeout(struct io_kiocb *req) * timeout event to be satisfied. If it isn't set, then this is * a pure timeout request, sequence isn't used. */ - if (!count) { + if (!off) { req->flags |= REQ_F_TIMEOUT_NOSEQ; entry = ctx->timeout_list.prev; goto add; }
- req->sequence = seq + count; + tail = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts); + req->timeout.target_seq = tail + off;
/* * Insertion sort, ensuring the first entry in the list is always @@ -4766,39 +4755,13 @@ static int io_timeout(struct io_kiocb *req) */ list_for_each_prev(entry, &ctx->timeout_list) { struct io_kiocb *nxt = list_entry(entry, struct io_kiocb, list); - unsigned nxt_seq; - long long tmp, tmp_nxt; - u32 nxt_offset = nxt->timeout.count;
if (nxt->flags & REQ_F_TIMEOUT_NOSEQ) continue; - - /* - * Since seq + count can overflow, use type long - * long to store it. - */ - tmp = (long long)seq + count; - nxt_seq = nxt->sequence - nxt_offset; - tmp_nxt = (long long)nxt_seq + nxt_offset; - - /* - * cached_sq_head may overflow, and it will never overflow twice - * once there is some timeout req still be valid. - */ - if (seq < nxt_seq) - tmp += UINT_MAX; - - if (tmp > tmp_nxt) + /* nxt.seq is behind @tail, otherwise would've been completed */ + if (off >= nxt->timeout.target_seq - tail) break; - - /* - * Sequence of reqs after the insert one and itself should - * be adjusted because each timeout req consumes a slot. - */ - span++; - nxt->sequence++; } - req->sequence -= span; add: list_add(&req->list, entry); data->timer.function = io_timeout_fn;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 7b53d59859bc932b37895d2d37388e7fa29af7a5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Overflowed requests in io_uring_cancel_files() should be shed only of inflight and overflowed refs. All other left references are owned by someone else.
If refcount_sub_and_test() fails, it will go further and put put extra ref, don't do that. Also, don't need to do io_wq_cancel_work() for overflowed reqs, they will be let go shortly anyway.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5757474c0754..8516dffe6649 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7395,10 +7395,11 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, finish_wait(&ctx->inflight_wait, &wait); continue; } + } else { + io_wq_cancel_work(ctx->io_wq, &cancel_req->work); + io_put_req(cancel_req); }
- io_wq_cancel_work(ctx->io_wq, &cancel_req->work); - io_put_req(cancel_req); schedule(); finish_wait(&ctx->inflight_wait, &wait); }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc1 commit fd2206e4e97b5bae422d9f2f9ebbc79bc97e44a5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
A previous commit enabled this functionality, which also enabled O_PATH to work correctly with io_uring. But we can't safely close the ring itself, as the file handle isn't reference counted inside io_uring_enter(). Instead of jumping through hoops to enable ring closure, add a "soft" ->needs_file option, ->needs_file_no_error. This enables O_PATH file descriptors to work, but still catches the case of trying to close the ring itself.
Reported-by: Jann Horn jannh@google.com Fixes: 904fbcb115c8 ("io_uring: remove 'fd is io_uring' from close path") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 25 +++++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8516dffe6649..aceede48ccf2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -701,6 +701,8 @@ struct io_op_def { unsigned needs_mm : 1; /* needs req->file assigned */ unsigned needs_file : 1; + /* don't fail if file grab fails */ + unsigned needs_file_no_error : 1; /* hash wq insertion if file is a regular file */ unsigned hash_reg_file : 1; /* unbound wq insertion if file is a non-regular file */ @@ -807,6 +809,8 @@ static const struct io_op_def io_op_defs[] = { .needs_fs = 1, }, [IORING_OP_CLOSE] = { + .needs_file = 1, + .needs_file_no_error = 1, .file_table = 1, }, [IORING_OP_FILES_UPDATE] = { @@ -3371,6 +3375,10 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EBADF;
req->close.fd = READ_ONCE(sqe->fd); + if ((req->file && req->file->f_op == &io_uring_fops) || + req->close.fd == req->ctx->ring_fd) + return -EBADF; + return 0; }
@@ -5376,19 +5384,20 @@ static int io_file_get(struct io_submit_state *state, struct io_kiocb *req, return -EBADF; fd = array_index_nospec(fd, ctx->nr_user_files); file = io_file_from_index(ctx, fd); - if (!file) - return -EBADF; - req->fixed_file_refs = ctx->file_data->cur_refs; - percpu_ref_get(req->fixed_file_refs); + if (file) { + req->fixed_file_refs = ctx->file_data->cur_refs; + percpu_ref_get(req->fixed_file_refs); + } } else { trace_io_uring_file_get(ctx, fd); file = __io_file_get(state, fd); - if (unlikely(!file)) - return -EBADF; }
- *out_file = file; - return 0; + if (file || io_op_defs[req->opcode].needs_file_no_error) { + *out_file = file; + return 0; + } + return -EBADF; }
static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 3232dd02af65f2d01be641120d2a710176b0c7a7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
IORING_SETUP_IOPOLL is defined only for read/write, other opcodes should be disallowed, otherwise it'll get an error as below. Also refuse open/close with SQPOLL, as the polling thread wouldn't know which file table to use.
RIP: 0010:io_iopoll_getevents+0x111/0x5a0 Call Trace: ? _raw_spin_unlock_irqrestore+0x24/0x40 ? do_send_sig_info+0x64/0x90 io_iopoll_reap_events.part.0+0x5e/0xa0 io_ring_ctx_wait_and_kill+0x132/0x1c0 io_uring_release+0x20/0x30 __fput+0xcd/0x230 ____fput+0xe/0x10 task_work_run+0x67/0xa0 do_exit+0x353/0xb10 ? handle_mm_fault+0xd4/0x200 ? syscall_trace_enter+0x18c/0x2c0 do_group_exit+0x43/0xa0 __x64_sys_exit_group+0x18/0x20 do_syscall_64+0x60/0x1e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Pavel Begunkov asml.silence@gmail.com [axboe: allow provide/remove buffers and files update] Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [commit cebdb98617ae("io_uring: add support for IORING_OP_OPENAT2") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index aceede48ccf2..fd0b428c965d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2764,6 +2764,8 @@ static int __io_splice_prep(struct io_kiocb *req,
if (req->flags & REQ_F_NEED_CLEANUP) return 0; + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL;
sp->file_in = NULL; sp->len = READ_ONCE(sqe->len); @@ -2964,6 +2966,8 @@ static int io_fallocate_prep(struct io_kiocb *req, { if (sqe->ioprio || sqe->buf_index || sqe->rw_flags) return -EINVAL; + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL;
req->sync.off = READ_ONCE(sqe->off); req->sync.len = READ_ONCE(sqe->addr); @@ -2989,6 +2993,8 @@ static int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) const char __user *fname; int ret;
+ if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) + return -EINVAL; if (sqe->ioprio || sqe->buf_index) return -EINVAL; if (req->flags & REQ_F_FIXED_FILE) @@ -3213,6 +3219,8 @@ static int io_epoll_ctl_prep(struct io_kiocb *req, #if defined(CONFIG_EPOLL) if (sqe->ioprio || sqe->buf_index) return -EINVAL; + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL;
req->epoll.epfd = READ_ONCE(sqe->fd); req->epoll.op = READ_ONCE(sqe->len); @@ -3257,6 +3265,8 @@ static int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) if (sqe->ioprio || sqe->buf_index || sqe->off) return -EINVAL; + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL;
req->madvise.addr = READ_ONCE(sqe->addr); req->madvise.len = READ_ONCE(sqe->len); @@ -3291,6 +3301,8 @@ static int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { if (sqe->ioprio || sqe->buf_index || sqe->addr) return -EINVAL; + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL;
req->fadvise.offset = READ_ONCE(sqe->off); req->fadvise.len = READ_ONCE(sqe->len); @@ -3324,6 +3336,8 @@ static int io_fadvise(struct io_kiocb *req, bool force_nonblock)
static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; if (sqe->ioprio || sqe->buf_index) return -EINVAL; if (req->flags & REQ_F_FIXED_FILE) @@ -3368,6 +3382,8 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) */ req->work.flags |= IO_WQ_WORK_NO_CANCEL;
+ if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) + return -EINVAL; if (sqe->ioprio || sqe->off || sqe->addr || sqe->len || sqe->rw_flags || sqe->buf_index) return -EINVAL;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit d2b6f48b691ed67569786c332f0173b918d3fd1b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Fail recv/send in case of IORING_SETUP_IOPOLL earlier during prep, so it'd be done only once. Removes duplication as well
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index fd0b428c965d..7633c2de7430 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3520,6 +3520,9 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) struct io_async_ctx *io = req->io; int ret;
+ if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + sr->msg_flags = READ_ONCE(sqe->msg_flags); sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); sr->len = READ_ONCE(sqe->len); @@ -3549,9 +3552,6 @@ static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) struct socket *sock; int ret;
- if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) - return -EINVAL; - sock = sock_from_file(req->file, &ret); if (sock) { struct io_async_ctx io; @@ -3605,9 +3605,6 @@ static int io_send(struct io_kiocb *req, bool force_nonblock) struct socket *sock; int ret;
- if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) - return -EINVAL; - sock = sock_from_file(req->file, &ret); if (sock) { struct io_sr_msg *sr = &req->sr_msg; @@ -3760,6 +3757,9 @@ static int io_recvmsg_prep(struct io_kiocb *req, struct io_async_ctx *io = req->io; int ret;
+ if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + sr->msg_flags = READ_ONCE(sqe->msg_flags); sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); sr->len = READ_ONCE(sqe->len); @@ -3788,9 +3788,6 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) struct socket *sock; int ret, cflags = 0;
- if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) - return -EINVAL; - sock = sock_from_file(req->file, &ret); if (sock) { struct io_buffer *kbuf; @@ -3852,9 +3849,6 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock) struct socket *sock; int ret, cflags = 0;
- if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) - return -EINVAL; - sock = sock_from_file(req->file, &ret); if (sock) { struct io_sr_msg *sr = &req->sr_msg;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc2 commit 801dd57bd1d8c2c253f43635a3045bfa32a810b3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For an exiting process it tries to cancel all its inflight requests. Use req->task to match such instead of work.pid. We always have req->task set, and it will be valid because we're matching only current exiting task.
Also, remove work.pid and everything related, it's useless now.
Reported-by: Eric W. Biederman ebiederm@xmission.com Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.h | 1 - fs/io_uring.c | 16 ++++++---------- 2 files changed, 6 insertions(+), 11 deletions(-)
diff --git a/fs/io-wq.h b/fs/io-wq.h index b72538fe5afd..071f1a997800 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -90,7 +90,6 @@ struct io_wq_work { const struct cred *creds; struct fs_struct *fs; unsigned flags; - pid_t task_pid; };
static inline struct io_wq_work *wq_next_work(struct io_wq_work *work) diff --git a/fs/io_uring.c b/fs/io_uring.c index cb032f2730a8..2639dcc4945e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1062,8 +1062,6 @@ static inline void io_req_work_grab_env(struct io_kiocb *req, } spin_unlock(¤t->fs->lock); } - if (!req->work.task_pid) - req->work.task_pid = task_pid_vnr(current); }
static inline void io_req_work_drop_env(struct io_kiocb *req) @@ -7409,11 +7407,12 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, } }
-static bool io_cancel_pid_cb(struct io_wq_work *work, void *data) +static bool io_cancel_task_cb(struct io_wq_work *work, void *data) { - pid_t pid = (pid_t) (unsigned long) data; + struct io_kiocb *req = container_of(work, struct io_kiocb, work); + struct task_struct *task = data;
- return work->task_pid == pid; + return req->task == task; }
static int io_uring_flush(struct file *file, void *data) @@ -7425,11 +7424,8 @@ static int io_uring_flush(struct file *file, void *data) /* * If the task is going away, cancel work it may have pending */ - if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) { - void *data = (void *) (unsigned long)task_pid_vnr(current); - - io_wq_cancel_cb(ctx->io_wq, io_cancel_pid_cb, data, true); - } + if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) + io_wq_cancel_cb(ctx->io_wq, io_cancel_task_cb, current, true);
return 0; }
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.8-rc2 commit 2d7d67920e5c8e0854df23ca77da2dd5880ce5dd category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In IOPOLL mode, for EAGAIN error, we'll try to submit io request again using io-wq, so don't fail rest of links if this io request has links.
Cc: stable@vger.kernel.org Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2639dcc4945e..b99c64bcfbdc 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1987,7 +1987,7 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2) if (kiocb->ki_flags & IOCB_WRITE) kiocb_end_write(req);
- if (res != req->result) + if (res != -EAGAIN && res != req->result) req_set_fail_links(req); req->result = res; if (res != -EAGAIN)
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.8-rc2 commit bbde017a32b32d2fa8d5fddca25fade20132abf8 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In io_complete_rw_iopoll(), stores to io_kiocb's result and iopoll completed are two independent store operations, to ensure that once iopoll_completed is ture and then req->result must been perceived by the cpu executing io_do_iopoll(), proper memory barrier should be used.
And in io_do_iopoll(), we check whether req->result is EAGAIN, if it is, we'll need to issue this io request using io-wq again. In order to just issue a single smp_rmb() on the completion side, move the re-submit work to io_iopoll_complete().
Cc: stable@vger.kernel.org Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com [axboe: don't set ->iopoll_completed for -EAGAIN retry] Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 53 ++++++++++++++++++++++++++++----------------------- 1 file changed, 29 insertions(+), 24 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b99c64bcfbdc..9a27b2224f30 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1741,6 +1741,18 @@ static int io_put_kbuf(struct io_kiocb *req) return cflags; }
+static void io_iopoll_queue(struct list_head *again) +{ + struct io_kiocb *req; + + do { + req = list_first_entry(again, struct io_kiocb, list); + list_del(&req->list); + refcount_inc(&req->refs); + io_queue_async_work(req); + } while (!list_empty(again)); +} + /* * Find and free completed poll iocbs */ @@ -1749,12 +1761,21 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, { struct req_batch rb; struct io_kiocb *req; + LIST_HEAD(again); + + /* order with ->result store in io_complete_rw_iopoll() */ + smp_rmb();
rb.to_free = rb.need_iter = 0; while (!list_empty(done)) { int cflags = 0;
req = list_first_entry(done, struct io_kiocb, list); + if (READ_ONCE(req->result) == -EAGAIN) { + req->iopoll_completed = 0; + list_move_tail(&req->list, &again); + continue; + } list_del(&req->list);
if (req->flags & REQ_F_BUFFER_SELECTED) @@ -1772,18 +1793,9 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, if (ctx->flags & IORING_SETUP_SQPOLL) io_cqring_ev_posted(ctx); io_free_req_many(ctx, &rb); -} - -static void io_iopoll_queue(struct list_head *again) -{ - struct io_kiocb *req;
- do { - req = list_first_entry(again, struct io_kiocb, list); - list_del(&req->list); - refcount_inc(&req->refs); - io_queue_async_work(req); - } while (!list_empty(again)); + if (!list_empty(&again)) + io_iopoll_queue(&again); }
static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, @@ -1791,7 +1803,6 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, { struct io_kiocb *req, *tmp; LIST_HEAD(done); - LIST_HEAD(again); bool spin; int ret;
@@ -1817,13 +1828,6 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, if (!list_empty(&done)) break;
- if (req->result == -EAGAIN) { - list_move_tail(&req->list, &again); - continue; - } - if (!list_empty(&again)) - break; - ret = kiocb->ki_filp->f_op->iopoll(kiocb, spin); if (ret < 0) break; @@ -1836,9 +1840,6 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, if (!list_empty(&done)) io_iopoll_complete(ctx, nr_events, &done);
- if (!list_empty(&again)) - io_iopoll_queue(&again); - return ret; }
@@ -1989,9 +1990,13 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
if (res != -EAGAIN && res != req->result) req_set_fail_links(req); - req->result = res; - if (res != -EAGAIN) + + WRITE_ONCE(req->result, res); + /* order with io_poll_complete() checking ->result */ + if (res != -EAGAIN) { + smp_wmb(); WRITE_ONCE(req->iopoll_completed, 1); + } }
/*
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc2 commit 9d8426a09195e2dcf2aa249de2aaadd792d491c7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we're unlucky with timing, we could be running task_work after having dropped the memory context in the sq thread. Since dropping the context requires a runnable task state, we cannot reliably drop it as part of our check-for-work loop in io_sq_thread(). Instead, abstract out the mm acquire for the sq thread into a helper, and call it from the async task work handler.
Cc: stable@vger.kernel.org # v5.7 Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [use *un*use_mm instead of kthread_*un*use_mm for commit f5678e7f2ac3 ("kernel: better document the use_mm/unuse_mm API contract") not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 44 +++++++++++++++++++++++++++++--------------- 1 file changed, 29 insertions(+), 15 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9a27b2224f30..e2a2191c9f53 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4221,6 +4221,28 @@ static void io_async_queue_proc(struct file *file, struct wait_queue_head *head, __io_queue_proc(&pt->req->apoll->poll, pt, head); }
+static void io_sq_thread_drop_mm(struct io_ring_ctx *ctx) +{ + struct mm_struct *mm = current->mm; + + if (mm) { + unuse_mm(mm); + mmput(mm); + } +} + +static int io_sq_thread_acquire_mm(struct io_ring_ctx *ctx, + struct io_kiocb *req) +{ + if (io_op_defs[req->opcode].needs_mm && !current->mm) { + if (unlikely(!mmget_not_zero(ctx->sqo_mm))) + return -EFAULT; + use_mm(ctx->sqo_mm); + } + + return 0; +} + static void io_async_task_func(struct callback_head *cb) { struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); @@ -4255,11 +4277,16 @@ static void io_async_task_func(struct callback_head *cb)
if (!canceled) { __set_current_state(TASK_RUNNING); + if (io_sq_thread_acquire_mm(ctx, req)) { + io_cqring_add_event(req, -EFAULT); + goto end_req; + } mutex_lock(&ctx->uring_lock); __io_queue_sqe(req, NULL); mutex_unlock(&ctx->uring_lock); } else { io_cqring_ev_posted(ctx); +end_req: req_set_fail_links(req); io_double_put_req(req); } @@ -5794,11 +5821,8 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, if (unlikely(req->opcode >= IORING_OP_LAST)) return -EINVAL;
- if (io_op_defs[req->opcode].needs_mm && !current->mm) { - if (unlikely(!mmget_not_zero(ctx->sqo_mm))) - return -EFAULT; - use_mm(ctx->sqo_mm); - } + if (unlikely(io_sq_thread_acquire_mm(ctx, req))) + return -EFAULT;
sqe_flags = READ_ONCE(sqe->flags); /* enforce forwards compatibility on users */ @@ -5907,16 +5931,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, return submitted; }
-static inline void io_sq_thread_drop_mm(struct io_ring_ctx *ctx) -{ - struct mm_struct *mm = current->mm; - - if (mm) { - unuse_mm(mm); - mmput(mm); - } -} - static int io_sq_thread(void *data) { struct io_ring_ctx *ctx = data;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc2 commit 56952e91acc93ed624fe9da840900defb75f1323 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we're doing polled IO and end up having requests being submitted async, then completions can come in while we're waiting for refs to drop. We need to reap these manually, as nobody else will be looking for them.
Break the wait into 1/20th of a second time waits, and check for done poll completions if we time out. Otherwise we can have done poll completions sitting in ctx->poll_list, which needs us to reap them but we're just waiting for them.
Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e2a2191c9f53..41db322af299 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7321,7 +7321,17 @@ static void io_ring_exit_work(struct work_struct *work) if (ctx->rings) io_cqring_overflow_flush(ctx, true);
- wait_for_completion(&ctx->ref_comp); + /* + * If we're doing polled IO and end up having requests being + * submitted async (out-of-line), then completions can come in while + * we're waiting for refs to drop. We need to reap these manually, + * as nobody else will be looking for them. + */ + while (!wait_for_completion_timeout(&ctx->ref_comp, HZ/20)) { + io_iopoll_reap_events(ctx); + if (ctx->rings) + io_cqring_overflow_flush(ctx, true); + } io_ring_ctx_free(ctx); }
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.8-rc2 commit 6f2cc1664db20676069cff27a461ccc97dbfd114 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In io_read() or io_write(), when io request is submitted successfully, it'll go through the below sequence:
kfree(iovec); req->flags &= ~REQ_F_NEED_CLEANUP; return ret;
But clearing REQ_F_NEED_CLEANUP might be unsafe. The io request may already have been completed, and then io_complete_rw_iopoll() and io_complete_rw() will be called, both of which will also modify req->flags if needed. This causes a race condition, with concurrent non-atomic modification of req->flags.
To eliminate this race, in io_read() or io_write(), if io request is submitted successfully, we don't remove REQ_F_NEED_CLEANUP flag. If REQ_F_NEED_CLEANUP is set, we'll leave __io_req_aux_free() to the iovec cleanup work correspondingly.
Cc: stable@vger.kernel.org Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 41db322af299..6856eec77aae 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2669,8 +2669,8 @@ static int io_read(struct io_kiocb *req, bool force_nonblock) } } out_free: - kfree(iovec); - req->flags &= ~REQ_F_NEED_CLEANUP; + if (!(req->flags & REQ_F_NEED_CLEANUP)) + kfree(iovec); return ret; }
@@ -2792,8 +2792,8 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) } } out_free: - req->flags &= ~REQ_F_NEED_CLEANUP; - kfree(iovec); + if (!(req->flags & REQ_F_NEED_CLEANUP)) + kfree(iovec); return ret; }
From: Jiufei Xue jiufei.xue@linux.alibaba.com
mainline inclusion from mainline-5.9-rc1 commit 5769a351b89cd4d82016f18fa5f6c4077403564d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
poll events should be 32-bits to cover EPOLLEXCLUSIVE.
Explicit word-swap the poll32_events for big endian to make sure the ABI is not changed. We call this feature IORING_FEAT_POLL_32BITS, applications who want to use EPOLLEXCLUSIVE should check the feature bit first.
Signed-off-by: Jiufei Xue jiufei.xue@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 13 +++++++++---- include/uapi/linux/io_uring.h | 4 +++- tools/io_uring/liburing.h | 6 +++++- 3 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6856eec77aae..78f67c41efd5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4554,7 +4554,7 @@ static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head, static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_poll_iocb *poll = &req->poll; - u16 events; + u32 events;
if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; @@ -4563,7 +4563,10 @@ static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe if (!poll->file) return -EBADF;
- events = READ_ONCE(sqe->poll_events); + events = READ_ONCE(sqe->poll32_events); +#ifdef __BIG_ENDIAN + events = swahw32(events); +#endif poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP;
io_get_req_task(req); @@ -7886,7 +7889,8 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP | IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS | - IORING_FEAT_CUR_PERSONALITY | IORING_FEAT_FAST_POLL; + IORING_FEAT_CUR_PERSONALITY | IORING_FEAT_FAST_POLL | + IORING_FEAT_POLL_32BITS;
if (copy_to_user(params, p, sizeof(*p))) { ret = -EFAULT; @@ -8175,7 +8179,8 @@ static int __init io_uring_init(void) BUILD_BUG_SQE_ELEM(28, /* compat */ int, rw_flags); BUILD_BUG_SQE_ELEM(28, /* compat */ __u32, rw_flags); BUILD_BUG_SQE_ELEM(28, __u32, fsync_flags); - BUILD_BUG_SQE_ELEM(28, __u16, poll_events); + BUILD_BUG_SQE_ELEM(28, /* compat */ __u16, poll_events); + BUILD_BUG_SQE_ELEM(28, __u32, poll32_events); BUILD_BUG_SQE_ELEM(28, __u32, sync_range_flags); BUILD_BUG_SQE_ELEM(28, __u32, msg_flags); BUILD_BUG_SQE_ELEM(28, __u32, timeout_flags); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 9afedee24e5b..83b790cf3c8d 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -31,7 +31,8 @@ struct io_uring_sqe { union { __kernel_rwf_t rw_flags; __u32 fsync_flags; - __u16 poll_events; + __u16 poll_events; /* compatibility */ + __u32 poll32_events; /* word-reversed for BE */ __u32 sync_range_flags; __u32 msg_flags; __u32 timeout_flags; @@ -247,6 +248,7 @@ struct io_uring_params { #define IORING_FEAT_RW_CUR_POS (1U << 3) #define IORING_FEAT_CUR_PERSONALITY (1U << 4) #define IORING_FEAT_FAST_POLL (1U << 5) +#define IORING_FEAT_POLL_32BITS (1U << 6)
/* * io_uring_register(2) opcodes and arguments diff --git a/tools/io_uring/liburing.h b/tools/io_uring/liburing.h index 5f305c86b892..28a837b6069d 100644 --- a/tools/io_uring/liburing.h +++ b/tools/io_uring/liburing.h @@ -10,6 +10,7 @@ extern "C" { #include <string.h> #include "../../include/uapi/linux/io_uring.h" #include <inttypes.h> +#include <linux/swab.h> #include "barrier.h"
/* @@ -145,11 +146,14 @@ static inline void io_uring_prep_write_fixed(struct io_uring_sqe *sqe, int fd, }
static inline void io_uring_prep_poll_add(struct io_uring_sqe *sqe, int fd, - short poll_mask) + unsigned poll_mask) { memset(sqe, 0, sizeof(*sqe)); sqe->opcode = IORING_OP_POLL_ADD; sqe->fd = fd; +#if __BYTE_ORDER == __BIG_ENDIAN + poll_mask = __swahw32(poll_mask); +#endif sqe->poll_events = poll_mask; }
From: Jiufei Xue jiufei.xue@linux.alibaba.com
mainline inclusion from mainline-5.9-rc1 commit a31eb4a2f1650fa578082ad9e9845487ecd90abe category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Applications can pass this flag in to avoid accept thundering herd.
Signed-off-by: Jiufei Xue jiufei.xue@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 78f67c41efd5..f7ffea95c907 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4210,7 +4210,11 @@ static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt,
pt->error = 0; poll->head = head; - add_wait_queue(head, &poll->wait); + + if (poll->events & EPOLLEXCLUSIVE) + add_wait_queue_exclusive(head, &poll->wait); + else + add_wait_queue(head, &poll->wait); }
static void io_async_queue_proc(struct file *file, struct wait_queue_head *head, @@ -4567,7 +4571,8 @@ static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe #ifdef __BIG_ENDIAN events = swahw32(events); #endif - poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; + poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP | + (events & EPOLLEXCLUSIVE);
io_get_req_task(req); return 0;
From: Bijan Mottahedeh bijan.mottahedeh@oracle.com
mainline inclusion from mainline-5.9-rc1 commit a087e2b519929152fdde8299457e32d5a8994a7c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Facilitate separation of locked memory usage reporting vs. limiting for upcoming patches. No functional changes.
Signed-off-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com [axboe: kill unnecessary () around return in io_account_mem()] Signed-off-by: Jens Axboe axboe@kernel.dk Conflicts: fs/io_uring.c [commit f1f6a7dd9b("mm, tree-wide: rename put_user_page*() to unpin_user_page*()) is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 48 ++++++++++++++++++++++++++++-------------------- 1 file changed, 28 insertions(+), 20 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f7ffea95c907..e4585dd74cb8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6925,12 +6925,14 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx, return ret; }
-static void io_unaccount_mem(struct user_struct *user, unsigned long nr_pages) +static inline void __io_unaccount_mem(struct user_struct *user, + unsigned long nr_pages) { atomic_long_sub(nr_pages, &user->locked_vm); }
-static int io_account_mem(struct user_struct *user, unsigned long nr_pages) +static inline int __io_account_mem(struct user_struct *user, + unsigned long nr_pages) { unsigned long page_limit, cur_pages, new_pages;
@@ -6948,6 +6950,20 @@ static int io_account_mem(struct user_struct *user, unsigned long nr_pages) return 0; }
+static void io_unaccount_mem(struct io_ring_ctx *ctx, unsigned long nr_pages) +{ + if (ctx->account_mem) + __io_unaccount_mem(ctx->user, nr_pages); +} + +static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages) +{ + if (ctx->account_mem) + return __io_account_mem(ctx->user, nr_pages); + + return 0; +} + static void io_mem_free(void *ptr) { struct page *page; @@ -7022,8 +7038,7 @@ static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx) for (j = 0; j < imu->nr_bvecs; j++) put_page(imu->bvec[j].bv_page);
- if (ctx->account_mem) - io_unaccount_mem(ctx->user, imu->nr_bvecs); + io_unaccount_mem(ctx, imu->nr_bvecs); kvfree(imu->bvec); imu->nr_bvecs = 0; } @@ -7106,11 +7121,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, start = ubuf >> PAGE_SHIFT; nr_pages = end - start;
- if (ctx->account_mem) { - ret = io_account_mem(ctx->user, nr_pages); - if (ret) - goto err; - } + ret = io_account_mem(ctx, nr_pages); + if (ret) + goto err;
ret = 0; if (!pages || nr_pages > got_pages) { @@ -7123,8 +7136,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, GFP_KERNEL); if (!pages || !vmas) { ret = -ENOMEM; - if (ctx->account_mem) - io_unaccount_mem(ctx->user, nr_pages); + io_unaccount_mem(ctx, nr_pages); goto err; } got_pages = nr_pages; @@ -7134,8 +7146,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, GFP_KERNEL); ret = -ENOMEM; if (!imu->bvec) { - if (ctx->account_mem) - io_unaccount_mem(ctx->user, nr_pages); + io_unaccount_mem(ctx, nr_pages); goto err; }
@@ -7167,8 +7178,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, for (j = 0; j < pret; j++) put_page(pages[j]); } - if (ctx->account_mem) - io_unaccount_mem(ctx->user, nr_pages); + io_unaccount_mem(ctx, nr_pages); kvfree(imu->bvec); goto err; } @@ -7273,9 +7283,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) io_mem_free(ctx->sq_sqes);
percpu_ref_exit(&ctx->refs); - if (ctx->account_mem) - io_unaccount_mem(ctx->user, - ring_pages(ctx->sq_entries, ctx->cq_entries)); + io_unaccount_mem(ctx, ring_pages(ctx->sq_entries, ctx->cq_entries)); free_uid(ctx->user); put_cred(ctx->creds); kfree(ctx->cancel_hash); @@ -7845,7 +7853,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, account_mem = !capable(CAP_IPC_LOCK);
if (account_mem) { - ret = io_account_mem(user, + ret = __io_account_mem(user, ring_pages(p->sq_entries, p->cq_entries)); if (ret) { free_uid(user); @@ -7856,7 +7864,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, ctx = io_ring_ctx_alloc(p); if (!ctx) { if (account_mem) - io_unaccount_mem(user, ring_pages(p->sq_entries, + __io_unaccount_mem(user, ring_pages(p->sq_entries, p->cq_entries)); free_uid(user); return -ENOMEM;
From: Bijan Mottahedeh bijan.mottahedeh@oracle.com
mainline inclusion from mainline-5.9-rc1 commit aad5d8da1b301fe399d65f2dcb84df2ec60caaa3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Rename account_mem to limit_name to clarify its purpose.
Signed-off-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e4585dd74cb8..6e7d3d69010e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -227,7 +227,7 @@ struct io_ring_ctx { struct { unsigned int flags; unsigned int compat: 1; - unsigned int account_mem: 1; + unsigned int limit_mem: 1; unsigned int cq_overflow_flushed: 1; unsigned int drain_next: 1; unsigned int eventfd_async: 1; @@ -6952,13 +6952,13 @@ static inline int __io_account_mem(struct user_struct *user,
static void io_unaccount_mem(struct io_ring_ctx *ctx, unsigned long nr_pages) { - if (ctx->account_mem) + if (ctx->limit_mem) __io_unaccount_mem(ctx->user, nr_pages); }
static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages) { - if (ctx->account_mem) + if (ctx->limit_mem) return __io_account_mem(ctx->user, nr_pages);
return 0; @@ -7811,7 +7811,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, { struct user_struct *user = NULL; struct io_ring_ctx *ctx; - bool account_mem; + bool limit_mem; int ret;
if (!entries) @@ -7850,9 +7850,9 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, }
user = get_uid(current_user()); - account_mem = !capable(CAP_IPC_LOCK); + limit_mem = !capable(CAP_IPC_LOCK);
- if (account_mem) { + if (limit_mem) { ret = __io_account_mem(user, ring_pages(p->sq_entries, p->cq_entries)); if (ret) { @@ -7863,14 +7863,14 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
ctx = io_ring_ctx_alloc(p); if (!ctx) { - if (account_mem) + if (limit_mem) __io_unaccount_mem(user, ring_pages(p->sq_entries, p->cq_entries)); free_uid(user); return -ENOMEM; } ctx->compat = in_compat_syscall(); - ctx->account_mem = account_mem; + ctx->limit_mem = limit_mem; ctx->user = user; ctx->creds = get_current_cred();
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8 commit d5e16d8e23825304c6a9945116cc6b6f8d51f28c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
req->work might be already initialised by the time it gets into __io_arm_poll_handler(), which will corrupt it by using fields that are in an union with req->work. Luckily, the only side effect is missing put_creds(). Clean req->work before going there.
Suggested-by: Jens Axboe axboe@kernel.dk Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ef4bde014013..3734323fcfa9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4626,6 +4626,10 @@ static int io_poll_add(struct io_kiocb *req) struct io_poll_table ipt; __poll_t mask;
+ /* ->work is in union with hash_node and others */ + io_req_work_drop_env(req); + req->flags &= ~REQ_F_WORK_INITIALIZED; + INIT_HLIST_NODE(&req->hash_node); INIT_LIST_HEAD(&req->list); ipt.pt._qproc = io_poll_queue_proc;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8 commit 4ae6dbd683860b9edc254ea8acf5e04b5ae242e5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_fail_links() doesn't consider REQ_F_COMP_LOCKED leading to nested spin_lock(completion_lock) and lockup.
[ 197.680409] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 6-... } 18239 jiffies s: 1421 root: 0x40/. [ 197.680411] rcu: blocking rcu_node structures: [ 197.680412] Task dump for CPU 6: [ 197.680413] link-timeout R running task 0 1669 1 0x8000008a [ 197.680414] Call Trace: [ 197.680420] ? io_req_find_next+0xa0/0x200 [ 197.680422] ? io_put_req_find_next+0x2a/0x50 [ 197.680423] ? io_poll_task_func+0xcf/0x140 [ 197.680425] ? task_work_run+0x67/0xa0 [ 197.680426] ? do_exit+0x35d/0xb70 [ 197.680429] ? syscall_trace_enter+0x187/0x2c0 [ 197.680430] ? do_group_exit+0x43/0xa0 [ 197.680448] ? __x64_sys_exit_group+0x18/0x20 [ 197.680450] ? do_syscall_64+0x52/0xa0 [ 197.680452] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3734323fcfa9..42d399fc01dc 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4159,10 +4159,9 @@ static void io_poll_task_handler(struct io_kiocb *req, struct io_kiocb **nxt)
hash_del(&req->hash_node); io_poll_complete(req, req->result, 0); - req->flags |= REQ_F_COMP_LOCKED; - io_put_req_find_next(req, nxt); spin_unlock_irq(&ctx->completion_lock);
+ io_put_req_find_next(req, nxt); io_cqring_ev_posted(ctx); }
From: Dmitry Vyukov dvyukov@google.com
mainline inclusion from mainline-5.9-rc1 commit b36200f543ff07a1cb346aa582349141df2c8068 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
rings_size() sets sq_offset to the total size of the rings (the returned value which is used for memory allocation). This is wrong: sq array should be located within the rings, not after them. Set sq_offset to where it should be.
Fixes: 75b28affdd6a ("io_uring: allocate the two rings together") Signed-off-by: Dmitry Vyukov dvyukov@google.com Acked-by: Hristo Venev hristo@venev.name Cc: io-uring@vger.kernel.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 42d399fc01dc..adf67d940f55 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7082,6 +7082,9 @@ static unsigned long rings_size(unsigned sq_entries, unsigned cq_entries, return SIZE_MAX; #endif
+ if (sq_offset) + *sq_offset = off; + sq_array_size = array_size(sizeof(u32), sq_entries); if (sq_array_size == SIZE_MAX) return SIZE_MAX; @@ -7089,9 +7092,6 @@ static unsigned long rings_size(unsigned sq_entries, unsigned cq_entries, if (check_add_overflow(off, sq_array_size, &off)) return SIZE_MAX;
- if (sq_offset) - *sq_offset = off; - return off; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 270a5940700bb6cf9abf36ea10cf1fa0d453aa7a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Every second field in send/recv is called msg, make it a bit more understandable by renaming ->msg, which is a user provided ptr, to ->umsg.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index adf67d940f55..c4305ed8d2c8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -414,7 +414,7 @@ struct io_connect { struct io_sr_msg { struct file *file; union { - struct user_msghdr __user *msg; + struct user_msghdr __user *umsg; void __user *buf; }; int msg_flags; @@ -3501,7 +3501,7 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EINVAL;
sr->msg_flags = READ_ONCE(sqe->msg_flags); - sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + sr->umsg = u64_to_user_ptr(READ_ONCE(sqe->addr)); sr->len = READ_ONCE(sqe->len);
#ifdef CONFIG_COMPAT @@ -3517,7 +3517,7 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
io->msg.msg.msg_name = &io->msg.addr; io->msg.iov = io->msg.fast_iov; - ret = sendmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, + ret = sendmsg_copy_msghdr(&io->msg.msg, sr->umsg, sr->msg_flags, &io->msg.iov); if (!ret) req->flags |= REQ_F_NEED_CLEANUP; @@ -3549,7 +3549,7 @@ static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) kmsg->msg.msg_name = &io.msg.addr;
io.msg.iov = io.msg.fast_iov; - ret = sendmsg_copy_msghdr(&io.msg.msg, sr->msg, + ret = sendmsg_copy_msghdr(&io.msg.msg, sr->umsg, sr->msg_flags, &io.msg.iov); if (ret) return ret; @@ -3628,8 +3628,8 @@ static int __io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io) size_t iov_len; int ret;
- ret = __copy_msghdr_from_user(&io->msg.msg, sr->msg, &io->msg.uaddr, - &uiov, &iov_len); + ret = __copy_msghdr_from_user(&io->msg.msg, sr->umsg, + &io->msg.uaddr, &uiov, &iov_len); if (ret) return ret;
@@ -3663,7 +3663,7 @@ static int __io_compat_recvmsg_copy_hdr(struct io_kiocb *req, compat_size_t len; int ret;
- msg_compat = (struct compat_msghdr __user *) sr->msg; + msg_compat = (struct compat_msghdr __user *) sr->umsg; ret = __get_compat_msghdr(&io->msg.msg, msg_compat, &io->msg.uaddr, &ptr, &len); if (ret) @@ -3740,7 +3740,7 @@ static int io_recvmsg_prep(struct io_kiocb *req, return -EINVAL;
sr->msg_flags = READ_ONCE(sqe->msg_flags); - sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + sr->umsg = u64_to_user_ptr(READ_ONCE(sqe->addr)); sr->len = READ_ONCE(sqe->len); sr->bgid = READ_ONCE(sqe->buf_group);
@@ -3804,7 +3804,7 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) else if (force_nonblock) flags |= MSG_DONTWAIT;
- ret = __sys_recvmsg_sock(sock, &kmsg->msg, req->sr_msg.msg, + ret = __sys_recvmsg_sock(sock, &kmsg->msg, req->sr_msg.umsg, kmsg->uaddr, flags); if (force_nonblock && ret == -EAGAIN) { ret = io_setup_async_msg(req, kmsg);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 1400e69705baf98d1c9cb73b592a3a68aab1d852 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
send/recv msghdr initialisation works with struct io_async_msghdr, but pulls the whole struct io_async_ctx for no reason. That complicates it with composite accessing, e.g. io->msg.
Use and pass the most specific type, which is struct io_async_msghdr. It is the larget field in union io_async_ctx and doesn't save stack space, but looks clearer. The most of the changes are replacing "io->msg." with "iomsg->"
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 63 +++++++++++++++++++++++++-------------------------- 1 file changed, 31 insertions(+), 32 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c4305ed8d2c8..03e4fef26567 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3532,7 +3532,7 @@ static int io_sendmsg(struct io_kiocb *req, bool force_nonblock)
sock = sock_from_file(req->file, &ret); if (sock) { - struct io_async_ctx io; + struct io_async_msghdr iomsg; unsigned flags;
if (req->io) { @@ -3545,14 +3545,13 @@ static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) } else { struct io_sr_msg *sr = &req->sr_msg;
- kmsg = &io.msg; - kmsg->msg.msg_name = &io.msg.addr; - - io.msg.iov = io.msg.fast_iov; - ret = sendmsg_copy_msghdr(&io.msg.msg, sr->umsg, - sr->msg_flags, &io.msg.iov); + iomsg.msg.msg_name = &iomsg.addr; + iomsg.iov = iomsg.fast_iov; + ret = sendmsg_copy_msghdr(&iomsg.msg, sr->umsg, + sr->msg_flags, &iomsg.iov); if (ret) return ret; + kmsg = &iomsg; }
flags = req->sr_msg.msg_flags; @@ -3621,30 +3620,31 @@ static int io_send(struct io_kiocb *req, bool force_nonblock) return 0; }
-static int __io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io) +static int __io_recvmsg_copy_hdr(struct io_kiocb *req, + struct io_async_msghdr *iomsg) { struct io_sr_msg *sr = &req->sr_msg; struct iovec __user *uiov; size_t iov_len; int ret;
- ret = __copy_msghdr_from_user(&io->msg.msg, sr->umsg, - &io->msg.uaddr, &uiov, &iov_len); + ret = __copy_msghdr_from_user(&iomsg->msg, sr->umsg, + &iomsg->uaddr, &uiov, &iov_len); if (ret) return ret;
if (req->flags & REQ_F_BUFFER_SELECT) { if (iov_len > 1) return -EINVAL; - if (copy_from_user(io->msg.iov, uiov, sizeof(*uiov))) + if (copy_from_user(iomsg->iov, uiov, sizeof(*uiov))) return -EFAULT; - sr->len = io->msg.iov[0].iov_len; - iov_iter_init(&io->msg.msg.msg_iter, READ, io->msg.iov, 1, + sr->len = iomsg->iov[0].iov_len; + iov_iter_init(&iomsg->msg.msg_iter, READ, iomsg->iov, 1, sr->len); - io->msg.iov = NULL; + iomsg->iov = NULL; } else { ret = import_iovec(READ, uiov, iov_len, UIO_FASTIOV, - &io->msg.iov, &io->msg.msg.msg_iter); + &iomsg->iov, &iomsg->msg.msg_iter); if (ret > 0) ret = 0; } @@ -3654,7 +3654,7 @@ static int __io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io)
#ifdef CONFIG_COMPAT static int __io_compat_recvmsg_copy_hdr(struct io_kiocb *req, - struct io_async_ctx *io) + struct io_async_msghdr *iomsg) { struct compat_msghdr __user *msg_compat; struct io_sr_msg *sr = &req->sr_msg; @@ -3664,7 +3664,7 @@ static int __io_compat_recvmsg_copy_hdr(struct io_kiocb *req, int ret;
msg_compat = (struct compat_msghdr __user *) sr->umsg; - ret = __get_compat_msghdr(&io->msg.msg, msg_compat, &io->msg.uaddr, + ret = __get_compat_msghdr(&iomsg->msg, msg_compat, &iomsg->uaddr, &ptr, &len); if (ret) return ret; @@ -3681,12 +3681,12 @@ static int __io_compat_recvmsg_copy_hdr(struct io_kiocb *req, return -EFAULT; if (clen < 0) return -EINVAL; - sr->len = io->msg.iov[0].iov_len; - io->msg.iov = NULL; + sr->len = iomsg->iov[0].iov_len; + iomsg->iov = NULL; } else { ret = compat_import_iovec(READ, uiov, len, UIO_FASTIOV, - &io->msg.iov, - &io->msg.msg.msg_iter); + &iomsg->iov, + &iomsg->msg.msg_iter); if (ret < 0) return ret; } @@ -3695,17 +3695,18 @@ static int __io_compat_recvmsg_copy_hdr(struct io_kiocb *req, } #endif
-static int io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io) +static int io_recvmsg_copy_hdr(struct io_kiocb *req, + struct io_async_msghdr *iomsg) { - io->msg.msg.msg_name = &io->msg.addr; - io->msg.iov = io->msg.fast_iov; + iomsg->msg.msg_name = &iomsg->addr; + iomsg->iov = iomsg->fast_iov;
#ifdef CONFIG_COMPAT if (req->ctx->compat) - return __io_compat_recvmsg_copy_hdr(req, io); + return __io_compat_recvmsg_copy_hdr(req, iomsg); #endif
- return __io_recvmsg_copy_hdr(req, io); + return __io_recvmsg_copy_hdr(req, iomsg); }
static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req, @@ -3755,7 +3756,7 @@ static int io_recvmsg_prep(struct io_kiocb *req, if (req->flags & REQ_F_NEED_CLEANUP) return 0;
- ret = io_recvmsg_copy_hdr(req, io); + ret = io_recvmsg_copy_hdr(req, &io->msg); if (!ret) req->flags |= REQ_F_NEED_CLEANUP; return ret; @@ -3770,7 +3771,7 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) sock = sock_from_file(req->file, &ret); if (sock) { struct io_buffer *kbuf; - struct io_async_ctx io; + struct io_async_msghdr iomsg; unsigned flags;
if (req->io) { @@ -3781,12 +3782,10 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) kmsg->iov = kmsg->fast_iov; kmsg->msg.msg_iter.iov = kmsg->iov; } else { - kmsg = &io.msg; - kmsg->msg.msg_name = &io.msg.addr; - - ret = io_recvmsg_copy_hdr(req, &io); + ret = io_recvmsg_copy_hdr(req, &iomsg); if (ret) return ret; + kmsg = &iomsg; }
kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 2ae523ed07f14391d685651f671a7858fe8c368a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Don't repeat send msg initialisation code, it's error prone. Extract and use a helper function.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 03e4fef26567..fe61ce18c6c2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3491,6 +3491,15 @@ static int io_setup_async_msg(struct io_kiocb *req, return -EAGAIN; }
+static int io_sendmsg_copy_hdr(struct io_kiocb *req, + struct io_async_msghdr *iomsg) +{ + iomsg->iov = iomsg->fast_iov; + iomsg->msg.msg_name = &iomsg->addr; + return sendmsg_copy_msghdr(&iomsg->msg, req->sr_msg.umsg, + req->sr_msg.msg_flags, &iomsg->iov); +} + static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_sr_msg *sr = &req->sr_msg; @@ -3515,10 +3524,7 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (req->flags & REQ_F_NEED_CLEANUP) return 0;
- io->msg.msg.msg_name = &io->msg.addr; - io->msg.iov = io->msg.fast_iov; - ret = sendmsg_copy_msghdr(&io->msg.msg, sr->umsg, sr->msg_flags, - &io->msg.iov); + ret = io_sendmsg_copy_hdr(req, &io->msg); if (!ret) req->flags |= REQ_F_NEED_CLEANUP; return ret; @@ -3543,12 +3549,7 @@ static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) kmsg->iov = kmsg->fast_iov; kmsg->msg.msg_iter.iov = kmsg->iov; } else { - struct io_sr_msg *sr = &req->sr_msg; - - iomsg.msg.msg_name = &iomsg.addr; - iomsg.iov = iomsg.fast_iov; - ret = sendmsg_copy_msghdr(&iomsg.msg, sr->umsg, - sr->msg_flags, &iomsg.iov); + ret = io_sendmsg_copy_hdr(req, &iomsg); if (ret) return ret; kmsg = &iomsg;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit b64e3444d4e1c71fe148a4f4535395b1fdd73200 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Don't deref req->io->rw every time, but put it in a local variable. This looks prettier, generates less instructions, and doesn't break alias analysis.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index fe61ce18c6c2..b70a41bad00a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2548,15 +2548,17 @@ static void io_req_map_rw(struct io_kiocb *req, ssize_t io_size, struct iovec *iovec, struct iovec *fast_iov, struct iov_iter *iter) { - req->io->rw.nr_segs = iter->nr_segs; - req->io->rw.size = io_size; - req->io->rw.iov = iovec; - if (!req->io->rw.iov) { - req->io->rw.iov = req->io->rw.fast_iov; - if (req->io->rw.iov != fast_iov) - memcpy(req->io->rw.iov, fast_iov, + struct io_async_rw *rw = &req->io->rw; + + rw->nr_segs = iter->nr_segs; + rw->size = io_size; + if (!iovec) { + rw->iov = rw->fast_iov; + if (rw->iov != fast_iov) + memcpy(rw->iov, fast_iov, sizeof(struct iovec) * iter->nr_segs); } else { + rw->iov = iovec; req->flags |= REQ_F_NEED_CLEANUP; } }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit c3e330a493740a2a8312dcb7b1cffceaec7f619a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Preparing reads/writes for async is a bit tricky. Extract a helper to not repeat it twice.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 46 ++++++++++++++++++++-------------------------- 1 file changed, 20 insertions(+), 26 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b70a41bad00a..70adbafb37bf 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2592,11 +2592,27 @@ static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size, return 0; }
+static inline int io_rw_prep_async(struct io_kiocb *req, int rw, + bool force_nonblock) +{ + struct io_async_ctx *io = req->io; + struct iov_iter iter; + ssize_t ret; + + io->rw.iov = io->rw.fast_iov; + req->io = NULL; + ret = io_import_iovec(rw, req, &io->rw.iov, &iter, !force_nonblock); + req->io = io; + if (unlikely(ret < 0)) + return ret; + + io_req_map_rw(req, ret, io->rw.iov, io->rw.fast_iov, &iter); + return 0; +} + static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, bool force_nonblock) { - struct io_async_ctx *io; - struct iov_iter iter; ssize_t ret;
ret = io_prep_rw(req, sqe, force_nonblock); @@ -2609,17 +2625,7 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, /* either don't need iovec imported or already have it */ if (!req->io || req->flags & REQ_F_NEED_CLEANUP) return 0; - - io = req->io; - io->rw.iov = io->rw.fast_iov; - req->io = NULL; - ret = io_import_iovec(READ, req, &io->rw.iov, &iter, !force_nonblock); - req->io = io; - if (ret < 0) - return ret; - - io_req_map_rw(req, ret, io->rw.iov, io->rw.fast_iov, &iter); - return 0; + return io_rw_prep_async(req, READ, force_nonblock); }
static int io_read(struct io_kiocb *req, bool force_nonblock) @@ -2685,8 +2691,6 @@ static int io_read(struct io_kiocb *req, bool force_nonblock) static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, bool force_nonblock) { - struct io_async_ctx *io; - struct iov_iter iter; ssize_t ret;
ret = io_prep_rw(req, sqe, force_nonblock); @@ -2701,17 +2705,7 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, /* either don't need iovec imported or already have it */ if (!req->io || req->flags & REQ_F_NEED_CLEANUP) return 0; - - io = req->io; - io->rw.iov = io->rw.fast_iov; - req->io = NULL; - ret = io_import_iovec(WRITE, req, &io->rw.iov, &iter, !force_nonblock); - req->io = io; - if (ret < 0) - return ret; - - io_req_map_rw(req, ret, io->rw.iov, io->rw.fast_iov, &iter); - return 0; + return io_rw_prep_async(req, WRITE, force_nonblock); }
static int io_write(struct io_kiocb *req, bool force_nonblock)
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.9-rc1 commit 23b3628e45924419399da48c2b3a522b05557c91 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In io_sq_thread(), if there are task works to handle, current codes will skip schedule() and go on polling sq again, but forget to clear IORING_SQ_NEED_WAKEUP flag, fix this issue. Also add two helpers to set and clear IORING_SQ_NEED_WAKEUP flag,
Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 29 +++++++++++++++++++---------- 1 file changed, 19 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 70adbafb37bf..a50e598336fb 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5981,6 +5981,21 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, return submitted; }
+static inline void io_ring_set_wakeup_flag(struct io_ring_ctx *ctx) +{ + /* Tell userspace we may need a wakeup call */ + spin_lock_irq(&ctx->completion_lock); + ctx->rings->sq_flags |= IORING_SQ_NEED_WAKEUP; + spin_unlock_irq(&ctx->completion_lock); +} + +static inline void io_ring_clear_wakeup_flag(struct io_ring_ctx *ctx) +{ + spin_lock_irq(&ctx->completion_lock); + ctx->rings->sq_flags &= ~IORING_SQ_NEED_WAKEUP; + spin_unlock_irq(&ctx->completion_lock); +} + static int io_sq_thread(void *data) { struct io_ring_ctx *ctx = data; @@ -6058,10 +6073,7 @@ static int io_sq_thread(void *data) continue; }
- /* Tell userspace we may need a wakeup call */ - spin_lock_irq(&ctx->completion_lock); - ctx->rings->sq_flags |= IORING_SQ_NEED_WAKEUP; - spin_unlock_irq(&ctx->completion_lock); + io_ring_set_wakeup_flag(ctx);
to_submit = io_sqring_entries(ctx); if (!to_submit || ret == -EBUSY) { @@ -6072,6 +6084,7 @@ static int io_sq_thread(void *data) if (current->task_works) { task_work_run(); finish_wait(&ctx->sqo_wait, &wait); + io_ring_clear_wakeup_flag(ctx); continue; } if (signal_pending(current)) @@ -6079,17 +6092,13 @@ static int io_sq_thread(void *data) schedule(); finish_wait(&ctx->sqo_wait, &wait);
- spin_lock_irq(&ctx->completion_lock); - ctx->rings->sq_flags &= ~IORING_SQ_NEED_WAKEUP; - spin_unlock_irq(&ctx->completion_lock); + io_ring_clear_wakeup_flag(ctx); ret = 0; continue; } finish_wait(&ctx->sqo_wait, &wait);
- spin_lock_irq(&ctx->completion_lock); - ctx->rings->sq_flags &= ~IORING_SQ_NEED_WAKEUP; - spin_unlock_irq(&ctx->completion_lock); + io_ring_clear_wakeup_flag(ctx); }
mutex_lock(&ctx->uring_lock);
From: Colin Ian King colin.king@canonical.com
mainline inclusion from mainline-5.10-rc1 commit 035fbafc7a54b8c7755b3c508b8f3ab6ff3c8d65 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
An incorrect sizeof() is being used, sizeof(file_data->table) is not correct, it should be sizeof(*file_data->table).
Fixes: 5398ae698525 ("io_uring: clean file_data access in files_register") Signed-off-by: Colin Ian King colin.king@canonical.com Addresses-Coverity: ("Sizeof not portable (SIZEOF_MISMATCH)") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7c4418715867..6b5b035a968d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6638,7 +6638,7 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, spin_lock_init(&file_data->lock);
nr_tables = DIV_ROUND_UP(nr_args, IORING_MAX_FILES_TABLE); - file_data->table = kcalloc(nr_tables, sizeof(file_data->table), + file_data->table = kcalloc(nr_tables, sizeof(*file_data->table), GFP_KERNEL); if (!file_data->table) goto out_free;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.10-rc1 commit 58852d4d673760cf7c88b9360b3c24a041bec298 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
__io_queue_proc() is used by both, poll reqs and apoll. Don't use req->poll.events to copy poll mask because for apoll it aliases with private data of the request.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6b5b035a968d..10ce1ceeef0b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4245,6 +4245,8 @@ static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt, * for write). Setup a separate io_poll_iocb if this happens. */ if (unlikely(poll->head)) { + struct io_poll_iocb *poll_one = poll; + /* already have a 2nd entry, fail a third attempt */ if (*poll_ptr) { pt->error = -EINVAL; @@ -4255,7 +4257,7 @@ static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt, pt->error = -ENOMEM; return; } - io_init_poll_iocb(poll, req->poll.events, io_poll_double_wake); + io_init_poll_iocb(poll, poll_one->events, io_poll_double_wake); refcount_inc(&req->refs); poll->wait.private = req; *poll_ptr = poll;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.10-rc2 commit c8b5e2600a2cfa1cdfbecf151afd67aee227381d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_poll_double_wake() is called for both request types - both pure poll requests, and internal polls. This means that we should be using the right handler based on the request type. Use the one that the original caller already assigned for the waitqueue handling, that will always match the correct type.
Cc: stable@vger.kernel.org # v5.8+ Reported-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 10ce1ceeef0b..1cec7adeec2b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4215,8 +4215,10 @@ static int io_poll_double_wake(struct wait_queue_entry *wait, unsigned mode, if (!done) list_del_init(&poll->wait.entry); spin_unlock(&poll->head->lock); - if (!done) - __io_async_wake(req, poll, mask, io_poll_task_func); + if (!done) { + /* use wait func handler, so it matches the rq type */ + poll->wait.func(&poll->wait, mode, sync, key); + } } refcount_dec(&req->refs); return 1;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.10-rc4 commit 88ec3211e46344a7d10cf6cb5045f839f7785f8e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If an application specifies IORING_SETUP_CQSIZE to set the CQ ring size to a specific size, we ensure that the CQ size is at least that of the SQ ring size. But in doing so, we compare the already rounded up to power of two SQ size to the as-of yet unrounded CQ size. This means that if an application passes in non power of two sizes, we can return -EINVAL when the final value would've been fine. As an example, an application passing in 100/100 for sq/cq size should end up with 128 for both. But since we round the SQ size first, we compare the CQ size of 100 to 128, and return -EINVAL as that is too small.
Cc: stable@vger.kernel.org Fixes: 33a107f0a1b8 ("io_uring: allow application controlled CQ ring size") Reported-by: Dan Melnic dmm@fb.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1cec7adeec2b..4d20ff944cdf 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7951,6 +7951,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, * to a power-of-two, if it isn't already. We do NOT impose * any cq vs sq ring sizing. */ + p->cq_entries = roundup_pow_of_two(p->cq_entries); if (p->cq_entries < p->sq_entries) return -EINVAL; if (p->cq_entries > IORING_MAX_CQ_ENTRIES) { @@ -7958,7 +7959,6 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, return -EINVAL; p->cq_entries = IORING_MAX_CQ_ENTRIES; } - p->cq_entries = roundup_pow_of_two(p->cq_entries); } else { p->cq_entries = 2 * p->sq_entries; }
From: Joseph Qi joseph.qi@linux.alibaba.com
mainline inclusion from mainline-5.10-rc6 commit eb2667b343361863da7b79be26de641e22844ba0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Abaci Fuzz reported a shift-out-of-bounds BUG in io_uring_create():
[ 59.598207] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13 [ 59.599665] shift exponent 64 is too large for 64-bit type 'long unsigned int' [ 59.601230] CPU: 0 PID: 963 Comm: a.out Not tainted 5.10.0-rc4+ #3 [ 59.602502] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 59.603673] Call Trace: [ 59.604286] dump_stack+0x107/0x163 [ 59.605237] ubsan_epilogue+0xb/0x5a [ 59.606094] __ubsan_handle_shift_out_of_bounds.cold+0xb2/0x20e [ 59.607335] ? lock_downgrade+0x6c0/0x6c0 [ 59.608182] ? rcu_read_lock_sched_held+0xaf/0xe0 [ 59.609166] io_uring_create.cold+0x99/0x149 [ 59.610114] io_uring_setup+0xd6/0x140 [ 59.610975] ? io_uring_create+0x2510/0x2510 [ 59.611945] ? lockdep_hardirqs_on_prepare+0x286/0x400 [ 59.613007] ? syscall_enter_from_user_mode+0x27/0x80 [ 59.614038] ? trace_hardirqs_on+0x5b/0x180 [ 59.615056] do_syscall_64+0x2d/0x40 [ 59.615940] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 59.617007] RIP: 0033:0x7f2bb8a0b239
This is caused by roundup_pow_of_two() if the input entries larger enough, e.g. 2^32-1. For sq_entries, it will check first and we allow at most IORING_MAX_ENTRIES, so it is okay. But for cq_entries, we do round up first, that may overflow and truncate it to 0, which is not the expected behavior. So check the cq size first and then do round up.
Fixes: 88ec3211e463 ("io_uring: round-up cq size before comparing with rounded sq size") Reported-by: Abaci Fuzz abaci@linux.alibaba.com Signed-off-by: Joseph Qi joseph.qi@linux.alibaba.com Reviewed-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4d20ff944cdf..c3960af74820 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7951,14 +7951,16 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, * to a power-of-two, if it isn't already. We do NOT impose * any cq vs sq ring sizing. */ - p->cq_entries = roundup_pow_of_two(p->cq_entries); - if (p->cq_entries < p->sq_entries) + if (!p->cq_entries) return -EINVAL; if (p->cq_entries > IORING_MAX_CQ_ENTRIES) { if (!(p->flags & IORING_SETUP_CLAMP)) return -EINVAL; p->cq_entries = IORING_MAX_CQ_ENTRIES; } + p->cq_entries = roundup_pow_of_two(p->cq_entries); + if (p->cq_entries < p->sq_entries) + return -EINVAL; } else { p->cq_entries = 2 * p->sq_entries; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.10-rc7 commit 2d280bc8930ba9ed1705cfd548c6c8924949eaf1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
__io_compat_recvmsg_copy_hdr() with REQ_F_BUFFER_SELECT reads out iov len but never assigns it to iov/fast_iov, leaving sr->len with garbage. Hopefully, following io_buffer_select() truncates it to the selected buffer size, but the value is still may be under what was specified.
Cc: stable@vger.kernel.org # 5.7 Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c3960af74820..0420e098ad54 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3692,7 +3692,8 @@ static int __io_compat_recvmsg_copy_hdr(struct io_kiocb *req, return -EFAULT; if (clen < 0) return -EINVAL; - sr->len = iomsg->iov[0].iov_len; + sr->len = clen; + iomsg->iov[0].iov_len = clen; iomsg->iov = NULL; } else { ret = compat_import_iovec(READ, uiov, len, UIO_FASTIOV,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit 4349f30ecb8068d146a1e57bb12f46e745323b4c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We don't use 'ctx' at all in io_sq_thread_drop_mm(), it just works on the mm of the current task. Drop the argument.
Move io_file_put_work() to where we have the other forward declarations of functions.
Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 0420e098ad54..e4bc79041880 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4284,7 +4284,7 @@ static void io_async_queue_proc(struct file *file, struct wait_queue_head *head, __io_queue_proc(&apoll->poll, pt, head, &apoll->double_poll); }
-static void io_sq_thread_drop_mm(struct io_ring_ctx *ctx) +static void io_sq_thread_drop_mm(void) { struct mm_struct *mm = current->mm;
@@ -6066,7 +6066,7 @@ static int io_sq_thread(void *data) * adding ourselves to the waitqueue, as the unuse/drop * may sleep. */ - io_sq_thread_drop_mm(ctx); + io_sq_thread_drop_mm();
/* * We're polling. If we're within the defined idle @@ -6139,7 +6139,7 @@ static int io_sq_thread(void *data) task_work_run();
set_fs(old_fs); - io_sq_thread_drop_mm(ctx); + io_sq_thread_drop_mm(); revert_creds(old_cred);
kthread_parkme();
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit d1719f70d0a5b83b12786a7dbc5b9fe396469016 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
As soon as we install the file descriptor, we have to assume that it can get arbitrarily closed. We currently account memory (and note that we did) after installing the ring fd, which means that it could be a potential use-after-free condition if the fd is closed right after being installed, but before we fiddle with the ctx.
In fact, syzbot reported this exact scenario:
BUG: KASAN: use-after-free in io_account_mem fs/io_uring.c:7397 [inline] BUG: KASAN: use-after-free in io_uring_create fs/io_uring.c:8369 [inline] BUG: KASAN: use-after-free in io_uring_setup+0x2797/0x2910 fs/io_uring.c:8400 Read of size 1 at addr ffff888087a41044 by task syz-executor.5/18145
CPU: 0 PID: 18145 Comm: syz-executor.5 Not tainted 5.8.0-rc7-next-20200729-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x18f/0x20d lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530 io_account_mem fs/io_uring.c:7397 [inline] io_uring_create fs/io_uring.c:8369 [inline] io_uring_setup+0x2797/0x2910 fs/io_uring.c:8400 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45c429 Code: 8d b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 5b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f8f121d0c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 0000000000008540 RCX: 000000000045c429 RDX: 0000000000000000 RSI: 0000000020000040 RDI: 0000000000000196 RBP: 000000000078bf38 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000078bf0c R13: 00007fff86698cff R14: 00007f8f121d19c0 R15: 000000000078bf0c
Move the accounting of the ring used locked memory before we get and install the ring file descriptor.
Cc: stable@vger.kernel.org Reported-by: syzbot+9d46305e76057f30c74e@syzkaller.appspotmail.com Fixes: 309758254ea6 ("io_uring: report pinned memory usage") Reviewed-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e4bc79041880..fa1ce29d5f67 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -8025,6 +8025,15 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, ret = -EFAULT; goto err; } + + /* + * Account memory _before_ installing the file descriptor. Once + * the descriptor is installed, it can get closed at any time. + */ + io_account_mem(ctx, ring_pages(p->sq_entries, p->cq_entries), + ACCT_LOCKED); + ctx->limit_mem = limit_mem; + /* * Install ring fd as the very last thing, so we don't risk someone * having closed it before we finish setup @@ -8034,9 +8043,6 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, goto err;
trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags); - io_account_mem(ctx, ring_pages(p->sq_entries, p->cq_entries), - ACCT_LOCKED); - ctx->limit_mem = limit_mem; return ret; err: io_ring_ctx_wait_and_kill(ctx);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit f74441e6311a28f0ee89b9c8e296a33730f812fc category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The tear down path will always unaccount the memory, so ensure that we have accounted it before hitting any of them.
Reported-by: Tomáš Chaloupka chalucha@gmail.com Reviewed-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index fa1ce29d5f67..a8218ff4df42 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7990,6 +7990,16 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, ctx->user = user; ctx->creds = get_current_cred();
+ /* + * Account memory _before_ installing the file descriptor. Once + * the descriptor is installed, it can get closed at any time. Also + * do this before hitting the general error path, as ring freeing + * will un-account as well. + */ + io_account_mem(ctx, ring_pages(p->sq_entries, p->cq_entries), + ACCT_LOCKED); + ctx->limit_mem = limit_mem; + ret = io_allocate_scq_urings(ctx, p); if (ret) goto err; @@ -8026,14 +8036,6 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, goto err; }
- /* - * Account memory _before_ installing the file descriptor. Once - * the descriptor is installed, it can get closed at any time. - */ - io_account_mem(ctx, ring_pages(p->sq_entries, p->cq_entries), - ACCT_LOCKED); - ctx->limit_mem = limit_mem; - /* * Install ring fd as the very last thing, so we don't risk someone * having closed it before we finish setup
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit d4e7cd36a90e38e0276d6ce0c20f5ccef17ec38c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There's a bit of confusion on the matching pairs of poll vs double poll, depending on if the request is a pure poll (IORING_OP_POLL_ADD) or poll driven retry.
Add io_poll_get_double() that returns the double poll waitqueue, if any, and io_poll_get_single() that returns the original poll waitqueue. With that, remove the argument to io_poll_remove_double().
Finally ensure that wait->private is cleared once the double poll handler has run, so that remove knows it's already been seen.
Cc: stable@vger.kernel.org # v5.8 Reported-by: syzbot+7f617d4a9369028b8a2c@syzkaller.appspotmail.com Fixes: 18bceab101ad ("io_uring: allow POLL_ADD with double poll_wait() users") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 34 +++++++++++++++++++++++++--------- 1 file changed, 25 insertions(+), 9 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 79e858d83b71..2dd7a04ad721 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4145,9 +4145,24 @@ static bool io_poll_rewait(struct io_kiocb *req, struct io_poll_iocb *poll) return false; }
-static void io_poll_remove_double(struct io_kiocb *req, void *data) +static struct io_poll_iocb *io_poll_get_double(struct io_kiocb *req) { - struct io_poll_iocb *poll = data; + /* pure poll stashes this in ->io, poll driven retry elsewhere */ + if (req->opcode == IORING_OP_POLL_ADD) + return (struct io_poll_iocb *) req->io; + return req->apoll->double_poll; +} + +static struct io_poll_iocb *io_poll_get_single(struct io_kiocb *req) +{ + if (req->opcode == IORING_OP_POLL_ADD) + return &req->poll; + return &req->apoll->poll; +} + +static void io_poll_remove_double(struct io_kiocb *req) +{ + struct io_poll_iocb *poll = io_poll_get_double(req);
lockdep_assert_held(&req->ctx->completion_lock);
@@ -4167,7 +4182,7 @@ static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error) { struct io_ring_ctx *ctx = req->ctx;
- io_poll_remove_double(req, req->io); + io_poll_remove_double(req); req->poll.done = true; io_cqring_fill_event(req, error ? error : mangle_poll(mask)); io_commit_cqring(ctx); @@ -4211,7 +4226,7 @@ static int io_poll_double_wake(struct wait_queue_entry *wait, unsigned mode, int sync, void *key) { struct io_kiocb *req = wait->private; - struct io_poll_iocb *poll = req->apoll->double_poll; + struct io_poll_iocb *poll = io_poll_get_single(req); __poll_t mask = key_to_poll(key);
/* for instances that support it check for an event match first: */ @@ -4227,6 +4242,8 @@ static int io_poll_double_wake(struct wait_queue_entry *wait, unsigned mode, done = list_empty(&poll->wait.entry); if (!done) list_del_init(&poll->wait.entry); + /* make sure double remove sees this as being gone */ + wait->private = NULL; spin_unlock(&poll->head->lock); if (!done) { /* use wait func handler, so it matches the rq type */ @@ -4410,7 +4427,7 @@ static void io_async_task_func(struct callback_head *cb) } }
- io_poll_remove_double(req, apoll->double_poll); + io_poll_remove_double(req); spin_unlock_irq(&ctx->completion_lock);
/* restore ->work in case we need to retry again */ @@ -4539,7 +4556,7 @@ static bool io_arm_poll_handler(struct io_kiocb *req) ret = __io_arm_poll_handler(req, &apoll->poll, &ipt, mask, io_async_wake); if (ret || ipt.error) { - io_poll_remove_double(req, apoll->double_poll); + io_poll_remove_double(req); spin_unlock_irq(&ctx->completion_lock); if (req->flags & REQ_F_WORK_INITIALIZED) memcpy(&req->work, &apoll->work, sizeof(req->work)); @@ -4573,14 +4590,13 @@ static bool io_poll_remove_one(struct io_kiocb *req) { bool do_complete;
+ io_poll_remove_double(req); + if (req->opcode == IORING_OP_POLL_ADD) { - io_poll_remove_double(req, req->io); do_complete = __io_poll_remove_one(req, &req->poll); } else { struct async_poll *apoll = req->apoll;
- io_poll_remove_double(req, apoll->double_poll); - /* non-poll requests have submit ref still */ do_complete = __io_poll_remove_one(req, &apoll->poll); if (do_complete) {
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc7 commit f3cd4850504ff612d0ea77a0aaf29b66c98fcefe category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we cancel these requests, we'll leak the memory associated with the filename. Add them to the table of ops that need cleaning, if REQ_F_NEED_CLEANUP is set.
Cc: stable@vger.kernel.org Fixes: e62753e4e292 ("io_uring: call statx directly") Reviewed-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [skip openat2 and commit 1c2da9e8839d("io_uring: remove empty cleanup of OP_OPEN* reqs") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2dd7a04ad721..22fd806cfb62 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5236,6 +5236,8 @@ static void io_cleanup_req(struct io_kiocb *req) kfree(req->sr_msg.kbuf); break; case IORING_OP_OPENAT: + if (req->open.filename) + putname(req->open.filename); break; case IORING_OP_SPLICE: case IORING_OP_TEE:
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.10-rc1 commit 55cbc2564ab2fd555ec0fc39311a9cfb811d7da5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
syzbot reports the following crash:
general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 1 PID: 8927 Comm: syz-executor.3 Not tainted 5.9.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:io_file_from_index fs/io_uring.c:5963 [inline] RIP: 0010:io_sqe_files_register fs/io_uring.c:7369 [inline] RIP: 0010:__io_uring_register fs/io_uring.c:9463 [inline] RIP: 0010:__do_sys_io_uring_register+0x2fd2/0x3ee0 fs/io_uring.c:9553 Code: ec 03 49 c1 ee 03 49 01 ec 49 01 ee e8 57 61 9c ff 41 80 3c 24 00 0f 85 9b 09 00 00 4d 8b af b8 01 00 00 4c 89 e8 48 c1 e8 03 <80> 3c 28 00 0f 85 76 09 00 00 49 8b 55 00 89 d8 c1 f8 09 48 98 4c RSP: 0018:ffffc90009137d68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9000ef2a000 RDX: 0000000000040000 RSI: ffffffff81d81dd9 RDI: 0000000000000005 RBP: dffffc0000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffed1012882a37 R13: 0000000000000000 R14: ffffed1012882a38 R15: ffff888094415000 FS: 00007f4266f3c700(0000) GS:ffff8880ae500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000118c000 CR3: 000000008e57d000 CR4: 00000000001506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45de59 Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f4266f3bc78 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab RAX: ffffffffffffffda RBX: 00000000000083c0 RCX: 000000000045de59 RDX: 0000000020000280 RSI: 0000000000000002 RDI: 0000000000000005 RBP: 000000000118bf68 R08: 0000000000000000 R09: 0000000000000000 R10: 40000000000000a1 R11: 0000000000000246 R12: 000000000118bf2c R13: 00007fff2fa4f12f R14: 00007f4266f3c9c0 R15: 000000000118bf2c Modules linked in: ---[ end trace 2a40a195e2d5e6e6 ]--- RIP: 0010:io_file_from_index fs/io_uring.c:5963 [inline] RIP: 0010:io_sqe_files_register fs/io_uring.c:7369 [inline] RIP: 0010:__io_uring_register fs/io_uring.c:9463 [inline] RIP: 0010:__do_sys_io_uring_register+0x2fd2/0x3ee0 fs/io_uring.c:9553 Code: ec 03 49 c1 ee 03 49 01 ec 49 01 ee e8 57 61 9c ff 41 80 3c 24 00 0f 85 9b 09 00 00 4d 8b af b8 01 00 00 4c 89 e8 48 c1 e8 03 <80> 3c 28 00 0f 85 76 09 00 00 49 8b 55 00 89 d8 c1 f8 09 48 98 4c RSP: 0018:ffffc90009137d68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9000ef2a000 RDX: 0000000000040000 RSI: ffffffff81d81dd9 RDI: 0000000000000005 RBP: dffffc0000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffed1012882a37 R13: 0000000000000000 R14: ffffed1012882a38 R15: ffff888094415000 FS: 00007f4266f3c700(0000) GS:ffff8880ae400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000074a918 CR3: 000000008e57d000 CR4: 00000000001506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
which is a copy of fget failure condition jumping to cleanup, but the cleanup requires ctx->file_data to be assigned. Assign it when setup, and ensure that we clear it again for the error path exit.
Fixes: 5398ae698525 ("io_uring: clean file_data access in files_register") Reported-by: syzbot+f4ebcc98223dafd8991e@syzkaller.appspotmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 22fd806cfb62..5da7b3fbaf77 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6757,6 +6757,7 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
if (io_sqe_alloc_file_tables(file_data, nr_tables, nr_args)) goto out_ref; + ctx->file_data = file_data;
for (i = 0; i < nr_args; i++, ctx->nr_user_files++) { struct fixed_file_table *table; @@ -6791,7 +6792,6 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, table->files[index] = file; }
- ctx->file_data = file_data; ret = io_sqe_files_scm(ctx); if (ret) { io_sqe_files_unregister(ctx); @@ -6824,6 +6824,7 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, out_free: kfree(file_data->table); kfree(file_data); + ctx->file_data = NULL; return ret; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.11-rc1 commit 00c18640c2430c4bafaaeede1f9dd6f7ec0e4b25 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Before IORING_SETUP_ATTACH_WQ, we could just cancel everything on the io-wq when exiting. But that's not the case if they are shared, so cancel for the specific ctx instead.
Cc: stable@vger.kernel.org Fixes: 24369c2e3bb0 ("io_uring: add io-wq workqueue sharing") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5da7b3fbaf77..628b262c10c8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7546,6 +7546,13 @@ static void io_ring_exit_work(struct work_struct *work) io_ring_ctx_free(ctx); }
+static bool io_cancel_ctx_cb(struct io_wq_work *work, void *data) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work); + + return req->ctx == data; +} + static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) { mutex_lock(&ctx->uring_lock); @@ -7556,7 +7563,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) io_poll_remove_all(ctx);
if (ctx->io_wq) - io_wq_cancel_all(ctx->io_wq); + io_wq_cancel_cb(ctx->io_wq, io_cancel_ctx_cb, ctx, true);
/* if we failed setting up the ctx, we might not have any rings */ if (ctx->rings)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.11-rc5 commit 607ec89ed18f49ca59689572659b9c0076f1991f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
IORING_OP_CLOSE is special in terms of cancelation, since it has an intermediate state where we've removed the file descriptor but hasn't closed the file yet. For that reason, it's currently marked with IO_WQ_WORK_NO_CANCEL to prevent cancelation. This ensures that the op is always run even if canceled, to prevent leaving us with a live file but an fd that is gone. However, with SQPOLL, since a cancel request doesn't carry any resources on behalf of the request being canceled, if we cancel before any of the close op has been run, we can end up with io-wq not having the ->files assigned. This can result in the following oops reported by Joseph:
BUG: kernel NULL pointer dereference, address: 00000000000000d8 PGD 800000010b76f067 P4D 800000010b76f067 PUD 10b462067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 1 PID: 1788 Comm: io_uring-sq Not tainted 5.11.0-rc4 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:__lock_acquire+0x19d/0x18c0 Code: 00 00 8b 1d fd 56 dd 08 85 db 0f 85 43 05 00 00 48 c7 c6 98 7b 95 82 48 c7 c7 57 96 93 82 e8 9a bc f5 ff 0f 0b e9 2b 05 00 00 <48> 81 3f c0 ca 67 8a b8 00 00 00 00 41 0f 45 c0 89 04 24 e9 81 fe RSP: 0018:ffffc90001933828 EFLAGS: 00010002 RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000d8 RBP: 0000000000000246 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: ffff888106e8a140 R15: 00000000000000d8 FS: 0000000000000000(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000d8 CR3: 0000000106efa004 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: lock_acquire+0x31a/0x440 ? close_fd_get_file+0x39/0x160 ? __lock_acquire+0x647/0x18c0 _raw_spin_lock+0x2c/0x40 ? close_fd_get_file+0x39/0x160 close_fd_get_file+0x39/0x160 io_issue_sqe+0x1334/0x14e0 ? lock_acquire+0x31a/0x440 ? __io_free_req+0xcf/0x2e0 ? __io_free_req+0x175/0x2e0 ? find_held_lock+0x28/0xb0 ? io_wq_submit_work+0x7f/0x240 io_wq_submit_work+0x7f/0x240 io_wq_cancel_cb+0x161/0x580 ? io_wqe_wake_worker+0x114/0x360 ? io_uring_get_socket+0x40/0x40 io_async_find_and_cancel+0x3b/0x140 io_issue_sqe+0xbe1/0x14e0 ? __lock_acquire+0x647/0x18c0 ? __io_queue_sqe+0x10b/0x5f0 __io_queue_sqe+0x10b/0x5f0 ? io_req_prep+0xdb/0x1150 ? mark_held_locks+0x6d/0xb0 ? mark_held_locks+0x6d/0xb0 ? io_queue_sqe+0x235/0x4b0 io_queue_sqe+0x235/0x4b0 io_submit_sqes+0xd7e/0x12a0 ? _raw_spin_unlock_irq+0x24/0x30 ? io_sq_thread+0x3ae/0x940 io_sq_thread+0x207/0x940 ? do_wait_intr_irq+0xc0/0xc0 ? __ia32_sys_io_uring_enter+0x650/0x650 kthread+0x134/0x180 ? kthread_create_worker_on_cpu+0x90/0x90 ret_from_fork+0x1f/0x30
Fix this by moving the IO_WQ_WORK_NO_CANCEL until _after_ we've modified the fdtable. Canceling before this point is totally fine, and running it in the io-wq context _after_ that point is also fine.
For 5.12, we'll handle this internally and get rid of the no-cancel flag, as IORING_OP_CLOSE is the only user of it.
Cc: stable@vger.kernel.org Fixes: b5dba59e0cf7 ("io_uring: add support for IORING_OP_CLOSE") Reported-by: "Abaci abaci@linux.alibaba.com" Reviewed-and-tested-by: Joseph Qi joseph.qi@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [commit 24c74678634b("io_uring: remove REQ_F_MUST_PUNT") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 628b262c10c8..375c901be2cd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3413,7 +3413,6 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) * io_wq_work.flags, so initialize io_wq_work firstly. */ io_req_init_async(req); - req->work.flags |= IO_WQ_WORK_NO_CANCEL;
if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) return -EINVAL; @@ -3446,6 +3445,8 @@ static int io_close(struct io_kiocb *req, bool force_nonblock)
/* if the file has a flush method, be safe and punt to async */ if (close->put_file->f_op->flush && force_nonblock) { + /* not safe to cancel at this point */ + req->work.flags |= IO_WQ_WORK_NO_CANCEL; /* avoid grabbing files - we don't need the files */ req->flags |= REQ_F_NO_FILE_TABLE | REQ_F_MUST_PUNT; return -EAGAIN;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-1c3b3e6527e57156bf4082f11c2151957560fe6a remotes/origin/master commit 1c3b3e6527e57156bf4082f11c2151957560fe6a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
syzbot reports a deadlock, attempting to lock the same spinlock twice:
============================================ WARNING: possible recursive locking detected 5.11.0-syzkaller #0 Not tainted -------------------------------------------- swapper/1/0 is trying to acquire lock: ffff88801b2b1130 (&runtime->sleep){..-.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline] ffff88801b2b1130 (&runtime->sleep){..-.}-{2:2}, at: io_poll_double_wake+0x25f/0x6a0 fs/io_uring.c:4960
but task is already holding lock: ffff88801b2b3130 (&runtime->sleep){..-.}-{2:2}, at: __wake_up_common_lock+0xb4/0x130 kernel/sched/wait.c:137
other info that might help us debug this: Possible unsafe locking scenario:
CPU0 ---- lock(&runtime->sleep); lock(&runtime->sleep);
*** DEADLOCK ***
May be due to missing lock nesting notation
2 locks held by swapper/1/0: #0: ffff888147474908 (&group->lock){..-.}-{2:2}, at: _snd_pcm_stream_lock_irqsave+0x9f/0xd0 sound/core/pcm_native.c:170 #1: ffff88801b2b3130 (&runtime->sleep){..-.}-{2:2}, at: __wake_up_common_lock+0xb4/0x130 kernel/sched/wait.c:137
stack backtrace: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.11.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0xfa/0x151 lib/dump_stack.c:120 print_deadlock_bug kernel/locking/lockdep.c:2829 [inline] check_deadlock kernel/locking/lockdep.c:2872 [inline] validate_chain kernel/locking/lockdep.c:3661 [inline] __lock_acquire.cold+0x14c/0x3b4 kernel/locking/lockdep.c:4900 lock_acquire kernel/locking/lockdep.c:5510 [inline] lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:354 [inline] io_poll_double_wake+0x25f/0x6a0 fs/io_uring.c:4960 __wake_up_common+0x147/0x650 kernel/sched/wait.c:108 __wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:138 snd_pcm_update_state+0x46a/0x540 sound/core/pcm_lib.c:203 snd_pcm_update_hw_ptr0+0xa75/0x1a50 sound/core/pcm_lib.c:464 snd_pcm_period_elapsed+0x160/0x250 sound/core/pcm_lib.c:1805 dummy_hrtimer_callback+0x94/0x1b0 sound/drivers/dummy.c:378 __run_hrtimer kernel/time/hrtimer.c:1519 [inline] __hrtimer_run_queues+0x609/0xe40 kernel/time/hrtimer.c:1583 hrtimer_run_softirq+0x17b/0x360 kernel/time/hrtimer.c:1600 __do_softirq+0x29b/0x9f6 kernel/softirq.c:345 invoke_softirq kernel/softirq.c:221 [inline] __irq_exit_rcu kernel/softirq.c:422 [inline] irq_exit_rcu+0x134/0x200 kernel/softirq.c:434 sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1100 </IRQ> asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:632 RIP: 0010:native_save_fl arch/x86/include/asm/irqflags.h:29 [inline] RIP: 0010:arch_local_save_flags arch/x86/include/asm/irqflags.h:70 [inline] RIP: 0010:arch_irqs_disabled arch/x86/include/asm/irqflags.h:137 [inline] RIP: 0010:acpi_safe_halt drivers/acpi/processor_idle.c:111 [inline] RIP: 0010:acpi_idle_do_entry+0x1c9/0x250 drivers/acpi/processor_idle.c:516 Code: dd 38 6e f8 84 db 75 ac e8 54 32 6e f8 e8 0f 1c 74 f8 e9 0c 00 00 00 e8 45 32 6e f8 0f 00 2d 4e 4a c5 00 e8 39 32 6e f8 fb f4 <9c> 5b 81 e3 00 02 00 00 fa 31 ff 48 89 de e8 14 3a 6e f8 48 85 db RSP: 0018:ffffc90000d47d18 EFLAGS: 00000293 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff8880115c3780 RSI: ffffffff89052537 RDI: 0000000000000000 RBP: ffff888141127064 R08: 0000000000000001 R09: 0000000000000001 R10: ffffffff81794168 R11: 0000000000000000 R12: 0000000000000001 R13: ffff888141127000 R14: ffff888141127064 R15: ffff888143331804 acpi_idle_enter+0x361/0x500 drivers/acpi/processor_idle.c:647 cpuidle_enter_state+0x1b1/0xc80 drivers/cpuidle/cpuidle.c:237 cpuidle_enter+0x4a/0xa0 drivers/cpuidle/cpuidle.c:351 call_cpuidle kernel/sched/idle.c:158 [inline] cpuidle_idle_call kernel/sched/idle.c:239 [inline] do_idle+0x3e1/0x590 kernel/sched/idle.c:300 cpu_startup_entry+0x14/0x20 kernel/sched/idle.c:397 start_secondary+0x274/0x350 arch/x86/kernel/smpboot.c:272 secondary_startup_64_no_verify+0xb0/0xbb
which is due to the driver doing poll_wait() twice on the same wait_queue_head. That is perfectly valid, but from checking the rest of the kernel tree, it's the only driver that does this.
We can handle this just fine, we just need to ignore the second addition as we'll get woken just fine on the first one.
Cc: stable@vger.kernel.org # 5.8+ Fixes: 18bceab101ad ("io_uring: allow POLL_ADD with double poll_wait() users") Reported-by: syzbot+28abd693db9e92c160d8@syzkaller.appspotmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 375c901be2cd..509ccdacab70 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4285,6 +4285,9 @@ static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt, pt->error = -EINVAL; return; } + /* double add on the same waitqueue head, ignore */ + if (poll->head == head) + return; poll = kmalloc(sizeof(*poll), GFP_ATOMIC); if (!poll) { pt->error = -ENOMEM;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-v5.12-rc5 commit d81269fecb8ce16eb07efafc9ff5520b2a31c486 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_provide_buffers_prep()'s "p->len * p->nbufs" to sign extension problems. Not a huge problem as it's only used for access_ok() and increases the checked length, but better to keep typing right.
Reported-by: Colin Ian King colin.king@canonical.com Fixes: efe68c1ca8f49 ("io_uring: validate the full range of provided buffers for access") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Reviewed-by: Colin Ian King colin.king@canonical.com Link: https://lore.kernel.org/r/562376a39509e260d8532186a06226e56eb1f594.161614923... Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 509ccdacab70..be440687fdb9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3159,6 +3159,7 @@ static int io_remove_buffers(struct io_kiocb *req, bool force_nonblock) static int io_provide_buffers_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { + unsigned long size; struct io_provide_buf *p = &req->pbuf; u64 tmp;
@@ -3172,7 +3173,8 @@ static int io_provide_buffers_prep(struct io_kiocb *req, p->addr = READ_ONCE(sqe->addr); p->len = READ_ONCE(sqe->len);
- if (!access_ok(u64_to_user_ptr(p->addr), (p->len * p->nbufs))) + size = (unsigned long)p->len * p->nbufs; + if (!access_ok(u64_to_user_ptr(p->addr), size)) return -EFAULT;
p->bgid = READ_ONCE(sqe->buf_group);
From: yangerkun yangerkun@huawei.com
hulk inclusion category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Backport for io_uring will extend the syscall number, which will change KABI like bpf_trace_run1. Fix it by hack the syscall in do_syscall_64.
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Reviewed-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/x86/entry/common.c | 7 +++++++ arch/x86/entry/syscalls/syscall_32.tbl | 3 --- arch/x86/entry/syscalls/syscall_64.tbl | 3 --- arch/x86/include/asm/syscall_wrapper.h | 3 +++ 4 files changed, 10 insertions(+), 6 deletions(-)
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 8353348ddeaf..0723098a3961 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -291,6 +291,13 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs) if (likely(nr < NR_syscalls)) { nr = array_index_nospec(nr, NR_syscalls); regs->ax = sys_call_table[nr](regs); + } else { + if (nr == 425) + regs->ax = __x64_sys_io_uring_setup(regs); + else if (likely(nr == 426)) + regs->ax = __x64_sys_io_uring_enter(regs); + else if (nr == 427) + regs->ax = __x64_sys_io_uring_register(regs); }
syscall_return_slowpath(regs); diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 2eefd2a7c1ce..3cf7b533b3d1 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -398,6 +398,3 @@ 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents 386 i386 rseq sys_rseq __ia32_sys_rseq -425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup -426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter -427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 65c026185e61..f0b1709a5ffb 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,9 +343,6 @@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq -425 common io_uring_setup __x64_sys_io_uring_setup -426 common io_uring_enter __x64_sys_io_uring_enter -427 common io_uring_register __x64_sys_io_uring_register
# # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/arch/x86/include/asm/syscall_wrapper.h b/arch/x86/include/asm/syscall_wrapper.h index 90eb70df0b18..46e125b2d08a 100644 --- a/arch/x86/include/asm/syscall_wrapper.h +++ b/arch/x86/include/asm/syscall_wrapper.h @@ -206,5 +206,8 @@ struct pt_regs; asmlinkage long __x64_sys_getcpu(const struct pt_regs *regs); asmlinkage long __x64_sys_gettimeofday(const struct pt_regs *regs); asmlinkage long __x64_sys_time(const struct pt_regs *regs); +asmlinkage long __x64_sys_io_uring_setup(const struct pt_regs *regs); +asmlinkage long __x64_sys_io_uring_enter(const struct pt_regs *regs); +asmlinkage long __x64_sys_io_uring_register(const struct pt_regs *regs);
#endif /* _ASM_X86_SYSCALL_WRAPPER_H */
From: yangerkun yangerkun@huawei.com
hulk inclusion category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The same as x86.
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Reviewed-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/include/asm/syscall_wrapper.h | 5 +++++ arch/arm64/kernel/syscall.c | 9 ++++++++- include/uapi/asm-generic/unistd.h | 8 +------- 3 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/arch/arm64/include/asm/syscall_wrapper.h b/arch/arm64/include/asm/syscall_wrapper.h index 507d0ee6bc69..8523ac1281f9 100644 --- a/arch/arm64/include/asm/syscall_wrapper.h +++ b/arch/arm64/include/asm/syscall_wrapper.h @@ -77,4 +77,9 @@ #define SYS_NI(name) SYSCALL_ALIAS(__arm64_sys_##name, sys_ni_posix_timers); #endif
+struct pt_regs; +asmlinkage long __arm64_sys_io_uring_setup(const struct pt_regs *regs); +asmlinkage long __arm64_sys_io_uring_enter(const struct pt_regs *regs); +asmlinkage long __arm64_sys_io_uring_register(const struct pt_regs *regs); + #endif /* __ASM_SYSCALL_WRAPPER_H */ diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c index cee2933bd6c1..e36ad39d8d14 100644 --- a/arch/arm64/kernel/syscall.c +++ b/arch/arm64/kernel/syscall.c @@ -47,7 +47,14 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno, syscall_fn = syscall_table[array_index_nospec(scno, sc_nr)]; ret = __invoke_syscall(regs, syscall_fn); } else { - ret = do_ni_syscall(regs, scno); + if (scno == 425) + ret = __arm64_sys_io_uring_setup(regs); + else if (likely(scno == 426)) + ret = __arm64_sys_io_uring_enter(regs); + else if (scno == 427) + ret = __arm64_sys_io_uring_register(regs); + else + ret = do_ni_syscall(regs, scno); }
regs->regs[0] = ret; diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 4c1ba6d0dac8..b538ed1be4eb 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -740,15 +740,9 @@ __SYSCALL(__NR_statx, sys_statx) __SC_COMP(__NR_io_pgetevents, sys_io_pgetevents, compat_sys_io_pgetevents) #define __NR_rseq 293 __SYSCALL(__NR_rseq, sys_rseq) -#define __NR_io_uring_setup 425 -__SYSCALL(__NR_io_uring_setup, sys_io_uring_setup) -#define __NR_io_uring_enter 426 -__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter) -#define __NR_io_uring_register 427 -__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
#undef __NR_syscalls -#define __NR_syscalls 428 +#define __NR_syscalls 294
/* * 32 bit systems traditionally used different
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit a1d7c393c4711a9ce6c239c3ab053a50dc96505a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
A bit more surgery required here, as completions are generally done through the kiocb->ki_complete() callback, even if they complete inline. This enables the regular read/write path to use the io_comp_state logic to batch inline completions.
Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [include merge conflict 2237d76530eb]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 38 ++++++++++++++++++++++++-------------- 1 file changed, 24 insertions(+), 14 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cdb068e6e642..81ddaafca443 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -912,7 +912,8 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, struct io_uring_files_update *ip, unsigned nr_args); static int io_grab_files(struct io_kiocb *req); -static void io_complete_rw_common(struct kiocb *kiocb, long res); +static void io_complete_rw_common(struct kiocb *kiocb, long res, + struct io_comp_state *cs); static void io_cleanup_req(struct io_kiocb *req); static int io_file_get(struct io_submit_state *state, struct io_kiocb *req, int fd, struct file **out_file, bool fixed); @@ -1828,7 +1829,7 @@ static void io_iopoll_queue(struct list_head *again)
/* shouldn't happen unless io_uring is dying, cancel reqs */ if (unlikely(!current->mm)) { - io_complete_rw_common(&req->rw.kiocb, -EAGAIN); + io_complete_rw_common(&req->rw.kiocb, -EAGAIN, NULL); io_put_req(req); continue; } @@ -2047,7 +2048,8 @@ static inline void req_set_fail_links(struct io_kiocb *req) req->flags |= REQ_F_FAIL_LINK; }
-static void io_complete_rw_common(struct kiocb *kiocb, long res) +static void io_complete_rw_common(struct kiocb *kiocb, long res, + struct io_comp_state *cs) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); int cflags = 0; @@ -2059,15 +2061,20 @@ static void io_complete_rw_common(struct kiocb *kiocb, long res) req_set_fail_links(req); if (req->flags & REQ_F_BUFFER_SELECTED) cflags = io_put_kbuf(req); - io_cqring_add_event(req, res, cflags); + __io_req_complete(req, res, cflags, cs); +} + +static void __io_complete_rw(struct io_kiocb *req, long res, long res2, + struct io_comp_state *cs) +{ + io_complete_rw_common(&req->rw.kiocb, res, cs); }
static void io_complete_rw(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
- io_complete_rw_common(kiocb, res); - io_put_req(req); + __io_complete_rw(req, res, res2, NULL); }
static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2) @@ -2278,14 +2285,15 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) } }
-static void kiocb_done(struct kiocb *kiocb, ssize_t ret) +static void kiocb_done(struct kiocb *kiocb, ssize_t ret, + struct io_comp_state *cs) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
if (req->flags & REQ_F_CUR_POS) req->file->f_pos = kiocb->ki_pos; if (ret >= 0 && kiocb->ki_complete == io_complete_rw) - io_complete_rw(kiocb, ret, 0); + __io_complete_rw(req, ret, 0, cs); else io_rw_done(kiocb, ret); } @@ -2709,7 +2717,8 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, return io_rw_prep_async(req, READ, force_nonblock); }
-static int io_read(struct io_kiocb *req, bool force_nonblock) +static int io_read(struct io_kiocb *req, bool force_nonblock, + struct io_comp_state *cs) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw.kiocb; @@ -2755,7 +2764,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock) if ((req->ctx->flags & IORING_SETUP_IOPOLL) && ret2 == -EAGAIN) goto copy_iov; - kiocb_done(kiocb, ret2); + kiocb_done(kiocb, ret2, cs); } else { copy_iov: ret = io_setup_async_rw(req, io_size, iovec, @@ -2795,7 +2804,8 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, return io_rw_prep_async(req, WRITE, force_nonblock); }
-static int io_write(struct io_kiocb *req, bool force_nonblock) +static int io_write(struct io_kiocb *req, bool force_nonblock, + struct io_comp_state *cs) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw.kiocb; @@ -2872,7 +2882,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) if ((req->ctx->flags & IORING_SETUP_IOPOLL) && ret2 == -EAGAIN) goto copy_iov; - kiocb_done(kiocb, ret2); + kiocb_done(kiocb, ret2, cs); } else { copy_iov: ret = io_setup_async_rw(req, io_size, iovec, @@ -5330,7 +5340,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret < 0) break; } - ret = io_read(req, force_nonblock); + ret = io_read(req, force_nonblock, cs); break; case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: @@ -5340,7 +5350,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret < 0) break; } - ret = io_write(req, force_nonblock); + ret = io_write(req, force_nonblock, cs); break; case IORING_OP_FSYNC: if (sqe) {
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 8c9cb6cd9a46ae6fb7cb6c39cf6a48a53440feef category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Now io_complete_rw_common() puts a ref, extra io_req_put() in io_iopoll_queue() causes undeflow. Remove it.
[ 455.998620] refcount_t: underflow; use-after-free. [ 455.998743] WARNING: CPU: 6 PID: 285394 at lib/refcount.c:28 refcount_warn_saturate+0xae/0xf0 [ 455.998772] CPU: 6 PID: 285394 Comm: read-write2 Tainted: G I E 5.8.0-rc2-00048-g1b1aa738f167-dirty #509 [ 455.998772] RIP: 0010:refcount_warn_saturate+0xae/0xf0 ... [ 455.998778] Call Trace: [ 455.998778] io_put_req+0x44/0x50 [ 455.998778] io_iopoll_complete+0x245/0x370 [ 455.998779] io_iopoll_getevents+0x12f/0x1a0 [ 455.998779] io_iopoll_reap_events.part.0+0x5e/0xa0 [ 455.998780] io_ring_ctx_wait_and_kill+0x132/0x1c0 [ 455.998780] io_uring_release+0x20/0x30 [ 455.998780] __fput+0xcd/0x230 [ 455.998781] ____fput+0xe/0x10 [ 455.998781] task_work_run+0x67/0xa0 [ 455.998781] do_exit+0x35d/0xb70 [ 455.998782] do_group_exit+0x43/0xa0 [ 455.998783] get_signal+0x140/0x900 [ 455.998783] do_signal+0x37/0x780 [ 455.998784] __prepare_exit_to_usermode+0x126/0x1c0 [ 455.998785] __syscall_return_slowpath+0x3b/0x1c0 [ 455.998785] do_syscall_64+0x5f/0xa0 [ 455.998785] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: a1d7c393c47 ("io_uring: enable READ/WRITE to use deferred completions") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 81ddaafca443..edb129bd316f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1830,7 +1830,6 @@ static void io_iopoll_queue(struct list_head *again) /* shouldn't happen unless io_uring is dying, cancel reqs */ if (unlikely(!current->mm)) { io_complete_rw_common(&req->rw.kiocb, -EAGAIN, NULL); - io_put_req(req); continue; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit e6543a816edca00b6b4c48625d142059d7211059 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_free_req_many() is used only for iopoll requests, i.e. reads/writes. Hence no need to batch inflight unhooking. For safety, it'll be done by io_dismantle_req(), which replaces __io_req_aux_free(), and looks more solid and cleaner.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [ecfc51777487 ("io_uring: fix potential use after free on fallback request free") include first]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 46 +++++++++++----------------------------------- 1 file changed, 11 insertions(+), 35 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index edb129bd316f..64d652c2d776 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1475,7 +1475,7 @@ static inline void io_put_file(struct io_kiocb *req, struct file *file, fput(file); }
-static void __io_req_aux_free(struct io_kiocb *req) +static void io_dismantle_req(struct io_kiocb *req) { if (req->flags & REQ_F_NEED_CLEANUP) io_cleanup_req(req); @@ -1485,15 +1485,9 @@ static void __io_req_aux_free(struct io_kiocb *req) io_put_file(req, req->file, (req->flags & REQ_F_FIXED_FILE)); __io_put_req_task(req); io_req_work_drop_env(req); -} - -static void __io_free_req(struct io_kiocb *req) -{ - struct io_ring_ctx *ctx = req->ctx; - - __io_req_aux_free(req);
if (req->flags & REQ_F_INFLIGHT) { + struct io_ring_ctx *ctx = req->ctx; unsigned long flags;
spin_lock_irqsave(&ctx->inflight_lock, flags); @@ -1502,7 +1496,13 @@ static void __io_free_req(struct io_kiocb *req) wake_up(&ctx->inflight_wait); spin_unlock_irqrestore(&ctx->inflight_lock, flags); } +} + +static void __io_free_req(struct io_kiocb *req) +{ + struct io_ring_ctx *ctx = req->ctx;
+ io_dismantle_req(req); if (likely(!io_is_fallback_req(req))) kmem_cache_free(req_cachep, req); else @@ -1521,35 +1521,11 @@ static void io_free_req_many(struct io_ring_ctx *ctx, struct req_batch *rb) if (!rb->to_free) return; if (rb->need_iter) { - int i, inflight = 0; - unsigned long flags; + int i;
- for (i = 0; i < rb->to_free; i++) { - struct io_kiocb *req = rb->reqs[i]; - - if (req->flags & REQ_F_INFLIGHT) - inflight++; - __io_req_aux_free(req); - } - if (!inflight) - goto do_free; - - spin_lock_irqsave(&ctx->inflight_lock, flags); - for (i = 0; i < rb->to_free; i++) { - struct io_kiocb *req = rb->reqs[i]; - - if (req->flags & REQ_F_INFLIGHT) { - list_del(&req->inflight_entry); - if (!--inflight) - break; - } - } - spin_unlock_irqrestore(&ctx->inflight_lock, flags); - - if (waitqueue_active(&ctx->inflight_wait)) - wake_up(&ctx->inflight_wait); + for (i = 0; i < rb->to_free; i++) + io_dismantle_req(rb->reqs[i]); } -do_free: kmem_cache_free_bulk(req_cachep, rb->to_free, rb->reqs); percpu_ref_put_many(&ctx->refs, rb->to_free); rb->to_free = rb->need_iter = 0;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 2757a23e7f6441eabf605ca59eeb88c34071757d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Every request in io_req_multi_free() is has ->file set. Instead of pointlessly defering and counting reqs with file, dismantle it on place and save for batch dealloc.
It also saves us from potentially skipping io_cleanup_req(), put_task(), etc. Never happens though, becacuse ->file is always there.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 64d652c2d776..75cbe22382c5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1513,22 +1513,16 @@ static void __io_free_req(struct io_kiocb *req) struct req_batch { void *reqs[IO_IOPOLL_BATCH]; int to_free; - int need_iter; };
static void io_free_req_many(struct io_ring_ctx *ctx, struct req_batch *rb) { if (!rb->to_free) return; - if (rb->need_iter) { - int i;
- for (i = 0; i < rb->to_free; i++) - io_dismantle_req(rb->reqs[i]); - } kmem_cache_free_bulk(req_cachep, rb->to_free, rb->reqs); percpu_ref_put_many(&ctx->refs, rb->to_free); - rb->to_free = rb->need_iter = 0; + rb->to_free = 0; }
static bool io_link_cancel_timeout(struct io_kiocb *req) @@ -1773,9 +1767,7 @@ static inline bool io_req_multi_free(struct req_batch *rb, struct io_kiocb *req) if ((req->flags & REQ_F_LINK_HEAD) || io_is_fallback_req(req)) return false;
- if (req->file || req->io) - rb->need_iter++; - + io_dismantle_req(req); rb->reqs[rb->to_free++] = req; if (unlikely(rb->to_free == ARRAY_SIZE(rb->reqs))) io_free_req_many(req->ctx, rb); @@ -1827,7 +1819,7 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, /* order with ->result store in io_complete_rw_iopoll() */ smp_rmb();
- rb.to_free = rb.need_iter = 0; + rb.to_free = 0; while (!list_empty(done)) { int cflags = 0;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit c3524383333e4ff2f720ab0c02b3a329f72de78b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There is no reason to not batch deallocation of linked requests. Take away its next req first and handle it as everything else in io_req_multi_free().
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 75cbe22382c5..571c57fdfd17 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1660,17 +1660,22 @@ static void io_req_find_next(struct io_kiocb *req, struct io_kiocb **nxt) io_req_link_next(req, nxt); }
-static void io_free_req(struct io_kiocb *req) +static void io_queue_next(struct io_kiocb *req) { struct io_kiocb *nxt = NULL;
io_req_find_next(req, &nxt); - __io_free_req(req);
if (nxt) io_queue_async_work(nxt); }
+static void io_free_req(struct io_kiocb *req) +{ + io_queue_next(req); + __io_free_req(req); +} + /* * Drop reference to request, return next in chain (if there is one) if this * was the last reference to this request. @@ -1762,16 +1767,19 @@ static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx) return smp_load_acquire(&rings->sq.tail) - ctx->cached_sq_head; }
-static inline bool io_req_multi_free(struct req_batch *rb, struct io_kiocb *req) +static inline void io_req_multi_free(struct req_batch *rb, struct io_kiocb *req) { - if ((req->flags & REQ_F_LINK_HEAD) || io_is_fallback_req(req)) - return false; + if (unlikely(io_is_fallback_req(req))) { + io_free_req(req); + return; + } + if (req->flags & REQ_F_LINK_HEAD) + io_queue_next(req);
io_dismantle_req(req); rb->reqs[rb->to_free++] = req; if (unlikely(rb->to_free == ARRAY_SIZE(rb->reqs))) io_free_req_many(req->ctx, rb); - return true; }
static int io_put_kbuf(struct io_kiocb *req) @@ -1838,9 +1846,8 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, __io_cqring_fill_event(req, req->result, cflags); (*nr_events)++;
- if (refcount_dec_and_test(&req->refs) && - !io_req_multi_free(&rb, req)) - io_free_req(req); + if (refcount_dec_and_test(&req->refs)) + io_req_multi_free(&rb, req); }
io_commit_cqring(ctx);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 2d6500d44c1374808040d120e625a22b013c9f0d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Move all batch free bits close to each other and rename in a consistent way.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [ecfc51777487 ("io_uring: fix potential use after free on fallback request free") include first]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 69 +++++++++++++++++++++++++++------------------------ 1 file changed, 37 insertions(+), 32 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 571c57fdfd17..07be9ad70461 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1510,21 +1510,6 @@ static void __io_free_req(struct io_kiocb *req) percpu_ref_put(&ctx->refs); }
-struct req_batch { - void *reqs[IO_IOPOLL_BATCH]; - int to_free; -}; - -static void io_free_req_many(struct io_ring_ctx *ctx, struct req_batch *rb) -{ - if (!rb->to_free) - return; - - kmem_cache_free_bulk(req_cachep, rb->to_free, rb->reqs); - percpu_ref_put_many(&ctx->refs, rb->to_free); - rb->to_free = 0; -} - static bool io_link_cancel_timeout(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; @@ -1676,6 +1661,41 @@ static void io_free_req(struct io_kiocb *req) __io_free_req(req); }
+struct req_batch { + void *reqs[IO_IOPOLL_BATCH]; + int to_free; +}; + +static void __io_req_free_batch_flush(struct io_ring_ctx *ctx, + struct req_batch *rb) +{ + kmem_cache_free_bulk(req_cachep, rb->to_free, rb->reqs); + percpu_ref_put_many(&ctx->refs, rb->to_free); + rb->to_free = 0; +} + +static void io_req_free_batch_finish(struct io_ring_ctx *ctx, + struct req_batch *rb) +{ + if (rb->to_free) + __io_req_free_batch_flush(ctx, rb); +} + +static void io_req_free_batch(struct req_batch *rb, struct io_kiocb *req) +{ + if (unlikely(io_is_fallback_req(req))) { + io_free_req(req); + return; + } + if (req->flags & REQ_F_LINK_HEAD) + io_queue_next(req); + + io_dismantle_req(req); + rb->reqs[rb->to_free++] = req; + if (unlikely(rb->to_free == ARRAY_SIZE(rb->reqs))) + __io_req_free_batch_flush(req->ctx, rb); +} + /* * Drop reference to request, return next in chain (if there is one) if this * was the last reference to this request. @@ -1767,21 +1787,6 @@ static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx) return smp_load_acquire(&rings->sq.tail) - ctx->cached_sq_head; }
-static inline void io_req_multi_free(struct req_batch *rb, struct io_kiocb *req) -{ - if (unlikely(io_is_fallback_req(req))) { - io_free_req(req); - return; - } - if (req->flags & REQ_F_LINK_HEAD) - io_queue_next(req); - - io_dismantle_req(req); - rb->reqs[rb->to_free++] = req; - if (unlikely(rb->to_free == ARRAY_SIZE(rb->reqs))) - io_free_req_many(req->ctx, rb); -} - static int io_put_kbuf(struct io_kiocb *req) { struct io_buffer *kbuf; @@ -1847,13 +1852,13 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, (*nr_events)++;
if (refcount_dec_and_test(&req->refs)) - io_req_multi_free(&rb, req); + io_req_free_batch(&rb, req); }
io_commit_cqring(ctx); if (ctx->flags & IORING_SETUP_SQPOLL) io_cqring_ev_posted(ctx); - io_free_req_many(ctx, &rb); + io_req_free_batch_finish(ctx, &rb);
if (!list_empty(&again)) io_iopoll_queue(&again);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 6795c5aba247653f99d1f336ff496dd74659b322 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Assign req->result to io_size early in io_{read,write}(), it's enough and makes it more straightforward.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 07be9ad70461..fde956653f32 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2229,7 +2229,6 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
kiocb->ki_flags |= IOCB_HIPRI; kiocb->ki_complete = io_complete_rw_iopoll; - req->result = 0; req->iopoll_completed = 0; } else { if (kiocb->ki_flags & IOCB_HIPRI) @@ -2714,10 +2713,8 @@ static int io_read(struct io_kiocb *req, bool force_nonblock, if (!force_nonblock) kiocb->ki_flags &= ~IOCB_NOWAIT;
- req->result = 0; io_size = ret; - if (req->flags & REQ_F_LINK_HEAD) - req->result = io_size; + req->result = io_size;
/* * If the file doesn't support async, mark it as REQ_F_MUST_PUNT so @@ -2801,10 +2798,8 @@ static int io_write(struct io_kiocb *req, bool force_nonblock, if (!force_nonblock) req->rw.kiocb.ki_flags &= ~IOCB_NOWAIT;
- req->result = 0; io_size = ret; - if (req->flags & REQ_F_LINK_HEAD) - req->result = io_size; + req->result = io_size;
/* * If the file doesn't support async, mark it as REQ_F_MUST_PUNT so
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 3adfecaa647ff8afa4b6f5907193cf751a0f8351 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There are a lot of new users of task_work, and some of task_work_add() may happen while we do io polling, thus make iopoll from time to time to do task_work_run(), so it doesn't poll for sitting there reqs.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index fde956653f32..108fb65aeb80 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1994,6 +1994,8 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, */ if (!(++iters & 7)) { mutex_unlock(&ctx->uring_lock); + if (current->task_works) + task_work_run(); mutex_lock(&ctx->uring_lock); }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit c40f63790ec957e9449056fb78d8c2523eff96b5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently links are always done in an async fashion, unless we catch them inline after we successfully complete a request without having to resort to blocking. This isn't necessarily the most efficient approach, it'd be more ideal if we could just use the task_work handling for this.
Outside of saving an async jump, we can also do less prep work for these kinds of requests.
Running dependent links from the task_work handler yields some nice performance benefits. As an example, examples/link-cp from the liburing repository uses read+write links to implement a copy operation. Without this patch, the a cache fold 4G file read from a VM runs in about 3 seconds:
$ time examples/link-cp /data/file /dev/null
real 0m2.986s user 0m0.051s sys 0m2.843s
and a subsequent cache hot run looks like this:
$ time examples/link-cp /data/file /dev/null
real 0m0.898s user 0m0.069s sys 0m0.797s
With this patch in place, the cold case takes about 2.4 seconds:
$ time examples/link-cp /data/file /dev/null
real 0m2.400s user 0m0.020s sys 0m2.366s
and the cache hot case looks like this:
$ time examples/link-cp /data/file /dev/null
real 0m0.676s user 0m0.010s sys 0m0.665s
As expected, the (mostly) cache hot case yields the biggest improvement, running about 25% faster with this change, while the cache cold case yields about a 20% increase in performance. Outside of the performance increase, we're using less CPU as well, as we're not using the async offload threads at all for this anymore.
Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [we backport 6d816e088c35 ("io_uring: hold 'ctx' reference around task_work queue + execute") early, should backport the change part this patch include]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 285 +++++++++++++++++++++++++++++++------------------- 1 file changed, 177 insertions(+), 108 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 108fb65aeb80..046017a5eeae 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -905,6 +905,7 @@ enum io_mem_account {
static void io_cqring_fill_event(struct io_kiocb *req, long res); static void io_put_req(struct io_kiocb *req); +static void io_double_put_req(struct io_kiocb *req); static void __io_double_put_req(struct io_kiocb *req); static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req); static void io_queue_linked_timeout(struct io_kiocb *req); @@ -953,6 +954,100 @@ static void __io_put_req_task(struct io_kiocb *req) put_task_struct(req->task); }
+static void io_sq_thread_drop_mm_files(void) +{ + struct files_struct *files = current->files; + struct mm_struct *mm = current->mm; + + if (mm) { + unuse_mm(mm); + mmput(mm); + current->mm = NULL; + } + if (files) { + struct nsproxy *nsproxy = current->nsproxy; + + task_lock(current); + current->files = NULL; + current->nsproxy = NULL; + task_unlock(current); + put_files_struct(files); + put_nsproxy(nsproxy); + } +} + +static void __io_sq_thread_acquire_files(struct io_ring_ctx *ctx) +{ + if (!current->files) { + struct files_struct *files; + struct nsproxy *nsproxy; + + task_lock(ctx->sqo_task); + files = ctx->sqo_task->files; + if (!files) { + task_unlock(ctx->sqo_task); + return; + } + atomic_inc(&files->count); + get_nsproxy(ctx->sqo_task->nsproxy); + nsproxy = ctx->sqo_task->nsproxy; + task_unlock(ctx->sqo_task); + + task_lock(current); + current->files = files; + current->nsproxy = nsproxy; + task_unlock(current); + } +} + +static int __io_sq_thread_acquire_mm(struct io_ring_ctx *ctx) +{ + struct mm_struct *mm; + + if (current->mm) + return 0; + + /* Should never happen */ + if (unlikely(!(ctx->flags & IORING_SETUP_SQPOLL))) + return -EFAULT; + + task_lock(ctx->sqo_task); + mm = ctx->sqo_task->mm; + if (unlikely(!mm || !mmget_not_zero(mm))) + mm = NULL; + task_unlock(ctx->sqo_task); + + if (mm) { + use_mm(mm); + return 0; + } + + return -EFAULT; +} + +static int io_sq_thread_acquire_mm_files(struct io_ring_ctx *ctx, + struct io_kiocb *req) +{ + const struct io_op_def *def = &io_op_defs[req->opcode]; + + if (def->needs_mm) { + int ret = __io_sq_thread_acquire_mm(ctx); + if (unlikely(ret)) + return ret; + } + + if (def->needs_file || def->file_table) + __io_sq_thread_acquire_files(ctx); + + return 0; +} + +static inline void req_set_fail_links(struct io_kiocb *req) +{ + if ((req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) == REQ_F_LINK) + req->flags |= REQ_F_FAIL_LINK; +} + static void io_file_put_work(struct work_struct *work);
/* @@ -1645,14 +1740,79 @@ static void io_req_find_next(struct io_kiocb *req, struct io_kiocb **nxt) io_req_link_next(req, nxt); }
+static void __io_req_task_cancel(struct io_kiocb *req, int error) +{ + struct io_ring_ctx *ctx = req->ctx; + + spin_lock_irq(&ctx->completion_lock); + io_cqring_fill_event(req, error); + io_commit_cqring(ctx); + spin_unlock_irq(&ctx->completion_lock); + + io_cqring_ev_posted(ctx); + req_set_fail_links(req); + io_double_put_req(req); +} + +static void io_req_task_cancel(struct callback_head *cb) +{ + struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); + + __io_req_task_cancel(req, -ECANCELED); +} + +static void __io_req_task_submit(struct io_kiocb *req) +{ + struct io_ring_ctx *ctx = req->ctx; + + __set_current_state(TASK_RUNNING); + if (!__io_sq_thread_acquire_mm(ctx)) { + mutex_lock(&ctx->uring_lock); + __io_queue_sqe(req, NULL, NULL); + mutex_unlock(&ctx->uring_lock); + } else { + __io_req_task_cancel(req, -EFAULT); + } +} + +static void io_req_task_submit(struct callback_head *cb) +{ + struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); + struct io_ring_ctx *ctx = req->ctx; + + __io_req_task_submit(req); + percpu_ref_put(&ctx->refs); +} + +static void io_req_task_queue(struct io_kiocb *req) +{ + struct task_struct *tsk = req->task; + int ret; + + init_task_work(&req->task_work, io_req_task_submit); + percpu_ref_get(&req->ctx->refs); + + ret = task_work_add(tsk, &req->task_work, true); + if (unlikely(ret)) { + init_task_work(&req->task_work, io_req_task_cancel); + tsk = io_wq_get_task(req->ctx->io_wq); + task_work_add(tsk, &req->task_work, true); + } + wake_up_process(tsk); +} + static void io_queue_next(struct io_kiocb *req) { struct io_kiocb *nxt = NULL;
io_req_find_next(req, &nxt);
- if (nxt) - io_queue_async_work(nxt); + if (nxt) { + if (nxt->flags & REQ_F_WORK_INITIALIZED) + io_queue_async_work(nxt); + else + io_req_task_queue(nxt); + } }
static void io_free_req(struct io_kiocb *req) @@ -2023,12 +2183,6 @@ static void kiocb_end_write(struct io_kiocb *req) file_end_write(req->file); }
-static inline void req_set_fail_links(struct io_kiocb *req) -{ - if ((req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) == REQ_F_LINK) - req->flags |= REQ_F_FAIL_LINK; -} - static void io_complete_rw_common(struct kiocb *kiocb, long res, struct io_comp_state *cs) { @@ -4362,94 +4516,6 @@ static void io_async_queue_proc(struct file *file, struct wait_queue_head *head, __io_queue_proc(&apoll->poll, pt, head, &apoll->double_poll); }
-static void io_sq_thread_drop_mm_files(void) -{ - struct files_struct *files = current->files; - struct mm_struct *mm = current->mm; - - if (mm) { - unuse_mm(mm); - mmput(mm); - current->mm = NULL; - } - if (files) { - struct nsproxy *nsproxy = current->nsproxy; - - task_lock(current); - current->files = NULL; - current->nsproxy = NULL; - task_unlock(current); - put_files_struct(files); - put_nsproxy(nsproxy); - } -} - -static void __io_sq_thread_acquire_files(struct io_ring_ctx *ctx) -{ - if (!current->files) { - struct files_struct *files; - struct nsproxy *nsproxy; - - task_lock(ctx->sqo_task); - files = ctx->sqo_task->files; - if (!files) { - task_unlock(ctx->sqo_task); - return; - } - atomic_inc(&files->count); - get_nsproxy(ctx->sqo_task->nsproxy); - nsproxy = ctx->sqo_task->nsproxy; - task_unlock(ctx->sqo_task); - - task_lock(current); - current->files = files; - current->nsproxy = nsproxy; - task_unlock(current); - } -} - -static int __io_sq_thread_acquire_mm(struct io_ring_ctx *ctx) -{ - struct mm_struct *mm; - - if (current->mm) - return 0; - - /* Should never happen */ - if (unlikely(!(ctx->flags & IORING_SETUP_SQPOLL))) - return -EFAULT; - - task_lock(ctx->sqo_task); - mm = ctx->sqo_task->mm; - if (unlikely(!mm || !mmget_not_zero(mm))) - mm = NULL; - task_unlock(ctx->sqo_task); - - if (mm) { - use_mm(mm); - return 0; - } - - return -EFAULT; -} - -static int io_sq_thread_acquire_mm_files(struct io_ring_ctx *ctx, - struct io_kiocb *req) -{ - const struct io_op_def *def = &io_op_defs[req->opcode]; - - if (def->needs_mm) { - int ret = __io_sq_thread_acquire_mm(ctx); - if (unlikely(ret)) - return ret; - } - - if (def->needs_file || def->file_table) - __io_sq_thread_acquire_files(ctx); - - return 0; -} - static void io_async_task_func(struct callback_head *cb) { struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); @@ -5112,22 +5178,24 @@ static int io_files_update(struct io_kiocb *req, bool force_nonblock, }
static int io_req_defer_prep(struct io_kiocb *req, - const struct io_uring_sqe *sqe) + const struct io_uring_sqe *sqe, bool for_async) { ssize_t ret = 0;
if (!sqe) return 0;
- io_req_init_async(req); + if (for_async || (req->flags & REQ_F_WORK_INITIALIZED)) { + io_req_init_async(req);
- if (io_op_defs[req->opcode].file_table) { - ret = io_grab_files(req); - if (unlikely(ret)) - return ret; - } + if (io_op_defs[req->opcode].file_table) { + ret = io_grab_files(req); + if (unlikely(ret)) + return ret; + }
- io_req_work_grab_env(req, &io_op_defs[req->opcode]); + io_req_work_grab_env(req, &io_op_defs[req->opcode]); + }
switch (req->opcode) { case IORING_OP_NOP: @@ -5238,7 +5306,7 @@ static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (!req->io) { if (io_alloc_async_ctx(req)) return -EAGAIN; - ret = io_req_defer_prep(req, sqe); + ret = io_req_defer_prep(req, sqe, true); if (ret < 0) return ret; } @@ -5841,7 +5909,7 @@ static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, ret = -EAGAIN; if (io_alloc_async_ctx(req)) goto fail_req; - ret = io_req_defer_prep(req, sqe); + ret = io_req_defer_prep(req, sqe, true); if (unlikely(ret < 0)) goto fail_req; } @@ -5898,13 +5966,14 @@ static int io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (io_alloc_async_ctx(req)) return -EAGAIN;
- ret = io_req_defer_prep(req, sqe); + ret = io_req_defer_prep(req, sqe, false); if (ret) { /* fail even hard links since we don't submit */ head->flags |= REQ_F_FAIL_LINK; return ret; } trace_io_uring_link(ctx, req, head); + io_get_req_task(req); list_add_tail(&req->link_list, &head->link_list);
/* last request of a link, enqueue the link */ @@ -5924,7 +5993,7 @@ static int io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (io_alloc_async_ctx(req)) return -EAGAIN;
- ret = io_req_defer_prep(req, sqe); + ret = io_req_defer_prep(req, sqe, true); if (ret) req->flags |= REQ_F_FAIL_LINK; *link = req;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit ea1164e574e9af0a15ab730ead0861a4c7724142 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_poll_task_func() hand-coded link submission forgetting to set TASK_RUNNING, acquire mm, etc. Call existing helper for that, i.e. __io_req_task_submit().
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [6d816e088c35 ("io_uring: hold 'ctx' reference around task_work queue + execute") include first]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 046017a5eeae..af1a4dc6c9c8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4412,13 +4412,9 @@ static void io_poll_task_func(struct callback_head *cb) struct io_kiocb *nxt = NULL;
io_poll_task_handler(req, &nxt); - if (nxt) { - struct io_ring_ctx *ctx = nxt->ctx; + if (nxt) + __io_req_task_submit(nxt);
- mutex_lock(&ctx->uring_lock); - __io_queue_sqe(nxt, NULL, NULL); - mutex_unlock(&ctx->uring_lock); - } percpu_ref_put(&ctx->refs); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 8ef77766ba8694968ed4ba24311b4bacee14f235 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
req->work and req->task_work are in a union, so io_req_task_queue() screws everything that was in work. De-union them for now.
[ 704.367253] BUG: unable to handle page fault for address: ffffffffaf7330d0 [ 704.367256] #PF: supervisor write access in kernel mode [ 704.367256] #PF: error_code(0x0003) - permissions violation [ 704.367261] CPU: 6 PID: 1654 Comm: io_wqe_worker-0 Tainted: G I 5.8.0-rc2-00038-ge28d0bdc4863-dirty #498 [ 704.367265] RIP: 0010:_raw_spin_lock+0x1e/0x36 ... [ 704.367276] __alloc_fd+0x35/0x150 [ 704.367279] __get_unused_fd_flags+0x25/0x30 [ 704.367280] io_openat2+0xcb/0x1b0 [ 704.367283] io_issue_sqe+0x36a/0x1320 [ 704.367294] io_wq_submit_work+0x58/0x160 [ 704.367295] io_worker_handle_work+0x2a3/0x430 [ 704.367296] io_wqe_worker+0x2a0/0x350 [ 704.367301] kthread+0x136/0x180 [ 704.367304] ret_from_fork+0x22/0x30
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [ac8691c415e0 ("io_uring: always plug for any number of IOs") not include]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 14dcba1da6be..f9dcdb0c9f7a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -679,12 +679,12 @@ struct io_kiocb { * restore the work, if needed. */ struct { - struct callback_head task_work; struct hlist_node hash_node; struct async_poll *apoll; }; struct io_wq_work work; }; + struct callback_head task_work; };
#define IO_PLUG_THRESHOLD 2
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 1bcb8c5d65a845e0ecb9e82237c399b29b8d15ea category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_steal_work() can't be sure that @nxt has req->work properly set, so we can't pass it to io-wq as is.
A dirty quick fix -- drag it through io_req_task_queue(), and always return NULL from io_steal_work().
e.g.
[ 50.770161] BUG: kernel NULL pointer dereference, address: 00000000 [ 50.770164] #PF: supervisor write access in kernel mode [ 50.770164] #PF: error_code(0x0002) - not-present page [ 50.770168] CPU: 1 PID: 1448 Comm: io_wqe_worker-0 Tainted: G I 5.8.0-rc2-00035-g2237d76530eb-dirty #494 [ 50.770172] RIP: 0010:override_creds+0x19/0x30 ... [ 50.770183] io_worker_handle_work+0x25c/0x430 [ 50.770185] io_wqe_worker+0x2a0/0x350 [ 50.770190] kthread+0x136/0x180 [ 50.770194] ret_from_fork+0x22/0x30
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e38a48c71a72..2535ac88d97a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1874,7 +1874,7 @@ static void io_put_req(struct io_kiocb *req)
static struct io_wq_work *io_steal_work(struct io_kiocb *req) { - struct io_kiocb *link, *nxt = NULL; + struct io_kiocb *nxt = NULL;
/* * A ref is owned by io-wq in which context we're. So, if that's the @@ -1891,10 +1891,15 @@ static struct io_wq_work *io_steal_work(struct io_kiocb *req) if ((nxt->flags & REQ_F_ISREG) && io_op_defs[nxt->opcode].hash_reg_file) io_wq_hash_work(&nxt->work, file_inode(nxt->file));
- link = io_prep_linked_timeout(nxt); - if (link) - nxt->flags |= REQ_F_QUEUE_TIMEOUT; - return &nxt->work; + io_req_task_queue(nxt); + /* + * If we're going to return actual work, here should be timeout prep: + * + * link = io_prep_linked_timeout(nxt); + * if (link) + * nxt->flags |= REQ_F_QUEUE_TIMEOUT; + */ + return NULL; }
/*
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit a6d45dd0d43e6d1275e002704540688b6768bc22 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
No reason to mark a head of a link as for-async in io_req_defer_prep(). grab_env(), etc. That will be done further during submission if neccessary.
Mark for_async=false saving extra grab_env() in many cases.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2535ac88d97a..c7cda1284fe0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5971,7 +5971,7 @@ static int io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (io_alloc_async_ctx(req)) return -EAGAIN;
- ret = io_req_defer_prep(req, sqe, true); + ret = io_req_defer_prep(req, sqe, false); if (ret) req->flags |= REQ_F_FAIL_LINK; *link = req;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit a1a4661691c5f1a3af4c04f56ad68e2d1dbee3af category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Now REQ_F_TIMEOUT is set but never used, kill it
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ---- 1 file changed, 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 289423385746..c744c45088cd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -541,7 +541,6 @@ enum { REQ_F_CUR_POS_BIT, REQ_F_NOWAIT_BIT, REQ_F_LINK_TIMEOUT_BIT, - REQ_F_TIMEOUT_BIT, REQ_F_ISREG_BIT, REQ_F_MUST_PUNT_BIT, REQ_F_TIMEOUT_NOSEQ_BIT, @@ -585,8 +584,6 @@ enum { REQ_F_NOWAIT = BIT(REQ_F_NOWAIT_BIT), /* has linked timeout */ REQ_F_LINK_TIMEOUT = BIT(REQ_F_LINK_TIMEOUT_BIT), - /* timeout request */ - REQ_F_TIMEOUT = BIT(REQ_F_TIMEOUT_BIT), /* regular file */ REQ_F_ISREG = BIT(REQ_F_ISREG_BIT), /* must be punted even for NONBLOCK */ @@ -4977,7 +4974,6 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
data = &req->io->timeout; data->req = req; - req->flags |= REQ_F_TIMEOUT;
if (get_timespec64(&data->ts, u64_to_user_ptr(sqe->addr))) return -EFAULT;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 8eb7e2d00763367f345ef0b2a2eb4f8001ae40ce category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There are too many useless flags, kill REQ_F_TIMEOUT_NOSEQ, which can be easily infered from req.timeout itself.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c744c45088cd..254343e64aba 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -543,7 +543,6 @@ enum { REQ_F_LINK_TIMEOUT_BIT, REQ_F_ISREG_BIT, REQ_F_MUST_PUNT_BIT, - REQ_F_TIMEOUT_NOSEQ_BIT, REQ_F_COMP_LOCKED_BIT, REQ_F_NEED_CLEANUP_BIT, REQ_F_OVERFLOW_BIT, @@ -588,8 +587,6 @@ enum { REQ_F_ISREG = BIT(REQ_F_ISREG_BIT), /* must be punted even for NONBLOCK */ REQ_F_MUST_PUNT = BIT(REQ_F_MUST_PUNT_BIT), - /* no timeout sequence */ - REQ_F_TIMEOUT_NOSEQ = BIT(REQ_F_TIMEOUT_NOSEQ_BIT), /* completion under lock */ REQ_F_COMP_LOCKED = BIT(REQ_F_COMP_LOCKED_BIT), /* needs cleanup */ @@ -1072,6 +1069,11 @@ static void io_ring_ctx_ref_free(struct percpu_ref *ref) complete(&ctx->ref_comp); }
+static inline bool io_is_timeout_noseq(struct io_kiocb *req) +{ + return !req->timeout.off; +} + static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) { struct io_ring_ctx *ctx; @@ -1284,7 +1286,7 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx) struct io_kiocb *req = list_first_entry(&ctx->timeout_list, struct io_kiocb, list);
- if (req->flags & REQ_F_TIMEOUT_NOSEQ) + if (io_is_timeout_noseq(req)) break; if (req->timeout.target_seq != ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts)) @@ -5001,8 +5003,7 @@ static int io_timeout(struct io_kiocb *req) * timeout event to be satisfied. If it isn't set, then this is * a pure timeout request, sequence isn't used. */ - if (!off) { - req->flags |= REQ_F_TIMEOUT_NOSEQ; + if (io_is_timeout_noseq(req)) { entry = ctx->timeout_list.prev; goto add; } @@ -5017,7 +5018,7 @@ static int io_timeout(struct io_kiocb *req) list_for_each_prev(entry, &ctx->timeout_list) { struct io_kiocb *nxt = list_entry(entry, struct io_kiocb, list);
- if (nxt->flags & REQ_F_TIMEOUT_NOSEQ) + if (io_is_timeout_noseq(nxt)) continue; /* nxt.seq is behind @tail, otherwise would've been completed */ if (off >= nxt->timeout.target_seq - tail)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 540e32a0855e700affa29b1112bf2dbb1fa7702a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
It supports both polling and I/O polling. Rename ctx->poll to clearly show that it's only in I/O poll case.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 28fedc96b17d..b38da6025c97 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -329,12 +329,12 @@ struct io_ring_ctx { spinlock_t completion_lock;
/* - * ->poll_list is protected by the ctx->uring_lock for + * ->iopoll_list is protected by the ctx->uring_lock for * io_uring instances that don't use IORING_SETUP_SQPOLL. * For SQPOLL, only the single threaded io_sq_thread() will * manipulate the list, hence no extra locking is needed there. */ - struct list_head poll_list; + struct list_head iopoll_list; struct hlist_head *cancel_hash; unsigned cancel_hash_bits; bool poll_multi_file; @@ -1123,7 +1123,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) mutex_init(&ctx->uring_lock); init_waitqueue_head(&ctx->wait); spin_lock_init(&ctx->completion_lock); - INIT_LIST_HEAD(&ctx->poll_list); + INIT_LIST_HEAD(&ctx->iopoll_list); INIT_LIST_HEAD(&ctx->defer_list); INIT_LIST_HEAD(&ctx->timeout_list); init_waitqueue_head(&ctx->inflight_wait); @@ -2085,7 +2085,7 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, spin = !ctx->poll_multi_file && *nr_events < min;
ret = 0; - list_for_each_entry_safe(req, tmp, &ctx->poll_list, list) { + list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, list) { struct kiocb *kiocb = &req->rw.kiocb;
/* @@ -2127,7 +2127,7 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, static int io_iopoll_getevents(struct io_ring_ctx *ctx, unsigned int *nr_events, long min) { - while (!list_empty(&ctx->poll_list) && !need_resched()) { + while (!list_empty(&ctx->iopoll_list) && !need_resched()) { int ret;
ret = io_do_iopoll(ctx, nr_events, min); @@ -2150,7 +2150,7 @@ static void io_iopoll_try_reap_events(struct io_ring_ctx *ctx) return;
mutex_lock(&ctx->uring_lock); - while (!list_empty(&ctx->poll_list)) { + while (!list_empty(&ctx->iopoll_list)) { unsigned int nr_events = 0;
io_do_iopoll(ctx, &nr_events, 0); @@ -2292,12 +2292,12 @@ static void io_iopoll_req_issued(struct io_kiocb *req) * how we do polling eventually, not spinning if we're on potentially * different devices. */ - if (list_empty(&ctx->poll_list)) { + if (list_empty(&ctx->iopoll_list)) { ctx->poll_multi_file = false; } else if (!ctx->poll_multi_file) { struct io_kiocb *list_req;
- list_req = list_first_entry(&ctx->poll_list, struct io_kiocb, + list_req = list_first_entry(&ctx->iopoll_list, struct io_kiocb, list); if (list_req->file != req->file) ctx->poll_multi_file = true; @@ -2308,9 +2308,9 @@ static void io_iopoll_req_issued(struct io_kiocb *req) * it to the front so we find it first. */ if (READ_ONCE(req->iopoll_completed)) - list_add(&req->list, &ctx->poll_list); + list_add(&req->list, &ctx->iopoll_list); else - list_add_tail(&req->list, &ctx->poll_list); + list_add_tail(&req->list, &ctx->iopoll_list);
if ((ctx->flags & IORING_SETUP_SQPOLL) && wq_has_sleeper(&ctx->sqo_wait)) @@ -6241,11 +6241,11 @@ static int io_sq_thread(void *data) while (!kthread_should_park()) { unsigned int to_submit;
- if (!list_empty(&ctx->poll_list)) { + if (!list_empty(&ctx->iopoll_list)) { unsigned nr_events = 0;
mutex_lock(&ctx->uring_lock); - if (!list_empty(&ctx->poll_list) && !need_resched()) + if (!list_empty(&ctx->iopoll_list) && !need_resched()) io_do_iopoll(ctx, &nr_events, 0); else timeout = jiffies + ctx->sq_thread_idle; @@ -6274,7 +6274,7 @@ static int io_sq_thread(void *data) * more IO, we should wait for the application to * reap events and wake us up. */ - if (!list_empty(&ctx->poll_list) || need_resched() || + if (!list_empty(&ctx->iopoll_list) || need_resched() || (!time_after(jiffies, timeout) && ret != -EBUSY && !percpu_ref_is_dying(&ctx->refs))) { io_run_task_work(); @@ -6287,13 +6287,13 @@ static int io_sq_thread(void *data)
/* * While doing polled IO, before going to sleep, we need - * to check if there are new reqs added to poll_list, it - * is because reqs may have been punted to io worker and - * will be added to poll_list later, hence check the - * poll_list again. + * to check if there are new reqs added to iopoll_list, + * it is because reqs may have been punted to io worker + * and will be added to iopoll_list later, hence check + * the iopoll_list again. */ if ((ctx->flags & IORING_SETUP_IOPOLL) && - !list_empty_careful(&ctx->poll_list)) { + !list_empty_careful(&ctx->iopoll_list)) { finish_wait(&ctx->sqo_wait, &wait); continue; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit d21ffe7eca82d47b489760899912f81e30456e2e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
req->inflight_entry is used to track requests that grabbed files_struct. Let's share it with iopoll list, because the only iopoll'ed ops are reads and writes, which don't need a file table.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [b63534c41e20 ("io_uring: re-issue block requests that failed because of resources") not include, 56450c20fe10 ("io_uring: clear req->result on IOPOLL re-issue") include first]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 26 +++++++++++++++----------- 1 file changed, 15 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b38da6025c97..3c0633fe4c8a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -662,6 +662,10 @@ struct io_kiocb {
struct list_head link_list;
+ /* + * 1. used with ctx->iopoll_list with reads/writes + * 2. to track reqs with ->files (see io_op_def::file_table) + */ struct list_head inflight_entry;
struct percpu_ref *fixed_file_refs; @@ -2011,8 +2015,8 @@ static void io_iopoll_queue(struct list_head *again) struct io_kiocb *req;
do { - req = list_first_entry(again, struct io_kiocb, list); - list_del(&req->list); + req = list_first_entry(again, struct io_kiocb, inflight_entry); + list_del(&req->inflight_entry);
/* shouldn't happen unless io_uring is dying, cancel reqs */ if (unlikely(!current->mm)) { @@ -2042,14 +2046,14 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, while (!list_empty(done)) { int cflags = 0;
- req = list_first_entry(done, struct io_kiocb, list); + req = list_first_entry(done, struct io_kiocb, inflight_entry); if (READ_ONCE(req->result) == -EAGAIN) { req->result = 0; req->iopoll_completed = 0; - list_move_tail(&req->list, &again); + list_move_tail(&req->inflight_entry, &again); continue; } - list_del(&req->list); + list_del(&req->inflight_entry);
if (req->flags & REQ_F_BUFFER_SELECTED) cflags = io_put_kbuf(req); @@ -2085,7 +2089,7 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, spin = !ctx->poll_multi_file && *nr_events < min;
ret = 0; - list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, list) { + list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, inflight_entry) { struct kiocb *kiocb = &req->rw.kiocb;
/* @@ -2094,7 +2098,7 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, * and complete those lists first, if we have entries there. */ if (READ_ONCE(req->iopoll_completed)) { - list_move_tail(&req->list, &done); + list_move_tail(&req->inflight_entry, &done); continue; } if (!list_empty(&done)) @@ -2106,7 +2110,7 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events,
/* iopoll may have completed current req */ if (READ_ONCE(req->iopoll_completed)) - list_move_tail(&req->list, &done); + list_move_tail(&req->inflight_entry, &done);
if (ret && spin) spin = false; @@ -2298,7 +2302,7 @@ static void io_iopoll_req_issued(struct io_kiocb *req) struct io_kiocb *list_req;
list_req = list_first_entry(&ctx->iopoll_list, struct io_kiocb, - list); + inflight_entry); if (list_req->file != req->file) ctx->poll_multi_file = true; } @@ -2308,9 +2312,9 @@ static void io_iopoll_req_issued(struct io_kiocb *req) * it to the front so we find it first. */ if (READ_ONCE(req->iopoll_completed)) - list_add(&req->list, &ctx->iopoll_list); + list_add(&req->inflight_entry, &ctx->iopoll_list); else - list_add_tail(&req->list, &ctx->iopoll_list); + list_add_tail(&req->inflight_entry, &ctx->iopoll_list);
if ((ctx->flags & IORING_SETUP_SQPOLL) && wq_has_sleeper(&ctx->sqo_wait))
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 40d8ddd4facb80760d5a0c61a7cf026d5ff73ff0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
As with the completion path, also use compl.list for overflowed requests. If cleaned up properly, nobody needs per-op data there anymore.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3c0633fe4c8a..6ce523412878 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1398,8 +1398,8 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) break;
req = list_first_entry(&ctx->cq_overflow_list, struct io_kiocb, - list); - list_move(&req->list, &list); + compl.list); + list_move(&req->compl.list, &list); req->flags &= ~REQ_F_OVERFLOW; if (cqe) { WRITE_ONCE(cqe->user_data, req->user_data); @@ -1421,8 +1421,8 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) io_cqring_ev_posted(ctx);
while (!list_empty(&list)) { - req = list_first_entry(&list, struct io_kiocb, list); - list_del(&req->list); + req = list_first_entry(&list, struct io_kiocb, compl.list); + list_del(&req->compl.list); io_put_req(req); }
@@ -1455,11 +1455,12 @@ static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags) set_bit(0, &ctx->cq_check_overflow); ctx->rings->sq_flags |= IORING_SQ_CQ_OVERFLOW; } + io_clean_op(req); req->flags |= REQ_F_OVERFLOW; - refcount_inc(&req->refs); req->result = res; req->cflags = cflags; - list_add_tail(&req->list, &ctx->cq_overflow_list); + refcount_inc(&req->refs); + list_add_tail(&req->compl.list, &ctx->cq_overflow_list); } }
@@ -7734,7 +7735,7 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx,
if (cancel_req->flags & REQ_F_OVERFLOW) { spin_lock_irq(&ctx->completion_lock); - list_del(&cancel_req->list); + list_del(&cancel_req->compl.list); cancel_req->flags &= ~REQ_F_OVERFLOW; if (list_empty(&ctx->cq_overflow_list)) { clear_bit(0, &ctx->sq_check_overflow);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 135fcde8496b03d31648171dbc038990112e41d5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Instead of using shared req->list, hang timeouts up on their own list entry. struct io_timeout have enough extra space for it, but if that will be a problem ->inflight_entry can reused for that.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6ce523412878..9a50a0de2395 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -405,6 +405,7 @@ struct io_timeout { int flags; u32 off; u32 target_seq; + struct list_head list; };
struct io_rw { @@ -1272,7 +1273,7 @@ static void io_kill_timeout(struct io_kiocb *req) ret = hrtimer_try_to_cancel(&req->io->timeout.timer); if (ret != -1) { atomic_inc(&req->ctx->cq_timeouts); - list_del_init(&req->list); + list_del_init(&req->timeout.list); req->flags |= REQ_F_COMP_LOCKED; io_cqring_fill_event(req, 0); io_put_req(req); @@ -1284,7 +1285,7 @@ static void io_kill_timeouts(struct io_ring_ctx *ctx) struct io_kiocb *req, *tmp;
spin_lock_irq(&ctx->completion_lock); - list_for_each_entry_safe(req, tmp, &ctx->timeout_list, list) + list_for_each_entry_safe(req, tmp, &ctx->timeout_list, timeout.list) io_kill_timeout(req); spin_unlock_irq(&ctx->completion_lock); } @@ -1307,7 +1308,7 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx) { while (!list_empty(&ctx->timeout_list)) { struct io_kiocb *req = list_first_entry(&ctx->timeout_list, - struct io_kiocb, list); + struct io_kiocb, timeout.list);
if (io_is_timeout_noseq(req)) break; @@ -1315,7 +1316,7 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx) - atomic_read(&ctx->cq_timeouts)) break;
- list_del_init(&req->list); + list_del_init(&req->timeout.list); io_kill_timeout(req); } } @@ -4898,8 +4899,8 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) * We could be racing with timeout deletion. If the list is empty, * then timeout lookup already found it and will be handling it. */ - if (!list_empty(&req->list)) - list_del_init(&req->list); + if (!list_empty(&req->timeout.list)) + list_del_init(&req->timeout.list);
io_cqring_fill_event(req, -ETIME); io_commit_cqring(ctx); @@ -4916,9 +4917,9 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) struct io_kiocb *req; int ret = -ENOENT;
- list_for_each_entry(req, &ctx->timeout_list, list) { + list_for_each_entry(req, &ctx->timeout_list, timeout.list) { if (user_data == req->user_data) { - list_del_init(&req->list); + list_del_init(&req->timeout.list); ret = 0; break; } @@ -5041,7 +5042,8 @@ static int io_timeout(struct io_kiocb *req) * the one we need first. */ list_for_each_prev(entry, &ctx->timeout_list) { - struct io_kiocb *nxt = list_entry(entry, struct io_kiocb, list); + struct io_kiocb *nxt = list_entry(entry, struct io_kiocb, + timeout.list);
if (io_is_timeout_noseq(nxt)) continue; @@ -5050,7 +5052,7 @@ static int io_timeout(struct io_kiocb *req) break; } add: - list_add(&req->list, entry); + list_add(&req->timeout.list, entry); data->timer.function = io_timeout_fn; hrtimer_start(&data->timer, timespec64_to_ktime(data->ts), data->mode); spin_unlock_irq(&ctx->completion_lock);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 7d6ddea6beaf6639cf3a2b291dcdac6fe1edc584 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
poll*() doesn't use req->list, don't init it.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9a50a0de2395..119b7ab91718 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4865,7 +4865,6 @@ static int io_poll_add(struct io_kiocb *req) req->flags &= ~REQ_F_WORK_INITIALIZED;
INIT_HLIST_NODE(&req->hash_node); - INIT_LIST_HEAD(&req->list); ipt.pt._qproc = io_poll_queue_proc;
mask = __io_arm_poll_handler(req, &req->poll, &ipt, poll->events,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 27dc8338e5fb0e0ed5b272e792f4ffad7f3bc03e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The only left user of req->list is DRAIN, hence instead of keeping a separate per request list for it, do that with old fashion non-intrusive lists allocated on demand. That's a really slow path, so that's OK.
This removes req->list and so sheds 16 bytes from io_kiocb.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [ac8691c415e0 ("io_uring: always plug for any number of IOs") not include]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 25 ++++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 119b7ab91718..4fa5633c8661 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -652,7 +652,6 @@ struct io_kiocb { u16 buf_index;
struct io_ring_ctx *ctx; - struct list_head list; unsigned int flags; refcount_t refs; struct task_struct *task; @@ -687,6 +686,11 @@ struct io_kiocb { struct callback_head task_work; };
+struct io_defer_entry { + struct list_head list; + struct io_kiocb *req; +}; + #define IO_PLUG_THRESHOLD 2 #define IO_IOPOLL_BATCH 8
@@ -1293,14 +1297,15 @@ static void io_kill_timeouts(struct io_ring_ctx *ctx) static void __io_queue_deferred(struct io_ring_ctx *ctx) { do { - struct io_kiocb *req = list_first_entry(&ctx->defer_list, - struct io_kiocb, list); + struct io_defer_entry *de = list_first_entry(&ctx->defer_list, + struct io_defer_entry, list);
- if (req_need_defer(req)) + if (req_need_defer(de->req)) break; - list_del_init(&req->list); + list_del_init(&de->list); /* punt-init is done before queueing for defer */ - __io_queue_async_work(req); + __io_queue_async_work(de->req); + kfree(de); } while (!list_empty(&ctx->defer_list)); }
@@ -5293,6 +5298,7 @@ static int io_req_defer_prep(struct io_kiocb *req, static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_ring_ctx *ctx = req->ctx; + struct io_defer_entry *de; int ret;
/* Still need defer if there is pending req in defer list. */ @@ -5307,15 +5313,20 @@ static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) return ret; } io_prep_async_link(req); + de = kmalloc(sizeof(*de), GFP_KERNEL); + if (!de) + return -ENOMEM;
spin_lock_irq(&ctx->completion_lock); if (!req_need_defer(req) && list_empty(&ctx->defer_list)) { spin_unlock_irq(&ctx->completion_lock); + kfree(de); return 0; }
trace_io_uring_defer(ctx, req, req->user_data); - list_add_tail(&req->list, &ctx->defer_list); + de->req = req; + list_add_tail(&de->list, &ctx->defer_list); spin_unlock_irq(&ctx->completion_lock); return -EIOCBQUEUED; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 9cf7c104deaef52d6fd7c103a716e31d9815ede8 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
req->sequence is used only for deferred (i.e. DRAIN) requests, but initialised for every request. Remove req->sequence from io_kiocb together with its initialisation in io_init_req().
Replace it with a new field in struct io_defer_entry, that will be calculated only when needed in io_req_defer(), which is a slow path.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [ac8691c415e0 ("io_uring: always plug for any number of IOs") not include]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 44 ++++++++++++++++++++++++++++++-------------- 1 file changed, 30 insertions(+), 14 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4fa5633c8661..95b11f0fc1f5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -650,6 +650,7 @@ struct io_kiocb { u8 iopoll_completed;
u16 buf_index; + u32 result;
struct io_ring_ctx *ctx; unsigned int flags; @@ -657,8 +658,6 @@ struct io_kiocb { struct task_struct *task; unsigned long fsize; u64 user_data; - u32 result; - u32 sequence;
struct list_head link_list;
@@ -689,6 +688,7 @@ struct io_kiocb { struct io_defer_entry { struct list_head list; struct io_kiocb *req; + u32 seq; };
#define IO_PLUG_THRESHOLD 2 @@ -1149,13 +1149,13 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) return NULL; }
-static inline bool req_need_defer(struct io_kiocb *req) +static bool req_need_defer(struct io_kiocb *req, u32 seq) { if (unlikely(req->flags & REQ_F_IO_DRAIN)) { struct io_ring_ctx *ctx = req->ctx;
- return req->sequence != ctx->cached_cq_tail - + atomic_read(&ctx->cached_cq_overflow); + return seq != ctx->cached_cq_tail + + atomic_read(&ctx->cached_cq_overflow); }
return false; @@ -1300,7 +1300,7 @@ static void __io_queue_deferred(struct io_ring_ctx *ctx) struct io_defer_entry *de = list_first_entry(&ctx->defer_list, struct io_defer_entry, list);
- if (req_need_defer(de->req)) + if (req_need_defer(de->req, de->seq)) break; list_del_init(&de->list); /* punt-init is done before queueing for defer */ @@ -5295,14 +5295,35 @@ static int io_req_defer_prep(struct io_kiocb *req, return ret; }
+static u32 io_get_sequence(struct io_kiocb *req) +{ + struct io_kiocb *pos; + struct io_ring_ctx *ctx = req->ctx; + u32 total_submitted, nr_reqs = 1; + + if (req->flags & REQ_F_LINK_HEAD) + list_for_each_entry(pos, &req->link_list, link_list) + nr_reqs++; + + total_submitted = ctx->cached_sq_head - ctx->cached_sq_dropped; + return total_submitted - nr_reqs; +} + static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_ring_ctx *ctx = req->ctx; struct io_defer_entry *de; int ret; + u32 seq;
/* Still need defer if there is pending req in defer list. */ - if (!req_need_defer(req) && list_empty_careful(&ctx->defer_list)) + if (likely(list_empty_careful(&ctx->defer_list) && + !(req->flags & REQ_F_IO_DRAIN))) + return 0; + + seq = io_get_sequence(req); + /* Still a chance to pass the sequence check */ + if (!req_need_defer(req, seq) && list_empty_careful(&ctx->defer_list)) return 0;
if (!req->io) { @@ -5318,7 +5339,7 @@ static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -ENOMEM;
spin_lock_irq(&ctx->completion_lock); - if (!req_need_defer(req) && list_empty(&ctx->defer_list)) { + if (!req_need_defer(req, seq) && list_empty(&ctx->defer_list)) { spin_unlock_irq(&ctx->completion_lock); kfree(de); return 0; @@ -5326,6 +5347,7 @@ static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe)
trace_io_uring_defer(ctx, req, req->user_data); de->req = req; + de->seq = seq; list_add_tail(&de->list, &ctx->defer_list); spin_unlock_irq(&ctx->completion_lock); return -EIOCBQUEUED; @@ -6087,12 +6109,6 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, unsigned int sqe_flags; int id;
- /* - * All io need record the previous position, if LINK vs DARIN, - * it can be used to mark the position of the first IO in the - * link list. - */ - req->sequence = ctx->cached_sq_head - ctx->cached_sq_dropped; req->opcode = READ_ONCE(sqe->opcode); req->user_data = READ_ONCE(sqe->user_data); req->io = NULL;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 0f7e466b393abab86be96ffcf00af383afddc0d1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
req->cflags is used only for defer-completion path, just use completion data to store it. With the 4 bytes from the ->sequence patch and compacting io_kiocb, this frees 8 bytes.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 95b11f0fc1f5..069907a467be 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -503,6 +503,7 @@ struct io_statx { struct io_completion { struct file *file; struct list_head list; + int cflags; };
struct io_async_connect { @@ -644,7 +645,6 @@ struct io_kiocb { };
struct io_async_ctx *io; - int cflags; u8 opcode; /* polled IO has completed */ u8 iopoll_completed; @@ -1410,7 +1410,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) if (cqe) { WRITE_ONCE(cqe->user_data, req->user_data); WRITE_ONCE(cqe->res, req->result); - WRITE_ONCE(cqe->flags, req->cflags); + WRITE_ONCE(cqe->flags, req->compl.cflags); } else { WRITE_ONCE(ctx->rings->cq_overflow, atomic_inc_return(&ctx->cached_cq_overflow)); @@ -1464,7 +1464,7 @@ static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags) io_clean_op(req); req->flags |= REQ_F_OVERFLOW; req->result = res; - req->cflags = cflags; + req->compl.cflags = cflags; refcount_inc(&req->refs); list_add_tail(&req->compl.list, &ctx->cq_overflow_list); } @@ -1498,7 +1498,7 @@ static void io_submit_flush_completions(struct io_comp_state *cs)
req = list_first_entry(&cs->list, struct io_kiocb, compl.list); list_del(&req->compl.list); - __io_cqring_fill_event(req, req->result, req->cflags); + __io_cqring_fill_event(req, req->result, req->compl.cflags); if (!(req->flags & REQ_F_LINK_HEAD)) { req->flags |= REQ_F_COMP_LOCKED; io_put_req(req); @@ -1524,7 +1524,7 @@ static void __io_req_complete(struct io_kiocb *req, long res, unsigned cflags, } else { io_clean_op(req); req->result = res; - req->cflags = cflags; + req->compl.cflags = cflags; list_add_tail(&req->compl.list, &cs->list); if (++cs->nr >= 32) io_submit_flush_completions(cs);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit f254ac04c8744cf7bfed012717eac34eacc65dfb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
When a process exits, we cancel whatever requests it has pending that are referencing the file table. However, if a link is holding a reference, then we cannot find it by simply looking at the inflight list.
Enable checking of the poll and timeout list to find the link, and cancel it appropriately.
Cc: stable@vger.kernel.org Reported-by: Josef josef.grieb@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 97 +++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 87 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 069907a467be..1ce7395c8939 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4741,6 +4741,7 @@ static bool io_poll_remove_one(struct io_kiocb *req) io_cqring_fill_event(req, -ECANCELED); io_commit_cqring(req->ctx); req->flags |= REQ_F_COMP_LOCKED; + req_set_fail_links(req); io_put_req(req); }
@@ -4916,6 +4917,23 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) return HRTIMER_NORESTART; }
+static int __io_timeout_cancel(struct io_kiocb *req) +{ + int ret; + + list_del_init(&req->timeout.list); + + ret = hrtimer_try_to_cancel(&req->io->timeout.timer); + if (ret == -1) + return -EALREADY; + + req_set_fail_links(req); + req->flags |= REQ_F_COMP_LOCKED; + io_cqring_fill_event(req, -ECANCELED); + io_put_req(req); + return 0; +} + static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) { struct io_kiocb *req; @@ -4923,7 +4941,6 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data)
list_for_each_entry(req, &ctx->timeout_list, timeout.list) { if (user_data == req->user_data) { - list_del_init(&req->timeout.list); ret = 0; break; } @@ -4932,15 +4949,7 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) if (ret == -ENOENT) return ret;
- ret = hrtimer_try_to_cancel(&req->io->timeout.timer); - if (ret == -1) - return -EALREADY; - - req_set_fail_links(req); - req->flags |= REQ_F_COMP_LOCKED; - io_cqring_fill_event(req, -ECANCELED); - io_put_req(req); - return 0; + return __io_timeout_cancel(req); }
static int io_timeout_remove_prep(struct io_kiocb *req, @@ -7729,6 +7738,71 @@ static bool io_wq_files_match(struct io_wq_work *work, void *data) return work->files == files; }
+/* + * Returns true if 'preq' is the link parent of 'req' + */ +static bool io_match_link(struct io_kiocb *preq, struct io_kiocb *req) +{ + struct io_kiocb *link; + + if (!(preq->flags & REQ_F_LINK_HEAD)) + return false; + + list_for_each_entry(link, &preq->link_list, link_list) { + if (link == req) + return true; + } + + return false; +} + +/* + * We're looking to cancel 'req' because it's holding on to our files, but + * 'req' could be a link to another request. See if it is, and cancel that + * parent request if so. + */ +static bool io_poll_remove_link(struct io_ring_ctx *ctx, struct io_kiocb *req) +{ + struct hlist_node *tmp; + struct io_kiocb *preq; + bool found = false; + int i; + + spin_lock_irq(&ctx->completion_lock); + for (i = 0; i < (1U << ctx->cancel_hash_bits); i++) { + struct hlist_head *list; + + list = &ctx->cancel_hash[i]; + hlist_for_each_entry_safe(preq, tmp, list, hash_node) { + found = io_match_link(preq, req); + if (found) { + io_poll_remove_one(preq); + break; + } + } + } + spin_unlock_irq(&ctx->completion_lock); + return found; +} + +static bool io_timeout_remove_link(struct io_ring_ctx *ctx, + struct io_kiocb *req) +{ + struct io_kiocb *preq; + bool found = false; + + spin_lock_irq(&ctx->completion_lock); + list_for_each_entry(preq, &ctx->timeout_list, timeout.list) { + found = io_match_link(preq, req); + if (found) { + __io_timeout_cancel(preq); + break; + } + } + spin_unlock_irq(&ctx->completion_lock); + return found; +} + static void io_uring_cancel_files(struct io_ring_ctx *ctx, struct files_struct *files) { @@ -7786,6 +7860,9 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, } } else { io_wq_cancel_work(ctx->io_wq, &cancel_req->work); + /* could be a link, check and remove if it is */ + if (!io_poll_remove_link(ctx, cancel_req)) + io_timeout_remove_link(ctx, cancel_req); io_put_req(cancel_req); }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit 7271ef3a93a832180068c7aade3f130b7f39b17e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
syszbot reports a scenario where we recurse on the completion lock when flushing an overflow:
1 lock held by syz-executor287/6816: #0: ffff888093cdb4d8 (&ctx->completion_lock){....}-{2:2}, at: io_cqring_overflow_flush+0xc6/0xab0 fs/io_uring.c:1333
stack backtrace: CPU: 1 PID: 6816 Comm: syz-executor287 Not tainted 5.8.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1f0/0x31e lib/dump_stack.c:118 print_deadlock_bug kernel/locking/lockdep.c:2391 [inline] check_deadlock kernel/locking/lockdep.c:2432 [inline] validate_chain+0x69a4/0x88a0 kernel/locking/lockdep.c:3202 __lock_acquire+0x1161/0x2ab0 kernel/locking/lockdep.c:4426 lock_acquire+0x160/0x730 kernel/locking/lockdep.c:5005 __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline] _raw_spin_lock_irq+0x67/0x80 kernel/locking/spinlock.c:167 spin_lock_irq include/linux/spinlock.h:379 [inline] io_queue_linked_timeout fs/io_uring.c:5928 [inline] __io_queue_async_work fs/io_uring.c:1192 [inline] __io_queue_deferred+0x36a/0x790 fs/io_uring.c:1237 io_cqring_overflow_flush+0x774/0xab0 fs/io_uring.c:1359 io_ring_ctx_wait_and_kill+0x2a1/0x570 fs/io_uring.c:7808 io_uring_release+0x59/0x70 fs/io_uring.c:7829 __fput+0x34f/0x7b0 fs/file_table.c:281 task_work_run+0x137/0x1c0 kernel/task_work.c:135 exit_task_work include/linux/task_work.h:25 [inline] do_exit+0x5f3/0x1f20 kernel/exit.c:806 do_group_exit+0x161/0x2d0 kernel/exit.c:903 __do_sys_exit_group+0x13/0x20 kernel/exit.c:914 __se_sys_exit_group+0x10/0x10 kernel/exit.c:912 __x64_sys_exit_group+0x37/0x40 kernel/exit.c:912 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fix this by passing back the link from __io_queue_async_work(), and then let the caller handle the queueing of the link. Take care to also punt the submission reference put to the caller, as we're holding the completion lock for the __io_queue_defer() case. Hence we need to mark the io_kiocb appropriately for that case.
Reported-by: syzbot+996f91b6ec3812c48042@syzkaller.appspotmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 36 ++++++++++++++++++++++++++---------- 1 file changed, 26 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1ce7395c8939..a7e0bae86df9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -912,6 +912,7 @@ static void io_put_req(struct io_kiocb *req); static void io_double_put_req(struct io_kiocb *req); static void __io_double_put_req(struct io_kiocb *req); static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req); +static void __io_queue_linked_timeout(struct io_kiocb *req); static void io_queue_linked_timeout(struct io_kiocb *req); static int __io_sqe_files_update(struct io_ring_ctx *ctx, struct io_uring_files_update *ip, @@ -1250,7 +1251,7 @@ static void io_prep_async_link(struct io_kiocb *req) io_prep_async_work(cur); }
-static void __io_queue_async_work(struct io_kiocb *req) +static struct io_kiocb *__io_queue_async_work(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; struct io_kiocb *link = io_prep_linked_timeout(req); @@ -1258,16 +1259,19 @@ static void __io_queue_async_work(struct io_kiocb *req) trace_io_uring_queue_async_work(ctx, io_wq_is_hashed(&req->work), req, &req->work, req->flags); io_wq_enqueue(ctx->io_wq, &req->work); - - if (link) - io_queue_linked_timeout(link); + return link; }
static void io_queue_async_work(struct io_kiocb *req) { + struct io_kiocb *link; + /* init ->work of the whole link before punting */ io_prep_async_link(req); - __io_queue_async_work(req); + link = __io_queue_async_work(req); + + if (link) + io_queue_linked_timeout(link); }
static void io_kill_timeout(struct io_kiocb *req) @@ -1299,12 +1303,19 @@ static void __io_queue_deferred(struct io_ring_ctx *ctx) do { struct io_defer_entry *de = list_first_entry(&ctx->defer_list, struct io_defer_entry, list); + struct io_kiocb *link;
if (req_need_defer(de->req, de->seq)) break; list_del_init(&de->list); /* punt-init is done before queueing for defer */ - __io_queue_async_work(de->req); + link = __io_queue_async_work(de->req); + if (link) { + __io_queue_linked_timeout(link); + /* drop submission reference */ + link->flags |= REQ_F_COMP_LOCKED; + io_put_req(link); + } kfree(de); } while (!list_empty(&ctx->defer_list)); } @@ -5800,15 +5811,12 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) return HRTIMER_NORESTART; }
-static void io_queue_linked_timeout(struct io_kiocb *req) +static void __io_queue_linked_timeout(struct io_kiocb *req) { - struct io_ring_ctx *ctx = req->ctx; - /* * If the list is now empty, then our linked request finished before * we got a chance to setup the timer */ - spin_lock_irq(&ctx->completion_lock); if (!list_empty(&req->link_list)) { struct io_timeout_data *data = &req->io->timeout;
@@ -5816,6 +5824,14 @@ static void io_queue_linked_timeout(struct io_kiocb *req) hrtimer_start(&data->timer, timespec64_to_ktime(data->ts), data->mode); } +} + +static void io_queue_linked_timeout(struct io_kiocb *req) +{ + struct io_ring_ctx *ctx = req->ctx; + + spin_lock_irq(&ctx->completion_lock); + __io_queue_linked_timeout(req); spin_unlock_irq(&ctx->completion_lock);
/* drop submission reference */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit ac8691c415e0ce0b8734cb6d9df2df18608eebed category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently we only plug if we're doing more than two request. We're going to be relying on always having the plug there to pass down information, so plug unconditionally.
Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [We need this to transfer arg]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a7e0bae86df9..c455d9ed5795 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -691,7 +691,6 @@ struct io_defer_entry { u32 seq; };
-#define IO_PLUG_THRESHOLD 2 #define IO_IOPOLL_BATCH 8
struct io_comp_state { @@ -6181,7 +6180,7 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, struct file *ring_file, int ring_fd) { - struct io_submit_state state, *statep = NULL; + struct io_submit_state state; struct io_kiocb *link = NULL; int i, submitted = 0;
@@ -6198,10 +6197,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, if (!percpu_ref_tryget_many(&ctx->refs, nr)) return -EAGAIN;
- if (nr > IO_PLUG_THRESHOLD) { - io_submit_state_start(&state, ctx, nr); - statep = &state; - } + io_submit_state_start(&state, ctx, nr);
ctx->ring_fd = ring_fd; ctx->ring_file = ring_file; @@ -6216,14 +6212,14 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, io_consume_sqe(ctx); break; } - req = io_alloc_req(ctx, statep); + req = io_alloc_req(ctx, &state); if (unlikely(!req)) { if (!submitted) submitted = -EAGAIN; break; }
- err = io_init_req(ctx, req, sqe, statep); + err = io_init_req(ctx, req, sqe, &state); io_consume_sqe(ctx); /* will complete beyond this point, count as submitted */ submitted++; @@ -6249,8 +6245,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, } if (link) io_queue_link_head(link, &state.comp); - if (statep) - io_submit_state_end(&state); + io_submit_state_end(&state);
/* Commit SQ ring head once we've consumed and submitted all SQEs */ io_commit_sqring(ctx);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc2 commit b711d4eaf0c408a811311ee3e94d6e9e5a230a9a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Commit f254ac04c874 ("io_uring: enable lookup of links holding inflight files") only handled 2 out of the three head link cases we have, we also need to lookup and cancel work that is blocked in io-wq if that work has a link that's holding a reference to the files structure.
Put the "cancel head links that hold this request pending" logic into io_attempt_cancel(), which will to through the motions of finding and canceling head links that hold the current inflight files stable request pending.
Cc: stable@vger.kernel.org Reported-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 33 +++++++++++++++++++++++++++++---- 1 file changed, 29 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c455d9ed5795..22d778c7a45e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7814,6 +7814,33 @@ static bool io_timeout_remove_link(struct io_ring_ctx *ctx, return found; }
+static bool io_cancel_link_cb(struct io_wq_work *work, void *data) +{ + return io_match_link(container_of(work, struct io_kiocb, work), data); +} + +static void io_attempt_cancel(struct io_ring_ctx *ctx, struct io_kiocb *req) +{ + enum io_wq_cancel cret; + + /* cancel this particular work, if it's running */ + cret = io_wq_cancel_work(ctx->io_wq, &req->work); + if (cret != IO_WQ_CANCEL_NOTFOUND) + return; + + /* find links that hold this pending, cancel those */ + cret = io_wq_cancel_cb(ctx->io_wq, io_cancel_link_cb, req, true); + if (cret != IO_WQ_CANCEL_NOTFOUND) + return; + + /* if we have a poll link holding this pending, cancel that */ + if (io_poll_remove_link(ctx, req)) + return; + + /* final option, timeout link is holding this req pending */ + io_timeout_remove_link(ctx, req); +} + static void io_uring_cancel_files(struct io_ring_ctx *ctx, struct files_struct *files) { @@ -7870,10 +7897,8 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, continue; } } else { - io_wq_cancel_work(ctx->io_wq, &cancel_req->work); - /* could be a link, check and remove if it is */ - if (!io_poll_remove_link(ctx, cancel_req)) - io_timeout_remove_link(ctx, cancel_req); + /* cancel this request, or head link requests */ + io_attempt_cancel(ctx, cancel_req); io_put_req(cancel_req); }
From: Marcelo Diop-Gonzalez marcelo827@gmail.com
mainline inclusion from mainline-5.11-rc4 commit f010505b78a4fa8d5b6480752566e7313fb5ca6e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Right now io_flush_timeouts() checks if the current number of events is equal to ->timeout.target_seq, but this will miss some timeouts if there have been more than 1 event added since the last time they were flushed (possible in io_submit_flush_completions(), for example). Fix it by recording the last sequence at which timeouts were flushed so that the number of events seen can be compared to the number of events needed without overflow.
Signed-off-by: Marcelo Diop-Gonzalez marcelo827@gmail.com Reviewed-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 34 ++++++++++++++++++++++++++++++---- 1 file changed, 30 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 22d778c7a45e..7163271d14c3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -314,6 +314,7 @@ struct io_ring_ctx { unsigned cq_entries; unsigned cq_mask; atomic_t cq_timeouts; + unsigned cq_last_tm_flush; unsigned long cq_check_overflow; struct wait_queue_head cq_wait; struct fasync_struct *cq_fasync; @@ -1321,19 +1322,38 @@ static void __io_queue_deferred(struct io_ring_ctx *ctx)
static void io_flush_timeouts(struct io_ring_ctx *ctx) { - while (!list_empty(&ctx->timeout_list)) { + u32 seq; + + if (list_empty(&ctx->timeout_list)) + return; + + seq = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts); + + do { + u32 events_needed, events_got; struct io_kiocb *req = list_first_entry(&ctx->timeout_list, struct io_kiocb, timeout.list);
if (io_is_timeout_noseq(req)) break; - if (req->timeout.target_seq != ctx->cached_cq_tail - - atomic_read(&ctx->cq_timeouts)) + + /* + * Since seq can easily wrap around over time, subtract + * the last seq at which timeouts were flushed before comparing. + * Assuming not more than 2^31-1 events have happened since, + * these subtractions won't have wrapped, so we can check if + * target is in [last_seq, current_seq] by comparing the two. + */ + events_needed = req->timeout.target_seq - ctx->cq_last_tm_flush; + events_got = seq - ctx->cq_last_tm_flush; + if (events_got < events_needed) break;
list_del_init(&req->timeout.list); io_kill_timeout(req); - } + } while (!list_empty(&ctx->timeout_list)); + + ctx->cq_last_tm_flush = seq; }
static void io_commit_cqring(struct io_ring_ctx *ctx) @@ -5060,6 +5080,12 @@ static int io_timeout(struct io_kiocb *req) tail = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts); req->timeout.target_seq = tail + off;
+ /* Update the last seq here in case io_flush_timeouts() hasn't. + * This is safe because ->completion_lock is held, and submissions + * and completions are never mixed in the same ->completion_lock section. + */ + ctx->cq_last_tm_flush = tail; + /* * Insertion sort, ensuring the first entry in the list is always * the one we need first.
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc4 commit b7ddce3cbf010edbfac6c6d8cc708560a7bcd7a4 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
While trying to cancel requests with ->files, it also should look for requests in ->defer_list, otherwise it might end up hanging a thread.
Cancel all requests in ->defer_list up to the last request there with matching ->files, that's needed to follow drain ordering semantics.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7163271d14c3..36536ed5659e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7867,12 +7867,39 @@ static void io_attempt_cancel(struct io_ring_ctx *ctx, struct io_kiocb *req) io_timeout_remove_link(ctx, req); }
+static void io_cancel_defer_files(struct io_ring_ctx *ctx, + struct files_struct *files) +{ + struct io_defer_entry *de = NULL; + LIST_HEAD(list); + + spin_lock_irq(&ctx->completion_lock); + list_for_each_entry_reverse(de, &ctx->defer_list, list) { + if ((de->req->flags & REQ_F_WORK_INITIALIZED) + && de->req->work.files == files) { + list_cut_position(&list, &ctx->defer_list, &de->list); + break; + } + } + spin_unlock_irq(&ctx->completion_lock); + + while (!list_empty(&list)) { + de = list_first_entry(&list, struct io_defer_entry, list); + list_del_init(&de->list); + req_set_fail_links(de->req); + io_put_req(de->req); + io_req_complete(de->req, -ECANCELED); + kfree(de); + } +} + static void io_uring_cancel_files(struct io_ring_ctx *ctx, struct files_struct *files) { if (list_empty_careful(&ctx->inflight_list)) return;
+ io_cancel_defer_files(ctx, files); /* cancel all at once, should be faster than doing it one by one*/ io_wq_cancel_cb(ctx->io_wq, io_wq_files_match, files, true);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc4 commit c127a2a1b7baa5eb40a7e2de4b7f0c51ccbbb2ef category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
While looking for ->files in ->defer_list, consider that requests there may actually be links.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 36536ed5659e..b4d684321724 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7793,6 +7793,28 @@ static bool io_match_link(struct io_kiocb *preq, struct io_kiocb *req) return false; }
+static inline bool io_match_files(struct io_kiocb *req, + struct files_struct *files) +{ + return (req->flags & REQ_F_WORK_INITIALIZED) && req->work.files == files; +} + +static bool io_match_link_files(struct io_kiocb *req, + struct files_struct *files) +{ + struct io_kiocb *link; + + if (io_match_files(req, files)) + return true; + if (req->flags & REQ_F_LINK_HEAD) { + list_for_each_entry(link, &req->link_list, link_list) { + if (io_match_files(link, files)) + return true; + } + } + return false; +} + /* * We're looking to cancel 'req' because it's holding on to our files, but * 'req' could be a link to another request. See if it is, and cancel that @@ -7875,8 +7897,7 @@ static void io_cancel_defer_files(struct io_ring_ctx *ctx,
spin_lock_irq(&ctx->completion_lock); list_for_each_entry_reverse(de, &ctx->defer_list, list) { - if ((de->req->flags & REQ_F_WORK_INITIALIZED) - && de->req->work.files == files) { + if (io_match_link_files(de->req, files)) { list_cut_position(&list, &ctx->defer_list, &de->list); break; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc3 commit 9dab14b81807a40dab8e464ec87043935c562c2c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There's no point in using the poll handler if we can't do a nonblocking IO attempt of the operation, since we'll need to go async anyway. In fact this is actively harmful, as reading from eg pipes won't return 0 to indicate EOF.
Cc: stable@vger.kernel.org # v5.7+ Reported-by: Benedikt Ames wisp3rwind@posteo.eu Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b4d684321724..bf1c73d334c3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4676,12 +4676,20 @@ static bool io_arm_poll_handler(struct io_kiocb *req) struct async_poll *apoll; struct io_poll_table ipt; __poll_t mask, ret; + int rw;
if (!req->file || !file_can_poll(req->file)) return false; if (req->flags & REQ_F_POLLED) return false; - if (!def->pollin && !def->pollout) + if (def->pollin) + rw = READ; + else if (def->pollout) + rw = WRITE; + else + return false; + /* if we can't nonblock try, then no point in arming a poll handler */ + if (!io_file_supports_async(req->file, rw)) return false;
apoll = kmalloc(sizeof(*apoll), GFP_ATOMIC);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc3 commit fd7d6de2241453fc7d042336d366a939a25bc5a9 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If an application is doing reads on signalfd, and we arm the poll handler because there's no data available, then the wakeup can recurse on the tasks sighand->siglock as the signal delivery from task_work_add() will use TWA_SIGNAL and that attempts to lock it again.
We can detect the signalfd case pretty easily by comparing the poll->head wait_queue_head_t with the target task signalfd wait queue. Just use normal task wakeup for this case.
Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [b63534c41e20 ("io_uring: re-issue block requests that failed because of resources") not merge]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index bf1c73d334c3..8c3c158a713e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1805,7 +1805,8 @@ static struct io_kiocb *io_req_find_next(struct io_kiocb *req) return __io_req_find_next(req); }
-static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb) +static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb, + bool twa_signal_ok) { struct task_struct *tsk = req->task; struct io_ring_ctx *ctx = req->ctx; @@ -1818,7 +1819,7 @@ static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb) * will do the job. */ notify = 0; - if (!(ctx->flags & IORING_SETUP_SQPOLL)) + if (!(ctx->flags & IORING_SETUP_SQPOLL) && twa_signal_ok) notify = TWA_SIGNAL;
ret = task_work_add(tsk, cb, notify); @@ -1879,7 +1880,7 @@ static void io_req_task_queue(struct io_kiocb *req) init_task_work(&req->task_work, io_req_task_submit); percpu_ref_get(&req->ctx->refs);
- ret = io_req_task_work_add(req, &req->task_work); + ret = io_req_task_work_add(req, &req->task_work, true); if (unlikely(ret)) { struct task_struct *tsk;
@@ -4354,6 +4355,7 @@ struct io_poll_table { static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, __poll_t mask, task_work_func_t func) { + bool twa_signal_ok; int ret;
/* for instances that support it check for an event match first: */ @@ -4368,13 +4370,21 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, init_task_work(&req->task_work, func); percpu_ref_get(&req->ctx->refs);
+ /* + * If we using the signalfd wait_queue_head for this wakeup, then + * it's not safe to use TWA_SIGNAL as we could be recursing on the + * tsk->sighand->siglock on doing the wakeup. Should not be needed + * either, as the normal wakeup will suffice. + */ + twa_signal_ok = (poll->head != &req->task->sighand->signalfd_wqh); + /* * If this fails, then the task is exiting. When a task exits, the * work gets canceled, so just cancel this request as well instead * of executing it. We can't safely execute it anyway, as we may not * have the needed state needed for it anyway. */ - ret = io_req_task_work_add(req, &req->task_work); + ret = io_req_task_work_add(req, &req->task_work, twa_signal_ok); if (unlikely(ret)) { struct task_struct *tsk;
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.11-rc1 commit dad1b1242fd5717af18ae4ac9d12b9f65849e13a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Abaci Fuzz reported a double-free or invalid-free BUG in io_commit_cqring(): [ 95.504842] BUG: KASAN: double-free or invalid-free in io_commit_cqring+0x3ec/0x8e0 [ 95.505921] [ 95.506225] CPU: 0 PID: 4037 Comm: io_wqe_worker-0 Tainted: G B W 5.10.0-rc5+ #1 [ 95.507434] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 95.508248] Call Trace: [ 95.508683] dump_stack+0x107/0x163 [ 95.509323] ? io_commit_cqring+0x3ec/0x8e0 [ 95.509982] print_address_description.constprop.0+0x3e/0x60 [ 95.510814] ? vprintk_func+0x98/0x140 [ 95.511399] ? io_commit_cqring+0x3ec/0x8e0 [ 95.512036] ? io_commit_cqring+0x3ec/0x8e0 [ 95.512733] kasan_report_invalid_free+0x51/0x80 [ 95.513431] ? io_commit_cqring+0x3ec/0x8e0 [ 95.514047] __kasan_slab_free+0x141/0x160 [ 95.514699] kfree+0xd1/0x390 [ 95.515182] io_commit_cqring+0x3ec/0x8e0 [ 95.515799] __io_req_complete.part.0+0x64/0x90 [ 95.516483] io_wq_submit_work+0x1fa/0x260 [ 95.517117] io_worker_handle_work+0xeac/0x1c00 [ 95.517828] io_wqe_worker+0xc94/0x11a0 [ 95.518438] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.519151] ? __kthread_parkme+0x11d/0x1d0 [ 95.519806] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.520512] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.521211] kthread+0x396/0x470 [ 95.521727] ? _raw_spin_unlock_irq+0x24/0x30 [ 95.522380] ? kthread_mod_delayed_work+0x180/0x180 [ 95.523108] ret_from_fork+0x22/0x30 [ 95.523684] [ 95.523985] Allocated by task 4035: [ 95.524543] kasan_save_stack+0x1b/0x40 [ 95.525136] __kasan_kmalloc.constprop.0+0xc2/0xd0 [ 95.525882] kmem_cache_alloc_trace+0x17b/0x310 [ 95.533930] io_queue_sqe+0x225/0xcb0 [ 95.534505] io_submit_sqes+0x1768/0x25f0 [ 95.535164] __x64_sys_io_uring_enter+0x89e/0xd10 [ 95.535900] do_syscall_64+0x33/0x40 [ 95.536465] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 95.537199] [ 95.537505] Freed by task 4035: [ 95.538003] kasan_save_stack+0x1b/0x40 [ 95.538599] kasan_set_track+0x1c/0x30 [ 95.539177] kasan_set_free_info+0x1b/0x30 [ 95.539798] __kasan_slab_free+0x112/0x160 [ 95.540427] kfree+0xd1/0x390 [ 95.540910] io_commit_cqring+0x3ec/0x8e0 [ 95.541516] io_iopoll_complete+0x914/0x1390 [ 95.542150] io_do_iopoll+0x580/0x700 [ 95.542724] io_iopoll_try_reap_events.part.0+0x108/0x200 [ 95.543512] io_ring_ctx_wait_and_kill+0x118/0x340 [ 95.544206] io_uring_release+0x43/0x50 [ 95.544791] __fput+0x28d/0x940 [ 95.545291] task_work_run+0xea/0x1b0 [ 95.545873] do_exit+0xb6a/0x2c60 [ 95.546400] do_group_exit+0x12a/0x320 [ 95.546967] __x64_sys_exit_group+0x3f/0x50 [ 95.547605] do_syscall_64+0x33/0x40 [ 95.548155] entry_SYSCALL_64_after_hwframe+0x44/0xa9
The reason is that once we got a non EAGAIN error in io_wq_submit_work(), we'll complete req by calling io_req_complete(), which will hold completion_lock to call io_commit_cqring(), but for polled io, io_iopoll_complete() won't hold completion_lock to call io_commit_cqring(), then there maybe concurrent access to ctx->defer_list, double free may happen.
To fix this bug, we always let io_iopoll_complete() complete polled io.
Cc: stable@vger.kernel.org # 5.5+ Reported-by: Abaci Fuzz abaci@linux.alibaba.com Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Reviewed-by: Pavel Begunkov asml.silence@gmail.com Reviewed-by: Joseph Qi joseph.qi@linux.alibaba.com Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8c3c158a713e..78fbab206606 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5737,8 +5737,19 @@ static struct io_wq_work *io_wq_submit_work(struct io_wq_work *work) }
if (ret) { - req_set_fail_links(req); - io_req_complete(req, ret); + /* + * io_iopoll_complete() does not hold completion_lock to complete + * polled io, so here for polled io, just mark it done and still let + * io_iopoll_complete() complete it. + */ + if (req->ctx->flags & IORING_SETUP_IOPOLL) { + struct kiocb *kiocb = &req->rw.kiocb; + + kiocb_done(kiocb, ret, NULL); + } else { + req_set_fail_links(req); + io_req_complete(req, ret); + } }
return io_steal_work(req);