io_uring support
Aleix Roca Nonell (1): io_uring: fix manual setup of iov_iter for fixed buffers
Arnd Bergmann (2): io_uring: fix big-endian compat signal mask handling io_uring: use __kernel_timespec in timeout ABI
Bart Van Assche (1): percpu-refcount: Introduce percpu_ref_resurrect()
Bijan Mottahedeh (13): io_uring: clear req->result always before issuing a read/write request io_uring: process requests completed with -EAGAIN on poll list io_uring: use proper references for fallback_req locking io_uring: don't use kiocb.private to store buf_index io_uring: add io_statx structure statx: allow system call to be invoked from io_uring io_uring: call statx directly statx: hide interfaces no longer used by io_uring io_uring: validate the full range of provided buffers for access io_uring: add wrappers for memory accounting io_uring: rename ctx->account_mem field io_uring: report pinned memory usage io_uring: separate reporting of ring pages from registered pages
Bob Liu (2): io_uring: clean up io_uring_cancel_files() io_uring: introduce req_need_defer()
Brian Gianforcaro (1): io_uring: fix stale comment and a few typos
Christoph Hellwig (2): fs: add an iopoll method to struct file_operations io_uring: add fsync support
Chucheng Luo (1): io_uring: fix missing 'return' in comment
Colin Ian King (3): io_uring: fix shadowed variable ret return code being not checked io_uring: remove redundant variable pointer nxt and io_wq_assign_next call io_uring: Fix sizeof() mismatch
Damien Le Moal (5): aio: Comment use of IOCB_FLAG_IOPRIO aio flag block: Introduce get_current_ioprio() aio: Fix fallback I/O priority value block: prevent merging of requests with different priorities block: Initialize BIO I/O priority early
Dan Carpenter (3): io-wq: remove extra space characters io_uring: remove unnecessary NULL checks io_uring: fix a use after free in io_async_task_func()
Daniel Xu (1): io_uring: increase IORING_MAX_ENTRIES to 32K
Daniele Albano (1): io_uring: always allow drain/link/hardlink/async sqe flags
Deepa Dinamani (5): signal: Add set_user_sigmask() signal: Add restore_user_sigmask() ppoll: use __kernel_timespec pselect6: use __kernel_timespec io_pgetevents: use __kernel_timespec
Denis Efremov (1): io_uring: use kvfree() in io_sqe_buffer_register()
Dmitrii Dolgov (1): io_uring: add set of tracing events
Dmitry Vyukov (1): io_uring: fix sq array offset calculation
Eric Biggers (1): io_uring: fix memory leak of UNIX domain socket inode
Eric W. Biederman (2): signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig signal: Allow cifs and drbd to receive their terminating signals
Eugene Syromiatnikov (1): io_uring: fix compat for IORING_REGISTER_FILES_UPDATE
Guoyu Huang (1): io_uring: Fix NULL pointer dereference in loop_rw_iter()
Hillf Danton (6): io-wq: remove unused busy list from io_sqe io-wq: add cond_resched() to worker thread io-uring: drop completion when removing file io-uring: drop 'free_pfile' in struct io_file_put io_uring: add missing finish_wait() in io_sq_thread() io-wq: fix use-after-free in io_wq_worker_running
Hristo Venev (1): io_uring: allocate the two rings together
Hrvoje Zeba (1): io_uring: remove superfluous check for sqe->off in io_accept()
Jackie Liu (17): io_uring: adjust smp_rmb inside io_cqring_events io_uring: use wait_event_interruptible for cq_wait conditional wait io_uring: fix io_sq_thread_stop running in front of io_sq_thread io_uring: fix KASAN use after free in io_sq_wq_submit_work io_uring: fix an issue when IOSQE_IO_LINK is inserted into defer list io_uring: fix wrong sequence setting logic io_uring: add support for link with drain io_uring: use kmemdup instead of kmalloc and memcpy io_uring: fix use-after-free of shadow_req io_uring: fix potential crash issue due to io_get_req failure io_uring: replace s->needs_lock with s->in_async io_uring: set -EINTR directly when a signal wakes up in io_cqring_wait io_uring: remove passed in 'ctx' function parameter ctx if possible io_uring: keep io_put_req only responsible for release and put req io_uring: separate the io_free_req and io_free_req_find_next interface io_uring: remove parameter ctx of io_submit_state_start io_uring: remove io_wq_current_is_worker
Jann Horn (2): io_uring: use kzalloc instead of kcalloc for single-element allocations io-wq: fix handling of NUMA node IDs
Jens Axboe (352): Add io_uring IO interface io_uring: support for IO polling fs: add fget_many() and fput_many() io_uring: use fget/fput_many() for file references io_uring: batch io_kiocb allocation io_uring: add support for pre-mapped user IO buffers net: split out functions related to registering inflight socket files io_uring: add file set registration io_uring: add submission polling io_uring: add io_kiocb ref count io_uring: add support for IORING_OP_POLL io_uring: allow workqueue item to handle multiple buffered requests io_uring: add a few test tools tools/io_uring: remove IOCQE_FLAG_CACHEHIT io_uring: use regular request ref counts io_uring: make io_read/write return an integer io_uring: add prepped flag io_uring: fix fget/fput handling io_uring: fix poll races io_uring: retry bulk slab allocs as single allocs io_uring: fix double free in case of fileset regitration failure io_uring: restrict IORING_SETUP_SQPOLL to root io_uring: park SQPOLL thread if it's percpu io_uring: only test SQPOLL cpu after we've verified it io_uring: drop io_file_put() 'file' argument io_uring: fix possible deadlock between io_uring_{enter,register} io_uring: fix CQ overflow condition io_uring: fail io_uring_register(2) on a dying io_uring instance io_uring: remove 'state' argument from io_{read,write} path io_uring: have submission side sqe errors post a cqe io_uring: drop req submit reference always in async punt fs: add sync_file_range() helper io_uring: add support for marking commands as draining io_uring: add support for IORING_OP_SYNC_FILE_RANGE io_uring: add support for eventfd notifications io_uring: fix failure to verify SQ_AFF cpu io_uring: remove 'ev_flags' argument tools/io_uring: fix Makefile for pthread library link tools/io_uring: sync with liburing io_uring: ensure req->file is cleared on allocation uio: make import_iovec()/compat_import_iovec() return bytes on success io_uring: punt short reads to async context io_uring: add support for sqe links io_uring: add support for sendmsg() io_uring: add support for recvmsg() io_uring: don't use iov_iter_advance() for fixed buffers io_uring: ensure ->list is initialized for poll commands io_uring: fix potential hang with polled IO io_uring: don't enter poll loop if we have CQEs pending io_uring: add need_resched() check in inner poll loop io_uring: expose single mmap capability io_uring: optimize submit_and_wait API io_uring: add io_queue_async_work() helper io_uring: limit parallelism of buffered writes io_uring: extend async work merging io_uring: make sqpoll wakeup possible with getevents io_uring: ensure poll commands clear ->sqe io_uring: use cond_resched() in sqthread io_uring: IORING_OP_TIMEOUT support io_uring: correctly handle non ->{read,write}_iter() file_operations io_uring: make CQ ring wakeups be more efficient io_uring: only flush workqueues on fileset removal io_uring: fix sequence logic for timeout requests io_uring: fix up O_NONBLOCK handling for sockets io_uring: revert "io_uring: optimize submit_and_wait API" io_uring: used cached copies of sq->dropped and cq->overflow io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD io_uring: don't touch ctx in setup after ring fd install io_uring: run dependent links inline if possible io_uring: allow sparse fixed file sets io_uring: add support for IORING_REGISTER_FILES_UPDATE io_uring: allow application controlled CQ ring size io_uring: add support for absolute timeouts io_uring: add support for canceling timeout requests io-wq: small threadpool implementation for io_uring io_uring: replace workqueue usage with io-wq io_uring: io_uring: add support for async work inheriting files net: add __sys_accept4_file() helper io_uring: add support for IORING_OP_ACCEPT io_uring: protect fixed file indexing with array_index_nospec() io_uring: support for larger fixed file sets io_uring: fix race with canceling timeouts io_uring: io_wq_create() returns an error pointer, not NULL io_uring: ensure we clear io_kiocb->result before each issue io_uring: support for generic async request cancel io_uring: add completion trace event io-wq: use proper nesting IRQ disabling spinlocks for cancel io_uring: enable optimized link handling for IORING_OP_POLL_ADD io_uring: fixup a few spots where link failure isn't flagged io_uring: kill dead REQ_F_LINK_DONE flag io_uring: abstract out io_async_cancel_one() helper io_uring: add support for linked SQE timeouts io_uring: make io_cqring_events() take 'ctx' as argument io_uring: pass in io_kiocb to fill/add CQ handlers io_uring: add support for backlogged CQ ring io-wq: io_wqe_run_queue() doesn't need to use list_empty_careful() io-wq: add support for bounded vs unbunded work io_uring: properly mark async work as bounded vs unbounded io_uring: reduce/pack size of io_ring_ctx io_uring: fix error clear of ->file_table in io_sqe_files_register() io_uring: convert accept4() -ERESTARTSYS into -EINTR io_uring: provide fallback request for OOM situations io_uring: make ASYNC_CANCEL work with poll and timeout io_uring: flag SQPOLL busy condition to userspace io_uring: don't do flush cancel under inflight_lock io_uring: fix -ENOENT issue with linked timer with short timeout io_uring: make timeout sequence == 0 mean no sequence io_uring: use correct "is IO worker" helper io_uring: fix potential deadlock in io_poll_wake() io_uring: check for validity of ->rings in teardown io_wq: add get/put_work handlers to io_wq_create() io-wq: ensure we have a stable view of ->cur_work for cancellations io_uring: ensure registered buffer import returns the IO length io-wq: ensure free/busy list browsing see all items io-wq: remove now redundant struct io_wq_nulls_list io_uring: make POLL_ADD/POLL_REMOVE scale better io_uring: io_async_cancel() should pass in 'nxt' request pointer io_uring: cleanup return values from the queueing functions io_uring: make io_double_put_req() use normal completion path io_uring: make req->timeout be dynamically allocated io_uring: fix sequencing issues with linked timeouts io_uring: remove dead REQ_F_SEQ_PREV flag io_uring: correct poll cancel and linked timeout expiration completion io_uring: request cancellations should break links io-wq: wait for io_wq_create() to setup necessary workers io_uring: io_fail_links() should only consider first linked timeout io_uring: io_allocate_scq_urings() should return a sane state io_uring: allow finding next link independent of req reference count io_uring: close lookup gap for dependent next work io_uring: improve trace_io_uring_defer() trace point io_uring: only return -EBUSY for submit on non-flushed backlog net: add __sys_connect_file() helper io_uring: add support for IORING_OP_CONNECT io-wq: have io_wq_create() take a 'data' argument io_uring: async workers should inherit the user creds io-wq: shrink io_wq_work a bit io_uring: make poll->wait dynamically allocated io_uring: fix missing kmap() declaration on powerpc io_uring: use current task creds instead of allocating a new one io_uring: transform send/recvmsg() -ERESTARTSYS to -EINTR io_uring: add general async offload context io_uring: ensure async punted read/write requests copy iovec net: separate out the msghdr copy from ___sys_{send,recv}msg() net: disallow ancillary data for __sys_{send,recv}msg_file() io_uring: ensure async punted sendmsg/recvmsg requests copy data io_uring: ensure async punted connect requests copy data io_uring: mark us with IORING_FEAT_SUBMIT_STABLE io_uring: handle connect -EINPROGRESS like -EAGAIN io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT io_uring: ensure deferred timeouts copy necessary data io-wq: clear node->next on list deletion io_uring: use hash table for poll command lookups io_uring: allow unbreakable links io-wq: remove worker->wait waitqueue io-wq: briefly spin for new work after finishing work io_uring: sqthread should grab ctx->uring_lock for submissions io_uring: deferred send/recvmsg should assign iov io_uring: don't dynamically allocate poll data io_uring: run next sqe inline if possible io_uring: only hash regular files for async work execution io_uring: add sockets to list of files that support non-blocking issue io_uring: ensure we return -EINVAL on unknown opcode io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG io-wq: re-add io_wq_current_is_worker() io_uring: fix pre-prepped issue with force_nonblock == true io_uring: remove 'sqe' parameter to the OP helpers that take it io_uring: any deferred command must have stable sqe data io_uring: make IORING_POLL_ADD and IORING_POLL_REMOVE deferrable io_uring: make IORING_OP_CANCEL_ASYNC deferrable io_uring: make IORING_OP_TIMEOUT_REMOVE deferrable io_uring: read opcode and user_data from SQE exactly once io_uring: warn about unhandled opcode io_uring: io_wq_submit_work() should not touch req->rw io_uring: use u64_to_user_ptr() consistently io_uring: add and use struct io_rw for read/writes io_uring: move all prep state for IORING_OP_CONNECT to prep handler io_uring: move all prep state for IORING_OP_{SEND,RECV}_MGS to prep handler io_uring: read 'count' for IORING_OP_TIMEOUT in prep handler io_uring: standardize the prep methods io_uring: pass in 'sqe' to the prep handlers io_uring: remove punt of short reads to async context io_uring: don't setup async context for read/write fixed io-wq: cancel work if we fail getting a mm reference io_uring: be consistent in assigning next work from handler io_uring: ensure workqueue offload grabs ring mutex for poll list io_uring: only allow submit from owning task Revert "io_uring: only allow submit from owning task" io_uring: don't cancel all work on process exit io_uring: add support for fallocate() fs: make build_open_flags() available internally io_uring: add support for IORING_OP_OPENAT io-wq: add support for uncancellable work io_uring: add support for IORING_OP_CLOSE io_uring: avoid ring quiesce for fixed file set unregister and update fs: make two stat prep helpers available io_uring: add support for IORING_OP_STATX io-wq: support concurrent non-blocking work io_uring: add IOSQE_ASYNC io_uring: remove two unnecessary function declarations io_uring: add lookup table for various opcode needs io_uring: split overflow state into SQ and CQ side io_uring: improve poll completion performance io_uring: add non-vectored read/write commands io_uring: allow use of offset == -1 to mean file position io_uring: add IORING_OP_FADVISE mm: make do_madvise() available internally io_uring: add IORING_OP_MADVISE io_uring: wrap multi-req freeing in struct req_batch io_uring: extend batch freeing to cover more cases io_uring: add support for IORING_SETUP_CLAMP io_uring: add support for send(2) and recv(2) io_uring: file set registration should use interruptible waits io_uring: change io_ring_ctx bool fields into bit fields io_uring: enable option to only trigger eventfd for async completions io_uring: remove 'fname' from io_open structure io_uring: add opcode to issue trace event io_uring: account fixed file references correctly in batch io_uring: add support for probing opcodes io_uring: file switch work needs to get flushed on exit io_uring: don't attempt to copy iovec for READ/WRITE io-wq: make the io_wq ref counted io_uring/io-wq: don't use static creds/mm assignments io_uring: allow registering credentials io_uring: support using a registered personality for commands io_uring: fix linked command file table usage eventpoll: abstract out epoll_ctl() handler eventpoll: support non-blocking do_epoll_ctl() calls io_uring: add support for epoll_ctl(2) io_uring: add ->show_fdinfo() for the io_uring file descriptor io_uring: prevent potential eventfd recursion on poll io_uring: use the proper helpers for io_send/recv io_uring: don't map read/write iovec potentially twice io_uring: fix sporadic double CQE entry for close io_uring: punt even fadvise() WILLNEED to async context io_uring: spin for sq thread to idle on shutdown io_uring: cleanup fixed file data table references io_uring: statx/openat/openat2 don't support fixed files io_uring: retry raw bdev writes if we hit -EOPNOTSUPP io-wq: add support for inheriting ->fs io_uring: grab ->fs as part of async preparation io_uring: allow AT_FDCWD for non-file openat/openat2/statx io-wq: make io_wqe_cancel_work() take a match handler io-wq: add io_wq_cancel_pid() to cancel based on a specific pid io_uring: cancel pending async work if task exits io_uring: retain sockaddr_storage across send/recvmsg async punt io-wq: don't call kXalloc_node() with non-online node io_uring: prune request from overflow list on flush io_uring: handle multiple personalities in link chains io_uring: fix personality idr leak io-wq: remove spin-for-work optimization io-wq: ensure work->task_pid is cleared on init io_uring: pick up link work on submit reference drop io_uring: import_single_range() returns 0/-ERROR io_uring: drop file set ref put/get on switch io_uring: fix 32-bit compatability with sendmsg/recvmsg io_uring: free fixed_file_data after RCU grace period io_uring: ensure RCU callback ordering with rcu_barrier() io_uring: make sure openat/openat2 honor rlimit nofile io_uring: make sure accept honor rlimit nofile io_uring: consider any io_read/write -EAGAIN as final io_uring: io_accept() should hold on to submit reference on retry io_uring: store io_kiocb in wait->private io_uring: add per-task callback handler io_uring: mark requests that we can do poll async in io_op_defs io_uring: use poll driven retry for files that support it io_uring: buffer registration infrastructure io_uring: add IORING_OP_PROVIDE_BUFFERS io_uring: support buffer selection for OP_READ and OP_RECV io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_READV net: abstract out normal and compat msghdr import io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG io_uring: provide means of removing buffers io_uring: add end-of-bits marker and build time verify it io_uring: dual license io_uring.h uapi header io_uring: fix truncated async read/readv and write/writev retry io_uring: honor original task RLIMIT_FSIZE io_uring: retry poll if we got woken with non-matching mask io_uring: grab task reference for poll requests io_uring: use io-wq manager as backup task if task is exiting io_uring: remove bogus RLIMIT_NOFILE check in file registration io_uring: ensure openat sets O_LARGEFILE if needed io_uring: punt final io_ring_ctx wait-and-free to workqueue io_uring: correct O_NONBLOCK check for splice punt io_uring: check for need to re-wait in polled async handling io_uring: io_async_task_func() should check and honor cancelation io_uring: only post events in io_poll_remove_all() if we completed some io_uring: statx must grab the file table for valid fd io_uring: enable poll retry for any file with ->read_iter / ->write_iter io_uring: only force async punt if poll based retry can't handle it io_uring: don't use 'fd' for openat/openat2/statx io_uring: polled fixed file must go through free iteration io_uring: initialize ctx->sqo_wait earlier io_uring: remove dead check in io_splice() io_uring: cancel work if task_work_add() fails io_uring: don't add non-IO requests to iopoll pending list io_uring: remove 'fd is io_uring' from close path io_uring: name sq thread and ref completions io_uring: batch reap of dead file registrations io_uring: allow POLL_ADD with double poll_wait() users io_uring: file registration list and lock optimization io_uring: cleanup io_poll_remove_one() logic io_uring: async task poll trigger cleanup io_uring: disallow close of ring itself io_uring: re-set iov base/len for buffer select retry io_uring: allow O_NONBLOCK async retry io_uring: acquire 'mm' for task_work for SQPOLL io_uring: reap poll completions while waiting for refs to drop on exit io_uring: use signal based task_work running io_uring: fix regression with always ignoring signals in io_cqring_wait() io_uring: account user memory freed when exit has been queued io_uring: ensure double poll additions work with both request types io_uring: use TWA_SIGNAL for task_work uncondtionally io_uring: hold 'ctx' reference around task_work queue + execute io_uring: clear req->result on IOPOLL re-issue io_uring: fix IOPOLL -EAGAIN retries io_uring: always delete double poll wait entry on match io_uring: fix potential ABBA deadlock in ->show_fdinfo() io_uring: use type appropriate io_kiocb handler for double poll io_uring: round-up cq size before comparing with rounded sq size io_uring: remove dead 'ctx' argument and move forward declaration io_uring: don't touch 'ctx' after installing file descriptor io_uring: account locked memory before potential error case io_uring: fix imbalanced sqo_mm accounting io_uring: stash ctx task reference for SQPOLL io_uring: ensure consistent view of original task ->mm from SQPOLL io_uring: allow non-fixed files with SQPOLL io_uring: fail poll arm on queue proc failure io_uring: sanitize double poll handling io_uring: ensure open/openat2 name is cleaned on cancelation io_uring: fix error path cleanup in io_sqe_files_register() io_uring: make ctx cancel on exit targeted to actual ctx io_uring: fix SQPOLL IORING_OP_CLOSE cancelation state io_uring: ignore double poll add on the same waitqueue head io_uring: clean up io_kill_linked_timeout() locking io_uring: add missing REQ_F_COMP_LOCKED for nested requests io_uring: provide generic io_req_complete() helper io_uring: add 'io_comp_state' to struct io_submit_state io_uring: pass down completion state on the issue side io_uring: pass in completion state to appropriate issue side handlers io_uring: enable READ/WRITE to use deferred completions io_uring: use task_work for links if possible io_uring: abstract out task work running io_uring: use new io_req_task_work_add() helper throughout io_uring: only call kfree() for a non-zero pointer io_uring: get rid of __req_need_defer() io_uring: enable lookup of links holding inflight files io_uring: fix recursive completion locking on oveflow flush io_uring: always plug for any number of IOs io_uring: find and cancel head link async work on files exit io_uring: don't use poll handler if file can't be nonblocking read/written io_uring: don't recurse on tsk->sighand->siglock with signalfd io_uring: defer file table grabbing request cleanup for locked requests
Jiufei Xue (5): io_uring: check file O_NONBLOCK state for accept io_uring: change the poll type to be 32-bits io_uring: use EPOLLEXCLUSIVE flag to aoid thundering herd type behavior io_uring: fix removing the wrong file in __io_sqe_files_update() io_uring: set table->files[i] to NULL when io_sqe_file_register failed
Joseph Qi (1): io_uring: fix shift-out-of-bounds when round up cq size
LimingWu (1): io_uring: fix a typo in a comment
Lukas Bulwahn (1): io_uring: make spdxcheck.py happy
Marcelo Diop-Gonzalez (1): io_uring: flush timeouts that should already have expired
Mark Rutland (3): io_uring: fix SQPOLL cpu validation io_uring: free allocated io_memory once io_uring: avoid page allocation warnings
Nathan Chancellor (1): io_uring: Ensure mask is initialized in io_arm_poll_handler
Oleg Nesterov (6): signal: remove the wrong signal_pending() check in restore_user_sigmask() signal: simplify set_user_sigmask/restore_user_sigmask select: change do_poll() to return -ERESTARTNOHAND rather than -EINTR select: shift restore_saved_sigmask_unless() into poll_select_copy_remaining() task_work_run: don't take ->pi_lock unconditionally task_work: teach task_work_add() to do signal_wake_up()
Pavel Begunkov (247): io_uring: Fix __io_uring_register() false success io_uring: fix reversed nonblock flag for link submission io_uring: remove wait loop spurious wakeups io_uring: Fix corrupted user_data io_uring: Fix broken links with offloading io_uring: Fix race for sqes with userspace io_uring: Fix leaked shadow_req io_uring: remove index from sqe_submit io_uring: Fix mm_fault with READ/WRITE_FIXED io_uring: Merge io_submit_sqes and io_ring_submit io_uring: io_queue_link*() right after submit io_uring: allocate io_kiocb upfront io_uring: Use submit info inlined into req io_uring: use inlined struct sqe_submit io_uring: Fix getting file for timeout io_uring: Fix getting file for non-fd opcodes io_uring: break links for failed defer io_uring: remove redundant check io_uring: Fix leaking linked timeouts io_uring: Always REQ_F_FREE_SQE for allocated sqe io_uring: drain next sqe instead of shadowing io_uring: rename __io_submit_sqe() io_uring: add likely/unlikely in io_get_sqring() io_uring: remove io_free_req_find_next() io_uring: pass only !null to io_req_find_next() io_uring: simplify io_req_link_next() io_uring: only !null ptr to io_issue_sqe() io_uring: fix dead-hung for non-iter fixed rw io_uring: store timeout's sqe->off in proper place io_uring: inline struct sqe_submit io_uring: cleanup io_import_fixed() io_uring: fix error handling in io_queue_link_head io_uring: hook all linked requests via link_list io_uring: make HARDLINK imply LINK io_uring: don't wait when under-submitting io_uring: rename prev to head io_uring: move *queue_link_head() from common path pcpu_ref: add percpu_ref_tryget_many() io_uring: batch getting pcpu references io_uring: clamp to_submit in io_submit_sqes() io_uring: optimise head checks in io_get_sqring() io_uring: optimise commit_sqring() for common case io_uring: remove extra io_wq_current_is_worker() io_uring: optimise use of ctx->drain_next io_uring: remove extra check in __io_commit_cqring io_uring: hide uring_fd in ctx io_uring: remove REQ_F_IO_DRAINED io_uring: optimise sqe-to-req flags translation io_uring: use labeled array init in io_op_defs io_uring: prep req when do IOSQE_ASYNC io_uring: honor IOSQE_ASYNC for linked reqs io_uring: add comment for drain_next io_uring: fix refcounting with batched allocations at OOM io-wq: allow grabbing existing io-wq io_uring: add io-wq workqueue sharing io_uring: remove extra ->file check io_uring: iterate req cache backwards io_uring: put the flag changing code in the same spot io_uring: get rid of delayed mm check io_uring: fix deferred req iovec leak io_uring: remove unused struct io_async_open io_uring: fix iovec leaks io_uring: add cleanup for openat()/statx() io_uring: fix async close() with f_op->flush() io_uring: fix double prep iovec leak io_uring: fix openat/statx's filename leak io_uring: add missing io_req_cancelled() io_uring: fix use-after-free by io_cleanup_req() io-wq: fix IO_WQ_WORK_NO_CANCEL cancellation io-wq: remove io_wq_flush and IO_WQ_WORK_INTERNAL io_uring: fix lockup with timeouts io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN} io_uring: don't call work.func from sync ctx io_uring: don't do full *prep_worker() from io-wq io_uring: remove req->in_async splice: make do_splice public io_uring: add interface for getting files io_uring: add splice(2) support io_uring: clean io_poll_complete io_uring: extract kmsg copy helper io-wq: remove unused IO_WQ_WORK_HAS_MM io_uring: remove IO_WQ_WORK_CB io-wq: use BIT for ulong hash io_uring: remove extra nxt check after punt io_uring: remove io_prep_next_work() io_uring: clean up io_close io_uring: make submission ref putting consistent io_uring: remove @nxt from handlers io_uring: get next work with submission ref drop io-wq: shuffle io_worker_handle_work() code io-wq: optimise locking in io_worker_handle_work() io-wq: optimise out *next_work() double lock io_uring/io-wq: forward submission ref to async io-wq: remove duplicated cancel code io-wq: don't resched if there is no work io-wq: split hashing and enqueueing io-wq: hash dependent work io-wq: close cancel gap for hashed linked work io_uring: Fix ->data corruption on re-enqueue io-wq: handle hashed writes in chains io_uring: fix ctx refcounting in io_submit_sqes() io_uring: simplify io_get_sqring io_uring: alloc req only after getting sqe io_uring: remove req init from io_get_req() io_uring: don't read user-shared sqe flags twice io_uring: fix fs cleanup on cqe overflow io_uring: remove obsolete @mm_fault io_uring: track mm through current->mm io_uring: early submission req fail code io_uring: keep all sqe->flags in req->flags io_uring: move all request init code in one place io_uring: fix cached_sq_head in io_timeout() io_uring: kill already cached timeout.seq_offset io_uring: don't count rqs failed after current one io_uring: fix extra put in sync_file_range() io_uring: check non-sync defer_list carefully io_uring: punt splice async because of inode mutex splice: move f_mode checks to do_{splice,tee}() io_uring: fix zero len do_splice() io_uring: don't prepare DRAIN reqs twice io_uring: fix FORCE_ASYNC req preparation io_uring: remove req->needs_fixed_files io_uring: rename io_file_put() io_uring: don't repeat valid flag list splice: export do_tee() io_uring: add tee(2) support io_uring: fix flush req->refs underflow io_uring: simplify io_timeout locking io_uring: don't re-read sqe->off in timeout_prep() io_uring: separate DRAIN flushing into a cold path io_uring: get rid of manual punting in io_close io_uring: move timeouts flushing to a helper io_uring: off timeouts based only on completions io_uring: fix overflowed reqs cancellation io_uring: fix {SQ,IO}POLL with unsupported opcodes io_uring: move send/recv IOPOLL check into prep io_uring: don't derive close state from ->func io_uring: remove custom ->func handlers io_uring: don't arm a timeout through work.func io_wq: add per-wq work handler instead of per work io_uring: fix lazy work init io-wq: reorder cancellation pending -> running io-wq: add an option to cancel all matched reqs io_uring: cancel all task's requests on exit io_uring: batch cancel in io_uring_cancel_files() io_uring: lazy get task io_uring: cancel by ->task not pid io-wq: compact io-wq flags numbers io-wq: return next work from ->do_work() directly io_uring: fix hanging iopoll in case of -EAGAIN io_uring: fix current->mm NULL dereference on exit io_uring: fix missing msg_name assignment io_uring: fix not initialised work->flags io_uring: fix recvmsg memory leak with buffer selection io_uring: missed req_init_async() for IOSQE_ASYNC io_uring: fix ->work corruption with poll_add io_uring: fix lockup in io_fail_links() io_uring: rename sr->msg into umsg io_uring: use more specific type in rcv/snd msg cp io_uring: extract io_sendmsg_copy_hdr() io_uring: simplify io_req_map_rw() io_uring: add a helper for async rw iovec prep io_uring: fix potential use after free on fallback request free io_uring: fix stopping iopoll'ing too early io_uring: briefly loose locks while reaping events io_uring: partially inline io_iopoll_getevents() io_uring: fix racy overflow count reporting io-wq: fix hang after cancelling pending hashed work io_uring: clean file_data access in files_register io_uring: refactor *files_register()'s error paths io_uring: keep a pointer ref_node in file_data io_uring: fix double poll mask init io_uring: fix recvmsg setup with compat buf-select io_uring: fix NULL-mm for linked reqs io_uring: fix missing ->mm on exit io_uring: return locked and pinned page accounting io_uring: don't burn CPU for iopoll on exit io_uring: don't miscount pinned memory io_uring: fix provide_buffers sign extension io_uring: fix stalled deferred requests io_uring: kill REQ_F_LINK_NEXT io_uring: deduplicate freeing linked timeouts io_uring: fix refs underflow in io_iopoll_queue() io_uring: remove inflight batching in free_many() io_uring: dismantle req early and remove need_iter io_uring: batch-free linked requests as well io_uring: cosmetic changes for batch free io_uring: clean up req->result setting by rw io_uring: do task_work_run() during iopoll io_uring: fix NULL mm in io_poll_task_func() io_uring: simplify io_async_task_func() io_uring: fix req->work corruption io_uring: fix punting req w/o grabbed env io_uring: fix feeding io-wq with uninit reqs io_uring: don't mark link's head for_async io_uring: fix missing io_grab_files() io_uring: replace find_next() out param with ret io_uring: kill REQ_F_TIMEOUT io_uring: kill REQ_F_TIMEOUT_NOSEQ io_uring: optimise io_req_find_next() fast check io_uring: remove setting REQ_F_MUST_PUNT in rw io_uring: remove REQ_F_MUST_PUNT io_uring: set @poll->file after @poll init io_uring: don't pass def into io_req_work_grab_env io_uring: do init work in grab_env() io_uring: factor out grab_env() from defer_prep() io_uring: do grab_env() just before punting io_uring: fix mis-refcounting linked timeouts io_uring: keep queue_sqe()'s fail path separately io_uring: fix lost cqe->flags io_uring: don't delay iopoll'ed req completion io_uring: remove nr_events arg from iopoll_check() io_uring: share completion list w/ per-op space io_uring: rename ctx->poll into ctx->iopoll io_uring: use inflight_entry list for iopoll'ing io_uring: use completion list for CQ overflow io_uring: add req->timeout.list io_uring: remove init for unused list io_uring: use non-intrusive list for defer io_uring: remove sequence from io_kiocb io_uring: place cflags into completion data io_uring: fix cancel of deferred reqs with ->files io_uring: fix linked deferred ->files cancellation io_uring: fix racy IOPOLL completions io_uring: inline io_req_work_grab_env() io_uring: alloc ->io in io_req_defer_prep() io_uring/io-wq: move RLIMIT_FSIZE to io-wq io_uring: mark ->work uninitialised after cleanup io_uring: follow **iovec idiom in io_import_iovec io_uring: de-unionise io_kiocb io_uring: consolidate *_check_overflow accounting io_uring: get rid of atomic FAA for cq_timeouts io-wq: update hash bits io_uring: indent left {send,recv}[msg]() io_uring: remove extra checks in send/recv io_uring: don't forget cflags in io_recv() io_uring: free selected-bufs if error'ed io_uring: move BUFFER_SELECT check into *recv[msg] io_uring: simplify file ref tracking in submission state io_uring: extract io_put_kbuf() helper io_uring: don't open-code recv kbuf managment io_uring: don't do opcode prep twice io_uring: deduplicate io_grab_files() calls io_uring: fix missing io_queue_linked_timeout() tasks: add put_task_struct_many() io_uring: batch put_task_struct() io_uring: fix racy req->flags modification
Randy Dunlap (2): io_uring: fix 1-bit bitfields to be unsigned io_uring: fix function args for !CONFIG_NET
Roman Gushchin (1): percpu_ref: introduce PERCPU_REF_ALLOW_REINIT flag
Roman Penyaev (3): io_uring: offload write to async worker in case of -EAGAIN io_uring: fix infinite wait in khread_park() on io_finish_async() io_uring: add mapping support for NOMMU archs
Shenghui Wang (1): io_uring: use cpu_online() to check p->sq_thread_cpu instead of cpu_possible()
Stefan Bühler (13): io_uring: fix race condition reading SQ entries io_uring: fix race condition when sq threads goes sleeping io_uring: fix poll full SQ detection io_uring: fix handling SQEs requesting NOWAIT io_uring: fix notes on barriers io_uring: remove unnecessary barrier before wq_has_sleeper io_uring: remove unnecessary barrier before reading cq head io_uring: remove unnecessary barrier after updating SQ head io_uring: remove unnecessary barrier before reading SQ tail io_uring: remove unnecessary barrier after incrementing dropped counter io_uring: remove unnecessary barrier after unsetting IORING_SQ_NEED_WAKEUP req->error only used for iopoll io_uring: fix race condition reading SQE data
Stefan Metzmacher (1): io_uring: add BUILD_BUG_ON() to assert the layout of struct io_uring_sqe
Stefano Garzarella (4): io_uring: flush overflowed CQ events in the io_uring_poll() io_uring: prevent sq_thread from spinning when it should stop io_uring: add 'cq_flags' field for the CQ ring io_uring: add IORING_CQ_EVENTFD_DISABLED to the CQ ring flags
Steve French (1): cifs: fix rmmod regression in cifs.ko caused by force_sig changes
Thomas Gleixner (2): sched: Remove stale PF_MUTEX_TESTER bit sched/core, workqueues: Distangle worker accounting from rq lock
Tobias Klauser (1): io_uring: define and set show_fdinfo only if procfs is enabled
Xiaoguang Wang (24): io_uring: fix __io_iopoll_check deadlock in io_sq_thread io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled io_uring: cleanup io_alloc_async_ctx() io_uring: refactor file register/unregister/update handling io_uring: initialize fixed_file_data lock io_uring: do not always copy iovec in io_req_map_rw() io_uring: restore req->work when canceling poll request io_uring: only restore req->work for req that needs do completion io_uring: use cond_resched() in io_ring_ctx_wait_and_kill() io_uring: fix mismatched finish_wait() calls in io_uring_cancel_files() io_uring: handle -EFAULT properly in io_uring_setup() io_uring: reset -EBUSY error when io sq thread is waken up io_uring: remove obsolete 'state' parameter io_uring: don't submit sqes when ctx->refs is dying io_uring: avoid whole io_wq_work copy for requests completed inline io_uring: avoid unnecessary io_wq_work copy for fast poll feature io_uring: fix io_kiocb.flags modification race in IOPOLL mode io_uring: don't fail links for EAGAIN error in IOPOLL mode io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed io_uring: fix possible race condition against REQ_F_NEED_CLEANUP io_uring: export cq overflow status to userspace io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works io_uring: always let io_iopoll_complete() complete polled io
Xiaoming Ni (1): io_uring: remove duplicate semicolon at the end of line
Xuan Zhuo (1): io_uring: fix io_sq_thread no schedule when busy
Yang Yingliang (2): io_uring: fix memleak in __io_sqe_files_update() io_uring: fix memleak in io_sqe_files_register()
YueHaibing (3): io-wq: use kfree_rcu() to simplify the code io_uring: Remove unnecessary null check io_uring: Fix unused function warnings
Zhengyuan Liu (4): io_uring: fix the sequence comparison in io_sequence_defer io_uring: fix counter inc/dec mismatch in async_list io_uring: add a memory barrier before atomic_read io_uring: track io length in async_list based on bytes
yangerkun (9): fs: fix kabi change since add iopoll io_uring: compare cached_cq_tail with cq.head in_io_uring_poll io_uring: consider the overflow of sequence for timeout req io_uring: fix logic error in io_timeout fs: introduce __close_fd_get_file to support IORING_OP_CLOSE for io_uring fs: make filename_lookup available externally x86: fix kabi with io_uring interface arm64: fix kabi with io_uring interface io_uring: add IORING_OP_OPENAT2 for compatablity
zhangyi (F) (2): io_uring : correct timeout req sequence when waiting timeout io_uring: correct timeout req sequence when inserting a new entry
Documentation/filesystems/vfs.txt | 3 + arch/arm64/include/asm/syscall_wrapper.h | 5 + arch/arm64/kernel/syscall.c | 9 +- arch/x86/entry/common.c | 7 + arch/x86/include/asm/syscall_wrapper.h | 3 + block/blk-core.c | 12 +- block/blk-merge.c | 7 +- drivers/block/drbd/drbd_main.c | 2 + fs/Kconfig | 3 + fs/Makefile | 2 + fs/aio.c | 157 +- fs/cifs/connect.c | 3 +- fs/eventpoll.c | 143 +- fs/file.c | 53 +- fs/file_table.c | 9 +- fs/internal.h | 9 + fs/io-wq.c | 1158 +++ fs/io-wq.h | 152 + fs/io_uring.c | 8811 ++++++++++++++++++++++ fs/namei.c | 4 +- fs/open.c | 2 +- fs/select.c | 376 +- fs/splice.c | 62 +- fs/stat.c | 65 +- fs/sync.c | 141 +- include/linux/compat.h | 19 + include/linux/eventpoll.h | 9 + include/linux/fdtable.h | 1 + include/linux/file.h | 3 + include/linux/fs.h | 22 +- include/linux/ioprio.h | 13 + include/linux/mm.h | 1 + include/linux/percpu-refcount.h | 36 +- include/linux/sched.h | 2 +- include/linux/sched/jobctl.h | 4 +- include/linux/sched/signal.h | 12 +- include/linux/sched/task.h | 6 + include/linux/sched/user.h | 2 +- include/linux/signal.h | 15 +- include/linux/socket.h | 25 + include/linux/splice.h | 6 + include/linux/syscalls.h | 28 +- include/linux/task_work.h | 5 +- include/linux/uio.h | 4 +- include/net/af_unix.h | 1 + include/net/compat.h | 3 + include/trace/events/io_uring.h | 495 ++ include/uapi/linux/aio_abi.h | 2 + include/uapi/linux/io_uring.h | 294 + init/Kconfig | 10 + kernel/sched/core.c | 96 +- kernel/signal.c | 63 +- kernel/sys_ni.c | 3 + kernel/task_work.c | 34 +- kernel/workqueue.c | 54 +- kernel/workqueue_internal.h | 5 +- lib/iov_iter.c | 15 +- lib/percpu-refcount.c | 28 +- mm/madvise.c | 7 +- net/Makefile | 2 +- net/compat.c | 31 +- net/socket.c | 298 +- net/unix/Kconfig | 5 + net/unix/Makefile | 2 + net/unix/af_unix.c | 63 +- net/unix/garbage.c | 68 +- net/unix/scm.c | 151 + net/unix/scm.h | 10 + tools/io_uring/Makefile | 18 + tools/io_uring/README | 29 + tools/io_uring/barrier.h | 16 + tools/io_uring/io_uring-bench.c | 592 ++ tools/io_uring/io_uring-cp.c | 260 + tools/io_uring/liburing.h | 187 + tools/io_uring/queue.c | 156 + tools/io_uring/setup.c | 107 + tools/io_uring/syscall.c | 52 + 77 files changed, 13739 insertions(+), 829 deletions(-) create mode 100644 fs/io-wq.c create mode 100644 fs/io-wq.h create mode 100644 fs/io_uring.c create mode 100644 include/trace/events/io_uring.h create mode 100644 include/uapi/linux/io_uring.h create mode 100644 net/unix/scm.c create mode 100644 net/unix/scm.h create mode 100644 tools/io_uring/Makefile create mode 100644 tools/io_uring/README create mode 100644 tools/io_uring/barrier.h create mode 100644 tools/io_uring/io_uring-bench.c create mode 100644 tools/io_uring/io_uring-cp.c create mode 100644 tools/io_uring/liburing.h create mode 100644 tools/io_uring/queue.c create mode 100644 tools/io_uring/setup.c create mode 100644 tools/io_uring/syscall.c
From: "Eric W. Biederman" ebiederm@xmission.com
stable inclusion from linux-4.19.99 commit e6a13c753f912564256d81f7036f9c524b1ef8ae bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 ---------------------------
[ Upstream commit 72abe3bcf0911d69b46c1e8bdb5612675e0ac42c ]
The locking in force_sig_info is not prepared to deal with a task that exits or execs (as sighand may change). The is not a locking problem in force_sig as force_sig is only built to handle synchronous exceptions.
Further the function force_sig_info changes the signal state if the signal is ignored, or blocked or if SIGNAL_UNKILLABLE will prevent the delivery of the signal. The signal SIGKILL can not be ignored and can not be blocked and SIGNAL_UNKILLABLE won't prevent it from being delivered.
So using force_sig rather than send_sig for SIGKILL is confusing and pointless.
Because it won't impact the sending of the signal and and because using force_sig is wrong, replace force_sig with send_sig.
Cc: Namjae Jeon namjae.jeon@samsung.com Cc: Jeff Layton jlayton@primarydata.com Cc: Steve French smfrench@gmail.com Fixes: a5c3e1c725af ("Revert "cifs: No need to send SIGKILL to demux_thread during umount"") Fixes: e7ddee9037e7 ("cifs: disable sharing session and tcon and add new TCP sharing code") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com [io_uring need allow_kernel_signal] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/cifs/connect.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c index 423bc5b481e4..d4b8f0ecf50a 100644 --- a/fs/cifs/connect.c +++ b/fs/cifs/connect.c @@ -2458,7 +2458,7 @@ cifs_put_tcp_session(struct TCP_Server_Info *server, int from_reconnect)
task = xchg(&server->tsk, NULL); if (task) - force_sig(SIGKILL, task); + send_sig(SIGKILL, task, 1); }
static struct TCP_Server_Info *
From: Steve French stfrench@microsoft.com
stable inclusion from linux-4.19.99 commit 7f6a96dd8223796ffae4dd251be3bff161a28a4b bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 ---------------------------
[ Upstream commit 247bc9470b1eeefc7b58cdf2c39f2866ba651509 ]
Fixes: 72abe3bcf091 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
The global change from force_sig caused module unloading of cifs.ko to fail (since the cifsd process could not be killed, "rmmod cifs" now would always fail)
Signed-off-by: Steve French stfrench@microsoft.com Reviewed-by: Ronnie Sahlberg lsahlber@redhat.com CC: Eric W. Biederman ebiederm@xmission.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com [io_uring need allow_kernel_signal] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/cifs/connect.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c index d4b8f0ecf50a..ef7e71b904df 100644 --- a/fs/cifs/connect.c +++ b/fs/cifs/connect.c @@ -974,6 +974,7 @@ cifs_demultiplex_thread(void *p) mempool_resize(cifs_req_poolp, length + cifs_min_rcv);
set_freezable(); + allow_signal(SIGKILL); while (server->tcpStatus != CifsExiting) { if (try_to_freeze()) continue;
From: "Eric W. Biederman" ebiederm@xmission.com
stable inclusion from linux-4.19.99 commit 6db0e28b893aa28af3f7c0197749a5d9cbfded5c bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 ---------------------------
[ Upstream commit 33da8e7c814f77310250bb54a9db36a44c5de784 ]
My recent to change to only use force_sig for a synchronous events wound up breaking signal reception cifs and drbd. I had overlooked the fact that by default kthreads start out with all signals set to SIG_IGN. So a change I thought was safe turned out to have made it impossible for those kernel thread to catch their signals.
Reverting the work on force_sig is a bad idea because what the code was doing was very much a misuse of force_sig. As the way force_sig ultimately allowed the signal to happen was to change the signal handler to SIG_DFL. Which after the first signal will allow userspace to send signals to these kernel threads. At least for wake_ack_receiver in drbd that does not appear actively wrong.
So correct this problem by adding allow_kernel_signal that will allow signals whose siginfo reports they were sent by the kernel through, but will not allow userspace generated signals, and update cifs and drbd to call allow_kernel_signal in an appropriate place so that their thread can receive this signal.
Fixing things this way ensures that userspace won't be able to send signals and cause problems, that it is clear which signals the threads are expecting to receive, and it guarantees that nothing else in the system will be affected.
This change was partly inspired by similar cifs and drbd patches that added allow_signal.
Reported-by: ronnie sahlberg ronniesahlberg@gmail.com Reported-by: Christoph Böhmwalder christoph.boehmwalder@linbit.com Tested-by: Christoph Böhmwalder christoph.boehmwalder@linbit.com Cc: Steve French smfrench@gmail.com Cc: Philipp Reisner philipp.reisner@linbit.com Cc: David Laight David.Laight@ACULAB.COM Fixes: 247bc9470b1e ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes") Fixes: 72abe3bcf091 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig") Fixes: fee109901f39 ("signal/drbd: Use send_sig not force_sig") Fixes: 3cf5d076fb4d ("signal: Remove task parameter from force_sig") Signed-off-by: "Eric W. Biederman" ebiederm@xmission.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yang Yingliang yangyingliang@huawei.com [io_uring need allow_kernel_signal] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/block/drbd/drbd_main.c | 2 ++ fs/cifs/connect.c | 2 +- include/linux/signal.h | 15 ++++++++++++++- kernel/signal.c | 5 +++++ 4 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index a49a8d91a599..5e3885f5729b 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -334,6 +334,8 @@ static int drbd_thread_setup(void *arg) thi->name[0], resource->name);
+ allow_kernel_signal(DRBD_SIGKILL); + allow_kernel_signal(SIGXCPU); restart: retval = thi->function(thi);
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c index ef7e71b904df..907be252c5d4 100644 --- a/fs/cifs/connect.c +++ b/fs/cifs/connect.c @@ -974,7 +974,7 @@ cifs_demultiplex_thread(void *p) mempool_resize(cifs_req_poolp, length + cifs_min_rcv);
set_freezable(); - allow_signal(SIGKILL); + allow_kernel_signal(SIGKILL); while (server->tcpStatus != CifsExiting) { if (try_to_freeze()) continue; diff --git a/include/linux/signal.h b/include/linux/signal.h index e4d01469ed60..0be5ce2375cb 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -272,6 +272,9 @@ extern void signal_setup_done(int failed, struct ksignal *ksig, int stepping); extern void exit_signals(struct task_struct *tsk); extern void kernel_sigaction(int, __sighandler_t);
+#define SIG_KTHREAD ((__force __sighandler_t)2) +#define SIG_KTHREAD_KERNEL ((__force __sighandler_t)3) + static inline void allow_signal(int sig) { /* @@ -279,7 +282,17 @@ static inline void allow_signal(int sig) * know it'll be handled, so that they don't get converted to * SIGKILL or just silently dropped. */ - kernel_sigaction(sig, (__force __sighandler_t)2); + kernel_sigaction(sig, SIG_KTHREAD); +} + +static inline void allow_kernel_signal(int sig) +{ + /* + * Kernel threads handle their own signals. Let the signal code + * know signals sent by the kernel will be handled, so that they + * don't get silently dropped. + */ + kernel_sigaction(sig, SIG_KTHREAD_KERNEL); }
static inline void disallow_signal(int sig) diff --git a/kernel/signal.c b/kernel/signal.c index deba77ef0573..5ded8c6ac789 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -88,6 +88,11 @@ static bool sig_task_ignored(struct task_struct *t, int sig, bool force) handler == SIG_DFL && !(force && sig_kernel_only(sig))) return true;
+ /* Only allow kernel generated signals to this kthread */ + if (unlikely((t->flags & PF_KTHREAD) && + (handler == SIG_KTHREAD_KERNEL) && !force)) + return true; + return sig_handler_ignored(handler, sig); }
From: Christoph Hellwig hch@lst.de
mainline inclusion from mainline-5.1-rc1 commit fb7e160019f4abb4082740bfeb27a38f6389c745 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This new methods is used to explicitly poll for I/O completion for an iocb. It must be called for any iocb submitted asynchronously (that is with a non-null ki_complete) which has the IOCB_HIPRI flag set.
The method is assisted by a new ki_cookie field in struct iocb to store the polling cookie.
Reviewed-by: Hannes Reinecke hare@suse.com Reviewed-by: Johannes Thumshirn jthumshirn@suse.de Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: [add ki_cookie in struct kiocb will change KABI and can not fix it. Stop to support block poll.]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- Documentation/filesystems/vfs.txt | 3 +++ include/linux/fs.h | 1 + 2 files changed, 4 insertions(+)
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index a6c6a8af48a2..0fe9c0dd3269 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -857,6 +857,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); @@ -901,6 +902,8 @@ otherwise noted.
write_iter: possibly asynchronous write with iov_iter as source
+ iopoll: called when aio wants to poll for completions on HIPRI iocbs + iterate: called when the VFS needs to read the directory contents
iterate_shared: called when the VFS needs to read the directory contents diff --git a/include/linux/fs.h b/include/linux/fs.h index 118021c316da..63748acb1444 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1776,6 +1776,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *);
From: yangerkun yangerkun@huawei.com
hulk inclusion category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/fs.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h index 63748acb1444..3c912284b9cb 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1776,7 +1776,6 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); - int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); @@ -1811,7 +1810,11 @@ struct file_operations { u64); int (*fadvise)(struct file *, loff_t, loff_t, int);
+#ifndef __GENKSYMS__ + int (*iopoll)(struct kiocb *kiocb, bool spin); +#else KABI_RESERVE(1) +#endif KABI_RESERVE(2) KABI_RESERVE(3) KABI_RESERVE(4)
From: Damien Le Moal damien.lemoal@wdc.com
mainline inclusion from mainline-5.0-rc1 commit 23464f8c3407b83106463999b64fe10dc66ff6a3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Comment the use of the IOCB_FLAG_IOPRIO aio flag similarly to the IOCB_FLAG_RESFD flag.
Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Johannes Thumshirn jthumshirn@suse.de Signed-off-by: Damien Le Moal damien.lemoal@wdc.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/uapi/linux/aio_abi.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index ce43d340f010..8387e0af0f76 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -50,6 +50,8 @@ enum { * * IOCB_FLAG_RESFD - Set if the "aio_resfd" member of the "struct iocb" * is valid. + * IOCB_FLAG_IOPRIO - Set if the "aio_reqprio" member of the "struct iocb" + * is valid. */ #define IOCB_FLAG_RESFD (1 << 0) #define IOCB_FLAG_IOPRIO (1 << 1)
From: Damien Le Moal damien.lemoal@wdc.com
mainline inclusion from mainline-5.0-rc1 commit 64845a1ddd655574886eb48e9a5eaeeb9b05bf0d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Define get_current_ioprio() as an inline helper to obtain the caller I/O priority from its task I/O context. Use this helper in blk_init_request_from_bio() to set a request ioprio.
Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Johannes Thumshirn jthumshirn@suse.de Signed-off-by: Damien Le Moal damien.lemoal@wdc.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com
Conflicts: block/blk-core.c [e2b3fa5af70c ("block: Remove bio->bi_ioc") not included]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/blk-core.c | 6 +----- include/linux/ioprio.h | 13 +++++++++++++ 2 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index ffbe326c70b9..3a9944ddee3a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1972,18 +1972,14 @@ unsigned int blk_plug_queued_count(struct request_queue *q)
void blk_init_request_from_bio(struct request *req, struct bio *bio) { - struct io_context *ioc = rq_ioc(bio); - if (bio->bi_opf & REQ_RAHEAD) req->cmd_flags |= REQ_FAILFAST_MASK;
req->__sector = bio->bi_iter.bi_sector; if (ioprio_valid(bio_prio(bio))) req->ioprio = bio_prio(bio); - else if (ioc) - req->ioprio = ioc->ioprio; else - req->ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0); + req->ioprio = get_current_ioprio(); req->write_hint = bio->bi_write_hint; blk_rq_bio_prep(req->q, req, bio); } diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h index 9e30ed6443db..e9bfe6972aed 100644 --- a/include/linux/ioprio.h +++ b/include/linux/ioprio.h @@ -70,6 +70,19 @@ static inline int task_nice_ioclass(struct task_struct *task) return IOPRIO_CLASS_BE; }
+/* + * If the calling process has set an I/O priority, use that. Otherwise, return + * the default I/O priority. + */ +static inline int get_current_ioprio(void) +{ + struct io_context *ioc = current->io_context; + + if (ioc) + return ioc->ioprio; + return IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0); +} + /* * For inheritance, return the highest of the two given priorities */
From: Damien Le Moal damien.lemoal@wdc.com
mainline inclusion from mainline-5.0-rc1 commit 76dc891395dc61e92e2ff31b6161815ce5eb715b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For cases when the application does not specify aio_reqprio for an aio, fallback to use get_current_ioprio() to obtain the task I/O priority last set using ioprio_set() rather than the hardcoded IOPRIO_CLASS_NONE value.
Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Johannes Thumshirn jthumshirn@suse.de Reviewed-by: Adam Manzanares adam.manzanares@wdc.com Signed-off-by: Damien Le Moal damien.lemoal@wdc.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/aio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/aio.c b/fs/aio.c index 02954e50ef9b..a404047ab453 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1459,7 +1459,7 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
req->ki_ioprio = iocb->aio_reqprio; } else - req->ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0); + req->ki_ioprio = get_current_ioprio();
ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags); if (unlikely(ret))
From: Damien Le Moal damien.lemoal@wdc.com
mainline inclusion from mainline-5.0-rc1 commit 668ffc03418bc779f699797c72ecf968cd6525a9 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Growing in size a high priority request by merging it with a lower priority BIO or request will increase the request execution time. This is the opposite result of the desired effect of high I/O priorities, namely getting low I/O latencies. Prevent merging of requests and BIOs that have different I/O priorities to fix this.
Signed-off-by: Damien Le Moal damien.lemoal@wdc.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: block/blk-merge.c [ Patch 9cf2bab630("block: kill request ->cpu member") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/blk-core.c | 3 --- block/blk-merge.c | 7 ++++++- 2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index 3a9944ddee3a..a3213e527008 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1814,7 +1814,6 @@ bool bio_attempt_back_merge(struct request_queue *q, struct request *req, req->biotail->bi_next = bio; req->biotail = bio; req->__data_len += bio->bi_iter.bi_size; - req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
blk_account_io_start(req, false); return true; @@ -1838,7 +1837,6 @@ bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
req->__sector = bio->bi_iter.bi_sector; req->__data_len += bio->bi_iter.bi_size; - req->ioprio = ioprio_best(req->ioprio, bio_prio(bio));
blk_account_io_start(req, false); return true; @@ -1858,7 +1856,6 @@ bool bio_attempt_discard_merge(struct request_queue *q, struct request *req, req->biotail->bi_next = bio; req->biotail = bio; req->__data_len += bio->bi_iter.bi_size; - req->ioprio = ioprio_best(req->ioprio, bio_prio(bio)); req->nr_phys_segments = segments + 1;
blk_account_io_start(req, false); diff --git a/block/blk-merge.c b/block/blk-merge.c index d24a6c9398ed..7904e45fc5c6 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -727,6 +727,9 @@ static struct request *attempt_merge(struct request_queue *q, if (req->write_hint != next->write_hint) return NULL;
+ if (req->ioprio != next->ioprio) + return NULL; + /* * If we are allowed to merge, then append bio list * from next to rq and release next. merge_requests_fn @@ -782,7 +785,6 @@ static struct request *attempt_merge(struct request_queue *q, */ blk_account_io_merge(next);
- req->ioprio = ioprio_best(req->ioprio, next->ioprio); if (blk_rq_cpu_valid(next)) req->cpu = next->cpu;
@@ -865,6 +867,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio) if (rq->write_hint != bio->bi_write_hint) return false;
+ if (rq->ioprio != bio_prio(bio)) + return false; + return true; }
From: Damien Le Moal damien.lemoal@wdc.com
mainline inclusion from mainline-5.0-rc1 commit 20578bdfd0418efb11ec316229e670d085cd574a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For the synchronous I/O path case (read(), write() etc system calls), a BIO I/O priority is not initialized until the execution of blk_init_request_from_bio() when the BIO is submitted and a request initialized for the BIO execution. This is due to the ki_ioprio field of the struct kiocb defined on stack being always initialized to IOPRIO_CLASS_NONE, regardless of the calling process I/O context ioprio value set with ioprio_set(). This late initialization can result in the BIO being merged to pending requests even when the I/O priorities differ.
Fix this by initializing the ki_iopriority field of on stack struct kiocb using the get_current_ioprio() helper, ensuring that all BIOs allocated and submitted for the system call execution see the correct intended I/O priority early. With this, since a BIO I/O priority is always set to the intended effective value for both the sync and async path, blk_init_request_from_bio() can be simplified.
Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Adam Manzanares adam.manzanares@wdc.com Signed-off-by: Damien Le Moal damien.lemoal@wdc.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- block/blk-core.c | 5 +---- include/linux/fs.h | 2 +- 2 files changed, 2 insertions(+), 5 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index a3213e527008..3c77408c4559 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1973,10 +1973,7 @@ void blk_init_request_from_bio(struct request *req, struct bio *bio) req->cmd_flags |= REQ_FAILFAST_MASK;
req->__sector = bio->bi_iter.bi_sector; - if (ioprio_valid(bio_prio(bio))) - req->ioprio = bio_prio(bio); - else - req->ioprio = get_current_ioprio(); + req->ioprio = bio_prio(bio); req->write_hint = bio->bi_write_hint; blk_rq_bio_prep(req->q, req, bio); } diff --git a/include/linux/fs.h b/include/linux/fs.h index 3c912284b9cb..b6bcd1f5bd72 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2036,7 +2036,7 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) .ki_filp = filp, .ki_flags = iocb_flags(filp), .ki_hint = ki_hint_validate(file_write_hint(filp)), - .ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0), + .ki_ioprio = get_current_ioprio(), }; }
From: Deepa Dinamani deepa.kernel@gmail.com
mainline inclusion from mainline-5.0-rc1 commit ded653ccbec0335a78fa7a7aff3ec9870349fafb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Refactor reading sigset from userspace and updating sigmask into an api.
This is useful for versions of syscalls that pass in the sigmask and expect the current->sigmask to be changed during, and restored after, the execution of the syscall.
With the advent of new y2038 syscalls in the subsequent patches, we add two more new versions of the syscalls (for pselect, ppoll, and io_pgetevents) in addition to the existing native and compat versions. Adding such an api reduces the logic that would need to be replicated otherwise.
Note that the calls to sigprocmask() ignored the return value from the api as the function only returns an error on an invalid first argument that is hardcoded at these call sites. The updated logic uses set_current_blocked() instead.
Signed-off-by: Deepa Dinamani deepa.kernel@gmail.com Signed-off-by: Arnd Bergmann arnd@arndb.de
Conflicts: include/linux/signal.h [ Patch ae7795bc6("signal: Distinguish between kernel_siginfo and siginfo") is not applied. ] kernel/signal.c [ Patch fb50f5a40("signal: Pair exports with their functions") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/aio.c | 23 ++++++------------- fs/eventpoll.c | 22 +++++-------------- fs/select.c | 50 ++++++++++-------------------------------- include/linux/compat.h | 4 ++++ include/linux/signal.h | 2 ++ kernel/signal.c | 45 +++++++++++++++++++++++++++++++++++++ 6 files changed, 76 insertions(+), 70 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c index a404047ab453..9bd3dd57ea8f 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -2138,14 +2138,10 @@ SYSCALL_DEFINE6(io_pgetevents, if (usig && copy_from_user(&ksig, usig, sizeof(ksig))) return -EFAULT;
- if (ksig.sigmask) { - if (ksig.sigsetsize != sizeof(sigset_t)) - return -EINVAL; - if (copy_from_user(&ksigmask, ksig.sigmask, sizeof(ksigmask))) - return -EFAULT; - sigdelsetmask(&ksigmask, sigmask(SIGKILL) | sigmask(SIGSTOP)); - sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved); - } + + ret = set_user_sigmask(ksig.sigmask, &ksigmask, &sigsaved, ksig.sigsetsize); + if (ret) + return ret;
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL); if (signal_pending(current)) { @@ -2208,14 +2204,9 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents, if (usig && copy_from_user(&ksig, usig, sizeof(ksig))) return -EFAULT;
- if (ksig.sigmask) { - if (ksig.sigsetsize != sizeof(compat_sigset_t)) - return -EINVAL; - if (get_compat_sigset(&ksigmask, ksig.sigmask)) - return -EFAULT; - sigdelsetmask(&ksigmask, sigmask(SIGKILL) | sigmask(SIGSTOP)); - sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved); - } + ret = set_compat_user_sigmask(ksig.sigmask, &ksigmask, &sigsaved, ksig.sigsetsize); + if (ret) + return ret;
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL); if (signal_pending(current)) { diff --git a/fs/eventpoll.c b/fs/eventpoll.c index cf332e8f6bdf..3d879caff64c 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2217,14 +2217,9 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events, * If the caller wants a certain signal mask to be set during the wait, * we apply it here. */ - if (sigmask) { - if (sigsetsize != sizeof(sigset_t)) - return -EINVAL; - if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask))) - return -EFAULT; - sigsaved = current->blocked; - set_current_blocked(&ksigmask); - } + error = set_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + if (error) + return error;
error = do_epoll_wait(epfd, events, maxevents, timeout);
@@ -2260,14 +2255,9 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd, * If the caller wants a certain signal mask to be set during the wait, * we apply it here. */ - if (sigmask) { - if (sigsetsize != sizeof(compat_sigset_t)) - return -EINVAL; - if (get_compat_sigset(&ksigmask, sigmask)) - return -EFAULT; - sigsaved = current->blocked; - set_current_blocked(&ksigmask); - } + err = set_compat_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + if (err) + return err;
err = do_epoll_wait(epfd, events, maxevents, timeout);
diff --git a/fs/select.c b/fs/select.c index be2f66c5cc8a..58594f0d5f67 100644 --- a/fs/select.c +++ b/fs/select.c @@ -714,16 +714,9 @@ static long do_pselect(int n, fd_set __user *inp, fd_set __user *outp, return -EINVAL; }
- if (sigmask) { - /* XXX: Don't preclude handling different sized sigset_t's. */ - if (sigsetsize != sizeof(sigset_t)) - return -EINVAL; - if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask))) - return -EFAULT; - - sigdelsetmask(&ksigmask, sigmask(SIGKILL)|sigmask(SIGSTOP)); - sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved); - } + ret = set_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + if (ret) + return ret;
ret = core_sys_select(n, inp, outp, exp, to); ret = poll_select_copy_remaining(&end_time, tsp, 0, ret); @@ -1056,16 +1049,9 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, return -EINVAL; }
- if (sigmask) { - /* XXX: Don't preclude handling different sized sigset_t's. */ - if (sigsetsize != sizeof(sigset_t)) - return -EINVAL; - if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask))) - return -EFAULT; - - sigdelsetmask(&ksigmask, sigmask(SIGKILL)|sigmask(SIGSTOP)); - sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved); - } + ret = set_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + if (ret) + return ret;
ret = do_sys_poll(ufds, nfds, to);
@@ -1318,15 +1304,9 @@ static long do_compat_pselect(int n, compat_ulong_t __user *inp, return -EINVAL; }
- if (sigmask) { - if (sigsetsize != sizeof(compat_sigset_t)) - return -EINVAL; - if (get_compat_sigset(&ksigmask, sigmask)) - return -EFAULT; - - sigdelsetmask(&ksigmask, sigmask(SIGKILL)|sigmask(SIGSTOP)); - sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved); - } + ret = set_compat_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + if (ret) + return ret;
ret = compat_core_sys_select(n, inp, outp, exp, to); ret = compat_poll_select_copy_remaining(&end_time, tsp, 0, ret); @@ -1384,15 +1364,9 @@ COMPAT_SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, return -EINVAL; }
- if (sigmask) { - if (sigsetsize != sizeof(compat_sigset_t)) - return -EINVAL; - if (get_compat_sigset(&ksigmask, sigmask)) - return -EFAULT; - - sigdelsetmask(&ksigmask, sigmask(SIGKILL)|sigmask(SIGSTOP)); - sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved); - } + ret = set_compat_user_sigmask(sigmask, &ksigmask, &sigsaved, sigsetsize); + if (ret) + return ret;
ret = do_sys_poll(ufds, nfds, to);
diff --git a/include/linux/compat.h b/include/linux/compat.h index 189d0e111d57..c0476f7c4444 100644 --- a/include/linux/compat.h +++ b/include/linux/compat.h @@ -176,6 +176,10 @@ typedef struct { compat_sigset_word sig[_COMPAT_NSIG_WORDS]; } compat_sigset_t;
+int set_compat_user_sigmask(const compat_sigset_t __user *usigmask, + sigset_t *set, sigset_t *oldset, + size_t sigsetsize); + struct compat_sigaction { #ifndef __ARCH_HAS_IRIX_SIGACTION compat_uptr_t sa_handler; diff --git a/include/linux/signal.h b/include/linux/signal.h index 0be5ce2375cb..2f489e525099 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -263,6 +263,8 @@ extern int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p, enum pid_type type); extern int __group_send_sig_info(int, struct siginfo *, struct task_struct *); extern int sigprocmask(int, sigset_t *, sigset_t *); +extern int set_user_sigmask(const sigset_t __user *usigmask, sigset_t *set, + sigset_t *oldset, size_t sigsetsize); extern void set_current_blocked(sigset_t *); extern void __set_current_blocked(const sigset_t *); extern int show_unhandled_signals; diff --git a/kernel/signal.c b/kernel/signal.c index 5ded8c6ac789..23beedc12eaa 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2816,6 +2816,51 @@ int sigprocmask(int how, sigset_t *set, sigset_t *oldset) return 0; }
+/* + * The api helps set app-provided sigmasks. + * + * This is useful for syscalls such as ppoll, pselect, io_pgetevents and + * epoll_pwait where a new sigmask is passed from userland for the syscalls. + */ +int set_user_sigmask(const sigset_t __user *usigmask, sigset_t *set, + sigset_t *oldset, size_t sigsetsize) +{ + if (!usigmask) + return 0; + + if (sigsetsize != sizeof(sigset_t)) + return -EINVAL; + if (copy_from_user(set, usigmask, sizeof(sigset_t))) + return -EFAULT; + + *oldset = current->blocked; + set_current_blocked(set); + + return 0; +} +EXPORT_SYMBOL(set_user_sigmask); + +#ifdef CONFIG_COMPAT +int set_compat_user_sigmask(const compat_sigset_t __user *usigmask, + sigset_t *set, sigset_t *oldset, + size_t sigsetsize) +{ + if (!usigmask) + return 0; + + if (sigsetsize != sizeof(compat_sigset_t)) + return -EINVAL; + if (get_compat_sigset(set, usigmask)) + return -EFAULT; + + *oldset = current->blocked; + set_current_blocked(set); + + return 0; +} +EXPORT_SYMBOL(set_compat_user_sigmask); +#endif + /** * sys_rt_sigprocmask - change the list of currently blocked signals * @how: whether to add, remove, or set signals
From: Deepa Dinamani deepa.kernel@gmail.com
mainline inclusion from mainline-5.0-rc1 commit 854a6ed56839a40f6b5d02a2962f48841482eec4 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Refactor the logic to restore the sigmask before the syscall returns into an api. This is useful for versions of syscalls that pass in the sigmask and expect the current->sigmask to be changed during the execution and restored after the execution of the syscall.
With the advent of new y2038 syscalls in the subsequent patches, we add two more new versions of the syscalls (for pselect, ppoll and io_pgetevents) in addition to the existing native and compat versions. Adding such an api reduces the logic that would need to be replicated otherwise.
Signed-off-by: Deepa Dinamani deepa.kernel@gmail.com Signed-off-by: Arnd Bergmann arnd@arndb.de Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/aio.c | 29 +++++--------------- fs/eventpoll.c | 30 ++------------------- fs/select.c | 60 ++++++------------------------------------ include/linux/signal.h | 2 ++ kernel/signal.c | 33 +++++++++++++++++++++++ 5 files changed, 51 insertions(+), 103 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c index 9bd3dd57ea8f..3c2162bb9309 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -2144,18 +2144,9 @@ SYSCALL_DEFINE6(io_pgetevents, return ret;
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL); - if (signal_pending(current)) { - if (ksig.sigmask) { - current->saved_sigmask = sigsaved; - set_restore_sigmask(); - } - - if (!ret) - ret = -ERESTARTNOHAND; - } else { - if (ksig.sigmask) - sigprocmask(SIG_SETMASK, &sigsaved, NULL); - } + restore_user_sigmask(ksig.sigmask, &sigsaved); + if (signal_pending(current) && !ret) + ret = -ERESTARTNOHAND;
return ret; } @@ -2209,17 +2200,9 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents, return ret;
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL); - if (signal_pending(current)) { - if (ksig.sigmask) { - current->saved_sigmask = sigsaved; - set_restore_sigmask(); - } - if (!ret) - ret = -ERESTARTNOHAND; - } else { - if (ksig.sigmask) - sigprocmask(SIG_SETMASK, &sigsaved, NULL); - } + restore_user_sigmask(ksig.sigmask, &sigsaved); + if (signal_pending(current) && !ret) + ret = -ERESTARTNOHAND;
return ret; } diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 3d879caff64c..fb096e3c9fdc 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2223,20 +2223,7 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
error = do_epoll_wait(epfd, events, maxevents, timeout);
- /* - * If we changed the signal mask, we need to restore the original one. - * In case we've got a signal while waiting, we do not restore the - * signal mask yet, and we allow do_signal() to deliver the signal on - * the way back to userspace, before the signal mask is restored. - */ - if (sigmask) { - if (error == -EINTR) { - memcpy(¤t->saved_sigmask, &sigsaved, - sizeof(sigsaved)); - set_restore_sigmask(); - } else - set_current_blocked(&sigsaved); - } + restore_user_sigmask(sigmask, &sigsaved);
return error; } @@ -2261,20 +2248,7 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
err = do_epoll_wait(epfd, events, maxevents, timeout);
- /* - * If we changed the signal mask, we need to restore the original one. - * In case we've got a signal while waiting, we do not restore the - * signal mask yet, and we allow do_signal() to deliver the signal on - * the way back to userspace, before the signal mask is restored. - */ - if (sigmask) { - if (err == -EINTR) { - memcpy(¤t->saved_sigmask, &sigsaved, - sizeof(sigsaved)); - set_restore_sigmask(); - } else - set_current_blocked(&sigsaved); - } + restore_user_sigmask(sigmask, &sigsaved);
return err; } diff --git a/fs/select.c b/fs/select.c index 58594f0d5f67..5989a43813b7 100644 --- a/fs/select.c +++ b/fs/select.c @@ -721,19 +721,7 @@ static long do_pselect(int n, fd_set __user *inp, fd_set __user *outp, ret = core_sys_select(n, inp, outp, exp, to); ret = poll_select_copy_remaining(&end_time, tsp, 0, ret);
- if (ret == -ERESTARTNOHAND) { - /* - * Don't restore the signal mask yet. Let do_signal() deliver - * the signal on the way back to userspace, before the signal - * mask is restored. - */ - if (sigmask) { - memcpy(¤t->saved_sigmask, &sigsaved, - sizeof(sigsaved)); - set_restore_sigmask(); - } - } else if (sigmask) - sigprocmask(SIG_SETMASK, &sigsaved, NULL); + restore_user_sigmask(sigmask, &sigsaved);
return ret; } @@ -1055,21 +1043,11 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds,
ret = do_sys_poll(ufds, nfds, to);
+ restore_user_sigmask(sigmask, &sigsaved); + /* We can restart this syscall, usually */ - if (ret == -EINTR) { - /* - * Don't restore the signal mask yet. Let do_signal() deliver - * the signal on the way back to userspace, before the signal - * mask is restored. - */ - if (sigmask) { - memcpy(¤t->saved_sigmask, &sigsaved, - sizeof(sigsaved)); - set_restore_sigmask(); - } + if (ret == -EINTR) ret = -ERESTARTNOHAND; - } else if (sigmask) - sigprocmask(SIG_SETMASK, &sigsaved, NULL);
ret = poll_select_copy_remaining(&end_time, tsp, 0, ret);
@@ -1311,19 +1289,7 @@ static long do_compat_pselect(int n, compat_ulong_t __user *inp, ret = compat_core_sys_select(n, inp, outp, exp, to); ret = compat_poll_select_copy_remaining(&end_time, tsp, 0, ret);
- if (ret == -ERESTARTNOHAND) { - /* - * Don't restore the signal mask yet. Let do_signal() deliver - * the signal on the way back to userspace, before the signal - * mask is restored. - */ - if (sigmask) { - memcpy(¤t->saved_sigmask, &sigsaved, - sizeof(sigsaved)); - set_restore_sigmask(); - } - } else if (sigmask) - sigprocmask(SIG_SETMASK, &sigsaved, NULL); + restore_user_sigmask(sigmask, &sigsaved);
return ret; } @@ -1370,21 +1336,11 @@ COMPAT_SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds,
ret = do_sys_poll(ufds, nfds, to);
+ restore_user_sigmask(sigmask, &sigsaved); + /* We can restart this syscall, usually */ - if (ret == -EINTR) { - /* - * Don't restore the signal mask yet. Let do_signal() deliver - * the signal on the way back to userspace, before the signal - * mask is restored. - */ - if (sigmask) { - memcpy(¤t->saved_sigmask, &sigsaved, - sizeof(sigsaved)); - set_restore_sigmask(); - } + if (ret == -EINTR) ret = -ERESTARTNOHAND; - } else if (sigmask) - sigprocmask(SIG_SETMASK, &sigsaved, NULL);
ret = compat_poll_select_copy_remaining(&end_time, tsp, 0, ret);
diff --git a/include/linux/signal.h b/include/linux/signal.h index 2f489e525099..5172526c90ce 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -265,6 +265,8 @@ extern int __group_send_sig_info(int, struct siginfo *, struct task_struct *); extern int sigprocmask(int, sigset_t *, sigset_t *); extern int set_user_sigmask(const sigset_t __user *usigmask, sigset_t *set, sigset_t *oldset, size_t sigsetsize); +extern void restore_user_sigmask(const void __user *usigmask, + sigset_t *sigsaved); extern void set_current_blocked(sigset_t *); extern void __set_current_blocked(const sigset_t *); extern int show_unhandled_signals; diff --git a/kernel/signal.c b/kernel/signal.c index 23beedc12eaa..24b48a689972 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2861,6 +2861,39 @@ int set_compat_user_sigmask(const compat_sigset_t __user *usigmask, EXPORT_SYMBOL(set_compat_user_sigmask); #endif
+/* + * restore_user_sigmask: + * usigmask: sigmask passed in from userland. + * sigsaved: saved sigmask when the syscall started and changed the sigmask to + * usigmask. + * + * This is useful for syscalls such as ppoll, pselect, io_pgetevents and + * epoll_pwait where a new sigmask is passed in from userland for the syscalls. + */ +void restore_user_sigmask(const void __user *usigmask, sigset_t *sigsaved) +{ + + if (!usigmask) + return; + /* + * When signals are pending, do not restore them here. + * Restoring sigmask here can lead to delivering signals that the above + * syscalls are intended to block because of the sigmask passed in. + */ + if (signal_pending(current)) { + current->saved_sigmask = *sigsaved; + set_restore_sigmask(); + return; + } + + /* + * This is needed because the fast syscall return path does not restore + * saved_sigmask when signals are not pending. + */ + set_current_blocked(sigsaved); +} +EXPORT_SYMBOL(restore_user_sigmask); + /** * sys_rt_sigprocmask - change the list of currently blocked signals * @how: whether to add, remove, or set signals
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1-rc1 commit edafccee56ff31678a091ddb7219aba9b28bc3cb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we have fixed user buffers, we can map them into the kernel when we setup the io_uring. That avoids the need to do get_user_pages() for each and every IO.
To utilize this feature, the application must call io_uring_register() after having setup an io_uring instance, passing in IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to an iovec array, and the nr_args should contain how many iovecs the application wishes to map.
If successful, these buffers are now mapped into the kernel, eligible for IO. To use these fixed buffers, the application must use the IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len must point to somewhere inside the indexed buffer.
The application may register buffers throughout the lifetime of the io_uring instance. It can call io_uring_register() with IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of buffers, and then register a new set. The application need not unregister buffers explicitly before shutting down the io_uring instance.
It's perfectly valid to setup a larger buffer, and then sometimes only use parts of it for an IO. As long as the range is within the originally mapped region, it will work just fine.
For now, buffers must not be file backed. If file backed buffers are passed in, the registration will fail with -1/EOPNOTSUPP. This restriction may be relaxed in the future.
RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat arbitrary 1G per buffer size is also imposed.
Reviewed-by: Hannes Reinecke hare@suse.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com [add ITER_BVEC for iov_iter_bvec in 4.19] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/io_uring.c | 374 ++++++++++++++++++++++++- include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/io_uring.h | 13 +- kernel/sys_ni.c | 1 + 7 files changed, 381 insertions(+), 15 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 481c126259e9..2eefd2a7c1ce 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -400,3 +400,4 @@ 386 i386 rseq sys_rseq __ia32_sys_rseq 425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup 426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter +427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 6a32a430c8e0..65c026185e61 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -345,6 +345,7 @@ 334 common rseq __x64_sys_rseq 425 common io_uring_setup __x64_sys_io_uring_setup 426 common io_uring_enter __x64_sys_io_uring_enter +427 common io_uring_register __x64_sys_io_uring_register
# # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/io_uring.c b/fs/io_uring.c index 31f43ed894ba..762df0beb199 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -45,6 +45,7 @@ #include <linux/slab.h> #include <linux/workqueue.h> #include <linux/blkdev.h> +#include <linux/bvec.h> #include <linux/net.h> #include <net/sock.h> #include <net/af_unix.h> @@ -52,6 +53,8 @@ #include <linux/sched/mm.h> #include <linux/uaccess.h> #include <linux/nospec.h> +#include <linux/sizes.h> +#include <linux/hugetlb.h>
#include <uapi/linux/io_uring.h>
@@ -81,6 +84,13 @@ struct io_cq_ring { struct io_uring_cqe cqes[]; };
+struct io_mapped_ubuf { + u64 ubuf; + size_t len; + struct bio_vec *bvec; + unsigned int nr_bvecs; +}; + struct io_ring_ctx { struct { struct percpu_ref refs; @@ -113,6 +123,10 @@ struct io_ring_ctx { struct fasync_struct *cq_fasync; } ____cacheline_aligned_in_smp;
+ /* if used, fixed mapped user buffers */ + unsigned nr_user_bufs; + struct io_mapped_ubuf *user_bufs; + struct user_struct *user;
struct completion ctx_done; @@ -732,6 +746,46 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) } }
+static int io_import_fixed(struct io_ring_ctx *ctx, int rw, + const struct io_uring_sqe *sqe, + struct iov_iter *iter) +{ + size_t len = READ_ONCE(sqe->len); + struct io_mapped_ubuf *imu; + unsigned index, buf_index; + size_t offset; + u64 buf_addr; + + /* attempt to use fixed buffers without having provided iovecs */ + if (unlikely(!ctx->user_bufs)) + return -EFAULT; + + buf_index = READ_ONCE(sqe->buf_index); + if (unlikely(buf_index >= ctx->nr_user_bufs)) + return -EFAULT; + + index = array_index_nospec(buf_index, ctx->nr_user_bufs); + imu = &ctx->user_bufs[index]; + buf_addr = READ_ONCE(sqe->addr); + + /* overflow */ + if (buf_addr + len < buf_addr) + return -EFAULT; + /* not inside the mapped region */ + if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len) + return -EFAULT; + + /* + * May not be a start of buffer, set size appropriately + * and advance us to the beginning. + */ + offset = buf_addr - imu->ubuf; + iov_iter_bvec(iter, ITER_BVEC | rw, imu->bvec, imu->nr_bvecs, offset + len); + if (offset) + iov_iter_advance(iter, offset); + return 0; +} + static int io_import_iovec(struct io_ring_ctx *ctx, int rw, const struct sqe_submit *s, struct iovec **iovec, struct iov_iter *iter) @@ -739,6 +793,23 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw, const struct io_uring_sqe *sqe = s->sqe; void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr)); size_t sqe_len = READ_ONCE(sqe->len); + u8 opcode; + + /* + * We're reading ->opcode for the second time, but the first read + * doesn't care whether it's _FIXED or not, so it doesn't matter + * whether ->opcode changes concurrently. The first read does care + * about whether it is a READ or a WRITE, so we don't trust this read + * for that purpose and instead let the caller pass in the read/write + * flag. + */ + opcode = READ_ONCE(sqe->opcode); + if (opcode == IORING_OP_READ_FIXED || + opcode == IORING_OP_WRITE_FIXED) { + ssize_t ret = io_import_fixed(ctx, rw, sqe, iter); + *iovec = NULL; + return ret; + }
if (!s->has_user) return -EFAULT; @@ -886,7 +957,7 @@ static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; - if (unlikely(sqe->addr || sqe->ioprio)) + if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index)) return -EINVAL;
fd = READ_ONCE(sqe->fd); @@ -945,9 +1016,19 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, ret = io_nop(req, req->user_data); break; case IORING_OP_READV: + if (unlikely(s->sqe->buf_index)) + return -EINVAL; ret = io_read(req, s, force_nonblock, state); break; case IORING_OP_WRITEV: + if (unlikely(s->sqe->buf_index)) + return -EINVAL; + ret = io_write(req, s, force_nonblock, state); + break; + case IORING_OP_READ_FIXED: + ret = io_read(req, s, force_nonblock, state); + break; + case IORING_OP_WRITE_FIXED: ret = io_write(req, s, force_nonblock, state); break; case IORING_OP_FSYNC: @@ -976,28 +1057,46 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, return 0; }
+static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe) +{ + u8 opcode = READ_ONCE(sqe->opcode); + + return !(opcode == IORING_OP_READ_FIXED || + opcode == IORING_OP_WRITE_FIXED); +} + static void io_sq_wq_submit_work(struct work_struct *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work); struct sqe_submit *s = &req->submit; const struct io_uring_sqe *sqe = s->sqe; struct io_ring_ctx *ctx = req->ctx; - mm_segment_t old_fs = get_fs(); + mm_segment_t old_fs; + bool needs_user; int ret;
/* Ensure we clear previously set forced non-block flag */ req->flags &= ~REQ_F_FORCE_NONBLOCK; req->rw.ki_flags &= ~IOCB_NOWAIT;
- if (!mmget_not_zero(ctx->sqo_mm)) { - ret = -EFAULT; - goto err; - } - - use_mm(ctx->sqo_mm); - set_fs(USER_DS); - s->has_user = true; s->needs_lock = true; + s->has_user = false; + + /* + * If we're doing IO to fixed buffers, we don't need to get/set + * user context + */ + needs_user = io_sqe_needs_user(s->sqe); + if (needs_user) { + if (!mmget_not_zero(ctx->sqo_mm)) { + ret = -EFAULT; + goto err; + } + use_mm(ctx->sqo_mm); + old_fs = get_fs(); + set_fs(USER_DS); + s->has_user = true; + }
do { ret = __io_submit_sqe(ctx, req, s, false, NULL); @@ -1011,9 +1110,11 @@ static void io_sq_wq_submit_work(struct work_struct *work) cond_resched(); } while (1);
- set_fs(old_fs); - unuse_mm(ctx->sqo_mm); - mmput(ctx->sqo_mm); + if (needs_user) { + set_fs(old_fs); + unuse_mm(ctx->sqo_mm); + mmput(ctx->sqo_mm); + } err: if (ret) { io_cqring_add_event(ctx, sqe->user_data, ret, 0); @@ -1317,6 +1418,198 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries) return (bytes + PAGE_SIZE - 1) / PAGE_SIZE; }
+static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx) +{ + int i, j; + + if (!ctx->user_bufs) + return -ENXIO; + + for (i = 0; i < ctx->nr_user_bufs; i++) { + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; + + for (j = 0; j < imu->nr_bvecs; j++) + put_page(imu->bvec[j].bv_page); + + if (ctx->account_mem) + io_unaccount_mem(ctx->user, imu->nr_bvecs); + kfree(imu->bvec); + imu->nr_bvecs = 0; + } + + kfree(ctx->user_bufs); + ctx->user_bufs = NULL; + ctx->nr_user_bufs = 0; + return 0; +} + +static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst, + void __user *arg, unsigned index) +{ + struct iovec __user *src; + +#ifdef CONFIG_COMPAT + if (ctx->compat) { + struct compat_iovec __user *ciovs; + struct compat_iovec ciov; + + ciovs = (struct compat_iovec __user *) arg; + if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov))) + return -EFAULT; + + dst->iov_base = (void __user *) (unsigned long) ciov.iov_base; + dst->iov_len = ciov.iov_len; + return 0; + } +#endif + src = (struct iovec __user *) arg; + if (copy_from_user(dst, &src[index], sizeof(*dst))) + return -EFAULT; + return 0; +} + +static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, + unsigned nr_args) +{ + struct vm_area_struct **vmas = NULL; + struct page **pages = NULL; + int i, j, got_pages = 0; + int ret = -EINVAL; + + if (ctx->user_bufs) + return -EBUSY; + if (!nr_args || nr_args > UIO_MAXIOV) + return -EINVAL; + + ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf), + GFP_KERNEL); + if (!ctx->user_bufs) + return -ENOMEM; + + for (i = 0; i < nr_args; i++) { + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; + unsigned long off, start, end, ubuf; + int pret, nr_pages; + struct iovec iov; + size_t size; + + ret = io_copy_iov(ctx, &iov, arg, i); + if (ret) + break; + + /* + * Don't impose further limits on the size and buffer + * constraints here, we'll -EINVAL later when IO is + * submitted if they are wrong. + */ + ret = -EFAULT; + if (!iov.iov_base || !iov.iov_len) + goto err; + + /* arbitrary limit, but we need something */ + if (iov.iov_len > SZ_1G) + goto err; + + ubuf = (unsigned long) iov.iov_base; + end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT; + start = ubuf >> PAGE_SHIFT; + nr_pages = end - start; + + if (ctx->account_mem) { + ret = io_account_mem(ctx->user, nr_pages); + if (ret) + goto err; + } + + ret = 0; + if (!pages || nr_pages > got_pages) { + kfree(vmas); + kfree(pages); + pages = kmalloc_array(nr_pages, sizeof(struct page *), + GFP_KERNEL); + vmas = kmalloc_array(nr_pages, + sizeof(struct vm_area_struct *), + GFP_KERNEL); + if (!pages || !vmas) { + ret = -ENOMEM; + if (ctx->account_mem) + io_unaccount_mem(ctx->user, nr_pages); + goto err; + } + got_pages = nr_pages; + } + + imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec), + GFP_KERNEL); + ret = -ENOMEM; + if (!imu->bvec) { + if (ctx->account_mem) + io_unaccount_mem(ctx->user, nr_pages); + goto err; + } + + ret = 0; + down_read(¤t->mm->mmap_sem); + pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE, + pages, vmas); + if (pret == nr_pages) { + /* don't support file backed memory */ + for (j = 0; j < nr_pages; j++) { + struct vm_area_struct *vma = vmas[j]; + + if (vma->vm_file && + !is_file_hugepages(vma->vm_file)) { + ret = -EOPNOTSUPP; + break; + } + } + } else { + ret = pret < 0 ? pret : -EFAULT; + } + up_read(¤t->mm->mmap_sem); + if (ret) { + /* + * if we did partial map, or found file backed vmas, + * release any pages we did get + */ + if (pret > 0) { + for (j = 0; j < pret; j++) + put_page(pages[j]); + } + if (ctx->account_mem) + io_unaccount_mem(ctx->user, nr_pages); + goto err; + } + + off = ubuf & ~PAGE_MASK; + size = iov.iov_len; + for (j = 0; j < nr_pages; j++) { + size_t vec_len; + + vec_len = min_t(size_t, size, PAGE_SIZE - off); + imu->bvec[j].bv_page = pages[j]; + imu->bvec[j].bv_len = vec_len; + imu->bvec[j].bv_offset = off; + off = 0; + size -= vec_len; + } + /* store original address for later verification */ + imu->ubuf = ubuf; + imu->len = iov.iov_len; + imu->nr_bvecs = nr_pages; + + ctx->nr_user_bufs++; + } + kfree(pages); + kfree(vmas); + return 0; +err: + kfree(pages); + kfree(vmas); + io_sqe_buffer_unregister(ctx); + return ret; +} + static void io_ring_ctx_free(struct io_ring_ctx *ctx) { if (ctx->sqo_wq) @@ -1325,6 +1618,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) mmdrop(ctx->sqo_mm);
io_iopoll_reap_events(ctx); + io_sqe_buffer_unregister(ctx);
#if defined(CONFIG_UNIX) if (ctx->ring_sock) @@ -1689,6 +1983,60 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries, return io_uring_setup(entries, params); }
+static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, + void __user *arg, unsigned nr_args) +{ + int ret; + + percpu_ref_kill(&ctx->refs); + wait_for_completion(&ctx->ctx_done); + + switch (opcode) { + case IORING_REGISTER_BUFFERS: + ret = io_sqe_buffer_register(ctx, arg, nr_args); + break; + case IORING_UNREGISTER_BUFFERS: + ret = -EINVAL; + if (arg || nr_args) + break; + ret = io_sqe_buffer_unregister(ctx); + break; + default: + ret = -EINVAL; + break; + } + + /* bring the ctx back to life */ + reinit_completion(&ctx->ctx_done); + percpu_ref_reinit(&ctx->refs); + return ret; +} + +SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode, + void __user *, arg, unsigned int, nr_args) +{ + struct io_ring_ctx *ctx; + long ret = -EBADF; + struct fd f; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (f.file->f_op != &io_uring_fops) + goto out_fput; + + ctx = f.file->private_data; + + mutex_lock(&ctx->uring_lock); + ret = __io_uring_register(ctx, opcode, arg, nr_args); + mutex_unlock(&ctx->uring_lock); +out_fput: + fdput(f); + return ret; +} + static int __init io_uring_init(void) { req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index d3f921b040ee..82660e1ceaca 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -315,6 +315,8 @@ asmlinkage long sys_io_uring_setup(u32 entries, asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit, u32 min_complete, u32 flags, const sigset_t __user *sig, size_t sigsz); +asmlinkage long sys_io_uring_register(unsigned int fd, unsigned int op, + void __user *arg, unsigned int nr_args);
/* fs/xattr.c */ asmlinkage long sys_setxattr(const char __user *path, const char __user *name, diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 5bb61bc98200..4c1ba6d0dac8 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -744,9 +744,11 @@ __SYSCALL(__NR_rseq, sys_rseq) __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup) #define __NR_io_uring_enter 426 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter) +#define __NR_io_uring_register 427 +__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
#undef __NR_syscalls -#define __NR_syscalls 427 +#define __NR_syscalls 428
/* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 5c457ea396e6..cf28f7a11f12 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -27,7 +27,10 @@ struct io_uring_sqe { __u32 fsync_flags; }; __u64 user_data; /* data to be passed back at completion time */ - __u64 __pad2[3]; + union { + __u16 buf_index; /* index into fixed buffers, if used */ + __u64 __pad2[3]; + }; };
/* @@ -39,6 +42,8 @@ struct io_uring_sqe { #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 #define IORING_OP_FSYNC 3 +#define IORING_OP_READ_FIXED 4 +#define IORING_OP_WRITE_FIXED 5
/* * sqe->fsync_flags @@ -103,4 +108,10 @@ struct io_uring_params { struct io_cqring_offsets cq_off; };
+/* + * io_uring_register(2) opcodes and arguments + */ +#define IORING_REGISTER_BUFFERS 0 +#define IORING_UNREGISTER_BUFFERS 1 + #endif diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 5cc38830e6cc..8cfc5dd43bf0 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -48,6 +48,7 @@ COND_SYSCALL_COMPAT(io_getevents); COND_SYSCALL_COMPAT(io_pgetevents); COND_SYSCALL(io_uring_setup); COND_SYSCALL(io_uring_enter); +COND_SYSCALL(io_uring_register);
/* fs/xattr.c */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1-rc1 commit f4e65870e5cede5ca1ec0006b6c9803994e5f7b8 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We need this functionality for the io_uring file registration, but we cannot rely on it since CONFIG_UNIX can be modular. Move the helpers to a separate file, that's always builtin to the kernel if CONFIG_UNIX is m/y.
No functional changes in this patch, just moving code around.
Reviewed-by: Hannes Reinecke hare@suse.com Acked-by: David S. Miller davem@davemloft.net Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/net/af_unix.h | 1 + net/Makefile | 2 +- net/unix/Kconfig | 5 ++ net/unix/Makefile | 2 + net/unix/af_unix.c | 63 +----------------- net/unix/garbage.c | 71 +------------------- net/unix/scm.c | 151 ++++++++++++++++++++++++++++++++++++++++++ net/unix/scm.h | 10 +++ 8 files changed, 174 insertions(+), 131 deletions(-) create mode 100644 net/unix/scm.c create mode 100644 net/unix/scm.h
diff --git a/include/net/af_unix.h b/include/net/af_unix.h index a5ba41b3b867..7ec1cdb66be8 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -10,6 +10,7 @@
void unix_inflight(struct user_struct *user, struct file *fp); void unix_notinflight(struct user_struct *user, struct file *fp); +void unix_destruct_scm(struct sk_buff *skb); void unix_gc(void); void wait_for_unix_gc(void); struct sock *unix_get_socket(struct file *filp); diff --git a/net/Makefile b/net/Makefile index bdaf53925acd..449fc0b221f8 100644 --- a/net/Makefile +++ b/net/Makefile @@ -18,7 +18,7 @@ obj-$(CONFIG_NETFILTER) += netfilter/ obj-$(CONFIG_INET) += ipv4/ obj-$(CONFIG_TLS) += tls/ obj-$(CONFIG_XFRM) += xfrm/ -obj-$(CONFIG_UNIX) += unix/ +obj-$(CONFIG_UNIX_SCM) += unix/ obj-$(CONFIG_NET) += ipv6/ obj-$(CONFIG_BPFILTER) += bpfilter/ obj-$(CONFIG_PACKET) += packet/ diff --git a/net/unix/Kconfig b/net/unix/Kconfig index 8b31ab85d050..3b9e450656a4 100644 --- a/net/unix/Kconfig +++ b/net/unix/Kconfig @@ -19,6 +19,11 @@ config UNIX
Say Y unless you know what you are doing.
+config UNIX_SCM + bool + depends on UNIX + default y + config UNIX_DIAG tristate "UNIX: socket monitoring interface" depends on UNIX diff --git a/net/unix/Makefile b/net/unix/Makefile index ffd0a275c3a7..54e58cc4f945 100644 --- a/net/unix/Makefile +++ b/net/unix/Makefile @@ -10,3 +10,5 @@ unix-$(CONFIG_SYSCTL) += sysctl_net_unix.o
obj-$(CONFIG_UNIX_DIAG) += unix_diag.o unix_diag-y := diag.o + +obj-$(CONFIG_UNIX_SCM) += scm.o diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 2020306468af..b09f0b567db5 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -119,6 +119,8 @@ #include <linux/freezer.h> #include <linux/file.h>
+#include "scm.h" + struct hlist_head unix_socket_table[2 * UNIX_HASH_SIZE]; EXPORT_SYMBOL_GPL(unix_socket_table); DEFINE_SPINLOCK(unix_table_lock); @@ -1514,67 +1516,6 @@ static int unix_getname(struct socket *sock, struct sockaddr *uaddr, int peer) return err; }
-static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb) -{ - int i; - - scm->fp = UNIXCB(skb).fp; - UNIXCB(skb).fp = NULL; - - for (i = scm->fp->count-1; i >= 0; i--) - unix_notinflight(scm->fp->user, scm->fp->fp[i]); -} - -static void unix_destruct_scm(struct sk_buff *skb) -{ - struct scm_cookie scm; - memset(&scm, 0, sizeof(scm)); - scm.pid = UNIXCB(skb).pid; - if (UNIXCB(skb).fp) - unix_detach_fds(&scm, skb); - - /* Alas, it calls VFS */ - /* So fscking what? fput() had been SMP-safe since the last Summer */ - scm_destroy(&scm); - sock_wfree(skb); -} - -/* - * The "user->unix_inflight" variable is protected by the garbage - * collection lock, and we just read it locklessly here. If you go - * over the limit, there might be a tiny race in actually noticing - * it across threads. Tough. - */ -static inline bool too_many_unix_fds(struct task_struct *p) -{ - struct user_struct *user = current_user(); - - if (unlikely(user->unix_inflight > task_rlimit(p, RLIMIT_NOFILE))) - return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN); - return false; -} - -static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb) -{ - int i; - - if (too_many_unix_fds(current)) - return -ETOOMANYREFS; - - /* - * Need to duplicate file references for the sake of garbage - * collection. Otherwise a socket in the fps might become a - * candidate for GC while the skb is not yet queued. - */ - UNIXCB(skb).fp = scm_fp_dup(scm->fp); - if (!UNIXCB(skb).fp) - return -ENOMEM; - - for (i = scm->fp->count - 1; i >= 0; i--) - unix_inflight(scm->fp->user, scm->fp->fp[i]); - return 0; -} - static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool send_fds) { int err = 0; diff --git a/net/unix/garbage.c b/net/unix/garbage.c index f81854d74c7d..8bbe1b8e4ff7 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -86,80 +86,13 @@ #include <net/scm.h> #include <net/tcp_states.h>
+#include "scm.h" + /* Internal data structures and random procedures: */
-static LIST_HEAD(gc_inflight_list); static LIST_HEAD(gc_candidates); -static DEFINE_SPINLOCK(unix_gc_lock); static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait);
-unsigned int unix_tot_inflight; - -struct sock *unix_get_socket(struct file *filp) -{ - struct sock *u_sock = NULL; - struct inode *inode = file_inode(filp); - - /* Socket ? */ - if (S_ISSOCK(inode->i_mode) && !(filp->f_mode & FMODE_PATH)) { - struct socket *sock = SOCKET_I(inode); - struct sock *s = sock->sk; - - /* PF_UNIX ? */ - if (s && sock->ops && sock->ops->family == PF_UNIX) - u_sock = s; - } else { - /* Could be an io_uring instance */ - u_sock = io_uring_get_socket(filp); - } - return u_sock; -} - -/* Keep the number of times in flight count for the file - * descriptor if it is for an AF_UNIX socket. - */ - -void unix_inflight(struct user_struct *user, struct file *fp) -{ - struct sock *s = unix_get_socket(fp); - - spin_lock(&unix_gc_lock); - - if (s) { - struct unix_sock *u = unix_sk(s); - - if (atomic_long_inc_return(&u->inflight) == 1) { - BUG_ON(!list_empty(&u->link)); - list_add_tail(&u->link, &gc_inflight_list); - } else { - BUG_ON(list_empty(&u->link)); - } - unix_tot_inflight++; - } - user->unix_inflight++; - spin_unlock(&unix_gc_lock); -} - -void unix_notinflight(struct user_struct *user, struct file *fp) -{ - struct sock *s = unix_get_socket(fp); - - spin_lock(&unix_gc_lock); - - if (s) { - struct unix_sock *u = unix_sk(s); - - BUG_ON(!atomic_long_read(&u->inflight)); - BUG_ON(list_empty(&u->link)); - - if (atomic_long_dec_and_test(&u->inflight)) - list_del_init(&u->link); - unix_tot_inflight--; - } - user->unix_inflight--; - spin_unlock(&unix_gc_lock); -} - static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *), struct sk_buff_head *hitlist) { diff --git a/net/unix/scm.c b/net/unix/scm.c new file mode 100644 index 000000000000..8c40f2b32392 --- /dev/null +++ b/net/unix/scm.c @@ -0,0 +1,151 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/module.h> +#include <linux/kernel.h> +#include <linux/string.h> +#include <linux/socket.h> +#include <linux/net.h> +#include <linux/fs.h> +#include <net/af_unix.h> +#include <net/scm.h> +#include <linux/init.h> + +#include "scm.h" + +unsigned int unix_tot_inflight; +EXPORT_SYMBOL(unix_tot_inflight); + +LIST_HEAD(gc_inflight_list); +EXPORT_SYMBOL(gc_inflight_list); + +DEFINE_SPINLOCK(unix_gc_lock); +EXPORT_SYMBOL(unix_gc_lock); + +struct sock *unix_get_socket(struct file *filp) +{ + struct sock *u_sock = NULL; + struct inode *inode = file_inode(filp); + + /* Socket ? */ + if (S_ISSOCK(inode->i_mode) && !(filp->f_mode & FMODE_PATH)) { + struct socket *sock = SOCKET_I(inode); + struct sock *s = sock->sk; + + /* PF_UNIX ? */ + if (s && sock->ops && sock->ops->family == PF_UNIX) + u_sock = s; + } else { + /* Could be an io_uring instance */ + u_sock = io_uring_get_socket(filp); + } + return u_sock; +} +EXPORT_SYMBOL(unix_get_socket); + +/* Keep the number of times in flight count for the file + * descriptor if it is for an AF_UNIX socket. + */ +void unix_inflight(struct user_struct *user, struct file *fp) +{ + struct sock *s = unix_get_socket(fp); + + spin_lock(&unix_gc_lock); + + if (s) { + struct unix_sock *u = unix_sk(s); + + if (atomic_long_inc_return(&u->inflight) == 1) { + BUG_ON(!list_empty(&u->link)); + list_add_tail(&u->link, &gc_inflight_list); + } else { + BUG_ON(list_empty(&u->link)); + } + unix_tot_inflight++; + } + user->unix_inflight++; + spin_unlock(&unix_gc_lock); +} + +void unix_notinflight(struct user_struct *user, struct file *fp) +{ + struct sock *s = unix_get_socket(fp); + + spin_lock(&unix_gc_lock); + + if (s) { + struct unix_sock *u = unix_sk(s); + + BUG_ON(!atomic_long_read(&u->inflight)); + BUG_ON(list_empty(&u->link)); + + if (atomic_long_dec_and_test(&u->inflight)) + list_del_init(&u->link); + unix_tot_inflight--; + } + user->unix_inflight--; + spin_unlock(&unix_gc_lock); +} + +/* + * The "user->unix_inflight" variable is protected by the garbage + * collection lock, and we just read it locklessly here. If you go + * over the limit, there might be a tiny race in actually noticing + * it across threads. Tough. + */ +static inline bool too_many_unix_fds(struct task_struct *p) +{ + struct user_struct *user = current_user(); + + if (unlikely(user->unix_inflight > task_rlimit(p, RLIMIT_NOFILE))) + return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN); + return false; +} + +int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb) +{ + int i; + + if (too_many_unix_fds(current)) + return -ETOOMANYREFS; + + /* + * Need to duplicate file references for the sake of garbage + * collection. Otherwise a socket in the fps might become a + * candidate for GC while the skb is not yet queued. + */ + UNIXCB(skb).fp = scm_fp_dup(scm->fp); + if (!UNIXCB(skb).fp) + return -ENOMEM; + + for (i = scm->fp->count - 1; i >= 0; i--) + unix_inflight(scm->fp->user, scm->fp->fp[i]); + return 0; +} +EXPORT_SYMBOL(unix_attach_fds); + +void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb) +{ + int i; + + scm->fp = UNIXCB(skb).fp; + UNIXCB(skb).fp = NULL; + + for (i = scm->fp->count-1; i >= 0; i--) + unix_notinflight(scm->fp->user, scm->fp->fp[i]); +} +EXPORT_SYMBOL(unix_detach_fds); + +void unix_destruct_scm(struct sk_buff *skb) +{ + struct scm_cookie scm; + + memset(&scm, 0, sizeof(scm)); + scm.pid = UNIXCB(skb).pid; + if (UNIXCB(skb).fp) + unix_detach_fds(&scm, skb); + + /* Alas, it calls VFS */ + /* So fscking what? fput() had been SMP-safe since the last Summer */ + scm_destroy(&scm); + sock_wfree(skb); +} +EXPORT_SYMBOL(unix_destruct_scm); diff --git a/net/unix/scm.h b/net/unix/scm.h new file mode 100644 index 000000000000..5a255a477f16 --- /dev/null +++ b/net/unix/scm.h @@ -0,0 +1,10 @@ +#ifndef NET_UNIX_SCM_H +#define NET_UNIX_SCM_H + +extern struct list_head gc_inflight_list; +extern spinlock_t unix_gc_lock; + +int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb); +void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb); + +#endif
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1-rc1 commit 6b06314c47e141031be043539900d80d2c7ba10f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We normally have to fget/fput for each IO we do on a file. Even with the batching we do, the cost of the atomic inc/dec of the file usage count adds up.
This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes for the io_uring_register(2) system call. The arguments passed in must be an array of __s32 holding file descriptors, and nr_args should hold the number of file descriptors the application wishes to pin for the duration of the io_uring instance (or until IORING_UNREGISTER_FILES is called).
When used, the application must set IOSQE_FIXED_FILE in the sqe->flags member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd to the index in the array passed in to IORING_REGISTER_FILES.
Files are automatically unregistered when the io_uring instance is torn down. An application need only unregister if it wishes to register a new set of fds.
Reviewed-by: Hannes Reinecke hare@suse.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 311 ++++++++++++++++++++++++++++++---- include/uapi/linux/io_uring.h | 9 +- 2 files changed, 288 insertions(+), 32 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 762df0beb199..d5a4f00f7a98 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -49,6 +49,7 @@ #include <linux/net.h> #include <net/sock.h> #include <net/af_unix.h> +#include <net/scm.h> #include <linux/anon_inodes.h> #include <linux/sched/mm.h> #include <linux/uaccess.h> @@ -61,6 +62,7 @@ #include "internal.h"
#define IORING_MAX_ENTRIES 4096 +#define IORING_MAX_FIXED_FILES 1024
struct io_uring { u32 head ____cacheline_aligned_in_smp; @@ -123,6 +125,14 @@ struct io_ring_ctx { struct fasync_struct *cq_fasync; } ____cacheline_aligned_in_smp;
+ /* + * If used, fixed file set. Writers must ensure that ->refs is dead, + * readers must ensure that ->refs is alive as long as the file* is + * used. Only updated through io_uring_register(2). + */ + struct file **user_files; + unsigned nr_user_files; + /* if used, fixed mapped user buffers */ unsigned nr_user_bufs; struct io_mapped_ubuf *user_bufs; @@ -170,6 +180,7 @@ struct io_kiocb { unsigned int flags; #define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ #define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ +#define REQ_F_FIXED_FILE 4 /* ctx owns file */ u64 user_data; u64 error;
@@ -404,15 +415,17 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, * Batched puts of the same file, to avoid dirtying the * file usage count multiple times, if avoidable. */ - if (!file) { - file = req->rw.ki_filp; - file_count = 1; - } else if (file == req->rw.ki_filp) { - file_count++; - } else { - fput_many(file, file_count); - file = req->rw.ki_filp; - file_count = 1; + if (!(req->flags & REQ_F_FIXED_FILE)) { + if (!file) { + file = req->rw.ki_filp; + file_count = 1; + } else if (file == req->rw.ki_filp) { + file_count++; + } else { + fput_many(file, file_count); + file = req->rw.ki_filp; + file_count = 1; + } }
if (to_free == ARRAY_SIZE(reqs)) @@ -544,13 +557,19 @@ static void kiocb_end_write(struct kiocb *kiocb) } }
+static void io_fput(struct io_kiocb *req) +{ + if (!(req->flags & REQ_F_FIXED_FILE)) + fput(req->rw.ki_filp); +} + static void io_complete_rw(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
kiocb_end_write(kiocb);
- fput(kiocb->ki_filp); + io_fput(req); io_cqring_add_event(req->ctx, req->user_data, res, 0); io_free_req(req); } @@ -666,19 +685,29 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, { struct io_ring_ctx *ctx = req->ctx; struct kiocb *kiocb = &req->rw; - unsigned ioprio; + unsigned ioprio, flags; int fd, ret;
/* For -EAGAIN retry, everything is already prepped */ if (kiocb->ki_filp) return 0;
+ flags = READ_ONCE(sqe->flags); fd = READ_ONCE(sqe->fd); - kiocb->ki_filp = io_file_get(state, fd); - if (unlikely(!kiocb->ki_filp)) - return -EBADF; - if (force_nonblock && !io_file_supports_async(kiocb->ki_filp)) - force_nonblock = false; + + if (flags & IOSQE_FIXED_FILE) { + if (unlikely(!ctx->user_files || + (unsigned) fd >= ctx->nr_user_files)) + return -EBADF; + kiocb->ki_filp = ctx->user_files[fd]; + req->flags |= REQ_F_FIXED_FILE; + } else { + kiocb->ki_filp = io_file_get(state, fd); + if (unlikely(!kiocb->ki_filp)) + return -EBADF; + if (force_nonblock && !io_file_supports_async(kiocb->ki_filp)) + force_nonblock = false; + } kiocb->ki_pos = READ_ONCE(sqe->off); kiocb->ki_flags = iocb_flags(kiocb->ki_filp); kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp)); @@ -718,10 +747,14 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, } return 0; out_fput: - /* in case of error, we didn't use this file reference. drop it. */ - if (state) - state->used_refs--; - io_file_put(state, kiocb->ki_filp); + if (!(flags & IOSQE_FIXED_FILE)) { + /* + * in case of error, we didn't use this file reference. drop it. + */ + if (state) + state->used_refs--; + io_file_put(state, kiocb->ki_filp); + } return ret; }
@@ -863,7 +896,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, out_fput: /* Hold on to the file for -EAGAIN */ if (unlikely(ret && ret != -EAGAIN)) - fput(file); + io_fput(req); return ret; }
@@ -917,7 +950,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s, kfree(iovec); out_fput: if (unlikely(ret)) - fput(file); + io_fput(req); return ret; }
@@ -940,7 +973,7 @@ static int io_nop(struct io_kiocb *req, u64 user_data) */ if (req->rw.ki_filp) { err = -EBADF; - fput(req->rw.ki_filp); + io_fput(req); } io_cqring_add_event(ctx, user_data, err, 0); io_free_req(req); @@ -949,21 +982,32 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe) { + struct io_ring_ctx *ctx = req->ctx; + unsigned flags; int fd;
/* Prep already done */ if (req->rw.ki_filp) return 0;
- if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index)) return -EINVAL;
fd = READ_ONCE(sqe->fd); - req->rw.ki_filp = fget(fd); - if (unlikely(!req->rw.ki_filp)) - return -EBADF; + flags = READ_ONCE(sqe->flags); + + if (flags & IOSQE_FIXED_FILE) { + if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files)) + return -EBADF; + req->rw.ki_filp = ctx->user_files[fd]; + req->flags |= REQ_F_FIXED_FILE; + } else { + req->rw.ki_filp = fget(fd); + if (unlikely(!req->rw.ki_filp)) + return -EBADF; + }
return 0; } @@ -993,7 +1037,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, end > 0 ? end : LLONG_MAX, fsync_flags & IORING_FSYNC_DATASYNC);
- fput(req->rw.ki_filp); + io_fput(req); io_cqring_add_event(req->ctx, sqe->user_data, ret, 0); io_free_req(req); return 0; @@ -1132,7 +1176,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, ssize_t ret;
/* enforce forwards compatibility on users */ - if (unlikely(s->sqe->flags)) + if (unlikely(s->sqe->flags & ~IOSQE_FIXED_FILE)) return -EINVAL;
req = io_get_req(ctx, state); @@ -1344,6 +1388,201 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0; }
+static void __io_sqe_files_unregister(struct io_ring_ctx *ctx) +{ +#if defined(CONFIG_UNIX) + if (ctx->ring_sock) { + struct sock *sock = ctx->ring_sock->sk; + struct sk_buff *skb; + + while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL) + kfree_skb(skb); + } +#else + int i; + + for (i = 0; i < ctx->nr_user_files; i++) + fput(ctx->user_files[i]); +#endif +} + +static int io_sqe_files_unregister(struct io_ring_ctx *ctx) +{ + if (!ctx->user_files) + return -ENXIO; + + __io_sqe_files_unregister(ctx); + kfree(ctx->user_files); + ctx->user_files = NULL; + ctx->nr_user_files = 0; + return 0; +} + +static void io_finish_async(struct io_ring_ctx *ctx) +{ + if (ctx->sqo_wq) { + destroy_workqueue(ctx->sqo_wq); + ctx->sqo_wq = NULL; + } +} + +#if defined(CONFIG_UNIX) +static void io_destruct_skb(struct sk_buff *skb) +{ + struct io_ring_ctx *ctx = skb->sk->sk_user_data; + + io_finish_async(ctx); + unix_destruct_scm(skb); +} + +/* + * Ensure the UNIX gc is aware of our file set, so we are certain that + * the io_uring can be safely unregistered on process exit, even if we have + * loops in the file referencing. + */ +static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset) +{ + struct sock *sk = ctx->ring_sock->sk; + struct scm_fp_list *fpl; + struct sk_buff *skb; + int i; + + if (!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) { + unsigned long inflight = ctx->user->unix_inflight + nr; + + if (inflight > task_rlimit(current, RLIMIT_NOFILE)) + return -EMFILE; + } + + fpl = kzalloc(sizeof(*fpl), GFP_KERNEL); + if (!fpl) + return -ENOMEM; + + skb = alloc_skb(0, GFP_KERNEL); + if (!skb) { + kfree(fpl); + return -ENOMEM; + } + + skb->sk = sk; + skb->destructor = io_destruct_skb; + + fpl->user = get_uid(ctx->user); + for (i = 0; i < nr; i++) { + fpl->fp[i] = get_file(ctx->user_files[i + offset]); + unix_inflight(fpl->user, fpl->fp[i]); + } + + fpl->max = fpl->count = nr; + UNIXCB(skb).fp = fpl; + refcount_add(skb->truesize, &sk->sk_wmem_alloc); + skb_queue_head(&sk->sk_receive_queue, skb); + + for (i = 0; i < nr; i++) + fput(fpl->fp[i]); + + return 0; +} + +/* + * If UNIX sockets are enabled, fd passing can cause a reference cycle which + * causes regular reference counting to break down. We rely on the UNIX + * garbage collection to take care of this problem for us. + */ +static int io_sqe_files_scm(struct io_ring_ctx *ctx) +{ + unsigned left, total; + int ret = 0; + + total = 0; + left = ctx->nr_user_files; + while (left) { + unsigned this_files = min_t(unsigned, left, SCM_MAX_FD); + int ret; + + ret = __io_sqe_files_scm(ctx, this_files, total); + if (ret) + break; + left -= this_files; + total += this_files; + } + + if (!ret) + return 0; + + while (total < ctx->nr_user_files) { + fput(ctx->user_files[total]); + total++; + } + + return ret; +} +#else +static int io_sqe_files_scm(struct io_ring_ctx *ctx) +{ + return 0; +} +#endif + +static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, + unsigned nr_args) +{ + __s32 __user *fds = (__s32 __user *) arg; + int fd, ret = 0; + unsigned i; + + if (ctx->user_files) + return -EBUSY; + if (!nr_args) + return -EINVAL; + if (nr_args > IORING_MAX_FIXED_FILES) + return -EMFILE; + + ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL); + if (!ctx->user_files) + return -ENOMEM; + + for (i = 0; i < nr_args; i++) { + ret = -EFAULT; + if (copy_from_user(&fd, &fds[i], sizeof(fd))) + break; + + ctx->user_files[i] = fget(fd); + + ret = -EBADF; + if (!ctx->user_files[i]) + break; + /* + * Don't allow io_uring instances to be registered. If UNIX + * isn't enabled, then this causes a reference cycle and this + * instance can never get freed. If UNIX is enabled we'll + * handle it just fine, but there's still no point in allowing + * a ring fd as it doesn't support regular read/write anyway. + */ + if (ctx->user_files[i]->f_op == &io_uring_fops) { + fput(ctx->user_files[i]); + break; + } + ctx->nr_user_files++; + ret = 0; + } + + if (ret) { + for (i = 0; i < ctx->nr_user_files; i++) + fput(ctx->user_files[i]); + + kfree(ctx->user_files); + ctx->nr_user_files = 0; + return ret; + } + + ret = io_sqe_files_scm(ctx); + if (ret) + io_sqe_files_unregister(ctx); + + return ret; +} + static int io_sq_offload_start(struct io_ring_ctx *ctx) { int ret; @@ -1612,13 +1851,13 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
static void io_ring_ctx_free(struct io_ring_ctx *ctx) { - if (ctx->sqo_wq) - destroy_workqueue(ctx->sqo_wq); + io_finish_async(ctx); if (ctx->sqo_mm) mmdrop(ctx->sqo_mm);
io_iopoll_reap_events(ctx); io_sqe_buffer_unregister(ctx); + io_sqe_files_unregister(ctx);
#if defined(CONFIG_UNIX) if (ctx->ring_sock) @@ -1858,6 +2097,7 @@ static int io_uring_get_fd(struct io_ring_ctx *ctx)
#if defined(CONFIG_UNIX) ctx->ring_sock->file = file; + ctx->ring_sock->sk->sk_user_data = ctx; #endif fd_install(ret, file); return ret; @@ -2001,6 +2241,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_sqe_buffer_unregister(ctx); break; + case IORING_REGISTER_FILES: + ret = io_sqe_files_register(ctx, arg, nr_args); + break; + case IORING_UNREGISTER_FILES: + ret = -EINVAL; + if (arg || nr_args) + break; + ret = io_sqe_files_unregister(ctx); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index cf28f7a11f12..6257478d55e9 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -16,7 +16,7 @@ */ struct io_uring_sqe { __u8 opcode; /* type of operation for this sqe */ - __u8 flags; /* as of now unused */ + __u8 flags; /* IOSQE_ flags */ __u16 ioprio; /* ioprio for the request */ __s32 fd; /* file descriptor to do IO on */ __u64 off; /* offset into file */ @@ -33,6 +33,11 @@ struct io_uring_sqe { }; };
+/* + * sqe->flags + */ +#define IOSQE_FIXED_FILE (1U << 0) /* use fixed fileset */ + /* * io_uring_setup() flags */ @@ -113,5 +118,7 @@ struct io_uring_params { */ #define IORING_REGISTER_BUFFERS 0 #define IORING_UNREGISTER_BUFFERS 1 +#define IORING_REGISTER_FILES 2 +#define IORING_UNREGISTER_FILES 3
#endif
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1-rc1 commit 6c271ce2f1d572f7fa225700a13cfe7ced492434 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This enables an application to do IO, without ever entering the kernel. By using the SQ ring to fill in new sqes and watching for completions on the CQ ring, we can submit and reap IOs without doing a single system call. The kernel side thread will poll for new submissions, and in case of HIPRI/polled IO, it'll also poll for completions.
By default, we allow 1 second of active spinning. This can by changed by passing in a different grace period at io_uring_register(2) time. If the thread exceeds this idle time without having any work to do, it will set:
sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
The application will have to call io_uring_enter() to start things back up again. If IO is kept busy, that will never be needed. Basically an application that has this feature enabled will guard it's io_uring_enter(2) call with:
read_barrier(); if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP) io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
instead of calling it unconditionally.
It's mandatory to use fixed files with this feature. Failure to do so will result in the application getting an -EBADF CQ entry when submitting IO.
Reviewed-by: Hannes Reinecke hare@suse.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 249 +++++++++++++++++++++++++++++++++- include/uapi/linux/io_uring.h | 12 +- 2 files changed, 253 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d5a4f00f7a98..5e6197250484 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -44,6 +44,7 @@ #include <linux/percpu.h> #include <linux/slab.h> #include <linux/workqueue.h> +#include <linux/kthread.h> #include <linux/blkdev.h> #include <linux/bvec.h> #include <linux/net.h> @@ -108,12 +109,16 @@ struct io_ring_ctx { unsigned cached_sq_head; unsigned sq_entries; unsigned sq_mask; + unsigned sq_thread_idle; struct io_uring_sqe *sq_sqes; } ____cacheline_aligned_in_smp;
/* IO offload */ struct workqueue_struct *sqo_wq; + struct task_struct *sqo_thread; /* if using sq thread polling */ struct mm_struct *sqo_mm; + wait_queue_head_t sqo_wait; + unsigned sqo_stop;
struct { /* CQ ring */ @@ -168,6 +173,7 @@ struct sqe_submit { unsigned short index; bool has_user; bool needs_lock; + bool needs_fixed_file; };
struct io_kiocb { @@ -327,6 +333,8 @@ static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data,
if (waitqueue_active(&ctx->wait)) wake_up(&ctx->wait); + if (waitqueue_active(&ctx->sqo_wait)) + wake_up(&ctx->sqo_wait); }
static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) @@ -680,9 +688,10 @@ static bool io_file_supports_async(struct file *file) return false; }
-static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, +static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s, bool force_nonblock, struct io_submit_state *state) { + const struct io_uring_sqe *sqe = s->sqe; struct io_ring_ctx *ctx = req->ctx; struct kiocb *kiocb = &req->rw; unsigned ioprio, flags; @@ -702,6 +711,8 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, kiocb->ki_filp = ctx->user_files[fd]; req->flags |= REQ_F_FIXED_FILE; } else { + if (s->needs_fixed_file) + return -EBADF; kiocb->ki_filp = io_file_get(state, fd); if (unlikely(!kiocb->ki_filp)) return -EBADF; @@ -865,7 +876,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, struct file *file; ssize_t ret;
- ret = io_prep_rw(req, s->sqe, force_nonblock, state); + ret = io_prep_rw(req, s, force_nonblock, state); if (ret) return ret; file = kiocb->ki_filp; @@ -909,7 +920,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s, struct file *file; ssize_t ret;
- ret = io_prep_rw(req, s->sqe, force_nonblock, state); + ret = io_prep_rw(req, s, force_nonblock, state); if (ret) return ret; /* Hold on to the file for -EAGAIN */ @@ -1301,6 +1312,169 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) return false; }
+static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, + unsigned int nr, bool has_user, bool mm_fault) +{ + struct io_submit_state state, *statep = NULL; + int ret, i, submitted = 0; + + if (nr > IO_PLUG_THRESHOLD) { + io_submit_state_start(&state, ctx, nr); + statep = &state; + } + + for (i = 0; i < nr; i++) { + if (unlikely(mm_fault)) { + ret = -EFAULT; + } else { + sqes[i].has_user = has_user; + sqes[i].needs_lock = true; + sqes[i].needs_fixed_file = true; + ret = io_submit_sqe(ctx, &sqes[i], statep); + } + if (!ret) { + submitted++; + continue; + } + + io_cqring_add_event(ctx, sqes[i].sqe->user_data, ret, 0); + } + + if (statep) + io_submit_state_end(&state); + + return submitted; +} + +static int io_sq_thread(void *data) +{ + struct sqe_submit sqes[IO_IOPOLL_BATCH]; + struct io_ring_ctx *ctx = data; + struct mm_struct *cur_mm = NULL; + mm_segment_t old_fs; + DEFINE_WAIT(wait); + unsigned inflight; + unsigned long timeout; + + old_fs = get_fs(); + set_fs(USER_DS); + + timeout = inflight = 0; + while (!kthread_should_stop() && !ctx->sqo_stop) { + bool all_fixed, mm_fault = false; + int i; + + if (inflight) { + unsigned nr_events = 0; + + if (ctx->flags & IORING_SETUP_IOPOLL) { + /* + * We disallow the app entering submit/complete + * with polling, but we still need to lock the + * ring to prevent racing with polled issue + * that got punted to a workqueue. + */ + mutex_lock(&ctx->uring_lock); + io_iopoll_check(ctx, &nr_events, 0); + mutex_unlock(&ctx->uring_lock); + } else { + /* + * Normal IO, just pretend everything completed. + * We don't have to poll completions for that. + */ + nr_events = inflight; + } + + inflight -= nr_events; + if (!inflight) + timeout = jiffies + ctx->sq_thread_idle; + } + + if (!io_get_sqring(ctx, &sqes[0])) { + /* + * We're polling. If we're within the defined idle + * period, then let us spin without work before going + * to sleep. + */ + if (inflight || !time_after(jiffies, timeout)) { + cpu_relax(); + continue; + } + + /* + * Drop cur_mm before scheduling, we can't hold it for + * long periods (or over schedule()). Do this before + * adding ourselves to the waitqueue, as the unuse/drop + * may sleep. + */ + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + cur_mm = NULL; + } + + prepare_to_wait(&ctx->sqo_wait, &wait, + TASK_INTERRUPTIBLE); + + /* Tell userspace we may need a wakeup call */ + ctx->sq_ring->flags |= IORING_SQ_NEED_WAKEUP; + smp_wmb(); + + if (!io_get_sqring(ctx, &sqes[0])) { + if (kthread_should_stop()) { + finish_wait(&ctx->sqo_wait, &wait); + break; + } + if (signal_pending(current)) + flush_signals(current); + schedule(); + finish_wait(&ctx->sqo_wait, &wait); + + ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP; + smp_wmb(); + continue; + } + finish_wait(&ctx->sqo_wait, &wait); + + ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP; + smp_wmb(); + } + + i = 0; + all_fixed = true; + do { + if (all_fixed && io_sqe_needs_user(sqes[i].sqe)) + all_fixed = false; + + i++; + if (i == ARRAY_SIZE(sqes)) + break; + } while (io_get_sqring(ctx, &sqes[i])); + + /* Unless all new commands are FIXED regions, grab mm */ + if (!all_fixed && !cur_mm) { + mm_fault = !mmget_not_zero(ctx->sqo_mm); + if (!mm_fault) { + use_mm(ctx->sqo_mm); + cur_mm = ctx->sqo_mm; + } + } + + inflight += io_submit_sqes(ctx, sqes, i, cur_mm != NULL, + mm_fault); + + /* Commit SQ ring head once we've consumed all SQEs */ + io_commit_sqring(ctx); + } + + set_fs(old_fs); + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + } + return 0; +} + static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) { struct io_submit_state state, *statep = NULL; @@ -1319,6 +1493,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
s.has_user = true; s.needs_lock = false; + s.needs_fixed_file = false;
ret = io_submit_sqe(ctx, &s, statep); if (ret) { @@ -1418,8 +1593,20 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx) return 0; }
+static void io_sq_thread_stop(struct io_ring_ctx *ctx) +{ + if (ctx->sqo_thread) { + ctx->sqo_stop = 1; + mb(); + kthread_stop(ctx->sqo_thread); + ctx->sqo_thread = NULL; + } +} + static void io_finish_async(struct io_ring_ctx *ctx) { + io_sq_thread_stop(ctx); + if (ctx->sqo_wq) { destroy_workqueue(ctx->sqo_wq); ctx->sqo_wq = NULL; @@ -1583,13 +1770,47 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, return ret; }
-static int io_sq_offload_start(struct io_ring_ctx *ctx) +static int io_sq_offload_start(struct io_ring_ctx *ctx, + struct io_uring_params *p) { int ret;
+ init_waitqueue_head(&ctx->sqo_wait); mmgrab(current->mm); ctx->sqo_mm = current->mm;
+ ctx->sq_thread_idle = msecs_to_jiffies(p->sq_thread_idle); + if (!ctx->sq_thread_idle) + ctx->sq_thread_idle = HZ; + + ret = -EINVAL; + if (!cpu_possible(p->sq_thread_cpu)) + goto err; + + if (ctx->flags & IORING_SETUP_SQPOLL) { + if (p->flags & IORING_SETUP_SQ_AFF) { + int cpu; + + cpu = array_index_nospec(p->sq_thread_cpu, NR_CPUS); + ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread, + ctx, cpu, + "io_uring-sq"); + } else { + ctx->sqo_thread = kthread_create(io_sq_thread, ctx, + "io_uring-sq"); + } + if (IS_ERR(ctx->sqo_thread)) { + ret = PTR_ERR(ctx->sqo_thread); + ctx->sqo_thread = NULL; + goto err; + } + wake_up_process(ctx->sqo_thread); + } else if (p->flags & IORING_SETUP_SQ_AFF) { + /* Can't have SQ_AFF without SQPOLL */ + ret = -EINVAL; + goto err; + } + /* Do QD, or 2 * CPUS, whatever is smallest */ ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE, min(ctx->sq_entries - 1, 2 * num_online_cpus())); @@ -1600,6 +1821,7 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx)
return 0; err: + io_sq_thread_stop(ctx); mmdrop(ctx->sqo_mm); ctx->sqo_mm = NULL; return ret; @@ -1959,7 +2181,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, int submitted = 0; struct fd f;
- if (flags & ~IORING_ENTER_GETEVENTS) + if (flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP)) return -EINVAL;
f = fdget(fd); @@ -1975,6 +2197,18 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, if (!percpu_ref_tryget(&ctx->refs)) goto out_fput;
+ /* + * For SQ polling, the thread will do all submissions and completions. + * Just return the requested submit count, and wake the thread if + * we were asked to. + */ + if (ctx->flags & IORING_SETUP_SQPOLL) { + if (flags & IORING_ENTER_SQ_WAKEUP) + wake_up(&ctx->sqo_wait); + submitted = to_submit; + goto out_ctx; + } + ret = 0; if (to_submit) { to_submit = min(to_submit, ctx->sq_entries); @@ -2156,7 +2390,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) if (ret) goto err;
- ret = io_sq_offload_start(ctx); + ret = io_sq_offload_start(ctx, p); if (ret) goto err;
@@ -2204,7 +2438,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params) return -EINVAL; }
- if (p.flags & ~IORING_SETUP_IOPOLL) + if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL | + IORING_SETUP_SQ_AFF)) return -EINVAL;
ret = io_uring_create(entries, &p); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 6257478d55e9..0ec74bab8dbe 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -42,6 +42,8 @@ struct io_uring_sqe { * io_uring_setup() flags */ #define IORING_SETUP_IOPOLL (1U << 0) /* io_context is polled */ +#define IORING_SETUP_SQPOLL (1U << 1) /* SQ poll thread */ +#define IORING_SETUP_SQ_AFF (1U << 2) /* sq_thread_cpu is valid */
#define IORING_OP_NOP 0 #define IORING_OP_READV 1 @@ -86,6 +88,11 @@ struct io_sqring_offsets { __u64 resv2; };
+/* + * sq_ring->flags + */ +#define IORING_SQ_NEED_WAKEUP (1U << 0) /* needs io_uring_enter wakeup */ + struct io_cqring_offsets { __u32 head; __u32 tail; @@ -100,6 +107,7 @@ struct io_cqring_offsets { * io_uring_enter(2) flags */ #define IORING_ENTER_GETEVENTS (1U << 0) +#define IORING_ENTER_SQ_WAKEUP (1U << 1)
/* * Passed in for io_uring_setup(2). Copied back with updated info on success @@ -108,7 +116,9 @@ struct io_uring_params { __u32 sq_entries; __u32 cq_entries; __u32 flags; - __u32 resv[7]; + __u32 sq_thread_cpu; + __u32 sq_thread_idle; + __u32 resv[5]; struct io_sqring_offsets sq_off; struct io_cqring_offsets cq_off; };
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1-rc1 commit c16361c1d805b6ea50c3c1fc5c314e944c71a984 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We'll use this for the POLL implementation. Regular requests will NOT be using references, so initialize it to 0. Any real use of the io_kiocb ref will initialize it to at least 2.
Reviewed-by: Hannes Reinecke hare@suse.com Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5e6197250484..cb6cc135e19b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -184,6 +184,7 @@ struct io_kiocb { struct io_ring_ctx *ctx; struct list_head list; unsigned int flags; + refcount_t refs; #define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ #define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ #define REQ_F_FIXED_FILE 4 /* ctx owns file */ @@ -377,6 +378,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
req->ctx = ctx; req->flags = 0; + refcount_set(&req->refs, 0); return req; out: io_ring_drop_ctx_refs(ctx, 1); @@ -394,8 +396,10 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
static void io_free_req(struct io_kiocb *req) { - io_ring_drop_ctx_refs(req->ctx, 1); - kmem_cache_free(req_cachep, req); + if (!refcount_read(&req->refs) || refcount_dec_and_test(&req->refs)) { + io_ring_drop_ctx_refs(req->ctx, 1); + kmem_cache_free(req_cachep, req); + } }
/*
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1-rc1 commit 221c5eb2338232f7340386de1c43decc32682e58 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This is basically a direct port of bfe4037e722e, which implements a one-shot poll command through aio. Description below is based on that commit as well. However, instead of adding a POLL command and relying on io_cancel(2) to remove it, we mimic the epoll(2) interface of having a command to add a poll notification, IORING_OP_POLL_ADD, and one to remove it again, IORING_OP_POLL_REMOVE.
To poll for a file descriptor the application should submit an sqe of type IORING_OP_POLL. It will poll the fd for the events specified in the poll_events field.
Unlike poll or epoll without EPOLLONESHOT this interface always works in one shot mode, that is once the sqe is completed, it will have to be resubmitted.
Reviewed-by: Hannes Reinecke hare@suse.com Based-on-code-from: Christoph Hellwig hch@lst.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 263 +++++++++++++++++++++++++++++++++- include/uapi/linux/io_uring.h | 3 + 2 files changed, 265 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cb6cc135e19b..9315843d0949 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -161,6 +161,7 @@ struct io_ring_ctx { * manipulate the list, hence no extra locking is needed there. */ struct list_head poll_list; + struct list_head cancel_list; } ____cacheline_aligned_in_smp;
#if defined(CONFIG_UNIX) @@ -176,8 +177,20 @@ struct sqe_submit { bool needs_fixed_file; };
+struct io_poll_iocb { + struct file *file; + struct wait_queue_head *head; + __poll_t events; + bool woken; + bool canceled; + struct wait_queue_entry wait; +}; + struct io_kiocb { - struct kiocb rw; + union { + struct kiocb rw; + struct io_poll_iocb poll; + };
struct sqe_submit submit;
@@ -261,6 +274,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) init_waitqueue_head(&ctx->wait); spin_lock_init(&ctx->completion_lock); INIT_LIST_HEAD(&ctx->poll_list); + INIT_LIST_HEAD(&ctx->cancel_list); return ctx; }
@@ -1058,6 +1072,246 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; }
+static void io_poll_remove_one(struct io_kiocb *req) +{ + struct io_poll_iocb *poll = &req->poll; + + spin_lock(&poll->head->lock); + WRITE_ONCE(poll->canceled, true); + if (!list_empty(&poll->wait.entry)) { + list_del_init(&poll->wait.entry); + queue_work(req->ctx->sqo_wq, &req->work); + } + spin_unlock(&poll->head->lock); + + list_del_init(&req->list); +} + +static void io_poll_remove_all(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + spin_lock_irq(&ctx->completion_lock); + while (!list_empty(&ctx->cancel_list)) { + req = list_first_entry(&ctx->cancel_list, struct io_kiocb,list); + io_poll_remove_one(req); + } + spin_unlock_irq(&ctx->completion_lock); +} + +/* + * Find a running poll command that matches one specified in sqe->addr, + * and remove it if found. + */ +static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_ring_ctx *ctx = req->ctx; + struct io_kiocb *poll_req, *next; + int ret = -ENOENT; + + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + if (sqe->ioprio || sqe->off || sqe->len || sqe->buf_index || + sqe->poll_events) + return -EINVAL; + + spin_lock_irq(&ctx->completion_lock); + list_for_each_entry_safe(poll_req, next, &ctx->cancel_list, list) { + if (READ_ONCE(sqe->addr) == poll_req->user_data) { + io_poll_remove_one(poll_req); + ret = 0; + break; + } + } + spin_unlock_irq(&ctx->completion_lock); + + io_cqring_add_event(req->ctx, sqe->user_data, ret, 0); + io_free_req(req); + return 0; +} + +static void io_poll_complete(struct io_kiocb *req, __poll_t mask) +{ + io_cqring_add_event(req->ctx, req->user_data, mangle_poll(mask), 0); + io_fput(req); + io_free_req(req); +} + +static void io_poll_complete_work(struct work_struct *work) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work); + struct io_poll_iocb *poll = &req->poll; + struct poll_table_struct pt = { ._key = poll->events }; + struct io_ring_ctx *ctx = req->ctx; + __poll_t mask = 0; + + if (!READ_ONCE(poll->canceled)) + mask = vfs_poll(poll->file, &pt) & poll->events; + + /* + * Note that ->ki_cancel callers also delete iocb from active_reqs after + * calling ->ki_cancel. We need the ctx_lock roundtrip here to + * synchronize with them. In the cancellation case the list_del_init + * itself is not actually needed, but harmless so we keep it in to + * avoid further branches in the fast path. + */ + spin_lock_irq(&ctx->completion_lock); + if (!mask && !READ_ONCE(poll->canceled)) { + add_wait_queue(poll->head, &poll->wait); + spin_unlock_irq(&ctx->completion_lock); + return; + } + list_del_init(&req->list); + spin_unlock_irq(&ctx->completion_lock); + + io_poll_complete(req, mask); +} + +static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, + void *key) +{ + struct io_poll_iocb *poll = container_of(wait, struct io_poll_iocb, + wait); + struct io_kiocb *req = container_of(poll, struct io_kiocb, poll); + struct io_ring_ctx *ctx = req->ctx; + __poll_t mask = key_to_poll(key); + + poll->woken = true; + + /* for instances that support it check for an event match first: */ + if (mask) { + unsigned long flags; + + if (!(mask & poll->events)) + return 0; + + /* try to complete the iocb inline if we can: */ + if (spin_trylock_irqsave(&ctx->completion_lock, flags)) { + list_del(&req->list); + spin_unlock_irqrestore(&ctx->completion_lock, flags); + + list_del_init(&poll->wait.entry); + io_poll_complete(req, mask); + return 1; + } + } + + list_del_init(&poll->wait.entry); + queue_work(ctx->sqo_wq, &req->work); + return 1; +} + +struct io_poll_table { + struct poll_table_struct pt; + struct io_kiocb *req; + int error; +}; + +static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head, + struct poll_table_struct *p) +{ + struct io_poll_table *pt = container_of(p, struct io_poll_table, pt); + + if (unlikely(pt->req->poll.head)) { + pt->error = -EINVAL; + return; + } + + pt->error = 0; + pt->req->poll.head = head; + add_wait_queue(head, &pt->req->poll.wait); +} + +static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_poll_iocb *poll = &req->poll; + struct io_ring_ctx *ctx = req->ctx; + struct io_poll_table ipt; + unsigned flags; + __poll_t mask; + u16 events; + int fd; + + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + if (sqe->addr || sqe->ioprio || sqe->off || sqe->len || sqe->buf_index) + return -EINVAL; + + INIT_WORK(&req->work, io_poll_complete_work); + events = READ_ONCE(sqe->poll_events); + poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; + + flags = READ_ONCE(sqe->flags); + fd = READ_ONCE(sqe->fd); + + if (flags & IOSQE_FIXED_FILE) { + if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files)) + return -EBADF; + poll->file = ctx->user_files[fd]; + req->flags |= REQ_F_FIXED_FILE; + } else { + poll->file = fget(fd); + } + if (unlikely(!poll->file)) + return -EBADF; + + poll->head = NULL; + poll->woken = false; + poll->canceled = false; + + ipt.pt._qproc = io_poll_queue_proc; + ipt.pt._key = poll->events; + ipt.req = req; + ipt.error = -EINVAL; /* same as no support for IOCB_CMD_POLL */ + + /* initialized the list so that we can do list_empty checks */ + INIT_LIST_HEAD(&poll->wait.entry); + init_waitqueue_func_entry(&poll->wait, io_poll_wake); + + /* one for removal from waitqueue, one for this function */ + refcount_set(&req->refs, 2); + + mask = vfs_poll(poll->file, &ipt.pt) & poll->events; + if (unlikely(!poll->head)) { + /* we did not manage to set up a waitqueue, done */ + goto out; + } + + spin_lock_irq(&ctx->completion_lock); + spin_lock(&poll->head->lock); + if (poll->woken) { + /* wake_up context handles the rest */ + mask = 0; + ipt.error = 0; + } else if (mask || ipt.error) { + /* if we get an error or a mask we are done */ + WARN_ON_ONCE(list_empty(&poll->wait.entry)); + list_del_init(&poll->wait.entry); + } else { + /* actually waiting for an event */ + list_add_tail(&req->list, &ctx->cancel_list); + } + spin_unlock(&poll->head->lock); + spin_unlock_irq(&ctx->completion_lock); + +out: + if (unlikely(ipt.error)) { + if (!(flags & IOSQE_FIXED_FILE)) + fput(poll->file); + /* + * Drop one of our refs to this req, __io_submit_sqe() will + * drop the other one since we're returning an error. + */ + io_free_req(req); + return ipt.error; + } + + if (mask) + io_poll_complete(req, mask); + io_free_req(req); + return 0; +} + static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, const struct sqe_submit *s, bool force_nonblock, struct io_submit_state *state) @@ -1093,6 +1347,12 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, case IORING_OP_FSYNC: ret = io_fsync(req, s->sqe, force_nonblock); break; + case IORING_OP_POLL_ADD: + ret = io_poll_add(req, s->sqe); + break; + case IORING_OP_POLL_REMOVE: + ret = io_poll_remove(req, s->sqe); + break; default: ret = -EINVAL; break; @@ -2131,6 +2391,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) percpu_ref_kill(&ctx->refs); mutex_unlock(&ctx->uring_lock);
+ io_poll_remove_all(ctx); io_iopoll_reap_events(ctx); wait_for_completion(&ctx->ctx_done); io_ring_ctx_free(ctx); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 0ec74bab8dbe..e23408692118 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -25,6 +25,7 @@ struct io_uring_sqe { union { __kernel_rwf_t rw_flags; __u32 fsync_flags; + __u16 poll_events; }; __u64 user_data; /* data to be passed back at completion time */ union { @@ -51,6 +52,8 @@ struct io_uring_sqe { #define IORING_OP_FSYNC 3 #define IORING_OP_READ_FIXED 4 #define IORING_OP_WRITE_FIXED 5 +#define IORING_OP_POLL_ADD 6 +#define IORING_OP_POLL_REMOVE 7
/* * sqe->fsync_flags
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1-rc1 commit 31b515106428b9717d2b6475b6f6182cf231b1e6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Right now we punt any buffered request that ends up triggering an -EAGAIN to an async workqueue. This works fine in terms of providing async execution of them, but it also can create quite a lot of work queue items. For sequentially buffered IO, it's advantageous to serialize the issue of them. For reads, the first one will trigger a read-ahead, and subsequent request merely end up waiting on later pages to complete. For writes, devices usually respond better to streamed sequential writes.
Add state to track the last buffered request we punted to a work queue, and if the next one is sequential to the previous, attempt to get the previous work item to handle it. We limit the number of sequential add-ons to the a multiple (8) of the max read-ahead size of the file. This should be a good number for both reads and wries, as it defines the max IO size the device can do directly.
This drastically cuts down on the number of context switches we need to handle buffered sequential IO, and a basic test case of copying a big file with io_uring sees a 5x speedup.
Reviewed-by: Hannes Reinecke hare@suse.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 281 ++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 229 insertions(+), 52 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9315843d0949..5dde033ed5a2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -94,6 +94,16 @@ struct io_mapped_ubuf { unsigned int nr_bvecs; };
+struct async_list { + spinlock_t lock; + atomic_t cnt; + struct list_head list; + + struct file *file; + off_t io_end; + size_t io_pages; +}; + struct io_ring_ctx { struct { struct percpu_ref refs; @@ -164,6 +174,8 @@ struct io_ring_ctx { struct list_head cancel_list; } ____cacheline_aligned_in_smp;
+ struct async_list pending_async[2]; + #if defined(CONFIG_UNIX) struct socket *ring_sock; #endif @@ -201,6 +213,7 @@ struct io_kiocb { #define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ #define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ #define REQ_F_FIXED_FILE 4 /* ctx owns file */ +#define REQ_F_SEQ_PREV 8 /* sequential with previous */ u64 user_data; u64 error;
@@ -257,6 +270,7 @@ static void io_ring_ctx_ref_free(struct percpu_ref *ref) static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) { struct io_ring_ctx *ctx; + int i;
ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); if (!ctx) @@ -272,6 +286,11 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) init_completion(&ctx->ctx_done); mutex_init(&ctx->uring_lock); init_waitqueue_head(&ctx->wait); + for (i = 0; i < ARRAY_SIZE(ctx->pending_async); i++) { + spin_lock_init(&ctx->pending_async[i].lock); + INIT_LIST_HEAD(&ctx->pending_async[i].list); + atomic_set(&ctx->pending_async[i].cnt, 0); + } spin_lock_init(&ctx->completion_lock); INIT_LIST_HEAD(&ctx->poll_list); INIT_LIST_HEAD(&ctx->cancel_list); @@ -885,6 +904,47 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw, return import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec, iter); }
+/* + * Make a note of the last file/offset/direction we punted to async + * context. We'll use this information to see if we can piggy back a + * sequential request onto the previous one, if it's still hasn't been + * completed by the async worker. + */ +static void io_async_list_note(int rw, struct io_kiocb *req, size_t len) +{ + struct async_list *async_list = &req->ctx->pending_async[rw]; + struct kiocb *kiocb = &req->rw; + struct file *filp = kiocb->ki_filp; + off_t io_end = kiocb->ki_pos + len; + + if (filp == async_list->file && kiocb->ki_pos == async_list->io_end) { + unsigned long max_pages; + + /* Use 8x RA size as a decent limiter for both reads/writes */ + max_pages = filp->f_ra.ra_pages; + if (!max_pages) + max_pages = VM_MAX_READAHEAD >> (PAGE_SHIFT - 10); + max_pages *= 8; + + /* If max pages are exceeded, reset the state */ + len >>= PAGE_SHIFT; + if (async_list->io_pages + len <= max_pages) { + req->flags |= REQ_F_SEQ_PREV; + async_list->io_pages += len; + } else { + io_end = 0; + async_list->io_pages = 0; + } + } + + /* New file? Reset state. */ + if (async_list->file != filp) { + async_list->io_pages = 0; + async_list->file = filp; + } + async_list->io_end = io_end; +} + static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, bool force_nonblock, struct io_submit_state *state) { @@ -892,6 +952,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, struct kiocb *kiocb = &req->rw; struct iov_iter iter; struct file *file; + size_t iov_count; ssize_t ret;
ret = io_prep_rw(req, s, force_nonblock, state); @@ -910,16 +971,24 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, if (ret) goto out_fput;
- ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter)); + iov_count = iov_iter_count(&iter); + ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_count); if (!ret) { ssize_t ret2;
/* Catch -EAGAIN return for forced non-blocking submission */ ret2 = call_read_iter(file, kiocb, &iter); - if (!force_nonblock || ret2 != -EAGAIN) + if (!force_nonblock || ret2 != -EAGAIN) { io_rw_done(kiocb, ret2); - else + } else { + /* + * If ->needs_lock is true, we're already in async + * context. + */ + if (!s->needs_lock) + io_async_list_note(READ, req, iov_count); ret = -EAGAIN; + } } kfree(iovec); out_fput: @@ -936,14 +1005,12 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s, struct kiocb *kiocb = &req->rw; struct iov_iter iter; struct file *file; + size_t iov_count; ssize_t ret;
ret = io_prep_rw(req, s, force_nonblock, state); if (ret) return ret; - /* Hold on to the file for -EAGAIN */ - if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) - return -EAGAIN;
ret = -EBADF; file = kiocb->ki_filp; @@ -957,8 +1024,17 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s, if (ret) goto out_fput;
- ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, - iov_iter_count(&iter)); + iov_count = iov_iter_count(&iter); + + ret = -EAGAIN; + if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) { + /* If ->needs_lock is true, we're already in async context. */ + if (!s->needs_lock) + io_async_list_note(WRITE, req, iov_count); + goto out_free; + } + + ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, iov_count); if (!ret) { /* * Open-code file_start_write here to grab freeze protection, @@ -976,9 +1052,11 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s, kiocb->ki_flags |= IOCB_WRITE; io_rw_done(kiocb, call_write_iter(file, kiocb, &iter)); } +out_free: kfree(iovec); out_fput: - if (unlikely(ret)) + /* Hold on to the file for -EAGAIN */ + if (unlikely(ret && ret != -EAGAIN)) io_fput(req); return ret; } @@ -1376,6 +1454,21 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, return 0; }
+static struct async_list *io_async_list_from_sqe(struct io_ring_ctx *ctx, + const struct io_uring_sqe *sqe) +{ + switch (sqe->opcode) { + case IORING_OP_READV: + case IORING_OP_READ_FIXED: + return &ctx->pending_async[READ]; + case IORING_OP_WRITEV: + case IORING_OP_WRITE_FIXED: + return &ctx->pending_async[WRITE]; + default: + return NULL; + } +} + static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe) { u8 opcode = READ_ONCE(sqe->opcode); @@ -1387,61 +1480,138 @@ static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe) static void io_sq_wq_submit_work(struct work_struct *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work); - struct sqe_submit *s = &req->submit; - const struct io_uring_sqe *sqe = s->sqe; struct io_ring_ctx *ctx = req->ctx; + struct mm_struct *cur_mm = NULL; + struct async_list *async_list; + LIST_HEAD(req_list); mm_segment_t old_fs; - bool needs_user; int ret;
- /* Ensure we clear previously set forced non-block flag */ - req->flags &= ~REQ_F_FORCE_NONBLOCK; - req->rw.ki_flags &= ~IOCB_NOWAIT; + async_list = io_async_list_from_sqe(ctx, req->submit.sqe); +restart: + do { + struct sqe_submit *s = &req->submit; + const struct io_uring_sqe *sqe = s->sqe; + + /* Ensure we clear previously set forced non-block flag */ + req->flags &= ~REQ_F_FORCE_NONBLOCK; + req->rw.ki_flags &= ~IOCB_NOWAIT; + + ret = 0; + if (io_sqe_needs_user(sqe) && !cur_mm) { + if (!mmget_not_zero(ctx->sqo_mm)) { + ret = -EFAULT; + } else { + cur_mm = ctx->sqo_mm; + use_mm(cur_mm); + old_fs = get_fs(); + set_fs(USER_DS); + } + } + + if (!ret) { + s->has_user = cur_mm != NULL; + s->needs_lock = true; + do { + ret = __io_submit_sqe(ctx, req, s, false, NULL); + /* + * We can get EAGAIN for polled IO even though + * we're forcing a sync submission from here, + * since we can't wait for request slots on the + * block side. + */ + if (ret != -EAGAIN) + break; + cond_resched(); + } while (1); + } + if (ret) { + io_cqring_add_event(ctx, sqe->user_data, ret, 0); + io_free_req(req); + }
- s->needs_lock = true; - s->has_user = false; + /* async context always use a copy of the sqe */ + kfree(sqe); + + if (!async_list) + break; + if (!list_empty(&req_list)) { + req = list_first_entry(&req_list, struct io_kiocb, + list); + list_del(&req->list); + continue; + } + if (list_empty(&async_list->list)) + break; + + req = NULL; + spin_lock(&async_list->lock); + if (list_empty(&async_list->list)) { + spin_unlock(&async_list->lock); + break; + } + list_splice_init(&async_list->list, &req_list); + spin_unlock(&async_list->lock); + + req = list_first_entry(&req_list, struct io_kiocb, list); + list_del(&req->list); + } while (req);
/* - * If we're doing IO to fixed buffers, we don't need to get/set - * user context + * Rare case of racing with a submitter. If we find the count has + * dropped to zero AND we have pending work items, then restart + * the processing. This is a tiny race window. */ - needs_user = io_sqe_needs_user(s->sqe); - if (needs_user) { - if (!mmget_not_zero(ctx->sqo_mm)) { - ret = -EFAULT; - goto err; + if (async_list) { + ret = atomic_dec_return(&async_list->cnt); + while (!ret && !list_empty(&async_list->list)) { + spin_lock(&async_list->lock); + atomic_inc(&async_list->cnt); + list_splice_init(&async_list->list, &req_list); + spin_unlock(&async_list->lock); + + if (!list_empty(&req_list)) { + req = list_first_entry(&req_list, + struct io_kiocb, list); + list_del(&req->list); + goto restart; + } + ret = atomic_dec_return(&async_list->cnt); } - use_mm(ctx->sqo_mm); - old_fs = get_fs(); - set_fs(USER_DS); - s->has_user = true; }
- do { - ret = __io_submit_sqe(ctx, req, s, false, NULL); - /* - * We can get EAGAIN for polled IO even though we're forcing - * a sync submission from here, since we can't wait for - * request slots on the block side. - */ - if (ret != -EAGAIN) - break; - cond_resched(); - } while (1); - - if (needs_user) { + if (cur_mm) { set_fs(old_fs); - unuse_mm(ctx->sqo_mm); - mmput(ctx->sqo_mm); - } -err: - if (ret) { - io_cqring_add_event(ctx, sqe->user_data, ret, 0); - io_free_req(req); + unuse_mm(cur_mm); + mmput(cur_mm); } +}
- /* async context always use a copy of the sqe */ - kfree(sqe); +/* + * See if we can piggy back onto previously submitted work, that is still + * running. We currently only allow this if the new request is sequential + * to the previous one we punted. + */ +static bool io_add_to_prev_work(struct async_list *list, struct io_kiocb *req) +{ + bool ret = false; + + if (!list) + return false; + if (!(req->flags & REQ_F_SEQ_PREV)) + return false; + if (!atomic_read(&list->cnt)) + return false; + + ret = true; + spin_lock(&list->lock); + list_add_tail(&req->list, &list->list); + if (!atomic_read(&list->cnt)) { + list_del_init(&req->list); + ret = false; + } + spin_unlock(&list->lock); + return ret; }
static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, @@ -1466,12 +1636,19 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL); if (sqe_copy) { + struct async_list *list; + memcpy(sqe_copy, s->sqe, sizeof(*sqe_copy)); s->sqe = sqe_copy;
memcpy(&req->submit, s, sizeof(*s)); - INIT_WORK(&req->work, io_sq_wq_submit_work); - queue_work(ctx->sqo_wq, &req->work); + list = io_async_list_from_sqe(ctx, s->sqe); + if (!io_add_to_prev_work(list, req)) { + if (list) + atomic_inc(&list->cnt); + INIT_WORK(&req->work, io_sq_wq_submit_work); + queue_work(ctx->sqo_wq, &req->work); + } ret = 0; } }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1-rc1 commit 21b4aa5d20fd07207e73270cadffed5c63fb4343 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This adds two test programs in tools/io_uring/ that demonstrate both the raw io_uring API (and all features) through a small benchmark app, io_uring-bench, and the liburing exposed API in a simplified cp(1) implementation through io_uring-cp.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- tools/io_uring/Makefile | 18 + tools/io_uring/README | 29 ++ tools/io_uring/barrier.h | 16 + tools/io_uring/io_uring-bench.c | 616 ++++++++++++++++++++++++++++++++ tools/io_uring/io_uring-cp.c | 251 +++++++++++++ tools/io_uring/liburing.h | 143 ++++++++ tools/io_uring/queue.c | 164 +++++++++ tools/io_uring/setup.c | 103 ++++++ tools/io_uring/syscall.c | 40 +++ 9 files changed, 1380 insertions(+) create mode 100644 tools/io_uring/Makefile create mode 100644 tools/io_uring/README create mode 100644 tools/io_uring/barrier.h create mode 100644 tools/io_uring/io_uring-bench.c create mode 100644 tools/io_uring/io_uring-cp.c create mode 100644 tools/io_uring/liburing.h create mode 100644 tools/io_uring/queue.c create mode 100644 tools/io_uring/setup.c create mode 100644 tools/io_uring/syscall.c
diff --git a/tools/io_uring/Makefile b/tools/io_uring/Makefile new file mode 100644 index 000000000000..f79522fc37b5 --- /dev/null +++ b/tools/io_uring/Makefile @@ -0,0 +1,18 @@ +# SPDX-License-Identifier: GPL-2.0 +# Makefile for io_uring test tools +CFLAGS += -Wall -Wextra -g -D_GNU_SOURCE +LDLIBS += -lpthread + +all: io_uring-cp io_uring-bench +%: %.c + $(CC) $(CFLAGS) -o $@ $^ + +io_uring-bench: syscall.o io_uring-bench.o + $(CC) $(CFLAGS) $(LDLIBS) -o $@ $^ + +io_uring-cp: setup.o syscall.o queue.o + +clean: + $(RM) io_uring-cp io_uring-bench *.o + +.PHONY: all clean diff --git a/tools/io_uring/README b/tools/io_uring/README new file mode 100644 index 000000000000..67fd70115cff --- /dev/null +++ b/tools/io_uring/README @@ -0,0 +1,29 @@ +This directory includes a few programs that demonstrate how to use io_uring +in an application. The examples are: + +io_uring-cp + A very basic io_uring implementation of cp(1). It takes two + arguments, copies the first argument to the second. This example + is part of liburing, and hence uses the simplified liburing API + for setting up an io_uring instance, submitting IO, completing IO, + etc. The support functions in queue.c and setup.c are straight + out of liburing. + +io_uring-bench + Benchmark program that does random reads on a number of files. This + app demonstrates the various features of io_uring, like fixed files, + fixed buffers, and polled IO. There are options in the program to + control which features to use. Arguments is the file (or files) that + io_uring-bench should operate on. This uses the raw io_uring + interface. + +liburing can be cloned with git here: + + git://git.kernel.dk/liburing + +and contains a number of unit tests as well for testing io_uring. It also +comes with man pages for the three system calls. + +Fio includes an io_uring engine, you can clone fio here: + + git://git.kernel.dk/fio diff --git a/tools/io_uring/barrier.h b/tools/io_uring/barrier.h new file mode 100644 index 000000000000..ef00f6722ba9 --- /dev/null +++ b/tools/io_uring/barrier.h @@ -0,0 +1,16 @@ +#ifndef LIBURING_BARRIER_H +#define LIBURING_BARRIER_H + +#if defined(__x86_64) || defined(__i386__) +#define read_barrier() __asm__ __volatile__("":::"memory") +#define write_barrier() __asm__ __volatile__("":::"memory") +#else +/* + * Add arch appropriate definitions. Be safe and use full barriers for + * archs we don't have support for. + */ +#define read_barrier() __sync_synchronize() +#define write_barrier() __sync_synchronize() +#endif + +#endif diff --git a/tools/io_uring/io_uring-bench.c b/tools/io_uring/io_uring-bench.c new file mode 100644 index 000000000000..512306a37531 --- /dev/null +++ b/tools/io_uring/io_uring-bench.c @@ -0,0 +1,616 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Simple benchmark program that uses the various features of io_uring + * to provide fast random access to a device/file. It has various + * options that are control how we use io_uring, see the OPTIONS section + * below. This uses the raw io_uring interface. + * + * Copyright (C) 2018-2019 Jens Axboe + */ +#include <stdio.h> +#include <errno.h> +#include <assert.h> +#include <stdlib.h> +#include <stddef.h> +#include <signal.h> +#include <inttypes.h> + +#include <sys/types.h> +#include <sys/stat.h> +#include <sys/ioctl.h> +#include <sys/syscall.h> +#include <sys/resource.h> +#include <sys/mman.h> +#include <sys/uio.h> +#include <linux/fs.h> +#include <fcntl.h> +#include <unistd.h> +#include <string.h> +#include <pthread.h> +#include <sched.h> + +#include "liburing.h" +#include "barrier.h" + +#ifndef IOCQE_FLAG_CACHEHIT +#define IOCQE_FLAG_CACHEHIT (1U << 0) +#endif + +#define min(a, b) ((a < b) ? (a) : (b)) + +struct io_sq_ring { + unsigned *head; + unsigned *tail; + unsigned *ring_mask; + unsigned *ring_entries; + unsigned *flags; + unsigned *array; +}; + +struct io_cq_ring { + unsigned *head; + unsigned *tail; + unsigned *ring_mask; + unsigned *ring_entries; + struct io_uring_cqe *cqes; +}; + +#define DEPTH 128 + +#define BATCH_SUBMIT 32 +#define BATCH_COMPLETE 32 + +#define BS 4096 + +#define MAX_FDS 16 + +static unsigned sq_ring_mask, cq_ring_mask; + +struct file { + unsigned long max_blocks; + unsigned pending_ios; + int real_fd; + int fixed_fd; +}; + +struct submitter { + pthread_t thread; + int ring_fd; + struct drand48_data rand; + struct io_sq_ring sq_ring; + struct io_uring_sqe *sqes; + struct iovec iovecs[DEPTH]; + struct io_cq_ring cq_ring; + int inflight; + unsigned long reaps; + unsigned long done; + unsigned long calls; + unsigned long cachehit, cachemiss; + volatile int finish; + + __s32 *fds; + + struct file files[MAX_FDS]; + unsigned nr_files; + unsigned cur_file; +}; + +static struct submitter submitters[1]; +static volatile int finish; + +/* + * OPTIONS: Set these to test the various features of io_uring. + */ +static int polled = 1; /* use IO polling */ +static int fixedbufs = 1; /* use fixed user buffers */ +static int register_files = 1; /* use fixed files */ +static int buffered = 0; /* use buffered IO, not O_DIRECT */ +static int sq_thread_poll = 0; /* use kernel submission/poller thread */ +static int sq_thread_cpu = -1; /* pin above thread to this CPU */ +static int do_nop = 0; /* no-op SQ ring commands */ + +static int io_uring_register_buffers(struct submitter *s) +{ + if (do_nop) + return 0; + + return io_uring_register(s->ring_fd, IORING_REGISTER_BUFFERS, s->iovecs, + DEPTH); +} + +static int io_uring_register_files(struct submitter *s) +{ + unsigned i; + + if (do_nop) + return 0; + + s->fds = calloc(s->nr_files, sizeof(__s32)); + for (i = 0; i < s->nr_files; i++) { + s->fds[i] = s->files[i].real_fd; + s->files[i].fixed_fd = i; + } + + return io_uring_register(s->ring_fd, IORING_REGISTER_FILES, s->fds, + s->nr_files); +} + +static int gettid(void) +{ + return syscall(__NR_gettid); +} + +static unsigned file_depth(struct submitter *s) +{ + return (DEPTH + s->nr_files - 1) / s->nr_files; +} + +static void init_io(struct submitter *s, unsigned index) +{ + struct io_uring_sqe *sqe = &s->sqes[index]; + unsigned long offset; + struct file *f; + long r; + + if (do_nop) { + sqe->opcode = IORING_OP_NOP; + return; + } + + if (s->nr_files == 1) { + f = &s->files[0]; + } else { + f = &s->files[s->cur_file]; + if (f->pending_ios >= file_depth(s)) { + s->cur_file++; + if (s->cur_file == s->nr_files) + s->cur_file = 0; + f = &s->files[s->cur_file]; + } + } + f->pending_ios++; + + lrand48_r(&s->rand, &r); + offset = (r % (f->max_blocks - 1)) * BS; + + if (register_files) { + sqe->flags = IOSQE_FIXED_FILE; + sqe->fd = f->fixed_fd; + } else { + sqe->flags = 0; + sqe->fd = f->real_fd; + } + if (fixedbufs) { + sqe->opcode = IORING_OP_READ_FIXED; + sqe->addr = (unsigned long) s->iovecs[index].iov_base; + sqe->len = BS; + sqe->buf_index = index; + } else { + sqe->opcode = IORING_OP_READV; + sqe->addr = (unsigned long) &s->iovecs[index]; + sqe->len = 1; + sqe->buf_index = 0; + } + sqe->ioprio = 0; + sqe->off = offset; + sqe->user_data = (unsigned long) f; +} + +static int prep_more_ios(struct submitter *s, unsigned max_ios) +{ + struct io_sq_ring *ring = &s->sq_ring; + unsigned index, tail, next_tail, prepped = 0; + + next_tail = tail = *ring->tail; + do { + next_tail++; + read_barrier(); + if (next_tail == *ring->head) + break; + + index = tail & sq_ring_mask; + init_io(s, index); + ring->array[index] = index; + prepped++; + tail = next_tail; + } while (prepped < max_ios); + + if (*ring->tail != tail) { + /* order tail store with writes to sqes above */ + write_barrier(); + *ring->tail = tail; + write_barrier(); + } + return prepped; +} + +static int get_file_size(struct file *f) +{ + struct stat st; + + if (fstat(f->real_fd, &st) < 0) + return -1; + if (S_ISBLK(st.st_mode)) { + unsigned long long bytes; + + if (ioctl(f->real_fd, BLKGETSIZE64, &bytes) != 0) + return -1; + + f->max_blocks = bytes / BS; + return 0; + } else if (S_ISREG(st.st_mode)) { + f->max_blocks = st.st_size / BS; + return 0; + } + + return -1; +} + +static int reap_events(struct submitter *s) +{ + struct io_cq_ring *ring = &s->cq_ring; + struct io_uring_cqe *cqe; + unsigned head, reaped = 0; + + head = *ring->head; + do { + struct file *f; + + read_barrier(); + if (head == *ring->tail) + break; + cqe = &ring->cqes[head & cq_ring_mask]; + if (!do_nop) { + f = (struct file *) (uintptr_t) cqe->user_data; + f->pending_ios--; + if (cqe->res != BS) { + printf("io: unexpected ret=%d\n", cqe->res); + if (polled && cqe->res == -EOPNOTSUPP) + printf("Your filesystem doesn't support poll\n"); + return -1; + } + } + if (cqe->flags & IOCQE_FLAG_CACHEHIT) + s->cachehit++; + else + s->cachemiss++; + reaped++; + head++; + } while (1); + + s->inflight -= reaped; + *ring->head = head; + write_barrier(); + return reaped; +} + +static void *submitter_fn(void *data) +{ + struct submitter *s = data; + struct io_sq_ring *ring = &s->sq_ring; + int ret, prepped; + + printf("submitter=%d\n", gettid()); + + srand48_r(pthread_self(), &s->rand); + + prepped = 0; + do { + int to_wait, to_submit, this_reap, to_prep; + + if (!prepped && s->inflight < DEPTH) { + to_prep = min(DEPTH - s->inflight, BATCH_SUBMIT); + prepped = prep_more_ios(s, to_prep); + } + s->inflight += prepped; +submit_more: + to_submit = prepped; +submit: + if (to_submit && (s->inflight + to_submit <= DEPTH)) + to_wait = 0; + else + to_wait = min(s->inflight + to_submit, BATCH_COMPLETE); + + /* + * Only need to call io_uring_enter if we're not using SQ thread + * poll, or if IORING_SQ_NEED_WAKEUP is set. + */ + if (!sq_thread_poll || (*ring->flags & IORING_SQ_NEED_WAKEUP)) { + unsigned flags = 0; + + if (to_wait) + flags = IORING_ENTER_GETEVENTS; + if ((*ring->flags & IORING_SQ_NEED_WAKEUP)) + flags |= IORING_ENTER_SQ_WAKEUP; + ret = io_uring_enter(s->ring_fd, to_submit, to_wait, + flags, NULL); + s->calls++; + } + + /* + * For non SQ thread poll, we already got the events we needed + * through the io_uring_enter() above. For SQ thread poll, we + * need to loop here until we find enough events. + */ + this_reap = 0; + do { + int r; + r = reap_events(s); + if (r == -1) { + s->finish = 1; + break; + } else if (r > 0) + this_reap += r; + } while (sq_thread_poll && this_reap < to_wait); + s->reaps += this_reap; + + if (ret >= 0) { + if (!ret) { + to_submit = 0; + if (s->inflight) + goto submit; + continue; + } else if (ret < to_submit) { + int diff = to_submit - ret; + + s->done += ret; + prepped -= diff; + goto submit_more; + } + s->done += ret; + prepped = 0; + continue; + } else if (ret < 0) { + if (errno == EAGAIN) { + if (s->finish) + break; + if (this_reap) + goto submit; + to_submit = 0; + goto submit; + } + printf("io_submit: %s\n", strerror(errno)); + break; + } + } while (!s->finish); + + finish = 1; + return NULL; +} + +static void sig_int(int sig) +{ + printf("Exiting on signal %d\n", sig); + submitters[0].finish = 1; + finish = 1; +} + +static void arm_sig_int(void) +{ + struct sigaction act; + + memset(&act, 0, sizeof(act)); + act.sa_handler = sig_int; + act.sa_flags = SA_RESTART; + sigaction(SIGINT, &act, NULL); +} + +static int setup_ring(struct submitter *s) +{ + struct io_sq_ring *sring = &s->sq_ring; + struct io_cq_ring *cring = &s->cq_ring; + struct io_uring_params p; + int ret, fd; + void *ptr; + + memset(&p, 0, sizeof(p)); + + if (polled && !do_nop) + p.flags |= IORING_SETUP_IOPOLL; + if (sq_thread_poll) { + p.flags |= IORING_SETUP_SQPOLL; + if (sq_thread_cpu != -1) { + p.flags |= IORING_SETUP_SQ_AFF; + p.sq_thread_cpu = sq_thread_cpu; + } + } + + fd = io_uring_setup(DEPTH, &p); + if (fd < 0) { + perror("io_uring_setup"); + return 1; + } + s->ring_fd = fd; + + if (fixedbufs) { + ret = io_uring_register_buffers(s); + if (ret < 0) { + perror("io_uring_register_buffers"); + return 1; + } + } + + if (register_files) { + ret = io_uring_register_files(s); + if (ret < 0) { + perror("io_uring_register_files"); + return 1; + } + } + + ptr = mmap(0, p.sq_off.array + p.sq_entries * sizeof(__u32), + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, + IORING_OFF_SQ_RING); + printf("sq_ring ptr = 0x%p\n", ptr); + sring->head = ptr + p.sq_off.head; + sring->tail = ptr + p.sq_off.tail; + sring->ring_mask = ptr + p.sq_off.ring_mask; + sring->ring_entries = ptr + p.sq_off.ring_entries; + sring->flags = ptr + p.sq_off.flags; + sring->array = ptr + p.sq_off.array; + sq_ring_mask = *sring->ring_mask; + + s->sqes = mmap(0, p.sq_entries * sizeof(struct io_uring_sqe), + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, + IORING_OFF_SQES); + printf("sqes ptr = 0x%p\n", s->sqes); + + ptr = mmap(0, p.cq_off.cqes + p.cq_entries * sizeof(struct io_uring_cqe), + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, + IORING_OFF_CQ_RING); + printf("cq_ring ptr = 0x%p\n", ptr); + cring->head = ptr + p.cq_off.head; + cring->tail = ptr + p.cq_off.tail; + cring->ring_mask = ptr + p.cq_off.ring_mask; + cring->ring_entries = ptr + p.cq_off.ring_entries; + cring->cqes = ptr + p.cq_off.cqes; + cq_ring_mask = *cring->ring_mask; + return 0; +} + +static void file_depths(char *buf) +{ + struct submitter *s = &submitters[0]; + unsigned i; + char *p; + + buf[0] = '\0'; + p = buf; + for (i = 0; i < s->nr_files; i++) { + struct file *f = &s->files[i]; + + if (i + 1 == s->nr_files) + p += sprintf(p, "%d", f->pending_ios); + else + p += sprintf(p, "%d, ", f->pending_ios); + } +} + +int main(int argc, char *argv[]) +{ + struct submitter *s = &submitters[0]; + unsigned long done, calls, reap, cache_hit, cache_miss; + int err, i, flags, fd; + char *fdepths; + void *ret; + + if (!do_nop && argc < 2) { + printf("%s: filename\n", argv[0]); + return 1; + } + + flags = O_RDONLY | O_NOATIME; + if (!buffered) + flags |= O_DIRECT; + + i = 1; + while (!do_nop && i < argc) { + struct file *f; + + if (s->nr_files == MAX_FDS) { + printf("Max number of files (%d) reached\n", MAX_FDS); + break; + } + fd = open(argv[i], flags); + if (fd < 0) { + perror("open"); + return 1; + } + + f = &s->files[s->nr_files]; + f->real_fd = fd; + if (get_file_size(f)) { + printf("failed getting size of device/file\n"); + return 1; + } + if (f->max_blocks <= 1) { + printf("Zero file/device size?\n"); + return 1; + } + f->max_blocks--; + + printf("Added file %s\n", argv[i]); + s->nr_files++; + i++; + } + + if (fixedbufs) { + struct rlimit rlim; + + rlim.rlim_cur = RLIM_INFINITY; + rlim.rlim_max = RLIM_INFINITY; + if (setrlimit(RLIMIT_MEMLOCK, &rlim) < 0) { + perror("setrlimit"); + return 1; + } + } + + arm_sig_int(); + + for (i = 0; i < DEPTH; i++) { + void *buf; + + if (posix_memalign(&buf, BS, BS)) { + printf("failed alloc\n"); + return 1; + } + s->iovecs[i].iov_base = buf; + s->iovecs[i].iov_len = BS; + } + + err = setup_ring(s); + if (err) { + printf("ring setup failed: %s, %d\n", strerror(errno), err); + return 1; + } + printf("polled=%d, fixedbufs=%d, buffered=%d", polled, fixedbufs, buffered); + printf(" QD=%d, sq_ring=%d, cq_ring=%d\n", DEPTH, *s->sq_ring.ring_entries, *s->cq_ring.ring_entries); + + pthread_create(&s->thread, NULL, submitter_fn, s); + + fdepths = malloc(8 * s->nr_files); + cache_hit = cache_miss = reap = calls = done = 0; + do { + unsigned long this_done = 0; + unsigned long this_reap = 0; + unsigned long this_call = 0; + unsigned long this_cache_hit = 0; + unsigned long this_cache_miss = 0; + unsigned long rpc = 0, ipc = 0; + double hit = 0.0; + + sleep(1); + this_done += s->done; + this_call += s->calls; + this_reap += s->reaps; + this_cache_hit += s->cachehit; + this_cache_miss += s->cachemiss; + if (this_cache_hit && this_cache_miss) { + unsigned long hits, total; + + hits = this_cache_hit - cache_hit; + total = hits + this_cache_miss - cache_miss; + hit = (double) hits / (double) total; + hit *= 100.0; + } + if (this_call - calls) { + rpc = (this_done - done) / (this_call - calls); + ipc = (this_reap - reap) / (this_call - calls); + } else + rpc = ipc = -1; + file_depths(fdepths); + printf("IOPS=%lu, IOS/call=%ld/%ld, inflight=%u (%s), Cachehit=%0.2f%%\n", + this_done - done, rpc, ipc, s->inflight, + fdepths, hit); + done = this_done; + calls = this_call; + reap = this_reap; + cache_hit = s->cachehit; + cache_miss = s->cachemiss; + } while (!finish); + + pthread_join(s->thread, &ret); + close(s->ring_fd); + free(fdepths); + return 0; +} diff --git a/tools/io_uring/io_uring-cp.c b/tools/io_uring/io_uring-cp.c new file mode 100644 index 000000000000..633f65bb43a7 --- /dev/null +++ b/tools/io_uring/io_uring-cp.c @@ -0,0 +1,251 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Simple test program that demonstrates a file copy through io_uring. This + * uses the API exposed by liburing. + * + * Copyright (C) 2018-2019 Jens Axboe + */ +#include <stdio.h> +#include <fcntl.h> +#include <string.h> +#include <stdlib.h> +#include <unistd.h> +#include <assert.h> +#include <errno.h> +#include <inttypes.h> +#include <sys/stat.h> +#include <sys/ioctl.h> + +#include "liburing.h" + +#define QD 64 +#define BS (32*1024) + +static int infd, outfd; + +struct io_data { + int read; + off_t first_offset, offset; + size_t first_len; + struct iovec iov; +}; + +static int setup_context(unsigned entries, struct io_uring *ring) +{ + int ret; + + ret = io_uring_queue_init(entries, ring, 0); + if (ret < 0) { + fprintf(stderr, "queue_init: %s\n", strerror(-ret)); + return -1; + } + + return 0; +} + +static int get_file_size(int fd, off_t *size) +{ + struct stat st; + + if (fstat(fd, &st) < 0) + return -1; + if (S_ISREG(st.st_mode)) { + *size = st.st_size; + return 0; + } else if (S_ISBLK(st.st_mode)) { + unsigned long long bytes; + + if (ioctl(fd, BLKGETSIZE64, &bytes) != 0) + return -1; + + *size = bytes; + return 0; + } + + return -1; +} + +static void queue_prepped(struct io_uring *ring, struct io_data *data) +{ + struct io_uring_sqe *sqe; + + sqe = io_uring_get_sqe(ring); + assert(sqe); + + if (data->read) + io_uring_prep_readv(sqe, infd, &data->iov, 1, data->offset); + else + io_uring_prep_writev(sqe, outfd, &data->iov, 1, data->offset); + + io_uring_sqe_set_data(sqe, data); +} + +static int queue_read(struct io_uring *ring, off_t size, off_t offset) +{ + struct io_uring_sqe *sqe; + struct io_data *data; + + sqe = io_uring_get_sqe(ring); + if (!sqe) + return 1; + + data = malloc(size + sizeof(*data)); + data->read = 1; + data->offset = data->first_offset = offset; + + data->iov.iov_base = data + 1; + data->iov.iov_len = size; + data->first_len = size; + + io_uring_prep_readv(sqe, infd, &data->iov, 1, offset); + io_uring_sqe_set_data(sqe, data); + return 0; +} + +static void queue_write(struct io_uring *ring, struct io_data *data) +{ + data->read = 0; + data->offset = data->first_offset; + + data->iov.iov_base = data + 1; + data->iov.iov_len = data->first_len; + + queue_prepped(ring, data); + io_uring_submit(ring); +} + +static int copy_file(struct io_uring *ring, off_t insize) +{ + unsigned long reads, writes; + struct io_uring_cqe *cqe; + off_t write_left, offset; + int ret; + + write_left = insize; + writes = reads = offset = 0; + + while (insize || write_left) { + unsigned long had_reads; + int got_comp; + + /* + * Queue up as many reads as we can + */ + had_reads = reads; + while (insize) { + off_t this_size = insize; + + if (reads + writes >= QD) + break; + if (this_size > BS) + this_size = BS; + else if (!this_size) + break; + + if (queue_read(ring, this_size, offset)) + break; + + insize -= this_size; + offset += this_size; + reads++; + } + + if (had_reads != reads) { + ret = io_uring_submit(ring); + if (ret < 0) { + fprintf(stderr, "io_uring_submit: %s\n", strerror(-ret)); + break; + } + } + + /* + * Queue is full at this point. Find at least one completion. + */ + got_comp = 0; + while (write_left) { + struct io_data *data; + + if (!got_comp) { + ret = io_uring_wait_completion(ring, &cqe); + got_comp = 1; + } else + ret = io_uring_get_completion(ring, &cqe); + if (ret < 0) { + fprintf(stderr, "io_uring_get_completion: %s\n", + strerror(-ret)); + return 1; + } + if (!cqe) + break; + + data = (struct io_data *) (uintptr_t) cqe->user_data; + if (cqe->res < 0) { + if (cqe->res == -EAGAIN) { + queue_prepped(ring, data); + continue; + } + fprintf(stderr, "cqe failed: %s\n", + strerror(-cqe->res)); + return 1; + } else if ((size_t) cqe->res != data->iov.iov_len) { + /* Short read/write, adjust and requeue */ + data->iov.iov_base += cqe->res; + data->iov.iov_len -= cqe->res; + data->offset += cqe->res; + queue_prepped(ring, data); + continue; + } + + /* + * All done. if write, nothing else to do. if read, + * queue up corresponding write. + */ + if (data->read) { + queue_write(ring, data); + write_left -= data->first_len; + reads--; + writes++; + } else { + free(data); + writes--; + } + } + } + + return 0; +} + +int main(int argc, char *argv[]) +{ + struct io_uring ring; + off_t insize; + int ret; + + if (argc < 3) { + printf("%s: infile outfile\n", argv[0]); + return 1; + } + + infd = open(argv[1], O_RDONLY); + if (infd < 0) { + perror("open infile"); + return 1; + } + outfd = open(argv[2], O_WRONLY | O_CREAT | O_TRUNC, 0644); + if (outfd < 0) { + perror("open outfile"); + return 1; + } + + if (setup_context(QD, &ring)) + return 1; + if (get_file_size(infd, &insize)) + return 1; + + ret = copy_file(&ring, insize); + + close(infd); + close(outfd); + io_uring_queue_exit(&ring); + return ret; +} diff --git a/tools/io_uring/liburing.h b/tools/io_uring/liburing.h new file mode 100644 index 000000000000..cab0f50257ba --- /dev/null +++ b/tools/io_uring/liburing.h @@ -0,0 +1,143 @@ +#ifndef LIB_URING_H +#define LIB_URING_H + +#include <sys/uio.h> +#include <signal.h> +#include <string.h> +#include "../../include/uapi/linux/io_uring.h" + +/* + * Library interface to io_uring + */ +struct io_uring_sq { + unsigned *khead; + unsigned *ktail; + unsigned *kring_mask; + unsigned *kring_entries; + unsigned *kflags; + unsigned *kdropped; + unsigned *array; + struct io_uring_sqe *sqes; + + unsigned sqe_head; + unsigned sqe_tail; + + size_t ring_sz; +}; + +struct io_uring_cq { + unsigned *khead; + unsigned *ktail; + unsigned *kring_mask; + unsigned *kring_entries; + unsigned *koverflow; + struct io_uring_cqe *cqes; + + size_t ring_sz; +}; + +struct io_uring { + struct io_uring_sq sq; + struct io_uring_cq cq; + int ring_fd; +}; + +/* + * System calls + */ +extern int io_uring_setup(unsigned entries, struct io_uring_params *p); +extern int io_uring_enter(unsigned fd, unsigned to_submit, + unsigned min_complete, unsigned flags, sigset_t *sig); +extern int io_uring_register(int fd, unsigned int opcode, void *arg, + unsigned int nr_args); + +/* + * Library interface + */ +extern int io_uring_queue_init(unsigned entries, struct io_uring *ring, + unsigned flags); +extern int io_uring_queue_mmap(int fd, struct io_uring_params *p, + struct io_uring *ring); +extern void io_uring_queue_exit(struct io_uring *ring); +extern int io_uring_get_completion(struct io_uring *ring, + struct io_uring_cqe **cqe_ptr); +extern int io_uring_wait_completion(struct io_uring *ring, + struct io_uring_cqe **cqe_ptr); +extern int io_uring_submit(struct io_uring *ring); +extern struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring); + +/* + * Command prep helpers + */ +static inline void io_uring_sqe_set_data(struct io_uring_sqe *sqe, void *data) +{ + sqe->user_data = (unsigned long) data; +} + +static inline void io_uring_prep_rw(int op, struct io_uring_sqe *sqe, int fd, + void *addr, unsigned len, off_t offset) +{ + memset(sqe, 0, sizeof(*sqe)); + sqe->opcode = op; + sqe->fd = fd; + sqe->off = offset; + sqe->addr = (unsigned long) addr; + sqe->len = len; +} + +static inline void io_uring_prep_readv(struct io_uring_sqe *sqe, int fd, + struct iovec *iovecs, unsigned nr_vecs, + off_t offset) +{ + io_uring_prep_rw(IORING_OP_READV, sqe, fd, iovecs, nr_vecs, offset); +} + +static inline void io_uring_prep_read_fixed(struct io_uring_sqe *sqe, int fd, + void *buf, unsigned nbytes, + off_t offset) +{ + io_uring_prep_rw(IORING_OP_READ_FIXED, sqe, fd, buf, nbytes, offset); +} + +static inline void io_uring_prep_writev(struct io_uring_sqe *sqe, int fd, + struct iovec *iovecs, unsigned nr_vecs, + off_t offset) +{ + io_uring_prep_rw(IORING_OP_WRITEV, sqe, fd, iovecs, nr_vecs, offset); +} + +static inline void io_uring_prep_write_fixed(struct io_uring_sqe *sqe, int fd, + void *buf, unsigned nbytes, + off_t offset) +{ + io_uring_prep_rw(IORING_OP_WRITE_FIXED, sqe, fd, buf, nbytes, offset); +} + +static inline void io_uring_prep_poll_add(struct io_uring_sqe *sqe, int fd, + short poll_mask) +{ + memset(sqe, 0, sizeof(*sqe)); + sqe->opcode = IORING_OP_POLL_ADD; + sqe->fd = fd; + sqe->poll_events = poll_mask; +} + +static inline void io_uring_prep_poll_remove(struct io_uring_sqe *sqe, + void *user_data) +{ + memset(sqe, 0, sizeof(*sqe)); + sqe->opcode = IORING_OP_POLL_REMOVE; + sqe->addr = (unsigned long) user_data; +} + +static inline void io_uring_prep_fsync(struct io_uring_sqe *sqe, int fd, + int datasync) +{ + memset(sqe, 0, sizeof(*sqe)); + sqe->opcode = IORING_OP_FSYNC; + sqe->fd = fd; + if (datasync) + sqe->fsync_flags = IORING_FSYNC_DATASYNC; +} + +#endif diff --git a/tools/io_uring/queue.c b/tools/io_uring/queue.c new file mode 100644 index 000000000000..88505e873ad9 --- /dev/null +++ b/tools/io_uring/queue.c @@ -0,0 +1,164 @@ +#include <sys/types.h> +#include <sys/stat.h> +#include <sys/mman.h> +#include <unistd.h> +#include <errno.h> +#include <string.h> + +#include "liburing.h" +#include "barrier.h" + +static int __io_uring_get_completion(struct io_uring *ring, + struct io_uring_cqe **cqe_ptr, int wait) +{ + struct io_uring_cq *cq = &ring->cq; + const unsigned mask = *cq->kring_mask; + unsigned head; + int ret; + + *cqe_ptr = NULL; + head = *cq->khead; + do { + /* + * It's necessary to use a read_barrier() before reading + * the CQ tail, since the kernel updates it locklessly. The + * kernel has the matching store barrier for the update. The + * kernel also ensures that previous stores to CQEs are ordered + * with the tail update. + */ + read_barrier(); + if (head != *cq->ktail) { + *cqe_ptr = &cq->cqes[head & mask]; + break; + } + if (!wait) + break; + ret = io_uring_enter(ring->ring_fd, 0, 1, + IORING_ENTER_GETEVENTS, NULL); + if (ret < 0) + return -errno; + } while (1); + + if (*cqe_ptr) { + *cq->khead = head + 1; + /* + * Ensure that the kernel sees our new head, the kernel has + * the matching read barrier. + */ + write_barrier(); + } + + return 0; +} + +/* + * Return an IO completion, if one is readily available + */ +int io_uring_get_completion(struct io_uring *ring, + struct io_uring_cqe **cqe_ptr) +{ + return __io_uring_get_completion(ring, cqe_ptr, 0); +} + +/* + * Return an IO completion, waiting for it if necessary + */ +int io_uring_wait_completion(struct io_uring *ring, + struct io_uring_cqe **cqe_ptr) +{ + return __io_uring_get_completion(ring, cqe_ptr, 1); +} + +/* + * Submit sqes acquired from io_uring_get_sqe() to the kernel. + * + * Returns number of sqes submitted + */ +int io_uring_submit(struct io_uring *ring) +{ + struct io_uring_sq *sq = &ring->sq; + const unsigned mask = *sq->kring_mask; + unsigned ktail, ktail_next, submitted; + int ret; + + /* + * If we have pending IO in the kring, submit it first. We need a + * read barrier here to match the kernels store barrier when updating + * the SQ head. + */ + read_barrier(); + if (*sq->khead != *sq->ktail) { + submitted = *sq->kring_entries; + goto submit; + } + + if (sq->sqe_head == sq->sqe_tail) + return 0; + + /* + * Fill in sqes that we have queued up, adding them to the kernel ring + */ + submitted = 0; + ktail = ktail_next = *sq->ktail; + while (sq->sqe_head < sq->sqe_tail) { + ktail_next++; + read_barrier(); + + sq->array[ktail & mask] = sq->sqe_head & mask; + ktail = ktail_next; + + sq->sqe_head++; + submitted++; + } + + if (!submitted) + return 0; + + if (*sq->ktail != ktail) { + /* + * First write barrier ensures that the SQE stores are updated + * with the tail update. This is needed so that the kernel + * will never see a tail update without the preceeding sQE + * stores being done. + */ + write_barrier(); + *sq->ktail = ktail; + /* + * The kernel has the matching read barrier for reading the + * SQ tail. + */ + write_barrier(); + } + +submit: + ret = io_uring_enter(ring->ring_fd, submitted, 0, + IORING_ENTER_GETEVENTS, NULL); + if (ret < 0) + return -errno; + + return 0; +} + +/* + * Return an sqe to fill. Application must later call io_uring_submit() + * when it's ready to tell the kernel about it. The caller may call this + * function multiple times before calling io_uring_submit(). + * + * Returns a vacant sqe, or NULL if we're full. + */ +struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring) +{ + struct io_uring_sq *sq = &ring->sq; + unsigned next = sq->sqe_tail + 1; + struct io_uring_sqe *sqe; + + /* + * All sqes are used + */ + if (next - sq->sqe_head > *sq->kring_entries) + return NULL; + + sqe = &sq->sqes[sq->sqe_tail & *sq->kring_mask]; + sq->sqe_tail = next; + return sqe; +} diff --git a/tools/io_uring/setup.c b/tools/io_uring/setup.c new file mode 100644 index 000000000000..4da19a77132c --- /dev/null +++ b/tools/io_uring/setup.c @@ -0,0 +1,103 @@ +#include <sys/types.h> +#include <sys/stat.h> +#include <sys/mman.h> +#include <unistd.h> +#include <errno.h> +#include <string.h> + +#include "liburing.h" + +static int io_uring_mmap(int fd, struct io_uring_params *p, + struct io_uring_sq *sq, struct io_uring_cq *cq) +{ + size_t size; + void *ptr; + int ret; + + sq->ring_sz = p->sq_off.array + p->sq_entries * sizeof(unsigned); + ptr = mmap(0, sq->ring_sz, PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQ_RING); + if (ptr == MAP_FAILED) + return -errno; + sq->khead = ptr + p->sq_off.head; + sq->ktail = ptr + p->sq_off.tail; + sq->kring_mask = ptr + p->sq_off.ring_mask; + sq->kring_entries = ptr + p->sq_off.ring_entries; + sq->kflags = ptr + p->sq_off.flags; + sq->kdropped = ptr + p->sq_off.dropped; + sq->array = ptr + p->sq_off.array; + + size = p->sq_entries * sizeof(struct io_uring_sqe), + sq->sqes = mmap(0, size, PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE, fd, + IORING_OFF_SQES); + if (sq->sqes == MAP_FAILED) { + ret = -errno; +err: + munmap(sq->khead, sq->ring_sz); + return ret; + } + + cq->ring_sz = p->cq_off.cqes + p->cq_entries * sizeof(struct io_uring_cqe); + ptr = mmap(0, cq->ring_sz, PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_CQ_RING); + if (ptr == MAP_FAILED) { + ret = -errno; + munmap(sq->sqes, p->sq_entries * sizeof(struct io_uring_sqe)); + goto err; + } + cq->khead = ptr + p->cq_off.head; + cq->ktail = ptr + p->cq_off.tail; + cq->kring_mask = ptr + p->cq_off.ring_mask; + cq->kring_entries = ptr + p->cq_off.ring_entries; + cq->koverflow = ptr + p->cq_off.overflow; + cq->cqes = ptr + p->cq_off.cqes; + return 0; +} + +/* + * For users that want to specify sq_thread_cpu or sq_thread_idle, this + * interface is a convenient helper for mmap()ing the rings. + * Returns -1 on error, or zero on success. On success, 'ring' + * contains the necessary information to read/write to the rings. + */ +int io_uring_queue_mmap(int fd, struct io_uring_params *p, struct io_uring *ring) +{ + int ret; + + memset(ring, 0, sizeof(*ring)); + ret = io_uring_mmap(fd, p, &ring->sq, &ring->cq); + if (!ret) + ring->ring_fd = fd; + return ret; +} + +/* + * Returns -1 on error, or zero on success. On success, 'ring' + * contains the necessary information to read/write to the rings. + */ +int io_uring_queue_init(unsigned entries, struct io_uring *ring, unsigned flags) +{ + struct io_uring_params p; + int fd; + + memset(&p, 0, sizeof(p)); + p.flags = flags; + + fd = io_uring_setup(entries, &p); + if (fd < 0) + return fd; + + return io_uring_queue_mmap(fd, &p, ring); +} + +void io_uring_queue_exit(struct io_uring *ring) +{ + struct io_uring_sq *sq = &ring->sq; + struct io_uring_cq *cq = &ring->cq; + + munmap(sq->sqes, *sq->kring_entries * sizeof(struct io_uring_sqe)); + munmap(sq->khead, sq->ring_sz); + munmap(cq->khead, cq->ring_sz); + close(ring->ring_fd); +} diff --git a/tools/io_uring/syscall.c b/tools/io_uring/syscall.c new file mode 100644 index 000000000000..6b835e5c6a5b --- /dev/null +++ b/tools/io_uring/syscall.c @@ -0,0 +1,40 @@ +/* + * Will go away once libc support is there + */ +#include <unistd.h> +#include <sys/syscall.h> +#include <sys/uio.h> +#include <signal.h> +#include "liburing.h" + +#if defined(__x86_64) || defined(__i386__) +#ifndef __NR_sys_io_uring_setup +#define __NR_sys_io_uring_setup 425 +#endif +#ifndef __NR_sys_io_uring_enter +#define __NR_sys_io_uring_enter 426 +#endif +#ifndef __NR_sys_io_uring_register +#define __NR_sys_io_uring_register 427 +#endif +#else +#error "Arch not supported yet" +#endif + +int io_uring_register(int fd, unsigned int opcode, void *arg, + unsigned int nr_args) +{ + return syscall(__NR_sys_io_uring_register, fd, opcode, arg, nr_args); +} + +int io_uring_setup(unsigned entries, struct io_uring_params *p) +{ + return syscall(__NR_sys_io_uring_setup, entries, p); +} + +int io_uring_enter(unsigned fd, unsigned to_submit, unsigned min_complete, + unsigned flags, sigset_t *sig) +{ + return syscall(__NR_sys_io_uring_enter, fd, to_submit, min_complete, + flags, sig, _NSIG / 8); +}
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1-rc5 commit 704236672edacf353c362bab70c3d3eda7bb4a51 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This ended up not being included in the mainline version of io_uring, so drop it from the test app as well.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- tools/io_uring/io_uring-bench.c | 32 ++++---------------------------- 1 file changed, 4 insertions(+), 28 deletions(-)
diff --git a/tools/io_uring/io_uring-bench.c b/tools/io_uring/io_uring-bench.c index 512306a37531..0f257139b003 100644 --- a/tools/io_uring/io_uring-bench.c +++ b/tools/io_uring/io_uring-bench.c @@ -32,10 +32,6 @@ #include "liburing.h" #include "barrier.h"
-#ifndef IOCQE_FLAG_CACHEHIT -#define IOCQE_FLAG_CACHEHIT (1U << 0) -#endif - #define min(a, b) ((a < b) ? (a) : (b))
struct io_sq_ring { @@ -85,7 +81,6 @@ struct submitter { unsigned long reaps; unsigned long done; unsigned long calls; - unsigned long cachehit, cachemiss; volatile int finish;
__s32 *fds; @@ -270,10 +265,6 @@ static int reap_events(struct submitter *s) return -1; } } - if (cqe->flags & IOCQE_FLAG_CACHEHIT) - s->cachehit++; - else - s->cachemiss++; reaped++; head++; } while (1); @@ -489,7 +480,7 @@ static void file_depths(char *buf) int main(int argc, char *argv[]) { struct submitter *s = &submitters[0]; - unsigned long done, calls, reap, cache_hit, cache_miss; + unsigned long done, calls, reap; int err, i, flags, fd; char *fdepths; void *ret; @@ -569,44 +560,29 @@ int main(int argc, char *argv[]) pthread_create(&s->thread, NULL, submitter_fn, s);
fdepths = malloc(8 * s->nr_files); - cache_hit = cache_miss = reap = calls = done = 0; + reap = calls = done = 0; do { unsigned long this_done = 0; unsigned long this_reap = 0; unsigned long this_call = 0; - unsigned long this_cache_hit = 0; - unsigned long this_cache_miss = 0; unsigned long rpc = 0, ipc = 0; - double hit = 0.0;
sleep(1); this_done += s->done; this_call += s->calls; this_reap += s->reaps; - this_cache_hit += s->cachehit; - this_cache_miss += s->cachemiss; - if (this_cache_hit && this_cache_miss) { - unsigned long hits, total; - - hits = this_cache_hit - cache_hit; - total = hits + this_cache_miss - cache_miss; - hit = (double) hits / (double) total; - hit *= 100.0; - } if (this_call - calls) { rpc = (this_done - done) / (this_call - calls); ipc = (this_reap - reap) / (this_call - calls); } else rpc = ipc = -1; file_depths(fdepths); - printf("IOPS=%lu, IOS/call=%ld/%ld, inflight=%u (%s), Cachehit=%0.2f%%\n", + printf("IOPS=%lu, IOS/call=%ld/%ld, inflight=%u (%s)\n", this_done - done, rpc, ipc, s->inflight, - fdepths, hit); + fdepths); done = this_done; calls = this_call; reap = this_reap; - cache_hit = s->cachehit; - cache_miss = s->cachemiss; } while (!finish);
pthread_join(s->thread, &ret);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1-rc2 commit e65ef56db4945fb18a0d522e056c02ddf939e644 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Get rid of the special casing of "normal" requests not having any references to the io_kiocb. We initialize the ref count to 2, one for the submission side, and one or the completion side.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 54 +++++++++++++++++++++++++++++++++------------------ 1 file changed, 35 insertions(+), 19 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5dde033ed5a2..d4d42040884a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -411,7 +411,8 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
req->ctx = ctx; req->flags = 0; - refcount_set(&req->refs, 0); + /* one is dropped after submission, the other at completion */ + refcount_set(&req->refs, 2); return req; out: io_ring_drop_ctx_refs(ctx, 1); @@ -429,10 +430,14 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
static void io_free_req(struct io_kiocb *req) { - if (!refcount_read(&req->refs) || refcount_dec_and_test(&req->refs)) { - io_ring_drop_ctx_refs(req->ctx, 1); - kmem_cache_free(req_cachep, req); - } + io_ring_drop_ctx_refs(req->ctx, 1); + kmem_cache_free(req_cachep, req); +} + +static void io_put_req(struct io_kiocb *req) +{ + if (refcount_dec_and_test(&req->refs)) + io_free_req(req); }
/* @@ -453,7 +458,8 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
io_cqring_fill_event(ctx, req->user_data, req->error, 0);
- reqs[to_free++] = req; + if (refcount_dec_and_test(&req->refs)) + reqs[to_free++] = req; (*nr_events)++;
/* @@ -616,7 +622,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
io_fput(req); io_cqring_add_event(req->ctx, req->user_data, res, 0); - io_free_req(req); + io_put_req(req); }
static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2) @@ -1083,7 +1089,7 @@ static int io_nop(struct io_kiocb *req, u64 user_data) io_fput(req); } io_cqring_add_event(ctx, user_data, err, 0); - io_free_req(req); + io_put_req(req); return 0; }
@@ -1146,7 +1152,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
io_fput(req); io_cqring_add_event(req->ctx, sqe->user_data, ret, 0); - io_free_req(req); + io_put_req(req); return 0; }
@@ -1204,7 +1210,7 @@ static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe) spin_unlock_irq(&ctx->completion_lock);
io_cqring_add_event(req->ctx, sqe->user_data, ret, 0); - io_free_req(req); + io_put_req(req); return 0; }
@@ -1212,7 +1218,7 @@ static void io_poll_complete(struct io_kiocb *req, __poll_t mask) { io_cqring_add_event(req->ctx, req->user_data, mangle_poll(mask), 0); io_fput(req); - io_free_req(req); + io_put_req(req); }
static void io_poll_complete_work(struct work_struct *work) @@ -1346,9 +1352,6 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe) INIT_LIST_HEAD(&poll->wait.entry); init_waitqueue_func_entry(&poll->wait, io_poll_wake);
- /* one for removal from waitqueue, one for this function */ - refcount_set(&req->refs, 2); - mask = vfs_poll(poll->file, &ipt.pt) & poll->events; if (unlikely(!poll->head)) { /* we did not manage to set up a waitqueue, done */ @@ -1380,13 +1383,12 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe) * Drop one of our refs to this req, __io_submit_sqe() will * drop the other one since we're returning an error. */ - io_free_req(req); + io_put_req(req); return ipt.error; }
if (mask) io_poll_complete(req, mask); - io_free_req(req); return 0; }
@@ -1524,10 +1526,13 @@ static void io_sq_wq_submit_work(struct work_struct *work) break; cond_resched(); } while (1); + + /* drop submission reference */ + io_put_req(req); } if (ret) { io_cqring_add_event(ctx, sqe->user_data, ret, 0); - io_free_req(req); + io_put_req(req); }
/* async context always use a copy of the sqe */ @@ -1649,11 +1654,22 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, INIT_WORK(&req->work, io_sq_wq_submit_work); queue_work(ctx->sqo_wq, &req->work); } - ret = 0; + + /* + * Queued up for async execution, worker will release + * submit reference when the iocb is actually + * submitted. + */ + return 0; } } + + /* drop submission reference */ + io_put_req(req); + + /* and drop final reference, if we failed */ if (ret) - io_free_req(req); + io_put_req(req);
return ret; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1-rc2 commit e0c5c576d5074b5bb7b1b4b59848c25ceb521331 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The callers all convert to an integer, and we only return 0/-ERROR anyway.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d4d42040884a..901d0132e9ae 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -893,7 +893,7 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw, opcode = READ_ONCE(sqe->opcode); if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) { - ssize_t ret = io_import_fixed(ctx, rw, sqe, iter); + int ret = io_import_fixed(ctx, rw, sqe, iter); *iovec = NULL; return ret; } @@ -951,15 +951,15 @@ static void io_async_list_note(int rw, struct io_kiocb *req, size_t len) async_list->io_end = io_end; }
-static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, - bool force_nonblock, struct io_submit_state *state) +static int io_read(struct io_kiocb *req, const struct sqe_submit *s, + bool force_nonblock, struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw; struct iov_iter iter; struct file *file; size_t iov_count; - ssize_t ret; + int ret;
ret = io_prep_rw(req, s, force_nonblock, state); if (ret) @@ -1004,15 +1004,15 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, return ret; }
-static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s, - bool force_nonblock, struct io_submit_state *state) +static int io_write(struct io_kiocb *req, const struct sqe_submit *s, + bool force_nonblock, struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw; struct iov_iter iter; struct file *file; size_t iov_count; - ssize_t ret; + int ret;
ret = io_prep_rw(req, s, force_nonblock, state); if (ret) @@ -1396,8 +1396,7 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, const struct sqe_submit *s, bool force_nonblock, struct io_submit_state *state) { - ssize_t ret; - int opcode; + int ret, opcode;
if (unlikely(s->index >= ctx->sq_entries)) return -EINVAL; @@ -1623,7 +1622,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, struct io_submit_state *state) { struct io_kiocb *req; - ssize_t ret; + int ret;
/* enforce forwards compatibility on users */ if (unlikely(s->sqe->flags & ~IOSQE_FIXED_FILE))
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1-rc7 commit 8358e3a8264a228cf2dfb6f3a05c0328f4118f12 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Since commit 09bb839434b we don't use the state argument for any sort of on-stack caching in the io read and write path. Remove the stale and unused argument from them, and bubble it up to __io_submit_sqe() and down to io_prep_rw().
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index dcbb2beb2050..d1efb389661c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -740,7 +740,7 @@ static bool io_file_supports_async(struct file *file) }
static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s, - bool force_nonblock, struct io_submit_state *state) + bool force_nonblock) { const struct io_uring_sqe *sqe = s->sqe; struct io_ring_ctx *ctx = req->ctx; @@ -935,7 +935,7 @@ static void io_async_list_note(int rw, struct io_kiocb *req, size_t len) }
static int io_read(struct io_kiocb *req, const struct sqe_submit *s, - bool force_nonblock, struct io_submit_state *state) + bool force_nonblock) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw; @@ -944,7 +944,7 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s, size_t iov_count; int ret;
- ret = io_prep_rw(req, s, force_nonblock, state); + ret = io_prep_rw(req, s, force_nonblock); if (ret) return ret; file = kiocb->ki_filp; @@ -982,7 +982,7 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s, }
static int io_write(struct io_kiocb *req, const struct sqe_submit *s, - bool force_nonblock, struct io_submit_state *state) + bool force_nonblock) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw; @@ -991,7 +991,7 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s, size_t iov_count; int ret;
- ret = io_prep_rw(req, s, force_nonblock, state); + ret = io_prep_rw(req, s, force_nonblock); if (ret) return ret;
@@ -1333,8 +1333,7 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe) }
static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, - const struct sqe_submit *s, bool force_nonblock, - struct io_submit_state *state) + const struct sqe_submit *s, bool force_nonblock) { int ret, opcode;
@@ -1350,18 +1349,18 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, case IORING_OP_READV: if (unlikely(s->sqe->buf_index)) return -EINVAL; - ret = io_read(req, s, force_nonblock, state); + ret = io_read(req, s, force_nonblock); break; case IORING_OP_WRITEV: if (unlikely(s->sqe->buf_index)) return -EINVAL; - ret = io_write(req, s, force_nonblock, state); + ret = io_write(req, s, force_nonblock); break; case IORING_OP_READ_FIXED: - ret = io_read(req, s, force_nonblock, state); + ret = io_read(req, s, force_nonblock); break; case IORING_OP_WRITE_FIXED: - ret = io_write(req, s, force_nonblock, state); + ret = io_write(req, s, force_nonblock); break; case IORING_OP_FSYNC: ret = io_fsync(req, s->sqe, force_nonblock); @@ -1454,7 +1453,7 @@ static void io_sq_wq_submit_work(struct work_struct *work) s->has_user = cur_mm != NULL; s->needs_lock = true; do { - ret = __io_submit_sqe(ctx, req, s, false, NULL); + ret = __io_submit_sqe(ctx, req, s, false); /* * We can get EAGAIN for polled IO even though * we're forcing a sync submission from here, @@ -1620,7 +1619,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, if (unlikely(ret)) goto out;
- ret = __io_submit_sqe(ctx, req, s, true, state); + ret = __io_submit_sqe(ctx, req, s, true); if (ret == -EAGAIN) { struct io_uring_sqe *sqe_copy;
From: Stefan Bühler source@stbuehler.de
mainline inclusion from mainline-5.1 commit b841f19524a16cd93a39f9306191f85c549a2bc2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
smp_store_release in io_commit_sqring already orders the store to dropped before the update to SQ head.
Signed-off-by: Stefan Bühler source@stbuehler.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ba3fb2d8ec27..0a0fbb236147 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1843,8 +1843,6 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) /* drop invalid entries */ ctx->cached_sq_head++; ring->dropped++; - /* See comment at the top of this file */ - smp_wmb(); return false; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1 commit 5c8b0b54db22c54f2aec991b388f550d3a927f26 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently we only post a cqe if we get an error OUTSIDE of submission. For submission, we return the error directly through io_uring_enter(). This is a bit awkward for applications, and it makes more sense to always post a cqe with an error, if the error happens on behalf of an sqe.
This changes submission behavior a bit. io_uring_enter() returns -ERROR for an error, and > 0 for number of sqes submitted. Before this change, if you wanted to submit 8 entries and had an error on the 5th entry, io_uring_enter() would return 4 (for number of entries successfully submitted) and rewind the sqring. The application would then have to peek at the sqring and figure out what was wrong with the head sqe, and then skip it itself. With this change, we'll return 5 since we did consume 5 sqes, and the last sqe (with the error) will result in a cqe being posted with the error.
This makes the logic easier to handle in the application, and it cleans up the submission part.
Suggested-by: Stefan Bühler source@stbuehler.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 34 ++++++---------------------------- 1 file changed, 6 insertions(+), 28 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a3a78d3cab7a..5bcc25a77291 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1798,14 +1798,6 @@ static void io_commit_sqring(struct io_ring_ctx *ctx) } }
-/* - * Undo last io_get_sqring() - */ -static void io_drop_sqring(struct io_ring_ctx *ctx) -{ - ctx->cached_sq_head--; -} - /* * Fetch an sqe, if one is available. Note that s->sqe will point to memory * that is mapped by userspace. This means that care needs to be taken to @@ -2015,7 +2007,7 @@ static int io_sq_thread(void *data) static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) { struct io_submit_state state, *statep = NULL; - int i, ret = 0, submit = 0; + int i, submit = 0;
if (to_submit > IO_PLUG_THRESHOLD) { io_submit_state_start(&state, ctx, to_submit); @@ -2024,6 +2016,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
for (i = 0; i < to_submit; i++) { struct sqe_submit s; + int ret;
if (!io_get_sqring(ctx, &s)) break; @@ -2031,21 +2024,18 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) s.has_user = true; s.needs_lock = false; s.needs_fixed_file = false; + submit++;
ret = io_submit_sqe(ctx, &s, statep); - if (ret) { - io_drop_sqring(ctx); - break; - } - - submit++; + if (ret) + io_cqring_add_event(ctx, s.sqe->user_data, ret, 0); } io_commit_sqring(ctx);
if (statep) io_submit_state_end(statep);
- return submit ? submit : ret; + return submit; }
static unsigned io_cqring_events(struct io_cq_ring *ring) @@ -2776,24 +2766,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, mutex_lock(&ctx->uring_lock); submitted = io_ring_submit(ctx, to_submit); mutex_unlock(&ctx->uring_lock); - - if (submitted < 0) - goto out_ctx; } if (flags & IORING_ENTER_GETEVENTS) { unsigned nr_events = 0;
min_complete = min(min_complete, ctx->cq_entries);
- /* - * The application could have included the 'to_submit' count - * in how many events it wanted to wait for. If we failed to - * submit the desired count, we may need to adjust the number - * of events to poll/wait for. - */ - if (submitted < to_submit) - min_complete = min_t(unsigned, submitted, min_complete); - if (ctx->flags & IORING_SETUP_IOPOLL) { mutex_lock(&ctx->uring_lock); ret = io_iopoll_check(ctx, &nr_events, min_complete);
From: Mark Rutland mark.rutland@arm.com
mainline inclusion from mainline-5.1 commit 975554b03eddc1df73bda3a764a09e18cadd5f1c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In io_sq_offload_start(), we call cpu_possible() on an unbounded cpu value from userspace. On v5.1-rc7 on arm64 with CONFIG_DEBUG_PER_CPU_MAPS, this results in a splat:
WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpu_max_bits_warn include/linux/cpumask.h:121 [inline]
There was an attempt to fix this in commit:
917257daa0fea7a0 ("io_uring: only test SQPOLL cpu after we've verified it")
... by adding a check after the cpu value had been limited to NR_CPU_IDS using array_index_nospec(). However, this left an unbound check at the start of the function, for which the warning still fires.
Let's fix this correctly by checking that the cpu value is bound by nr_cpu_ids before passing it to cpu_possible(). Note that only nr_cpu_ids of a cpumask are guaranteed to exist at runtime, and nr_cpu_ids can be significantly smaller than NR_CPUs. For example, an arm64 defconfig has NR_CPUS=256, while my test VM has 4 vCPUs.
Following the intent from the commit message for 917257daa0fea7a0, the check is moved under the SQ_AFF branch, which is the only branch where the cpu values is consumed. The check is performed before bounding the value with array_index_nospec() so that we don't silently accept bogus cpu values from userspace, where array_index_nospec() would force these values to 0.
I suspect we can remove the array_index_nospec() call entirely, but I've conservatively left that in place, updated to use nr_cpu_ids to match the prior check.
Tested on arm64 with the Syzkaller reproducer:
https://syzkaller.appspot.com/bug?extid=cd714a07c6de2bc34293 https://syzkaller.appspot.com/x/repro.syz?x=15d8b397200000
Full splat from before this patch:
WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpu_max_bits_warn include/linux/cpumask.h:121 [inline] WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpumask_check include/linux/cpumask.h:128 [inline] WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpumask_test_cpu include/linux/cpumask.h:344 [inline] WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_sq_offload_start fs/io_uring.c:2244 [inline] WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_uring_create fs/io_uring.c:2864 [inline] WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_uring_setup+0x1108/0x15a0 fs/io_uring.c:2916 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 27601 Comm: syz-executor.0 Not tainted 5.1.0-rc7 #3 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x110/0x190 lib/dump_stack.c:113 panic+0x384/0x68c kernel/panic.c:214 __warn+0x2bc/0x2c0 kernel/panic.c:571 report_bug+0x228/0x2d8 lib/bug.c:186 bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956 call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline] brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316 do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831 el1_dbg+0x18/0x8c cpu_max_bits_warn include/linux/cpumask.h:121 [inline] cpumask_check include/linux/cpumask.h:128 [inline] cpumask_test_cpu include/linux/cpumask.h:344 [inline] io_sq_offload_start fs/io_uring.c:2244 [inline] io_uring_create fs/io_uring.c:2864 [inline] io_uring_setup+0x1108/0x15a0 fs/io_uring.c:2916 __do_sys_io_uring_setup fs/io_uring.c:2929 [inline] __se_sys_io_uring_setup fs/io_uring.c:2926 [inline] __arm64_sys_io_uring_setup+0x50/0x70 fs/io_uring.c:2926 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall arch/arm64/kernel/syscall.c:47 [inline] el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83 el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129 el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948 SMP: stopping secondary CPUs Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled CPU features: 0x002,23000438 Memory Limit: none Rebooting in 1 seconds..
Fixes: 917257daa0fea7a0 ("io_uring: only test SQPOLL cpu after we've verified it") Signed-off-by: Mark Rutland mark.rutland@arm.com Cc: Jens Axboe axboe@kernel.dk Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: linux-block@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org
Simplied the logic
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5bcc25a77291..919789957544 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2316,10 +2316,6 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx, mmgrab(current->mm); ctx->sqo_mm = current->mm;
- ret = -EINVAL; - if (!cpu_possible(p->sq_thread_cpu)) - goto err; - if (ctx->flags & IORING_SETUP_SQPOLL) { ret = -EPERM; if (!capable(CAP_SYS_ADMIN)) @@ -2330,11 +2326,11 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx, ctx->sq_thread_idle = HZ;
if (p->flags & IORING_SETUP_SQ_AFF) { - int cpu; + int cpu = array_index_nospec(p->sq_thread_cpu, + nr_cpu_ids);
- cpu = array_index_nospec(p->sq_thread_cpu, NR_CPUS); ret = -EINVAL; - if (!cpu_possible(p->sq_thread_cpu)) + if (!cpu_possible(cpu)) goto err;
ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread,
From: Mark Rutland mark.rutland@arm.com
mainline inclusion from mainline-5.1 commit 52e04ef4c9d459cba3afd86ec335a411b40b7fd2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If io_allocate_scq_urings() fails to allocate an sq_* region, it will call io_mem_free() for any previously allocated regions, but leave dangling pointers to these regions in the ctx. Any regions which have not yet been allocated are left NULL. Note that when returning -EOVERFLOW, the previously allocated sq_ring is not freed, which appears to be an unintentional leak.
When io_allocate_scq_urings() fails, io_uring_create() will call io_ring_ctx_wait_and_kill(), which calls io_mem_free() on all the sq_* regions, assuming the pointers are valid and not NULL.
This can result in pages being freed multiple times, which has been observed to corrupt the page state, leading to subsequent fun. This can also result in virt_to_page() on NULL, resulting in the use of bogus page addresses, and yet more subsequent fun. The latter can be detected with CONFIG_DEBUG_VIRTUAL on arm64.
Adding a cleanup path to io_allocate_scq_urings() complicates the logic, so let's leave it to io_ring_ctx_free() to consistently free these pointers, and simplify the io_allocate_scq_urings() error paths.
Full splats from before this patch below. Note that the pointer logged by the DEBUG_VIRTUAL "non-linear address" warning has been hashed, and is actually NULL.
[ 26.098129] page:ffff80000e949a00 count:0 mapcount:-128 mapping:0000000000000000 index:0x0 [ 26.102976] flags: 0x63fffc000000() [ 26.104373] raw: 000063fffc000000 ffff80000e86c188 ffff80000ea3df08 0000000000000000 [ 26.108917] raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000 [ 26.137235] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) [ 26.143960] ------------[ cut here ]------------ [ 26.146020] kernel BUG at include/linux/mm.h:547! [ 26.147586] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 26.149163] Modules linked in: [ 26.150287] Process syz-executor.21 (pid: 20204, stack limit = 0x000000000e9cefeb) [ 26.153307] CPU: 2 PID: 20204 Comm: syz-executor.21 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #18 [ 26.156566] Hardware name: linux,dummy-virt (DT) [ 26.158089] pstate: 40400005 (nZcv daif +PAN -UAO) [ 26.159869] pc : io_mem_free+0x9c/0xa8 [ 26.161436] lr : io_mem_free+0x9c/0xa8 [ 26.162720] sp : ffff000013003d60 [ 26.164048] x29: ffff000013003d60 x28: ffff800025048040 [ 26.165804] x27: 0000000000000000 x26: ffff800025048040 [ 26.167352] x25: 00000000000000c0 x24: ffff0000112c2820 [ 26.169682] x23: 0000000000000000 x22: 0000000020000080 [ 26.171899] x21: ffff80002143b418 x20: ffff80002143b400 [ 26.174236] x19: ffff80002143b280 x18: 0000000000000000 [ 26.176607] x17: 0000000000000000 x16: 0000000000000000 [ 26.178997] x15: 0000000000000000 x14: 0000000000000000 [ 26.181508] x13: 00009178a5e077b2 x12: 0000000000000001 [ 26.183863] x11: 0000000000000000 x10: 0000000000000980 [ 26.186437] x9 : ffff000013003a80 x8 : ffff800025048a20 [ 26.189006] x7 : ffff8000250481c0 x6 : ffff80002ffe9118 [ 26.191359] x5 : ffff80002ffe9118 x4 : 0000000000000000 [ 26.193863] x3 : ffff80002ffefe98 x2 : 44c06ddd107d1f00 [ 26.196642] x1 : 0000000000000000 x0 : 000000000000003e [ 26.198892] Call trace: [ 26.199893] io_mem_free+0x9c/0xa8 [ 26.201155] io_ring_ctx_wait_and_kill+0xec/0x180 [ 26.202688] io_uring_setup+0x6c4/0x6f0 [ 26.204091] __arm64_sys_io_uring_setup+0x18/0x20 [ 26.205576] el0_svc_common.constprop.0+0x7c/0xe8 [ 26.207186] el0_svc_handler+0x28/0x78 [ 26.208389] el0_svc+0x8/0xc [ 26.209408] Code: aa0203e0 d0006861 9133a021 97fcdc3c (d4210000) [ 26.211995] ---[ end trace bdb81cd43a21e50d ]---
[ 81.770626] ------------[ cut here ]------------ [ 81.825015] virt_to_phys used for non-linear address: 000000000d42f2c7 ( (null)) [ 81.827860] WARNING: CPU: 1 PID: 30171 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x48/0x68 [ 81.831202] Modules linked in: [ 81.832212] CPU: 1 PID: 30171 Comm: syz-executor.20 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #19 [ 81.835616] Hardware name: linux,dummy-virt (DT) [ 81.836863] pstate: 60400005 (nZCv daif +PAN -UAO) [ 81.838727] pc : __virt_to_phys+0x48/0x68 [ 81.840572] lr : __virt_to_phys+0x48/0x68 [ 81.842264] sp : ffff80002cf67c70 [ 81.843858] x29: ffff80002cf67c70 x28: ffff800014358e18 [ 81.846463] x27: 0000000000000000 x26: 0000000020000080 [ 81.849148] x25: 0000000000000000 x24: ffff80001bb01f40 [ 81.851986] x23: ffff200011db06c8 x22: ffff2000127e3c60 [ 81.854351] x21: ffff800014358cc0 x20: ffff800014358d98 [ 81.856711] x19: 0000000000000000 x18: 0000000000000000 [ 81.859132] x17: 0000000000000000 x16: 0000000000000000 [ 81.861586] x15: 0000000000000000 x14: 0000000000000000 [ 81.863905] x13: 0000000000000000 x12: ffff1000037603e9 [ 81.866226] x11: 1ffff000037603e8 x10: 0000000000000980 [ 81.868776] x9 : ffff80002cf67840 x8 : ffff80001bb02920 [ 81.873272] x7 : ffff1000037603e9 x6 : ffff80001bb01f47 [ 81.875266] x5 : ffff1000037603e9 x4 : dfff200000000000 [ 81.876875] x3 : ffff200010087528 x2 : ffff1000059ecf58 [ 81.878751] x1 : 44c06ddd107d1f00 x0 : 0000000000000000 [ 81.880453] Call trace: [ 81.881164] __virt_to_phys+0x48/0x68 [ 81.882919] io_mem_free+0x18/0x110 [ 81.886585] io_ring_ctx_wait_and_kill+0x13c/0x1f0 [ 81.891212] io_uring_setup+0xa60/0xad0 [ 81.892881] __arm64_sys_io_uring_setup+0x2c/0x38 [ 81.894398] el0_svc_common.constprop.0+0xac/0x150 [ 81.896306] el0_svc_handler+0x34/0x88 [ 81.897744] el0_svc+0x8/0xc [ 81.898715] ---[ end trace b4a703802243cbba ]---
Fixes: 2b188cc1bb857a9d ("Add io_uring IO interface") Signed-off-by: Mark Rutland mark.rutland@arm.com Cc: Jens Axboe axboe@kernel.dk Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: linux-block@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 919789957544..6dd523adacab 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2393,8 +2393,12 @@ static int io_account_mem(struct user_struct *user, unsigned long nr_pages)
static void io_mem_free(void *ptr) { - struct page *page = virt_to_head_page(ptr); + struct page *page; + + if (!ptr) + return;
+ page = virt_to_head_page(ptr); if (put_page_testzero(page)) free_compound_page(page); } @@ -2813,17 +2817,12 @@ static int io_allocate_scq_urings(struct io_ring_ctx *ctx, return -EOVERFLOW;
ctx->sq_sqes = io_mem_alloc(size); - if (!ctx->sq_sqes) { - io_mem_free(ctx->sq_ring); + if (!ctx->sq_sqes) return -ENOMEM; - }
cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries)); - if (!cq_ring) { - io_mem_free(ctx->sq_ring); - io_mem_free(ctx->sq_sqes); + if (!cq_ring) return -ENOMEM; - }
ctx->cq_ring = cq_ring; cq_ring->ring_mask = p->cq_entries - 1;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.1 commit 817869d2519f0cb7be5b3482129dadc806dfb747 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we don't end up actually calling submit in io_sq_wq_submit_work(), we still need to drop the submit reference to the request. If we don't, then we can leak the request. This can happen if we race with ring shutdown while flushing the workqueue for requests that require use of the mm_struct.
Fixes: e65ef56db494 ("io_uring: use regular request ref counts") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6dd523adacab..a6cd6b3ac4f6 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1565,10 +1565,11 @@ static void io_sq_wq_submit_work(struct work_struct *work) break; cond_resched(); } while (1); - - /* drop submission reference */ - io_put_req(req); } + + /* drop submission reference */ + io_put_req(req); + if (ret) { io_cqring_add_event(ctx, sqe->user_data, ret, 0); io_put_req(req);
From: Mark Rutland mark.rutland@arm.com
mainline inclusion from mainline-5.1 commit d4ef647510b1200fe1c996ff1cbf5ac47eb930cc category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In io_sqe_buffer_register() we allocate a number of arrays based on the iov_len from the user-provided iov. While we limit iov_len to SZ_1G, we can still attempt to allocate arrays exceeding MAX_ORDER.
On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320 bytes, which is greater than 4MiB. This results in SLUB warning that we're trying to allocate greater than MAX_ORDER, and failing the allocation.
Avoid this by using kvmalloc() for allocations dependent on the user-provided iov_len. At the same time, fix a leak of imu->bvec when registration fails.
Full splat from before this patch:
WARNING: CPU: 1 PID: 2314 at mm/page_alloc.c:4595 __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 2314 Comm: syz-executor326 Not tainted 5.1.0-rc7-dirty #4 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x110/0x190 lib/dump_stack.c:113 panic+0x384/0x68c kernel/panic.c:214 __warn+0x2bc/0x2c0 kernel/panic.c:571 report_bug+0x228/0x2d8 lib/bug.c:186 bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956 call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline] brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316 do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831 el1_dbg+0x18/0x8c __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595 alloc_pages_current+0x164/0x278 mm/mempolicy.c:2132 alloc_pages include/linux/gfp.h:509 [inline] kmalloc_order+0x20/0x50 mm/slab_common.c:1231 kmalloc_order_trace+0x30/0x2b0 mm/slab_common.c:1243 kmalloc_large include/linux/slab.h:480 [inline] __kmalloc+0x3dc/0x4f0 mm/slub.c:3791 kmalloc_array include/linux/slab.h:670 [inline] io_sqe_buffer_register fs/io_uring.c:2472 [inline] __io_uring_register fs/io_uring.c:2962 [inline] __do_sys_io_uring_register fs/io_uring.c:3008 [inline] __se_sys_io_uring_register fs/io_uring.c:2990 [inline] __arm64_sys_io_uring_register+0x9e0/0x1bc8 fs/io_uring.c:2990 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall arch/arm64/kernel/syscall.c:47 [inline] el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83 el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129 el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948 SMP: stopping secondary CPUs Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled CPU features: 0x002,23000438 Memory Limit: none Rebooting in 1 seconds..
Fixes: edafccee56ff3167 ("io_uring: add support for pre-mapped user IO buffers") Signed-off-by: Mark Rutland mark.rutland@arm.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Jens Axboe axboe@kernel.dk Cc: linux-fsdevel@vger.kernel.org Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a6cd6b3ac4f6..ae1d4793013b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2440,7 +2440,7 @@ static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
if (ctx->account_mem) io_unaccount_mem(ctx->user, imu->nr_bvecs); - kfree(imu->bvec); + kvfree(imu->bvec); imu->nr_bvecs = 0; }
@@ -2532,9 +2532,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, if (!pages || nr_pages > got_pages) { kfree(vmas); kfree(pages); - pages = kmalloc_array(nr_pages, sizeof(struct page *), + pages = kvmalloc_array(nr_pages, sizeof(struct page *), GFP_KERNEL); - vmas = kmalloc_array(nr_pages, + vmas = kvmalloc_array(nr_pages, sizeof(struct vm_area_struct *), GFP_KERNEL); if (!pages || !vmas) { @@ -2546,7 +2546,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, got_pages = nr_pages; }
- imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec), + imu->bvec = kvmalloc_array(nr_pages, sizeof(struct bio_vec), GFP_KERNEL); ret = -ENOMEM; if (!imu->bvec) { @@ -2585,6 +2585,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, } if (ctx->account_mem) io_unaccount_mem(ctx->user, nr_pages); + kvfree(imu->bvec); goto err; }
@@ -2607,12 +2608,12 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
ctx->nr_user_bufs++; } - kfree(pages); - kfree(vmas); + kvfree(pages); + kvfree(vmas); return 0; err: - kfree(pages); - kfree(vmas); + kvfree(pages); + kvfree(vmas); io_sqe_buffer_unregister(ctx); return ret; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.2-rc1 commit 22f96b3808c12a218e9a3bce6e1bfbd74efbe374 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This just pulls out the ksys_sync_file_range() code to work on a struct file instead of an fd, so we can use it elsewhere.
Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/sync.c include/linux/fs.h [ Patch c553ea4fdf("fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback") applied earlier. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/sync.c | 141 ++++++++++++++++++++++++--------------------- include/linux/fs.h | 3 + 2 files changed, 77 insertions(+), 67 deletions(-)
diff --git a/fs/sync.c b/fs/sync.c index 9e8cd90e890f..4d1ff010bc5a 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -234,61 +234,10 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd) return do_fsync(fd, 1); }
-/* - * ksys_sync_file_range() permits finely controlled syncing over a segment of - * a file in the range offset .. (offset+nbytes-1) inclusive. If nbytes is - * zero then ksys_sync_file_range() will operate from offset out to EOF. - * - * The flag bits are: - * - * SYNC_FILE_RANGE_WAIT_BEFORE: wait upon writeout of all pages in the range - * before performing the write. - * - * SYNC_FILE_RANGE_WRITE: initiate writeout of all those dirty pages in the - * range which are not presently under writeback. Note that this may block for - * significant periods due to exhaustion of disk request structures. - * - * SYNC_FILE_RANGE_WAIT_AFTER: wait upon writeout of all pages in the range - * after performing the write. - * - * Useful combinations of the flag bits are: - * - * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE: ensures that all pages - * in the range which were dirty on entry to ksys_sync_file_range() are placed - * under writeout. This is a start-write-for-data-integrity operation. - * - * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which - * are not presently under writeout. This is an asynchronous flush-to-disk - * operation. Not suitable for data integrity operations. - * - * SYNC_FILE_RANGE_WAIT_BEFORE (or SYNC_FILE_RANGE_WAIT_AFTER): wait for - * completion of writeout of all pages in the range. This will be used after an - * earlier SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE operation to wait - * for that operation to complete and to return the result. - * - * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER - * (a.k.a. SYNC_FILE_RANGE_WRITE_AND_WAIT): - * a traditional sync() operation. This is a write-for-data-integrity operation - * which will ensure that all pages in the range which were dirty on entry to - * ksys_sync_file_range() are written to disk. It should be noted that disk - * caches are not flushed by this call, so there are no guarantees here that the - * data will be available on disk after a crash. - * - * - * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any - * I/O errors or ENOSPC conditions and will return those to the caller, after - * clearing the EIO and ENOSPC flags in the address_space. - * - * It should be noted that none of these operations write out the file's - * metadata. So unless the application is strictly performing overwrites of - * already-instantiated disk blocks, there are no guarantees here that the data - * will be available after a crash. - */ -int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes, - unsigned int flags) +int sync_file_range(struct file *file, loff_t offset, loff_t nbytes, + unsigned int flags) { int ret; - struct fd f; struct address_space *mapping; loff_t endbyte; /* inclusive */ umode_t i_mode; @@ -328,23 +277,18 @@ int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes, else endbyte--; /* inclusive */
- ret = -EBADF; - f = fdget(fd); - if (!f.file) - goto out; - - i_mode = file_inode(f.file)->i_mode; + i_mode = file_inode(file)->i_mode; ret = -ESPIPE; if (!S_ISREG(i_mode) && !S_ISBLK(i_mode) && !S_ISDIR(i_mode) && !S_ISLNK(i_mode)) - goto out_put; + goto out;
- mapping = f.file->f_mapping; + mapping = file->f_mapping; ret = 0; if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) { - ret = file_fdatawait_range(f.file, offset, endbyte); + ret = file_fdatawait_range(file, offset, endbyte); if (ret < 0) - goto out_put; + goto out; }
if (flags & SYNC_FILE_RANGE_WRITE) { @@ -357,18 +301,81 @@ int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes, ret = __filemap_fdatawrite_range(mapping, offset, endbyte, sync_mode); if (ret < 0) - goto out_put; + goto out; }
if (flags & SYNC_FILE_RANGE_WAIT_AFTER) - ret = file_fdatawait_range(f.file, offset, endbyte); + ret = file_fdatawait_range(file, offset, endbyte);
-out_put: - fdput(f); out: return ret; }
+/* + * ksys_sync_file_range() permits finely controlled syncing over a segment of + * a file in the range offset .. (offset+nbytes-1) inclusive. If nbytes is + * zero then ksys_sync_file_range() will operate from offset out to EOF. + * + * The flag bits are: + * + * SYNC_FILE_RANGE_WAIT_BEFORE: wait upon writeout of all pages in the range + * before performing the write. + * + * SYNC_FILE_RANGE_WRITE: initiate writeout of all those dirty pages in the + * range which are not presently under writeback. Note that this may block for + * significant periods due to exhaustion of disk request structures. + * + * SYNC_FILE_RANGE_WAIT_AFTER: wait upon writeout of all pages in the range + * after performing the write. + * + * Useful combinations of the flag bits are: + * + * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE: ensures that all pages + * in the range which were dirty on entry to ksys_sync_file_range() are placed + * under writeout. This is a start-write-for-data-integrity operation. + * + * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which + * are not presently under writeout. This is an asynchronous flush-to-disk + * operation. Not suitable for data integrity operations. + * + * SYNC_FILE_RANGE_WAIT_BEFORE (or SYNC_FILE_RANGE_WAIT_AFTER): wait for + * completion of writeout of all pages in the range. This will be used after an + * earlier SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE operation to wait + * for that operation to complete and to return the result. + * + * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER + * (a.k.a. SYNC_FILE_RANGE_WRITE_AND_WAIT): + * a traditional sync() operation. This is a write-for-data-integrity operation + * which will ensure that all pages in the range which were dirty on entry to + * ksys_sync_file_range() are written to disk. It should be noted that disk + * caches are not flushed by this call, so there are no guarantees here that the + * data will be available on disk after a crash. + * + * + * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any + * I/O errors or ENOSPC conditions and will return those to the caller, after + * clearing the EIO and ENOSPC flags in the address_space. + * + * It should be noted that none of these operations write out the file's + * metadata. So unless the application is strictly performing overwrites of + * already-instantiated disk blocks, there are no guarantees here that the data + * will be available after a crash. + */ +int ksys_sync_file_range(int fd, loff_t offset, loff_t nbytes, + unsigned int flags) +{ + int ret; + struct fd f; + + ret = -EBADF; + f = fdget(fd); + if (f.file) + ret = sync_file_range(f.file, offset, nbytes, flags); + + fdput(f); + return ret; +} + SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes, unsigned int, flags) { diff --git a/include/linux/fs.h b/include/linux/fs.h index db7dd25ce645..36d828c741d5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2804,6 +2804,9 @@ extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync); extern int vfs_fsync(struct file *file, int datasync);
+extern int sync_file_range(struct file *file, loff_t offset, loff_t nbytes, + unsigned int flags); + /* * Sync the bytes written if this was a synchronous write. Expect ki_pos * to already be updated for the write, and will return either the amount
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.2-rc1 commit de0617e467171ba44c73efd1ba63f101b164a035 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There are no ordering constraints between the submission and completion side of io_uring. But sometimes that would be useful to have. One common example is doing an fsync, for instance, and have it ordered with previous writes. Without support for that, the application must do this tracking itself.
This adds a general SQE flag, IOSQE_IO_DRAIN. If a command is marked with this flag, then it will not be issued before previous commands have completed, and subsequent commands submitted after the drain will not be issued before the drain is started.. If there are no pending commands, setting this flag will not change the behavior of the issue of the command.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 91 +++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 1 + 2 files changed, 89 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ae1d4793013b..e10adb340c26 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -222,6 +222,8 @@ struct io_ring_ctx { unsigned sq_mask; unsigned sq_thread_idle; struct io_uring_sqe *sq_sqes; + + struct list_head defer_list; } ____cacheline_aligned_in_smp;
/* IO offload */ @@ -327,8 +329,11 @@ struct io_kiocb { #define REQ_F_FIXED_FILE 4 /* ctx owns file */ #define REQ_F_SEQ_PREV 8 /* sequential with previous */ #define REQ_F_PREPPED 16 /* prep already done */ +#define REQ_F_IO_DRAIN 32 /* drain existing IO first */ +#define REQ_F_IO_DRAINED 64 /* drain done */ u64 user_data; - u64 error; + u32 error; + u32 sequence;
struct work_struct work; }; @@ -356,6 +361,8 @@ struct io_submit_state { unsigned int ios_left; };
+static void io_sq_wq_submit_work(struct work_struct *work); + static struct kmem_cache *req_cachep;
static const struct file_operations io_uring_fops; @@ -407,10 +414,36 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) spin_lock_init(&ctx->completion_lock); INIT_LIST_HEAD(&ctx->poll_list); INIT_LIST_HEAD(&ctx->cancel_list); + INIT_LIST_HEAD(&ctx->defer_list); return ctx; }
-static void io_commit_cqring(struct io_ring_ctx *ctx) +static inline bool io_sequence_defer(struct io_ring_ctx *ctx, + struct io_kiocb *req) +{ + if ((req->flags & (REQ_F_IO_DRAIN|REQ_F_IO_DRAINED)) != REQ_F_IO_DRAIN) + return false; + + return req->sequence > ctx->cached_cq_tail + ctx->sq_ring->dropped; +} + +static struct io_kiocb *io_get_deferred_req(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + if (list_empty(&ctx->defer_list)) + return NULL; + + req = list_first_entry(&ctx->defer_list, struct io_kiocb, list); + if (!io_sequence_defer(ctx, req)) { + list_del_init(&req->list); + return req; + } + + return NULL; +} + +static void __io_commit_cqring(struct io_ring_ctx *ctx) { struct io_cq_ring *ring = ctx->cq_ring;
@@ -425,6 +458,18 @@ static void io_commit_cqring(struct io_ring_ctx *ctx) } }
+static void io_commit_cqring(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + __io_commit_cqring(ctx); + + while ((req = io_get_deferred_req(ctx)) != NULL) { + req->flags |= REQ_F_IO_DRAINED; + queue_work(ctx->sqo_wq, &req->work); + } +} + static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx) { struct io_cq_ring *ring = ctx->cq_ring; @@ -1434,6 +1479,34 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe) return ipt.error; }
+static int io_req_defer(struct io_ring_ctx *ctx, struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + struct io_uring_sqe *sqe_copy; + + if (!io_sequence_defer(ctx, req) && list_empty(&ctx->defer_list)) + return 0; + + sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL); + if (!sqe_copy) + return -EAGAIN; + + spin_lock_irq(&ctx->completion_lock); + if (!io_sequence_defer(ctx, req) && list_empty(&ctx->defer_list)) { + spin_unlock_irq(&ctx->completion_lock); + kfree(sqe_copy); + return 0; + } + + memcpy(sqe_copy, sqe, sizeof(*sqe_copy)); + req->submit.sqe = sqe_copy; + + INIT_WORK(&req->work, io_sq_wq_submit_work); + list_add_tail(&req->list, &ctx->defer_list); + spin_unlock_irq(&ctx->completion_lock); + return -EIOCBQUEUED; +} + static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, const struct sqe_submit *s, bool force_nonblock) { @@ -1681,6 +1754,11 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s, flags = READ_ONCE(s->sqe->flags); fd = READ_ONCE(s->sqe->fd);
+ if (flags & IOSQE_IO_DRAIN) { + req->flags |= REQ_F_IO_DRAIN; + req->sequence = ctx->cached_sq_head - 1; + } + if (!io_op_needs_file(s->sqe)) { req->file = NULL; return 0; @@ -1710,7 +1788,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, int ret;
/* enforce forwards compatibility on users */ - if (unlikely(s->sqe->flags & ~IOSQE_FIXED_FILE)) + if (unlikely(s->sqe->flags & ~(IOSQE_FIXED_FILE | IOSQE_IO_DRAIN))) return -EINVAL;
req = io_get_req(ctx, state); @@ -1721,6 +1799,13 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, if (unlikely(ret)) goto out;
+ ret = io_req_defer(ctx, req, s->sqe); + if (ret) { + if (ret == -EIOCBQUEUED) + ret = 0; + return ret; + } + ret = __io_submit_sqe(ctx, req, s, true); if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) { struct io_uring_sqe *sqe_copy; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index e23408692118..a7a6384d0c70 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -38,6 +38,7 @@ struct io_uring_sqe { * sqe->flags */ #define IOSQE_FIXED_FILE (1U << 0) /* use fixed fileset */ +#define IOSQE_IO_DRAIN (1U << 1) /* issue after inflight IO */
/* * io_uring_setup() flags
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.2-rc1 commit 5d17b4a4b7fa172b205be8a05051ae705d1dc3bb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This behaves just like sync_file_range(2) does.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 51 +++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 2 ++ 2 files changed, 53 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e10adb340c26..b61e9838d34a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1264,6 +1264,54 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; }
+static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_ring_ctx *ctx = req->ctx; + int ret = 0; + + if (!req->file) + return -EBADF; + /* Prep already done (EAGAIN retry) */ + if (req->flags & REQ_F_PREPPED) + return 0; + + if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index)) + return -EINVAL; + + req->flags |= REQ_F_PREPPED; + return ret; +} + +static int io_sync_file_range(struct io_kiocb *req, + const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + loff_t sqe_off; + loff_t sqe_len; + unsigned flags; + int ret; + + ret = io_prep_sfr(req, sqe); + if (ret) + return ret; + + /* sync_file_range always requires a blocking context */ + if (force_nonblock) + return -EAGAIN; + + sqe_off = READ_ONCE(sqe->off); + sqe_len = READ_ONCE(sqe->len); + flags = READ_ONCE(sqe->sync_range_flags); + + ret = sync_file_range(req->rw.ki_filp, sqe_off, sqe_len, flags); + + io_cqring_add_event(req->ctx, sqe->user_data, ret, 0); + io_put_req(req); + return 0; +} + static void io_poll_remove_one(struct io_kiocb *req) { struct io_poll_iocb *poll = &req->poll; @@ -1546,6 +1594,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, case IORING_OP_POLL_REMOVE: ret = io_poll_remove(req, s->sqe); break; + case IORING_OP_SYNC_FILE_RANGE: + ret = io_sync_file_range(req, s->sqe, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index a7a6384d0c70..e707a17c6908 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -26,6 +26,7 @@ struct io_uring_sqe { __kernel_rwf_t rw_flags; __u32 fsync_flags; __u16 poll_events; + __u32 sync_range_flags; }; __u64 user_data; /* data to be passed back at completion time */ union { @@ -55,6 +56,7 @@ struct io_uring_sqe { #define IORING_OP_WRITE_FIXED 5 #define IORING_OP_POLL_ADD 6 #define IORING_OP_POLL_REMOVE 7 +#define IORING_OP_SYNC_FILE_RANGE 8
/* * sqe->fsync_flags
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.2-rc1 commit 9b402849e80c85eee10bbd341aab3f1a0f942d4f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Allow registration of an eventfd, which will trigger an event every time a completion event happens for this io_uring instance.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 48 +++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 2 ++ 2 files changed, 50 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b61e9838d34a..723575f8a8c4 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -241,6 +241,7 @@ struct io_ring_ctx { unsigned cq_mask; struct wait_queue_head cq_wait; struct fasync_struct *cq_fasync; + struct eventfd_ctx *cq_ev_fd; } ____cacheline_aligned_in_smp;
/* @@ -516,6 +517,8 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx) wake_up(&ctx->wait); if (waitqueue_active(&ctx->sqo_wait)) wake_up(&ctx->sqo_wait); + if (ctx->cq_ev_fd) + eventfd_signal(ctx->cq_ev_fd, 1); }
static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 user_data, @@ -2754,6 +2757,38 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, return ret; }
+static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg) +{ + __s32 __user *fds = arg; + int fd; + + if (ctx->cq_ev_fd) + return -EBUSY; + + if (copy_from_user(&fd, fds, sizeof(*fds))) + return -EFAULT; + + ctx->cq_ev_fd = eventfd_ctx_fdget(fd); + if (IS_ERR(ctx->cq_ev_fd)) { + int ret = PTR_ERR(ctx->cq_ev_fd); + ctx->cq_ev_fd = NULL; + return ret; + } + + return 0; +} + +static int io_eventfd_unregister(struct io_ring_ctx *ctx) +{ + if (ctx->cq_ev_fd) { + eventfd_ctx_put(ctx->cq_ev_fd); + ctx->cq_ev_fd = NULL; + return 0; + } + + return -ENXIO; +} + static void io_ring_ctx_free(struct io_ring_ctx *ctx) { io_finish_async(ctx); @@ -2763,6 +2798,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) io_iopoll_reap_events(ctx); io_sqe_buffer_unregister(ctx); io_sqe_files_unregister(ctx); + io_eventfd_unregister(ctx);
#if defined(CONFIG_UNIX) if (ctx->ring_sock) @@ -3176,6 +3212,18 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_sqe_files_unregister(ctx); break; + case IORING_REGISTER_EVENTFD: + ret = -EINVAL; + if (nr_args != 1) + break; + ret = io_eventfd_register(ctx, arg); + break; + case IORING_UNREGISTER_EVENTFD: + ret = -EINVAL; + if (arg || nr_args) + break; + ret = io_eventfd_unregister(ctx); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index e707a17c6908..a0c460025036 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -136,5 +136,7 @@ struct io_uring_params { #define IORING_UNREGISTER_BUFFERS 1 #define IORING_REGISTER_FILES 2 #define IORING_UNREGISTER_FILES 3 +#define IORING_REGISTER_EVENTFD 4 +#define IORING_UNREGISTER_EVENTFD 5
#endif
From: Stefan Bühler source@stbuehler.de
mainline inclusion from mainline-5.2-rc1 commit 5dcf877fb13f3c6a8ba0777ef766c4af32df725d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
No need to set it in io_poll_add; io_poll_complete doesn't use it to set the result in the CQE.
Signed-off-by: Stefan Bühler source@stbuehler.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 723575f8a8c4..39e89e13addd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -333,7 +333,7 @@ struct io_kiocb { #define REQ_F_IO_DRAIN 32 /* drain existing IO first */ #define REQ_F_IO_DRAINED 64 /* drain done */ u64 user_data; - u32 error; + u32 error; /* iopoll result from callback */ u32 sequence;
struct work_struct work; @@ -1517,7 +1517,6 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe) spin_unlock(&poll->head->lock); } if (mask) { /* no async, we'd stolen it */ - req->error = mangle_poll(mask); ipt.error = 0; io_poll_complete(ctx, req, mask); }
From: Colin Ian King colin.king@canonical.com
mainline inclusion from mainline-5.2-rc1 commit efeb862bd5bc001636e690debf6f9fbba98e5bfd category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently variable ret is declared in a while-loop code block that shadows another variable ret. When an error occurs in the while-loop the error return in ret is not being set in the outer code block and so the error check on ret is always going to be checking on the wrong ret variable resulting in check that is always going to be true and a premature return occurs.
Fix this by removing the declaration of the inner while-loop variable ret so that shadowing does not occur.
Addresses-Coverity: ("'Constant' variable guards dead code") Fixes: 6b06314c47e1 ("io_uring: add file set registration") Signed-off-by: Colin Ian King colin.king@canonical.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 39e89e13addd..27d0e4ed6f21 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2360,7 +2360,6 @@ static int io_sqe_files_scm(struct io_ring_ctx *ctx) left = ctx->nr_user_files; while (left) { unsigned this_files = min_t(unsigned, left, SCM_MAX_FD); - int ret;
ret = __io_sqe_files_scm(ctx, this_files, total); if (ret)
From: Oleg Nesterov oleg@redhat.com
mainline inclusion from mainline-5.3-rc1 commit 8cf8b5539a414da3257db6d121bcee2d883135cb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
do_poll() returns -EINTR if interrupted and after that all its callers have to translate it into -ERESTARTNOHAND. Change do_poll() to return -ERESTARTNOHAND and update (simplify) the callers.
Note that this also unifies all users of restore_saved_sigmask_unless(), see the next patch.
Linus:
: The *right* return value will actually be then chosen by : poll_select_copy_remaining(), which will turn ERESTARTNOHAND to EINTR : when it can't update the timeout. : : Except for the cases that use restart_block and do that instead and : don't have the whole timeout restart issue as a result.
Link: http://lkml.kernel.org/r/20190606140852.GB13440@redhat.com Signed-off-by: Oleg Nesterov oleg@redhat.com Acked-by: Linus Torvalds torvalds@linux-foundation.org Cc: Al Viro viro@ZenIV.linux.org.uk Cc: Arnd Bergmann arnd@arndb.de Cc: David Laight David.Laight@aculab.com Cc: Davidlohr Bueso dave@stgolabs.net Cc: Deepa Dinamani deepa.kernel@gmail.com Cc: Eric W. Biederman ebiederm@xmission.com Cc: Eric Wong e@80x24.org Cc: Jason Baron jbaron@akamai.com Cc: Jens Axboe axboe@kernel.dk Cc: Thomas Gleixner tglx@linutronix.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com
Conflicts: fs/select.c
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/select.c | 30 +++++++----------------------- 1 file changed, 7 insertions(+), 23 deletions(-)
diff --git a/fs/select.c b/fs/select.c index baed50c60083..bf2395de6437 100644 --- a/fs/select.c +++ b/fs/select.c @@ -925,7 +925,7 @@ static int do_poll(struct poll_list *list, struct poll_wqueues *wait, if (!count) { count = wait->error; if (signal_pending(current)) - count = -EINTR; + count = -ERESTARTNOHAND; } if (count || timed_out) break; @@ -1040,7 +1040,7 @@ static long do_restart_poll(struct restart_block *restart_block)
ret = do_sys_poll(ufds, nfds, to);
- if (ret == -EINTR) + if (ret == -ERESTARTNOHAND) ret = set_restart_fn(restart_block, do_restart_poll);
return ret; @@ -1060,7 +1060,7 @@ SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfds,
ret = do_sys_poll(ufds, nfds, to);
- if (ret == -EINTR) { + if (ret == -ERESTARTNOHAND) { struct restart_block *restart_block;
restart_block = ¤t->restart_block; @@ -1100,11 +1100,7 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, return ret;
ret = do_sys_poll(ufds, nfds, to); - - restore_saved_sigmask_unless(ret == -EINTR); - /* We can restart this syscall, usually */ - if (ret == -EINTR) - ret = -ERESTARTNOHAND; + restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); ret = poll_select_copy_remaining(&end_time, tsp, PT_TIMESPEC, ret);
return ret; @@ -1133,11 +1129,7 @@ SYSCALL_DEFINE5(ppoll_time32, struct pollfd __user *, ufds, unsigned int, nfds, return ret;
ret = do_sys_poll(ufds, nfds, to); - - restore_saved_sigmask_unless(ret == -EINTR); - /* We can restart this syscall, usually */ - if (ret == -EINTR) - ret = -ERESTARTNOHAND; + restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); ret = poll_select_copy_remaining(&end_time, tsp, PT_OLD_TIMESPEC, ret);
return ret; @@ -1411,11 +1403,7 @@ COMPAT_SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, return ret;
ret = do_sys_poll(ufds, nfds, to); - - restore_saved_sigmask_unless(ret == -EINTR); - /* We can restart this syscall, usually */ - if (ret == -EINTR) - ret = -ERESTARTNOHAND; + restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); ret = poll_select_copy_remaining(&end_time, tsp, PT_OLD_TIMESPEC, ret);
return ret; @@ -1444,11 +1432,7 @@ COMPAT_SYSCALL_DEFINE5(ppoll_time64, struct pollfd __user *, ufds, return ret;
ret = do_sys_poll(ufds, nfds, to); - - restore_saved_sigmask_unless(ret == -EINTR); - /* We can restart this syscall, usually */ - if (ret == -EINTR) - ret = -ERESTARTNOHAND; + restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); ret = poll_select_copy_remaining(&end_time, tsp, PT_TIMESPEC, ret);
return ret;
From: Oleg Nesterov oleg@redhat.com
mainline inclusion from mainline-5.3-rc1 commit ac301020627e258a304f40cab5b35b6814a6f033 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Now that restore_saved_sigmask_unless() is always called with the same argument right before poll_select_copy_remaining() we can move it into poll_select_copy_remaining() and make it the only caller of restore() in fs/select.c.
The patch also renames poll_select_copy_remaining(), poll_select_finish() looks better after this change.
kern_select() doesn't use set_user_sigmask(), so in this case poll_select_finish() does restore_saved_sigmask_unless() "for no reason". But this won't hurt, and WARN_ON(!TIF_SIGPENDING) is still valid.
Link: http://lkml.kernel.org/r/20190606140915.GC13440@redhat.com Signed-off-by: Oleg Nesterov oleg@redhat.com Cc: Al Viro viro@ZenIV.linux.org.uk Cc: Arnd Bergmann arnd@arndb.de Cc: David Laight David.Laight@aculab.com Cc: Davidlohr Bueso dave@stgolabs.net Cc: Deepa Dinamani deepa.kernel@gmail.com Cc: Eric W. Biederman ebiederm@xmission.com Cc: Eric Wong e@80x24.org Cc: Jason Baron jbaron@akamai.com Cc: Jens Axboe axboe@kernel.dk Cc: Thomas Gleixner tglx@linutronix.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/select.c | 46 +++++++++++++--------------------------------- 1 file changed, 13 insertions(+), 33 deletions(-)
diff --git a/fs/select.c b/fs/select.c index bf2395de6437..b684f0dd6db8 100644 --- a/fs/select.c +++ b/fs/select.c @@ -294,12 +294,14 @@ enum poll_time_type { PT_OLD_TIMESPEC = 3, };
-static int poll_select_copy_remaining(struct timespec64 *end_time, - void __user *p, - enum poll_time_type pt_type, int ret) +static int poll_select_finish(struct timespec64 *end_time, + void __user *p, + enum poll_time_type pt_type, int ret) { struct timespec64 rts;
+ restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); + if (!p) return ret;
@@ -714,9 +716,7 @@ static int kern_select(int n, fd_set __user *inp, fd_set __user *outp, }
ret = core_sys_select(n, inp, outp, exp, to); - ret = poll_select_copy_remaining(&end_time, tvp, PT_TIMEVAL, ret); - - return ret; + return poll_select_finish(&end_time, tvp, PT_TIMEVAL, ret); }
SYSCALL_DEFINE5(select, int, n, fd_set __user *, inp, fd_set __user *, outp, @@ -757,10 +757,7 @@ static long do_pselect(int n, fd_set __user *inp, fd_set __user *outp, return ret;
ret = core_sys_select(n, inp, outp, exp, to); - restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); - ret = poll_select_copy_remaining(&end_time, tsp, type, ret); - - return ret; + return poll_select_finish(&end_time, tsp, type, ret); }
/* @@ -1100,10 +1097,7 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, return ret;
ret = do_sys_poll(ufds, nfds, to); - restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); - ret = poll_select_copy_remaining(&end_time, tsp, PT_TIMESPEC, ret); - - return ret; + return poll_select_finish(&end_time, tsp, PT_TIMESPEC, ret); }
#if defined(CONFIG_COMPAT_32BIT_TIME) && !defined(CONFIG_64BIT) @@ -1129,10 +1123,7 @@ SYSCALL_DEFINE5(ppoll_time32, struct pollfd __user *, ufds, unsigned int, nfds, return ret;
ret = do_sys_poll(ufds, nfds, to); - restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); - ret = poll_select_copy_remaining(&end_time, tsp, PT_OLD_TIMESPEC, ret); - - return ret; + return poll_select_finish(&end_time, tsp, PT_OLD_TIMESPEC, ret); } #endif
@@ -1269,9 +1260,7 @@ static int do_compat_select(int n, compat_ulong_t __user *inp, }
ret = compat_core_sys_select(n, inp, outp, exp, to); - ret = poll_select_copy_remaining(&end_time, tvp, PT_OLD_TIMEVAL, ret); - - return ret; + return poll_select_finish(&end_time, tvp, PT_OLD_TIMEVAL, ret); }
COMPAT_SYSCALL_DEFINE5(select, int, n, compat_ulong_t __user *, inp, @@ -1331,10 +1320,7 @@ static long do_compat_pselect(int n, compat_ulong_t __user *inp, return ret;
ret = compat_core_sys_select(n, inp, outp, exp, to); - restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); - ret = poll_select_copy_remaining(&end_time, tsp, type, ret); - - return ret; + return poll_select_finish(&end_time, tsp, type, ret); }
COMPAT_SYSCALL_DEFINE6(pselect6_time64, int, n, compat_ulong_t __user *, inp, @@ -1403,10 +1389,7 @@ COMPAT_SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, return ret;
ret = do_sys_poll(ufds, nfds, to); - restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); - ret = poll_select_copy_remaining(&end_time, tsp, PT_OLD_TIMESPEC, ret); - - return ret; + return poll_select_finish(&end_time, tsp, PT_OLD_TIMESPEC, ret); } #endif
@@ -1432,10 +1415,7 @@ COMPAT_SYSCALL_DEFINE5(ppoll_time64, struct pollfd __user *, ufds, return ret;
ret = do_sys_poll(ufds, nfds, to); - restore_saved_sigmask_unless(ret == -ERESTARTNOHAND); - ret = poll_select_copy_remaining(&end_time, tsp, PT_TIMESPEC, ret); - - return ret; + return poll_select_finish(&end_time, tsp, PT_TIMESPEC, ret); }
#endif
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.2-rc2 commit 486f069253c3c738dec62daeb16f7232b2cca065 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently fails with:
io_uring-bench.o: In function `main': /home/axboe/git/linux-block/tools/io_uring/io_uring-bench.c:560: undefined reference to `pthread_create' /home/axboe/git/linux-block/tools/io_uring/io_uring-bench.c:588: undefined reference to `pthread_join' collect2: error: ld returned 1 exit status Makefile:11: recipe for target 'io_uring-bench' failed make: *** [io_uring-bench] Error 1
Move -lpthread to the end.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- tools/io_uring/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/io_uring/Makefile b/tools/io_uring/Makefile index f79522fc37b5..00f146c54c53 100644 --- a/tools/io_uring/Makefile +++ b/tools/io_uring/Makefile @@ -8,7 +8,7 @@ all: io_uring-cp io_uring-bench $(CC) $(CFLAGS) -o $@ $^
io_uring-bench: syscall.o io_uring-bench.o - $(CC) $(CFLAGS) $(LDLIBS) -o $@ $^ + $(CC) $(CFLAGS) -o $@ $^ $(LDLIBS)
io_uring-cp: setup.o syscall.o queue.o
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.2-rc2 commit 004d564f908790efe815a6510a542ac1227ef2a2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Various fixes and changes have been applied to liburing since we copied some select bits to the kernel testing/examples part, sync up with liburing to get those changes.
Most notable is the change that split the CQE reading into the peek and seen event, instead of being just a single function. Also fixes an unsigned wrap issue in io_uring_submit(), leak of 'fd' in setup if we fail, and various other little issues.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- tools/io_uring/io_uring-cp.c | 21 ++++++++---- tools/io_uring/liburing.h | 64 +++++++++++++++++++++++++++++------- tools/io_uring/queue.c | 36 ++++++++------------ tools/io_uring/setup.c | 10 ++++-- tools/io_uring/syscall.c | 48 +++++++++++++++++---------- 5 files changed, 118 insertions(+), 61 deletions(-)
diff --git a/tools/io_uring/io_uring-cp.c b/tools/io_uring/io_uring-cp.c index 633f65bb43a7..81461813ec62 100644 --- a/tools/io_uring/io_uring-cp.c +++ b/tools/io_uring/io_uring-cp.c @@ -13,6 +13,7 @@ #include <assert.h> #include <errno.h> #include <inttypes.h> +#include <sys/types.h> #include <sys/stat.h> #include <sys/ioctl.h>
@@ -85,11 +86,16 @@ static int queue_read(struct io_uring *ring, off_t size, off_t offset) struct io_uring_sqe *sqe; struct io_data *data;
+ data = malloc(size + sizeof(*data)); + if (!data) + return 1; + sqe = io_uring_get_sqe(ring); - if (!sqe) + if (!sqe) { + free(data); return 1; + }
- data = malloc(size + sizeof(*data)); data->read = 1; data->offset = data->first_offset = offset;
@@ -166,22 +172,23 @@ static int copy_file(struct io_uring *ring, off_t insize) struct io_data *data;
if (!got_comp) { - ret = io_uring_wait_completion(ring, &cqe); + ret = io_uring_wait_cqe(ring, &cqe); got_comp = 1; } else - ret = io_uring_get_completion(ring, &cqe); + ret = io_uring_peek_cqe(ring, &cqe); if (ret < 0) { - fprintf(stderr, "io_uring_get_completion: %s\n", + fprintf(stderr, "io_uring_peek_cqe: %s\n", strerror(-ret)); return 1; } if (!cqe) break;
- data = (struct io_data *) (uintptr_t) cqe->user_data; + data = io_uring_cqe_get_data(cqe); if (cqe->res < 0) { if (cqe->res == -EAGAIN) { queue_prepped(ring, data); + io_uring_cqe_seen(ring, cqe); continue; } fprintf(stderr, "cqe failed: %s\n", @@ -193,6 +200,7 @@ static int copy_file(struct io_uring *ring, off_t insize) data->iov.iov_len -= cqe->res; data->offset += cqe->res; queue_prepped(ring, data); + io_uring_cqe_seen(ring, cqe); continue; }
@@ -209,6 +217,7 @@ static int copy_file(struct io_uring *ring, off_t insize) free(data); writes--; } + io_uring_cqe_seen(ring, cqe); } }
diff --git a/tools/io_uring/liburing.h b/tools/io_uring/liburing.h index cab0f50257ba..5f305c86b892 100644 --- a/tools/io_uring/liburing.h +++ b/tools/io_uring/liburing.h @@ -1,10 +1,16 @@ #ifndef LIB_URING_H #define LIB_URING_H
+#ifdef __cplusplus +extern "C" { +#endif + #include <sys/uio.h> #include <signal.h> #include <string.h> #include "../../include/uapi/linux/io_uring.h" +#include <inttypes.h> +#include "barrier.h"
/* * Library interface to io_uring @@ -46,7 +52,7 @@ struct io_uring { * System calls */ extern int io_uring_setup(unsigned entries, struct io_uring_params *p); -extern int io_uring_enter(unsigned fd, unsigned to_submit, +extern int io_uring_enter(int fd, unsigned to_submit, unsigned min_complete, unsigned flags, sigset_t *sig); extern int io_uring_register(int fd, unsigned int opcode, void *arg, unsigned int nr_args); @@ -59,13 +65,32 @@ extern int io_uring_queue_init(unsigned entries, struct io_uring *ring, extern int io_uring_queue_mmap(int fd, struct io_uring_params *p, struct io_uring *ring); extern void io_uring_queue_exit(struct io_uring *ring); -extern int io_uring_get_completion(struct io_uring *ring, +extern int io_uring_peek_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr); -extern int io_uring_wait_completion(struct io_uring *ring, +extern int io_uring_wait_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr); extern int io_uring_submit(struct io_uring *ring); extern struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring);
+/* + * Must be called after io_uring_{peek,wait}_cqe() after the cqe has + * been processed by the application. + */ +static inline void io_uring_cqe_seen(struct io_uring *ring, + struct io_uring_cqe *cqe) +{ + if (cqe) { + struct io_uring_cq *cq = &ring->cq; + + (*cq->khead)++; + /* + * Ensure that the kernel sees our new head, the kernel has + * the matching read barrier. + */ + write_barrier(); + } +} + /* * Command prep helpers */ @@ -74,8 +99,14 @@ static inline void io_uring_sqe_set_data(struct io_uring_sqe *sqe, void *data) sqe->user_data = (unsigned long) data; }
+static inline void *io_uring_cqe_get_data(struct io_uring_cqe *cqe) +{ + return (void *) (uintptr_t) cqe->user_data; +} + static inline void io_uring_prep_rw(int op, struct io_uring_sqe *sqe, int fd, - void *addr, unsigned len, off_t offset) + const void *addr, unsigned len, + off_t offset) { memset(sqe, 0, sizeof(*sqe)); sqe->opcode = op; @@ -86,8 +117,8 @@ static inline void io_uring_prep_rw(int op, struct io_uring_sqe *sqe, int fd, }
static inline void io_uring_prep_readv(struct io_uring_sqe *sqe, int fd, - struct iovec *iovecs, unsigned nr_vecs, - off_t offset) + const struct iovec *iovecs, + unsigned nr_vecs, off_t offset) { io_uring_prep_rw(IORING_OP_READV, sqe, fd, iovecs, nr_vecs, offset); } @@ -100,14 +131,14 @@ static inline void io_uring_prep_read_fixed(struct io_uring_sqe *sqe, int fd, }
static inline void io_uring_prep_writev(struct io_uring_sqe *sqe, int fd, - struct iovec *iovecs, unsigned nr_vecs, - off_t offset) + const struct iovec *iovecs, + unsigned nr_vecs, off_t offset) { io_uring_prep_rw(IORING_OP_WRITEV, sqe, fd, iovecs, nr_vecs, offset); }
static inline void io_uring_prep_write_fixed(struct io_uring_sqe *sqe, int fd, - void *buf, unsigned nbytes, + const void *buf, unsigned nbytes, off_t offset) { io_uring_prep_rw(IORING_OP_WRITE_FIXED, sqe, fd, buf, nbytes, offset); @@ -131,13 +162,22 @@ static inline void io_uring_prep_poll_remove(struct io_uring_sqe *sqe, }
static inline void io_uring_prep_fsync(struct io_uring_sqe *sqe, int fd, - int datasync) + unsigned fsync_flags) { memset(sqe, 0, sizeof(*sqe)); sqe->opcode = IORING_OP_FSYNC; sqe->fd = fd; - if (datasync) - sqe->fsync_flags = IORING_FSYNC_DATASYNC; + sqe->fsync_flags = fsync_flags; +} + +static inline void io_uring_prep_nop(struct io_uring_sqe *sqe) +{ + memset(sqe, 0, sizeof(*sqe)); + sqe->opcode = IORING_OP_NOP; +} + +#ifdef __cplusplus } +#endif
#endif diff --git a/tools/io_uring/queue.c b/tools/io_uring/queue.c index 88505e873ad9..321819c132c7 100644 --- a/tools/io_uring/queue.c +++ b/tools/io_uring/queue.c @@ -8,8 +8,8 @@ #include "liburing.h" #include "barrier.h"
-static int __io_uring_get_completion(struct io_uring *ring, - struct io_uring_cqe **cqe_ptr, int wait) +static int __io_uring_get_cqe(struct io_uring *ring, + struct io_uring_cqe **cqe_ptr, int wait) { struct io_uring_cq *cq = &ring->cq; const unsigned mask = *cq->kring_mask; @@ -39,34 +39,25 @@ static int __io_uring_get_completion(struct io_uring *ring, return -errno; } while (1);
- if (*cqe_ptr) { - *cq->khead = head + 1; - /* - * Ensure that the kernel sees our new head, the kernel has - * the matching read barrier. - */ - write_barrier(); - } - return 0; }
/* - * Return an IO completion, if one is readily available + * Return an IO completion, if one is readily available. Returns 0 with + * cqe_ptr filled in on success, -errno on failure. */ -int io_uring_get_completion(struct io_uring *ring, - struct io_uring_cqe **cqe_ptr) +int io_uring_peek_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr) { - return __io_uring_get_completion(ring, cqe_ptr, 0); + return __io_uring_get_cqe(ring, cqe_ptr, 0); }
/* - * Return an IO completion, waiting for it if necessary + * Return an IO completion, waiting for it if necessary. Returns 0 with + * cqe_ptr filled in on success, -errno on failure. */ -int io_uring_wait_completion(struct io_uring *ring, - struct io_uring_cqe **cqe_ptr) +int io_uring_wait_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr) { - return __io_uring_get_completion(ring, cqe_ptr, 1); + return __io_uring_get_cqe(ring, cqe_ptr, 1); }
/* @@ -78,7 +69,7 @@ int io_uring_submit(struct io_uring *ring) { struct io_uring_sq *sq = &ring->sq; const unsigned mask = *sq->kring_mask; - unsigned ktail, ktail_next, submitted; + unsigned ktail, ktail_next, submitted, to_submit; int ret;
/* @@ -100,7 +91,8 @@ int io_uring_submit(struct io_uring *ring) */ submitted = 0; ktail = ktail_next = *sq->ktail; - while (sq->sqe_head < sq->sqe_tail) { + to_submit = sq->sqe_tail - sq->sqe_head; + while (to_submit--) { ktail_next++; read_barrier();
@@ -136,7 +128,7 @@ int io_uring_submit(struct io_uring *ring) if (ret < 0) return -errno;
- return 0; + return ret; }
/* diff --git a/tools/io_uring/setup.c b/tools/io_uring/setup.c index 4da19a77132c..0b50fcd78520 100644 --- a/tools/io_uring/setup.c +++ b/tools/io_uring/setup.c @@ -27,7 +27,7 @@ static int io_uring_mmap(int fd, struct io_uring_params *p, sq->kdropped = ptr + p->sq_off.dropped; sq->array = ptr + p->sq_off.array;
- size = p->sq_entries * sizeof(struct io_uring_sqe), + size = p->sq_entries * sizeof(struct io_uring_sqe); sq->sqes = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQES); @@ -79,7 +79,7 @@ int io_uring_queue_mmap(int fd, struct io_uring_params *p, struct io_uring *ring int io_uring_queue_init(unsigned entries, struct io_uring *ring, unsigned flags) { struct io_uring_params p; - int fd; + int fd, ret;
memset(&p, 0, sizeof(p)); p.flags = flags; @@ -88,7 +88,11 @@ int io_uring_queue_init(unsigned entries, struct io_uring *ring, unsigned flags) if (fd < 0) return fd;
- return io_uring_queue_mmap(fd, &p, ring); + ret = io_uring_queue_mmap(fd, &p, ring); + if (ret) + close(fd); + + return ret; }
void io_uring_queue_exit(struct io_uring *ring) diff --git a/tools/io_uring/syscall.c b/tools/io_uring/syscall.c index 6b835e5c6a5b..b22e0aa54e9d 100644 --- a/tools/io_uring/syscall.c +++ b/tools/io_uring/syscall.c @@ -7,34 +7,46 @@ #include <signal.h> #include "liburing.h"
-#if defined(__x86_64) || defined(__i386__) -#ifndef __NR_sys_io_uring_setup -#define __NR_sys_io_uring_setup 425 -#endif -#ifndef __NR_sys_io_uring_enter -#define __NR_sys_io_uring_enter 426 -#endif -#ifndef __NR_sys_io_uring_register -#define __NR_sys_io_uring_register 427 -#endif -#else -#error "Arch not supported yet" +#ifdef __alpha__ +/* + * alpha is the only exception, all other architectures + * have common numbers for new system calls. + */ +# ifndef __NR_io_uring_setup +# define __NR_io_uring_setup 535 +# endif +# ifndef __NR_io_uring_enter +# define __NR_io_uring_enter 536 +# endif +# ifndef __NR_io_uring_register +# define __NR_io_uring_register 537 +# endif +#else /* !__alpha__ */ +# ifndef __NR_io_uring_setup +# define __NR_io_uring_setup 425 +# endif +# ifndef __NR_io_uring_enter +# define __NR_io_uring_enter 426 +# endif +# ifndef __NR_io_uring_register +# define __NR_io_uring_register 427 +# endif #endif
int io_uring_register(int fd, unsigned int opcode, void *arg, unsigned int nr_args) { - return syscall(__NR_sys_io_uring_register, fd, opcode, arg, nr_args); + return syscall(__NR_io_uring_register, fd, opcode, arg, nr_args); }
-int io_uring_setup(unsigned entries, struct io_uring_params *p) +int io_uring_setup(unsigned int entries, struct io_uring_params *p) { - return syscall(__NR_sys_io_uring_setup, entries, p); + return syscall(__NR_io_uring_setup, entries, p); }
-int io_uring_enter(unsigned fd, unsigned to_submit, unsigned min_complete, - unsigned flags, sigset_t *sig) +int io_uring_enter(int fd, unsigned int to_submit, unsigned int min_complete, + unsigned int flags, sigset_t *sig) { - return syscall(__NR_sys_io_uring_enter, fd, to_submit, min_complete, + return syscall(__NR_io_uring_enter, fd, to_submit, min_complete, flags, sig, _NSIG / 8); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.2-rc3 commit a278682dad37fd2f8d2f30d8e84e376a856ab472 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If io_copy_iov() fails, it will break the loop and report success, albeit partially completed operation.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 62a73623601d..3229b34e0fe0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2609,7 +2609,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
ret = io_copy_iov(ctx, &iov, arg, i); if (ret) - break; + goto err;
/* * Don't impose further limits on the size and buffer
From: Eric Biggers ebiggers@google.com
mainline inclusion from mainline-5.2-rc5 commit 355e8d26f719c207aa2e00e6f3cfab3acf21769b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Opening and closing an io_uring instance leaks a UNIX domain socket inode. This is because the ->file of the io_uring instance's internal UNIX domain socket is set to point to the io_uring file, but then sock_release() sees the non-NULL ->file and assumes the inode reference is held by the file so doesn't call iput(). That's not the case here, since the reference is still meant to be held by the socket; the actual inode of the io_uring file is different.
Fix this leak by NULL-ing out ->file before releasing the socket.
Reported-by: syzbot+111cb28d9f583693aefa@syzkaller.appspotmail.com Fixes: 2b188cc1bb85 ("Add io_uring IO interface") Cc: stable@vger.kernel.org # v5.1+ Signed-off-by: Eric Biggers ebiggers@google.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3229b34e0fe0..efad9106576b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2769,8 +2769,10 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) io_eventfd_unregister(ctx);
#if defined(CONFIG_UNIX) - if (ctx->ring_sock) + if (ctx->ring_sock) { + ctx->ring_sock->file = NULL; /* so that iput() is called */ sock_release(ctx->ring_sock); + } #endif
io_mem_free(ctx->sq_ring);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.2-rc7 commit 60c112b0ada09826cc4ae6a4e55df677f76f1313 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Stephen reports:
I hit the following General Protection Fault when testing io_uring via the io_uring engine in fio. This was on a VM running 5.2-rc5 and the latest version of fio. The issue occurs for both null_blk and fake NVMe drives. I have not tested bare metal or real NVMe SSDs. The fio script used is given below.
[io_uring] time_based=1 runtime=60 filename=/dev/nvme2n1 (note /dev/nullb0 also fails) ioengine=io_uring bs=4k rw=readwrite direct=1 fixedbufs=1 sqthread_poll=1 sqthread_poll_cpu=0
general protection fault: 0000 [#1] SMP PTI CPU: 0 PID: 872 Comm: io_uring-sq Not tainted 5.2.0-rc5-cpacket-io-uring #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 RIP: 0010:fput_many+0x7/0x90 Code: 01 48 85 ff 74 17 55 48 89 e5 53 48 8b 1f e8 a0 f9 ff ff 48 85 db 48 89 df 75 f0 5b 5d f3 c3 0f 1f 40 00 0f 1f 44 00 00 89 f6 <f0> 48 29 77 38 74 01 c3 55 48 89 e5 53 48 89 fb 65 48 \
RSP: 0018:ffffadeb817ebc50 EFLAGS: 00010246 RAX: 0000000000000004 RBX: ffff8f46ad477480 RCX: 0000000000001805 RDX: 0000000000000000 RSI: 0000000000000001 RDI: f18b51b9a39552b5 RBP: ffffadeb817ebc58 R08: ffff8f46b7a318c0 R09: 000000000000015d R10: ffffadeb817ebce8 R11: 0000000000000020 R12: ffff8f46ad4cd000 R13: 00000000fffffff7 R14: ffffadeb817ebe30 R15: 0000000000000004 FS: 0000000000000000(0000) GS:ffff8f46b7a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055828f0bbbf0 CR3: 0000000232176004 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? fput+0x13/0x20 io_free_req+0x20/0x40 io_put_req+0x1b/0x20 io_submit_sqe+0x40a/0x680 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 io_submit_sqes+0xb9/0x160 ? io_submit_sqes+0xb9/0x160 ? __switch_to_asm+0x40/0x70 ? __switch_to_asm+0x34/0x70 ? __schedule+0x3f2/0x6a0 ? __switch_to_asm+0x34/0x70 io_sq_thread+0x1af/0x470 ? __switch_to_asm+0x34/0x70 ? wait_woken+0x80/0x80 ? __switch_to+0x85/0x410 ? __switch_to_asm+0x40/0x70 ? __switch_to_asm+0x34/0x70 ? __schedule+0x3f2/0x6a0 kthread+0x105/0x140 ? io_submit_sqes+0x160/0x160 ? kthread+0x105/0x140 ? io_submit_sqes+0x160/0x160 ? kthread_destroy_worker+0x50/0x50 ret_from_fork+0x35/0x40
which occurs because using a kernel side submission thread isn't valid without using fixed files (registered through io_uring_register()). This causes io_uring to put the request after logging an error, but before the file field is set in the request. If it happens to be non-zero, we attempt to fput() garbage.
Fix this by ensuring that req->file is initialized when the request is allocated.
Cc: stable@vger.kernel.org # 5.1+ Reported-by: Stephen Bates sbates@raithlin.com Tested-by: Stephen Bates sbates@raithlin.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index efad9106576b..617023002fd3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -579,6 +579,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, state->cur_req++; }
+ req->file = NULL; req->ctx = ctx; req->flags = 0; /* one is dropped after submission, the other at completion */ @@ -1798,10 +1799,8 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s, req->sequence = ctx->cached_sq_head - 1; }
- if (!io_op_needs_file(s->sqe)) { - req->file = NULL; + if (!io_op_needs_file(s->sqe)) return 0; - }
if (flags & IOSQE_FIXED_FILE) { if (unlikely(!ctx->user_files ||
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.3-rc1 commit 87e5e6dab6c2a21fab2620f37786276d202e2ce0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently these functions return < 0 on error, and 0 for success. Change that so that we return < 0 on error, but number of bytes for success.
Some callers already treat the return value that way, others need a slight tweak.
Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: include/linux/uio.h [ Patch d05f443554b("iov_iter: introduce hash_and_copy_to_iter helper") is not appplied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/aio.c | 9 +++++---- fs/io_uring.c | 16 ++++++++-------- fs/splice.c | 8 ++++---- include/linux/uio.h | 4 ++-- lib/iov_iter.c | 15 ++++++++------- net/compat.c | 3 ++- net/socket.c | 3 ++- 7 files changed, 31 insertions(+), 27 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c index 4561f9ba56c4..190f8a7d85c5 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1469,8 +1469,9 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb) return 0; }
-static int aio_setup_rw(int rw, const struct iocb *iocb, struct iovec **iovec, - bool vectored, bool compat, struct iov_iter *iter) +static ssize_t aio_setup_rw(int rw, const struct iocb *iocb, + struct iovec **iovec, bool vectored, bool compat, + struct iov_iter *iter) { void __user *buf = (void __user *)(uintptr_t)iocb->aio_buf; size_t len = iocb->aio_nbytes; @@ -1527,7 +1528,7 @@ static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb, return -EINVAL;
ret = aio_setup_rw(READ, iocb, &iovec, vectored, compat, &iter); - if (ret) + if (ret < 0) return ret; ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter)); if (!ret) @@ -1555,7 +1556,7 @@ static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb, return -EINVAL;
ret = aio_setup_rw(WRITE, iocb, &iovec, vectored, compat, &iter); - if (ret) + if (ret < 0) return ret; ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter)); if (!ret) { diff --git a/fs/io_uring.c b/fs/io_uring.c index 617023002fd3..6def560f2259 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1001,9 +1001,9 @@ static int io_import_fixed(struct io_ring_ctx *ctx, int rw, return 0; }
-static int io_import_iovec(struct io_ring_ctx *ctx, int rw, - const struct sqe_submit *s, struct iovec **iovec, - struct iov_iter *iter) +static ssize_t io_import_iovec(struct io_ring_ctx *ctx, int rw, + const struct sqe_submit *s, struct iovec **iovec, + struct iov_iter *iter) { const struct io_uring_sqe *sqe = s->sqe; void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr)); @@ -1021,7 +1021,7 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw, opcode = READ_ONCE(sqe->opcode); if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) { - int ret = io_import_fixed(ctx, rw, sqe, iter); + ssize_t ret = io_import_fixed(ctx, rw, sqe, iter); *iovec = NULL; return ret; } @@ -1087,7 +1087,7 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s, struct iov_iter iter; struct file *file; size_t iov_count; - int ret; + ssize_t ret;
ret = io_prep_rw(req, s, force_nonblock); if (ret) @@ -1100,7 +1100,7 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s, return -EINVAL;
ret = io_import_iovec(req->ctx, READ, s, &iovec, &iter); - if (ret) + if (ret < 0) return ret;
iov_count = iov_iter_count(&iter); @@ -1134,7 +1134,7 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s, struct iov_iter iter; struct file *file; size_t iov_count; - int ret; + ssize_t ret;
ret = io_prep_rw(req, s, force_nonblock); if (ret) @@ -1147,7 +1147,7 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s, return -EINVAL;
ret = io_import_iovec(req->ctx, WRITE, s, &iovec, &iter); - if (ret) + if (ret < 0) return ret;
iov_count = iov_iter_count(&iter); diff --git a/fs/splice.c b/fs/splice.c index fd28c7da3c83..62645e2141d1 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1359,7 +1359,7 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, uiov, struct iovec iovstack[UIO_FASTIOV]; struct iovec *iov = iovstack; struct iov_iter iter; - long error; + ssize_t error; struct fd f; int type;
@@ -1370,7 +1370,7 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, uiov,
error = import_iovec(type, uiov, nr_segs, ARRAY_SIZE(iovstack), &iov, &iter); - if (!error) { + if (error >= 0) { error = do_vmsplice(f.file, &iter, flags); kfree(iov); } @@ -1385,7 +1385,7 @@ COMPAT_SYSCALL_DEFINE4(vmsplice, int, fd, const struct compat_iovec __user *, io struct iovec iovstack[UIO_FASTIOV]; struct iovec *iov = iovstack; struct iov_iter iter; - long error; + ssize_t error; struct fd f; int type;
@@ -1396,7 +1396,7 @@ COMPAT_SYSCALL_DEFINE4(vmsplice, int, fd, const struct compat_iovec __user *, io
error = compat_import_iovec(type, iov32, nr_segs, ARRAY_SIZE(iovstack), &iov, &iter); - if (!error) { + if (error >= 0) { error = do_vmsplice(f.file, &iter, flags); kfree(iov); } diff --git a/include/linux/uio.h b/include/linux/uio.h index 422b1c01ee0d..4af82ff60264 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -245,13 +245,13 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum, struc size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i); bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
-int import_iovec(int type, const struct iovec __user * uvector, +ssize_t import_iovec(int type, const struct iovec __user * uvector, unsigned nr_segs, unsigned fast_segs, struct iovec **iov, struct iov_iter *i);
#ifdef CONFIG_COMPAT struct compat_iovec; -int compat_import_iovec(int type, const struct compat_iovec __user * uvector, +ssize_t compat_import_iovec(int type, const struct compat_iovec __user * uvector, unsigned nr_segs, unsigned fast_segs, struct iovec **iov, struct iov_iter *i); #endif diff --git a/lib/iov_iter.c b/lib/iov_iter.c index d51bd7283243..a19d3423d775 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1530,9 +1530,9 @@ EXPORT_SYMBOL(dup_iter); * on-stack array was used or not (and regardless of whether this function * returns an error or not). * - * Return: 0 on success or negative error code on error. + * Return: Negative error code on error, bytes imported on success */ -int import_iovec(int type, const struct iovec __user * uvector, +ssize_t import_iovec(int type, const struct iovec __user * uvector, unsigned nr_segs, unsigned fast_segs, struct iovec **iov, struct iov_iter *i) { @@ -1548,16 +1548,17 @@ int import_iovec(int type, const struct iovec __user * uvector, } iov_iter_init(i, type, p, nr_segs, n); *iov = p == *iov ? NULL : p; - return 0; + return n; } EXPORT_SYMBOL(import_iovec);
#ifdef CONFIG_COMPAT #include <linux/compat.h>
-int compat_import_iovec(int type, const struct compat_iovec __user * uvector, - unsigned nr_segs, unsigned fast_segs, - struct iovec **iov, struct iov_iter *i) +ssize_t compat_import_iovec(int type, + const struct compat_iovec __user * uvector, + unsigned nr_segs, unsigned fast_segs, + struct iovec **iov, struct iov_iter *i) { ssize_t n; struct iovec *p; @@ -1571,7 +1572,7 @@ int compat_import_iovec(int type, const struct compat_iovec __user * uvector, } iov_iter_init(i, type, p, nr_segs, n); *iov = p == *iov ? NULL : p; - return 0; + return n; } #endif
diff --git a/net/compat.c b/net/compat.c index 981424bd707d..2582a9223d80 100644 --- a/net/compat.c +++ b/net/compat.c @@ -79,9 +79,10 @@ int get_compat_msghdr(struct msghdr *kmsg,
kmsg->msg_iocb = NULL;
- return compat_import_iovec(save_addr ? READ : WRITE, + err = compat_import_iovec(save_addr ? READ : WRITE, compat_ptr(msg.msg_iov), msg.msg_iovlen, UIO_FASTIOV, iov, &kmsg->msg_iter); + return err < 0 ? err : 0; }
/* Bleech... */ diff --git a/net/socket.c b/net/socket.c index 29169045dcfe..17d4aa0c0ba3 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2031,9 +2031,10 @@ static int copy_msghdr_from_user(struct msghdr *kmsg,
kmsg->msg_iocb = NULL;
- return import_iovec(save_addr ? READ : WRITE, + err = import_iovec(save_addr ? READ : WRITE, msg.msg_iov, msg.msg_iovlen, UIO_FASTIOV, iov, &kmsg->msg_iter); + return err < 0 ? err : 0; }
static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.3-rc1 commit 9d93a3f5a0c0d0f79aebc597d47c7cedc852aeb5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We can encounter a short read when we're doing buffered reads and the data is partially cached. Right now we just return the short read, but that forces the application to read that CQE, then issue another SQE to finish the read. That read will not be cached, and hence will result in an async punt.
It's more efficient to do that async punt from within the kernel, as that will the not need two round trips more to the kernel.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6def560f2259..590909b8bf20 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1087,7 +1087,7 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s, struct iov_iter iter; struct file *file; size_t iov_count; - ssize_t ret; + ssize_t read_size, ret;
ret = io_prep_rw(req, s, force_nonblock); if (ret) @@ -1103,13 +1103,24 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s, if (ret < 0) return ret;
+ read_size = ret; iov_count = iov_iter_count(&iter); ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_count); if (!ret) { ssize_t ret2;
- /* Catch -EAGAIN return for forced non-blocking submission */ ret2 = call_read_iter(file, kiocb, &iter); + /* + * In case of a short read, punt to async. This can happen + * if we have data partially cached. Alternatively we can + * return the short read, in which case the application will + * need to issue another SQE and wait for it. That SQE will + * need async punt anyway, so it's more efficient to do it + * here. + */ + if (force_nonblock && ret2 > 0 && ret2 < read_size) + ret2 = -EAGAIN; + /* Catch -EAGAIN return for forced non-blocking submission */ if (!force_nonblock || ret2 != -EAGAIN) { io_rw_done(kiocb, ret2); } else {
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.3-rc1 commit 9e645e1105ca60fbbc6bddf2fd5ef7e57ed3dca8 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
With SQE links, we can create chains of dependent SQEs. One example would be queueing an SQE that's a read from one file descriptor, with the linked SQE being a write to another with the same set of buffers.
An SQE link will not stall the pipeline, it'll just ensure that dependent SQEs aren't issued before the previous link has completed.
Any error at submission or completion time will break the chain of SQEs. For completions, this also includes short reads or writes, as the next SQE could depend on the previous one being fully completed.
Any SQE in a chain that gets canceled due to any of the above errors, will get an CQE fill with -ECANCELED as the error value.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 241 +++++++++++++++++++++++++++------- include/uapi/linux/io_uring.h | 1 + 2 files changed, 194 insertions(+), 48 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 590909b8bf20..215486d08d25 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -322,6 +322,7 @@ struct io_kiocb {
struct io_ring_ctx *ctx; struct list_head list; + struct list_head link_list; unsigned int flags; refcount_t refs; #define REQ_F_NOWAIT 1 /* must not punt to workers */ @@ -330,8 +331,10 @@ struct io_kiocb { #define REQ_F_SEQ_PREV 8 /* sequential with previous */ #define REQ_F_IO_DRAIN 16 /* drain existing IO first */ #define REQ_F_IO_DRAINED 32 /* drain done */ +#define REQ_F_LINK 64 /* linked sqes */ +#define REQ_F_FAIL_LINK 128 /* fail rest of links */ u64 user_data; - u32 error; /* iopoll result from callback */ + u32 result; u32 sequence;
struct work_struct work; @@ -584,6 +587,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, req->flags = 0; /* one is dropped after submission, the other at completion */ refcount_set(&req->refs, 2); + req->result = 0; return req; out: io_ring_drop_ctx_refs(ctx, 1); @@ -599,7 +603,7 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr) } }
-static void io_free_req(struct io_kiocb *req) +static void __io_free_req(struct io_kiocb *req) { if (req->file && !(req->flags & REQ_F_FIXED_FILE)) fput(req->file); @@ -607,6 +611,63 @@ static void io_free_req(struct io_kiocb *req) kmem_cache_free(req_cachep, req); }
+static void io_req_link_next(struct io_kiocb *req) +{ + struct io_kiocb *nxt; + + /* + * The list should never be empty when we are called here. But could + * potentially happen if the chain is messed up, check to be on the + * safe side. + */ + nxt = list_first_entry_or_null(&req->link_list, struct io_kiocb, list); + if (nxt) { + list_del(&nxt->list); + if (!list_empty(&req->link_list)) { + INIT_LIST_HEAD(&nxt->link_list); + list_splice(&req->link_list, &nxt->link_list); + nxt->flags |= REQ_F_LINK; + } + + INIT_WORK(&nxt->work, io_sq_wq_submit_work); + queue_work(req->ctx->sqo_wq, &nxt->work); + } +} + +/* + * Called if REQ_F_LINK is set, and we fail the head request + */ +static void io_fail_links(struct io_kiocb *req) +{ + struct io_kiocb *link; + + while (!list_empty(&req->link_list)) { + link = list_first_entry(&req->link_list, struct io_kiocb, list); + list_del(&link->list); + + io_cqring_add_event(req->ctx, link->user_data, -ECANCELED); + __io_free_req(link); + } +} + +static void io_free_req(struct io_kiocb *req) +{ + /* + * If LINK is set, we have dependent requests in this chain. If we + * didn't fail this request, queue the first one up, moving any other + * dependencies to the next request. In case of failure, fail the rest + * of the chain. + */ + if (req->flags & REQ_F_LINK) { + if (req->flags & REQ_F_FAIL_LINK) + io_fail_links(req); + else + io_req_link_next(req); + } + + __io_free_req(req); +} + static void io_put_req(struct io_kiocb *req) { if (refcount_dec_and_test(&req->refs)) @@ -628,16 +689,17 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, req = list_first_entry(done, struct io_kiocb, list); list_del(&req->list);
- io_cqring_fill_event(ctx, req->user_data, req->error); + io_cqring_fill_event(ctx, req->user_data, req->result); (*nr_events)++;
if (refcount_dec_and_test(&req->refs)) { /* If we're not using fixed files, we have to pair the * completion part with the file put. Use regular * completions for those, only batch free for fixed - * file. + * file and non-linked commands. */ - if (req->flags & REQ_F_FIXED_FILE) { + if ((req->flags & (REQ_F_FIXED_FILE|REQ_F_LINK)) == + REQ_F_FIXED_FILE) { reqs[to_free++] = req; if (to_free == ARRAY_SIZE(reqs)) io_free_req_many(ctx, reqs, &to_free); @@ -776,6 +838,8 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
kiocb_end_write(kiocb);
+ if ((req->flags & REQ_F_LINK) && res != req->result) + req->flags |= REQ_F_FAIL_LINK; io_cqring_add_event(req->ctx, req->user_data, res); io_put_req(req); } @@ -786,7 +850,9 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
kiocb_end_write(kiocb);
- req->error = res; + if ((req->flags & REQ_F_LINK) && res != req->result) + req->flags |= REQ_F_FAIL_LINK; + req->result = res; if (res != -EAGAIN) req->flags |= REQ_F_IOPOLL_COMPLETED; } @@ -929,7 +995,6 @@ static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s, !kiocb->ki_filp->f_op->iopoll) return -EOPNOTSUPP;
- req->error = 0; kiocb->ki_flags |= IOCB_HIPRI; kiocb->ki_complete = io_complete_rw_iopoll; } else { @@ -1104,6 +1169,9 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s, return ret;
read_size = ret; + if (req->flags & REQ_F_LINK) + req->result = read_size; + iov_count = iov_iter_count(&iter); ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_count); if (!ret) { @@ -1161,6 +1229,9 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s, if (ret < 0) return ret;
+ if (req->flags & REQ_F_LINK) + req->result = ret; + iov_count = iov_iter_count(&iter);
ret = -EAGAIN; @@ -1264,6 +1335,8 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, end > 0 ? end : LLONG_MAX, fsync_flags & IORING_FSYNC_DATASYNC);
+ if (ret < 0 && (req->flags & REQ_F_LINK)) + req->flags |= REQ_F_FAIL_LINK; io_cqring_add_event(req->ctx, sqe->user_data, ret); io_put_req(req); return 0; @@ -1308,6 +1381,8 @@ static int io_sync_file_range(struct io_kiocb *req,
ret = sync_file_range(req->rw.ki_filp, sqe_off, sqe_len, flags);
+ if (ret < 0 && (req->flags & REQ_F_LINK)) + req->flags |= REQ_F_FAIL_LINK; io_cqring_add_event(req->ctx, sqe->user_data, ret); io_put_req(req); return 0; @@ -1560,9 +1635,10 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, { int ret, opcode;
+ req->user_data = READ_ONCE(s->sqe->user_data); + if (unlikely(s->index >= ctx->sq_entries)) return -EINVAL; - req->user_data = READ_ONCE(s->sqe->user_data);
opcode = READ_ONCE(s->sqe->opcode); switch (opcode) { @@ -1606,7 +1682,7 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, return ret;
if (ctx->flags & IORING_SETUP_IOPOLL) { - if (req->error == -EAGAIN) + if (req->result == -EAGAIN) return -EAGAIN;
/* workqueue context doesn't hold uring_lock, grab it now */ @@ -1830,31 +1906,11 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s, return 0; }
-static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, - struct io_submit_state *state) +static int io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, + struct sqe_submit *s) { - struct io_kiocb *req; int ret;
- /* enforce forwards compatibility on users */ - if (unlikely(s->sqe->flags & ~(IOSQE_FIXED_FILE | IOSQE_IO_DRAIN))) - return -EINVAL; - - req = io_get_req(ctx, state); - if (unlikely(!req)) - return -EAGAIN; - - ret = io_req_set_file(ctx, s, state, req); - if (unlikely(ret)) - goto out; - - ret = io_req_defer(ctx, req, s->sqe); - if (ret) { - if (ret == -EIOCBQUEUED) - ret = 0; - return ret; - } - ret = __io_submit_sqe(ctx, req, s, true); if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) { struct io_uring_sqe *sqe_copy; @@ -1877,24 +1933,93 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
/* * Queued up for async execution, worker will release - * submit reference when the iocb is actually - * submitted. + * submit reference when the iocb is actually submitted. */ return 0; } }
-out: /* drop submission reference */ io_put_req(req);
/* and drop final reference, if we failed */ - if (ret) + if (ret) { + io_cqring_add_event(ctx, req->user_data, ret); + if (req->flags & REQ_F_LINK) + req->flags |= REQ_F_FAIL_LINK; io_put_req(req); + }
return ret; }
+#define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK) + +static void io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, + struct io_submit_state *state, struct io_kiocb **link) +{ + struct io_uring_sqe *sqe_copy; + struct io_kiocb *req; + int ret; + + /* enforce forwards compatibility on users */ + if (unlikely(s->sqe->flags & ~SQE_VALID_FLAGS)) { + ret = -EINVAL; + goto err; + } + + req = io_get_req(ctx, state); + if (unlikely(!req)) { + ret = -EAGAIN; + goto err; + } + + ret = io_req_set_file(ctx, s, state, req); + if (unlikely(ret)) { +err_req: + io_free_req(req); +err: + io_cqring_add_event(ctx, s->sqe->user_data, ret); + return; + } + + ret = io_req_defer(ctx, req, s->sqe); + if (ret) { + if (ret != -EIOCBQUEUED) + goto err_req; + return; + } + + /* + * If we already have a head request, queue this one for async + * submittal once the head completes. If we don't have a head but + * IOSQE_IO_LINK is set in the sqe, start a new head. This one will be + * submitted sync once the chain is complete. If none of those + * conditions are true (normal request), then just queue it. + */ + if (*link) { + struct io_kiocb *prev = *link; + + sqe_copy = kmemdup(s->sqe, sizeof(*sqe_copy), GFP_KERNEL); + if (!sqe_copy) { + ret = -EAGAIN; + goto err_req; + } + + s->sqe = sqe_copy; + memcpy(&req->submit, s, sizeof(*s)); + list_add_tail(&req->list, &prev->link_list); + } else if (s->sqe->flags & IOSQE_IO_LINK) { + req->flags |= REQ_F_LINK; + + memcpy(&req->submit, s, sizeof(*s)); + INIT_LIST_HEAD(&req->link_list); + *link = req; + } else { + io_queue_sqe(ctx, req, s); + } +} + /* * Batched submission is done, ensure local IO is flushed out. */ @@ -1977,7 +2102,9 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, unsigned int nr, bool has_user, bool mm_fault) { struct io_submit_state state, *statep = NULL; - int ret, i, submitted = 0; + struct io_kiocb *link = NULL; + bool prev_was_link = false; + int i, submitted = 0;
if (nr > IO_PLUG_THRESHOLD) { io_submit_state_start(&state, ctx, nr); @@ -1985,22 +2112,30 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, }
for (i = 0; i < nr; i++) { + /* + * If previous wasn't linked and we have a linked command, + * that's the end of the chain. Submit the previous link. + */ + if (!prev_was_link && link) { + io_queue_sqe(ctx, link, &link->submit); + link = NULL; + } + prev_was_link = (sqes[i].sqe->flags & IOSQE_IO_LINK) != 0; + if (unlikely(mm_fault)) { - ret = -EFAULT; + io_cqring_add_event(ctx, sqes[i].sqe->user_data, + -EFAULT); } else { sqes[i].has_user = has_user; sqes[i].needs_lock = true; sqes[i].needs_fixed_file = true; - ret = io_submit_sqe(ctx, &sqes[i], statep); - } - if (!ret) { + io_submit_sqe(ctx, &sqes[i], statep, &link); submitted++; - continue; } - - io_cqring_add_event(ctx, sqes[i].sqe->user_data, ret); }
+ if (link) + io_queue_sqe(ctx, link, &link->submit); if (statep) io_submit_state_end(&state);
@@ -2141,6 +2276,8 @@ static int io_sq_thread(void *data) static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) { struct io_submit_state state, *statep = NULL; + struct io_kiocb *link = NULL; + bool prev_was_link = false; int i, submit = 0;
if (to_submit > IO_PLUG_THRESHOLD) { @@ -2150,22 +2287,30 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
for (i = 0; i < to_submit; i++) { struct sqe_submit s; - int ret;
if (!io_get_sqring(ctx, &s)) break;
+ /* + * If previous wasn't linked and we have a linked command, + * that's the end of the chain. Submit the previous link. + */ + if (!prev_was_link && link) { + io_queue_sqe(ctx, link, &link->submit); + link = NULL; + } + prev_was_link = (s.sqe->flags & IOSQE_IO_LINK) != 0; + s.has_user = true; s.needs_lock = false; s.needs_fixed_file = false; submit++; - - ret = io_submit_sqe(ctx, &s, statep); - if (ret) - io_cqring_add_event(ctx, s.sqe->user_data, ret); + io_submit_sqe(ctx, &s, statep, &link); } io_commit_sqring(ctx);
+ if (link) + io_queue_sqe(ctx, link, &link->submit); if (statep) io_submit_state_end(statep);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index a0c460025036..10b7c45f6d57 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -40,6 +40,7 @@ struct io_uring_sqe { */ #define IOSQE_FIXED_FILE (1U << 0) /* use fixed fileset */ #define IOSQE_IO_DRAIN (1U << 1) /* issue after inflight IO */ +#define IOSQE_IO_LINK (1U << 2) /* links next sqe */
/* * io_uring_setup() flags
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.3-rc1 commit 0fa03c624d8fc9932d0f27c39a9deca6a37e0e17 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags for the flags argument, and the msghdr struct is passed in the sqe->addr field.
We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't block, and punt to async execution if it would have.
Acked-by: David S. Miller davem@davemloft.net Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 40 +++++++++++++++++++++++++++++++++++ include/linux/socket.h | 4 ++++ include/uapi/linux/io_uring.h | 2 ++ net/socket.c | 7 ++++++ 4 files changed, 53 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 215486d08d25..0e1ed1444ef3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1388,6 +1388,43 @@ static int io_sync_file_range(struct io_kiocb *req, return 0; }
+static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ +#if defined(CONFIG_NET) + struct socket *sock; + int ret; + + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + + sock = sock_from_file(req->file, &ret); + if (sock) { + struct user_msghdr __user *msg; + unsigned flags; + + flags = READ_ONCE(sqe->msg_flags); + if (flags & MSG_DONTWAIT) + req->flags |= REQ_F_NOWAIT; + else if (force_nonblock) + flags |= MSG_DONTWAIT; + + msg = (struct user_msghdr __user *) (unsigned long) + READ_ONCE(sqe->addr); + + ret = __sys_sendmsg_sock(sock, msg, flags); + if (force_nonblock && ret == -EAGAIN) + return ret; + } + + io_cqring_add_event(req->ctx, sqe->user_data, ret); + io_put_req(req); + return 0; +#else + return -EOPNOTSUPP; +#endif +} + static void io_poll_remove_one(struct io_kiocb *req) { struct io_poll_iocb *poll = &req->poll; @@ -1673,6 +1710,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, case IORING_OP_SYNC_FILE_RANGE: ret = io_sync_file_range(req, s->sqe, force_nonblock); break; + case IORING_OP_SENDMSG: + ret = io_sendmsg(req, s->sqe, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/linux/socket.h b/include/linux/socket.h index 7ed4713d5337..eac46dd8fb73 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -12,6 +12,7 @@
struct pid; struct cred; +struct socket;
#define __sockaddr_check_size(size) \ BUILD_BUG_ON(((size) > sizeof(struct __kernel_sockaddr_storage))) @@ -362,6 +363,9 @@ extern int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen extern int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen, unsigned int flags, bool forbid_cmsg_compat); +extern long __sys_sendmsg_sock(struct socket *sock, + struct user_msghdr __user *msg, + unsigned int flags);
/* helpers which do the actual work for syscalls */ extern int __sys_recvfrom(int fd, void __user *ubuf, size_t size, diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 10b7c45f6d57..d74742d6269f 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -27,6 +27,7 @@ struct io_uring_sqe { __u32 fsync_flags; __u16 poll_events; __u32 sync_range_flags; + __u32 msg_flags; }; __u64 user_data; /* data to be passed back at completion time */ union { @@ -58,6 +59,7 @@ struct io_uring_sqe { #define IORING_OP_POLL_ADD 6 #define IORING_OP_POLL_REMOVE 7 #define IORING_OP_SYNC_FILE_RANGE 8 +#define IORING_OP_SENDMSG 9
/* * sqe->fsync_flags diff --git a/net/socket.c b/net/socket.c index 17d4aa0c0ba3..bf44fbc7488b 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2136,6 +2136,13 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, /* * BSD sendmsg interface */ +long __sys_sendmsg_sock(struct socket *sock, struct user_msghdr __user *msg, + unsigned int flags) +{ + struct msghdr msg_sys; + + return ___sys_sendmsg(sock, msg, &msg_sys, flags, NULL, 0); +}
long __sys_sendmsg(int fd, struct user_msghdr __user *msg, unsigned int flags, bool forbid_cmsg_compat)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.3-rc1 commit aa1fa28fc73ea6b740ee7b62bf3b07141883dbb8 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This is done through IORING_OP_RECVMSG. This opcode uses the same sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the msghdr struct in the sqe->addr field as well.
We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't block, and punt to async execution if it would have.
Acked-by: David S. Miller davem@davemloft.net Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 31 +++++++++++++++++++++++++++---- include/linux/socket.h | 3 +++ include/uapi/linux/io_uring.h | 1 + net/socket.c | 8 ++++++++ 4 files changed, 39 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 0e1ed1444ef3..cd27cd473b6f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1388,10 +1388,12 @@ static int io_sync_file_range(struct io_kiocb *req, return 0; }
-static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, - bool force_nonblock) -{ #if defined(CONFIG_NET) +static int io_send_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock, + long (*fn)(struct socket *, struct user_msghdr __user *, + unsigned int)) +{ struct socket *sock; int ret;
@@ -1412,7 +1414,7 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, msg = (struct user_msghdr __user *) (unsigned long) READ_ONCE(sqe->addr);
- ret = __sys_sendmsg_sock(sock, msg, flags); + ret = fn(sock, msg, flags); if (force_nonblock && ret == -EAGAIN) return ret; } @@ -1420,6 +1422,24 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, io_cqring_add_event(req->ctx, sqe->user_data, ret); io_put_req(req); return 0; +} +#endif + +static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ +#if defined(CONFIG_NET) + return io_send_recvmsg(req, sqe, force_nonblock, __sys_sendmsg_sock); +#else + return -EOPNOTSUPP; +#endif +} + +static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ +#if defined(CONFIG_NET) + return io_send_recvmsg(req, sqe, force_nonblock, __sys_recvmsg_sock); #else return -EOPNOTSUPP; #endif @@ -1713,6 +1733,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, case IORING_OP_SENDMSG: ret = io_sendmsg(req, s->sqe, force_nonblock); break; + case IORING_OP_RECVMSG: + ret = io_recvmsg(req, s->sqe, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/linux/socket.h b/include/linux/socket.h index eac46dd8fb73..70d2578085cf 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -366,6 +366,9 @@ extern int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, extern long __sys_sendmsg_sock(struct socket *sock, struct user_msghdr __user *msg, unsigned int flags); +extern long __sys_recvmsg_sock(struct socket *sock, + struct user_msghdr __user *msg, + unsigned int flags);
/* helpers which do the actual work for syscalls */ extern int __sys_recvfrom(int fd, void __user *ubuf, size_t size, diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index d74742d6269f..1e1652f25cc1 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -60,6 +60,7 @@ struct io_uring_sqe { #define IORING_OP_POLL_REMOVE 7 #define IORING_OP_SYNC_FILE_RANGE 8 #define IORING_OP_SENDMSG 9 +#define IORING_OP_RECVMSG 10
/* * sqe->fsync_flags diff --git a/net/socket.c b/net/socket.c index bf44fbc7488b..b3a9bf2622b3 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2317,6 +2317,14 @@ static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg, * BSD recvmsg interface */
+long __sys_recvmsg_sock(struct socket *sock, struct user_msghdr __user *msg, + unsigned int flags) +{ + struct msghdr msg_sys; + + return ___sys_recvmsg(sock, msg, &msg_sys, flags, 0); +} + long __sys_recvmsg(int fd, struct user_msghdr __user *msg, unsigned int flags, bool forbid_cmsg_compat) {
From: Jackie Liu liuyun01@kylinos.cn
mainline inclusion from mainline-5.3-rc1 commit a4c0b3decb33fb4a2b5ecc6234a50680f0b21e7d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
INFO: task syz-executor.5:8634 blocked for more than 143 seconds. Not tainted 5.2.0-rc5+ #3 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syz-executor.5 D25632 8634 8224 0x00004004 Call Trace: context_switch kernel/sched/core.c:2818 [inline] __schedule+0x658/0x9e0 kernel/sched/core.c:3445 schedule+0x131/0x1d0 kernel/sched/core.c:3509 schedule_timeout+0x9a/0x2b0 kernel/time/timer.c:1783 do_wait_for_common+0x35e/0x5a0 kernel/sched/completion.c:83 __wait_for_common kernel/sched/completion.c:104 [inline] wait_for_common kernel/sched/completion.c:115 [inline] wait_for_completion+0x47/0x60 kernel/sched/completion.c:136 kthread_stop+0xb4/0x150 kernel/kthread.c:559 io_sq_thread_stop fs/io_uring.c:2252 [inline] io_finish_async fs/io_uring.c:2259 [inline] io_ring_ctx_free fs/io_uring.c:2770 [inline] io_ring_ctx_wait_and_kill+0x268/0x880 fs/io_uring.c:2834 io_uring_release+0x5d/0x70 fs/io_uring.c:2842 __fput+0x2e4/0x740 fs/file_table.c:280 ____fput+0x15/0x20 fs/file_table.c:313 task_work_run+0x17e/0x1b0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop arch/x86/entry/common.c:168 [inline] prepare_exit_to_usermode+0x402/0x4f0 arch/x86/entry/common.c:199 syscall_return_slowpath+0x110/0x440 arch/x86/entry/common.c:279 do_syscall_64+0x126/0x140 arch/x86/entry/common.c:304 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x412fb1 Code: 80 3b 7c 0f 84 c7 02 00 00 c7 85 d0 00 00 00 00 00 00 00 48 8b 05 cf a6 24 00 49 8b 14 24 41 b9 cb 2a 44 00 48 89 ee 48 89 df <48> 85 c0 4c 0f 45 c8 45 31 c0 31 c9 e8 0e 5b 00 00 85 c0 41 89 c7 RSP: 002b:00007ffe7ee6a180 EFLAGS: 00000293 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000412fb1 RDX: 0000001b2d920000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 0000000000000001 R08: 00000000f3a3e1f8 R09: 00000000f3a3e1fc R10: 00007ffe7ee6a260 R11: 0000000000000293 R12: 000000000075c9a0 R13: 000000000075c9a0 R14: 0000000000024c00 R15: 000000000075bf2c
=============================================
There is an wrong logic, when kthread_park running in front of io_sq_thread.
CPU#0 CPU#1
io_sq_thread_stop: int kthread(void *_create):
kthread_park() __kthread_parkme(self); <<< Wrong kthread_stop() << wait for self->exited << clear_bit KTHREAD_SHOULD_PARK
ret = threadfn(data); | |- io_sq_thread |- kthread_should_park() << false |- schedule() <<< nobody wake up
stuck CPU#0 stuck CPU#1
So, use a new variable sqo_thread_started to ensure that io_sq_thread run first, then io_sq_thread_stop.
Reported-by: syzbot+94324416c485d422fe15@syzkaller.appspotmail.com Suggested-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cd27cd473b6f..7a5db8261c1c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -231,6 +231,7 @@ struct io_ring_ctx { struct task_struct *sqo_thread; /* if using sq thread polling */ struct mm_struct *sqo_mm; wait_queue_head_t sqo_wait; + struct completion sqo_thread_started;
struct { /* CQ ring */ @@ -406,6 +407,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) ctx->flags = p->flags; init_waitqueue_head(&ctx->cq_wait); init_completion(&ctx->ctx_done); + init_completion(&ctx->sqo_thread_started); mutex_init(&ctx->uring_lock); init_waitqueue_head(&ctx->wait); for (i = 0; i < ARRAY_SIZE(ctx->pending_async); i++) { @@ -2215,6 +2217,8 @@ static int io_sq_thread(void *data) unsigned inflight; unsigned long timeout;
+ complete(&ctx->sqo_thread_started); + old_fs = get_fs(); set_fs(USER_DS);
@@ -2454,6 +2458,7 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx) static void io_sq_thread_stop(struct io_ring_ctx *ctx) { if (ctx->sqo_thread) { + wait_for_completion(&ctx->sqo_thread_started); /* * The park is a bit of a work-around, without it we get * warning spews on shutdown with SQPOLL set and affinity
From: Zhengyuan Liu liuzhengyuan@kylinos.cn
mainline inclusion from mainline-5.3-rc2 commit dbd0f6d6c2a11eb9c31ca9cd454f95bb5713e92e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
sq->cached_sq_head and cq->cached_cq_tail are both unsigned int. If cached_sq_head overflows before cached_cq_tail, then we may miss a barrier req. As cached_cq_tail always follows cached_sq_head, the NQ should be enough.
Cc: stable@vger.kernel.org Fixes: de0617e46717 ("io_uring: add support for marking commands as draining") Signed-off-by: Zhengyuan Liu liuzhengyuan@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7a5db8261c1c..98298147d8cb 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -428,7 +428,7 @@ static inline bool io_sequence_defer(struct io_ring_ctx *ctx, if ((req->flags & (REQ_F_IO_DRAIN|REQ_F_IO_DRAINED)) != REQ_F_IO_DRAIN) return false;
- return req->sequence > ctx->cached_cq_tail + ctx->sq_ring->dropped; + return req->sequence != ctx->cached_cq_tail + ctx->sq_ring->dropped; }
static struct io_kiocb *io_get_deferred_req(struct io_ring_ctx *ctx)
From: Zhengyuan Liu liuzhengyuan@kylinos.cn
mainline inclusion from mainline-5.3-rc2 commit f7b76ac9d17e16e44feebb6d2749fec92bfd6dd4 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We could queue a work for each req in defer and link list without increasing async_list->cnt, so we shouldn't decrease it while exiting from workqueue as well if we didn't process the req in async list.
Thanks to Jens Axboe axboe@kernel.dk for his guidance.
Fixes: 31b515106428 ("io_uring: allow workqueue item to handle multiple buffered requests") Signed-off-by: Zhengyuan Liu liuzhengyuan@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 98298147d8cb..2703edeadc6a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -333,7 +333,8 @@ struct io_kiocb { #define REQ_F_IO_DRAIN 16 /* drain existing IO first */ #define REQ_F_IO_DRAINED 32 /* drain done */ #define REQ_F_LINK 64 /* linked sqes */ -#define REQ_F_FAIL_LINK 128 /* fail rest of links */ +#define REQ_F_LINK_DONE 128 /* linked sqes done */ +#define REQ_F_FAIL_LINK 256 /* fail rest of links */ u64 user_data; u32 result; u32 sequence; @@ -631,6 +632,7 @@ static void io_req_link_next(struct io_kiocb *req) nxt->flags |= REQ_F_LINK; }
+ nxt->flags |= REQ_F_LINK_DONE; INIT_WORK(&nxt->work, io_sq_wq_submit_work); queue_work(req->ctx->sqo_wq, &nxt->work); } @@ -1843,6 +1845,10 @@ static void io_sq_wq_submit_work(struct work_struct *work) /* async context always use a copy of the sqe */ kfree(sqe);
+ /* req from defer and link list needn't decrease async cnt */ + if (req->flags & (REQ_F_IO_DRAINED | REQ_F_LINK_DONE)) + goto out; + if (!async_list) break; if (!list_empty(&req_list)) { @@ -1890,6 +1896,7 @@ static void io_sq_wq_submit_work(struct work_struct *work) } }
+out: if (cur_mm) { set_fs(old_fs); unuse_mm(cur_mm);
From: Zhengyuan Liu liuzhengyuan@kylinos.cn
mainline inclusion from mainline-5.3-rc2 commit c0e48f9dea9129aa11bec3ed13803bcc26e96e49 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There is a hang issue while using fio to do some basic test. The issue can be easily reproduced using the below script:
while true do fio --ioengine=io_uring -rw=write -bs=4k -numjobs=1 \ -size=1G -iodepth=64 -name=uring --filename=/dev/zero done
After several minutes (or more), fio would block at io_uring_enter->io_cqring_wait in order to waiting for previously committed sqes to be completed and can't return to user anymore until we send a SIGTERM to fio. After receiving SIGTERM, fio hangs at io_ring_ctx_wait_and_kill with a backtrace like this:
[54133.243816] Call Trace: [54133.243842] __schedule+0x3a0/0x790 [54133.243868] schedule+0x38/0xa0 [54133.243880] schedule_timeout+0x218/0x3b0 [54133.243891] ? sched_clock+0x9/0x10 [54133.243903] ? wait_for_completion+0xa3/0x130 [54133.243916] ? _raw_spin_unlock_irq+0x2c/0x40 [54133.243930] ? trace_hardirqs_on+0x3f/0xe0 [54133.243951] wait_for_completion+0xab/0x130 [54133.243962] ? wake_up_q+0x70/0x70 [54133.243984] io_ring_ctx_wait_and_kill+0xa0/0x1d0 [54133.243998] io_uring_release+0x20/0x30 [54133.244008] __fput+0xcf/0x270 [54133.244029] ____fput+0xe/0x10 [54133.244040] task_work_run+0x7f/0xa0 [54133.244056] do_exit+0x305/0xc40 [54133.244067] ? get_signal+0x13b/0xbd0 [54133.244088] do_group_exit+0x50/0xd0 [54133.244103] get_signal+0x18d/0xbd0 [54133.244112] ? _raw_spin_unlock_irqrestore+0x36/0x60 [54133.244142] do_signal+0x34/0x720 [54133.244171] ? exit_to_usermode_loop+0x7e/0x130 [54133.244190] exit_to_usermode_loop+0xc0/0x130 [54133.244209] do_syscall_64+0x16b/0x1d0 [54133.244221] entry_SYSCALL_64_after_hwframe+0x49/0xbe
The reason is that we had added a req to ctx->pending_async at the very end, but it didn't get a chance to be processed. How could this happen?
fio#cpu0 wq#cpu1
io_add_to_prev_work io_sq_wq_submit_work
atomic_read() <<< 1
atomic_dec_return() << 1->0 list_empty(); <<< true;
list_add_tail() atomic_read() << 0 or 1?
As atomic_ops.rst states, atomic_read does not guarantee that the runtime modification by any other thread is visible yet, so we must take care of that with a proper implicit or explicit memory barrier.
This issue was detected with the help of Jackie's liuyun01@kylinos.cn
Fixes: 31b515106428 ("io_uring: allow workqueue item to handle multiple buffered requests") Signed-off-by: Zhengyuan Liu liuzhengyuan@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2703edeadc6a..8169b327d4ac 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1923,6 +1923,10 @@ static bool io_add_to_prev_work(struct async_list *list, struct io_kiocb *req) ret = true; spin_lock(&list->lock); list_add_tail(&req->list, &list->list); + /* + * Ensure we see a simultaneous modification from io_sq_wq_submit_work() + */ + smp_mb(); if (!atomic_read(&list->cnt)) { list_del_init(&req->list); ret = false;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.3-rc2 commit bd11b3a391e3df6fa958facbe4b3f9f4cca9bd49 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Hrvoje reports that when a large fixed buffer is registered and IO is being done to the latter pages of said buffer, the IO submission time is much worse:
reading to the start of the buffer: 11238 ns reading to the end of the buffer: 1039879 ns
In fact, it's worse by two orders of magnitude. The reason for that is how io_uring figures out how to setup the iov_iter. We point the iter at the first bvec, and then use iov_iter_advance() to fast-forward to the offset within that buffer we need.
However, that is abysmally slow, as it entails iterating the bvecs that we setup as part of buffer registration. There's really no need to use this generic helper, as we know it's a BVEC type iterator, and we also know that each bvec is PAGE_SIZE in size, apart from possibly the first and last. Hence we can just use a shift on the offset to find the right index, and then adjust the iov_iter appropriately. After this fix, the timings are:
reading to the start of the buffer: 10135 ns reading to the end of the buffer: 1377 ns
Or about an 755x improvement for the tail page.
Reported-by: Hrvoje Zeba zeba.hrvoje@gmail.com Tested-by: Hrvoje Zeba zeba.hrvoje@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com
Conflicts: fs/io_uring.c [conflitcs with ITER_BVEC in iov_iter_bvec] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 40 ++++++++++++++++++++++++++++++++++++++-- 1 file changed, 38 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8169b327d4ac..79a38405be68 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1065,8 +1065,44 @@ static int io_import_fixed(struct io_ring_ctx *ctx, int rw, */ offset = buf_addr - imu->ubuf; iov_iter_bvec(iter, ITER_BVEC | rw, imu->bvec, imu->nr_bvecs, offset + len); - if (offset) - iov_iter_advance(iter, offset); + + if (offset) { + /* + * Don't use iov_iter_advance() here, as it's really slow for + * using the latter parts of a big fixed buffer - it iterates + * over each segment manually. We can cheat a bit here, because + * we know that: + * + * 1) it's a BVEC iter, we set it up + * 2) all bvecs are PAGE_SIZE in size, except potentially the + * first and last bvec + * + * So just find our index, and adjust the iterator afterwards. + * If the offset is within the first bvec (or the whole first + * bvec, just use iov_iter_advance(). This makes it easier + * since we can just skip the first segment, which may not + * be PAGE_SIZE aligned. + */ + const struct bio_vec *bvec = imu->bvec; + + if (offset <= bvec->bv_len) { + iov_iter_advance(iter, offset); + } else { + unsigned long seg_skip; + + /* skip first vec */ + offset -= bvec->bv_len; + seg_skip = 1 + (offset >> PAGE_SHIFT); + + iter->bvec = bvec + seg_skip; + iter->nr_segs -= seg_skip; + iter->count -= (seg_skip << PAGE_SHIFT); + iter->iov_offset = offset & ~PAGE_MASK; + if (iter->iov_offset) + iter->count -= iter->iov_offset; + } + } + return 0; }
From: Zhengyuan Liu liuzhengyuan@kylinos.cn
mainline inclusion from mainline-5.3-rc2 commit 9310a7ba6de8cce6209e3e8a3cdf733f824cdd9b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We are using PAGE_SIZE as the unit to determine if the total len in async_list has exceeded max_pages, it's not fair for smaller io sizes. For example, if we are doing 1k-size io streams, we will never exceed max_pages since len >>= PAGE_SHIFT always gets zero. So use original bytes to make it more accurate.
Signed-off-by: Zhengyuan Liu liuzhengyuan@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [ Patch b5420237ec81("mm: refactor readahead defines in mm.h") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 79a38405be68..e027f5bde537 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -202,7 +202,7 @@ struct async_list {
struct file *file; off_t io_end; - size_t io_pages; + size_t io_len; };
struct io_ring_ctx { @@ -1157,28 +1157,26 @@ static void io_async_list_note(int rw, struct io_kiocb *req, size_t len) off_t io_end = kiocb->ki_pos + len;
if (filp == async_list->file && kiocb->ki_pos == async_list->io_end) { - unsigned long max_pages; + unsigned long max_bytes;
/* Use 8x RA size as a decent limiter for both reads/writes */ - max_pages = filp->f_ra.ra_pages; - if (!max_pages) - max_pages = VM_MAX_READAHEAD >> (PAGE_SHIFT - 10); - max_pages *= 8; - - /* If max pages are exceeded, reset the state */ - len >>= PAGE_SHIFT; - if (async_list->io_pages + len <= max_pages) { + max_bytes = filp->f_ra.ra_pages << (PAGE_SHIFT + 3); + if (!max_bytes) + max_bytes = VM_MAX_READAHEAD << 7; + + /* If max len are exceeded, reset the state */ + if (async_list->io_len + len <= max_bytes) { req->flags |= REQ_F_SEQ_PREV; - async_list->io_pages += len; + async_list->io_len += len; } else { io_end = 0; - async_list->io_pages = 0; + async_list->io_len = 0; } }
/* New file? Reset state. */ if (async_list->file != filp) { - async_list->io_pages = 0; + async_list->io_len = 0; async_list->file = filp; } async_list->io_end = io_end;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.3-rc2 commit 36703247d5f52a679df9da51192b6950fe81689f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Daniel reports that when testing an http server that uses io_uring to poll for incoming connections, sometimes it hard crashes. This is due to an uninitialized list member for the io_uring request. Normally this doesn't trigger and none of the test cases caught it.
Reported-by: Daniel Kozak kozzi11@gmail.com Tested-by: Daniel Kozak kozzi11@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e027f5bde537..255aaec5a5d6 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1665,6 +1665,8 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe) INIT_LIST_HEAD(&poll->wait.entry); init_waitqueue_func_entry(&poll->wait, io_poll_wake);
+ INIT_LIST_HEAD(&req->list); + mask = vfs_poll(poll->file, &ipt.pt) & poll->events;
spin_lock_irq(&ctx->completion_lock);
From: Jackie Liu liuyun01@kylinos.cn
mainline inclusion from mainline-5.4-rc1 commit a1041c27b64ce744632147e19701c95fed14fab1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Sometimes io_get_req will return a NUL, then we need to do the correct error handling, otherwise it will cause the kernel null pointer exception.
Fixes: 4fe2c963154c ("io_uring: add support for link with drain") Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6b2295fcb355..3eeee5a03fc8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2363,12 +2363,15 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, if (link && (sqes[i].sqe->flags & IOSQE_IO_DRAIN)) { if (!shadow_req) { shadow_req = io_get_req(ctx, NULL); + if (unlikely(!shadow_req)) + goto out; shadow_req->flags |= (REQ_F_IO_DRAIN | REQ_F_SHADOW_DRAIN); refcount_dec(&shadow_req->refs); } shadow_req->sequence = sqes[i].sequence; }
+out: if (unlikely(mm_fault)) { io_cqring_add_event(ctx, sqes[i].sqe->user_data, -EFAULT); @@ -2550,12 +2553,15 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, if (link && (s.sqe->flags & IOSQE_IO_DRAIN)) { if (!shadow_req) { shadow_req = io_get_req(ctx, NULL); + if (unlikely(!shadow_req)) + goto out; shadow_req->flags |= (REQ_F_IO_DRAIN | REQ_F_SHADOW_DRAIN); refcount_dec(&shadow_req->refs); } shadow_req->sequence = s.sequence; }
+out: s.has_user = true; s.needs_lock = false; s.needs_fixed_file = false;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.4-rc1 commit 9831a90ce64362f8429e8fd23838a9db2cdf7803 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If preempt isn't enabled in the kernel, we can run into hang issues with sqthread submissions. Use cond_resched() to play nice instead of cpu_relax(), if we end up starting the loop and not having any events pending for submissions.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3eeee5a03fc8..896ca486d4ae 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2437,7 +2437,7 @@ static int io_sq_thread(void *data) * to sleep. */ if (inflight || !time_after(jiffies, timeout)) { - cpu_relax(); + cond_resched(); continue; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.4-rc1 commit 5262f567987d3c30052b22e78c35c2313d07b230 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There's been a few requests for functionality similar to io_getevents() and epoll_wait(), where the user can specify a timeout for waiting on events. I deliberately did not add support for this through the system call initially to avoid overloading the args, but I can see that the use cases for this are valid.
This adds support for IORING_OP_TIMEOUT. If a user wants to get woken when waiting for events, simply submit one of these timeout commands with your wait call (or before). This ensures that the application sleeping on the CQ ring waiting for events will get woken. The timeout command is passed in as a pointer to a struct timespec. Timeouts are relative. The timeout command also includes a way to auto-cancel after N events has passed.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 149 ++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 2 + 2 files changed, 146 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 896ca486d4ae..defb917aa2aa 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -200,6 +200,7 @@ struct io_ring_ctx { struct io_uring_sqe *sq_sqes;
struct list_head defer_list; + struct list_head timeout_list; } ____cacheline_aligned_in_smp;
/* IO offload */ @@ -216,6 +217,7 @@ struct io_ring_ctx { struct wait_queue_head cq_wait; struct fasync_struct *cq_fasync; struct eventfd_ctx *cq_ev_fd; + atomic_t cq_timeouts; } ____cacheline_aligned_in_smp;
struct io_rings *rings; @@ -283,6 +285,11 @@ struct io_poll_iocb { struct wait_queue_entry wait; };
+struct io_timeout { + struct file *file; + struct hrtimer timer; +}; + /* * NOTE! Each of the iocb union members has the file pointer * as the first entry in their struct definition. So you can @@ -294,6 +301,7 @@ struct io_kiocb { struct file *file; struct kiocb rw; struct io_poll_iocb poll; + struct io_timeout timeout; };
struct sqe_submit submit; @@ -313,6 +321,7 @@ struct io_kiocb { #define REQ_F_LINK_DONE 128 /* linked sqes done */ #define REQ_F_FAIL_LINK 256 /* fail rest of links */ #define REQ_F_SHADOW_DRAIN 512 /* link-drain shadow req */ +#define REQ_F_TIMEOUT 1024 /* timeout request */ u64 user_data; u32 result; u32 sequence; @@ -344,6 +353,8 @@ struct io_submit_state { };
static void io_sq_wq_submit_work(struct work_struct *work); +static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res); static void __io_free_req(struct io_kiocb *req);
static struct kmem_cache *req_cachep; @@ -399,26 +410,30 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) INIT_LIST_HEAD(&ctx->poll_list); INIT_LIST_HEAD(&ctx->cancel_list); INIT_LIST_HEAD(&ctx->defer_list); + INIT_LIST_HEAD(&ctx->timeout_list); return ctx; }
static inline bool io_sequence_defer(struct io_ring_ctx *ctx, struct io_kiocb *req) { - if ((req->flags & (REQ_F_IO_DRAIN|REQ_F_IO_DRAINED)) != REQ_F_IO_DRAIN) + /* timeout requests always honor sequence */ + if (!(req->flags & REQ_F_TIMEOUT) && + (req->flags & (REQ_F_IO_DRAIN|REQ_F_IO_DRAINED)) != REQ_F_IO_DRAIN) return false;
return req->sequence != ctx->cached_cq_tail + ctx->rings->sq_dropped; }
-static struct io_kiocb *io_get_deferred_req(struct io_ring_ctx *ctx) +static struct io_kiocb *__io_get_deferred_req(struct io_ring_ctx *ctx, + struct list_head *list) { struct io_kiocb *req;
- if (list_empty(&ctx->defer_list)) + if (list_empty(list)) return NULL;
- req = list_first_entry(&ctx->defer_list, struct io_kiocb, list); + req = list_first_entry(list, struct io_kiocb, list); if (!io_sequence_defer(ctx, req)) { list_del_init(&req->list); return req; @@ -427,6 +442,16 @@ static struct io_kiocb *io_get_deferred_req(struct io_ring_ctx *ctx) return NULL; }
+static struct io_kiocb *io_get_deferred_req(struct io_ring_ctx *ctx) +{ + return __io_get_deferred_req(ctx, &ctx->defer_list); +} + +static struct io_kiocb *io_get_timeout_req(struct io_ring_ctx *ctx) +{ + return __io_get_deferred_req(ctx, &ctx->timeout_list); +} + static void __io_commit_cqring(struct io_ring_ctx *ctx) { struct io_rings *rings = ctx->rings; @@ -459,10 +484,36 @@ static inline void io_queue_async_work(struct io_ring_ctx *ctx, queue_work(ctx->sqo_wq[rw], &req->work); }
+static void io_kill_timeout(struct io_kiocb *req) +{ + int ret; + + ret = hrtimer_try_to_cancel(&req->timeout.timer); + if (ret != -1) { + atomic_inc(&req->ctx->cq_timeouts); + list_del(&req->list); + io_cqring_fill_event(req->ctx, req->user_data, 0); + __io_free_req(req); + } +} + +static void io_kill_timeouts(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req, *tmp; + + spin_lock_irq(&ctx->completion_lock); + list_for_each_entry_safe(req, tmp, &ctx->timeout_list, list) + io_kill_timeout(req); + spin_unlock_irq(&ctx->completion_lock); +} + static void io_commit_cqring(struct io_ring_ctx *ctx) { struct io_kiocb *req;
+ while ((req = io_get_timeout_req(ctx)) != NULL) + io_kill_timeout(req); + __io_commit_cqring(ctx);
while ((req = io_get_deferred_req(ctx)) != NULL) { @@ -1764,6 +1815,81 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe) return ipt.error; }
+static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) +{ + struct io_ring_ctx *ctx; + struct io_kiocb *req; + unsigned long flags; + + req = container_of(timer, struct io_kiocb, timeout.timer); + ctx = req->ctx; + atomic_inc(&ctx->cq_timeouts); + + spin_lock_irqsave(&ctx->completion_lock, flags); + list_del(&req->list); + + io_cqring_fill_event(ctx, req->user_data, -ETIME); + io_commit_cqring(ctx); + spin_unlock_irqrestore(&ctx->completion_lock, flags); + + io_cqring_ev_posted(ctx); + + io_put_req(req); + return HRTIMER_NORESTART; +} + +static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + unsigned count, req_dist, tail_index; + struct io_ring_ctx *ctx = req->ctx; + struct list_head *entry; + struct timespec ts; + + if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + if (sqe->flags || sqe->ioprio || sqe->buf_index || sqe->timeout_flags || + sqe->len != 1) + return -EINVAL; + if (copy_from_user(&ts, (void __user *) (unsigned long) sqe->addr, + sizeof(ts))) + return -EFAULT; + + /* + * sqe->off holds how many events that need to occur for this + * timeout event to be satisfied. + */ + count = READ_ONCE(sqe->off); + if (!count) + count = 1; + + req->sequence = ctx->cached_sq_head + count - 1; + req->flags |= REQ_F_TIMEOUT; + + /* + * Insertion sort, ensuring the first entry in the list is always + * the one we need first. + */ + tail_index = ctx->cached_cq_tail - ctx->rings->sq_dropped; + req_dist = req->sequence - tail_index; + spin_lock_irq(&ctx->completion_lock); + list_for_each_prev(entry, &ctx->timeout_list) { + struct io_kiocb *nxt = list_entry(entry, struct io_kiocb, list); + unsigned dist; + + dist = nxt->sequence - tail_index; + if (req_dist >= dist) + break; + } + list_add(&req->list, entry); + spin_unlock_irq(&ctx->completion_lock); + + hrtimer_init(&req->timeout.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + req->timeout.timer.function = io_timeout_fn; + hrtimer_start(&req->timeout.timer, timespec_to_ktime(ts), + HRTIMER_MODE_REL); + return 0; +} + static int io_req_defer(struct io_ring_ctx *ctx, struct io_kiocb *req, const struct io_uring_sqe *sqe) { @@ -1841,6 +1967,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, case IORING_OP_RECVMSG: ret = io_recvmsg(req, s->sqe, force_nonblock); break; + case IORING_OP_TIMEOUT: + ret = io_timeout(req, s->sqe); + break; default: ret = -EINVAL; break; @@ -2598,6 +2727,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, const sigset_t __user *sig, size_t sigsz) { struct io_rings *rings = ctx->rings; + unsigned nr_timeouts; int ret;
if (io_cqring_events(rings) >= min_events) @@ -2616,7 +2746,15 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, return ret; }
- ret = wait_event_interruptible(ctx->wait, io_cqring_events(rings) >= min_events); + nr_timeouts = atomic_read(&ctx->cq_timeouts); + /* + * Return if we have enough events, or if a timeout occured since + * we started waiting. For timeouts, we always want to return to + * userspace. + */ + ret = wait_event_interruptible(ctx->wait, + io_cqring_events(rings) >= min_events || + atomic_read(&ctx->cq_timeouts) != nr_timeouts); restore_saved_sigmask_unless(ret == -ERESTARTSYS); if (ret == -ERESTARTSYS) ret = -EINTR; @@ -3288,6 +3426,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) percpu_ref_kill(&ctx->refs); mutex_unlock(&ctx->uring_lock);
+ io_kill_timeouts(ctx); io_poll_remove_all(ctx); io_iopoll_reap_events(ctx); wait_for_completion(&ctx->ctx_done); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 96ee9d94b73e..ea57526a5b89 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -28,6 +28,7 @@ struct io_uring_sqe { __u16 poll_events; __u32 sync_range_flags; __u32 msg_flags; + __u32 timeout_flags; }; __u64 user_data; /* data to be passed back at completion time */ union { @@ -61,6 +62,7 @@ struct io_uring_sqe { #define IORING_OP_SYNC_FILE_RANGE 8 #define IORING_OP_SENDMSG 9 #define IORING_OP_RECVMSG 10 +#define IORING_OP_TIMEOUT 11
/* * sqe->fsync_flags
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.4-rc1 commit 32960613b7c3352ddf38c42596e28a16ae36335e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently we just -EINVAL a read or write to an fd that isn't backed by ->read_iter() or ->write_iter(). But we can handle them just fine, as long as we punt fo async context first.
Implement a simple loop function for doing ->read() or ->write() instead, and ensure we call it appropriately.
Reported-by: 李通洲 carter.li@eoitek.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 60 +++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 54 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index defb917aa2aa..8236e80c99e3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1297,6 +1297,51 @@ static void io_async_list_note(int rw, struct io_kiocb *req, size_t len) } }
+/* + * For files that don't have ->read_iter() and ->write_iter(), handle them + * by looping over ->read() or ->write() manually. + */ +static ssize_t loop_rw_iter(int rw, struct file *file, struct kiocb *kiocb, + struct iov_iter *iter) +{ + ssize_t ret = 0; + + /* + * Don't support polled IO through this interface, and we can't + * support non-blocking either. For the latter, this just causes + * the kiocb to be handled from an async context. + */ + if (kiocb->ki_flags & IOCB_HIPRI) + return -EOPNOTSUPP; + if (kiocb->ki_flags & IOCB_NOWAIT) + return -EAGAIN; + + while (iov_iter_count(iter)) { + struct iovec iovec = iov_iter_iovec(iter); + ssize_t nr; + + if (rw == READ) { + nr = file->f_op->read(file, iovec.iov_base, + iovec.iov_len, &kiocb->ki_pos); + } else { + nr = file->f_op->write(file, iovec.iov_base, + iovec.iov_len, &kiocb->ki_pos); + } + + if (nr < 0) { + if (!ret) + ret = nr; + break; + } + ret += nr; + if (nr != iovec.iov_len) + break; + iov_iter_advance(iter, nr); + } + + return ret; +} + static int io_read(struct io_kiocb *req, const struct sqe_submit *s, bool force_nonblock) { @@ -1314,8 +1359,6 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s,
if (unlikely(!(file->f_mode & FMODE_READ))) return -EBADF; - if (unlikely(!file->f_op->read_iter)) - return -EINVAL;
ret = io_import_iovec(req->ctx, READ, s, &iovec, &iter); if (ret < 0) @@ -1330,7 +1373,11 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s, if (!ret) { ssize_t ret2;
- ret2 = call_read_iter(file, kiocb, &iter); + if (file->f_op->read_iter) + ret2 = call_read_iter(file, kiocb, &iter); + else + ret2 = loop_rw_iter(READ, file, kiocb, &iter); + /* * In case of a short read, punt to async. This can happen * if we have data partially cached. Alternatively we can @@ -1375,8 +1422,6 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s, file = kiocb->ki_filp; if (unlikely(!(file->f_mode & FMODE_WRITE))) return -EBADF; - if (unlikely(!file->f_op->write_iter)) - return -EINVAL;
ret = io_import_iovec(req->ctx, WRITE, s, &iovec, &iter); if (ret < 0) @@ -1414,7 +1459,10 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s, } kiocb->ki_flags |= IOCB_WRITE;
- ret2 = call_write_iter(file, kiocb, &iter); + if (file->f_op->write_iter) + ret2 = call_write_iter(file, kiocb, &iter); + else + ret2 = loop_rw_iter(WRITE, file, kiocb, &iter); if (!force_nonblock || ret2 != -EAGAIN) { io_rw_done(kiocb, ret2); } else {
From: yangerkun yangerkun@huawei.com
mainline inclusion from mainline-5.4-rc1 commit daa5de5415849b9a53056ec1e1e88fe4c5c9aa2b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
After 75b28af("io_uring: allocate the two rings together"), we compare sq.head with cached_cq_tail to determine does there any cq invalid. Actually, we should use cq.head.
Fixes: 75b28affdd6a ("io_uring: allocate the two rings together") Signed-off-by: yangerkun yangerkun@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8236e80c99e3..ea88223e6dc6 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3455,7 +3455,7 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait) if (READ_ONCE(ctx->rings->sq.tail) - ctx->cached_sq_head != ctx->rings->sq_ring_entries) mask |= EPOLLOUT | EPOLLWRNORM; - if (READ_ONCE(ctx->rings->sq.head) != ctx->cached_cq_tail) + if (READ_ONCE(ctx->rings->cq.head) != ctx->cached_cq_tail) mask |= EPOLLIN | EPOLLRDNORM;
return mask;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.4-rc1 commit bda521624e75c665c407b3d9cece6e7a28178cd8 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For batched IO, it's not uncommon for waiters to ask for more than 1 IO to complete before being woken up. This is a problem with wait_event() since tasks will get woken for every IO that completes, re-check condition, then go back to sleep. For batch counts on the order of what you do for high IOPS, that can result in 10s of extra wakeups for the waiting task.
Add a private wake function that checks for the wake up count criteria being met before calling autoremove_wake_function(). Pavel reports that one test case he has runs 40% faster with proper batching of wakeups.
Reported-by: Pavel Begunkov asml.silence@gmail.com Tested-by: Pavel Begunkov asml.silence@gmail.com Reviewed-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 66 +++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 56 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ea88223e6dc6..cb67fc03f2f8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2767,6 +2767,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, return submit; }
+struct io_wait_queue { + struct wait_queue_entry wq; + struct io_ring_ctx *ctx; + unsigned to_wait; + unsigned nr_timeouts; +}; + +static inline bool io_should_wake(struct io_wait_queue *iowq) +{ + struct io_ring_ctx *ctx = iowq->ctx; + + /* + * Wake up if we have enough events, or if a timeout occured since we + * started waiting. For timeouts, we always want to return to userspace, + * regardless of event count. + */ + return io_cqring_events(ctx->rings) >= iowq->to_wait || + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts; +} + +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, + int wake_flags, void *key) +{ + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, + wq); + + if (!io_should_wake(iowq)) + return -1; + + return autoremove_wake_function(curr, mode, wake_flags, key); +} + /* * Wait until events become available, if we don't already have some. The * application must reap them itself, as they reside on the shared cq ring. @@ -2774,8 +2806,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, const sigset_t __user *sig, size_t sigsz) { + struct io_wait_queue iowq = { + .wq = { + .private = current, + .func = io_wake_function, + .entry = LIST_HEAD_INIT(iowq.wq.entry), + }, + .ctx = ctx, + .to_wait = min_events, + }; struct io_rings *rings = ctx->rings; - unsigned nr_timeouts; int ret;
if (io_cqring_events(rings) >= min_events) @@ -2794,15 +2834,21 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, return ret; }
- nr_timeouts = atomic_read(&ctx->cq_timeouts); - /* - * Return if we have enough events, or if a timeout occured since - * we started waiting. For timeouts, we always want to return to - * userspace. - */ - ret = wait_event_interruptible(ctx->wait, - io_cqring_events(rings) >= min_events || - atomic_read(&ctx->cq_timeouts) != nr_timeouts); + ret = 0; + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts); + do { + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, + TASK_INTERRUPTIBLE); + if (io_should_wake(&iowq)) + break; + schedule(); + if (signal_pending(current)) { + ret = -ERESTARTSYS; + break; + } + } while (1); + finish_wait(&ctx->wait, &iowq.wq); + restore_saved_sigmask_unless(ret == -ERESTARTSYS); if (ret == -ERESTARTSYS) ret = -EINTR;
From: Arnd Bergmann arnd@arndb.de
mainline inclusion from mainline-5.4-rc2 commit bdf200731145f07a6127cb16753e2e8fdc159cf4 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
All system calls use struct __kernel_timespec instead of the old struct timespec, but this one was just added with the old-style ABI. Change it now to enforce the use of __kernel_timespec, avoiding ABI confusion and the need for compat handlers on 32-bit architectures.
Any user space caller will have to use __kernel_timespec now, but this is unambiguous and works for any C library regardless of the time_t definition. A nicer way to specify the timeout would have been a less ambiguous 64-bit nanosecond value, but I suppose it's too late now to change that as this would impact both 32-bit and 64-bit users.
Fixes: 5262f567987d ("io_uring: IORING_OP_TIMEOUT support") Signed-off-by: Arnd Bergmann arnd@arndb.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cb67fc03f2f8..d78e64d66acc 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1891,15 +1891,15 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) unsigned count, req_dist, tail_index; struct io_ring_ctx *ctx = req->ctx; struct list_head *entry; - struct timespec ts; + struct timespec64 ts;
if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->flags || sqe->ioprio || sqe->buf_index || sqe->timeout_flags || sqe->len != 1) return -EINVAL; - if (copy_from_user(&ts, (void __user *) (unsigned long) sqe->addr, - sizeof(ts))) + + if (get_timespec64(&ts, u64_to_user_ptr(sqe->addr))) return -EFAULT;
/* @@ -1933,7 +1933,7 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe)
hrtimer_init(&req->timeout.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); req->timeout.timer.function = io_timeout_fn; - hrtimer_start(&req->timeout.timer, timespec_to_ktime(ts), + hrtimer_start(&req->timeout.timer, timespec64_to_ktime(ts), HRTIMER_MODE_REL); return 0; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.4-rc3 commit bf7ec93c644cb0064ba7d2fc40d4841c5ba382ab category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_queue_link_head() accepts @force_nonblock flag, but io_ring_submit() passes something opposite.
Fixes: c576666863b78 ("io_uring: optimize submit_and_wait API") Reported-by: kbuild test robot lkp@intel.com Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d78e64d66acc..2c827efa4f53 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2760,7 +2760,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
if (link) io_queue_link_head(ctx, link, &link->submit, shadow_req, - block_for_last); + !block_for_last); if (statep) io_submit_state_end(statep);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit c3a31e605620c279163c14068a60869ea3fda203 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Allows the application to remove/replace/add files to/from a file set. Passes in a struct:
struct io_uring_files_update { __u32 offset; __s32 *fds; };
that holds an array of fds, size of array passed in through the usual nr_args part of the io_uring_register() system call. The logic is as follows:
1) If ->fds[i] is -1, the existing file at i + ->offset is removed from the set. 2) If ->fds[i] is a valid fd, the existing file at i + ->offset is replaced with ->fds[i].
For case #2, is the existing file is currently empty (fd == -1), the new fd is simply added to the array.
Reviewed-by: Jeff Moyer jmoyer@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 175 ++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 6 ++ 2 files changed, 181 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e1dceb57d7d4..0b09dc80c100 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3223,6 +3223,178 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, return ret; }
+static void io_sqe_file_unregister(struct io_ring_ctx *ctx, int index) +{ +#if defined(CONFIG_UNIX) + struct file *file = ctx->user_files[index]; + struct sock *sock = ctx->ring_sock->sk; + struct sk_buff_head list, *head = &sock->sk_receive_queue; + struct sk_buff *skb; + int i; + + __skb_queue_head_init(&list); + + /* + * Find the skb that holds this file in its SCM_RIGHTS. When found, + * remove this entry and rearrange the file array. + */ + skb = skb_dequeue(head); + while (skb) { + struct scm_fp_list *fp; + + fp = UNIXCB(skb).fp; + for (i = 0; i < fp->count; i++) { + int left; + + if (fp->fp[i] != file) + continue; + + unix_notinflight(fp->user, fp->fp[i]); + left = fp->count - 1 - i; + if (left) { + memmove(&fp->fp[i], &fp->fp[i + 1], + left * sizeof(struct file *)); + } + fp->count--; + if (!fp->count) { + kfree_skb(skb); + skb = NULL; + } else { + __skb_queue_tail(&list, skb); + } + fput(file); + file = NULL; + break; + } + + if (!file) + break; + + __skb_queue_tail(&list, skb); + + skb = skb_dequeue(head); + } + + if (skb_peek(&list)) { + spin_lock_irq(&head->lock); + while ((skb = __skb_dequeue(&list)) != NULL) + __skb_queue_tail(head, skb); + spin_unlock_irq(&head->lock); + } +#else + fput(ctx->user_files[index]); +#endif +} + +static int io_sqe_file_register(struct io_ring_ctx *ctx, struct file *file, + int index) +{ +#if defined(CONFIG_UNIX) + struct sock *sock = ctx->ring_sock->sk; + struct sk_buff_head *head = &sock->sk_receive_queue; + struct sk_buff *skb; + + /* + * See if we can merge this file into an existing skb SCM_RIGHTS + * file set. If there's no room, fall back to allocating a new skb + * and filling it in. + */ + spin_lock_irq(&head->lock); + skb = skb_peek(head); + if (skb) { + struct scm_fp_list *fpl = UNIXCB(skb).fp; + + if (fpl->count < SCM_MAX_FD) { + __skb_unlink(skb, head); + spin_unlock_irq(&head->lock); + fpl->fp[fpl->count] = get_file(file); + unix_inflight(fpl->user, fpl->fp[fpl->count]); + fpl->count++; + spin_lock_irq(&head->lock); + __skb_queue_head(head, skb); + } else { + skb = NULL; + } + } + spin_unlock_irq(&head->lock); + + if (skb) { + fput(file); + return 0; + } + + return __io_sqe_files_scm(ctx, 1, index); +#else + return 0; +#endif +} + +static int io_sqe_files_update(struct io_ring_ctx *ctx, void __user *arg, + unsigned nr_args) +{ + struct io_uring_files_update up; + __s32 __user *fds; + int fd, i, err; + __u32 done; + + if (!ctx->user_files) + return -ENXIO; + if (!nr_args) + return -EINVAL; + if (copy_from_user(&up, arg, sizeof(up))) + return -EFAULT; + if (check_add_overflow(up.offset, nr_args, &done)) + return -EOVERFLOW; + if (done > ctx->nr_user_files) + return -EINVAL; + + done = 0; + fds = (__s32 __user *) up.fds; + while (nr_args) { + err = 0; + if (copy_from_user(&fd, &fds[done], sizeof(fd))) { + err = -EFAULT; + break; + } + i = array_index_nospec(up.offset, ctx->nr_user_files); + if (ctx->user_files[i]) { + io_sqe_file_unregister(ctx, i); + ctx->user_files[i] = NULL; + } + if (fd != -1) { + struct file *file; + + file = fget(fd); + if (!file) { + err = -EBADF; + break; + } + /* + * Don't allow io_uring instances to be registered. If + * UNIX isn't enabled, then this causes a reference + * cycle and this instance can never get freed. If UNIX + * is enabled we'll handle it just fine, but there's + * still no point in allowing a ring fd as it doesn't + * support regular read/write anyway. + */ + if (file->f_op == &io_uring_fops) { + fput(file); + err = -EBADF; + break; + } + ctx->user_files[i] = file; + err = io_sqe_file_register(ctx, file, i); + if (err) + break; + } + nr_args--; + done++; + up.offset++; + } + + return done ? done : err; +} + static int io_sq_offload_start(struct io_ring_ctx *ctx, struct io_uring_params *p) { @@ -4031,6 +4203,9 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_sqe_files_unregister(ctx); break; + case IORING_REGISTER_FILES_UPDATE: + ret = io_sqe_files_update(ctx, arg, nr_args); + break; case IORING_REGISTER_EVENTFD: ret = -EINVAL; if (nr_args != 1) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ea57526a5b89..4f532d9c0554 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -150,5 +150,11 @@ struct io_uring_params { #define IORING_UNREGISTER_FILES 3 #define IORING_REGISTER_EVENTFD 4 #define IORING_UNREGISTER_EVENTFD 5 +#define IORING_REGISTER_FILES_UPDATE 6 + +struct io_uring_files_update { + __u32 offset; + __s32 *fds; +};
#endif
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 33a107f0a1b8df0ad925e39d8afc97bb78e0cec1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently size the CQ ring as twice the SQ ring, to allow some flexibility in not overflowing the CQ ring. This is done because the SQE life time is different than that of the IO request itself, the SQE is consumed as soon as the kernel has seen the entry.
Certain application don't need a huge SQ ring size, since they just submit IO in batches. But they may have a lot of requests pending, and hence need a big CQ ring to hold them all. By allowing the application to control the CQ ring size multiplier, we can cater to those applications more efficiently.
If an application wants to define its own CQ ring size, it must set IORING_SETUP_CQSIZE in the setup flags, and fill out io_uring_params->cq_entries. The value must be a power of two.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 20 +++++++++++++++++--- include/uapi/linux/io_uring.h | 1 + 2 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 0b09dc80c100..cefd9e685610 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -76,6 +76,7 @@ #include "internal.h"
#define IORING_MAX_ENTRIES 32768 +#define IORING_MAX_CQ_ENTRIES (2 * IORING_MAX_ENTRIES) #define IORING_MAX_FIXED_FILES 1024
struct io_uring { @@ -4049,10 +4050,23 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) * Use twice as many entries for the CQ ring. It's possible for the * application to drive a higher depth than the size of the SQ ring, * since the sqes are only used at submission time. This allows for - * some flexibility in overcommitting a bit. + * some flexibility in overcommitting a bit. If the application has + * set IORING_SETUP_CQSIZE, it will have passed in the desired number + * of CQ ring entries manually. */ p->sq_entries = roundup_pow_of_two(entries); - p->cq_entries = 2 * p->sq_entries; + if (p->flags & IORING_SETUP_CQSIZE) { + /* + * If IORING_SETUP_CQSIZE is set, we do the same roundup + * to a power-of-two, if it isn't already. We do NOT impose + * any cq vs sq ring sizing. + */ + if (p->cq_entries < p->sq_entries || p->cq_entries > IORING_MAX_CQ_ENTRIES) + return -EINVAL; + p->cq_entries = roundup_pow_of_two(p->cq_entries); + } else { + p->cq_entries = 2 * p->sq_entries; + }
user = get_uid(current_user()); account_mem = !capable(CAP_IPC_LOCK); @@ -4137,7 +4151,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params) }
if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL | - IORING_SETUP_SQ_AFF)) + IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE)) return -EINVAL;
ret = io_uring_create(entries, &p); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 4f532d9c0554..e0137ea6ad79 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -50,6 +50,7 @@ struct io_uring_sqe { #define IORING_SETUP_IOPOLL (1U << 0) /* io_context is polled */ #define IORING_SETUP_SQPOLL (1U << 1) /* SQ poll thread */ #define IORING_SETUP_SQ_AFF (1U << 2) /* sq_thread_cpu is valid */ +#define IORING_SETUP_CQSIZE (1U << 3) /* app defines CQ size */
#define IORING_OP_NOP 0 #define IORING_OP_READV 1
From: Jackie Liu liuyun01@kylinos.cn
mainline inclusion from mainline-5.5-rc1 commit ba5290ccb6b57fc5e274ae46d051fba1f0ece262 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There is no function change, just to clean up the code, use s->in_async to make the code know where it is.
Signed-off-by: Jackie Liu liuyun01@kylinos.cn Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 31 +++++++++++-------------------- 1 file changed, 11 insertions(+), 20 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cefd9e685610..c41c7d52a27e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -271,7 +271,7 @@ struct sqe_submit { unsigned short index; u32 sequence; bool has_user; - bool needs_lock; + bool in_async; bool needs_fixed_file; };
@@ -1473,13 +1473,9 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s, ret2 = -EAGAIN; /* Catch -EAGAIN return for forced non-blocking submission */ if (!force_nonblock || ret2 != -EAGAIN) { - kiocb_done(kiocb, ret2, nxt, s->needs_lock); + kiocb_done(kiocb, ret2, nxt, s->in_async); } else { - /* - * If ->needs_lock is true, we're already in async - * context. - */ - if (!s->needs_lock) + if (!s->in_async) io_async_list_note(READ, req, iov_count); ret = -EAGAIN; } @@ -1517,8 +1513,7 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s,
ret = -EAGAIN; if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) { - /* If ->needs_lock is true, we're already in async context. */ - if (!s->needs_lock) + if (!s->in_async) io_async_list_note(WRITE, req, iov_count); goto out_free; } @@ -1547,13 +1542,9 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s, else ret2 = loop_rw_iter(WRITE, file, kiocb, &iter); if (!force_nonblock || ret2 != -EAGAIN) { - kiocb_done(kiocb, ret2, nxt, s->needs_lock); + kiocb_done(kiocb, ret2, nxt, s->in_async); } else { - /* - * If ->needs_lock is true, we're already in async - * context. - */ - if (!s->needs_lock) + if (!s->in_async) io_async_list_note(WRITE, req, iov_count); ret = -EAGAIN; } @@ -2151,10 +2142,10 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, return -EAGAIN;
/* workqueue context doesn't hold uring_lock, grab it now */ - if (s->needs_lock) + if (s->in_async) mutex_lock(&ctx->uring_lock); io_iopoll_req_issued(req); - if (s->needs_lock) + if (s->in_async) mutex_unlock(&ctx->uring_lock); }
@@ -2219,7 +2210,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
if (!ret) { s->has_user = cur_mm != NULL; - s->needs_lock = true; + s->in_async = true; do { ret = __io_submit_sqe(ctx, req, s, &nxt, false); /* @@ -2695,7 +2686,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, -EFAULT); } else { s.has_user = has_user; - s.needs_lock = true; + s.in_async = true; s.needs_fixed_file = true; io_submit_sqe(ctx, &s, statep, &link); submitted++; @@ -2882,7 +2873,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
out: s.has_user = true; - s.needs_lock = false; + s.in_async = false; s.needs_fixed_file = false; submit++; io_submit_sqe(ctx, &s, statep, &link);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit a41525ab2e75987e809926352ebc6f1397da900e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This is a pretty trivial addition on top of the relative timeouts we have now, but it's handy for ensuring tighter timing for those that are building scheduling primitives on top of io_uring.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 17 ++++++++++++----- include/uapi/linux/io_uring.h | 5 +++++ 2 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c41c7d52a27e..fc338c697e42 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1977,13 +1977,17 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) unsigned count; struct io_ring_ctx *ctx = req->ctx; struct list_head *entry; + enum hrtimer_mode mode; struct timespec64 ts; unsigned span = 0; + unsigned flags;
if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; - if (sqe->flags || sqe->ioprio || sqe->buf_index || sqe->timeout_flags || - sqe->len != 1) + if (sqe->flags || sqe->ioprio || sqe->buf_index || sqe->len != 1) + return -EINVAL; + flags = READ_ONCE(sqe->timeout_flags); + if (flags & ~IORING_TIMEOUT_ABS) return -EINVAL;
if (get_timespec64(&ts, u64_to_user_ptr(sqe->addr))) @@ -2041,10 +2045,13 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) list_add(&req->list, entry); spin_unlock_irq(&ctx->completion_lock);
- hrtimer_init(&req->timeout.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + if (flags & IORING_TIMEOUT_ABS) + mode = HRTIMER_MODE_ABS; + else + mode = HRTIMER_MODE_REL; + hrtimer_init(&req->timeout.timer, CLOCK_MONOTONIC, mode); req->timeout.timer.function = io_timeout_fn; - hrtimer_start(&req->timeout.timer, timespec64_to_ktime(ts), - HRTIMER_MODE_REL); + hrtimer_start(&req->timeout.timer, timespec64_to_ktime(ts), mode); return 0; }
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index e0137ea6ad79..b402dfee5e15 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -70,6 +70,11 @@ struct io_uring_sqe { */ #define IORING_FSYNC_DATASYNC (1U << 0)
+/* + * sqe->timeout_flags + */ +#define IORING_TIMEOUT_ABS (1U << 0) + /* * IO completion data structure (Completion Queue Entry) */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 11365043e5271fea4c92189a976833da477a3a44 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We might have cases where the need for a specific timeout is gone, add support for canceling an existing timeout operation. This works like the POLL_REMOVE command, where the application passes in the user_data of the timeout it wishes to cancel in the sqe->addr field.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 109 ++++++++++++++++++++++++++++------ include/uapi/linux/io_uring.h | 1 + 2 files changed, 92 insertions(+), 18 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index fc338c697e42..9bb289aedda6 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1943,8 +1943,9 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe) static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) { struct io_ring_ctx *ctx; - struct io_kiocb *req, *prev; + struct io_kiocb *req; unsigned long flags; + bool comp;
req = container_of(timer, struct io_kiocb, timeout.timer); ctx = req->ctx; @@ -1952,24 +1953,92 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer)
spin_lock_irqsave(&ctx->completion_lock, flags); /* - * Adjust the reqs sequence before the current one because it - * will consume a slot in the cq_ring and the the cq_tail pointer - * will be increased, otherwise other timeout reqs may return in - * advance without waiting for enough wait_nr. + * We could be racing with timeout deletion. If the list is empty, + * then timeout lookup already found it and will be handling it. */ - prev = req; - list_for_each_entry_continue_reverse(prev, &ctx->timeout_list, list) - prev->sequence++; - list_del(&req->list); + comp = !list_empty(&req->list); + if (comp) { + struct io_kiocb *prev;
- io_cqring_fill_event(ctx, req->user_data, -ETIME); - io_commit_cqring(ctx); + /* + * Adjust the reqs sequence before the current one because it + * will consume a slot in the cq_ring and the the cq_tail + * pointer will be increased, otherwise other timeout reqs may + * return in advance without waiting for enough wait_nr. + */ + prev = req; + list_for_each_entry_continue_reverse(prev, &ctx->timeout_list, list) + prev->sequence++; + + list_del_init(&req->list); + io_cqring_fill_event(ctx, req->user_data, -ETIME); + io_commit_cqring(ctx); + } spin_unlock_irqrestore(&ctx->completion_lock, flags);
+ if (comp) { + io_cqring_ev_posted(ctx); + io_put_req(req, NULL); + } + return HRTIMER_NORESTART; +} + +/* + * Remove or update an existing timeout command + */ +static int io_timeout_remove(struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + struct io_ring_ctx *ctx = req->ctx; + struct io_kiocb *treq; + int ret = -ENOENT; + __u64 user_data; + unsigned flags; + + if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + if (sqe->flags || sqe->ioprio || sqe->buf_index || sqe->len) + return -EINVAL; + flags = READ_ONCE(sqe->timeout_flags); + if (flags) + return -EINVAL; + + user_data = READ_ONCE(sqe->addr); + spin_lock_irq(&ctx->completion_lock); + list_for_each_entry(treq, &ctx->timeout_list, list) { + if (user_data == treq->user_data) { + list_del_init(&treq->list); + ret = 0; + break; + } + } + + /* didn't find timeout */ + if (ret) { +fill_ev: + io_cqring_fill_event(ctx, req->user_data, ret); + io_commit_cqring(ctx); + spin_unlock_irq(&ctx->completion_lock); + io_cqring_ev_posted(ctx); + io_put_req(req, NULL); + return 0; + } + + ret = hrtimer_try_to_cancel(&treq->timeout.timer); + if (ret == -1) { + ret = -EBUSY; + goto fill_ev; + } + + io_cqring_fill_event(ctx, req->user_data, 0); + io_cqring_fill_event(ctx, treq->user_data, -ECANCELED); + io_commit_cqring(ctx); + spin_unlock_irq(&ctx->completion_lock); io_cqring_ev_posted(ctx);
+ io_put_req(treq, NULL); io_put_req(req, NULL); - return HRTIMER_NORESTART; + return 0; }
static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) @@ -1993,6 +2062,13 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (get_timespec64(&ts, u64_to_user_ptr(sqe->addr))) return -EFAULT;
+ if (flags & IORING_TIMEOUT_ABS) + mode = HRTIMER_MODE_ABS; + else + mode = HRTIMER_MODE_REL; + + hrtimer_init(&req->timeout.timer, CLOCK_MONOTONIC, mode); + /* * sqe->off holds how many events that need to occur for this * timeout event to be satisfied. @@ -2044,12 +2120,6 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) req->sequence -= span; list_add(&req->list, entry); spin_unlock_irq(&ctx->completion_lock); - - if (flags & IORING_TIMEOUT_ABS) - mode = HRTIMER_MODE_ABS; - else - mode = HRTIMER_MODE_REL; - hrtimer_init(&req->timeout.timer, CLOCK_MONOTONIC, mode); req->timeout.timer.function = io_timeout_fn; hrtimer_start(&req->timeout.timer, timespec64_to_ktime(ts), mode); return 0; @@ -2136,6 +2206,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, case IORING_OP_TIMEOUT: ret = io_timeout(req, s->sqe); break; + case IORING_OP_TIMEOUT_REMOVE: + ret = io_timeout_remove(req, s->sqe); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index b402dfee5e15..6dc5ced1c37a 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -64,6 +64,7 @@ struct io_uring_sqe { #define IORING_OP_SENDMSG 9 #define IORING_OP_RECVMSG 10 #define IORING_OP_TIMEOUT 11 +#define IORING_OP_TIMEOUT_REMOVE 12
/* * sqe->fsync_flags
From: Dmitrii Dolgov 9erthalion6@gmail.com
mainline inclusion from mainline-5.5-rc1 commit c826bd7a743f275e2b68c16d595534063b400deb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
To trace io_uring activity one can get an information from workqueue and io trace events, but looks like some parts could be hard to identify via this approach. Making what happens inside io_uring more transparent is important to be able to reason about many aspects of it, hence introduce the set of tracing events.
All such events could be roughly divided into two categories:
* those, that are helping to understand correctness (from both kernel and an application point of view). E.g. a ring creation, file registration, or waiting for available CQE. Proposed approach is to get a pointer to an original structure of interest (ring context, or request), and then find relevant events. io_uring_queue_async_work also exposes a pointer to work_struct, to be able to track down corresponding workqueue events.
* those, that provide performance related information. Mostly it's about events that change the flow of requests, e.g. whether an async work was queued, or delayed due to some dependencies. Another important case is how io_uring optimizations (e.g. registered files) are utilized.
Signed-off-by: Dmitrii Dolgov 9erthalion6@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: include/Kbuild [ Patch 43c78d88036e47("kbuild: compile-test kernel headers to ensure they are self-contained") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 17 ++ include/trace/events/io_uring.h | 349 ++++++++++++++++++++++++++++++++ 2 files changed, 366 insertions(+) create mode 100644 include/trace/events/io_uring.h
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9bb289aedda6..d3b60862c883 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -71,6 +71,9 @@ #include <linux/sizes.h> #include <linux/hugetlb.h>
+#define CREATE_TRACE_POINTS +#include <trace/events/io_uring.h> + #include <uapi/linux/io_uring.h>
#include "internal.h" @@ -490,6 +493,7 @@ static inline void io_queue_async_work(struct io_ring_ctx *ctx, } }
+ trace_io_uring_queue_async_work(ctx, rw, req, &req->work, req->flags); queue_work(ctx->sqo_wq[rw], &req->work); }
@@ -709,6 +713,7 @@ static void io_fail_links(struct io_kiocb *req) link = list_first_entry(&req->link_list, struct io_kiocb, list); list_del(&link->list);
+ trace_io_uring_fail_link(req, link); io_cqring_add_event(req->ctx, link->user_data, -ECANCELED); __io_free_req(link); } @@ -2148,6 +2153,7 @@ static int io_req_defer(struct io_ring_ctx *ctx, struct io_kiocb *req, req->submit.sqe = sqe_copy;
INIT_WORK(&req->work, io_sq_wq_submit_work); + trace_io_uring_defer(ctx, req, false); list_add_tail(&req->list, &ctx->defer_list); spin_unlock_irq(&ctx->completion_lock); return -EIOCBQUEUED; @@ -2409,6 +2415,8 @@ static bool io_add_to_prev_work(struct async_list *list, struct io_kiocb *req) ret = false; } spin_unlock(&list->lock); + + trace_io_uring_add_to_prev(req, ret); return ret; }
@@ -2457,6 +2465,7 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s, } else { if (s->needs_fixed_file) return -EBADF; + trace_io_uring_file_get(ctx, fd); req->file = io_file_get(state, fd); if (unlikely(!req->file)) return -EBADF; @@ -2566,6 +2575,7 @@ static int io_queue_link_head(struct io_ring_ctx *ctx, struct io_kiocb *req,
/* Insert shadow req to defer_list, blocking next IOs */ spin_lock_irq(&ctx->completion_lock); + trace_io_uring_defer(ctx, shadow, true); list_add_tail(&shadow->list, &ctx->defer_list); spin_unlock_irq(&ctx->completion_lock);
@@ -2625,6 +2635,7 @@ static void io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
s->sqe = sqe_copy; memcpy(&req->submit, s, sizeof(*s)); + trace_io_uring_link(ctx, req, prev); list_add_tail(&req->list, &prev->link_list); } else if (s->sqe->flags & IOSQE_IO_LINK) { req->flags |= REQ_F_LINK; @@ -2768,6 +2779,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, s.has_user = has_user; s.in_async = true; s.needs_fixed_file = true; + trace_io_uring_submit_sqe(ctx, true, true); io_submit_sqe(ctx, &s, statep, &link); submitted++; } @@ -2956,6 +2968,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) s.in_async = false; s.needs_fixed_file = false; submit++; + trace_io_uring_submit_sqe(ctx, true, false); io_submit_sqe(ctx, &s, statep, &link); }
@@ -3038,6 +3051,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
ret = 0; iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts); + trace_io_uring_cqring_wait(ctx, min_events); do { prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); @@ -4197,6 +4211,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) goto err;
p->features = IORING_FEAT_SINGLE_MMAP; + trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags); return ret; err: io_ring_ctx_wait_and_kill(ctx); @@ -4334,6 +4349,8 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode, mutex_lock(&ctx->uring_lock); ret = __io_uring_register(ctx, opcode, arg, nr_args); mutex_unlock(&ctx->uring_lock); + trace_io_uring_register(ctx, opcode, ctx->nr_user_files, ctx->nr_user_bufs, + ctx->cq_ev_fd != NULL, ret); out_fput: fdput(f); return ret; diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h new file mode 100644 index 000000000000..c5a905fbf1da --- /dev/null +++ b/include/trace/events/io_uring.h @@ -0,0 +1,349 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM io_uring + +#if !defined(_TRACE_IO_URING_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_IO_URING_H + +#include <linux/tracepoint.h> + +/** + * io_uring_create - called after a new io_uring context was prepared + * + * @fd: corresponding file descriptor + * @ctx: pointer to a ring context structure + * @sq_entries: actual SQ size + * @cq_entries: actual CQ size + * @flags: SQ ring flags, provided to io_uring_setup(2) + * + * Allows to trace io_uring creation and provide pointer to a context, that can + * be used later to find correlated events. + */ +TRACE_EVENT(io_uring_create, + + TP_PROTO(int fd, void *ctx, u32 sq_entries, u32 cq_entries, u32 flags), + + TP_ARGS(fd, ctx, sq_entries, cq_entries, flags), + + TP_STRUCT__entry ( + __field( int, fd ) + __field( void *, ctx ) + __field( u32, sq_entries ) + __field( u32, cq_entries ) + __field( u32, flags ) + ), + + TP_fast_assign( + __entry->fd = fd; + __entry->ctx = ctx; + __entry->sq_entries = sq_entries; + __entry->cq_entries = cq_entries; + __entry->flags = flags; + ), + + TP_printk("ring %p, fd %d sq size %d, cq size %d, flags %d", + __entry->ctx, __entry->fd, __entry->sq_entries, + __entry->cq_entries, __entry->flags) +); + +/** + * io_uring_register - called after a buffer/file/eventfd was succesfully + * registered for a ring + * + * @ctx: pointer to a ring context structure + * @opcode: describes which operation to perform + * @nr_user_files: number of registered files + * @nr_user_bufs: number of registered buffers + * @cq_ev_fd: whether eventfs registered or not + * @ret: return code + * + * Allows to trace fixed files/buffers/eventfds, that could be registered to + * avoid an overhead of getting references to them for every operation. This + * event, together with io_uring_file_get, can provide a full picture of how + * much overhead one can reduce via fixing. + */ +TRACE_EVENT(io_uring_register, + + TP_PROTO(void *ctx, unsigned opcode, unsigned nr_files, + unsigned nr_bufs, bool eventfd, long ret), + + TP_ARGS(ctx, opcode, nr_files, nr_bufs, eventfd, ret), + + TP_STRUCT__entry ( + __field( void *, ctx ) + __field( unsigned, opcode ) + __field( unsigned, nr_files ) + __field( unsigned, nr_bufs ) + __field( bool, eventfd ) + __field( long, ret ) + ), + + TP_fast_assign( + __entry->ctx = ctx; + __entry->opcode = opcode; + __entry->nr_files = nr_files; + __entry->nr_bufs = nr_bufs; + __entry->eventfd = eventfd; + __entry->ret = ret; + ), + + TP_printk("ring %p, opcode %d, nr_user_files %d, nr_user_bufs %d, " + "eventfd %d, ret %ld", + __entry->ctx, __entry->opcode, __entry->nr_files, + __entry->nr_bufs, __entry->eventfd, __entry->ret) +); + +/** + * io_uring_file_get - called before getting references to an SQE file + * + * @ctx: pointer to a ring context structure + * @fd: SQE file descriptor + * + * Allows to trace out how often an SQE file reference is obtained, which can + * help figuring out if it makes sense to use fixed files, or check that fixed + * files are used correctly. + */ +TRACE_EVENT(io_uring_file_get, + + TP_PROTO(void *ctx, int fd), + + TP_ARGS(ctx, fd), + + TP_STRUCT__entry ( + __field( void *, ctx ) + __field( int, fd ) + ), + + TP_fast_assign( + __entry->ctx = ctx; + __entry->fd = fd; + ), + + TP_printk("ring %p, fd %d", __entry->ctx, __entry->fd) +); + +/** + * io_uring_queue_async_work - called before submitting a new async work + * + * @ctx: pointer to a ring context structure + * @rw: type of workqueue, normal or buffered writes + * @req: pointer to a submitted request + * @work: pointer to a submitted work_struct + * + * Allows to trace asynchronous work submission. + */ +TRACE_EVENT(io_uring_queue_async_work, + + TP_PROTO(void *ctx, int rw, void * req, struct work_struct *work, + unsigned int flags), + + TP_ARGS(ctx, rw, req, work, flags), + + TP_STRUCT__entry ( + __field( void *, ctx ) + __field( int, rw ) + __field( void *, req ) + __field( struct work_struct *, work ) + __field( unsigned int, flags ) + ), + + TP_fast_assign( + __entry->ctx = ctx; + __entry->rw = rw; + __entry->req = req; + __entry->work = work; + __entry->flags = flags; + ), + + TP_printk("ring %p, request %p, flags %d, %s queue, work %p", + __entry->ctx, __entry->req, __entry->flags, + __entry->rw ? "buffered" : "normal", __entry->work) +); + +/** + * io_uring_defer_list - called before the io_uring work added into defer_list + * + * @ctx: pointer to a ring context structure + * @req: pointer to a deferred request + * @shadow: whether request is shadow or not + * + * Allows to track deferred requests, to get an insight about what requests are + * not started immediately. + */ +TRACE_EVENT(io_uring_defer, + + TP_PROTO(void *ctx, void *req, bool shadow), + + TP_ARGS(ctx, req, shadow), + + TP_STRUCT__entry ( + __field( void *, ctx ) + __field( void *, req ) + __field( bool, shadow ) + ), + + TP_fast_assign( + __entry->ctx = ctx; + __entry->req = req; + __entry->shadow = shadow; + ), + + TP_printk("ring %p, request %p%s", __entry->ctx, __entry->req, + __entry->shadow ? ", shadow": "") +); + +/** + * io_uring_link - called before the io_uring request added into link_list of + * another request + * + * @ctx: pointer to a ring context structure + * @req: pointer to a linked request + * @target_req: pointer to a previous request, that would contain @req + * + * Allows to track linked requests, to understand dependencies between requests + * and how does it influence their execution flow. + */ +TRACE_EVENT(io_uring_link, + + TP_PROTO(void *ctx, void *req, void *target_req), + + TP_ARGS(ctx, req, target_req), + + TP_STRUCT__entry ( + __field( void *, ctx ) + __field( void *, req ) + __field( void *, target_req ) + ), + + TP_fast_assign( + __entry->ctx = ctx; + __entry->req = req; + __entry->target_req = target_req; + ), + + TP_printk("ring %p, request %p linked after %p", + __entry->ctx, __entry->req, __entry->target_req) +); + +/** + * io_uring_add_to_prev - called after a request was added into a previously + * submitted work + * + * @req: pointer to a request, added to a previous + * @ret: whether or not it was completed successfully + * + * Allows to track merged work, to figure out how often requests are piggy + * backed into other ones, changing the execution flow. + */ +TRACE_EVENT(io_uring_add_to_prev, + + TP_PROTO(void *req, bool ret), + + TP_ARGS(req, ret), + + TP_STRUCT__entry ( + __field( void *, req ) + __field( bool, ret ) + ), + + TP_fast_assign( + __entry->req = req; + __entry->ret = ret; + ), + + TP_printk("request %p, ret %d", __entry->req, __entry->ret) +); + +/** + * io_uring_cqring_wait - called before start waiting for an available CQE + * + * @ctx: pointer to a ring context structure + * @min_events: minimal number of events to wait for + * + * Allows to track waiting for CQE, so that we can e.g. troubleshoot + * situations, when an application wants to wait for an event, that never + * comes. + */ +TRACE_EVENT(io_uring_cqring_wait, + + TP_PROTO(void *ctx, int min_events), + + TP_ARGS(ctx, min_events), + + TP_STRUCT__entry ( + __field( void *, ctx ) + __field( int, min_events ) + ), + + TP_fast_assign( + __entry->ctx = ctx; + __entry->min_events = min_events; + ), + + TP_printk("ring %p, min_events %d", __entry->ctx, __entry->min_events) +); + +/** + * io_uring_fail_link - called before failing a linked request + * + * @req: request, which links were cancelled + * @link: cancelled link + * + * Allows to track linked requests cancellation, to see not only that some work + * was cancelled, but also which request was the reason. + */ +TRACE_EVENT(io_uring_fail_link, + + TP_PROTO(void *req, void *link), + + TP_ARGS(req, link), + + TP_STRUCT__entry ( + __field( void *, req ) + __field( void *, link ) + ), + + TP_fast_assign( + __entry->req = req; + __entry->link = link; + ), + + TP_printk("request %p, link %p", __entry->req, __entry->link) +); + +/** + * io_uring_submit_sqe - called before submitting one SQE + * + * @ctx: pointer to a ring context structure + * @force_nonblock: whether a context blocking or not + * @sq_thread: true if sq_thread has submitted this SQE + * + * Allows to track SQE submitting, to understand what was the source of it, SQ + * thread or io_uring_enter call. + */ +TRACE_EVENT(io_uring_submit_sqe, + + TP_PROTO(void *ctx, bool force_nonblock, bool sq_thread), + + TP_ARGS(ctx, force_nonblock, sq_thread), + + TP_STRUCT__entry ( + __field( void *, ctx ) + __field( bool, force_nonblock ) + __field( bool, sq_thread ) + ), + + TP_fast_assign( + __entry->ctx = ctx; + __entry->force_nonblock = force_nonblock; + __entry->sq_thread = sq_thread; + ), + + TP_printk("ring %p, non block %d, sq_thread %d", + __entry->ctx, __entry->force_nonblock, __entry->sq_thread) +); + +#endif /* _TRACE_IO_URING_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h>
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit fa4562280889ad372dfb1413833a8b8675721b17 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
submit->index is used only for inbound check in submission path (i.e. head < ctx->sq_entries). However, it always will be true, as 1. it's already validated by io_get_sqring() 2. ctx->sq_entries can't be changedd in between, because of held ctx->uring_lock and ctx->refs.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 5 ----- 1 file changed, 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d3b60862c883..6ec3edd96453 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -271,7 +271,6 @@ struct io_ring_ctx {
struct sqe_submit { const struct io_uring_sqe *sqe; - unsigned short index; u32 sequence; bool has_user; bool in_async; @@ -2167,9 +2166,6 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
req->user_data = READ_ONCE(s->sqe->user_data);
- if (unlikely(s->index >= ctx->sq_entries)) - return -EINVAL; - opcode = READ_ONCE(s->sqe->opcode); switch (opcode) { case IORING_OP_NOP: @@ -2715,7 +2711,6 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
head = READ_ONCE(sq_array[head & ctx->sq_mask]); if (head < ctx->sq_entries) { - s->index = head; s->sqe = &ctx->sq_sqes[head]; s->sequence = ctx->cached_sq_head; ctx->cached_sq_head++;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit 95a1b3ff9a3e4ea2f26c4e802067d58831f415db category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Commit fb5ccc98782f ("io_uring: Fix broken links with offloading") introduced a potential performance regression with unconditionally taking mm even for READ/WRITE_FIXED operations.
Return the logic handling it back. mm-faulted requests will go through the generic submission path, so honoring links and drains, but will fail further on req->has_user check.
Fixes: fb5ccc98782f ("io_uring: Fix broken links with offloading") Cc: stable@vger.kernel.org # v5.4 Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 41 +++++++++++++++++------------------------ 1 file changed, 17 insertions(+), 24 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6ec3edd96453..74f194cbef9b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2725,13 +2725,14 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) }
static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, - bool has_user, bool mm_fault) + struct mm_struct **mm) { struct io_submit_state state, *statep = NULL; struct io_kiocb *link = NULL; struct io_kiocb *shadow_req = NULL; bool prev_was_link = false; int i, submitted = 0; + bool mm_fault = false;
if (nr > IO_PLUG_THRESHOLD) { io_submit_state_start(&state, ctx, nr); @@ -2744,6 +2745,14 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, if (!io_get_sqring(ctx, &s)) break;
+ if (io_sqe_needs_user(s.sqe) && !*mm) { + mm_fault = mm_fault || !mmget_not_zero(ctx->sqo_mm); + if (!mm_fault) { + use_mm(ctx->sqo_mm); + *mm = ctx->sqo_mm; + } + } + /* * If previous wasn't linked and we have a linked command, * that's the end of the chain. Submit the previous link. @@ -2767,17 +2776,12 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, }
out: - if (unlikely(mm_fault)) { - io_cqring_add_event(ctx, s.sqe->user_data, - -EFAULT); - } else { - s.has_user = has_user; - s.in_async = true; - s.needs_fixed_file = true; - trace_io_uring_submit_sqe(ctx, true, true); - io_submit_sqe(ctx, &s, statep, &link); - submitted++; - } + s.has_user = *mm != NULL; + s.in_async = true; + s.needs_fixed_file = true; + trace_io_uring_submit_sqe(ctx, true, true); + io_submit_sqe(ctx, &s, statep, &link); + submitted++; }
if (link) @@ -2804,7 +2808,6 @@ static int io_sq_thread(void *data)
timeout = inflight = 0; while (!kthread_should_park()) { - bool mm_fault = false; unsigned int to_submit;
if (inflight) { @@ -2889,18 +2892,8 @@ static int io_sq_thread(void *data) ctx->rings->sq_flags &= ~IORING_SQ_NEED_WAKEUP; }
- /* Unless all new commands are FIXED regions, grab mm */ - if (!cur_mm) { - mm_fault = !mmget_not_zero(ctx->sqo_mm); - if (!mm_fault) { - use_mm(ctx->sqo_mm); - cur_mm = ctx->sqo_mm; - } - } - to_submit = min(to_submit, ctx->sq_entries); - inflight += io_submit_sqes(ctx, to_submit, cur_mm != NULL, - mm_fault); + inflight += io_submit_sqes(ctx, to_submit, &cur_mm);
/* Commit SQ ring head once we've consumed all SQEs */ io_commit_sqring(ctx);
From: Thomas Gleixner tglx@linutronix.de
mainline inclusion from mainline-5.1-rc1 commit 15917dc02841862840efcbfe1da0830f88078b5c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The RTMUTEX tester was removed long ago but the PF bit stayed around. Remove it and free up the space.
Signed-off-by: Thomas Gleixner tglx@linutronix.de
Conflicts: include/linux/sched.h [ Patch 73ab1cb2de9e3("umh: add exit routine for UMH process") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Cheng Jian cj.chengjian@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/sched.h | 1 - 1 file changed, 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 9dc064305c13..67d4cfefc99d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1421,7 +1421,6 @@ extern struct pid *cad_pid; #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */ #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ -#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */ #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */ #define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */
From: Thomas Gleixner tglx@linutronix.de
mainline inclusion from mainline-5.2-rc1 commit 6d25be5782e482eb93e3de0c94d0a517879377d0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The worker accounting for CPU bound workers is plugged into the core scheduler code and the wakeup code. This is not a hard requirement and can be avoided by keeping track of the state in the workqueue code itself.
Keep track of the sleeping state in the worker itself and call the notifier before entering the core scheduler. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from scheduling and being woken immediately after switching away. When nr_running is updated when the task is retunrning from schedule() then it is later compared when it is done from ttwu().
[ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ]
Signed-off-by: Thomas Gleixner tglx@linutronix.de Signed-off-by: Sebastian Andrzej Siewior bigeasy@linutronix.de Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Acked-by: Tejun Heo tj@kernel.org Cc: Daniel Bristot de Oliveira bristot@redhat.com Cc: Lai Jiangshan jiangshanlai@gmail.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814... Signed-off-by: Ingo Molnar mingo@kernel.org
Conflicts: kernel/workqueue_internal.h [ Patch 1b69ac6b40ebd("psi: fix aggregation idle shut-off") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/sched/core.c | 88 +++++++++---------------------------- kernel/workqueue.c | 54 ++++++++++------------- kernel/workqueue_internal.h | 5 ++- 3 files changed, 48 insertions(+), 99 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 43d58409607b..0fac7e9aa9fe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1663,10 +1663,6 @@ static inline void ttwu_activate(struct rq *rq, struct task_struct *p, int en_fl { activate_task(rq, p, en_flags); p->on_rq = TASK_ON_RQ_QUEUED; - - /* If a worker is waking up, notify the workqueue: */ - if (p->flags & PF_WQ_WORKER) - wq_worker_waking_up(p, cpu_of(rq)); }
/* @@ -2083,56 +2079,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) return success; }
-/** - * try_to_wake_up_local - try to wake up a local task with rq lock held - * @p: the thread to be awakened - * @rf: request-queue flags for pinning - * - * Put @p on the run-queue if it's not already there. The caller must - * ensure that this_rq() is locked, @p is bound to this_rq() and not - * the current task. - */ -static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf) -{ - struct rq *rq = task_rq(p); - - if (WARN_ON_ONCE(rq != this_rq()) || - WARN_ON_ONCE(p == current)) - return; - - lockdep_assert_held(&rq->lock); - - if (!raw_spin_trylock(&p->pi_lock)) { - /* - * This is OK, because current is on_cpu, which avoids it being - * picked for load-balance and preemption/IRQs are still - * disabled avoiding further scheduler activity on it and we've - * not yet picked a replacement task. - */ - rq_unlock(rq, rf); - raw_spin_lock(&p->pi_lock); - rq_relock(rq, rf); - } - - if (!(p->state & TASK_NORMAL)) - goto out; - - trace_sched_waking(p); - - if (!task_on_rq_queued(p)) { - if (p->in_iowait) { - delayacct_blkio_end(p); - atomic_dec(&rq->nr_iowait); - } - ttwu_activate(rq, p, ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK); - } - - ttwu_do_wakeup(rq, p, 0, rf); - ttwu_stat(p, smp_processor_id(), 0); -out: - raw_spin_unlock(&p->pi_lock); -} - /** * wake_up_process - Wake up a specific process * @p: The process to be woken up. @@ -3538,19 +3484,6 @@ static void __sched notrace __schedule(bool preempt) atomic_inc(&rq->nr_iowait); delayacct_blkio_start(); } - - /* - * If a worker went to sleep, notify and ask workqueue - * whether it wants to wake up a task to maintain - * concurrency. - */ - if (prev->flags & PF_WQ_WORKER) { - struct task_struct *to_wakeup; - - to_wakeup = wq_worker_sleeping(prev); - if (to_wakeup) - try_to_wake_up_local(to_wakeup, &rf); - } } switch_count = &prev->nvcsw; } @@ -3610,6 +3543,20 @@ static inline void sched_submit_work(struct task_struct *tsk) { if (!tsk->state || tsk_is_pi_blocked(tsk)) return; + + /* + * If a worker went to sleep, notify and ask workqueue whether + * it wants to wake up a task to maintain concurrency. + * As this function is called inside the schedule() context, + * we disable preemption to avoid it calling schedule() again + * in the possible wakeup of a kworker. + */ + if (tsk->flags & PF_WQ_WORKER) { + preempt_disable(); + wq_worker_sleeping(tsk); + preempt_enable_no_resched(); + } + /* * If we are going to sleep and we have plugged IO queued, * make sure to submit it to avoid deadlocks. @@ -3618,6 +3565,12 @@ static inline void sched_submit_work(struct task_struct *tsk) blk_schedule_flush_plug(tsk); }
+static void sched_update_worker(struct task_struct *tsk) +{ + if (tsk->flags & PF_WQ_WORKER) + wq_worker_running(tsk); +} + asmlinkage __visible void __sched schedule(void) { struct task_struct *tsk = current; @@ -3628,6 +3581,7 @@ asmlinkage __visible void __sched schedule(void) __schedule(false); sched_preempt_enable_no_resched(); } while (need_resched()); + sched_update_worker(tsk); } EXPORT_SYMBOL(schedule);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 1ffc523edb65..a07aa758571e 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -840,43 +840,32 @@ static void wake_up_worker(struct worker_pool *pool) }
/** - * wq_worker_waking_up - a worker is waking up + * wq_worker_running - a worker is running again * @task: task waking up - * @cpu: CPU @task is waking up to * - * This function is called during try_to_wake_up() when a worker is - * being awoken. - * - * CONTEXT: - * spin_lock_irq(rq->lock) + * This function is called when a worker returns from schedule() */ -void wq_worker_waking_up(struct task_struct *task, int cpu) +void wq_worker_running(struct task_struct *task) { struct worker *worker = kthread_data(task);
- if (!(worker->flags & WORKER_NOT_RUNNING)) { - WARN_ON_ONCE(worker->pool->cpu != cpu); + if (!worker->sleeping) + return; + if (!(worker->flags & WORKER_NOT_RUNNING)) atomic_inc(&worker->pool->nr_running); - } + worker->sleeping = 0; }
/** * wq_worker_sleeping - a worker is going to sleep * @task: task going to sleep * - * This function is called during schedule() when a busy worker is - * going to sleep. Worker on the same cpu can be woken up by - * returning pointer to its task. - * - * CONTEXT: - * spin_lock_irq(rq->lock) - * - * Return: - * Worker task on @cpu to wake up, %NULL if none. + * This function is called from schedule() when a busy worker is + * going to sleep. */ -struct task_struct *wq_worker_sleeping(struct task_struct *task) +void wq_worker_sleeping(struct task_struct *task) { - struct worker *worker = kthread_data(task), *to_wakeup = NULL; + struct worker *next, *worker = kthread_data(task); struct worker_pool *pool;
/* @@ -885,13 +874,15 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task) * checking NOT_RUNNING. */ if (worker->flags & WORKER_NOT_RUNNING) - return NULL; + return;
pool = worker->pool;
- /* this can only happen on the local cpu */ - if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id())) - return NULL; + if (WARN_ON_ONCE(worker->sleeping)) + return; + + worker->sleeping = 1; + spin_lock_irq(&pool->lock);
/* * The counterpart of the following dec_and_test, implied mb, @@ -905,9 +896,12 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task) * lock is safe. */ if (atomic_dec_and_test(&pool->nr_running) && - !list_empty(&pool->worklist)) - to_wakeup = first_idle_worker(pool); - return to_wakeup ? to_wakeup->task : NULL; + !list_empty(&pool->worklist)) { + next = first_idle_worker(pool); + if (next) + wake_up_process(next->task); + } + spin_unlock_irq(&pool->lock); }
/** @@ -4891,7 +4885,7 @@ static void rebind_workers(struct worker_pool *pool) * * WRITE_ONCE() is necessary because @worker->flags may be * tested without holding any lock in - * wq_worker_waking_up(). Without it, NOT_RUNNING test may + * wq_worker_running(). Without it, NOT_RUNNING test may * fail incorrectly leading to premature concurrency * management operations. */ diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h index 66fbb5a9e633..30cfed226b39 100644 --- a/kernel/workqueue_internal.h +++ b/kernel/workqueue_internal.h @@ -44,6 +44,7 @@ struct worker { unsigned long last_active; /* L: last active timestamp */ unsigned int flags; /* X: flags */ int id; /* I: worker id */ + int sleeping; /* None */
/* * Opaque string set with work_set_desc(). Printed out with task @@ -69,7 +70,7 @@ static inline struct worker *current_wq_worker(void) * Scheduler hooks for concurrency managed workqueue. Only to be used from * sched/core.c and workqueue.c. */ -void wq_worker_waking_up(struct task_struct *task, int cpu); -struct task_struct *wq_worker_sleeping(struct task_struct *task); +void wq_worker_running(struct task_struct *task); +void wq_worker_sleeping(struct task_struct *task);
#endif /* _KERNEL_WORKQUEUE_INTERNAL_H */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 771b53d033e8663abdf59704806aa856b236dcdb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This adds support for io-wq, a smaller and specialized thread pool implementation. This is meant to replace workqueues for io_uring. Among the reasons for this addition are:
- We can assign memory context smarter and more persistently if we manage the life time of threads.
- We can drop various work-arounds we have in io_uring, like the async_list.
- We can implement hashed work insertion, to manage concurrency of buffered writes without needing a) an extra workqueue, or b) needlessly making the concurrency of said workqueue very low which hurts performance of multiple buffered file writers.
- We can implement cancel through signals, for cancelling interruptible work like read/write (or send/recv) to/from sockets.
- We need the above cancel for being able to assign and use file tables from a process.
- We can implement a more thorough cancel operation in general.
- We need it to move towards a syslet/threadlet model for even faster async execution. For that we need to take ownership of the used threads.
This list is just off the top of my head. Performance should be the same, or better, at least that's what I've seen in my testing. io-wq supports basic NUMA functionality, setting up a pool per node.
io-wq hooks up to the scheduler schedule in/out just like workqueue and uses that to drive the need for more/less workers.
Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/Kconfig fs/Makefile include/linux/sched.h [ Patch d7fefcc8de9("mm/cma: add PF flag to force non cma alloc") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/Kconfig | 3 + fs/Makefile | 1 + fs/io-wq.c | 826 ++++++++++++++++++++++++++++++++++++++++++ fs/io-wq.h | 55 +++ include/linux/sched.h | 1 + kernel/sched/core.c | 16 +- 6 files changed, 898 insertions(+), 4 deletions(-) create mode 100644 fs/io-wq.c create mode 100644 fs/io-wq.h
diff --git a/fs/Kconfig b/fs/Kconfig index 2d9d472d8ba8..5921bfbebee4 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -335,6 +335,9 @@ endif # NETWORK_FILESYSTEMS source "fs/nls/Kconfig" source "fs/dlm/Kconfig"
+config IO_WQ + bool + endmenu
config RESCTRL diff --git a/fs/Makefile b/fs/Makefile index a3d3479e6b49..612dc8785aa0 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -31,6 +31,7 @@ obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_AIO) += aio.o obj-$(CONFIG_IO_URING) += io_uring.o +obj-$(CONFIG_IO_WQ) += io-wq.o obj-$(CONFIG_FS_DAX) += dax.o obj-$(CONFIG_FS_ENCRYPTION) += crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o diff --git a/fs/io-wq.c b/fs/io-wq.c new file mode 100644 index 000000000000..88acfd0bf139 --- /dev/null +++ b/fs/io-wq.c @@ -0,0 +1,826 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Basic worker thread pool for io_uring + * + * Copyright (C) 2019 Jens Axboe + * + */ +#include <linux/uaccess.h> +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/errno.h> +#include <linux/sched/signal.h> +#include <linux/mm.h> +#include <linux/mmu_context.h> +#include <linux/sched/mm.h> +#include <linux/percpu.h> +#include <linux/slab.h> +#include <linux/kthread.h> +#include <linux/rculist_nulls.h> + +#include "io-wq.h" + +#define WORKER_IDLE_TIMEOUT (5 * HZ) + +enum { + IO_WORKER_F_UP = 1, /* up and active */ + IO_WORKER_F_RUNNING = 2, /* account as running */ + IO_WORKER_F_FREE = 4, /* worker on free list */ + IO_WORKER_F_EXITING = 8, /* worker exiting */ + IO_WORKER_F_FIXED = 16, /* static idle worker */ +}; + +enum { + IO_WQ_BIT_EXIT = 0, /* wq exiting */ + IO_WQ_BIT_CANCEL = 1, /* cancel work on list */ +}; + +enum { + IO_WQE_FLAG_STALLED = 1, /* stalled on hash */ +}; + +/* + * One for each thread in a wqe pool + */ +struct io_worker { + refcount_t ref; + unsigned flags; + struct hlist_nulls_node nulls_node; + struct task_struct *task; + wait_queue_head_t wait; + struct io_wqe *wqe; + struct io_wq_work *cur_work; + + struct rcu_head rcu; + struct mm_struct *mm; +}; + +struct io_wq_nulls_list { + struct hlist_nulls_head head; + unsigned long nulls; +}; + +#if BITS_PER_LONG == 64 +#define IO_WQ_HASH_ORDER 6 +#else +#define IO_WQ_HASH_ORDER 5 +#endif + +/* + * Per-node worker thread pool + */ +struct io_wqe { + struct { + spinlock_t lock; + struct list_head work_list; + unsigned long hash_map; + unsigned flags; + } ____cacheline_aligned_in_smp; + + int node; + unsigned nr_workers; + unsigned max_workers; + atomic_t nr_running; + + struct io_wq_nulls_list free_list; + struct io_wq_nulls_list busy_list; + + struct io_wq *wq; +}; + +/* + * Per io_wq state + */ +struct io_wq { + struct io_wqe **wqes; + unsigned long state; + unsigned nr_wqes; + + struct task_struct *manager; + struct mm_struct *mm; + refcount_t refs; + struct completion done; +}; + +static void io_wq_free_worker(struct rcu_head *head) +{ + struct io_worker *worker = container_of(head, struct io_worker, rcu); + + kfree(worker); +} + +static bool io_worker_get(struct io_worker *worker) +{ + return refcount_inc_not_zero(&worker->ref); +} + +static void io_worker_release(struct io_worker *worker) +{ + if (refcount_dec_and_test(&worker->ref)) + wake_up_process(worker->task); +} + +/* + * Note: drops the wqe->lock if returning true! The caller must re-acquire + * the lock in that case. Some callers need to restart handling if this + * happens, so we can't just re-acquire the lock on behalf of the caller. + */ +static bool __io_worker_unuse(struct io_wqe *wqe, struct io_worker *worker) +{ + /* + * If we have an active mm, we need to drop the wq lock before unusing + * it. If we do, return true and let the caller retry the idle loop. + */ + if (worker->mm) { + __acquire(&wqe->lock); + spin_unlock_irq(&wqe->lock); + __set_current_state(TASK_RUNNING); + set_fs(KERNEL_DS); + unuse_mm(worker->mm); + mmput(worker->mm); + worker->mm = NULL; + return true; + } + + return false; +} + +static void io_worker_exit(struct io_worker *worker) +{ + struct io_wqe *wqe = worker->wqe; + bool all_done = false; + + /* + * If we're not at zero, someone else is holding a brief reference + * to the worker. Wait for that to go away. + */ + set_current_state(TASK_INTERRUPTIBLE); + if (!refcount_dec_and_test(&worker->ref)) + schedule(); + __set_current_state(TASK_RUNNING); + + preempt_disable(); + current->flags &= ~PF_IO_WORKER; + if (worker->flags & IO_WORKER_F_RUNNING) + atomic_dec(&wqe->nr_running); + worker->flags = 0; + preempt_enable(); + + spin_lock_irq(&wqe->lock); + hlist_nulls_del_rcu(&worker->nulls_node); + if (__io_worker_unuse(wqe, worker)) { + __release(&wqe->lock); + spin_lock_irq(&wqe->lock); + } + wqe->nr_workers--; + all_done = !wqe->nr_workers; + spin_unlock_irq(&wqe->lock); + + /* all workers gone, wq exit can proceed */ + if (all_done && refcount_dec_and_test(&wqe->wq->refs)) + complete(&wqe->wq->done); + + call_rcu(&worker->rcu, io_wq_free_worker); +} + +static void io_worker_start(struct io_wqe *wqe, struct io_worker *worker) +{ + allow_kernel_signal(SIGINT); + + current->flags |= PF_IO_WORKER; + + worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING); + atomic_inc(&wqe->nr_running); +} + +/* + * Worker will start processing some work. Move it to the busy list, if + * it's currently on the freelist + */ +static void __io_worker_busy(struct io_wqe *wqe, struct io_worker *worker, + struct io_wq_work *work) + __must_hold(wqe->lock) +{ + if (worker->flags & IO_WORKER_F_FREE) { + worker->flags &= ~IO_WORKER_F_FREE; + hlist_nulls_del_init_rcu(&worker->nulls_node); + hlist_nulls_add_head_rcu(&worker->nulls_node, + &wqe->busy_list.head); + } + worker->cur_work = work; +} + +/* + * No work, worker going to sleep. Move to freelist, and unuse mm if we + * have one attached. Dropping the mm may potentially sleep, so we drop + * the lock in that case and return success. Since the caller has to + * retry the loop in that case (we changed task state), we don't regrab + * the lock if we return success. + */ +static bool __io_worker_idle(struct io_wqe *wqe, struct io_worker *worker) + __must_hold(wqe->lock) +{ + if (!(worker->flags & IO_WORKER_F_FREE)) { + worker->flags |= IO_WORKER_F_FREE; + hlist_nulls_del_init_rcu(&worker->nulls_node); + hlist_nulls_add_head_rcu(&worker->nulls_node, + &wqe->free_list.head); + } + + return __io_worker_unuse(wqe, worker); +} + +static struct io_wq_work *io_get_next_work(struct io_wqe *wqe, unsigned *hash) + __must_hold(wqe->lock) +{ + struct io_wq_work *work; + + list_for_each_entry(work, &wqe->work_list, list) { + /* not hashed, can run anytime */ + if (!(work->flags & IO_WQ_WORK_HASHED)) { + list_del(&work->list); + return work; + } + + /* hashed, can run if not already running */ + *hash = work->flags >> IO_WQ_HASH_SHIFT; + if (!(wqe->hash_map & BIT_ULL(*hash))) { + wqe->hash_map |= BIT_ULL(*hash); + list_del(&work->list); + return work; + } + } + + return NULL; +} + +static void io_worker_handle_work(struct io_worker *worker) + __releases(wqe->lock) +{ + struct io_wq_work *work, *old_work; + struct io_wqe *wqe = worker->wqe; + struct io_wq *wq = wqe->wq; + + do { + unsigned hash = -1U; + + /* + * Signals are either sent to cancel specific work, or to just + * cancel all work items. For the former, ->cur_work must + * match. ->cur_work is NULL at this point, since we haven't + * assigned any work, so it's safe to flush signals for that + * case. For the latter case of cancelling all work, the caller + * wil have set IO_WQ_BIT_CANCEL. + */ + if (signal_pending(current)) + flush_signals(current); + + /* + * If we got some work, mark us as busy. If we didn't, but + * the list isn't empty, it means we stalled on hashed work. + * Mark us stalled so we don't keep looking for work when we + * can't make progress, any work completion or insertion will + * clear the stalled flag. + */ + work = io_get_next_work(wqe, &hash); + if (work) + __io_worker_busy(wqe, worker, work); + else if (!list_empty(&wqe->work_list)) + wqe->flags |= IO_WQE_FLAG_STALLED; + + spin_unlock_irq(&wqe->lock); + if (!work) + break; +next: + if ((work->flags & IO_WQ_WORK_NEEDS_USER) && !worker->mm && + wq->mm && mmget_not_zero(wq->mm)) { + use_mm(wq->mm); + set_fs(USER_DS); + worker->mm = wq->mm; + } + if (test_bit(IO_WQ_BIT_CANCEL, &wq->state)) + work->flags |= IO_WQ_WORK_CANCEL; + if (worker->mm) + work->flags |= IO_WQ_WORK_HAS_MM; + + old_work = work; + work->func(&work); + + spin_lock_irq(&wqe->lock); + worker->cur_work = NULL; + if (hash != -1U) { + wqe->hash_map &= ~BIT_ULL(hash); + wqe->flags &= ~IO_WQE_FLAG_STALLED; + } + if (work && work != old_work) { + spin_unlock_irq(&wqe->lock); + /* dependent work not hashed */ + hash = -1U; + goto next; + } + } while (1); +} + +static inline bool io_wqe_run_queue(struct io_wqe *wqe) + __must_hold(wqe->lock) +{ + if (!list_empty_careful(&wqe->work_list) && + !(wqe->flags & IO_WQE_FLAG_STALLED)) + return true; + return false; +} + +static int io_wqe_worker(void *data) +{ + struct io_worker *worker = data; + struct io_wqe *wqe = worker->wqe; + struct io_wq *wq = wqe->wq; + DEFINE_WAIT(wait); + + io_worker_start(wqe, worker); + + while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) { + prepare_to_wait(&worker->wait, &wait, TASK_INTERRUPTIBLE); + + spin_lock_irq(&wqe->lock); + if (io_wqe_run_queue(wqe)) { + __set_current_state(TASK_RUNNING); + io_worker_handle_work(worker); + continue; + } + /* drops the lock on success, retry */ + if (__io_worker_idle(wqe, worker)) { + __release(&wqe->lock); + continue; + } + spin_unlock_irq(&wqe->lock); + if (signal_pending(current)) + flush_signals(current); + if (schedule_timeout(WORKER_IDLE_TIMEOUT)) + continue; + /* timed out, exit unless we're the fixed worker */ + if (test_bit(IO_WQ_BIT_EXIT, &wq->state) || + !(worker->flags & IO_WORKER_F_FIXED)) + break; + } + + finish_wait(&worker->wait, &wait); + + if (test_bit(IO_WQ_BIT_EXIT, &wq->state)) { + spin_lock_irq(&wqe->lock); + if (!list_empty(&wqe->work_list)) + io_worker_handle_work(worker); + else + spin_unlock_irq(&wqe->lock); + } + + io_worker_exit(worker); + return 0; +} + +/* + * Check head of free list for an available worker. If one isn't available, + * caller must wake up the wq manager to create one. + */ +static bool io_wqe_activate_free_worker(struct io_wqe *wqe) + __must_hold(RCU) +{ + struct hlist_nulls_node *n; + struct io_worker *worker; + + n = rcu_dereference(hlist_nulls_first_rcu(&wqe->free_list.head)); + if (is_a_nulls(n)) + return false; + + worker = hlist_nulls_entry(n, struct io_worker, nulls_node); + if (io_worker_get(worker)) { + wake_up(&worker->wait); + io_worker_release(worker); + return true; + } + + return false; +} + +/* + * We need a worker. If we find a free one, we're good. If not, and we're + * below the max number of workers, wake up the manager to create one. + */ +static void io_wqe_wake_worker(struct io_wqe *wqe) +{ + bool ret; + + rcu_read_lock(); + ret = io_wqe_activate_free_worker(wqe); + rcu_read_unlock(); + + if (!ret && wqe->nr_workers < wqe->max_workers) + wake_up_process(wqe->wq->manager); +} + +/* + * Called when a worker is scheduled in. Mark us as currently running. + */ +void io_wq_worker_running(struct task_struct *tsk) +{ + struct io_worker *worker = kthread_data(tsk); + struct io_wqe *wqe = worker->wqe; + + if (!(worker->flags & IO_WORKER_F_UP)) + return; + if (worker->flags & IO_WORKER_F_RUNNING) + return; + worker->flags |= IO_WORKER_F_RUNNING; + atomic_inc(&wqe->nr_running); +} + +/* + * Called when worker is going to sleep. If there are no workers currently + * running and we have work pending, wake up a free one or have the manager + * set one up. + */ +void io_wq_worker_sleeping(struct task_struct *tsk) +{ + struct io_worker *worker = kthread_data(tsk); + struct io_wqe *wqe = worker->wqe; + + if (!(worker->flags & IO_WORKER_F_UP)) + return; + if (!(worker->flags & IO_WORKER_F_RUNNING)) + return; + + worker->flags &= ~IO_WORKER_F_RUNNING; + + spin_lock_irq(&wqe->lock); + if (atomic_dec_and_test(&wqe->nr_running) && io_wqe_run_queue(wqe)) + io_wqe_wake_worker(wqe); + spin_unlock_irq(&wqe->lock); +} + +static void create_io_worker(struct io_wq *wq, struct io_wqe *wqe) +{ + struct io_worker *worker; + + worker = kcalloc_node(1, sizeof(*worker), GFP_KERNEL, wqe->node); + if (!worker) + return; + + refcount_set(&worker->ref, 1); + worker->nulls_node.pprev = NULL; + init_waitqueue_head(&worker->wait); + worker->wqe = wqe; + + worker->task = kthread_create_on_node(io_wqe_worker, worker, wqe->node, + "io_wqe_worker-%d", wqe->node); + if (IS_ERR(worker->task)) { + kfree(worker); + return; + } + + spin_lock_irq(&wqe->lock); + hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list.head); + worker->flags |= IO_WORKER_F_FREE; + if (!wqe->nr_workers) + worker->flags |= IO_WORKER_F_FIXED; + wqe->nr_workers++; + spin_unlock_irq(&wqe->lock); + + wake_up_process(worker->task); +} + +static inline bool io_wqe_need_new_worker(struct io_wqe *wqe) + __must_hold(wqe->lock) +{ + if (!wqe->nr_workers) + return true; + if (hlist_nulls_empty(&wqe->free_list.head) && + wqe->nr_workers < wqe->max_workers && io_wqe_run_queue(wqe)) + return true; + + return false; +} + +/* + * Manager thread. Tasked with creating new workers, if we need them. + */ +static int io_wq_manager(void *data) +{ + struct io_wq *wq = data; + + while (!kthread_should_stop()) { + int i; + + for (i = 0; i < wq->nr_wqes; i++) { + struct io_wqe *wqe = wq->wqes[i]; + bool fork_worker = false; + + spin_lock_irq(&wqe->lock); + fork_worker = io_wqe_need_new_worker(wqe); + spin_unlock_irq(&wqe->lock); + if (fork_worker) + create_io_worker(wq, wqe); + } + set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(HZ); + } + + return 0; +} + +static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work) +{ + unsigned long flags; + + spin_lock_irqsave(&wqe->lock, flags); + list_add_tail(&work->list, &wqe->work_list); + wqe->flags &= ~IO_WQE_FLAG_STALLED; + spin_unlock_irqrestore(&wqe->lock, flags); + + if (!atomic_read(&wqe->nr_running)) + io_wqe_wake_worker(wqe); +} + +void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work) +{ + struct io_wqe *wqe = wq->wqes[numa_node_id()]; + + io_wqe_enqueue(wqe, work); +} + +/* + * Enqueue work, hashed by some key. Work items that hash to the same value + * will not be done in parallel. Used to limit concurrent writes, generally + * hashed by inode. + */ +void io_wq_enqueue_hashed(struct io_wq *wq, struct io_wq_work *work, void *val) +{ + struct io_wqe *wqe = wq->wqes[numa_node_id()]; + unsigned bit; + + + bit = hash_ptr(val, IO_WQ_HASH_ORDER); + work->flags |= (IO_WQ_WORK_HASHED | (bit << IO_WQ_HASH_SHIFT)); + io_wqe_enqueue(wqe, work); +} + +static bool io_wqe_worker_send_sig(struct io_worker *worker, void *data) +{ + send_sig(SIGINT, worker->task, 1); + return false; +} + +/* + * Iterate the passed in list and call the specific function for each + * worker that isn't exiting + */ +static bool io_wq_for_each_worker(struct io_wqe *wqe, + struct io_wq_nulls_list *list, + bool (*func)(struct io_worker *, void *), + void *data) +{ + struct hlist_nulls_node *n; + struct io_worker *worker; + bool ret = false; + +restart: + hlist_nulls_for_each_entry_rcu(worker, n, &list->head, nulls_node) { + if (io_worker_get(worker)) { + ret = func(worker, data); + io_worker_release(worker); + if (ret) + break; + } + } + if (!ret && get_nulls_value(n) != list->nulls) + goto restart; + return ret; +} + +void io_wq_cancel_all(struct io_wq *wq) +{ + int i; + + set_bit(IO_WQ_BIT_CANCEL, &wq->state); + + /* + * Browse both lists, as there's a gap between handing work off + * to a worker and the worker putting itself on the busy_list + */ + rcu_read_lock(); + for (i = 0; i < wq->nr_wqes; i++) { + struct io_wqe *wqe = wq->wqes[i]; + + io_wq_for_each_worker(wqe, &wqe->busy_list, + io_wqe_worker_send_sig, NULL); + io_wq_for_each_worker(wqe, &wqe->free_list, + io_wqe_worker_send_sig, NULL); + } + rcu_read_unlock(); +} + +static bool io_wq_worker_cancel(struct io_worker *worker, void *data) +{ + struct io_wq_work *work = data; + + if (worker->cur_work == work) { + send_sig(SIGINT, worker->task, 1); + return true; + } + + return false; +} + +static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, + struct io_wq_work *cwork) +{ + struct io_wq_work *work; + bool found = false; + + cwork->flags |= IO_WQ_WORK_CANCEL; + + /* + * First check pending list, if we're lucky we can just remove it + * from there. CANCEL_OK means that the work is returned as-new, + * no completion will be posted for it. + */ + spin_lock_irq(&wqe->lock); + list_for_each_entry(work, &wqe->work_list, list) { + if (work == cwork) { + list_del(&work->list); + found = true; + break; + } + } + spin_unlock_irq(&wqe->lock); + + if (found) { + work->flags |= IO_WQ_WORK_CANCEL; + work->func(&work); + return IO_WQ_CANCEL_OK; + } + + /* + * Now check if a free (going busy) or busy worker has the work + * currently running. If we find it there, we'll return CANCEL_RUNNING + * as an indication that we attempte to signal cancellation. The + * completion will run normally in this case. + */ + rcu_read_lock(); + found = io_wq_for_each_worker(wqe, &wqe->free_list, io_wq_worker_cancel, + cwork); + if (found) + goto done; + + found = io_wq_for_each_worker(wqe, &wqe->busy_list, io_wq_worker_cancel, + cwork); +done: + rcu_read_unlock(); + return found ? IO_WQ_CANCEL_RUNNING : IO_WQ_CANCEL_NOTFOUND; +} + +enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork) +{ + enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; + int i; + + for (i = 0; i < wq->nr_wqes; i++) { + struct io_wqe *wqe = wq->wqes[i]; + + ret = io_wqe_cancel_work(wqe, cwork); + if (ret != IO_WQ_CANCEL_NOTFOUND) + break; + } + + return ret; +} + +struct io_wq_flush_data { + struct io_wq_work work; + struct completion done; +}; + +static void io_wq_flush_func(struct io_wq_work **workptr) +{ + struct io_wq_work *work = *workptr; + struct io_wq_flush_data *data; + + data = container_of(work, struct io_wq_flush_data, work); + complete(&data->done); +} + +/* + * Doesn't wait for previously queued work to finish. When this completes, + * it just means that previously queued work was started. + */ +void io_wq_flush(struct io_wq *wq) +{ + struct io_wq_flush_data data; + int i; + + for (i = 0; i < wq->nr_wqes; i++) { + struct io_wqe *wqe = wq->wqes[i]; + + init_completion(&data.done); + INIT_IO_WORK(&data.work, io_wq_flush_func); + io_wqe_enqueue(wqe, &data.work); + wait_for_completion(&data.done); + } +} + +struct io_wq *io_wq_create(unsigned concurrency, struct mm_struct *mm) +{ + int ret = -ENOMEM, i, node; + struct io_wq *wq; + + wq = kcalloc(1, sizeof(*wq), GFP_KERNEL); + if (!wq) + return ERR_PTR(-ENOMEM); + + wq->nr_wqes = num_online_nodes(); + wq->wqes = kcalloc(wq->nr_wqes, sizeof(struct io_wqe *), GFP_KERNEL); + if (!wq->wqes) { + kfree(wq); + return ERR_PTR(-ENOMEM); + } + + i = 0; + refcount_set(&wq->refs, wq->nr_wqes); + for_each_online_node(node) { + struct io_wqe *wqe; + + wqe = kcalloc_node(1, sizeof(struct io_wqe), GFP_KERNEL, node); + if (!wqe) + break; + wq->wqes[i] = wqe; + wqe->node = node; + wqe->max_workers = concurrency; + wqe->node = node; + wqe->wq = wq; + spin_lock_init(&wqe->lock); + INIT_LIST_HEAD(&wqe->work_list); + INIT_HLIST_NULLS_HEAD(&wqe->free_list.head, 0); + wqe->free_list.nulls = 0; + INIT_HLIST_NULLS_HEAD(&wqe->busy_list.head, 1); + wqe->busy_list.nulls = 1; + atomic_set(&wqe->nr_running, 0); + + i++; + } + + init_completion(&wq->done); + + if (i != wq->nr_wqes) + goto err; + + /* caller must have already done mmgrab() on this mm */ + wq->mm = mm; + + wq->manager = kthread_create(io_wq_manager, wq, "io_wq_manager"); + if (!IS_ERR(wq->manager)) { + wake_up_process(wq->manager); + return wq; + } + + ret = PTR_ERR(wq->manager); + wq->manager = NULL; +err: + complete(&wq->done); + io_wq_destroy(wq); + return ERR_PTR(ret); +} + +static bool io_wq_worker_wake(struct io_worker *worker, void *data) +{ + wake_up_process(worker->task); + return false; +} + +void io_wq_destroy(struct io_wq *wq) +{ + int i; + + if (wq->manager) { + set_bit(IO_WQ_BIT_EXIT, &wq->state); + kthread_stop(wq->manager); + } + + rcu_read_lock(); + for (i = 0; i < wq->nr_wqes; i++) { + struct io_wqe *wqe = wq->wqes[i]; + + if (!wqe) + continue; + io_wq_for_each_worker(wqe, &wqe->free_list, io_wq_worker_wake, + NULL); + io_wq_for_each_worker(wqe, &wqe->busy_list, io_wq_worker_wake, + NULL); + } + rcu_read_unlock(); + + wait_for_completion(&wq->done); + + for (i = 0; i < wq->nr_wqes; i++) + kfree(wq->wqes[i]); + kfree(wq->wqes); + kfree(wq); +} diff --git a/fs/io-wq.h b/fs/io-wq.h new file mode 100644 index 000000000000..be8f22c8937b --- /dev/null +++ b/fs/io-wq.h @@ -0,0 +1,55 @@ +#ifndef INTERNAL_IO_WQ_H +#define INTERNAL_IO_WQ_H + +struct io_wq; + +enum { + IO_WQ_WORK_CANCEL = 1, + IO_WQ_WORK_HAS_MM = 2, + IO_WQ_WORK_HASHED = 4, + IO_WQ_WORK_NEEDS_USER = 8, + + IO_WQ_HASH_SHIFT = 24, /* upper 8 bits are used for hash key */ +}; + +enum io_wq_cancel { + IO_WQ_CANCEL_OK, /* cancelled before started */ + IO_WQ_CANCEL_RUNNING, /* found, running, and attempted cancelled */ + IO_WQ_CANCEL_NOTFOUND, /* work not found */ +}; + +struct io_wq_work { + struct list_head list; + void (*func)(struct io_wq_work **); + unsigned flags; +}; + +#define INIT_IO_WORK(work, _func) \ + do { \ + (work)->func = _func; \ + (work)->flags = 0; \ + } while (0) \ + +struct io_wq *io_wq_create(unsigned concurrency, struct mm_struct *mm); +void io_wq_destroy(struct io_wq *wq); + +void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work); +void io_wq_enqueue_hashed(struct io_wq *wq, struct io_wq_work *work, void *val); +void io_wq_flush(struct io_wq *wq); + +void io_wq_cancel_all(struct io_wq *wq); +enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork); + +#if defined(CONFIG_IO_WQ) +extern void io_wq_worker_sleeping(struct task_struct *); +extern void io_wq_worker_running(struct task_struct *); +#else +static inline void io_wq_worker_sleeping(struct task_struct *tsk) +{ +} +static inline void io_wq_worker_running(struct task_struct *tsk) +{ +} +#endif + +#endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 67d4cfefc99d..835654168b42 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1421,6 +1421,7 @@ extern struct pid *cad_pid; #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */ #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ +#define PF_IO_WORKER 0x20000000 /* Task is an IO worker */ #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */ #define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0fac7e9aa9fe..41fee321ef83 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -15,6 +15,7 @@ #include <asm/tlb.h>
#include "../workqueue_internal.h" +#include "../../fs/io-wq.h" #include "../smpboot.h"
#include "pelt.h" @@ -3551,9 +3552,12 @@ static inline void sched_submit_work(struct task_struct *tsk) * we disable preemption to avoid it calling schedule() again * in the possible wakeup of a kworker. */ - if (tsk->flags & PF_WQ_WORKER) { + if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) { preempt_disable(); - wq_worker_sleeping(tsk); + if (tsk->flags & PF_WQ_WORKER) + wq_worker_sleeping(tsk); + else + io_wq_worker_sleeping(tsk); preempt_enable_no_resched(); }
@@ -3567,8 +3571,12 @@ static inline void sched_submit_work(struct task_struct *tsk)
static void sched_update_worker(struct task_struct *tsk) { - if (tsk->flags & PF_WQ_WORKER) - wq_worker_running(tsk); + if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) { + if (tsk->flags & PF_WQ_WORKER) + wq_worker_running(tsk); + else + io_wq_worker_running(tsk); + } }
asmlinkage __visible void __sched schedule(void)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 561fb04a6a2257716738dac2ed812f377c2634c2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Drop various work-arounds we have for workqueues:
- We no longer need the async_list for tracking sequential IO.
- We don't have to maintain our own mm tracking/setting.
- We don't need a separate workqueue for buffered writes. This didn't even work that well to begin with, as it was suboptimal for multiple buffered writers on multiple files.
- We can properly cancel pending interruptible work. This fixes deadlocks with particularly socket IO, where we cannot cancel them when the io_uring is closed. Hence the ring will wait forever for these requests to complete, which may never happen. This is different from disk IO where we know requests will complete in a finite amount of time.
- Due to being able to cancel work interruptible work that is already running, we can implement file table support for work. We need that for supporting system calls that add to a process file table.
- It gets us one step closer to adding async support for any system call.
Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [ Patch b5420237ec81("mm: refactor readahead defines in mm.h") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 417 ++++++++------------------------ include/trace/events/io_uring.h | 12 +- init/Kconfig | 1 + 3 files changed, 107 insertions(+), 323 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 74f194cbef9b..facf3caec6d0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -56,7 +56,6 @@ #include <linux/mmu_context.h> #include <linux/percpu.h> #include <linux/slab.h> -#include <linux/workqueue.h> #include <linux/kthread.h> #include <linux/blkdev.h> #include <linux/bvec.h> @@ -77,6 +76,7 @@ #include <uapi/linux/io_uring.h>
#include "internal.h" +#include "io-wq.h"
#define IORING_MAX_ENTRIES 32768 #define IORING_MAX_CQ_ENTRIES (2 * IORING_MAX_ENTRIES) @@ -165,16 +165,6 @@ struct io_mapped_ubuf { unsigned int nr_bvecs; };
-struct async_list { - spinlock_t lock; - atomic_t cnt; - struct list_head list; - - struct file *file; - off_t io_start; - size_t io_len; -}; - struct io_ring_ctx { struct { struct percpu_ref refs; @@ -209,7 +199,7 @@ struct io_ring_ctx { } ____cacheline_aligned_in_smp;
/* IO offload */ - struct workqueue_struct *sqo_wq[2]; + struct io_wq *io_wq; struct task_struct *sqo_thread; /* if using sq thread polling */ struct mm_struct *sqo_mm; wait_queue_head_t sqo_wait; @@ -262,8 +252,6 @@ struct io_ring_ctx { struct list_head cancel_list; } ____cacheline_aligned_in_smp;
- struct async_list pending_async[2]; - #if defined(CONFIG_UNIX) struct socket *ring_sock; #endif @@ -333,7 +321,7 @@ struct io_kiocb { u32 result; u32 sequence;
- struct work_struct work; + struct io_wq_work work; };
#define IO_PLUG_THRESHOLD 2 @@ -359,7 +347,7 @@ struct io_submit_state { unsigned int ios_left; };
-static void io_sq_wq_submit_work(struct work_struct *work); +static void io_wq_submit_work(struct io_wq_work **workptr); static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, long res); static void __io_free_req(struct io_kiocb *req); @@ -391,7 +379,6 @@ static void io_ring_ctx_ref_free(struct percpu_ref *ref) static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) { struct io_ring_ctx *ctx; - int i;
ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); if (!ctx) @@ -408,11 +395,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) init_completion(&ctx->sqo_thread_started); mutex_init(&ctx->uring_lock); init_waitqueue_head(&ctx->wait); - for (i = 0; i < ARRAY_SIZE(ctx->pending_async); i++) { - spin_lock_init(&ctx->pending_async[i].lock); - INIT_LIST_HEAD(&ctx->pending_async[i].list); - atomic_set(&ctx->pending_async[i].cnt, 0); - } spin_lock_init(&ctx->completion_lock); INIT_LIST_HEAD(&ctx->poll_list); INIT_LIST_HEAD(&ctx->cancel_list); @@ -478,22 +460,45 @@ static void __io_commit_cqring(struct io_ring_ctx *ctx) } }
-static inline void io_queue_async_work(struct io_ring_ctx *ctx, - struct io_kiocb *req) +static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe) { - int rw = 0; + u8 opcode = READ_ONCE(sqe->opcode); + + return !(opcode == IORING_OP_READ_FIXED || + opcode == IORING_OP_WRITE_FIXED); +} + +static inline bool io_prep_async_work(struct io_kiocb *req) +{ + bool do_hashed = false;
if (req->submit.sqe) { switch (req->submit.sqe->opcode) { case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: - rw = !(req->rw.ki_flags & IOCB_DIRECT); + do_hashed = true; break; } + if (io_sqe_needs_user(req->submit.sqe)) + req->work.flags |= IO_WQ_WORK_NEEDS_USER; }
- trace_io_uring_queue_async_work(ctx, rw, req, &req->work, req->flags); - queue_work(ctx->sqo_wq[rw], &req->work); + return do_hashed; +} + +static inline void io_queue_async_work(struct io_ring_ctx *ctx, + struct io_kiocb *req) +{ + bool do_hashed = io_prep_async_work(req); + + trace_io_uring_queue_async_work(ctx, do_hashed, req, &req->work, + req->flags); + if (!do_hashed) { + io_wq_enqueue(ctx->io_wq, &req->work); + } else { + io_wq_enqueue_hashed(ctx->io_wq, &req->work, + file_inode(req->file)); + } }
static void io_kill_timeout(struct io_kiocb *req) @@ -646,6 +651,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, /* one is dropped after submission, the other at completion */ refcount_set(&req->refs, 2); req->result = 0; + INIT_IO_WORK(&req->work, io_wq_submit_work); return req; out: percpu_ref_put(&ctx->refs); @@ -692,12 +698,10 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) * If we're in async work, we can continue processing the chain * in this context instead of having to queue up new async work. */ - if (nxtptr && current_work()) { + if (nxtptr && current_work()) *nxtptr = nxt; - } else { - INIT_WORK(&nxt->work, io_sq_wq_submit_work); + else io_queue_async_work(req->ctx, nxt); - } } }
@@ -756,12 +760,10 @@ static void io_put_req(struct io_kiocb *req, struct io_kiocb **nxtptr)
nxt = io_put_req_find_next(req); if (nxt) { - if (nxtptr) { + if (nxtptr) *nxtptr = nxt; - } else { - INIT_WORK(&nxt->work, io_sq_wq_submit_work); + else io_queue_async_work(nxt->ctx, nxt); - } } }
@@ -1323,65 +1325,6 @@ static ssize_t io_import_iovec(struct io_ring_ctx *ctx, int rw, return import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec, iter); }
-static inline bool io_should_merge(struct async_list *al, struct kiocb *kiocb) -{ - if (al->file == kiocb->ki_filp) { - off_t start, end; - - /* - * Allow merging if we're anywhere in the range of the same - * page. Generally this happens for sub-page reads or writes, - * and it's beneficial to allow the first worker to bring the - * page in and the piggy backed work can then work on the - * cached page. - */ - start = al->io_start & PAGE_MASK; - end = (al->io_start + al->io_len + PAGE_SIZE - 1) & PAGE_MASK; - if (kiocb->ki_pos >= start && kiocb->ki_pos <= end) - return true; - } - - al->file = NULL; - return false; -} - -/* - * Make a note of the last file/offset/direction we punted to async - * context. We'll use this information to see if we can piggy back a - * sequential request onto the previous one, if it's still hasn't been - * completed by the async worker. - */ -static void io_async_list_note(int rw, struct io_kiocb *req, size_t len) -{ - struct async_list *async_list = &req->ctx->pending_async[rw]; - struct kiocb *kiocb = &req->rw; - struct file *filp = kiocb->ki_filp; - - if (io_should_merge(async_list, kiocb)) { - unsigned long max_bytes; - - /* Use 8x RA size as a decent limiter for both reads/writes */ - max_bytes = filp->f_ra.ra_pages << (PAGE_SHIFT + 3); - if (!max_bytes) - max_bytes = VM_MAX_READAHEAD << 7; - - /* If max len are exceeded, reset the state */ - if (async_list->io_len + len <= max_bytes) { - req->flags |= REQ_F_SEQ_PREV; - async_list->io_len += len; - } else { - async_list->file = NULL; - } - } - - /* New file? Reset state. */ - if (async_list->file != filp) { - async_list->io_start = kiocb->ki_pos; - async_list->io_len = len; - async_list->file = filp; - } -} - /* * For files that don't have ->read_iter() and ->write_iter(), handle them * by looping over ->read() or ->write() manually. @@ -1476,13 +1419,10 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s, ret2 > 0 && ret2 < read_size) ret2 = -EAGAIN; /* Catch -EAGAIN return for forced non-blocking submission */ - if (!force_nonblock || ret2 != -EAGAIN) { + if (!force_nonblock || ret2 != -EAGAIN) kiocb_done(kiocb, ret2, nxt, s->in_async); - } else { - if (!s->in_async) - io_async_list_note(READ, req, iov_count); + else ret = -EAGAIN; - } } kfree(iovec); return ret; @@ -1516,11 +1456,8 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s, iov_count = iov_iter_count(&iter);
ret = -EAGAIN; - if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) { - if (!s->in_async) - io_async_list_note(WRITE, req, iov_count); + if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) goto out_free; - }
ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, iov_count); if (!ret) { @@ -1545,13 +1482,10 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s, ret2 = call_write_iter(file, kiocb, &iter); else ret2 = loop_rw_iter(WRITE, file, kiocb, &iter); - if (!force_nonblock || ret2 != -EAGAIN) { + if (!force_nonblock || ret2 != -EAGAIN) kiocb_done(kiocb, ret2, nxt, s->in_async); - } else { - if (!s->in_async) - io_async_list_note(WRITE, req, iov_count); + else ret = -EAGAIN; - } } out_free: kfree(iovec); @@ -1793,14 +1727,18 @@ static void io_poll_complete(struct io_ring_ctx *ctx, struct io_kiocb *req, io_commit_cqring(ctx); }
-static void io_poll_complete_work(struct work_struct *work) +static void io_poll_complete_work(struct io_wq_work **workptr) { + struct io_wq_work *work = *workptr; struct io_kiocb *req = container_of(work, struct io_kiocb, work); struct io_poll_iocb *poll = &req->poll; struct poll_table_struct pt = { ._key = poll->events }; struct io_ring_ctx *ctx = req->ctx; __poll_t mask = 0;
+ if (work->flags & IO_WQ_WORK_CANCEL) + WRITE_ONCE(poll->canceled, true); + if (!READ_ONCE(poll->canceled)) mask = vfs_poll(poll->file, &pt) & poll->events;
@@ -1893,7 +1831,7 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EBADF;
req->submit.sqe = NULL; - INIT_WORK(&req->work, io_poll_complete_work); + INIT_IO_WORK(&req->work, io_poll_complete_work); events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP;
@@ -2151,7 +2089,6 @@ static int io_req_defer(struct io_ring_ctx *ctx, struct io_kiocb *req, memcpy(sqe_copy, sqe, sizeof(*sqe_copy)); req->submit.sqe = sqe_copy;
- INIT_WORK(&req->work, io_sq_wq_submit_work); trace_io_uring_defer(ctx, req, false); list_add_tail(&req->list, &ctx->defer_list); spin_unlock_irq(&ctx->completion_lock); @@ -2234,186 +2171,54 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, return 0; }
-static struct async_list *io_async_list_from_sqe(struct io_ring_ctx *ctx, - const struct io_uring_sqe *sqe) -{ - switch (sqe->opcode) { - case IORING_OP_READV: - case IORING_OP_READ_FIXED: - return &ctx->pending_async[READ]; - case IORING_OP_WRITEV: - case IORING_OP_WRITE_FIXED: - return &ctx->pending_async[WRITE]; - default: - return NULL; - } -} - -static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe) -{ - u8 opcode = READ_ONCE(sqe->opcode); - - return !(opcode == IORING_OP_READ_FIXED || - opcode == IORING_OP_WRITE_FIXED); -} - -static void io_sq_wq_submit_work(struct work_struct *work) +static void io_wq_submit_work(struct io_wq_work **workptr) { + struct io_wq_work *work = *workptr; struct io_kiocb *req = container_of(work, struct io_kiocb, work); struct io_ring_ctx *ctx = req->ctx; - struct mm_struct *cur_mm = NULL; - struct async_list *async_list; - LIST_HEAD(req_list); - mm_segment_t old_fs; - int ret; + struct sqe_submit *s = &req->submit; + const struct io_uring_sqe *sqe = s->sqe; + struct io_kiocb *nxt = NULL; + int ret = 0;
- async_list = io_async_list_from_sqe(ctx, req->submit.sqe); -restart: - do { - struct sqe_submit *s = &req->submit; - const struct io_uring_sqe *sqe = s->sqe; - unsigned int flags = req->flags; - struct io_kiocb *nxt = NULL; + /* Ensure we clear previously set non-block flag */ + req->rw.ki_flags &= ~IOCB_NOWAIT;
- /* Ensure we clear previously set non-block flag */ - req->rw.ki_flags &= ~IOCB_NOWAIT; + if (work->flags & IO_WQ_WORK_CANCEL) + ret = -ECANCELED;
- ret = 0; - if (io_sqe_needs_user(sqe) && !cur_mm) { - if (!mmget_not_zero(ctx->sqo_mm)) { - ret = -EFAULT; - } else { - cur_mm = ctx->sqo_mm; - use_mm(cur_mm); - old_fs = get_fs(); - set_fs(USER_DS); - } - } + if (!ret) { + s->has_user = (work->flags & IO_WQ_WORK_HAS_MM) != 0; + s->in_async = true; + do { + ret = __io_submit_sqe(ctx, req, s, &nxt, false); + /* + * We can get EAGAIN for polled IO even though we're + * forcing a sync submission from here, since we can't + * wait for request slots on the block side. + */ + if (ret != -EAGAIN) + break; + cond_resched(); + } while (1); + }
- if (!ret) { - s->has_user = cur_mm != NULL; - s->in_async = true; - do { - ret = __io_submit_sqe(ctx, req, s, &nxt, false); - /* - * We can get EAGAIN for polled IO even though - * we're forcing a sync submission from here, - * since we can't wait for request slots on the - * block side. - */ - if (ret != -EAGAIN) - break; - cond_resched(); - } while (1); - } + /* drop submission reference */ + io_put_req(req, NULL);
- /* drop submission reference */ + if (ret) { + io_cqring_add_event(ctx, sqe->user_data, ret); io_put_req(req, NULL); - - if (ret) { - io_cqring_add_event(ctx, sqe->user_data, ret); - io_put_req(req, NULL); - } - - /* async context always use a copy of the sqe */ - kfree(sqe); - - /* if a dependent link is ready, do that as the next one */ - if (!ret && nxt) { - req = nxt; - continue; - } - - /* req from defer and link list needn't decrease async cnt */ - if (flags & (REQ_F_IO_DRAINED | REQ_F_LINK_DONE)) - goto out; - - if (!async_list) - break; - if (!list_empty(&req_list)) { - req = list_first_entry(&req_list, struct io_kiocb, - list); - list_del(&req->list); - continue; - } - if (list_empty(&async_list->list)) - break; - - req = NULL; - spin_lock(&async_list->lock); - if (list_empty(&async_list->list)) { - spin_unlock(&async_list->lock); - break; - } - list_splice_init(&async_list->list, &req_list); - spin_unlock(&async_list->lock); - - req = list_first_entry(&req_list, struct io_kiocb, list); - list_del(&req->list); - } while (req); - - /* - * Rare case of racing with a submitter. If we find the count has - * dropped to zero AND we have pending work items, then restart - * the processing. This is a tiny race window. - */ - if (async_list) { - ret = atomic_dec_return(&async_list->cnt); - while (!ret && !list_empty(&async_list->list)) { - spin_lock(&async_list->lock); - atomic_inc(&async_list->cnt); - list_splice_init(&async_list->list, &req_list); - spin_unlock(&async_list->lock); - - if (!list_empty(&req_list)) { - req = list_first_entry(&req_list, - struct io_kiocb, list); - list_del(&req->list); - goto restart; - } - ret = atomic_dec_return(&async_list->cnt); - } - } - -out: - if (cur_mm) { - set_fs(old_fs); - unuse_mm(cur_mm); - mmput(cur_mm); } -} - -/* - * See if we can piggy back onto previously submitted work, that is still - * running. We currently only allow this if the new request is sequential - * to the previous one we punted. - */ -static bool io_add_to_prev_work(struct async_list *list, struct io_kiocb *req) -{ - bool ret;
- if (!list) - return false; - if (!(req->flags & REQ_F_SEQ_PREV)) - return false; - if (!atomic_read(&list->cnt)) - return false; + /* async context always use a copy of the sqe */ + kfree(sqe);
- ret = true; - spin_lock(&list->lock); - list_add_tail(&req->list, &list->list); - /* - * Ensure we see a simultaneous modification from io_sq_wq_submit_work() - */ - smp_mb(); - if (!atomic_read(&list->cnt)) { - list_del_init(&req->list); - ret = false; + /* if a dependent link is ready, pass it back */ + if (!ret && nxt) { + io_prep_async_work(nxt); + *workptr = &nxt->work; } - spin_unlock(&list->lock); - - trace_io_uring_add_to_prev(req, ret); - return ret; }
static bool io_op_needs_file(const struct io_uring_sqe *sqe) @@ -2487,17 +2292,9 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
sqe_copy = kmemdup(s->sqe, sizeof(*sqe_copy), GFP_KERNEL); if (sqe_copy) { - struct async_list *list; - s->sqe = sqe_copy; memcpy(&req->submit, s, sizeof(*s)); - list = io_async_list_from_sqe(ctx, s->sqe); - if (!io_add_to_prev_work(list, req)) { - if (list) - atomic_inc(&list->cnt); - INIT_WORK(&req->work, io_sq_wq_submit_work); - io_queue_async_work(ctx, req); - } + io_queue_async_work(ctx, req);
/* * Queued up for async execution, worker will release @@ -3108,15 +2905,11 @@ static void io_sq_thread_stop(struct io_ring_ctx *ctx)
static void io_finish_async(struct io_ring_ctx *ctx) { - int i; - io_sq_thread_stop(ctx);
- for (i = 0; i < ARRAY_SIZE(ctx->sqo_wq); i++) { - if (ctx->sqo_wq[i]) { - destroy_workqueue(ctx->sqo_wq[i]); - ctx->sqo_wq[i] = NULL; - } + if (ctx->io_wq) { + io_wq_destroy(ctx->io_wq); + ctx->io_wq = NULL; } }
@@ -3124,11 +2917,9 @@ static void io_finish_async(struct io_ring_ctx *ctx) static void io_destruct_skb(struct sk_buff *skb) { struct io_ring_ctx *ctx = skb->sk->sk_user_data; - int i;
- for (i = 0; i < ARRAY_SIZE(ctx->sqo_wq); i++) - if (ctx->sqo_wq[i]) - flush_workqueue(ctx->sqo_wq[i]); + if (ctx->io_wq) + io_wq_flush(ctx->io_wq);
unix_destruct_scm(skb); } @@ -3472,6 +3263,7 @@ static int io_sqe_files_update(struct io_ring_ctx *ctx, void __user *arg, static int io_sq_offload_start(struct io_ring_ctx *ctx, struct io_uring_params *p) { + unsigned concurrency; int ret;
init_waitqueue_head(&ctx->sqo_wait); @@ -3515,25 +3307,10 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx, goto err; }
- /* Do QD, or 2 * CPUS, whatever is smallest */ - ctx->sqo_wq[0] = alloc_workqueue("io_ring-wq", - WQ_UNBOUND | WQ_FREEZABLE, - min(ctx->sq_entries - 1, 2 * num_online_cpus())); - if (!ctx->sqo_wq[0]) { - ret = -ENOMEM; - goto err; - } - - /* - * This is for buffered writes, where we want to limit the parallelism - * due to file locking in file systems. As "normal" buffered writes - * should parellelize on writeout quite nicely, limit us to having 2 - * pending. This avoids massive contention on the inode when doing - * buffered async writes. - */ - ctx->sqo_wq[1] = alloc_workqueue("io_ring-write-wq", - WQ_UNBOUND | WQ_FREEZABLE, 2); - if (!ctx->sqo_wq[1]) { + /* Do QD, or 4 * CPUS, whatever is smallest */ + concurrency = min(ctx->sq_entries, 4 * num_online_cpus()); + ctx->io_wq = io_wq_create(concurrency, ctx->sqo_mm); + if (!ctx->io_wq) { ret = -ENOMEM; goto err; } @@ -3919,6 +3696,10 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
io_kill_timeouts(ctx); io_poll_remove_all(ctx); + + if (ctx->io_wq) + io_wq_cancel_all(ctx->io_wq); + io_iopoll_reap_events(ctx); wait_for_completion(&ctx->ctx_done); io_ring_ctx_free(ctx); diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h index c5a905fbf1da..b85255121b98 100644 --- a/include/trace/events/io_uring.h +++ b/include/trace/events/io_uring.h @@ -7,6 +7,8 @@
#include <linux/tracepoint.h>
+struct io_wq_work; + /** * io_uring_create - called after a new io_uring context was prepared * @@ -126,15 +128,15 @@ TRACE_EVENT(io_uring_file_get, * io_uring_queue_async_work - called before submitting a new async work * * @ctx: pointer to a ring context structure - * @rw: type of workqueue, normal or buffered writes + * @hashed: type of workqueue, hashed or normal * @req: pointer to a submitted request - * @work: pointer to a submitted work_struct + * @work: pointer to a submitted io_wq_work * * Allows to trace asynchronous work submission. */ TRACE_EVENT(io_uring_queue_async_work,
- TP_PROTO(void *ctx, int rw, void * req, struct work_struct *work, + TP_PROTO(void *ctx, int rw, void * req, struct io_wq_work *work, unsigned int flags),
TP_ARGS(ctx, rw, req, work, flags), @@ -143,7 +145,7 @@ TRACE_EVENT(io_uring_queue_async_work, __field( void *, ctx ) __field( int, rw ) __field( void *, req ) - __field( struct work_struct *, work ) + __field( struct io_wq_work *, work ) __field( unsigned int, flags ) ),
@@ -157,7 +159,7 @@ TRACE_EVENT(io_uring_queue_async_work,
TP_printk("ring %p, request %p, flags %d, %s queue, work %p", __entry->ctx, __entry->req, __entry->flags, - __entry->rw ? "buffered" : "normal", __entry->work) + __entry->rw ? "hashed" : "normal", __entry->work) );
/** diff --git a/init/Kconfig b/init/Kconfig index 850061203eeb..1386cf410c6a 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1423,6 +1423,7 @@ config AIO config IO_URING bool "Enable IO uring support" if EXPERT select ANON_INODES + select IO_WQ default y help This option enables support for the io_uring interface, enabling
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit fcb323cc53e29d9cc696d606bb42736b32dd9825 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This is in preparation for adding opcodes that need to add new files in a process file table, system calls like open(2) or accept4(2).
If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work item. If work that needs to get punted to async context have this set, the async worker will assume the original task file table before executing the work.
Note that opcodes that need access to the current files of an application cannot be done through IORING_SETUP_SQPOLL.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 30 +++++++++++-- fs/io-wq.h | 3 ++ fs/io_uring.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++-- 3 files changed, 141 insertions(+), 8 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 88acfd0bf139..3e0f6dfbdcd9 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -53,6 +53,7 @@ struct io_worker {
struct rcu_head rcu; struct mm_struct *mm; + struct files_struct *restore_files; };
struct io_wq_nulls_list { @@ -127,22 +128,36 @@ static void io_worker_release(struct io_worker *worker) */ static bool __io_worker_unuse(struct io_wqe *wqe, struct io_worker *worker) { + bool dropped_lock = false; + + if (current->files != worker->restore_files) { + __acquire(&wqe->lock); + spin_unlock_irq(&wqe->lock); + dropped_lock = true; + + task_lock(current); + current->files = worker->restore_files; + task_unlock(current); + } + /* * If we have an active mm, we need to drop the wq lock before unusing * it. If we do, return true and let the caller retry the idle loop. */ if (worker->mm) { - __acquire(&wqe->lock); - spin_unlock_irq(&wqe->lock); + if (!dropped_lock) { + __acquire(&wqe->lock); + spin_unlock_irq(&wqe->lock); + dropped_lock = true; + } __set_current_state(TASK_RUNNING); set_fs(KERNEL_DS); unuse_mm(worker->mm); mmput(worker->mm); worker->mm = NULL; - return true; }
- return false; + return dropped_lock; }
static void io_worker_exit(struct io_worker *worker) @@ -190,6 +205,7 @@ static void io_worker_start(struct io_wqe *wqe, struct io_worker *worker) current->flags |= PF_IO_WORKER;
worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING); + worker->restore_files = current->files; atomic_inc(&wqe->nr_running); }
@@ -292,6 +308,12 @@ static void io_worker_handle_work(struct io_worker *worker) if (!work) break; next: + if ((work->flags & IO_WQ_WORK_NEEDS_FILES) && + current->files != work->files) { + task_lock(current); + current->files = work->files; + task_unlock(current); + } if ((work->flags & IO_WQ_WORK_NEEDS_USER) && !worker->mm && wq->mm && mmget_not_zero(wq->mm)) { use_mm(wq->mm); diff --git a/fs/io-wq.h b/fs/io-wq.h index be8f22c8937b..e93f764b1fa4 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -8,6 +8,7 @@ enum { IO_WQ_WORK_HAS_MM = 2, IO_WQ_WORK_HASHED = 4, IO_WQ_WORK_NEEDS_USER = 8, + IO_WQ_WORK_NEEDS_FILES = 16,
IO_WQ_HASH_SHIFT = 24, /* upper 8 bits are used for hash key */ }; @@ -22,12 +23,14 @@ struct io_wq_work { struct list_head list; void (*func)(struct io_wq_work **); unsigned flags; + struct files_struct *files; };
#define INIT_IO_WORK(work, _func) \ do { \ (work)->func = _func; \ (work)->flags = 0; \ + (work)->files = NULL; \ } while (0) \
struct io_wq *io_wq_create(unsigned concurrency, struct mm_struct *mm); diff --git a/fs/io_uring.c b/fs/io_uring.c index facf3caec6d0..7cf028b1de47 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -196,6 +196,8 @@ struct io_ring_ctx {
struct list_head defer_list; struct list_head timeout_list; + + wait_queue_head_t inflight_wait; } ____cacheline_aligned_in_smp;
/* IO offload */ @@ -250,6 +252,9 @@ struct io_ring_ctx { */ struct list_head poll_list; struct list_head cancel_list; + + spinlock_t inflight_lock; + struct list_head inflight_list; } ____cacheline_aligned_in_smp;
#if defined(CONFIG_UNIX) @@ -259,6 +264,8 @@ struct io_ring_ctx {
struct sqe_submit { const struct io_uring_sqe *sqe; + struct file *ring_file; + int ring_fd; u32 sequence; bool has_user; bool in_async; @@ -317,10 +324,13 @@ struct io_kiocb { #define REQ_F_TIMEOUT 1024 /* timeout request */ #define REQ_F_ISREG 2048 /* regular file */ #define REQ_F_MUST_PUNT 4096 /* must be punted even for NONBLOCK */ +#define REQ_F_INFLIGHT 8192 /* on inflight list */ u64 user_data; u32 result; u32 sequence;
+ struct list_head inflight_entry; + struct io_wq_work work; };
@@ -400,6 +410,9 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) INIT_LIST_HEAD(&ctx->cancel_list); INIT_LIST_HEAD(&ctx->defer_list); INIT_LIST_HEAD(&ctx->timeout_list); + init_waitqueue_head(&ctx->inflight_wait); + spin_lock_init(&ctx->inflight_lock); + INIT_LIST_HEAD(&ctx->inflight_list); return ctx; }
@@ -669,9 +682,20 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
static void __io_free_req(struct io_kiocb *req) { + struct io_ring_ctx *ctx = req->ctx; + if (req->file && !(req->flags & REQ_F_FIXED_FILE)) fput(req->file); - percpu_ref_put(&req->ctx->refs); + if (req->flags & REQ_F_INFLIGHT) { + unsigned long flags; + + spin_lock_irqsave(&ctx->inflight_lock, flags); + list_del(&req->inflight_entry); + if (waitqueue_active(&ctx->inflight_wait)) + wake_up(&ctx->inflight_wait); + spin_unlock_irqrestore(&ctx->inflight_lock, flags); + } + percpu_ref_put(&ctx->refs); kmem_cache_free(req_cachep, req); }
@@ -2275,6 +2299,30 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s, return 0; }
+static int io_grab_files(struct io_ring_ctx *ctx, struct io_kiocb *req) +{ + int ret = -EBADF; + + rcu_read_lock(); + spin_lock_irq(&ctx->inflight_lock); + /* + * We use the f_ops->flush() handler to ensure that we can flush + * out work accessing these files if the fd is closed. Check if + * the fd has changed since we started down this path, and disallow + * this operation if it has. + */ + if (fcheck(req->submit.ring_fd) == req->submit.ring_file) { + list_add(&req->inflight_entry, &ctx->inflight_list); + req->flags |= REQ_F_INFLIGHT; + req->work.files = current->files; + ret = 0; + } + spin_unlock_irq(&ctx->inflight_lock); + rcu_read_unlock(); + + return ret; +} + static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, struct sqe_submit *s) { @@ -2294,17 +2342,25 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, if (sqe_copy) { s->sqe = sqe_copy; memcpy(&req->submit, s, sizeof(*s)); - io_queue_async_work(ctx, req); + if (req->work.flags & IO_WQ_WORK_NEEDS_FILES) { + ret = io_grab_files(ctx, req); + if (ret) { + kfree(sqe_copy); + goto err; + } + }
/* * Queued up for async execution, worker will release * submit reference when the iocb is actually submitted. */ + io_queue_async_work(ctx, req); return 0; } }
/* drop submission reference */ +err: io_put_req(req, NULL);
/* and drop final reference, if we failed */ @@ -2508,6 +2564,7 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
head = READ_ONCE(sq_array[head & ctx->sq_mask]); if (head < ctx->sq_entries) { + s->ring_file = NULL; s->sqe = &ctx->sq_sqes[head]; s->sequence = ctx->cached_sq_head; ctx->cached_sq_head++; @@ -2707,7 +2764,8 @@ static int io_sq_thread(void *data) return 0; }
-static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, + struct file *ring_file, int ring_fd) { struct io_submit_state state, *statep = NULL; struct io_kiocb *link = NULL; @@ -2749,9 +2807,11 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) }
out: + s.ring_file = ring_file; s.has_user = true; s.in_async = false; s.needs_fixed_file = false; + s.ring_fd = ring_fd; submit++; trace_io_uring_submit_sqe(ctx, true, false); io_submit_sqe(ctx, &s, statep, &link); @@ -3714,6 +3774,53 @@ static int io_uring_release(struct inode *inode, struct file *file) return 0; }
+static void io_uring_cancel_files(struct io_ring_ctx *ctx, + struct files_struct *files) +{ + struct io_kiocb *req; + DEFINE_WAIT(wait); + + while (!list_empty_careful(&ctx->inflight_list)) { + enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; + + spin_lock_irq(&ctx->inflight_lock); + list_for_each_entry(req, &ctx->inflight_list, inflight_entry) { + if (req->work.files == files) { + ret = io_wq_cancel_work(ctx->io_wq, &req->work); + break; + } + } + if (ret == IO_WQ_CANCEL_RUNNING) + prepare_to_wait(&ctx->inflight_wait, &wait, + TASK_UNINTERRUPTIBLE); + + spin_unlock_irq(&ctx->inflight_lock); + + /* + * We need to keep going until we get NOTFOUND. We only cancel + * one work at the time. + * + * If we get CANCEL_RUNNING, then wait for a work to complete + * before continuing. + */ + if (ret == IO_WQ_CANCEL_OK) + continue; + else if (ret != IO_WQ_CANCEL_RUNNING) + break; + schedule(); + } +} + +static int io_uring_flush(struct file *file, void *data) +{ + struct io_ring_ctx *ctx = file->private_data; + + io_uring_cancel_files(ctx, data); + if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) + io_wq_cancel_all(ctx->io_wq); + return 0; +} + static int io_uring_mmap(struct file *file, struct vm_area_struct *vma) { loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT; @@ -3782,7 +3889,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, to_submit = min(to_submit, ctx->sq_entries);
mutex_lock(&ctx->uring_lock); - submitted = io_ring_submit(ctx, to_submit); + submitted = io_ring_submit(ctx, to_submit, f.file, fd); mutex_unlock(&ctx->uring_lock); } if (flags & IORING_ENTER_GETEVENTS) { @@ -3805,6 +3912,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
static const struct file_operations io_uring_fops = { .release = io_uring_release, + .flush = io_uring_flush, .mmap = io_uring_mmap, .poll = io_uring_poll, .fasync = io_uring_fasync,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit de2ea4b64b75a79ed9cdf9bf30e0e197901084e4 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This is identical to __sys_accept4(), except it takes a struct file instead of an fd, and it also allows passing in extra file->f_flags flags. The latter is done to support masking in O_NONBLOCK without manipulating the original file flags.
No functional changes in this patch.
Cc: netdev@vger.kernel.org Acked-by: David S. Miller davem@davemloft.net Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com [yek: Add declaration of struct file in include/linux/socket.h for patch 7999096fa9cf("iov_iter: Move unnecessary inclusion of crypto/hash.h") is not applied.] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/socket.h | 4 +++ net/socket.c | 65 ++++++++++++++++++++++++++---------------- 2 files changed, 45 insertions(+), 24 deletions(-)
diff --git a/include/linux/socket.h b/include/linux/socket.h index 70d2578085cf..b5f99ade825d 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -10,6 +10,7 @@ #include <linux/compiler.h> /* __user */ #include <uapi/linux/socket.h>
+struct file; struct pid; struct cred; struct socket; @@ -377,6 +378,9 @@ extern int __sys_recvfrom(int fd, void __user *ubuf, size_t size, extern int __sys_sendto(int fd, void __user *buff, size_t len, unsigned int flags, struct sockaddr __user *addr, int addr_len); +extern int __sys_accept4_file(struct file *file, unsigned file_flags, + struct sockaddr __user *upeer_sockaddr, + int __user *upeer_addrlen, int flags); extern int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr, int __user *upeer_addrlen, int flags); extern int __sys_socket(int family, int type, int protocol); diff --git a/net/socket.c b/net/socket.c index b3a9bf2622b3..5ef7a4fc17d2 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1525,24 +1525,13 @@ SYSCALL_DEFINE2(listen, int, fd, int, backlog) return __sys_listen(fd, backlog); }
-/* - * For accept, we attempt to create a new socket, set up the link - * with the client, wake up the client, then return the new - * connected fd. We collect the address of the connector in kernel - * space and move it to user at the very end. This is unclean because - * we open the socket then return an error. - * - * 1003.1g adds the ability to recvmsg() to query connection pending - * status to recvmsg. We need to add that support in a way thats - * clean when we restructure accept also. - */ - -int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr, - int __user *upeer_addrlen, int flags) +int __sys_accept4_file(struct file *file, unsigned file_flags, + struct sockaddr __user *upeer_sockaddr, + int __user *upeer_addrlen, int flags) { struct socket *sock, *newsock; struct file *newfile; - int err, len, newfd, fput_needed; + int err, len, newfd; struct sockaddr_storage address;
if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK)) @@ -1551,14 +1540,14 @@ int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr, if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK)) flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
- sock = sockfd_lookup_light(fd, &err, &fput_needed); + sock = sock_from_file(file, &err); if (!sock) goto out;
err = -ENFILE; newsock = sock_alloc(); if (!newsock) - goto out_put; + goto out;
newsock->type = sock->type; newsock->ops = sock->ops; @@ -1573,20 +1562,21 @@ int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr, if (unlikely(newfd < 0)) { err = newfd; sock_release(newsock); - goto out_put; + goto out; } newfile = sock_alloc_file(newsock, flags, sock->sk->sk_prot_creator->name); if (IS_ERR(newfile)) { err = PTR_ERR(newfile); put_unused_fd(newfd); - goto out_put; + goto out; }
err = security_socket_accept(sock, newsock); if (err) goto out_fd;
- err = sock->ops->accept(sock, newsock, sock->file->f_flags, false); + err = sock->ops->accept(sock, newsock, sock->file->f_flags | file_flags, + false); if (err < 0) goto out_fd;
@@ -1607,15 +1597,42 @@ int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr,
fd_install(newfd, newfile); err = newfd; - -out_put: - fput_light(sock->file, fput_needed); out: return err; out_fd: fput(newfile); put_unused_fd(newfd); - goto out_put; + goto out; + +} + +/* + * For accept, we attempt to create a new socket, set up the link + * with the client, wake up the client, then return the new + * connected fd. We collect the address of the connector in kernel + * space and move it to user at the very end. This is unclean because + * we open the socket then return an error. + * + * 1003.1g adds the ability to recvmsg() to query connection pending + * status to recvmsg. We need to add that support in a way thats + * clean when we restructure accept also. + */ + +int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr, + int __user *upeer_addrlen, int flags) +{ + int ret = -EBADF; + struct fd f; + + f = fdget(fd); + if (f.file) { + ret = __sys_accept4_file(f.file, 0, upeer_sockaddr, + upeer_addrlen, flags); + if (f.flags) + fput(f.file); + } + + return ret; }
SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 17f2fe35d080d8f64e86a60cdcd3a97edcbc213b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This allows an application to call accept4() in an async fashion. Like other opcodes, we first try a non-blocking accept, then punt to async context if we have to.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 37 +++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 7 ++++++- 2 files changed, 43 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7cf028b1de47..18b42f21aadd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1685,6 +1685,40 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, #endif }
+static int io_accept(struct io_kiocb *req, const struct io_uring_sqe *sqe, + struct io_kiocb **nxt, bool force_nonblock) +{ +#if defined(CONFIG_NET) + struct sockaddr __user *addr; + int __user *addr_len; + unsigned file_flags; + int flags, ret; + + if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) + return -EINVAL; + if (sqe->ioprio || sqe->off || sqe->len || sqe->buf_index) + return -EINVAL; + + addr = (struct sockaddr __user *) (unsigned long) READ_ONCE(sqe->addr); + addr_len = (int __user *) (unsigned long) READ_ONCE(sqe->addr2); + flags = READ_ONCE(sqe->accept_flags); + file_flags = force_nonblock ? O_NONBLOCK : 0; + + ret = __sys_accept4_file(req->file, file_flags, addr, addr_len, flags); + if (ret == -EAGAIN && force_nonblock) { + req->work.flags |= IO_WQ_WORK_NEEDS_FILES; + return -EAGAIN; + } + if (ret < 0 && (req->flags & REQ_F_LINK)) + req->flags |= REQ_F_FAIL_LINK; + io_cqring_add_event(req->ctx, sqe->user_data, ret); + io_put_req(req, nxt); + return 0; +#else + return -EOPNOTSUPP; +#endif +} + static void io_poll_remove_one(struct io_kiocb *req) { struct io_poll_iocb *poll = &req->poll; @@ -2172,6 +2206,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, case IORING_OP_TIMEOUT_REMOVE: ret = io_timeout_remove(req, s->sqe); break; + case IORING_OP_ACCEPT: + ret = io_accept(req, s->sqe, nxt, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 6dc5ced1c37a..f82d90e617a6 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -19,7 +19,10 @@ struct io_uring_sqe { __u8 flags; /* IOSQE_ flags */ __u16 ioprio; /* ioprio for the request */ __s32 fd; /* file descriptor to do IO on */ - __u64 off; /* offset into file */ + union { + __u64 off; /* offset into file */ + __u64 addr2; + }; __u64 addr; /* pointer to buffer or iovecs */ __u32 len; /* buffer size or number of iovecs */ union { @@ -29,6 +32,7 @@ struct io_uring_sqe { __u32 sync_range_flags; __u32 msg_flags; __u32 timeout_flags; + __u32 accept_flags; }; __u64 user_data; /* data to be passed back at completion time */ union { @@ -65,6 +69,7 @@ struct io_uring_sqe { #define IORING_OP_RECVMSG 10 #define IORING_OP_TIMEOUT 11 #define IORING_OP_TIMEOUT_REMOVE 12 +#define IORING_OP_ACCEPT 13
/* * sqe->fsync_flags
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit b7620121dc04e44ce654297050f9eaf39d414a34 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We index the file tables with a user given value. After we check it's within our limits, use array_index_nospec() to prevent any spectre attacks here.
Suggested-by: Jann Horn jannh@google.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 18b42f21aadd..22e66c2dd904 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2320,6 +2320,7 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s, if (unlikely(!ctx->user_files || (unsigned) fd >= ctx->nr_user_files)) return -EBADF; + fd = array_index_nospec(fd, ctx->nr_user_files); if (!ctx->user_files[fd]) return -EBADF; req->file = ctx->user_files[fd];
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 65e19f54d29cd8559ce60cfd0d751bef7afbdc5c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There's been a few requests for supporting more fixed files than 1024. This isn't really tricky to do, we just need to split up the file table into multiple tables and index appropriately. As we do so, reduce the max single file table to 512. This enables us to do single page allocs always for the tables, which is an improvement over the situation prior.
This patch adds support for up to 64K files, which should be enough for everyone.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 150 +++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 117 insertions(+), 33 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 22e66c2dd904..994f4762bbe9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -80,7 +80,14 @@
#define IORING_MAX_ENTRIES 32768 #define IORING_MAX_CQ_ENTRIES (2 * IORING_MAX_ENTRIES) -#define IORING_MAX_FIXED_FILES 1024 + +/* + * Shift of 9 is 512 entries, or exactly one page on 64-bit archs + */ +#define IORING_FILE_TABLE_SHIFT 9 +#define IORING_MAX_FILES_TABLE (1U << IORING_FILE_TABLE_SHIFT) +#define IORING_FILE_TABLE_MASK (IORING_MAX_FILES_TABLE - 1) +#define IORING_MAX_FIXED_FILES (64 * IORING_MAX_FILES_TABLE)
struct io_uring { u32 head ____cacheline_aligned_in_smp; @@ -165,6 +172,10 @@ struct io_mapped_ubuf { unsigned int nr_bvecs; };
+struct fixed_file_table { + struct file **files; +}; + struct io_ring_ctx { struct { struct percpu_ref refs; @@ -225,7 +236,7 @@ struct io_ring_ctx { * readers must ensure that ->refs is alive as long as the file* is * used. Only updated through io_uring_register(2). */ - struct file **user_files; + struct fixed_file_table *file_table; unsigned nr_user_files;
/* if used, fixed mapped user buffers */ @@ -2295,6 +2306,15 @@ static bool io_op_needs_file(const struct io_uring_sqe *sqe) } }
+static inline struct file *io_file_from_index(struct io_ring_ctx *ctx, + int index) +{ + struct fixed_file_table *table; + + table = &ctx->file_table[index >> IORING_FILE_TABLE_SHIFT]; + return table->files[index & IORING_FILE_TABLE_MASK]; +} + static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s, struct io_submit_state *state, struct io_kiocb *req) { @@ -2317,13 +2337,13 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s, return 0;
if (flags & IOSQE_FIXED_FILE) { - if (unlikely(!ctx->user_files || + if (unlikely(!ctx->file_table || (unsigned) fd >= ctx->nr_user_files)) return -EBADF; fd = array_index_nospec(fd, ctx->nr_user_files); - if (!ctx->user_files[fd]) + req->file = io_file_from_index(ctx, fd); + if (!req->file) return -EBADF; - req->file = ctx->user_files[fd]; req->flags |= REQ_F_FIXED_FILE; } else { if (s->needs_fixed_file) @@ -2968,20 +2988,29 @@ static void __io_sqe_files_unregister(struct io_ring_ctx *ctx) #else int i;
- for (i = 0; i < ctx->nr_user_files; i++) - if (ctx->user_files[i]) - fput(ctx->user_files[i]); + for (i = 0; i < ctx->nr_user_files; i++) { + struct file *file; + + file = io_file_from_index(ctx, i); + if (file) + fput(file); + } #endif }
static int io_sqe_files_unregister(struct io_ring_ctx *ctx) { - if (!ctx->user_files) + unsigned nr_tables, i; + + if (!ctx->file_table) return -ENXIO;
__io_sqe_files_unregister(ctx); - kfree(ctx->user_files); - ctx->user_files = NULL; + nr_tables = DIV_ROUND_UP(ctx->nr_user_files, IORING_MAX_FILES_TABLE); + for (i = 0; i < nr_tables; i++) + kfree(ctx->file_table[i].files); + kfree(ctx->file_table); + ctx->file_table = NULL; ctx->nr_user_files = 0; return 0; } @@ -3056,9 +3085,11 @@ static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset) nr_files = 0; fpl->user = get_uid(ctx->user); for (i = 0; i < nr; i++) { - if (!ctx->user_files[i + offset]) + struct file *file = io_file_from_index(ctx, i + offset); + + if (!file) continue; - fpl->fp[nr_files] = get_file(ctx->user_files[i + offset]); + fpl->fp[nr_files] = get_file(file); unix_inflight(fpl->user, fpl->fp[nr_files]); nr_files++; } @@ -3107,8 +3138,10 @@ static int io_sqe_files_scm(struct io_ring_ctx *ctx) return 0;
while (total < ctx->nr_user_files) { - if (ctx->user_files[total]) - fput(ctx->user_files[total]); + struct file *file = io_file_from_index(ctx, total); + + if (file) + fput(file); total++; }
@@ -3121,25 +3154,63 @@ static int io_sqe_files_scm(struct io_ring_ctx *ctx) } #endif
+static int io_sqe_alloc_file_tables(struct io_ring_ctx *ctx, unsigned nr_tables, + unsigned nr_files) +{ + int i; + + for (i = 0; i < nr_tables; i++) { + struct fixed_file_table *table = &ctx->file_table[i]; + unsigned this_files; + + this_files = min(nr_files, IORING_MAX_FILES_TABLE); + table->files = kcalloc(this_files, sizeof(struct file *), + GFP_KERNEL); + if (!table->files) + break; + nr_files -= this_files; + } + + if (i == nr_tables) + return 0; + + for (i = 0; i < nr_tables; i++) { + struct fixed_file_table *table = &ctx->file_table[i]; + kfree(table->files); + } + return 1; +} + static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args) { __s32 __user *fds = (__s32 __user *) arg; + unsigned nr_tables; int fd, ret = 0; unsigned i;
- if (ctx->user_files) + if (ctx->file_table) return -EBUSY; if (!nr_args) return -EINVAL; if (nr_args > IORING_MAX_FIXED_FILES) return -EMFILE;
- ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL); - if (!ctx->user_files) + nr_tables = DIV_ROUND_UP(nr_args, IORING_MAX_FILES_TABLE); + ctx->file_table = kcalloc(nr_tables, sizeof(struct fixed_file_table), + GFP_KERNEL); + if (!ctx->file_table) return -ENOMEM;
+ if (io_sqe_alloc_file_tables(ctx, nr_tables, nr_args)) { + kfree(ctx->file_table); + return -ENOMEM; + } + for (i = 0; i < nr_args; i++, ctx->nr_user_files++) { + struct fixed_file_table *table; + unsigned index; + ret = -EFAULT; if (copy_from_user(&fd, &fds[i], sizeof(fd))) break; @@ -3149,10 +3220,12 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, continue; }
- ctx->user_files[i] = fget(fd); + table = &ctx->file_table[i >> IORING_FILE_TABLE_SHIFT]; + index = i & IORING_FILE_TABLE_MASK; + table->files[index] = fget(fd);
ret = -EBADF; - if (!ctx->user_files[i]) + if (!table->files[index]) break; /* * Don't allow io_uring instances to be registered. If UNIX @@ -3161,20 +3234,26 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, * handle it just fine, but there's still no point in allowing * a ring fd as it doesn't support regular read/write anyway. */ - if (ctx->user_files[i]->f_op == &io_uring_fops) { - fput(ctx->user_files[i]); + if (table->files[index]->f_op == &io_uring_fops) { + fput(table->files[index]); break; } ret = 0; }
if (ret) { - for (i = 0; i < ctx->nr_user_files; i++) - if (ctx->user_files[i]) - fput(ctx->user_files[i]); + for (i = 0; i < ctx->nr_user_files; i++) { + struct file *file;
- kfree(ctx->user_files); - ctx->user_files = NULL; + file = io_file_from_index(ctx, i); + if (file) + fput(file); + } + for (i = 0; i < nr_tables; i++) + kfree(ctx->file_table[i].files); + + kfree(ctx->file_table); + ctx->file_table = NULL; ctx->nr_user_files = 0; return ret; } @@ -3189,7 +3268,7 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, static void io_sqe_file_unregister(struct io_ring_ctx *ctx, int index) { #if defined(CONFIG_UNIX) - struct file *file = ctx->user_files[index]; + struct file *file = io_file_from_index(ctx, index); struct sock *sock = ctx->ring_sock->sk; struct sk_buff_head list, *head = &sock->sk_receive_queue; struct sk_buff *skb; @@ -3245,7 +3324,7 @@ static void io_sqe_file_unregister(struct io_ring_ctx *ctx, int index) spin_unlock_irq(&head->lock); } #else - fput(ctx->user_files[index]); + fput(io_file_from_index(ctx, index)); #endif }
@@ -3300,7 +3379,7 @@ static int io_sqe_files_update(struct io_ring_ctx *ctx, void __user *arg, int fd, i, err; __u32 done;
- if (!ctx->user_files) + if (!ctx->file_table) return -ENXIO; if (!nr_args) return -EINVAL; @@ -3314,15 +3393,20 @@ static int io_sqe_files_update(struct io_ring_ctx *ctx, void __user *arg, done = 0; fds = (__s32 __user *) up.fds; while (nr_args) { + struct fixed_file_table *table; + unsigned index; + err = 0; if (copy_from_user(&fd, &fds[done], sizeof(fd))) { err = -EFAULT; break; } i = array_index_nospec(up.offset, ctx->nr_user_files); - if (ctx->user_files[i]) { + table = &ctx->file_table[i >> IORING_FILE_TABLE_SHIFT]; + index = i & IORING_FILE_TABLE_MASK; + if (table->files[index]) { io_sqe_file_unregister(ctx, i); - ctx->user_files[i] = NULL; + table->files[index] = NULL; } if (fd != -1) { struct file *file; @@ -3345,7 +3429,7 @@ static int io_sqe_files_update(struct io_ring_ctx *ctx, void __user *arg, err = -EBADF; break; } - ctx->user_files[i] = file; + table->files[index] = file; err = io_sqe_file_register(ctx, file, i); if (err) break;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 842f96124c5617b060cc0f071dcfb6ab24bdd042 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we get -1 from hrtimer_try_to_cancel(), we know that the timer is running. Hence leave all completion to the timeout handler. If we don't, we can corrupt the list and miss a completion.
Fixes: 11365043e527 ("io_uring: add support for canceling timeout requests") Reported-by: Hrvoje Zeba zeba.hrvoje@gmail.com Tested-by: Hrvoje Zeba zeba.hrvoje@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 20 ++++++++------------ 1 file changed, 8 insertions(+), 12 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 994f4762bbe9..3c76aa56dd7f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -532,7 +532,7 @@ static void io_kill_timeout(struct io_kiocb *req) ret = hrtimer_try_to_cancel(&req->timeout.timer); if (ret != -1) { atomic_inc(&req->ctx->cq_timeouts); - list_del(&req->list); + list_del_init(&req->list); io_cqring_fill_event(req->ctx, req->user_data, 0); __io_free_req(req); } @@ -1956,7 +1956,6 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) struct io_ring_ctx *ctx; struct io_kiocb *req; unsigned long flags; - bool comp;
req = container_of(timer, struct io_kiocb, timeout.timer); ctx = req->ctx; @@ -1967,8 +1966,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) * We could be racing with timeout deletion. If the list is empty, * then timeout lookup already found it and will be handling it. */ - comp = !list_empty(&req->list); - if (comp) { + if (!list_empty(&req->list)) { struct io_kiocb *prev;
/* @@ -1980,17 +1978,15 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) prev = req; list_for_each_entry_continue_reverse(prev, &ctx->timeout_list, list) prev->sequence++; - list_del_init(&req->list); - io_cqring_fill_event(ctx, req->user_data, -ETIME); - io_commit_cqring(ctx); } + + io_cqring_fill_event(ctx, req->user_data, -ETIME); + io_commit_cqring(ctx); spin_unlock_irqrestore(&ctx->completion_lock, flags);
- if (comp) { - io_cqring_ev_posted(ctx); - io_put_req(req, NULL); - } + io_cqring_ev_posted(ctx); + io_put_req(req, NULL); return HRTIMER_NORESTART; }
@@ -2130,9 +2126,9 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) } req->sequence -= span; list_add(&req->list, entry); - spin_unlock_irq(&ctx->completion_lock); req->timeout.timer.function = io_timeout_fn; hrtimer_start(&req->timeout.timer, timespec64_to_ktime(ts), mode); + spin_unlock_irq(&ctx->completion_lock); return 0; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 975c99a570967dd48e917dd7853867fee3febabd category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
syzbot reported an issue where we crash at setup time if failslab is used. The issue is that io_wq_create() returns an error pointer on failure, not NULL. Hence io_uring thought the io-wq was setup just fine, but in reality it's a garbage error pointer.
Use IS_ERR() instead of a NULL check, and assign ret appropriately.
Reported-by: syzbot+221cc24572a2fed23b6b@syzkaller.appspotmail.com Fixes: 561fb04a6a22 ("io_uring: replace workqueue usage with io-wq") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3c76aa56dd7f..960690b3d7b1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3488,8 +3488,9 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx, /* Do QD, or 4 * CPUS, whatever is smallest */ concurrency = min(ctx->sq_entries, 4 * num_online_cpus()); ctx->io_wq = io_wq_create(concurrency, ctx->sqo_mm); - if (!ctx->io_wq) { - ret = -ENOMEM; + if (IS_ERR(ctx->io_wq)) { + ret = PTR_ERR(ctx->io_wq); + ctx->io_wq = NULL; goto err; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.4-rc6 commit 6873e0bd6a9cb14ecfadd89d9ed9698ff1761902 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We use io_kiocb->result == -EAGAIN as a way to know if we need to re-submit a polled request, as -EAGAIN reporting happens out-of-line for IO submission failures. This field is cleared when we originally allocate the request, but it isn't reset when we retry the submission from async context. This can cause issues where we think something needs a re-issue, but we're really just reading stale data.
Reset ->result whenever we re-prep a request for polled submission.
Cc: stable@vger.kernel.org Fixes: 9e645e1105ca ("io_uring: add support for sqe links") Reported-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 960690b3d7b1..57241937c9a0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1211,6 +1211,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s,
kiocb->ki_flags |= IOCB_HIPRI; kiocb->ki_complete = io_complete_rw_iopoll; + req->result = 0; } else { if (kiocb->ki_flags & IOCB_HIPRI) return -EINVAL;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 62755e35dfb2b113c52b81cd96d01c20971c8e02 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to cancel requests that have been punted to async context and are now in-flight. This works for regular read/write requests to files, as long as they haven't been started yet. For socket based IO (or things like accept4(2)), we can cancel work that is already running as well.
To cancel a request, the sqe must have ->addr set to the user_data of the request it wishes to cancel. If the request is cancelled successfully, the original request is completed with -ECANCELED and the cancel request is completed with a result of 0. If the request was already running, the original may or may not complete in error. The cancel request will complete with -EALREADY for that case. And finally, if the request to cancel wasn't found, the cancel request is completed with -ENOENT.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 85 +++++++++++++++++++++++++++++++++++ fs/io-wq.h | 5 +++ fs/io_uring.c | 45 +++++++++++++++++++ include/uapi/linux/io_uring.h | 2 + 4 files changed, 137 insertions(+)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 3e0f6dfbdcd9..4fab4917938e 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -640,6 +640,91 @@ void io_wq_cancel_all(struct io_wq *wq) rcu_read_unlock(); }
+struct io_cb_cancel_data { + struct io_wqe *wqe; + work_cancel_fn *cancel; + void *caller_data; +}; + +static bool io_work_cancel(struct io_worker *worker, void *cancel_data) +{ + struct io_cb_cancel_data *data = cancel_data; + struct io_wqe *wqe = data->wqe; + bool ret = false; + + /* + * Hold the lock to avoid ->cur_work going out of scope, caller + * may deference the passed in work. + */ + spin_lock_irq(&wqe->lock); + if (worker->cur_work && + data->cancel(worker->cur_work, data->caller_data)) { + send_sig(SIGINT, worker->task, 1); + ret = true; + } + spin_unlock_irq(&wqe->lock); + + return ret; +} + +static enum io_wq_cancel io_wqe_cancel_cb_work(struct io_wqe *wqe, + work_cancel_fn *cancel, + void *cancel_data) +{ + struct io_cb_cancel_data data = { + .wqe = wqe, + .cancel = cancel, + .caller_data = cancel_data, + }; + struct io_wq_work *work; + bool found = false; + + spin_lock_irq(&wqe->lock); + list_for_each_entry(work, &wqe->work_list, list) { + if (cancel(work, cancel_data)) { + list_del(&work->list); + found = true; + break; + } + } + spin_unlock_irq(&wqe->lock); + + if (found) { + work->flags |= IO_WQ_WORK_CANCEL; + work->func(&work); + return IO_WQ_CANCEL_OK; + } + + rcu_read_lock(); + found = io_wq_for_each_worker(wqe, &wqe->free_list, io_work_cancel, + &data); + if (found) + goto done; + + found = io_wq_for_each_worker(wqe, &wqe->busy_list, io_work_cancel, + &data); +done: + rcu_read_unlock(); + return found ? IO_WQ_CANCEL_RUNNING : IO_WQ_CANCEL_NOTFOUND; +} + +enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel, + void *data) +{ + enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; + int i; + + for (i = 0; i < wq->nr_wqes; i++) { + struct io_wqe *wqe = wq->wqes[i]; + + ret = io_wqe_cancel_cb_work(wqe, cancel, data); + if (ret != IO_WQ_CANCEL_NOTFOUND) + break; + } + + return ret; +} + static bool io_wq_worker_cancel(struct io_worker *worker, void *data) { struct io_wq_work *work = data; diff --git a/fs/io-wq.h b/fs/io-wq.h index e93f764b1fa4..3de192dc73fc 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -43,6 +43,11 @@ void io_wq_flush(struct io_wq *wq); void io_wq_cancel_all(struct io_wq *wq); enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork);
+typedef bool (work_cancel_fn)(struct io_wq_work *, void *); + +enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel, + void *data); + #if defined(CONFIG_IO_WQ) extern void io_wq_worker_sleeping(struct task_struct *); extern void io_wq_worker_running(struct task_struct *); diff --git a/fs/io_uring.c b/fs/io_uring.c index 57241937c9a0..a18d4d47f9df 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2133,6 +2133,48 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
+static bool io_cancel_cb(struct io_wq_work *work, void *data) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work); + + return req->user_data == (unsigned long) data; +} + +static int io_async_cancel(struct io_kiocb *req, const struct io_uring_sqe *sqe, + struct io_kiocb **nxt) +{ + struct io_ring_ctx *ctx = req->ctx; + enum io_wq_cancel cancel_ret; + void *sqe_addr; + int ret = 0; + + if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + if (sqe->flags || sqe->ioprio || sqe->off || sqe->len || + sqe->cancel_flags) + return -EINVAL; + + sqe_addr = (void *) (unsigned long) READ_ONCE(sqe->addr); + cancel_ret = io_wq_cancel_cb(ctx->io_wq, io_cancel_cb, sqe_addr); + switch (cancel_ret) { + case IO_WQ_CANCEL_OK: + ret = 0; + break; + case IO_WQ_CANCEL_RUNNING: + ret = -EALREADY; + break; + case IO_WQ_CANCEL_NOTFOUND: + ret = -ENOENT; + break; + } + + if (ret < 0 && (req->flags & REQ_F_LINK)) + req->flags |= REQ_F_FAIL_LINK; + io_cqring_add_event(req->ctx, sqe->user_data, ret); + io_put_req(req, nxt); + return 0; +} + static int io_req_defer(struct io_ring_ctx *ctx, struct io_kiocb *req, const struct io_uring_sqe *sqe) { @@ -2217,6 +2259,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, case IORING_OP_ACCEPT: ret = io_accept(req, s->sqe, nxt, force_nonblock); break; + case IORING_OP_ASYNC_CANCEL: + ret = io_async_cancel(req, s->sqe, nxt); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index f82d90e617a6..6877cf8894db 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -33,6 +33,7 @@ struct io_uring_sqe { __u32 msg_flags; __u32 timeout_flags; __u32 accept_flags; + __u32 cancel_flags; }; __u64 user_data; /* data to be passed back at completion time */ union { @@ -70,6 +71,7 @@ struct io_uring_sqe { #define IORING_OP_TIMEOUT 11 #define IORING_OP_TIMEOUT_REMOVE 12 #define IORING_OP_ACCEPT 13 +#define IORING_OP_ASYNC_CANCEL 14
/* * sqe->fsync_flags
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit c1edbf5f081be9fbbea68c1d564b773e59c1acf3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Now that we have backpressure, for SQPOLL, we have one more condition that warrants flagging that the application needs to enter the kernel: we failed to submit IO due to backpressure. Make sure we catch that and flag it appropriately.
If we run into backpressure issues with the SQPOLL thread, flag it as such to the application by setting IORING_SQ_NEED_WAKEUP. This will cause the application to enter the kernel, and that will flush the backlog and clear the condition.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 645939e864db..85c21fee7ac0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3118,16 +3118,16 @@ static int io_sq_thread(void *data) DEFINE_WAIT(wait); unsigned inflight; unsigned long timeout; + int ret;
complete(&ctx->completions[1]);
old_fs = get_fs(); set_fs(USER_DS);
- timeout = inflight = 0; + ret = timeout = inflight = 0; while (!kthread_should_park()) { unsigned int to_submit; - int ret;
if (inflight) { unsigned nr_events = 0; @@ -3161,13 +3161,21 @@ static int io_sq_thread(void *data) }
to_submit = io_sqring_entries(ctx); - if (!to_submit) { + + /* + * If submit got -EBUSY, flag us as needing the application + * to enter the kernel to reap and flush events. + */ + if (!to_submit || ret == -EBUSY) { /* * We're polling. If we're within the defined idle * period, then let us spin without work before going - * to sleep. + * to sleep. The exception is if we got EBUSY doing + * more IO, we should wait for the application to + * reap events and wake us up. */ - if (inflight || !time_after(jiffies, timeout)) { + if (inflight || + (!time_after(jiffies, timeout) && ret != -EBUSY)) { cond_resched(); continue; } @@ -3193,7 +3201,7 @@ static int io_sq_thread(void *data) smp_mb();
to_submit = io_sqring_entries(ctx); - if (!to_submit) { + if (!to_submit || ret == -EBUSY) { if (kthread_should_park()) { finish_wait(&ctx->sqo_wait, &wait); break; @@ -4352,6 +4360,8 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, */ ret = 0; if (ctx->flags & IORING_SETUP_SQPOLL) { + if (!list_empty_careful(&ctx->cq_overflow_list)) + io_cqring_overflow_flush(ctx, false); if (flags & IORING_ENTER_SQ_WAKEUP) wake_up(&ctx->sqo_wait); submitted = to_submit;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 768134d4f48109b90f4248feecbeeb7d684e410c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We can't safely cancel under the inflight lock. If the work hasn't been started yet, then io_wq_cancel_work() simply marks the work as cancelled and invokes the work handler. But if the work completion needs to grab the inflight lock because it's grabbing user files, then we'll deadlock trying to finish the work as we already hold that lock.
Instead grab a reference to the request, if it isn't already zero. If it's zero, then we know it's going through completion anyway, and we can safely ignore it. If it's not zero, then we can drop the lock and attempt to cancel from there.
This also fixes a missing finish_wait() at the end of io_uring_cancel_files().
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 35 ++++++++++++++++++----------------- 1 file changed, 18 insertions(+), 17 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 85c21fee7ac0..d751e1eb245e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4258,33 +4258,34 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx,
while (!list_empty_careful(&ctx->inflight_list)) { enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; + struct io_kiocb *cancel_req = NULL;
spin_lock_irq(&ctx->inflight_lock); list_for_each_entry(req, &ctx->inflight_list, inflight_entry) { - if (req->work.files == files) { - ret = io_wq_cancel_work(ctx->io_wq, &req->work); - break; - } + if (req->work.files != files) + continue; + /* req is being completed, ignore */ + if (!refcount_inc_not_zero(&req->refs)) + continue; + cancel_req = req; + break; } - if (ret == IO_WQ_CANCEL_RUNNING) + if (cancel_req) prepare_to_wait(&ctx->inflight_wait, &wait, - TASK_UNINTERRUPTIBLE); - + TASK_UNINTERRUPTIBLE); spin_unlock_irq(&ctx->inflight_lock);
- /* - * We need to keep going until we get NOTFOUND. We only cancel - * one work at the time. - * - * If we get CANCEL_RUNNING, then wait for a work to complete - * before continuing. - */ - if (ret == IO_WQ_CANCEL_OK) - continue; - else if (ret != IO_WQ_CANCEL_RUNNING) + if (cancel_req) { + ret = io_wq_cancel_work(ctx->io_wq, &cancel_req->work); + io_put_req(cancel_req); + } + + /* We need to keep going until we don't find a matching req */ + if (!cancel_req) break; schedule(); } + finish_wait(&ctx->inflight_wait, &wait); }
static int io_uring_flush(struct file *file, void *data)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 76a46e066e2d93bd333599d1c84c605c2c4cc909 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If you prep a read (for example) that needs to get punted to async context with a timer, if the timeout is sufficiently short, the timer request will get completed with -ENOENT as it could not find the read.
The issue is that we prep and start the timer before we start the read. Hence the timer can trigger before the read is even started, and the end result is then that the timer completes with -ENOENT, while the read starts instead of being cancelled by the timer.
Fix this by splitting the linked timer into two parts:
1) Prep and validate the linked timer 2) Start timer
The read is then started between steps 1 and 2, so we know that the timer will always have a consistent view of the read request state.
Reported-by: Hrvoje Zeba zeba.hrvoje@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 123 +++++++++++++++++++++++++++++--------------------- 1 file changed, 72 insertions(+), 51 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d751e1eb245e..5e487fa88a82 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -854,7 +854,7 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) */ nxt = list_first_entry_or_null(&req->link_list, struct io_kiocb, list); while (nxt) { - list_del(&nxt->list); + list_del_init(&nxt->list); if (!list_empty(&req->link_list)) { INIT_LIST_HEAD(&nxt->link_list); list_splice(&req->link_list, &nxt->link_list); @@ -2688,13 +2688,17 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) */ if (!list_empty(&req->list)) { prev = list_entry(req->list.prev, struct io_kiocb, link_list); - list_del_init(&req->list); + if (refcount_inc_not_zero(&prev->refs)) + list_del_init(&req->list); + else + prev = NULL; }
spin_unlock_irqrestore(&ctx->completion_lock, flags);
if (prev) { io_async_find_and_cancel(ctx, req, prev->user_data, NULL); + io_put_req(prev); } else { io_cqring_add_event(req, -ETIME); io_put_req(req); @@ -2702,78 +2706,84 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) return HRTIMER_NORESTART; }
-static int io_queue_linked_timeout(struct io_kiocb *req, struct io_kiocb *nxt) +static void io_queue_linked_timeout(struct io_kiocb *req, struct timespec64 *ts, + enum hrtimer_mode *mode) { - const struct io_uring_sqe *sqe = nxt->submit.sqe; - enum hrtimer_mode mode; - struct timespec64 ts; - int ret = -EINVAL; + struct io_ring_ctx *ctx = req->ctx;
- if (sqe->ioprio || sqe->buf_index || sqe->len != 1 || sqe->off) - goto err; - if (sqe->timeout_flags & ~IORING_TIMEOUT_ABS) - goto err; - if (get_timespec64(&ts, u64_to_user_ptr(sqe->addr))) { - ret = -EFAULT; - goto err; + /* + * If the list is now empty, then our linked request finished before + * we got a chance to setup the timer + */ + spin_lock_irq(&ctx->completion_lock); + if (!list_empty(&req->list)) { + req->timeout.timer.function = io_link_timeout_fn; + hrtimer_start(&req->timeout.timer, timespec64_to_ktime(*ts), + *mode); } + spin_unlock_irq(&ctx->completion_lock);
- req->flags |= REQ_F_LINK_TIMEOUT; - - if (sqe->timeout_flags & IORING_TIMEOUT_ABS) - mode = HRTIMER_MODE_ABS; - else - mode = HRTIMER_MODE_REL; - hrtimer_init(&nxt->timeout.timer, CLOCK_MONOTONIC, mode); - nxt->timeout.timer.function = io_link_timeout_fn; - hrtimer_start(&nxt->timeout.timer, timespec64_to_ktime(ts), mode); - ret = 0; -err: /* drop submission reference */ - io_put_req(nxt); - - if (ret) { - struct io_ring_ctx *ctx = req->ctx; + io_put_req(req); +}
- /* - * Break the link and fail linked timeout, parent will get - * failed by the regular submission path. - */ - list_del(&nxt->list); - io_cqring_fill_event(nxt, ret); - trace_io_uring_fail_link(req, nxt); - io_commit_cqring(ctx); - io_put_req(nxt); - ret = -ECANCELED; - } +static int io_validate_link_timeout(const struct io_uring_sqe *sqe, + struct timespec64 *ts) +{ + if (sqe->ioprio || sqe->buf_index || sqe->len != 1 || sqe->off) + return -EINVAL; + if (sqe->timeout_flags & ~IORING_TIMEOUT_ABS) + return -EINVAL; + if (get_timespec64(ts, u64_to_user_ptr(sqe->addr))) + return -EFAULT;
- return ret; + return 0; }
-static inline struct io_kiocb *io_get_linked_timeout(struct io_kiocb *req) +static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req, + struct timespec64 *ts, + enum hrtimer_mode *mode) { struct io_kiocb *nxt; + int ret;
if (!(req->flags & REQ_F_LINK)) return NULL;
nxt = list_first_entry_or_null(&req->link_list, struct io_kiocb, list); - if (nxt && nxt->submit.sqe->opcode == IORING_OP_LINK_TIMEOUT) - return nxt; + if (!nxt || nxt->submit.sqe->opcode != IORING_OP_LINK_TIMEOUT) + return NULL;
- return NULL; + ret = io_validate_link_timeout(nxt->submit.sqe, ts); + if (ret) { + list_del_init(&nxt->list); + io_cqring_add_event(nxt, ret); + io_double_put_req(nxt); + return ERR_PTR(-ECANCELED); + } + + if (nxt->submit.sqe->timeout_flags & IORING_TIMEOUT_ABS) + *mode = HRTIMER_MODE_ABS; + else + *mode = HRTIMER_MODE_REL; + + req->flags |= REQ_F_LINK_TIMEOUT; + hrtimer_init(&nxt->timeout.timer, CLOCK_MONOTONIC, *mode); + return nxt; }
static int __io_queue_sqe(struct io_kiocb *req) { + enum hrtimer_mode mode; struct io_kiocb *nxt; + struct timespec64 ts; int ret;
- nxt = io_get_linked_timeout(req); - if (unlikely(nxt)) { - ret = io_queue_linked_timeout(req, nxt); - if (ret) - goto err; + nxt = io_prep_linked_timeout(req, &ts, &mode); + if (IS_ERR(nxt)) { + ret = PTR_ERR(nxt); + nxt = NULL; + goto err; }
ret = __io_submit_sqe(req, NULL, true); @@ -2803,14 +2813,25 @@ static int __io_queue_sqe(struct io_kiocb *req) * submit reference when the iocb is actually submitted. */ io_queue_async_work(req); + + if (nxt) + io_queue_linked_timeout(nxt, &ts, &mode); + return 0; } }
- /* drop submission reference */ err: + /* drop submission reference */ io_put_req(req);
+ if (nxt) { + if (!ret) + io_queue_linked_timeout(nxt, &ts, &mode); + else + io_put_req(nxt); + } + /* and drop final reference, if we failed */ if (ret) { io_cqring_add_event(req, ret);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.4-rc8 commit 93bd25bb69f46367ba8f82c578e0c05702ceb482 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently we make sequence == 0 be the same as sequence == 1, but that's not super useful if the intent is really to have a timeout that's just a pure timeout.
If the user passes in sqe->off == 0, then don't apply any sequence logic to the request, let it purely be driven by the timeout specified.
Reported-by: 李通洲 carter.li@eoitek.com Reviewed-by: 李通洲 carter.li@eoitek.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 30 ++++++++++++++++++++++-------- 1 file changed, 22 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5e487fa88a82..77688c4fba50 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -340,7 +340,8 @@ struct io_kiocb { #define REQ_F_TIMEOUT 1024 /* timeout request */ #define REQ_F_ISREG 2048 /* regular file */ #define REQ_F_MUST_PUNT 4096 /* must be punted even for NONBLOCK */ -#define REQ_F_INFLIGHT 8192 /* on inflight list */ +#define REQ_F_TIMEOUT_NOSEQ 8192 /* no timeout sequence */ +#define REQ_F_INFLIGHT 16384 /* on inflight list */ u64 user_data; u32 result; u32 sequence; @@ -480,9 +481,13 @@ static struct io_kiocb *io_get_timeout_req(struct io_ring_ctx *ctx) struct io_kiocb *req;
req = list_first_entry_or_null(&ctx->timeout_list, struct io_kiocb, list); - if (req && !__io_sequence_defer(req)) { - list_del_init(&req->list); - return req; + if (req) { + if (req->flags & REQ_F_TIMEOUT_NOSEQ) + return NULL; + if (!__io_sequence_defer(req)) { + list_del_init(&req->list); + return req; + } }
return NULL; @@ -2292,19 +2297,24 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) mode = HRTIMER_MODE_REL;
hrtimer_init(&req->timeout.timer, CLOCK_MONOTONIC, mode); + req->flags |= REQ_F_TIMEOUT;
/* * sqe->off holds how many events that need to occur for this - * timeout event to be satisfied. + * timeout event to be satisfied. If it isn't set, then this is + * a pure timeout request, sequence isn't used. */ count = READ_ONCE(sqe->off); - if (!count) - count = 1; + if (!count) { + req->flags |= REQ_F_TIMEOUT_NOSEQ; + spin_lock_irq(&ctx->completion_lock); + entry = ctx->timeout_list.prev; + goto add; + }
req->sequence = ctx->cached_sq_head + count - 1; /* reuse it to store the count */ req->submit.sequence = count; - req->flags |= REQ_F_TIMEOUT;
/* * Insertion sort, ensuring the first entry in the list is always @@ -2316,6 +2326,9 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) unsigned nxt_sq_head; long long tmp, tmp_nxt;
+ if (nxt->flags & REQ_F_TIMEOUT_NOSEQ) + continue; + /* * Since cached_sq_head + count - 1 can overflow, use type long * long to store it. @@ -2342,6 +2355,7 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) nxt->sequence++; } req->sequence -= span; +add: list_add(&req->list, entry); req->timeout.timer.function = io_timeout_fn; hrtimer_start(&req->timeout.timer, timespec64_to_ktime(ts), mode);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 960e432dfa5927892a9b170d14de874597b84849 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Since we switched to io-wq, the dependent link optimization for when to pass back work inline has been broken. Fix this by providing a suitable io-wq helper for io_uring to use to detect when to do this.
Fixes: 561fb04a6a22 ("io_uring: replace workqueue usage with io-wq") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.h | 4 ++++ fs/io_uring.c | 2 +- 2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/io-wq.h b/fs/io-wq.h index 8cb345256f35..cc50754d028c 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -62,4 +62,8 @@ static inline void io_wq_worker_running(struct task_struct *tsk) } #endif
+static inline bool io_wq_current_is_worker(void) +{ + return in_task() && (current->flags & PF_IO_WORKER); +} #endif diff --git a/fs/io_uring.c b/fs/io_uring.c index 77688c4fba50..866527e2aaf9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -876,7 +876,7 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) /* we dropped this link, get next */ nxt = list_first_entry_or_null(&req->link_list, struct io_kiocb, list); - } else if (nxtptr && current_work()) { + } else if (nxtptr && io_wq_current_is_worker()) { *nxtptr = nxt; break; } else {
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 7c9e7f0fe0d8abf856a957c150c48778e75154c1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We attempt to run the poll completion inline, but we're using trylock to do so. This avoids a deadlock since we're grabbing the locks in reverse order at this point, we already hold the poll wq lock and we're trying to grab the completion lock, while the normal rules are the reverse of that order.
IO completion for a timeout link will need to grab the completion lock, but that's not safe from this context. Put the completion under the completion_lock in io_poll_wake(), and mark the request as entering the completion with the completion_lock already held.
Fixes: 2665abfd757f ("io_uring: add support for linked SQE timeouts") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 866527e2aaf9..8f8bb5c7e791 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -342,6 +342,7 @@ struct io_kiocb { #define REQ_F_MUST_PUNT 4096 /* must be punted even for NONBLOCK */ #define REQ_F_TIMEOUT_NOSEQ 8192 /* no timeout sequence */ #define REQ_F_INFLIGHT 16384 /* on inflight list */ +#define REQ_F_COMP_LOCKED 32768 /* completion under lock */ u64 user_data; u32 result; u32 sequence; @@ -935,14 +936,15 @@ static void io_free_req_find_next(struct io_kiocb *req, struct io_kiocb **nxt) */ if (req->flags & REQ_F_FAIL_LINK) { io_fail_links(req); - } else if (req->flags & REQ_F_LINK_TIMEOUT) { + } else if ((req->flags & (REQ_F_LINK_TIMEOUT | REQ_F_COMP_LOCKED)) == + REQ_F_LINK_TIMEOUT) { struct io_ring_ctx *ctx = req->ctx; unsigned long flags;
/* * If this is a timeout link, we could be racing with the * timeout timer. Grab the completion lock for this case to - * protection against that. + * protect against that. */ spin_lock_irqsave(&ctx->completion_lock, flags); io_req_link_next(req, nxt); @@ -2069,13 +2071,20 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
list_del_init(&poll->wait.entry);
+ /* + * Run completion inline if we can. We're using trylock here because + * we are violating the completion_lock -> poll wq lock ordering. + * If we have a link timeout we're going to need the completion_lock + * for finalizing the request, mark us as having grabbed that already. + */ if (mask && spin_trylock_irqsave(&ctx->completion_lock, flags)) { list_del(&req->list); io_poll_complete(req, mask); + req->flags |= REQ_F_COMP_LOCKED; + io_put_req(req); spin_unlock_irqrestore(&ctx->completion_lock, flags);
io_cqring_ev_posted(ctx); - io_put_req(req); } else { io_queue_async_work(req); }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 15dff286d0e0087d4dcd7049911f179e4e4cfd94 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Normally the rings are always valid, the exception is if we failed to allocate the rings at setup time. syzbot reports this:
RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229 RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 8903 Comm: syz-executor410 Not tainted 5.4.0-rc7-next-20191113 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline] RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline] RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592 Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1 ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61 RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100 RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0 R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000 FS: 0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: io_cqring_overflow_flush+0x6b9/0xa90 fs/io_uring.c:673 io_ring_ctx_wait_and_kill+0x24f/0x7c0 fs/io_uring.c:4260 io_uring_create fs/io_uring.c:4600 [inline] io_uring_setup+0x1256/0x1cc0 fs/io_uring.c:4626 __do_sys_io_uring_setup fs/io_uring.c:4639 [inline] __se_sys_io_uring_setup fs/io_uring.c:4636 [inline] __x64_sys_io_uring_setup+0x54/0x80 fs/io_uring.c:4636 do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x441229 Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffd6e8aa078 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441229 RDX: 0000000000000002 RSI: 0000000020000140 RDI: 0000000000000d0d RBP: 00007ffd6e8aa090 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 Modules linked in: ---[ end trace b0f5b127a57f623f ]--- RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline] RIP: 0010:__io_commit_cqring fs/io_uring.c:496 [inline] RIP: 0010:io_commit_cqring+0x1e1/0xdb0 fs/io_uring.c:592 Code: 03 0f 8e df 09 00 00 48 8b 45 d0 4c 8d a3 c0 00 00 00 4c 89 e2 48 c1 ea 03 44 8b b8 c0 01 00 00 48 b8 00 00 00 00 00 fc ff df <0f> b6 14 02 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 61 RSP: 0018:ffff88808f51fc08 EFLAGS: 00010006 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff815abe4a RDX: 0000000000000018 RSI: ffffffff81d168d5 RDI: ffff8880a9166100 RBP: ffff88808f51fc70 R08: 0000000000000004 R09: ffffed1011ea3f7d R10: ffffed1011ea3f7c R11: 0000000000000003 R12: 00000000000000c0 R13: ffff8880a91661c0 R14: 1ffff1101522cc10 R15: 0000000000000000 FS: 0000000001e7a880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000140 CR3: 000000009a74c000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
which is exactly the case of failing to allocate the SQ/CQ rings, and then entering shutdown. Check if the rings are valid before trying to access them at shutdown time.
Reported-by: syzbot+21147d79607d724bd6f3@syzkaller.appspotmail.com Fixes: 1d7bb1d50fb4 ("io_uring: add support for backlogged CQ ring") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8f8bb5c7e791..cc69f38c77e5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4280,7 +4280,9 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) io_wq_cancel_all(ctx->io_wq);
io_iopoll_reap_events(ctx); - io_cqring_overflow_flush(ctx, true); + /* if we failed setting up the ctx, we might not have any rings */ + if (ctx->rings) + io_cqring_overflow_flush(ctx, true); wait_for_completion(&ctx->completions[0]); io_ring_ctx_free(ctx); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.4-rc8 commit 5683e5406e94ae1bfb0d9516a18fdb281d0f8d1d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For timeout requests io_uring tries to grab a file with specified fd, which is usually stdin/fd=0. Update io_op_needs_file()
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cc69f38c77e5..77e8d403b3e7 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2612,6 +2612,7 @@ static bool io_op_needs_file(const struct io_uring_sqe *sqe) switch (op) { case IORING_OP_NOP: case IORING_OP_POLL_REMOVE: + case IORING_OP_TIMEOUT: return false; default: return true;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 7d7230652e7c788ef908536fd79f4cca077f269f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For cancellation, we need to ensure that the work item stays valid for as long as ->cur_work is valid. Right now we can't safely dereference the work item even under the wqe->lock, because while the ->cur_work pointer will remain valid, the work could be completing and be freed in parallel.
Only invoke ->get/put_work() on items we know that the caller queued themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed when we're queueing a flush item, for instance.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 25 +++++++++++++++++++++++-- fs/io-wq.h | 7 ++++++- fs/io_uring.c | 17 ++++++++++++++++- 3 files changed, 45 insertions(+), 4 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index e70da9583377..97b18dfad163 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -107,6 +107,9 @@ struct io_wq { unsigned long state; unsigned nr_wqes;
+ get_work_fn *get_work; + put_work_fn *put_work; + struct task_struct *manager; struct user_struct *user; struct mm_struct *mm; @@ -393,7 +396,7 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe, unsigned *hash) static void io_worker_handle_work(struct io_worker *worker) __releases(wqe->lock) { - struct io_wq_work *work, *old_work; + struct io_wq_work *work, *old_work = NULL, *put_work = NULL; struct io_wqe *wqe = worker->wqe; struct io_wq *wq = wqe->wq;
@@ -425,6 +428,8 @@ static void io_worker_handle_work(struct io_worker *worker) wqe->flags |= IO_WQE_FLAG_STALLED;
spin_unlock_irq(&wqe->lock); + if (put_work && wq->put_work) + wq->put_work(old_work); if (!work) break; next: @@ -445,6 +450,11 @@ static void io_worker_handle_work(struct io_worker *worker) if (worker->mm) work->flags |= IO_WQ_WORK_HAS_MM;
+ if (wq->get_work && !(work->flags & IO_WQ_WORK_INTERNAL)) { + put_work = work; + wq->get_work(work); + } + old_work = work; work->func(&work);
@@ -456,6 +466,12 @@ static void io_worker_handle_work(struct io_worker *worker) } if (work && work != old_work) { spin_unlock_irq(&wqe->lock); + + if (put_work && wq->put_work) { + wq->put_work(put_work); + put_work = NULL; + } + /* dependent work not hashed */ hash = -1U; goto next; @@ -951,13 +967,15 @@ void io_wq_flush(struct io_wq *wq)
init_completion(&data.done); INIT_IO_WORK(&data.work, io_wq_flush_func); + data.work.flags |= IO_WQ_WORK_INTERNAL; io_wqe_enqueue(wqe, &data.work); wait_for_completion(&data.done); } }
struct io_wq *io_wq_create(unsigned bounded, struct mm_struct *mm, - struct user_struct *user) + struct user_struct *user, get_work_fn *get_work, + put_work_fn *put_work) { int ret = -ENOMEM, i, node; struct io_wq *wq; @@ -973,6 +991,9 @@ struct io_wq *io_wq_create(unsigned bounded, struct mm_struct *mm, return ERR_PTR(-ENOMEM); }
+ wq->get_work = get_work; + wq->put_work = put_work; + /* caller must already hold a reference to this */ wq->user = user;
diff --git a/fs/io-wq.h b/fs/io-wq.h index cc50754d028c..4b29f922f80c 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -10,6 +10,7 @@ enum { IO_WQ_WORK_NEEDS_USER = 8, IO_WQ_WORK_NEEDS_FILES = 16, IO_WQ_WORK_UNBOUND = 32, + IO_WQ_WORK_INTERNAL = 64,
IO_WQ_HASH_SHIFT = 24, /* upper 8 bits are used for hash key */ }; @@ -34,8 +35,12 @@ struct io_wq_work { (work)->files = NULL; \ } while (0) \
+typedef void (get_work_fn)(struct io_wq_work *); +typedef void (put_work_fn)(struct io_wq_work *); + struct io_wq *io_wq_create(unsigned bounded, struct mm_struct *mm, - struct user_struct *user); + struct user_struct *user, + get_work_fn *get_work, put_work_fn *put_work); void io_wq_destroy(struct io_wq *wq);
void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work); diff --git a/fs/io_uring.c b/fs/io_uring.c index 77e8d403b3e7..9ceb7af472bf 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3837,6 +3837,20 @@ static int io_sqe_files_update(struct io_ring_ctx *ctx, void __user *arg, return done ? done : err; }
+static void io_put_work(struct io_wq_work *work) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work); + + io_put_req(req); +} + +static void io_get_work(struct io_wq_work *work) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work); + + refcount_inc(&req->refs); +} + static int io_sq_offload_start(struct io_ring_ctx *ctx, struct io_uring_params *p) { @@ -3886,7 +3900,8 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx,
/* Do QD, or 4 * CPUS, whatever is smallest */ concurrency = min(ctx->sq_entries, 4 * num_online_cpus()); - ctx->io_wq = io_wq_create(concurrency, ctx->sqo_mm, ctx->user); + ctx->io_wq = io_wq_create(concurrency, ctx->sqo_mm, ctx->user, + io_get_work, io_put_work); if (IS_ERR(ctx->io_wq)) { ret = PTR_ERR(ctx->io_wq); ctx->io_wq = NULL;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 978db57e2c329fc612ff669cab9bf0007efd3ca3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we don't use the normal completion path, we may skip killing links that should be errored and freed. Add __io_double_put_req() for use within the completion path itself, other calls should just use io_double_put_req().
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9b10335b68fc..700ae01d986f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -383,6 +383,7 @@ static void io_cqring_fill_event(struct io_kiocb *req, long res); static void __io_free_req(struct io_kiocb *req); static void io_put_req(struct io_kiocb *req); static void io_double_put_req(struct io_kiocb *req); +static void __io_double_put_req(struct io_kiocb *req);
static struct kmem_cache *req_cachep;
@@ -915,7 +916,7 @@ static void io_fail_links(struct io_kiocb *req) io_link_cancel_timeout(link); } else { io_cqring_fill_event(link, -ECANCELED); - io_double_put_req(link); + __io_double_put_req(link); } }
@@ -989,13 +990,24 @@ static void io_put_req(struct io_kiocb *req) io_free_req(req); }
-static void io_double_put_req(struct io_kiocb *req) +/* + * Must only be used if we don't need to care about links, usually from + * within the completion handling itself. + */ +static void __io_double_put_req(struct io_kiocb *req) { /* drop both submit and complete references */ if (refcount_sub_and_test(2, &req->refs)) __io_free_req(req); }
+static void io_double_put_req(struct io_kiocb *req) +{ + /* drop both submit and complete references */ + if (refcount_sub_and_test(2, &req->refs)) + io_free_req(req); +} + static unsigned io_cqring_events(struct io_ring_ctx *ctx, bool noflush) { struct io_rings *rings = ctx->rings;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit ad8a48acc23cb13cbf4332ebabb867b1baa81842 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There are a few reasons for this:
- As a prep to improving the linked timeout logic - io_timeout is the biggest member in the io_kiocb opcode union
This also enables a few cleanups, like unifying the timer setup between IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT, and not needing multiple arguments to the link/prep helpers.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 129 +++++++++++++++++++++++++++----------------------- 1 file changed, 70 insertions(+), 59 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 700ae01d986f..2eb1b9cec145 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -301,9 +301,16 @@ struct io_poll_iocb { struct wait_queue_entry wait; };
+struct io_timeout_data { + struct io_kiocb *req; + struct hrtimer timer; + struct timespec64 ts; + enum hrtimer_mode mode; +}; + struct io_timeout { struct file *file; - struct hrtimer timer; + struct io_timeout_data *data; };
/* @@ -572,7 +579,7 @@ static void io_kill_timeout(struct io_kiocb *req) { int ret;
- ret = hrtimer_try_to_cancel(&req->timeout.timer); + ret = hrtimer_try_to_cancel(&req->timeout.data->timer); if (ret != -1) { atomic_inc(&req->ctx->cq_timeouts); list_del_init(&req->list); @@ -827,6 +834,8 @@ static void __io_free_req(struct io_kiocb *req) wake_up(&ctx->inflight_wait); spin_unlock_irqrestore(&ctx->inflight_lock, flags); } + if (req->flags & REQ_F_TIMEOUT) + kfree(req->timeout.data); percpu_ref_put(&ctx->refs); if (likely(!io_is_fallback_req(req))) kmem_cache_free(req_cachep, req); @@ -839,7 +848,7 @@ static bool io_link_cancel_timeout(struct io_kiocb *req) struct io_ring_ctx *ctx = req->ctx; int ret;
- ret = hrtimer_try_to_cancel(&req->timeout.timer); + ret = hrtimer_try_to_cancel(&req->timeout.data->timer); if (ret != -1) { io_cqring_fill_event(req, -ECANCELED); io_commit_cqring(ctx); @@ -2235,12 +2244,12 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe,
static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) { - struct io_ring_ctx *ctx; - struct io_kiocb *req; + struct io_timeout_data *data = container_of(timer, + struct io_timeout_data, timer); + struct io_kiocb *req = data->req; + struct io_ring_ctx *ctx = req->ctx; unsigned long flags;
- req = container_of(timer, struct io_kiocb, timeout.timer); - ctx = req->ctx; atomic_inc(&ctx->cq_timeouts);
spin_lock_irqsave(&ctx->completion_lock, flags); @@ -2290,7 +2299,7 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) if (ret == -ENOENT) return ret;
- ret = hrtimer_try_to_cancel(&req->timeout.timer); + ret = hrtimer_try_to_cancel(&req->timeout.data->timer); if (ret == -1) return -EALREADY;
@@ -2330,34 +2339,54 @@ static int io_timeout_remove(struct io_kiocb *req, return 0; }
-static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) +static int io_timeout_setup(struct io_kiocb *req) { - unsigned count; - struct io_ring_ctx *ctx = req->ctx; - struct list_head *entry; - enum hrtimer_mode mode; - struct timespec64 ts; - unsigned span = 0; + const struct io_uring_sqe *sqe = req->submit.sqe; + struct io_timeout_data *data; unsigned flags;
- if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; - if (sqe->flags || sqe->ioprio || sqe->buf_index || sqe->len != 1) + if (sqe->ioprio || sqe->buf_index || sqe->len != 1) return -EINVAL; flags = READ_ONCE(sqe->timeout_flags); if (flags & ~IORING_TIMEOUT_ABS) return -EINVAL;
- if (get_timespec64(&ts, u64_to_user_ptr(sqe->addr))) + data = kzalloc(sizeof(struct io_timeout_data), GFP_KERNEL); + if (!data) + return -ENOMEM; + data->req = req; + req->timeout.data = data; + req->flags |= REQ_F_TIMEOUT; + + if (get_timespec64(&data->ts, u64_to_user_ptr(sqe->addr))) return -EFAULT;
if (flags & IORING_TIMEOUT_ABS) - mode = HRTIMER_MODE_ABS; + data->mode = HRTIMER_MODE_ABS; else - mode = HRTIMER_MODE_REL; + data->mode = HRTIMER_MODE_REL;
- hrtimer_init(&req->timeout.timer, CLOCK_MONOTONIC, mode); - req->flags |= REQ_F_TIMEOUT; + hrtimer_init(&data->timer, CLOCK_MONOTONIC, data->mode); + return 0; +} + +static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + unsigned count; + struct io_ring_ctx *ctx = req->ctx; + struct io_timeout_data *data; + struct list_head *entry; + unsigned span = 0; + int ret; + + ret = io_timeout_setup(req); + /* common setup allows flags (like links) set, we don't */ + if (!ret && sqe->flags) + ret = -EINVAL; + if (ret) + return ret;
/* * sqe->off holds how many events that need to occur for this @@ -2417,8 +2446,9 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) req->sequence -= span; add: list_add(&req->list, entry); - req->timeout.timer.function = io_timeout_fn; - hrtimer_start(&req->timeout.timer, timespec64_to_ktime(ts), mode); + data = req->timeout.data; + data->timer.function = io_timeout_fn; + hrtimer_start(&data->timer, timespec64_to_ktime(data->ts), data->mode); spin_unlock_irq(&ctx->completion_lock); return 0; } @@ -2753,8 +2783,9 @@ static int io_grab_files(struct io_kiocb *req)
static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) { - struct io_kiocb *req = container_of(timer, struct io_kiocb, - timeout.timer); + struct io_timeout_data *data = container_of(timer, + struct io_timeout_data, timer); + struct io_kiocb *req = data->req; struct io_ring_ctx *ctx = req->ctx; struct io_kiocb *prev = NULL; unsigned long flags; @@ -2785,9 +2816,9 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) return HRTIMER_NORESTART; }
-static void io_queue_linked_timeout(struct io_kiocb *req, struct timespec64 *ts, - enum hrtimer_mode *mode) +static void io_queue_linked_timeout(struct io_kiocb *req) { + struct io_timeout_data *data = req->timeout.data; struct io_ring_ctx *ctx = req->ctx;
/* @@ -2796,9 +2827,9 @@ static void io_queue_linked_timeout(struct io_kiocb *req, struct timespec64 *ts, */ spin_lock_irq(&ctx->completion_lock); if (!list_empty(&req->list)) { - req->timeout.timer.function = io_link_timeout_fn; - hrtimer_start(&req->timeout.timer, timespec64_to_ktime(*ts), - *mode); + data->timer.function = io_link_timeout_fn; + hrtimer_start(&data->timer, timespec64_to_ktime(data->ts), + data->mode); } spin_unlock_irq(&ctx->completion_lock);
@@ -2806,22 +2837,7 @@ static void io_queue_linked_timeout(struct io_kiocb *req, struct timespec64 *ts, io_put_req(req); }
-static int io_validate_link_timeout(const struct io_uring_sqe *sqe, - struct timespec64 *ts) -{ - if (sqe->ioprio || sqe->buf_index || sqe->len != 1 || sqe->off) - return -EINVAL; - if (sqe->timeout_flags & ~IORING_TIMEOUT_ABS) - return -EINVAL; - if (get_timespec64(ts, u64_to_user_ptr(sqe->addr))) - return -EFAULT; - - return 0; -} - -static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req, - struct timespec64 *ts, - enum hrtimer_mode *mode) +static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req) { struct io_kiocb *nxt; int ret; @@ -2833,7 +2849,10 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req, if (!nxt || nxt->submit.sqe->opcode != IORING_OP_LINK_TIMEOUT) return NULL;
- ret = io_validate_link_timeout(nxt->submit.sqe, ts); + ret = io_timeout_setup(nxt); + /* common setup allows offset being set, we don't */ + if (!ret && nxt->submit.sqe->off) + ret = -EINVAL; if (ret) { list_del_init(&nxt->list); io_cqring_add_event(nxt, ret); @@ -2841,24 +2860,16 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req, return ERR_PTR(-ECANCELED); }
- if (nxt->submit.sqe->timeout_flags & IORING_TIMEOUT_ABS) - *mode = HRTIMER_MODE_ABS; - else - *mode = HRTIMER_MODE_REL; - req->flags |= REQ_F_LINK_TIMEOUT; - hrtimer_init(&nxt->timeout.timer, CLOCK_MONOTONIC, *mode); return nxt; }
static void __io_queue_sqe(struct io_kiocb *req) { - enum hrtimer_mode mode; struct io_kiocb *nxt; - struct timespec64 ts; int ret;
- nxt = io_prep_linked_timeout(req, &ts, &mode); + nxt = io_prep_linked_timeout(req); if (IS_ERR(nxt)) { ret = PTR_ERR(nxt); nxt = NULL; @@ -2894,7 +2905,7 @@ static void __io_queue_sqe(struct io_kiocb *req) io_queue_async_work(req);
if (nxt) - io_queue_linked_timeout(nxt, &ts, &mode); + io_queue_linked_timeout(nxt);
return; } @@ -2906,7 +2917,7 @@ static void __io_queue_sqe(struct io_kiocb *req)
if (nxt) { if (!ret) - io_queue_linked_timeout(nxt, &ts, &mode); + io_queue_linked_timeout(nxt); else io_put_req(nxt); }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 94ae5e77a9150a8c6c57432e2db290c6868ddfad category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We have an issue with timeout links that are deeper in the submit chain, because we only handle it upfront, not from later submissions. Move the prep + issue of the timeout link to the async work prep handler, and do it normally for non-async queue. If we validate and prepare the timeout links upfront when we first see them, there's nothing stopping us from supporting any sort of nesting.
Fixes: 2665abfd757f ("io_uring: add support for linked SQE timeouts") Reported-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 102 ++++++++++++++++++++++++++++++-------------------- 1 file changed, 61 insertions(+), 41 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2eb1b9cec145..9cdced780c9f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -353,6 +353,7 @@ struct io_kiocb { #define REQ_F_TIMEOUT_NOSEQ 8192 /* no timeout sequence */ #define REQ_F_INFLIGHT 16384 /* on inflight list */ #define REQ_F_COMP_LOCKED 32768 /* completion under lock */ +#define REQ_F_FREE_SQE 65536 /* free sqe if not async queued */ u64 user_data; u32 result; u32 sequence; @@ -391,6 +392,8 @@ static void __io_free_req(struct io_kiocb *req); static void io_put_req(struct io_kiocb *req); static void io_double_put_req(struct io_kiocb *req); static void __io_double_put_req(struct io_kiocb *req); +static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req); +static void io_queue_linked_timeout(struct io_kiocb *req);
static struct kmem_cache *req_cachep;
@@ -528,7 +531,8 @@ static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe) opcode == IORING_OP_WRITE_FIXED); }
-static inline bool io_prep_async_work(struct io_kiocb *req) +static inline bool io_prep_async_work(struct io_kiocb *req, + struct io_kiocb **link) { bool do_hashed = false;
@@ -557,13 +561,17 @@ static inline bool io_prep_async_work(struct io_kiocb *req) req->work.flags |= IO_WQ_WORK_NEEDS_USER; }
+ *link = io_prep_linked_timeout(req); return do_hashed; }
static inline void io_queue_async_work(struct io_kiocb *req) { - bool do_hashed = io_prep_async_work(req); struct io_ring_ctx *ctx = req->ctx; + struct io_kiocb *link; + bool do_hashed; + + do_hashed = io_prep_async_work(req, &link);
trace_io_uring_queue_async_work(ctx, do_hashed, req, &req->work, req->flags); @@ -573,6 +581,9 @@ static inline void io_queue_async_work(struct io_kiocb *req) io_wq_enqueue_hashed(ctx->io_wq, &req->work, file_inode(req->file)); } + + if (link) + io_queue_linked_timeout(link); }
static void io_kill_timeout(struct io_kiocb *req) @@ -874,6 +885,15 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) nxt = list_first_entry_or_null(&req->link_list, struct io_kiocb, list); while (nxt) { list_del_init(&nxt->list); + + if ((req->flags & REQ_F_LINK_TIMEOUT) && + (nxt->flags & REQ_F_TIMEOUT)) { + wake_ev |= io_link_cancel_timeout(nxt); + nxt = list_first_entry_or_null(&req->link_list, + struct io_kiocb, list); + req->flags &= ~REQ_F_LINK_TIMEOUT; + continue; + } if (!list_empty(&req->link_list)) { INIT_LIST_HEAD(&nxt->link_list); list_splice(&req->link_list, &nxt->link_list); @@ -884,19 +904,13 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) * If we're in async work, we can continue processing the chain * in this context instead of having to queue up new async work. */ - if (req->flags & REQ_F_LINK_TIMEOUT) { - wake_ev = io_link_cancel_timeout(nxt); - - /* we dropped this link, get next */ - nxt = list_first_entry_or_null(&req->link_list, - struct io_kiocb, list); - } else if (nxtptr && io_wq_current_is_worker()) { - *nxtptr = nxt; - break; - } else { - io_queue_async_work(nxt); - break; + if (nxt) { + if (nxtptr && io_wq_current_is_worker()) + *nxtptr = nxt; + else + io_queue_async_work(nxt); } + break; }
if (wake_ev) @@ -915,11 +929,16 @@ static void io_fail_links(struct io_kiocb *req) spin_lock_irqsave(&ctx->completion_lock, flags);
while (!list_empty(&req->link_list)) { + const struct io_uring_sqe *sqe_to_free = NULL; + link = list_first_entry(&req->link_list, struct io_kiocb, list); list_del_init(&link->list);
trace_io_uring_fail_link(req, link);
+ if (link->flags & REQ_F_FREE_SQE) + sqe_to_free = link->submit.sqe; + if ((req->flags & REQ_F_LINK_TIMEOUT) && link->submit.sqe->opcode == IORING_OP_LINK_TIMEOUT) { io_link_cancel_timeout(link); @@ -927,6 +946,7 @@ static void io_fail_links(struct io_kiocb *req) io_cqring_fill_event(link, -ECANCELED); __io_double_put_req(link); } + kfree(sqe_to_free); }
io_commit_cqring(ctx); @@ -2682,8 +2702,12 @@ static void io_wq_submit_work(struct io_wq_work **workptr)
/* if a dependent link is ready, pass it back */ if (!ret && nxt) { - io_prep_async_work(nxt); + struct io_kiocb *link; + + io_prep_async_work(nxt, &link); *workptr = &nxt->work; + if (link) + io_queue_linked_timeout(link); } }
@@ -2818,7 +2842,6 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer)
static void io_queue_linked_timeout(struct io_kiocb *req) { - struct io_timeout_data *data = req->timeout.data; struct io_ring_ctx *ctx = req->ctx;
/* @@ -2827,6 +2850,8 @@ static void io_queue_linked_timeout(struct io_kiocb *req) */ spin_lock_irq(&ctx->completion_lock); if (!list_empty(&req->list)) { + struct io_timeout_data *data = req->timeout.data; + data->timer.function = io_link_timeout_fn; hrtimer_start(&data->timer, timespec64_to_ktime(data->ts), data->mode); @@ -2840,7 +2865,6 @@ static void io_queue_linked_timeout(struct io_kiocb *req) static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req) { struct io_kiocb *nxt; - int ret;
if (!(req->flags & REQ_F_LINK)) return NULL; @@ -2849,33 +2873,15 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req) if (!nxt || nxt->submit.sqe->opcode != IORING_OP_LINK_TIMEOUT) return NULL;
- ret = io_timeout_setup(nxt); - /* common setup allows offset being set, we don't */ - if (!ret && nxt->submit.sqe->off) - ret = -EINVAL; - if (ret) { - list_del_init(&nxt->list); - io_cqring_add_event(nxt, ret); - io_double_put_req(nxt); - return ERR_PTR(-ECANCELED); - } - req->flags |= REQ_F_LINK_TIMEOUT; return nxt; }
static void __io_queue_sqe(struct io_kiocb *req) { - struct io_kiocb *nxt; + struct io_kiocb *nxt = io_prep_linked_timeout(req); int ret;
- nxt = io_prep_linked_timeout(req); - if (IS_ERR(nxt)) { - ret = PTR_ERR(nxt); - nxt = NULL; - goto err; - } - ret = __io_submit_sqe(req, NULL, true);
/* @@ -2903,10 +2909,6 @@ static void __io_queue_sqe(struct io_kiocb *req) * submit reference when the iocb is actually submitted. */ io_queue_async_work(req); - - if (nxt) - io_queue_linked_timeout(nxt); - return; } } @@ -2951,6 +2953,10 @@ static void io_queue_link_head(struct io_kiocb *req, struct io_kiocb *shadow) int need_submit = false; struct io_ring_ctx *ctx = req->ctx;
+ if (unlikely(req->flags & REQ_F_FAIL_LINK)) { + ret = -ECANCELED; + goto err; + } if (!shadow) { io_queue_sqe(req); return; @@ -2965,9 +2971,11 @@ static void io_queue_link_head(struct io_kiocb *req, struct io_kiocb *shadow) ret = io_req_defer(req); if (ret) { if (ret != -EIOCBQUEUED) { +err: io_cqring_add_event(req, ret); io_double_put_req(req); - __io_free_req(shadow); + if (shadow) + __io_free_req(shadow); return; } } else { @@ -3024,6 +3032,17 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, if (*link) { struct io_kiocb *prev = *link;
+ if (READ_ONCE(s->sqe->opcode) == IORING_OP_LINK_TIMEOUT) { + ret = io_timeout_setup(req); + /* common setup allows offset being set, we don't */ + if (!ret && s->sqe->off) + ret = -EINVAL; + if (ret) { + prev->flags |= REQ_F_FAIL_LINK; + goto err_req; + } + } + sqe_copy = kmemdup(s->sqe, sizeof(*sqe_copy), GFP_KERNEL); if (!sqe_copy) { ret = -EAGAIN; @@ -3031,6 +3050,7 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, }
s->sqe = sqe_copy; + req->flags |= REQ_F_FREE_SQE; trace_io_uring_link(ctx, req, prev); list_add_tail(&req->list, &prev->link_list); } else if (s->sqe->flags & IOSQE_IO_LINK) {
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit e0e328c4b330712e45ba799dc589bda751323110 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
With the conversion to io-wq, we no longer use that flag. Kill it.
Fixes: 561fb04a6a22 ("io_uring: replace workqueue usage with io-wq") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9cdced780c9f..5e34c660faef 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -340,7 +340,6 @@ struct io_kiocb { #define REQ_F_NOWAIT 1 /* must not punt to workers */ #define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ #define REQ_F_FIXED_FILE 4 /* ctx owns file */ -#define REQ_F_SEQ_PREV 8 /* sequential with previous */ #define REQ_F_IO_DRAIN 16 /* drain existing IO first */ #define REQ_F_IO_DRAINED 32 /* drain done */ #define REQ_F_LINK 64 /* linked sqes */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit b0dd8a412699afe3420a08f841333f3474ad45c5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently a poll request fills a completion entry of 0, even if it got cancelled. This is odd, and it makes it harder to support with chains. Ensure that it returns -ECANCELED in the completions events if it got cancelled, and furthermore ensure that the linked timeout that triggered it completes with -ETIME if we did indeed trigger the completions through a timeout.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 33 ++++++++++++++++++++++----------- 1 file changed, 22 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5e34c660faef..f892ef9b848f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2065,12 +2065,15 @@ static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static void io_poll_complete(struct io_kiocb *req, __poll_t mask) +static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error) { struct io_ring_ctx *ctx = req->ctx;
req->poll.done = true; - io_cqring_fill_event(req, mangle_poll(mask)); + if (error) + io_cqring_fill_event(req, error); + else + io_cqring_fill_event(req, mangle_poll(mask)); io_commit_cqring(ctx); }
@@ -2083,11 +2086,16 @@ static void io_poll_complete_work(struct io_wq_work **workptr) struct io_ring_ctx *ctx = req->ctx; struct io_kiocb *nxt = NULL; __poll_t mask = 0; + int ret = 0;
- if (work->flags & IO_WQ_WORK_CANCEL) + if (work->flags & IO_WQ_WORK_CANCEL) { WRITE_ONCE(poll->canceled, true); + ret = -ECANCELED; + } else if (READ_ONCE(poll->canceled)) { + ret = -ECANCELED; + }
- if (!READ_ONCE(poll->canceled)) + if (ret != -ECANCELED) mask = vfs_poll(poll->file, &pt) & poll->events;
/* @@ -2098,13 +2106,13 @@ static void io_poll_complete_work(struct io_wq_work **workptr) * avoid further branches in the fast path. */ spin_lock_irq(&ctx->completion_lock); - if (!mask && !READ_ONCE(poll->canceled)) { + if (!mask && ret != -ECANCELED) { add_wait_queue(poll->head, &poll->wait); spin_unlock_irq(&ctx->completion_lock); return; } io_poll_remove_req(req); - io_poll_complete(req, mask); + io_poll_complete(req, mask, ret); spin_unlock_irq(&ctx->completion_lock);
io_cqring_ev_posted(ctx); @@ -2138,7 +2146,7 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, */ if (mask && spin_trylock_irqsave(&ctx->completion_lock, flags)) { io_poll_remove_req(req); - io_poll_complete(req, mask); + io_poll_complete(req, mask, 0); req->flags |= REQ_F_COMP_LOCKED; io_put_req(req); spin_unlock_irqrestore(&ctx->completion_lock, flags); @@ -2250,7 +2258,7 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, } if (mask) { /* no async, we'd stolen it */ ipt.error = 0; - io_poll_complete(req, mask); + io_poll_complete(req, mask, 0); } spin_unlock_irq(&ctx->completion_lock);
@@ -2502,7 +2510,7 @@ static int io_async_cancel_one(struct io_ring_ctx *ctx, void *sqe_addr)
static void io_async_find_and_cancel(struct io_ring_ctx *ctx, struct io_kiocb *req, __u64 sqe_addr, - struct io_kiocb **nxt) + struct io_kiocb **nxt, int success_ret) { unsigned long flags; int ret; @@ -2519,6 +2527,8 @@ static void io_async_find_and_cancel(struct io_ring_ctx *ctx, goto done; ret = io_poll_cancel(ctx, sqe_addr); done: + if (!ret) + ret = success_ret; io_cqring_fill_event(req, ret); io_commit_cqring(ctx); spin_unlock_irqrestore(&ctx->completion_lock, flags); @@ -2540,7 +2550,7 @@ static int io_async_cancel(struct io_kiocb *req, const struct io_uring_sqe *sqe, sqe->cancel_flags) return -EINVAL;
- io_async_find_and_cancel(ctx, req, READ_ONCE(sqe->addr), nxt); + io_async_find_and_cancel(ctx, req, READ_ONCE(sqe->addr), nxt, 0); return 0; }
@@ -2830,7 +2840,8 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) spin_unlock_irqrestore(&ctx->completion_lock, flags);
if (prev) { - io_async_find_and_cancel(ctx, req, prev->user_data, NULL); + io_async_find_and_cancel(ctx, req, prev->user_data, NULL, + -ETIME); io_put_req(prev); } else { io_cqring_add_event(req, -ETIME);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit fba38c272a0385148935d6443cb9dc68cf1f37a7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently don't explicitly break links if a request is cancelled, but we should. Add explicitly link breakage for all types of request cancellations that we support.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f892ef9b848f..b18844ca8484 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2117,6 +2117,8 @@ static void io_poll_complete_work(struct io_wq_work **workptr)
io_cqring_ev_posted(ctx);
+ if (ret < 0 && req->flags & REQ_F_LINK) + req->flags |= REQ_F_FAIL_LINK; io_put_req_find_next(req, &nxt); if (nxt) *workptr = &nxt->work; @@ -2330,6 +2332,8 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) if (ret == -1) return -EALREADY;
+ if (req->flags & REQ_F_LINK) + req->flags |= REQ_F_FAIL_LINK; io_cqring_fill_event(req, -ECANCELED); io_put_req(req); return 0; @@ -2840,6 +2844,8 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) spin_unlock_irqrestore(&ctx->completion_lock, flags);
if (prev) { + if (prev->flags & REQ_F_LINK) + prev->flags |= REQ_F_FAIL_LINK; io_async_find_and_cancel(ctx, req, prev->user_data, NULL, -ETIME); io_put_req(prev);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit b60fda6000a99a7ccac36005ab78b14b47c06de3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently have a race where if setup is really slow, we can be calling io_wq_destroy() before we're done setting up. This will cause the caller to get stuck waiting for the manager to set things up, but the manager already exited.
Fix this by doing a sync setup of the manager. This also fixes the case where if we failed creating workers, we'd also get stuck.
In practice this race window was really small, as we already wait for the manager to start. Hence someone would have to call io_wq_destroy() after the task has started, but before it started the first loop. The reported test case forked tons of these, which is why it became an issue.
Reported-by: syzbot+0f1cc17f85154f400465@syzkaller.appspotmail.com Fixes: 771b53d033e8 ("io-wq: small threadpool implementation for io_uring") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 50 +++++++++++++++++++++++++++++++++++--------------- 1 file changed, 35 insertions(+), 15 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index b7eae2e866a3..f9b5a1f94aa3 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -34,6 +34,7 @@ enum { enum { IO_WQ_BIT_EXIT = 0, /* wq exiting */ IO_WQ_BIT_CANCEL = 1, /* cancel work on list */ + IO_WQ_BIT_ERROR = 2, /* error on setup */ };
enum { @@ -563,14 +564,14 @@ void io_wq_worker_sleeping(struct task_struct *tsk) spin_unlock_irq(&wqe->lock); }
-static void create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index) +static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index) { struct io_wqe_acct *acct =&wqe->acct[index]; struct io_worker *worker;
worker = kcalloc_node(1, sizeof(*worker), GFP_KERNEL, wqe->node); if (!worker) - return; + return false;
refcount_set(&worker->ref, 1); worker->nulls_node.pprev = NULL; @@ -582,7 +583,7 @@ static void create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index) "io_wqe_worker-%d/%d", index, wqe->node); if (IS_ERR(worker->task)) { kfree(worker); - return; + return false; }
spin_lock_irq(&wqe->lock); @@ -600,6 +601,7 @@ static void create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index) atomic_inc(&wq->user->processes);
wake_up_process(worker->task); + return true; }
static inline bool io_wqe_need_worker(struct io_wqe *wqe, int index) @@ -607,9 +609,6 @@ static inline bool io_wqe_need_worker(struct io_wqe *wqe, int index) { struct io_wqe_acct *acct = &wqe->acct[index];
- /* always ensure we have one bounded worker */ - if (index == IO_WQ_ACCT_BOUND && !acct->nr_workers) - return true; /* if we have available workers or no work, no need */ if (!hlist_nulls_empty(&wqe->free_list) || !io_wqe_run_queue(wqe)) return false; @@ -622,10 +621,19 @@ static inline bool io_wqe_need_worker(struct io_wqe *wqe, int index) static int io_wq_manager(void *data) { struct io_wq *wq = data; + int i;
- while (!kthread_should_stop()) { - int i; + /* create fixed workers */ + refcount_set(&wq->refs, wq->nr_wqes); + for (i = 0; i < wq->nr_wqes; i++) { + if (create_io_worker(wq, wq->wqes[i], IO_WQ_ACCT_BOUND)) + continue; + goto err; + }
+ complete(&wq->done); + + while (!kthread_should_stop()) { for (i = 0; i < wq->nr_wqes; i++) { struct io_wqe *wqe = wq->wqes[i]; bool fork_worker[2] = { false, false }; @@ -646,6 +654,12 @@ static int io_wq_manager(void *data) }
return 0; +err: + set_bit(IO_WQ_BIT_ERROR, &wq->state); + set_bit(IO_WQ_BIT_EXIT, &wq->state); + if (refcount_sub_and_test(wq->nr_wqes - i, &wq->refs)) + complete(&wq->done); + return 0; }
static bool io_wq_can_queue(struct io_wqe *wqe, struct io_wqe_acct *acct, @@ -983,7 +997,6 @@ struct io_wq *io_wq_create(unsigned bounded, struct mm_struct *mm, wq->user = user;
i = 0; - refcount_set(&wq->refs, wq->nr_wqes); for_each_online_node(node) { struct io_wqe *wqe;
@@ -1021,14 +1034,22 @@ struct io_wq *io_wq_create(unsigned bounded, struct mm_struct *mm, wq->manager = kthread_create(io_wq_manager, wq, "io_wq_manager"); if (!IS_ERR(wq->manager)) { wake_up_process(wq->manager); + wait_for_completion(&wq->done); + if (test_bit(IO_WQ_BIT_ERROR, &wq->state)) { + ret = -ENOMEM; + goto err; + } + reinit_completion(&wq->done); return wq; }
ret = PTR_ERR(wq->manager); - wq->manager = NULL; -err: complete(&wq->done); - io_wq_destroy(wq); +err: + for (i = 0; i < wq->nr_wqes; i++) + kfree(wq->wqes[i]); + kfree(wq->wqes); + kfree(wq); return ERR_PTR(ret); }
@@ -1042,10 +1063,9 @@ void io_wq_destroy(struct io_wq *wq) { int i;
- if (wq->manager) { - set_bit(IO_WQ_BIT_EXIT, &wq->state); + set_bit(IO_WQ_BIT_EXIT, &wq->state); + if (wq->manager) kthread_stop(wq->manager); - }
rcu_read_lock(); for (i = 0; i < wq->nr_wqes; i++) {
From: Dan Carpenter dan.carpenter@oracle.com
mainline inclusion from mainline-5.5-rc1 commit b2e9c7d64b7ecacc1d0f15a6af88a73cab7d8db9 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
These lines are indented an extra space character.
Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index f9b5a1f94aa3..fc83200e04ca 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -329,9 +329,9 @@ static void __io_worker_busy(struct io_wqe *wqe, struct io_worker *worker, * If worker is moving from bound to unbound (or vice versa), then * ensure we update the running accounting. */ - worker_bound = (worker->flags & IO_WORKER_F_BOUND) != 0; - work_bound = (work->flags & IO_WQ_WORK_UNBOUND) == 0; - if (worker_bound != work_bound) { + worker_bound = (worker->flags & IO_WORKER_F_BOUND) != 0; + work_bound = (work->flags & IO_WQ_WORK_UNBOUND) == 0; + if (worker_bound != work_bound) { io_wqe_dec_running(wqe, worker); if (work_bound) { worker->flags |= IO_WORKER_F_BOUND;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit d3b35796b1e3f118017491d621f624e0de7ff9fb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If io_req_defer() failed, it needs to cancel a dependant link.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b18844ca8484..3e223d0cd26b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2957,6 +2957,8 @@ static void io_queue_sqe(struct io_kiocb *req) if (ret) { if (ret != -EIOCBQUEUED) { io_cqring_add_event(req, ret); + if (req->flags & REQ_F_LINK) + req->flags |= REQ_F_FAIL_LINK; io_double_put_req(req); } } else @@ -2989,6 +2991,8 @@ static void io_queue_link_head(struct io_kiocb *req, struct io_kiocb *shadow) if (ret != -EIOCBQUEUED) { err: io_cqring_add_event(req, ret); + if (req->flags & REQ_F_LINK) + req->flags |= REQ_F_FAIL_LINK; io_double_put_req(req); if (shadow) __io_free_req(shadow);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit d732447fed7d6b4c22907f630cd25d574bae5276 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
__io_submit_sqe() is issuing requests, so call it as such. Moreover, it ends by calling io_iopoll_req_issued().
Rename it and make terminology clearer.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a094787d9bab..bfacf7a8954b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2593,8 +2593,8 @@ static int io_req_defer(struct io_kiocb *req) return -EIOCBQUEUED; }
-static int __io_submit_sqe(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_issue_sqe(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) { int ret, opcode; struct sqe_submit *s = &req->submit; @@ -2701,7 +2701,7 @@ static void io_wq_submit_work(struct io_wq_work **workptr) s->has_user = (work->flags & IO_WQ_WORK_HAS_MM) != 0; s->in_async = true; do { - ret = __io_submit_sqe(req, &nxt, false); + ret = io_issue_sqe(req, &nxt, false); /* * We can get EAGAIN for polled IO even though we're * forcing a sync submission from here, since we can't @@ -2912,7 +2912,7 @@ static void __io_queue_sqe(struct io_kiocb *req) struct io_kiocb *nxt = io_prep_linked_timeout(req); int ret;
- ret = __io_submit_sqe(req, NULL, true); + ret = io_issue_sqe(req, NULL, true);
/* * We async punt it if the file wasn't marked NOWAIT, or if the file
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit 9835d6fafba58e6d9386a6d5af800789bdb52e5b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The number of SQEs to submit is specified by a user, so io_get_sqring() in most of the cases succeeds. Hint compilers about that.
Checking ASM genereted by gcc 9.2.0 for x64, there is one branch misprediction.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index bfacf7a8954b..d7ea7e0ee473 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3128,11 +3128,11 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) */ head = ctx->cached_sq_head; /* make sure SQ entry isn't read before tail */ - if (head == smp_load_acquire(&rings->sq.tail)) + if (unlikely(head == smp_load_acquire(&rings->sq.tail))) return false;
head = READ_ONCE(sq_array[head & ctx->sq_mask]); - if (head < ctx->sq_entries) { + if (likely(head < ctx->sq_entries)) { s->ring_file = NULL; s->sqe = &ctx->sq_sqes[head]; s->sequence = ctx->cached_sq_head;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit 70cf9f3270a5c5148e93a526dc1e51965259e70c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There is only one one-liner user of io_free_req_find_next(). Inline it.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d7ea7e0ee473..192a5903df34 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -983,15 +983,10 @@ static void io_req_find_next(struct io_kiocb *req, struct io_kiocb **nxt) } }
-static void io_free_req_find_next(struct io_kiocb *req, struct io_kiocb **nxt) -{ - io_req_find_next(req, nxt); - __io_free_req(req); -} - static void io_free_req(struct io_kiocb *req) { - io_free_req_find_next(req, NULL); + io_req_find_next(req, NULL); + __io_free_req(req); }
/*
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit 944e58bfeda0e9b97cd611adafc823c78e0bc464 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Make io_req_find_next() and io_req_link_next() to accept only non-null nxt, and handle it in callers.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 192a5903df34..c9e15ac37178 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -907,7 +907,7 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) * in this context instead of having to queue up new async work. */ if (nxt) { - if (nxtptr && io_wq_current_is_worker()) + if (io_wq_current_is_worker()) *nxtptr = nxt; else io_queue_async_work(nxt); @@ -985,8 +985,13 @@ static void io_req_find_next(struct io_kiocb *req, struct io_kiocb **nxt)
static void io_free_req(struct io_kiocb *req) { - io_req_find_next(req, NULL); + struct io_kiocb *nxt = NULL; + + io_req_find_next(req, &nxt); __io_free_req(req); + + if (nxt) + io_queue_async_work(nxt); }
/*
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit b18fdf71e01fba29a804d63f8c1e2ed61011170d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
"if (nxt)" is always true, as it was checked in the while's condition. io_wq_current_is_worker() is unnecessary, as non-async callers don't pass nxt, so io_queue_async_work() will be called for them anyway.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 11 +---------- 1 file changed, 1 insertion(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c9e15ac37178..cd04220944b3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -902,16 +902,7 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) nxt->flags |= REQ_F_LINK; }
- /* - * If we're in async work, we can continue processing the chain - * in this context instead of having to queue up new async work. - */ - if (nxt) { - if (io_wq_current_is_worker()) - *nxtptr = nxt; - else - io_queue_async_work(nxt); - } + *nxtptr = nxt; break; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit f9bd67f69af56d712bfd498f5ad9cf7bb177d600 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Pass only non-null @nxt to io_issue_sqe() and handle it at the caller's side. And propagate it.
- kiocb_done() is only called from io_read() and io_write(), which are only called from io_issue_sqe(), so it's @nxt != NULL
- io_put_req_find_next() is called either with explicitly non-null local nxt, or from one of the functions in io_issue_sqe() switch (or their callees).
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 28 ++++++++++++---------------- 1 file changed, 12 insertions(+), 16 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cd04220944b3..f73f2d9a5c56 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -989,21 +989,13 @@ static void io_free_req(struct io_kiocb *req) * Drop reference to request, return next in chain (if there is one) if this * was the last reference to this request. */ +__attribute__((nonnull)) static void io_put_req_find_next(struct io_kiocb *req, struct io_kiocb **nxtptr) { - struct io_kiocb *nxt = NULL; - - io_req_find_next(req, &nxt); + io_req_find_next(req, nxtptr);
if (refcount_dec_and_test(&req->refs)) __io_free_req(req); - - if (nxt) { - if (nxtptr) - *nxtptr = nxt; - else - io_queue_async_work(nxt); - } }
static void io_put_req(struct io_kiocb *req) @@ -1487,7 +1479,7 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) static void kiocb_done(struct kiocb *kiocb, ssize_t ret, struct io_kiocb **nxt, bool in_async) { - if (in_async && ret >= 0 && nxt && kiocb->ki_complete == io_complete_rw) + if (in_async && ret >= 0 && kiocb->ki_complete == io_complete_rw) *nxt = __io_complete_rw(kiocb, ret); else io_rw_done(kiocb, ret); @@ -2584,6 +2576,7 @@ static int io_req_defer(struct io_kiocb *req) return -EIOCBQUEUED; }
+__attribute__((nonnull)) static int io_issue_sqe(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { @@ -2900,10 +2893,13 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req)
static void __io_queue_sqe(struct io_kiocb *req) { - struct io_kiocb *nxt = io_prep_linked_timeout(req); + struct io_kiocb *linked_timeout = io_prep_linked_timeout(req); + struct io_kiocb *nxt = NULL; int ret;
- ret = io_issue_sqe(req, NULL, true); + ret = io_issue_sqe(req, &nxt, true); + if (nxt) + io_queue_async_work(nxt);
/* * We async punt it if the file wasn't marked NOWAIT, or if the file @@ -2939,11 +2935,11 @@ static void __io_queue_sqe(struct io_kiocb *req) /* drop submission reference */ io_put_req(req);
- if (nxt) { + if (linked_timeout) { if (!ret) - io_queue_linked_timeout(nxt); + io_queue_linked_timeout(linked_timeout); else - io_put_req(nxt); + io_put_req(linked_timeout); }
/* and drop final reference, if we failed */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit c4a2ed72c9a61594b6afc23e1fbc78878d32b5a3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We return -EBUSY on submit when we have a CQ ring overflow backlog, but that can be a bit problematic if the application is using pure userspace poll of the CQ ring. For that case, if the ring briefly overflowed and we have pending entries in the backlog, the submit flushes the backlog successfully but still returns -EBUSY. If we're able to fully flush the CQ ring backlog, let the submission proceed.
Reported-by: Dan Melnic dmm@fb.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f73f2d9a5c56..33d04821c1db 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -653,7 +653,8 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx) eventfd_signal(ctx->cq_ev_fd, 1); }
-static void io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) +/* Returns true if there are no backlogged entries after the flush */ +static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) { struct io_rings *rings = ctx->rings; struct io_uring_cqe *cqe; @@ -663,10 +664,10 @@ static void io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
if (!force) { if (list_empty_careful(&ctx->cq_overflow_list)) - return; + return true; if ((ctx->cached_cq_tail - READ_ONCE(rings->cq.head) == rings->cq_ring_entries)) - return; + return false; }
spin_lock_irqsave(&ctx->completion_lock, flags); @@ -675,6 +676,7 @@ static void io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) if (force) ctx->cq_overflow_flushed = true;
+ cqe = NULL; while (!list_empty(&ctx->cq_overflow_list)) { cqe = io_get_cqring(ctx); if (!cqe && !force) @@ -702,6 +704,8 @@ static void io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) list_del(&req->list); io_put_req(req); } + + return cqe != NULL; }
static void io_cqring_fill_event(struct io_kiocb *req, long res) @@ -3143,10 +3147,10 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, int i, submitted = 0; bool mm_fault = false;
- if (!list_empty(&ctx->cq_overflow_list)) { - io_cqring_overflow_flush(ctx, false); + /* if we have a backlog and couldn't flush it all, return BUSY */ + if (!list_empty(&ctx->cq_overflow_list) && + !io_cqring_overflow_flush(ctx, false)) return -EBUSY; - }
if (nr > IO_PLUG_THRESHOLD) { io_submit_state_start(&state, ctx, nr);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit bd3ded3146daa2cbb57ed353749ef99cf75371b0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This is identical to __sys_connect(), except it takes a struct file instead of an fd, and it also allows passing in extra file->f_flags flags. The latter is done to support masking in O_NONBLOCK without manipulating the original file flags.
No functional changes in this patch.
Cc: netdev@vger.kernel.org Acked-by: David S. Miller davem@davemloft.net Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/socket.h | 3 +++ net/socket.c | 30 ++++++++++++++++++++++-------- 2 files changed, 25 insertions(+), 8 deletions(-)
diff --git a/include/linux/socket.h b/include/linux/socket.h index b5f99ade825d..841f18488954 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -385,6 +385,9 @@ extern int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr, int __user *upeer_addrlen, int flags); extern int __sys_socket(int family, int type, int protocol); extern int __sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen); +extern int __sys_connect_file(struct file *file, + struct sockaddr __user *uservaddr, int addrlen, + int file_flags); extern int __sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen); extern int __sys_listen(int fd, int backlog); diff --git a/net/socket.c b/net/socket.c index 5ef7a4fc17d2..8faf6ea75c61 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1659,32 +1659,46 @@ SYSCALL_DEFINE3(accept, int, fd, struct sockaddr __user *, upeer_sockaddr, * include the -EINPROGRESS status for such sockets. */
-int __sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen) +int __sys_connect_file(struct file *file, struct sockaddr __user *uservaddr, + int addrlen, int file_flags) { struct socket *sock; struct sockaddr_storage address; - int err, fput_needed; + int err;
- sock = sockfd_lookup_light(fd, &err, &fput_needed); + sock = sock_from_file(file, &err); if (!sock) goto out; err = move_addr_to_kernel(uservaddr, addrlen, &address); if (err < 0) - goto out_put; + goto out;
err = security_socket_connect(sock, (struct sockaddr *)&address, addrlen); if (err) - goto out_put; + goto out;
err = sock->ops->connect(sock, (struct sockaddr *)&address, addrlen, - sock->file->f_flags); -out_put: - fput_light(sock->file, fput_needed); + sock->file->f_flags | file_flags); out: return err; }
+int __sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen) +{ + int ret = -EBADF; + struct fd f; + + f = fdget(fd); + if (f.file) { + ret = __sys_connect_file(f.file, uservaddr, addrlen, 0); + if (f.flags) + fput(f.file); + } + + return ret; +} + SYSCALL_DEFINE3(connect, int, fd, struct sockaddr __user *, uservaddr, int, addrlen) {
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit f8e85cf255ad57d65eeb9a9d0e59e3dec55bdd9e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This allows an application to call connect() in an async fashion. Like other opcodes, we first try a non-blocking connect, then punt to async context if we have to.
Note that we can still return -EINPROGRESS, and in that case the caller should use IORING_OP_POLL_ADD to do an async wait for completion of the connect request (just like for regular connect(2), except we can do it async here too).
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 36 +++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 1 + 2 files changed, 37 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 33d04821c1db..702cbb5c0d47 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -549,6 +549,7 @@ static inline bool io_prep_async_work(struct io_kiocb *req, case IORING_OP_RECVMSG: case IORING_OP_ACCEPT: case IORING_OP_POLL_ADD: + case IORING_OP_CONNECT: /* * We know REQ_F_ISREG is not set on some of these * opcodes, but this enables us to keep the check in @@ -1973,6 +1974,38 @@ static int io_accept(struct io_kiocb *req, const struct io_uring_sqe *sqe, #endif }
+static int io_connect(struct io_kiocb *req, const struct io_uring_sqe *sqe, + struct io_kiocb **nxt, bool force_nonblock) +{ +#if defined(CONFIG_NET) + struct sockaddr __user *addr; + unsigned file_flags; + int addr_len, ret; + + if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) + return -EINVAL; + if (sqe->ioprio || sqe->len || sqe->buf_index || sqe->rw_flags) + return -EINVAL; + + addr = (struct sockaddr __user *) (unsigned long) READ_ONCE(sqe->addr); + addr_len = READ_ONCE(sqe->addr2); + file_flags = force_nonblock ? O_NONBLOCK : 0; + + ret = __sys_connect_file(req->file, addr, addr_len, file_flags); + if (ret == -EAGAIN && force_nonblock) + return -EAGAIN; + if (ret == -ERESTARTSYS) + ret = -EINTR; + if (ret < 0 && (req->flags & REQ_F_LINK)) + req->flags |= REQ_F_FAIL_LINK; + io_cqring_add_event(req, ret); + io_put_req_find_next(req, nxt); + return 0; +#else + return -EOPNOTSUPP; +#endif +} + static inline void io_poll_remove_req(struct io_kiocb *req) { if (!RB_EMPTY_NODE(&req->rb_node)) { @@ -2636,6 +2669,9 @@ static int io_issue_sqe(struct io_kiocb *req, struct io_kiocb **nxt, case IORING_OP_ACCEPT: ret = io_accept(req, s->sqe, nxt, force_nonblock); break; + case IORING_OP_CONNECT: + ret = io_connect(req, s->sqe, nxt, force_nonblock); + break; case IORING_OP_ASYNC_CANCEL: ret = io_async_cancel(req, s->sqe, nxt); break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 2a1569211d87..4637ed1d9949 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -73,6 +73,7 @@ struct io_uring_sqe { #define IORING_OP_ACCEPT 13 #define IORING_OP_ASYNC_CANCEL 14 #define IORING_OP_LINK_TIMEOUT 15 +#define IORING_OP_CONNECT 16
/* * sqe->fsync_flags
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit 311ae9e159d81a1ec1cf645daf40b39ae5a0bd84 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Read/write requests to devices without implemented read/write_iter using fixed buffers can cause general protection fault, which totally hangs a machine.
io_import_fixed() initialises iov_iter with bvec, but loop_rw_iter() accesses it as iovec, dereferencing random address.
kmap() page by page in this case
Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 702cbb5c0d47..e4fe2c140bfb 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1621,9 +1621,19 @@ static ssize_t loop_rw_iter(int rw, struct file *file, struct kiocb *kiocb, return -EAGAIN;
while (iov_iter_count(iter)) { - struct iovec iovec = iov_iter_iovec(iter); + struct iovec iovec; ssize_t nr;
+ if (!((iter->type & ~(READ | WRITE)) == ITER_BVEC)) { + iovec = iov_iter_iovec(iter); + } else { + /* fixed buffers import bvec */ + iovec.iov_base = kmap(iter->bvec->bv_page) + + iter->iov_offset; + iovec.iov_len = min(iter->count, + iter->bvec->bv_len - iter->iov_offset); + } + if (rw == READ) { nr = file->f_op->read(file, iovec.iov_base, iovec.iov_len, &kiocb->ki_pos); @@ -1632,6 +1642,9 @@ static ssize_t loop_rw_iter(int rw, struct file *file, struct kiocb *kiocb, iovec.iov_len, &kiocb->ki_pos); }
+ if ((iter->type & ~(READ | WRITE)) == ITER_BVEC) + kunmap(iter->bvec->bv_page); + if (nr < 0) { if (!ret) ret = nr;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit e944475e69849273ca8f1fe04a3ce81b5901d165 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In the quest to bring io_kiocb down to 3 cachelines, this one does the trick. Make the wait_queue_entry for the poll command come out of kmalloc instead of embedding it in struct io_poll_iocb, as the latter is the largest member of io_kiocb. Once we trim this down a bit, we're back at a healthy 192 bytes for struct io_kiocb.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 29 +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c560bad4988c..7e783220d425 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -291,7 +291,7 @@ struct io_poll_iocb { __poll_t events; bool done; bool canceled; - struct wait_queue_entry wait; + struct wait_queue_entry *wait; };
struct io_timeout_data { @@ -2029,8 +2029,8 @@ static void io_poll_remove_one(struct io_kiocb *req)
spin_lock(&poll->head->lock); WRITE_ONCE(poll->canceled, true); - if (!list_empty(&poll->wait.entry)) { - list_del_init(&poll->wait.entry); + if (!list_empty(&poll->wait->entry)) { + list_del_init(&poll->wait->entry); io_queue_async_work(req); } spin_unlock(&poll->head->lock); @@ -2103,6 +2103,7 @@ static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error) struct io_ring_ctx *ctx = req->ctx;
req->poll.done = true; + kfree(req->poll.wait); if (error) io_cqring_fill_event(req, error); else @@ -2140,7 +2141,7 @@ static void io_poll_complete_work(struct io_wq_work **workptr) */ spin_lock_irq(&ctx->completion_lock); if (!mask && ret != -ECANCELED) { - add_wait_queue(poll->head, &poll->wait); + add_wait_queue(poll->head, poll->wait); spin_unlock_irq(&ctx->completion_lock); return; } @@ -2160,8 +2161,7 @@ static void io_poll_complete_work(struct io_wq_work **workptr) static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, void *key) { - struct io_poll_iocb *poll = container_of(wait, struct io_poll_iocb, - wait); + struct io_poll_iocb *poll = wait->private; struct io_kiocb *req = container_of(poll, struct io_kiocb, poll); struct io_ring_ctx *ctx = req->ctx; __poll_t mask = key_to_poll(key); @@ -2171,7 +2171,7 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, if (mask && !(mask & poll->events)) return 0;
- list_del_init(&poll->wait.entry); + list_del_init(&poll->wait->entry);
/* * Run completion inline if we can. We're using trylock here because @@ -2212,7 +2212,7 @@ static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head,
pt->error = 0; pt->req->poll.head = head; - add_wait_queue(head, &pt->req->poll.wait); + add_wait_queue(head, pt->req->poll.wait); }
static void io_poll_req_insert(struct io_kiocb *req) @@ -2251,6 +2251,10 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (!poll->file) return -EBADF;
+ poll->wait = kmalloc(sizeof(*poll->wait), GFP_KERNEL); + if (!poll->wait) + return -ENOMEM; + req->sqe = NULL; INIT_IO_WORK(&req->work, io_poll_complete_work); events = READ_ONCE(sqe->poll_events); @@ -2267,8 +2271,9 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, ipt.error = -EINVAL; /* same as no support for IOCB_CMD_POLL */
/* initialized the list so that we can do list_empty checks */ - INIT_LIST_HEAD(&poll->wait.entry); - init_waitqueue_func_entry(&poll->wait, io_poll_wake); + INIT_LIST_HEAD(&poll->wait->entry); + init_waitqueue_func_entry(poll->wait, io_poll_wake); + poll->wait->private = poll;
INIT_LIST_HEAD(&req->list);
@@ -2277,14 +2282,14 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, spin_lock_irq(&ctx->completion_lock); if (likely(poll->head)) { spin_lock(&poll->head->lock); - if (unlikely(list_empty(&poll->wait.entry))) { + if (unlikely(list_empty(&poll->wait->entry))) { if (ipt.error) cancel = true; ipt.error = 0; mask = 0; } if (mask || ipt.error) - list_del_init(&poll->wait.entry); + list_del_init(&poll->wait->entry); else if (cancel) WRITE_ONCE(poll->canceled, true); else if (!poll->done) /* actually waiting for an event */
From: Roman Penyaev rpenyaev@suse.de
mainline inclusion from mainline-5.5-rc1 commit 6c5c240e412682f97aecd233c1e706822704aa28 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
That is a bit weird scenario but I find it interesting to run fio loads using LKL linux, where MMU is disabled. Probably other real archs which run uClinux can also benefit from this patch.
Signed-off-by: Roman Penyaev rpenyaev@suse.de Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [ Patch a50b854e07("mm: introduce page_size()") is not applied. ]
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 57 +++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 51 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7e783220d425..98cd3ff11008 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4469,12 +4469,11 @@ static int io_uring_flush(struct file *file, void *data) return 0; }
-static int io_uring_mmap(struct file *file, struct vm_area_struct *vma) +static void *io_uring_validate_mmap_request(struct file *file, + loff_t pgoff, size_t sz) { - loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT; - unsigned long sz = vma->vm_end - vma->vm_start; struct io_ring_ctx *ctx = file->private_data; - unsigned long pfn; + loff_t offset = pgoff << PAGE_SHIFT; struct page *page; void *ptr;
@@ -4487,17 +4486,59 @@ static int io_uring_mmap(struct file *file, struct vm_area_struct *vma) ptr = ctx->sq_sqes; break; default: - return -EINVAL; + return ERR_PTR(-EINVAL); }
page = virt_to_head_page(ptr); if (sz > (PAGE_SIZE << compound_order(page))) - return -EINVAL; + return ERR_PTR(-EINVAL); + + return ptr; +} + +#ifdef CONFIG_MMU + +static int io_uring_mmap(struct file *file, struct vm_area_struct *vma) +{ + size_t sz = vma->vm_end - vma->vm_start; + unsigned long pfn; + void *ptr; + + ptr = io_uring_validate_mmap_request(file, vma->vm_pgoff, sz); + if (IS_ERR(ptr)) + return PTR_ERR(ptr);
pfn = virt_to_phys(ptr) >> PAGE_SHIFT; return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot); }
+#else /* !CONFIG_MMU */ + +static int io_uring_mmap(struct file *file, struct vm_area_struct *vma) +{ + return vma->vm_flags & (VM_SHARED | VM_MAYSHARE) ? 0 : -EINVAL; +} + +static unsigned int io_uring_nommu_mmap_capabilities(struct file *file) +{ + return NOMMU_MAP_DIRECT | NOMMU_MAP_READ | NOMMU_MAP_WRITE; +} + +static unsigned long io_uring_nommu_get_unmapped_area(struct file *file, + unsigned long addr, unsigned long len, + unsigned long pgoff, unsigned long flags) +{ + void *ptr; + + ptr = io_uring_validate_mmap_request(file, pgoff, len); + if (IS_ERR(ptr)) + return PTR_ERR(ptr); + + return (unsigned long) ptr; +} + +#endif /* !CONFIG_MMU */ + SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, u32, min_complete, u32, flags, const sigset_t __user *, sig, size_t, sigsz) @@ -4568,6 +4609,10 @@ static const struct file_operations io_uring_fops = { .release = io_uring_release, .flush = io_uring_flush, .mmap = io_uring_mmap, +#ifndef CONFIG_MMU + .get_unmapped_area = io_uring_nommu_get_unmapped_area, + .mmap_capabilities = io_uring_nommu_mmap_capabilities, +#endif .poll = io_uring_poll, .fasync = io_uring_fasync, };
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit aa4c3967756c6c576a38a23ac511be211462a6b7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Christophe reports that current master fails building on powerpc with this error:
CC fs/io_uring.o fs/io_uring.c: In function ‘loop_rw_iter’: fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’ [-Werror=implicit-function-declaration] iovec.iov_base = kmap(iter->bvec->bv_page) ^ fs/io_uring.c:1628:19: warning: assignment makes pointer from integer without a cast [-Wint-conversion] iovec.iov_base = kmap(iter->bvec->bv_page) ^ fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’ [-Werror=implicit-function-declaration] kunmap(iter->bvec->bv_page); ^
which is caused by a missing highmem.h include. Fix it by including it.
Fixes: 311ae9e159d8 ("io_uring: fix dead-hung for non-iter fixed rw") Reported-by: Christophe Leroy christophe.leroy@c-s.fr Tested-by: Christophe Leroy christophe.leroy@c-s.fr Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 98cd3ff11008..b22d30fecb60 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -69,6 +69,7 @@ #include <linux/nospec.h> #include <linux/sizes.h> #include <linux/hugetlb.h> +#include <linux/highmem.h>
#define CREATE_TRACE_POINTS #include <trace/events/io_uring.h>
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 0b8c0ec7eedcd8f9f1a1f238d87f9b512b09e71a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
syzbot reports:
kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline] RIP: 0010:__validate_creds include/linux/cred.h:187 [inline] RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550 Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c 24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318 RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010 RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849 R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000 R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274 kthread+0x361/0x430 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Modules linked in: ---[ end trace f2e1a4307fbe2245 ]--- RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline] RIP: 0010:__validate_creds include/linux/cred.h:187 [inline] RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550 Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c 24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318 RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010 RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849 R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000 R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
which is caused by slab fault injection triggering a failure in prepare_creds(). We don't actually need to create a copy of the creds as we're not modifying it, we just need a reference on the current task creds. This avoids the failure case as well, and propagates the const throughout the stack.
Fixes: 181e448d8709 ("io_uring: async workers should inherit the user creds") Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 2 +- fs/io-wq.h | 2 +- fs/io_uring.c | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index cadbc77542f7..25654b5bf853 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -112,7 +112,7 @@ struct io_wq {
struct task_struct *manager; struct user_struct *user; - struct cred *creds; + const struct cred *creds; struct mm_struct *mm; refcount_t refs; struct completion done; diff --git a/fs/io-wq.h b/fs/io-wq.h index 600e0158cba7..dd0af0d7376c 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -87,7 +87,7 @@ typedef void (put_work_fn)(struct io_wq_work *); struct io_wq_data { struct mm_struct *mm; struct user_struct *user; - struct cred *creds; + const struct cred *creds;
get_work_fn *get_work; put_work_fn *put_work; diff --git a/fs/io_uring.c b/fs/io_uring.c index b22d30fecb60..da8e3bbddc1b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -238,7 +238,7 @@ struct io_ring_ctx {
struct user_struct *user;
- struct cred *creds; + const struct cred *creds;
/* 0 is for ctx quiesce/reinit/free, 1 is for sqo_thread started */ struct completion *completions; @@ -4759,7 +4759,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) ctx->compat = in_compat_syscall(); ctx->account_mem = account_mem; ctx->user = user; - ctx->creds = prepare_creds(); + ctx->creds = get_current_cred();
ret = io_allocate_scq_urings(ctx, p); if (ret)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 441cdbd5449b4923cd413d3ba748124f91388be9 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We should never return -ERESTARTSYS to userspace, transform it into -EINTR.
Cc: stable@vger.kernel.org # v5.3+ Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index da8e3bbddc1b..0780574e1843 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1916,6 +1916,8 @@ static int io_send_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, ret = fn(sock, msg, flags); if (force_nonblock && ret == -EAGAIN) return ret; + if (ret == -ERESTARTSYS) + ret = -EINTR; }
io_cqring_add_event(req, ret);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 1a6b74fc87024db59d41cd7346bd437f20fb3e2d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Right now we just copy the sqe for async offload, but we want to store more context across an async punt. In preparation for doing so, put the sqe copy inside a structure that we can expand. With this pointer added, we can get rid of REQ_F_FREE_SQE, as that is now indicated by whether req->io is NULL or not.
No functional changes in this patch.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 56 +++++++++++++++++++++++++++++---------------------- 1 file changed, 32 insertions(+), 24 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 0780574e1843..12db5162dae8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -308,6 +308,10 @@ struct io_timeout { struct io_timeout_data *data; };
+struct io_async_ctx { + struct io_uring_sqe sqe; +}; + /* * NOTE! Each of the iocb union members has the file pointer * as the first entry in their struct definition. So you can @@ -323,6 +327,7 @@ struct io_kiocb { };
const struct io_uring_sqe *sqe; + struct io_async_ctx *io; struct file *ring_file; int ring_fd; bool has_user; @@ -353,7 +358,6 @@ struct io_kiocb { #define REQ_F_TIMEOUT_NOSEQ 8192 /* no timeout sequence */ #define REQ_F_INFLIGHT 16384 /* on inflight list */ #define REQ_F_COMP_LOCKED 32768 /* completion under lock */ -#define REQ_F_FREE_SQE 65536 /* free sqe if not async queued */ u64 user_data; u32 result; u32 sequence; @@ -805,6 +809,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, }
got_it: + req->io = NULL; req->ring_file = NULL; req->file = NULL; req->ctx = ctx; @@ -835,8 +840,8 @@ static void __io_free_req(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx;
- if (req->flags & REQ_F_FREE_SQE) - kfree(req->sqe); + if (req->io) + kfree(req->io); if (req->file && !(req->flags & REQ_F_FIXED_FILE)) fput(req->file); if (req->flags & REQ_F_INFLIGHT) { @@ -1078,9 +1083,9 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, * completions for those, only batch free for fixed * file and non-linked commands. */ - if (((req->flags & - (REQ_F_FIXED_FILE|REQ_F_LINK|REQ_F_FREE_SQE)) == - REQ_F_FIXED_FILE) && !io_is_fallback_req(req)) { + if (((req->flags & (REQ_F_FIXED_FILE|REQ_F_LINK)) == + REQ_F_FIXED_FILE) && !io_is_fallback_req(req) && + !req->io) { reqs[to_free++] = req; if (to_free == ARRAY_SIZE(reqs)) io_free_req_many(ctx, reqs, &to_free); @@ -2258,7 +2263,7 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (!poll->wait) return -ENOMEM;
- req->sqe = NULL; + req->io = NULL; INIT_IO_WORK(&req->work, io_poll_complete_work); events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; @@ -2601,27 +2606,27 @@ static int io_async_cancel(struct io_kiocb *req, const struct io_uring_sqe *sqe,
static int io_req_defer(struct io_kiocb *req) { - struct io_uring_sqe *sqe_copy; struct io_ring_ctx *ctx = req->ctx; + struct io_async_ctx *io;
/* Still need defer if there is pending req in defer list. */ if (!req_need_defer(req) && list_empty(&ctx->defer_list)) return 0;
- sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL); - if (!sqe_copy) + io = kmalloc(sizeof(*io), GFP_KERNEL); + if (!io) return -EAGAIN;
spin_lock_irq(&ctx->completion_lock); if (!req_need_defer(req) && list_empty(&ctx->defer_list)) { spin_unlock_irq(&ctx->completion_lock); - kfree(sqe_copy); + kfree(io); return 0; }
- memcpy(sqe_copy, req->sqe, sizeof(*sqe_copy)); - req->flags |= REQ_F_FREE_SQE; - req->sqe = sqe_copy; + memcpy(&io->sqe, req->sqe, sizeof(io->sqe)); + req->sqe = &io->sqe; + req->io = io;
trace_io_uring_defer(ctx, req, req->user_data); list_add_tail(&req->list, &ctx->defer_list); @@ -2954,14 +2959,16 @@ static void __io_queue_sqe(struct io_kiocb *req) */ if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) || (req->flags & REQ_F_MUST_PUNT))) { - struct io_uring_sqe *sqe_copy; + struct io_async_ctx *io;
- sqe_copy = kmemdup(req->sqe, sizeof(*sqe_copy), GFP_KERNEL); - if (!sqe_copy) + io = kmalloc(sizeof(*io), GFP_KERNEL); + if (!io) goto err;
- req->sqe = sqe_copy; - req->flags |= REQ_F_FREE_SQE; + memcpy(&io->sqe, req->sqe, sizeof(io->sqe)); + + req->sqe = &io->sqe; + req->io = io;
if (req->work.flags & IO_WQ_WORK_NEEDS_FILES) { ret = io_grab_files(req); @@ -3062,7 +3069,7 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, */ if (*link) { struct io_kiocb *prev = *link; - struct io_uring_sqe *sqe_copy; + struct io_async_ctx *io;
if (req->sqe->flags & IOSQE_IO_DRAIN) (*link)->flags |= REQ_F_DRAIN_LINK | REQ_F_IO_DRAIN; @@ -3078,14 +3085,15 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, } }
- sqe_copy = kmemdup(req->sqe, sizeof(*sqe_copy), GFP_KERNEL); - if (!sqe_copy) { + io = kmalloc(sizeof(*io), GFP_KERNEL); + if (!io) { ret = -EAGAIN; goto err_req; }
- req->sqe = sqe_copy; - req->flags |= REQ_F_FREE_SQE; + memcpy(&io->sqe, req->sqe, sizeof(io->sqe)); + req->sqe = &io->sqe; + req->io = io; trace_io_uring_link(ctx, req, prev); list_add_tail(&req->list, &prev->link_list); } else if (req->sqe->flags & IOSQE_IO_LINK) {
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit f67676d160c6ee2ed82917fadfed6d29cab8237c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently we don't copy the iovecs when we punt to async context. This can be problematic for applications that store the iovec on the stack, as they often assume that it's safe to let the iovec go out of scope as soon as IO submission has been called. This isn't always safe, as we will re-copy the iovec once we're in async context.
Make this 100% safe by copying the iovec just once. With this change, applications may safely store the iovec on the stack for all cases.
Reported-by: 李通洲 carter.li@eoitek.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 243 +++++++++++++++++++++++++++++++++++++------------- 1 file changed, 181 insertions(+), 62 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 12db5162dae8..2060fb7b4450 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -308,8 +308,18 @@ struct io_timeout { struct io_timeout_data *data; };
+struct io_async_rw { + struct iovec fast_iov[UIO_FASTIOV]; + struct iovec *iov; + ssize_t nr_segs; + ssize_t size; +}; + struct io_async_ctx { struct io_uring_sqe sqe; + union { + struct io_async_rw rw; + }; };
/* @@ -1414,15 +1424,6 @@ static int io_prep_rw(struct io_kiocb *req, bool force_nonblock) if (S_ISREG(file_inode(req->file)->i_mode)) req->flags |= REQ_F_ISREG;
- /* - * If the file doesn't support async, mark it as REQ_F_MUST_PUNT so - * we know to async punt it even if it was opened O_NONBLOCK - */ - if (force_nonblock && !io_file_supports_async(req->file)) { - req->flags |= REQ_F_MUST_PUNT; - return -EAGAIN; - } - kiocb->ki_pos = READ_ONCE(sqe->off); kiocb->ki_flags = iocb_flags(kiocb->ki_filp); kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp)); @@ -1591,6 +1592,16 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, return io_import_fixed(req->ctx, rw, sqe, iter); }
+ if (req->io) { + struct io_async_rw *iorw = &req->io->rw; + + *iovec = iorw->iov; + iov_iter_init(iter, rw, *iovec, iorw->nr_segs, iorw->size); + if (iorw->iov == iorw->fast_iov) + *iovec = NULL; + return iorw->size; + } + if (!req->has_user) return -EFAULT;
@@ -1661,6 +1672,50 @@ static ssize_t loop_rw_iter(int rw, struct file *file, struct kiocb *kiocb, return ret; }
+static void io_req_map_io(struct io_kiocb *req, ssize_t io_size, + struct iovec *iovec, struct iovec *fast_iov, + struct iov_iter *iter) +{ + req->io->rw.nr_segs = iter->nr_segs; + req->io->rw.size = io_size; + req->io->rw.iov = iovec; + if (!req->io->rw.iov) { + req->io->rw.iov = req->io->rw.fast_iov; + memcpy(req->io->rw.iov, fast_iov, + sizeof(struct iovec) * iter->nr_segs); + } +} + +static int io_setup_async_io(struct io_kiocb *req, ssize_t io_size, + struct iovec *iovec, struct iovec *fast_iov, + struct iov_iter *iter) +{ + req->io = kmalloc(sizeof(*req->io), GFP_KERNEL); + if (req->io) { + io_req_map_io(req, io_size, iovec, fast_iov, iter); + memcpy(&req->io->sqe, req->sqe, sizeof(req->io->sqe)); + req->sqe = &req->io->sqe; + return 0; + } + + return -ENOMEM; +} + +static int io_read_prep(struct io_kiocb *req, struct iovec **iovec, + struct iov_iter *iter, bool force_nonblock) +{ + ssize_t ret; + + ret = io_prep_rw(req, force_nonblock); + if (ret) + return ret; + + if (unlikely(!(req->file->f_mode & FMODE_READ))) + return -EBADF; + + return io_import_iovec(READ, req, iovec, iter); +} + static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { @@ -1669,23 +1724,31 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, struct iov_iter iter; struct file *file; size_t iov_count; - ssize_t read_size, ret; + ssize_t io_size, ret;
- ret = io_prep_rw(req, force_nonblock); - if (ret) - return ret; - file = kiocb->ki_filp; - - if (unlikely(!(file->f_mode & FMODE_READ))) - return -EBADF; - - ret = io_import_iovec(READ, req, &iovec, &iter); - if (ret < 0) - return ret; + if (!req->io) { + ret = io_read_prep(req, &iovec, &iter, force_nonblock); + if (ret < 0) + return ret; + } else { + ret = io_import_iovec(READ, req, &iovec, &iter); + if (ret < 0) + return ret; + }
- read_size = ret; + file = req->file; + io_size = ret; if (req->flags & REQ_F_LINK) - req->result = read_size; + req->result = io_size; + + /* + * If the file doesn't support async, mark it as REQ_F_MUST_PUNT so + * we know to async punt it even if it was opened O_NONBLOCK + */ + if (force_nonblock && !io_file_supports_async(file)) { + req->flags |= REQ_F_MUST_PUNT; + goto copy_iov; + }
iov_count = iov_iter_count(&iter); ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_count); @@ -1707,18 +1770,40 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, */ if (force_nonblock && !(req->flags & REQ_F_NOWAIT) && (req->flags & REQ_F_ISREG) && - ret2 > 0 && ret2 < read_size) + ret2 > 0 && ret2 < io_size) ret2 = -EAGAIN; /* Catch -EAGAIN return for forced non-blocking submission */ - if (!force_nonblock || ret2 != -EAGAIN) + if (!force_nonblock || ret2 != -EAGAIN) { kiocb_done(kiocb, ret2, nxt, req->in_async); - else - ret = -EAGAIN; + } else { +copy_iov: + ret = io_setup_async_io(req, io_size, iovec, + inline_vecs, &iter); + if (ret) + goto out_free; + return -EAGAIN; + } } +out_free: kfree(iovec); return ret; }
+static int io_write_prep(struct io_kiocb *req, struct iovec **iovec, + struct iov_iter *iter, bool force_nonblock) +{ + ssize_t ret; + + ret = io_prep_rw(req, force_nonblock); + if (ret) + return ret; + + if (unlikely(!(req->file->f_mode & FMODE_WRITE))) + return -EBADF; + + return io_import_iovec(WRITE, req, iovec, iter); +} + static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { @@ -1727,29 +1812,36 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, struct iov_iter iter; struct file *file; size_t iov_count; - ssize_t ret; + ssize_t ret, io_size;
- ret = io_prep_rw(req, force_nonblock); - if (ret) - return ret; + if (!req->io) { + ret = io_write_prep(req, &iovec, &iter, force_nonblock); + if (ret < 0) + return ret; + } else { + ret = io_import_iovec(WRITE, req, &iovec, &iter); + if (ret < 0) + return ret; + }
file = kiocb->ki_filp; - if (unlikely(!(file->f_mode & FMODE_WRITE))) - return -EBADF; - - ret = io_import_iovec(WRITE, req, &iovec, &iter); - if (ret < 0) - return ret; - + io_size = ret; if (req->flags & REQ_F_LINK) - req->result = ret; + req->result = io_size;
- iov_count = iov_iter_count(&iter); + /* + * If the file doesn't support async, mark it as REQ_F_MUST_PUNT so + * we know to async punt it even if it was opened O_NONBLOCK + */ + if (force_nonblock && !io_file_supports_async(req->file)) { + req->flags |= REQ_F_MUST_PUNT; + goto copy_iov; + }
- ret = -EAGAIN; if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) - goto out_free; + goto copy_iov;
+ iov_count = iov_iter_count(&iter); ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, iov_count); if (!ret) { ssize_t ret2; @@ -1773,10 +1865,16 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, ret2 = call_write_iter(file, kiocb, &iter); else ret2 = loop_rw_iter(WRITE, file, kiocb, &iter); - if (!force_nonblock || ret2 != -EAGAIN) + if (!force_nonblock || ret2 != -EAGAIN) { kiocb_done(kiocb, ret2, nxt, req->in_async); - else - ret = -EAGAIN; + } else { +copy_iov: + ret = io_setup_async_io(req, io_size, iovec, + inline_vecs, &iter); + if (ret) + goto out_free; + return -EAGAIN; + } } out_free: kfree(iovec); @@ -2604,10 +2702,42 @@ static int io_async_cancel(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; }
+static int io_req_defer_prep(struct io_kiocb *req, struct io_async_ctx *io) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct iov_iter iter; + ssize_t ret; + + memcpy(&io->sqe, req->sqe, sizeof(io->sqe)); + req->sqe = &io->sqe; + + switch (io->sqe.opcode) { + case IORING_OP_READV: + case IORING_OP_READ_FIXED: + ret = io_read_prep(req, &iovec, &iter, true); + break; + case IORING_OP_WRITEV: + case IORING_OP_WRITE_FIXED: + ret = io_write_prep(req, &iovec, &iter, true); + break; + default: + req->io = io; + return 0; + } + + if (ret < 0) + return ret; + + req->io = io; + io_req_map_io(req, ret, iovec, inline_vecs, &iter); + return 0; +} + static int io_req_defer(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; struct io_async_ctx *io; + int ret;
/* Still need defer if there is pending req in defer list. */ if (!req_need_defer(req) && list_empty(&ctx->defer_list)) @@ -2624,9 +2754,9 @@ static int io_req_defer(struct io_kiocb *req) return 0; }
- memcpy(&io->sqe, req->sqe, sizeof(io->sqe)); - req->sqe = &io->sqe; - req->io = io; + ret = io_req_defer_prep(req, io); + if (ret < 0) + return ret;
trace_io_uring_defer(ctx, req, req->user_data); list_add_tail(&req->list, &ctx->defer_list); @@ -2959,17 +3089,6 @@ static void __io_queue_sqe(struct io_kiocb *req) */ if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) || (req->flags & REQ_F_MUST_PUNT))) { - struct io_async_ctx *io; - - io = kmalloc(sizeof(*io), GFP_KERNEL); - if (!io) - goto err; - - memcpy(&io->sqe, req->sqe, sizeof(io->sqe)); - - req->sqe = &io->sqe; - req->io = io; - if (req->work.flags & IO_WQ_WORK_NEEDS_FILES) { ret = io_grab_files(req); if (ret) @@ -3091,9 +3210,9 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, goto err_req; }
- memcpy(&io->sqe, req->sqe, sizeof(io->sqe)); - req->sqe = &io->sqe; - req->io = io; + ret = io_req_defer_prep(req, io); + if (ret) + goto err_req; trace_io_uring_link(ctx, req, prev); list_add_tail(&req->list, &prev->link_list); } else if (req->sqe->flags & IOSQE_IO_LINK) {
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 4257c8ca13b084550574b8c9a667d9c90ff746eb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This is in preparation for enabling the io_uring helpers for sendmsg and recvmsg to first copy the header for validation before continuing with the operation.
There should be no functional changes in this patch.
Acked-by: David S. Miller davem@davemloft.net Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/socket.c | 141 ++++++++++++++++++++++++++++++++++----------------- 1 file changed, 95 insertions(+), 46 deletions(-)
diff --git a/net/socket.c b/net/socket.c index 8faf6ea75c61..b4fd9c96e2ed 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2068,15 +2068,10 @@ static int copy_msghdr_from_user(struct msghdr *kmsg, return err < 0 ? err : 0; }
-static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, - struct msghdr *msg_sys, unsigned int flags, - struct used_address *used_address, - unsigned int allowed_msghdr_flags) +static int ____sys_sendmsg(struct socket *sock, struct msghdr *msg_sys, + unsigned int flags, struct used_address *used_address, + unsigned int allowed_msghdr_flags) { - struct compat_msghdr __user *msg_compat = - (struct compat_msghdr __user *)msg; - struct sockaddr_storage address; - struct iovec iovstack[UIO_FASTIOV], *iov = iovstack; unsigned char ctl[sizeof(struct cmsghdr) + 20] __aligned(sizeof(__kernel_size_t)); /* 20 is size of ipv6_pktinfo */ @@ -2084,19 +2079,10 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, int ctl_len; ssize_t err;
- msg_sys->msg_name = &address; - - if (MSG_CMSG_COMPAT & flags) - err = get_compat_msghdr(msg_sys, msg_compat, NULL, &iov); - else - err = copy_msghdr_from_user(msg_sys, msg, NULL, &iov); - if (err < 0) - return err; - err = -ENOBUFS;
if (msg_sys->msg_controllen > INT_MAX) - goto out_freeiov; + goto out; flags |= (msg_sys->msg_flags & allowed_msghdr_flags); ctl_len = msg_sys->msg_controllen; if ((MSG_CMSG_COMPAT & flags) && ctl_len) { @@ -2104,7 +2090,7 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, cmsghdr_from_user_compat_to_kern(msg_sys, sock->sk, ctl, sizeof(ctl)); if (err) - goto out_freeiov; + goto out; ctl_buf = msg_sys->msg_control; ctl_len = msg_sys->msg_controllen; } else if (ctl_len) { @@ -2113,7 +2099,7 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, if (ctl_len > sizeof(ctl)) { ctl_buf = sock_kmalloc(sock->sk, ctl_len, GFP_KERNEL); if (ctl_buf == NULL) - goto out_freeiov; + goto out; } err = -EFAULT; /* @@ -2159,7 +2145,47 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, out_freectl: if (ctl_buf != ctl) sock_kfree_s(sock->sk, ctl_buf, ctl_len); -out_freeiov: +out: + return err; +} + +static int sendmsg_copy_msghdr(struct msghdr *msg, + struct user_msghdr __user *umsg, unsigned flags, + struct iovec **iov) +{ + int err; + + if (flags & MSG_CMSG_COMPAT) { + struct compat_msghdr __user *msg_compat; + + msg_compat = (struct compat_msghdr __user *) umsg; + err = get_compat_msghdr(msg, msg_compat, NULL, iov); + } else { + err = copy_msghdr_from_user(msg, umsg, NULL, iov); + } + if (err < 0) + return err; + + return 0; +} + +static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, + struct msghdr *msg_sys, unsigned int flags, + struct used_address *used_address, + unsigned int allowed_msghdr_flags) +{ + struct sockaddr_storage address; + struct iovec iovstack[UIO_FASTIOV], *iov = iovstack; + ssize_t err; + + msg_sys->msg_name = &address; + + err = sendmsg_copy_msghdr(msg_sys, msg, flags, &iov); + if (err < 0) + return err; + + err = ____sys_sendmsg(sock, msg_sys, flags, used_address, + allowed_msghdr_flags); kfree(iov); return err; } @@ -2278,33 +2304,41 @@ SYSCALL_DEFINE4(sendmmsg, int, fd, struct mmsghdr __user *, mmsg, return __sys_sendmmsg(fd, mmsg, vlen, flags, true); }
-static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg, - struct msghdr *msg_sys, unsigned int flags, int nosec) +static int recvmsg_copy_msghdr(struct msghdr *msg, + struct user_msghdr __user *umsg, unsigned flags, + struct sockaddr __user **uaddr, + struct iovec **iov) { - struct compat_msghdr __user *msg_compat = - (struct compat_msghdr __user *)msg; - struct iovec iovstack[UIO_FASTIOV]; - struct iovec *iov = iovstack; - unsigned long cmsg_ptr; - int len; ssize_t err;
- /* kernel mode address */ - struct sockaddr_storage addr; - - /* user mode address pointers */ - struct sockaddr __user *uaddr; - int __user *uaddr_len = COMPAT_NAMELEN(msg); - - msg_sys->msg_name = &addr; + if (MSG_CMSG_COMPAT & flags) { + struct compat_msghdr __user *msg_compat;
- if (MSG_CMSG_COMPAT & flags) - err = get_compat_msghdr(msg_sys, msg_compat, &uaddr, &iov); - else - err = copy_msghdr_from_user(msg_sys, msg, &uaddr, &iov); + msg_compat = (struct compat_msghdr __user *) umsg; + err = get_compat_msghdr(msg, msg_compat, uaddr, iov); + } else { + err = copy_msghdr_from_user(msg, umsg, uaddr, iov); + } if (err < 0) return err;
+ return 0; +} + +static int ____sys_recvmsg(struct socket *sock, struct msghdr *msg_sys, + struct user_msghdr __user *msg, + struct sockaddr __user *uaddr, + unsigned int flags, int nosec) +{ + struct compat_msghdr __user *msg_compat = + (struct compat_msghdr __user *) msg; + int __user *uaddr_len = COMPAT_NAMELEN(msg); + struct sockaddr_storage addr; + unsigned long cmsg_ptr; + int len; + ssize_t err; + + msg_sys->msg_name = &addr; cmsg_ptr = (unsigned long)msg_sys->msg_control; msg_sys->msg_flags = flags & (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
@@ -2315,7 +2349,7 @@ static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg, flags |= MSG_DONTWAIT; err = (nosec ? sock_recvmsg_nosec : sock_recvmsg)(sock, msg_sys, flags); if (err < 0) - goto out_freeiov; + goto out; len = err;
if (uaddr != NULL) { @@ -2323,12 +2357,12 @@ static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg, msg_sys->msg_namelen, uaddr, uaddr_len); if (err < 0) - goto out_freeiov; + goto out; } err = __put_user((msg_sys->msg_flags & ~MSG_CMSG_COMPAT), COMPAT_FLAGS(msg)); if (err) - goto out_freeiov; + goto out; if (MSG_CMSG_COMPAT & flags) err = __put_user((unsigned long)msg_sys->msg_control - cmsg_ptr, &msg_compat->msg_controllen); @@ -2336,10 +2370,25 @@ static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg, err = __put_user((unsigned long)msg_sys->msg_control - cmsg_ptr, &msg->msg_controllen); if (err) - goto out_freeiov; + goto out; err = len; +out: + return err; +} + +static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg, + struct msghdr *msg_sys, unsigned int flags, int nosec) +{ + struct iovec iovstack[UIO_FASTIOV], *iov = iovstack; + /* user mode address pointers */ + struct sockaddr __user *uaddr; + ssize_t err; + + err = recvmsg_copy_msghdr(msg_sys, msg, flags, &uaddr, &iov); + if (err < 0) + return err;
-out_freeiov: + err = ____sys_recvmsg(sock, msg_sys, msg, uaddr, flags, nosec); kfree(iov); return err; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit d69e07793f891524c6bbf1e75b9ae69db4450953 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Only io_uring uses (and added) these, and we want to disallow the use of sendmsg/recvmsg for anything but regular data transfers. Use the newly added prep helper to split the msghdr copy out from the core function, to check for msg_control and msg_controllen settings. If either is set, we return -EINVAL.
Acked-by: David S. Miller davem@davemloft.net Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- net/socket.c | 43 +++++++++++++++++++++++++++++++++++++------ 1 file changed, 37 insertions(+), 6 deletions(-)
diff --git a/net/socket.c b/net/socket.c index b4fd9c96e2ed..b3ffa502d62a 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2193,12 +2193,27 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, /* * BSD sendmsg interface */ -long __sys_sendmsg_sock(struct socket *sock, struct user_msghdr __user *msg, +long __sys_sendmsg_sock(struct socket *sock, struct user_msghdr __user *umsg, unsigned int flags) { - struct msghdr msg_sys; + struct iovec iovstack[UIO_FASTIOV], *iov = iovstack; + struct sockaddr_storage address; + struct msghdr msg = { .msg_name = &address }; + ssize_t err; + + err = sendmsg_copy_msghdr(&msg, umsg, flags, &iov); + if (err) + return err; + /* disallow ancillary data requests from this path */ + if (msg.msg_control || msg.msg_controllen) { + err = -EINVAL; + goto out; + }
- return ___sys_sendmsg(sock, msg, &msg_sys, flags, NULL, 0); + err = ____sys_sendmsg(sock, &msg, flags, NULL, 0); +out: + kfree(iov); + return err; }
long __sys_sendmsg(int fd, struct user_msghdr __user *msg, unsigned int flags, @@ -2397,12 +2412,28 @@ static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg, * BSD recvmsg interface */
-long __sys_recvmsg_sock(struct socket *sock, struct user_msghdr __user *msg, +long __sys_recvmsg_sock(struct socket *sock, struct user_msghdr __user *umsg, unsigned int flags) { - struct msghdr msg_sys; + struct iovec iovstack[UIO_FASTIOV], *iov = iovstack; + struct sockaddr_storage address; + struct msghdr msg = { .msg_name = &address }; + struct sockaddr __user *uaddr; + ssize_t err;
- return ___sys_recvmsg(sock, msg, &msg_sys, flags, 0); + err = recvmsg_copy_msghdr(&msg, umsg, flags, &uaddr, &iov); + if (err) + return err; + /* disallow ancillary data requests from this path */ + if (msg.msg_control || msg.msg_controllen) { + err = -EINVAL; + goto out; + } + + err = ____sys_recvmsg(sock, &msg, umsg, uaddr, flags, 0); +out: + kfree(iov); + return err; }
long __sys_recvmsg(int fd, struct user_msghdr __user *msg, unsigned int flags,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 03b1230ca12a12e045d83b0357792075bf94a1e0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Just like commit f67676d160c6 for read/write requests, this one ensures that the msghdr data is fully copied if we need to punt a recvmsg or sendmsg system call to async context.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 145 ++++++++++++++++++++++++++++++++++++----- include/linux/socket.h | 15 +++-- net/socket.c | 60 +++++------------ 3 files changed, 156 insertions(+), 64 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2060fb7b4450..4de95825e878 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -308,6 +308,13 @@ struct io_timeout { struct io_timeout_data *data; };
+struct io_async_msghdr { + struct iovec fast_iov[UIO_FASTIOV]; + struct iovec *iov; + struct sockaddr __user *uaddr; + struct msghdr msg; +}; + struct io_async_rw { struct iovec fast_iov[UIO_FASTIOV]; struct iovec *iov; @@ -319,6 +326,7 @@ struct io_async_ctx { struct io_uring_sqe sqe; union { struct io_async_rw rw; + struct io_async_msghdr msg; }; };
@@ -1990,12 +1998,25 @@ static int io_sync_file_range(struct io_kiocb *req, return 0; }
+static int io_sendmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) +{ #if defined(CONFIG_NET) -static int io_send_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, - struct io_kiocb **nxt, bool force_nonblock, - long (*fn)(struct socket *, struct user_msghdr __user *, - unsigned int)) + const struct io_uring_sqe *sqe = req->sqe; + struct user_msghdr __user *msg; + unsigned flags; + + flags = READ_ONCE(sqe->msg_flags); + msg = (struct user_msghdr __user *)(unsigned long) READ_ONCE(sqe->addr); + return sendmsg_copy_msghdr(&io->msg.msg, msg, flags, &io->msg.iov); +#else + return 0; +#endif +} + +static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, + struct io_kiocb **nxt, bool force_nonblock) { +#if defined(CONFIG_NET) struct socket *sock; int ret;
@@ -2004,7 +2025,9 @@ static int io_send_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
sock = sock_from_file(req->file, &ret); if (sock) { - struct user_msghdr __user *msg; + struct io_async_ctx io, *copy; + struct sockaddr_storage addr; + struct msghdr *kmsg; unsigned flags;
flags = READ_ONCE(sqe->msg_flags); @@ -2013,32 +2036,59 @@ static int io_send_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, else if (force_nonblock) flags |= MSG_DONTWAIT;
- msg = (struct user_msghdr __user *) (unsigned long) - READ_ONCE(sqe->addr); + if (req->io) { + kmsg = &req->io->msg.msg; + kmsg->msg_name = &addr; + } else { + kmsg = &io.msg.msg; + kmsg->msg_name = &addr; + io.msg.iov = io.msg.fast_iov; + ret = io_sendmsg_prep(req, &io); + if (ret) + goto out; + }
- ret = fn(sock, msg, flags); - if (force_nonblock && ret == -EAGAIN) + ret = __sys_sendmsg_sock(sock, kmsg, flags); + if (force_nonblock && ret == -EAGAIN) { + copy = kmalloc(sizeof(*copy), GFP_KERNEL); + if (!copy) { + ret = -ENOMEM; + goto out; + } + memcpy(©->msg, &io.msg, sizeof(copy->msg)); + req->io = copy; + memcpy(&req->io->sqe, req->sqe, sizeof(*req->sqe)); + req->sqe = &req->io->sqe; return ret; + } if (ret == -ERESTARTSYS) ret = -EINTR; }
+out: io_cqring_add_event(req, ret); if (ret < 0 && (req->flags & REQ_F_LINK)) req->flags |= REQ_F_FAIL_LINK; io_put_req_find_next(req, nxt); return 0; -} +#else + return -EOPNOTSUPP; #endif +}
-static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, - struct io_kiocb **nxt, bool force_nonblock) +static int io_recvmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) { #if defined(CONFIG_NET) - return io_send_recvmsg(req, sqe, nxt, force_nonblock, - __sys_sendmsg_sock); + const struct io_uring_sqe *sqe = req->sqe; + struct user_msghdr __user *msg; + unsigned flags; + + flags = READ_ONCE(sqe->msg_flags); + msg = (struct user_msghdr __user *)(unsigned long) READ_ONCE(sqe->addr); + return recvmsg_copy_msghdr(&io->msg.msg, msg, flags, &io->msg.uaddr, + &io->msg.iov); #else - return -EOPNOTSUPP; + return 0; #endif }
@@ -2046,8 +2096,63 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct io_kiocb **nxt, bool force_nonblock) { #if defined(CONFIG_NET) - return io_send_recvmsg(req, sqe, nxt, force_nonblock, - __sys_recvmsg_sock); + struct socket *sock; + int ret; + + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + + sock = sock_from_file(req->file, &ret); + if (sock) { + struct user_msghdr __user *msg; + struct io_async_ctx io, *copy; + struct sockaddr_storage addr; + struct msghdr *kmsg; + unsigned flags; + + flags = READ_ONCE(sqe->msg_flags); + if (flags & MSG_DONTWAIT) + req->flags |= REQ_F_NOWAIT; + else if (force_nonblock) + flags |= MSG_DONTWAIT; + + msg = (struct user_msghdr __user *) (unsigned long) + READ_ONCE(sqe->addr); + if (req->io) { + kmsg = &req->io->msg.msg; + kmsg->msg_name = &addr; + } else { + kmsg = &io.msg.msg; + kmsg->msg_name = &addr; + io.msg.iov = io.msg.fast_iov; + ret = io_recvmsg_prep(req, &io); + if (ret) + goto out; + } + + ret = __sys_recvmsg_sock(sock, kmsg, msg, io.msg.uaddr, flags); + if (force_nonblock && ret == -EAGAIN) { + copy = kmalloc(sizeof(*copy), GFP_KERNEL); + if (!copy) { + ret = -ENOMEM; + goto out; + } + memcpy(copy, &io, sizeof(*copy)); + req->io = copy; + memcpy(&req->io->sqe, req->sqe, sizeof(*req->sqe)); + req->sqe = &req->io->sqe; + return ret; + } + if (ret == -ERESTARTSYS) + ret = -EINTR; + } + +out: + io_cqring_add_event(req, ret); + if (ret < 0 && (req->flags & REQ_F_LINK)) + req->flags |= REQ_F_FAIL_LINK; + io_put_req_find_next(req, nxt); + return 0; #else return -EOPNOTSUPP; #endif @@ -2720,6 +2825,12 @@ static int io_req_defer_prep(struct io_kiocb *req, struct io_async_ctx *io) case IORING_OP_WRITE_FIXED: ret = io_write_prep(req, &iovec, &iter, true); break; + case IORING_OP_SENDMSG: + ret = io_sendmsg_prep(req, io); + break; + case IORING_OP_RECVMSG: + ret = io_recvmsg_prep(req, io); + break; default: req->io = io; return 0; diff --git a/include/linux/socket.h b/include/linux/socket.h index 841f18488954..9ea24dbab8b7 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -364,12 +364,19 @@ extern int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen extern int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen, unsigned int flags, bool forbid_cmsg_compat); -extern long __sys_sendmsg_sock(struct socket *sock, - struct user_msghdr __user *msg, +extern long __sys_sendmsg_sock(struct socket *sock, struct msghdr *msg, unsigned int flags); -extern long __sys_recvmsg_sock(struct socket *sock, - struct user_msghdr __user *msg, +extern long __sys_recvmsg_sock(struct socket *sock, struct msghdr *msg, + struct user_msghdr __user *umsg, + struct sockaddr __user *uaddr, unsigned int flags); +extern int sendmsg_copy_msghdr(struct msghdr *msg, + struct user_msghdr __user *umsg, unsigned flags, + struct iovec **iov); +extern int recvmsg_copy_msghdr(struct msghdr *msg, + struct user_msghdr __user *umsg, unsigned flags, + struct sockaddr __user **uaddr, + struct iovec **iov);
/* helpers which do the actual work for syscalls */ extern int __sys_recvfrom(int fd, void __user *ubuf, size_t size, diff --git a/net/socket.c b/net/socket.c index b3ffa502d62a..cf06a55d2f18 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2149,9 +2149,9 @@ static int ____sys_sendmsg(struct socket *sock, struct msghdr *msg_sys, return err; }
-static int sendmsg_copy_msghdr(struct msghdr *msg, - struct user_msghdr __user *umsg, unsigned flags, - struct iovec **iov) +int sendmsg_copy_msghdr(struct msghdr *msg, + struct user_msghdr __user *umsg, unsigned flags, + struct iovec **iov) { int err;
@@ -2193,27 +2193,14 @@ static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg, /* * BSD sendmsg interface */ -long __sys_sendmsg_sock(struct socket *sock, struct user_msghdr __user *umsg, +long __sys_sendmsg_sock(struct socket *sock, struct msghdr *msg, unsigned int flags) { - struct iovec iovstack[UIO_FASTIOV], *iov = iovstack; - struct sockaddr_storage address; - struct msghdr msg = { .msg_name = &address }; - ssize_t err; - - err = sendmsg_copy_msghdr(&msg, umsg, flags, &iov); - if (err) - return err; /* disallow ancillary data requests from this path */ - if (msg.msg_control || msg.msg_controllen) { - err = -EINVAL; - goto out; - } + if (msg->msg_control || msg->msg_controllen) + return -EINVAL;
- err = ____sys_sendmsg(sock, &msg, flags, NULL, 0); -out: - kfree(iov); - return err; + return ____sys_sendmsg(sock, msg, flags, NULL, 0); }
long __sys_sendmsg(int fd, struct user_msghdr __user *msg, unsigned int flags, @@ -2319,10 +2306,10 @@ SYSCALL_DEFINE4(sendmmsg, int, fd, struct mmsghdr __user *, mmsg, return __sys_sendmmsg(fd, mmsg, vlen, flags, true); }
-static int recvmsg_copy_msghdr(struct msghdr *msg, - struct user_msghdr __user *umsg, unsigned flags, - struct sockaddr __user **uaddr, - struct iovec **iov) +int recvmsg_copy_msghdr(struct msghdr *msg, + struct user_msghdr __user *umsg, unsigned flags, + struct sockaddr __user **uaddr, + struct iovec **iov) { ssize_t err;
@@ -2412,28 +2399,15 @@ static int ___sys_recvmsg(struct socket *sock, struct user_msghdr __user *msg, * BSD recvmsg interface */
-long __sys_recvmsg_sock(struct socket *sock, struct user_msghdr __user *umsg, - unsigned int flags) +long __sys_recvmsg_sock(struct socket *sock, struct msghdr *msg, + struct user_msghdr __user *umsg, + struct sockaddr __user *uaddr, unsigned int flags) { - struct iovec iovstack[UIO_FASTIOV], *iov = iovstack; - struct sockaddr_storage address; - struct msghdr msg = { .msg_name = &address }; - struct sockaddr __user *uaddr; - ssize_t err; - - err = recvmsg_copy_msghdr(&msg, umsg, flags, &uaddr, &iov); - if (err) - return err; /* disallow ancillary data requests from this path */ - if (msg.msg_control || msg.msg_controllen) { - err = -EINVAL; - goto out; - } + if (msg->msg_control || msg->msg_controllen) + return -EINVAL;
- err = ____sys_recvmsg(sock, &msg, umsg, uaddr, flags, 0); -out: - kfree(iov); - return err; + return ____sys_recvmsg(sock, msg, umsg, uaddr, flags, 0); }
long __sys_recvmsg(int fd, struct user_msghdr __user *msg, unsigned int flags,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 2d28390aff879238f00e209e38c2a0b78717360e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we defer a timeout, we should ensure that we copy the timespec when we have consumed the sqe. This is similar to commit f67676d160c6 for read/write requests. We already did this correctly for timeouts deferred as links, but do it generally and use the infrastructure added by commit 1a6b74fc8702 instead of having the timeout deferral use its own.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 83 ++++++++++++++++++++++++++------------------------- 1 file changed, 42 insertions(+), 41 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c5bcb751b688..7d9001280fb5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -303,11 +303,6 @@ struct io_timeout_data { u32 seq_offset; };
-struct io_timeout { - struct file *file; - struct io_timeout_data *data; -}; - struct io_async_connect { struct sockaddr_storage address; }; @@ -332,6 +327,7 @@ struct io_async_ctx { struct io_async_rw rw; struct io_async_msghdr msg; struct io_async_connect connect; + struct io_timeout_data timeout; }; };
@@ -346,7 +342,6 @@ struct io_kiocb { struct file *file; struct kiocb rw; struct io_poll_iocb poll; - struct io_timeout timeout; };
const struct io_uring_sqe *sqe; @@ -618,7 +613,7 @@ static void io_kill_timeout(struct io_kiocb *req) { int ret;
- ret = hrtimer_try_to_cancel(&req->timeout.data->timer); + ret = hrtimer_try_to_cancel(&req->io->timeout.timer); if (ret != -1) { atomic_inc(&req->ctx->cq_timeouts); list_del_init(&req->list); @@ -876,8 +871,6 @@ static void __io_free_req(struct io_kiocb *req) wake_up(&ctx->inflight_wait); spin_unlock_irqrestore(&ctx->inflight_lock, flags); } - if (req->flags & REQ_F_TIMEOUT) - kfree(req->timeout.data); percpu_ref_put(&ctx->refs); if (likely(!io_is_fallback_req(req))) kmem_cache_free(req_cachep, req); @@ -890,7 +883,7 @@ static bool io_link_cancel_timeout(struct io_kiocb *req) struct io_ring_ctx *ctx = req->ctx; int ret;
- ret = hrtimer_try_to_cancel(&req->timeout.data->timer); + ret = hrtimer_try_to_cancel(&req->io->timeout.timer); if (ret != -1) { io_cqring_fill_event(req, -ECANCELED); io_commit_cqring(ctx); @@ -2617,7 +2610,7 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) if (ret == -ENOENT) return ret;
- ret = hrtimer_try_to_cancel(&req->timeout.data->timer); + ret = hrtimer_try_to_cancel(&req->io->timeout.timer); if (ret == -1) return -EALREADY;
@@ -2659,7 +2652,8 @@ static int io_timeout_remove(struct io_kiocb *req, return 0; }
-static int io_timeout_setup(struct io_kiocb *req) +static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io, + bool is_timeout_link) { const struct io_uring_sqe *sqe = req->sqe; struct io_timeout_data *data; @@ -2669,15 +2663,14 @@ static int io_timeout_setup(struct io_kiocb *req) return -EINVAL; if (sqe->ioprio || sqe->buf_index || sqe->len != 1) return -EINVAL; + if (sqe->off && is_timeout_link) + return -EINVAL; flags = READ_ONCE(sqe->timeout_flags); if (flags & ~IORING_TIMEOUT_ABS) return -EINVAL;
- data = kzalloc(sizeof(struct io_timeout_data), GFP_KERNEL); - if (!data) - return -ENOMEM; + data = &io->timeout; data->req = req; - req->timeout.data = data; req->flags |= REQ_F_TIMEOUT;
if (get_timespec64(&data->ts, u64_to_user_ptr(sqe->addr))) @@ -2689,6 +2682,7 @@ static int io_timeout_setup(struct io_kiocb *req) data->mode = HRTIMER_MODE_REL;
hrtimer_init(&data->timer, CLOCK_MONOTONIC, data->mode); + req->io = io; return 0; }
@@ -2697,13 +2691,24 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) unsigned count; struct io_ring_ctx *ctx = req->ctx; struct io_timeout_data *data; + struct io_async_ctx *io; struct list_head *entry; unsigned span = 0; - int ret;
- ret = io_timeout_setup(req); - if (ret) - return ret; + io = req->io; + if (!io) { + int ret; + + io = kmalloc(sizeof(*io), GFP_KERNEL); + if (!io) + return -ENOMEM; + ret = io_timeout_prep(req, io, false); + if (ret) { + kfree(io); + return ret; + } + } + data = &req->io->timeout;
/* * sqe->off holds how many events that need to occur for this @@ -2719,7 +2724,7 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) }
req->sequence = ctx->cached_sq_head + count - 1; - req->timeout.data->seq_offset = count; + data->seq_offset = count;
/* * Insertion sort, ensuring the first entry in the list is always @@ -2730,7 +2735,7 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) struct io_kiocb *nxt = list_entry(entry, struct io_kiocb, list); unsigned nxt_sq_head; long long tmp, tmp_nxt; - u32 nxt_offset = nxt->timeout.data->seq_offset; + u32 nxt_offset = nxt->io->timeout.seq_offset;
if (nxt->flags & REQ_F_TIMEOUT_NOSEQ) continue; @@ -2763,7 +2768,6 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) req->sequence -= span; add: list_add(&req->list, entry); - data = req->timeout.data; data->timer.function = io_timeout_fn; hrtimer_start(&data->timer, timespec64_to_ktime(data->ts), data->mode); spin_unlock_irq(&ctx->completion_lock); @@ -2871,6 +2875,10 @@ static int io_req_defer_prep(struct io_kiocb *req, struct io_async_ctx *io) case IORING_OP_CONNECT: ret = io_connect_prep(req, io); break; + case IORING_OP_TIMEOUT: + return io_timeout_prep(req, io, false); + case IORING_OP_LINK_TIMEOUT: + return io_timeout_prep(req, io, true); default: req->io = io; return 0; @@ -2898,17 +2906,18 @@ static int io_req_defer(struct io_kiocb *req) if (!io) return -EAGAIN;
+ ret = io_req_defer_prep(req, io); + if (ret < 0) { + kfree(io); + return ret; + } + spin_lock_irq(&ctx->completion_lock); if (!req_need_defer(req) && list_empty(&ctx->defer_list)) { spin_unlock_irq(&ctx->completion_lock); - kfree(io); return 0; }
- ret = io_req_defer_prep(req, io); - if (ret < 0) - return ret; - trace_io_uring_defer(ctx, req, req->user_data); list_add_tail(&req->list, &ctx->defer_list); spin_unlock_irq(&ctx->completion_lock); @@ -3197,7 +3206,7 @@ static void io_queue_linked_timeout(struct io_kiocb *req) */ spin_lock_irq(&ctx->completion_lock); if (!list_empty(&req->list)) { - struct io_timeout_data *data = req->timeout.data; + struct io_timeout_data *data = &req->io->timeout;
data->timer.function = io_link_timeout_fn; hrtimer_start(&data->timer, timespec64_to_ktime(data->ts), @@ -3344,17 +3353,6 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, if (req->sqe->flags & IOSQE_IO_DRAIN) (*link)->flags |= REQ_F_DRAIN_LINK | REQ_F_IO_DRAIN;
- if (READ_ONCE(req->sqe->opcode) == IORING_OP_LINK_TIMEOUT) { - ret = io_timeout_setup(req); - /* common setup allows offset being set, we don't */ - if (!ret && req->sqe->off) - ret = -EINVAL; - if (ret) { - prev->flags |= REQ_F_FAIL_LINK; - goto err_req; - } - } - io = kmalloc(sizeof(*io), GFP_KERNEL); if (!io) { ret = -EAGAIN; @@ -3362,8 +3360,11 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, }
ret = io_req_defer_prep(req, io); - if (ret) + if (ret) { + kfree(io); + prev->flags |= REQ_F_FAIL_LINK; goto err_req; + } trace_io_uring_link(ctx, req, prev); list_add_tail(&req->list, &prev->link_list); } else if (req->sqe->flags & IOSQE_IO_LINK) {
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc1 commit 78076bb64aa8ba5b7207c38b2660a9e10ffa8cc7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We recently changed this from a single list to an rbtree, but for some real life workloads, the rbtree slows down the submission/insertion case enough so that it's the top cycle consumer on the io_uring side. In testing, using a hash table is a more well rounded compromise. It is fast for insertion, and as long as it's sized appropriately, it works well for the cancellation case as well. Running TAO with a lot of network sockets, this removes io_poll_req_insert() from spending 2% of the CPU cycles.
Reported-by: Dan Melnic dmm@fb.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [214828962dea io_uring: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT not applied]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 84 +++++++++++++++++++++++++-------------------------- 1 file changed, 41 insertions(+), 43 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7d9001280fb5..d2f9fc82810b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -275,7 +275,8 @@ struct io_ring_ctx { * manipulate the list, hence no extra locking is needed there. */ struct list_head poll_list; - struct rb_root cancel_tree; + struct hlist_head *cancel_hash; + unsigned cancel_hash_bits;
spinlock_t inflight_lock; struct list_head inflight_list; @@ -355,7 +356,7 @@ struct io_kiocb { struct io_ring_ctx *ctx; union { struct list_head list; - struct rb_node rb_node; + struct hlist_node hash_node; }; struct list_head link_list; unsigned int flags; @@ -444,6 +445,7 @@ static void io_ring_ctx_ref_free(struct percpu_ref *ref) static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) { struct io_ring_ctx *ctx; + int hash_bits;
ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); if (!ctx) @@ -457,6 +459,21 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) if (!ctx->completions) goto err;
+ /* + * Use 5 bits less than the max cq entries, that should give us around + * 32 entries per hash list if totally full and uniformly spread. + */ + hash_bits = ilog2(p->cq_entries); + hash_bits -= 5; + if (hash_bits <= 0) + hash_bits = 1; + ctx->cancel_hash_bits = hash_bits; + ctx->cancel_hash = kmalloc((1U << hash_bits) * sizeof(struct hlist_head), + GFP_KERNEL); + if (!ctx->cancel_hash) + goto err; + __hash_init(ctx->cancel_hash, 1U << hash_bits); + if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) goto err;
@@ -469,7 +486,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) init_waitqueue_head(&ctx->wait); spin_lock_init(&ctx->completion_lock); INIT_LIST_HEAD(&ctx->poll_list); - ctx->cancel_tree = RB_ROOT; INIT_LIST_HEAD(&ctx->defer_list); INIT_LIST_HEAD(&ctx->timeout_list); init_waitqueue_head(&ctx->inflight_wait); @@ -480,6 +496,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) if (ctx->fallback_req) kmem_cache_free(req_cachep, ctx->fallback_req); kfree(ctx->completions); + kfree(ctx->cancel_hash); kfree(ctx); return NULL; } @@ -2259,14 +2276,6 @@ static int io_connect(struct io_kiocb *req, const struct io_uring_sqe *sqe, #endif }
-static inline void io_poll_remove_req(struct io_kiocb *req) -{ - if (!RB_EMPTY_NODE(&req->rb_node)) { - rb_erase(&req->rb_node, &req->ctx->cancel_tree); - RB_CLEAR_NODE(&req->rb_node); - } -} - static void io_poll_remove_one(struct io_kiocb *req) { struct io_poll_iocb *poll = &req->poll; @@ -2278,36 +2287,34 @@ static void io_poll_remove_one(struct io_kiocb *req) io_queue_async_work(req); } spin_unlock(&poll->head->lock); - io_poll_remove_req(req); + hash_del(&req->hash_node); }
static void io_poll_remove_all(struct io_ring_ctx *ctx) { - struct rb_node *node; + struct hlist_node *tmp; struct io_kiocb *req; + int i;
spin_lock_irq(&ctx->completion_lock); - while ((node = rb_first(&ctx->cancel_tree)) != NULL) { - req = rb_entry(node, struct io_kiocb, rb_node); - io_poll_remove_one(req); + for (i = 0; i < (1U << ctx->cancel_hash_bits); i++) { + struct hlist_head *list; + + list = &ctx->cancel_hash[i]; + hlist_for_each_entry_safe(req, tmp, list, hash_node) + io_poll_remove_one(req); } spin_unlock_irq(&ctx->completion_lock); }
static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr) { - struct rb_node *p, *parent = NULL; + struct hlist_head *list; struct io_kiocb *req;
- p = ctx->cancel_tree.rb_node; - while (p) { - parent = p; - req = rb_entry(parent, struct io_kiocb, rb_node); - if (sqe_addr < req->user_data) { - p = p->rb_left; - } else if (sqe_addr > req->user_data) { - p = p->rb_right; - } else { + list = &ctx->cancel_hash[hash_long(sqe_addr, ctx->cancel_hash_bits)]; + hlist_for_each_entry(req, list, hash_node) { + if (sqe_addr == req->user_data) { io_poll_remove_one(req); return 0; } @@ -2389,7 +2396,7 @@ static void io_poll_complete_work(struct io_wq_work **workptr) spin_unlock_irq(&ctx->completion_lock); return; } - io_poll_remove_req(req); + hash_del(&req->hash_node); io_poll_complete(req, mask, ret); spin_unlock_irq(&ctx->completion_lock);
@@ -2424,7 +2431,7 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, * for finalizing the request, mark us as having grabbed that already. */ if (mask && spin_trylock_irqsave(&ctx->completion_lock, flags)) { - io_poll_remove_req(req); + hash_del(&req->hash_node); io_poll_complete(req, mask, 0); req->flags |= REQ_F_COMP_LOCKED; io_put_req(req); @@ -2462,20 +2469,10 @@ static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head, static void io_poll_req_insert(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; - struct rb_node **p = &ctx->cancel_tree.rb_node; - struct rb_node *parent = NULL; - struct io_kiocb *tmp; - - while (*p) { - parent = *p; - tmp = rb_entry(parent, struct io_kiocb, rb_node); - if (req->user_data < tmp->user_data) - p = &(*p)->rb_left; - else - p = &(*p)->rb_right; - } - rb_link_node(&req->rb_node, parent, p); - rb_insert_color(&req->rb_node, &ctx->cancel_tree); + struct hlist_head *list; + + list = &ctx->cancel_hash[hash_long(req->user_data, ctx->cancel_hash_bits)]; + hlist_add_head(&req->hash_node, list); }
static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, @@ -2503,7 +2500,7 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, INIT_IO_WORK(&req->work, io_poll_complete_work); events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; - RB_CLEAR_NODE(&req->rb_node); + INIT_HLIST_NODE(&req->hash_node);
poll->head = NULL; poll->done = false; @@ -4644,6 +4641,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) free_uid(ctx->user); put_cred(ctx->creds); kfree(ctx->completions); + kfree(ctx->cancel_hash); kmem_cache_free(req_cachep, ctx->fallback_req); kfree(ctx); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit 2e6e1fde32d7d41cf076c21060c329d3fdbce25c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In case of an error io_submit_sqe() drops a request and continues without it, even if the request was a part of a link. Not only it doesn't cancel links, but also may execute wrong sequence of actions.
Stop consuming sqes, and let the user handle errors.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d2f9fc82810b..f58ab64d2617 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3314,7 +3314,7 @@ static inline void io_queue_link_head(struct io_kiocb *req)
#define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK)
-static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, +static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, struct io_kiocb **link) { struct io_ring_ctx *ctx = req->ctx; @@ -3333,7 +3333,7 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, err_req: io_cqring_add_event(req, ret); io_double_put_req(req); - return; + return false; }
/* @@ -3372,6 +3372,8 @@ static void io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, } else { io_queue_sqe(req); } + + return true; }
/* @@ -3501,6 +3503,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, } }
+ submitted++; sqe_flags = req->sqe->flags;
req->ring_file = ring_file; @@ -3510,9 +3513,8 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, req->needs_fixed_file = async; trace_io_uring_submit_sqe(ctx, req->sqe->user_data, true, async); - io_submit_sqe(req, statep, &link); - submitted++; - + if (!io_submit_sqe(req, statep, &link)) + break; /* * If previous wasn't linked and we have a linked command, * that's the end of the chain. Submit the previous link.
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc1 commit 4493233edcfc0ad0a7f76f1c83f95b1bcf280547 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Links are created by chaining requests through req->list with an exception that head uses req->link_list. (e.g. link_list->list->list) Because of that, io_req_link_next() needs complex splicing to advance.
Link them all through list_list. Also, it seems to be simpler and more consistent IMHO.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 42 ++++++++++++++++++++---------------------- 1 file changed, 20 insertions(+), 22 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f58ab64d2617..54aaa737ddca 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -915,7 +915,6 @@ static bool io_link_cancel_timeout(struct io_kiocb *req) static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) { struct io_ring_ctx *ctx = req->ctx; - struct io_kiocb *nxt; bool wake_ev = false;
/* Already got next link */ @@ -927,24 +926,21 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) * potentially happen if the chain is messed up, check to be on the * safe side. */ - nxt = list_first_entry_or_null(&req->link_list, struct io_kiocb, list); - while (nxt) { - list_del_init(&nxt->list); + while (!list_empty(&req->link_list)) { + struct io_kiocb *nxt = list_first_entry(&req->link_list, + struct io_kiocb, link_list);
- if ((req->flags & REQ_F_LINK_TIMEOUT) && - (nxt->flags & REQ_F_TIMEOUT)) { + if (unlikely((req->flags & REQ_F_LINK_TIMEOUT) && + (nxt->flags & REQ_F_TIMEOUT))) { + list_del_init(&nxt->link_list); wake_ev |= io_link_cancel_timeout(nxt); - nxt = list_first_entry_or_null(&req->link_list, - struct io_kiocb, list); req->flags &= ~REQ_F_LINK_TIMEOUT; continue; } - if (!list_empty(&req->link_list)) { - INIT_LIST_HEAD(&nxt->link_list); - list_splice(&req->link_list, &nxt->link_list); - nxt->flags |= REQ_F_LINK; - }
+ list_del_init(&req->link_list); + if (!list_empty(&nxt->link_list)) + nxt->flags |= REQ_F_LINK; *nxtptr = nxt; break; } @@ -960,15 +956,15 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) static void io_fail_links(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; - struct io_kiocb *link; unsigned long flags;
spin_lock_irqsave(&ctx->completion_lock, flags);
while (!list_empty(&req->link_list)) { - link = list_first_entry(&req->link_list, struct io_kiocb, list); - list_del_init(&link->list); + struct io_kiocb *link = list_first_entry(&req->link_list, + struct io_kiocb, link_list);
+ list_del_init(&link->link_list); trace_io_uring_fail_link(req, link);
if ((req->flags & REQ_F_LINK_TIMEOUT) && @@ -3169,10 +3165,11 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) * We don't expect the list to be empty, that will only happen if we * race with the completion of the linked work. */ - if (!list_empty(&req->list)) { - prev = list_entry(req->list.prev, struct io_kiocb, link_list); + if (!list_empty(&req->link_list)) { + prev = list_entry(req->link_list.prev, struct io_kiocb, + link_list); if (refcount_inc_not_zero(&prev->refs)) { - list_del_init(&req->list); + list_del_init(&req->link_list); prev->flags &= ~REQ_F_LINK_TIMEOUT; } else prev = NULL; @@ -3202,7 +3199,7 @@ static void io_queue_linked_timeout(struct io_kiocb *req) * we got a chance to setup the timer */ spin_lock_irq(&ctx->completion_lock); - if (!list_empty(&req->list)) { + if (!list_empty(&req->link_list)) { struct io_timeout_data *data = &req->io->timeout;
data->timer.function = io_link_timeout_fn; @@ -3222,7 +3219,8 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req) if (!(req->flags & REQ_F_LINK)) return NULL;
- nxt = list_first_entry_or_null(&req->link_list, struct io_kiocb, list); + nxt = list_first_entry_or_null(&req->link_list, struct io_kiocb, + link_list); if (!nxt || nxt->sqe->opcode != IORING_OP_LINK_TIMEOUT) return NULL;
@@ -3363,7 +3361,7 @@ static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, goto err_req; } trace_io_uring_link(ctx, req, prev); - list_add_tail(&req->list, &prev->link_list); + list_add_tail(&req->link_list, &prev->link_list); } else if (req->sqe->flags & IOSQE_IO_LINK) { req->flags |= REQ_F_LINK;
From: LimingWu 19092205@suning.com
mainline inclusion from mainline-5.5-rc1 commit 0b4295b5e2b9b42f3f3096496fe4775b656c9ba6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
thatn -> than.
Signed-off-by: Liming Wu 19092205@suning.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 54aaa737ddca..d4dc4e1729a5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -145,7 +145,7 @@ struct io_rings { /* * Number of completion events lost because the queue was full; * this should be avoided by the application by making sure - * there are not more requests pending thatn there is space in + * there are not more requests pending than there is space in * the completion queue. * * Written by the kernel, shouldn't be modified by the
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc2 commit 4e88d6e7793f2f445f43bd608828541d7f43b608 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Some commands will invariably end in a failure in the sense that the completion result will be less than zero. One such example is timeouts that don't have a completion count set, they will always complete with -ETIME unless cancelled.
For linked commands, we sever links and fail the rest of the chain if the result is less than zero. Since we have commands where we know that will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever regardless of the completion result. Note that the link will still sever if we fail submitting the parent request, hard links are only resilient in the presence of completion results for requests that did submit correctly.
Cc: stable@vger.kernel.org # v5.4 Reviewed-by: Pavel Begunkov asml.silence@gmail.com Reported-by: 李通洲 carter.li@eoitek.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 84 +++++++++++++++++++---------------- include/uapi/linux/io_uring.h | 1 + 2 files changed, 47 insertions(+), 38 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d4dc4e1729a5..7cf5bc8bd3d9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -377,6 +377,7 @@ struct io_kiocb { #define REQ_F_TIMEOUT_NOSEQ 8192 /* no timeout sequence */ #define REQ_F_INFLIGHT 16384 /* on inflight list */ #define REQ_F_COMP_LOCKED 32768 /* completion under lock */ +#define REQ_F_HARDLINK 65536 /* doesn't sever on completion < 0 */ u64 user_data; u32 result; u32 sequence; @@ -1291,6 +1292,12 @@ static void kiocb_end_write(struct io_kiocb *req) file_end_write(req->file); }
+static inline void req_set_fail_links(struct io_kiocb *req) +{ + if ((req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) == REQ_F_LINK) + req->flags |= REQ_F_FAIL_LINK; +} + static void io_complete_rw_common(struct kiocb *kiocb, long res) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); @@ -1298,8 +1305,8 @@ static void io_complete_rw_common(struct kiocb *kiocb, long res) if (kiocb->ki_flags & IOCB_WRITE) kiocb_end_write(req);
- if ((req->flags & REQ_F_LINK) && res != req->result) - req->flags |= REQ_F_FAIL_LINK; + if (res != req->result) + req_set_fail_links(req); io_cqring_add_event(req, res); }
@@ -1329,8 +1336,8 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2) if (kiocb->ki_flags & IOCB_WRITE) kiocb_end_write(req);
- if ((req->flags & REQ_F_LINK) && res != req->result) - req->flags |= REQ_F_FAIL_LINK; + if (res != req->result) + req_set_fail_links(req); req->result = res; if (res != -EAGAIN) req->flags |= REQ_F_IOPOLL_COMPLETED; @@ -1955,8 +1962,8 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, end > 0 ? end : LLONG_MAX, fsync_flags & IORING_FSYNC_DATASYNC);
- if (ret < 0 && (req->flags & REQ_F_LINK)) - req->flags |= REQ_F_FAIL_LINK; + if (ret < 0) + req_set_fail_links(req); io_cqring_add_event(req, ret); io_put_req_find_next(req, nxt); return 0; @@ -2002,8 +2009,8 @@ static int io_sync_file_range(struct io_kiocb *req,
ret = sync_file_range(req->rw.ki_filp, sqe_off, sqe_len, flags);
- if (ret < 0 && (req->flags & REQ_F_LINK)) - req->flags |= REQ_F_FAIL_LINK; + if (ret < 0) + req_set_fail_links(req); io_cqring_add_event(req, ret); io_put_req_find_next(req, nxt); return 0; @@ -2078,8 +2085,8 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
out: io_cqring_add_event(req, ret); - if (ret < 0 && (req->flags & REQ_F_LINK)) - req->flags |= REQ_F_FAIL_LINK; + if (ret < 0) + req_set_fail_links(req); io_put_req_find_next(req, nxt); return 0; #else @@ -2160,8 +2167,8 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
out: io_cqring_add_event(req, ret); - if (ret < 0 && (req->flags & REQ_F_LINK)) - req->flags |= REQ_F_FAIL_LINK; + if (ret < 0) + req_set_fail_links(req); io_put_req_find_next(req, nxt); return 0; #else @@ -2195,8 +2202,8 @@ static int io_accept(struct io_kiocb *req, const struct io_uring_sqe *sqe, } if (ret == -ERESTARTSYS) ret = -EINTR; - if (ret < 0 && (req->flags & REQ_F_LINK)) - req->flags |= REQ_F_FAIL_LINK; + if (ret < 0) + req_set_fail_links(req); io_cqring_add_event(req, ret); io_put_req_find_next(req, nxt); return 0; @@ -2262,8 +2269,8 @@ static int io_connect(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret == -ERESTARTSYS) ret = -EINTR; out: - if (ret < 0 && (req->flags & REQ_F_LINK)) - req->flags |= REQ_F_FAIL_LINK; + if (ret < 0) + req_set_fail_links(req); io_cqring_add_event(req, ret); io_put_req_find_next(req, nxt); return 0; @@ -2339,8 +2346,8 @@ static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe) spin_unlock_irq(&ctx->completion_lock);
io_cqring_add_event(req, ret); - if (ret < 0 && (req->flags & REQ_F_LINK)) - req->flags |= REQ_F_FAIL_LINK; + if (ret < 0) + req_set_fail_links(req); io_put_req(req); return 0; } @@ -2398,8 +2405,8 @@ static void io_poll_complete_work(struct io_wq_work **workptr)
io_cqring_ev_posted(ctx);
- if (ret < 0 && req->flags & REQ_F_LINK) - req->flags |= REQ_F_FAIL_LINK; + if (ret < 0) + req_set_fail_links(req); io_put_req_find_next(req, &nxt); if (nxt) *workptr = &nxt->work; @@ -2581,8 +2588,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) spin_unlock_irqrestore(&ctx->completion_lock, flags);
io_cqring_ev_posted(ctx); - if (req->flags & REQ_F_LINK) - req->flags |= REQ_F_FAIL_LINK; + req_set_fail_links(req); io_put_req(req); return HRTIMER_NORESTART; } @@ -2607,8 +2613,7 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) if (ret == -1) return -EALREADY;
- if (req->flags & REQ_F_LINK) - req->flags |= REQ_F_FAIL_LINK; + req_set_fail_links(req); io_cqring_fill_event(req, -ECANCELED); io_put_req(req); return 0; @@ -2639,8 +2644,8 @@ static int io_timeout_remove(struct io_kiocb *req, io_commit_cqring(ctx); spin_unlock_irq(&ctx->completion_lock); io_cqring_ev_posted(ctx); - if (ret < 0 && req->flags & REQ_F_LINK) - req->flags |= REQ_F_FAIL_LINK; + if (ret < 0) + req_set_fail_links(req); io_put_req(req); return 0; } @@ -2821,8 +2826,8 @@ static void io_async_find_and_cancel(struct io_ring_ctx *ctx, spin_unlock_irqrestore(&ctx->completion_lock, flags); io_cqring_ev_posted(ctx);
- if (ret < 0 && (req->flags & REQ_F_LINK)) - req->flags |= REQ_F_FAIL_LINK; + if (ret < 0) + req_set_fail_links(req); io_put_req_find_next(req, nxt); }
@@ -3043,8 +3048,7 @@ static void io_wq_submit_work(struct io_wq_work **workptr) io_put_req(req);
if (ret) { - if (req->flags & REQ_F_LINK) - req->flags |= REQ_F_FAIL_LINK; + req_set_fail_links(req); io_cqring_add_event(req, ret); io_put_req(req); } @@ -3178,8 +3182,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) spin_unlock_irqrestore(&ctx->completion_lock, flags);
if (prev) { - if (prev->flags & REQ_F_LINK) - prev->flags |= REQ_F_FAIL_LINK; + req_set_fail_links(prev); io_async_find_and_cancel(ctx, req, prev->user_data, NULL, -ETIME); io_put_req(prev); @@ -3272,8 +3275,7 @@ static void __io_queue_sqe(struct io_kiocb *req) /* and drop final reference, if we failed */ if (ret) { io_cqring_add_event(req, ret); - if (req->flags & REQ_F_LINK) - req->flags |= REQ_F_FAIL_LINK; + req_set_fail_links(req); io_put_req(req); } } @@ -3292,8 +3294,7 @@ static void io_queue_sqe(struct io_kiocb *req) if (ret) { if (ret != -EIOCBQUEUED) { io_cqring_add_event(req, ret); - if (req->flags & REQ_F_LINK) - req->flags |= REQ_F_FAIL_LINK; + req_set_fail_links(req); io_double_put_req(req); } } else @@ -3310,7 +3311,8 @@ static inline void io_queue_link_head(struct io_kiocb *req) }
-#define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK) +#define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK| \ + IOSQE_IO_HARDLINK)
static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, struct io_kiocb **link) @@ -3348,6 +3350,9 @@ static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, if (req->sqe->flags & IOSQE_IO_DRAIN) (*link)->flags |= REQ_F_DRAIN_LINK | REQ_F_IO_DRAIN;
+ if (req->sqe->flags & IOSQE_IO_HARDLINK) + req->flags |= REQ_F_HARDLINK; + io = kmalloc(sizeof(*io), GFP_KERNEL); if (!io) { ret = -EAGAIN; @@ -3357,13 +3362,16 @@ static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, ret = io_req_defer_prep(req, io); if (ret) { kfree(io); + /* fail even hard links since we don't submit */ prev->flags |= REQ_F_FAIL_LINK; goto err_req; } trace_io_uring_link(ctx, req, prev); list_add_tail(&req->link_list, &prev->link_list); - } else if (req->sqe->flags & IOSQE_IO_LINK) { + } else if (req->sqe->flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) { req->flags |= REQ_F_LINK; + if (req->sqe->flags & IOSQE_IO_HARDLINK) + req->flags |= REQ_F_HARDLINK;
INIT_LIST_HEAD(&req->link_list); *link = req; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index eabccb46edd1..ea231366f5fd 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -48,6 +48,7 @@ struct io_uring_sqe { #define IOSQE_FIXED_FILE (1U << 0) /* use fixed fileset */ #define IOSQE_IO_DRAIN (1U << 1) /* issue after inflight IO */ #define IOSQE_IO_LINK (1U << 2) /* links next sqe */ +#define IOSQE_IO_HARDLINK (1U << 3) /* like LINK, but stronger */
/* * io_uring_setup() flags
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc2 commit 506d95ff5d6aa0a099a116c49d3884e29801d843 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We only have one cases of using the waitqueue to wake the worker, the rest are using wake_up_process(). Since we can save some cycles not fiddling with the waitqueue io_wqe_worker(), switch the work activation to task wakeup and get rid of the now unused wait_queue_head_t in struct io_worker.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 25654b5bf853..544da3426954 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -50,7 +50,6 @@ struct io_worker { struct hlist_nulls_node nulls_node; struct list_head all_list; struct task_struct *task; - wait_queue_head_t wait; struct io_wqe *wqe;
struct io_wq_work *cur_work; @@ -259,7 +258,7 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe)
worker = hlist_nulls_entry(n, struct io_worker, nulls_node); if (io_worker_get(worker)) { - wake_up(&worker->wait); + wake_up_process(worker->task); io_worker_release(worker); return true; } @@ -498,13 +497,11 @@ static int io_wqe_worker(void *data) struct io_worker *worker = data; struct io_wqe *wqe = worker->wqe; struct io_wq *wq = wqe->wq; - DEFINE_WAIT(wait);
io_worker_start(wqe, worker);
while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) { - prepare_to_wait(&worker->wait, &wait, TASK_INTERRUPTIBLE); - + set_current_state(TASK_INTERRUPTIBLE); spin_lock_irq(&wqe->lock); if (io_wqe_run_queue(wqe)) { __set_current_state(TASK_RUNNING); @@ -527,8 +524,6 @@ static int io_wqe_worker(void *data) break; }
- finish_wait(&worker->wait, &wait); - if (test_bit(IO_WQ_BIT_EXIT, &wq->state)) { spin_lock_irq(&wqe->lock); if (!wq_list_empty(&wqe->work_list)) @@ -590,7 +585,6 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
refcount_set(&worker->ref, 1); worker->nulls_node.pprev = NULL; - init_waitqueue_head(&worker->wait); worker->wqe = wqe; spin_lock_init(&worker->lock);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc2 commit e995d5123ed433e37a8d63ac528737c912592e3d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
To avoid going to sleep only to get woken shortly thereafter, spin briefly for new work upon completion of work.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 24 ++++++++++++++++++++++-- fs/io-wq.h | 7 ++++--- 2 files changed, 26 insertions(+), 5 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 544da3426954..1be307a174b5 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -492,26 +492,46 @@ static void io_worker_handle_work(struct io_worker *worker) } while (1); }
+static inline void io_worker_spin_for_work(struct io_wqe *wqe) +{ + int i = 0; + + while (++i < 1000) { + if (io_wqe_run_queue(wqe)) + break; + if (need_resched()) + break; + cpu_relax(); + } +} + static int io_wqe_worker(void *data) { struct io_worker *worker = data; struct io_wqe *wqe = worker->wqe; struct io_wq *wq = wqe->wq; + bool did_work;
io_worker_start(wqe, worker);
+ did_work = false; while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) { set_current_state(TASK_INTERRUPTIBLE); +loop: + if (did_work) + io_worker_spin_for_work(wqe); spin_lock_irq(&wqe->lock); if (io_wqe_run_queue(wqe)) { __set_current_state(TASK_RUNNING); io_worker_handle_work(worker); - continue; + did_work = true; + goto loop; } + did_work = false; /* drops the lock on success, retry */ if (__io_worker_idle(wqe, worker)) { __release(&wqe->lock); - continue; + goto loop; } spin_unlock_irq(&wqe->lock); if (signal_pending(current)) diff --git a/fs/io-wq.h b/fs/io-wq.h index 7c333a28e2a7..fb993b2bd0ef 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -35,7 +35,8 @@ static inline void wq_list_add_tail(struct io_wq_work_node *node, struct io_wq_work_list *list) { if (!list->first) { - list->first = list->last = node; + list->last = node; + WRITE_ONCE(list->first, node); } else { list->last->next = node; list->last = node; @@ -47,7 +48,7 @@ static inline void wq_node_del(struct io_wq_work_list *list, struct io_wq_work_node *prev) { if (node == list->first) - list->first = node->next; + WRITE_ONCE(list->first, node->next); if (node == list->last) list->last = prev; if (prev) @@ -58,7 +59,7 @@ static inline void wq_node_del(struct io_wq_work_list *list, #define wq_list_for_each(pos, prv, head) \ for (pos = (head)->first, prv = NULL; pos; prv = pos, pos = (pos)->next)
-#define wq_list_empty(list) ((list)->first == NULL) +#define wq_list_empty(list) (READ_ONCE((list)->first) == NULL) #define INIT_WQ_LIST(list) do { \ (list)->first = NULL; \ (list)->last = NULL; \
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc2 commit 8a4955ff1cca7d4da480774034a16e7c28bafec8 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We use the mutex to guard against registered file updates, for instance. Ensure we're safe in accessing that state against concurrent updates.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7cf5bc8bd3d9..58346af2fc13 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2995,12 +2995,7 @@ static int io_issue_sqe(struct io_kiocb *req, struct io_kiocb **nxt, if (req->result == -EAGAIN) return -EAGAIN;
- /* workqueue context doesn't hold uring_lock, grab it now */ - if (req->in_async) - mutex_lock(&ctx->uring_lock); io_iopoll_req_issued(req); - if (req->in_async) - mutex_unlock(&ctx->uring_lock); }
return 0; @@ -3654,7 +3649,9 @@ static int io_sq_thread(void *data) }
to_submit = min(to_submit, ctx->sq_entries); + mutex_lock(&ctx->uring_lock); ret = io_submit_sqes(ctx, to_submit, NULL, -1, &cur_mm, true); + mutex_unlock(&ctx->uring_lock); if (ret > 0) inflight += ret; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc2 commit d96885658d9971fc2c752b8699f17a42ef745db6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Don't just assign it from the main call path, that can miss the case when we're called from issue deferral.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 58346af2fc13..544ac00f32a1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2025,6 +2025,7 @@ static int io_sendmsg_prep(struct io_kiocb *req, struct io_async_ctx *io)
flags = READ_ONCE(sqe->msg_flags); msg = (struct user_msghdr __user *)(unsigned long) READ_ONCE(sqe->addr); + io->msg.iov = io->msg.fast_iov; return sendmsg_copy_msghdr(&io->msg.msg, msg, flags, &io->msg.iov); #else return 0; @@ -2060,7 +2061,6 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, } else { kmsg = &io.msg.msg; kmsg->msg_name = &addr; - io.msg.iov = io.msg.fast_iov; ret = io_sendmsg_prep(req, &io); if (ret) goto out; @@ -2103,6 +2103,7 @@ static int io_recvmsg_prep(struct io_kiocb *req, struct io_async_ctx *io)
flags = READ_ONCE(sqe->msg_flags); msg = (struct user_msghdr __user *)(unsigned long) READ_ONCE(sqe->addr); + io->msg.iov = io->msg.fast_iov; return recvmsg_copy_msghdr(&io->msg.msg, msg, flags, &io->msg.uaddr, &io->msg.iov); #else @@ -2142,7 +2143,6 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, } else { kmsg = &io.msg.msg; kmsg->msg_name = &addr; - io.msg.iov = io.msg.fast_iov; ret = io_recvmsg_prep(req, &io); if (ret) goto out;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc2 commit 392edb45b24337eaa0bc1ecd4e3cf897e662ec61 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This essentially reverts commit e944475e6984. For high poll ops workloads, like TAO, the dynamic allocation of the wait_queue entry for IORING_OP_POLL_ADD adds considerable extra overhead. Go back to embedding the wait_queue_entry, but keep the usage of wait->private for the pointer stashing.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 27 +++++++++++---------------- 1 file changed, 11 insertions(+), 16 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 544ac00f32a1..1c5d199fb2d1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -293,7 +293,7 @@ struct io_poll_iocb { __poll_t events; bool done; bool canceled; - struct wait_queue_entry *wait; + struct wait_queue_entry wait; };
struct io_timeout_data { @@ -2285,8 +2285,8 @@ static void io_poll_remove_one(struct io_kiocb *req)
spin_lock(&poll->head->lock); WRITE_ONCE(poll->canceled, true); - if (!list_empty(&poll->wait->entry)) { - list_del_init(&poll->wait->entry); + if (!list_empty(&poll->wait.entry)) { + list_del_init(&poll->wait.entry); io_queue_async_work(req); } spin_unlock(&poll->head->lock); @@ -2357,7 +2357,6 @@ static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error) struct io_ring_ctx *ctx = req->ctx;
req->poll.done = true; - kfree(req->poll.wait); if (error) io_cqring_fill_event(req, error); else @@ -2395,7 +2394,7 @@ static void io_poll_complete_work(struct io_wq_work **workptr) */ spin_lock_irq(&ctx->completion_lock); if (!mask && ret != -ECANCELED) { - add_wait_queue(poll->head, poll->wait); + add_wait_queue(poll->head, &poll->wait); spin_unlock_irq(&ctx->completion_lock); return; } @@ -2425,7 +2424,7 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, if (mask && !(mask & poll->events)) return 0;
- list_del_init(&poll->wait->entry); + list_del_init(&poll->wait.entry);
/* * Run completion inline if we can. We're using trylock here because @@ -2466,7 +2465,7 @@ static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head,
pt->error = 0; pt->req->poll.head = head; - add_wait_queue(head, pt->req->poll.wait); + add_wait_queue(head, &pt->req->poll.wait); }
static void io_poll_req_insert(struct io_kiocb *req) @@ -2495,10 +2494,6 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (!poll->file) return -EBADF;
- poll->wait = kmalloc(sizeof(*poll->wait), GFP_KERNEL); - if (!poll->wait) - return -ENOMEM; - req->io = NULL; INIT_IO_WORK(&req->work, io_poll_complete_work); events = READ_ONCE(sqe->poll_events); @@ -2515,9 +2510,9 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, ipt.error = -EINVAL; /* same as no support for IOCB_CMD_POLL */
/* initialized the list so that we can do list_empty checks */ - INIT_LIST_HEAD(&poll->wait->entry); - init_waitqueue_func_entry(poll->wait, io_poll_wake); - poll->wait->private = poll; + INIT_LIST_HEAD(&poll->wait.entry); + init_waitqueue_func_entry(&poll->wait, io_poll_wake); + poll->wait.private = poll;
INIT_LIST_HEAD(&req->list);
@@ -2526,14 +2521,14 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, spin_lock_irq(&ctx->completion_lock); if (likely(poll->head)) { spin_lock(&poll->head->lock); - if (unlikely(list_empty(&poll->wait->entry))) { + if (unlikely(list_empty(&poll->wait.entry))) { if (ipt.error) cancel = true; ipt.error = 0; mask = 0; } if (mask || ipt.error) - list_del_init(&poll->wait->entry); + list_del_init(&poll->wait.entry); else if (cancel) WRITE_ONCE(poll->canceled, true); else if (!poll->done) /* actually waiting for an event */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc2 commit 4a0a7a187453e65bdd24b9ede045b4c36b958868 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
One major use case of linked commands is the ability to run the next link inline, if at all possible. This is done correctly for async offload, but somewhere along the line we lost the ability to do so when we were able to complete a request without having to punt it. Ensure that we do so correctly.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1c5d199fb2d1..b5103c9202f1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3223,13 +3223,14 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req)
static void __io_queue_sqe(struct io_kiocb *req) { - struct io_kiocb *linked_timeout = io_prep_linked_timeout(req); + struct io_kiocb *linked_timeout; struct io_kiocb *nxt = NULL; int ret;
+again: + linked_timeout = io_prep_linked_timeout(req); + ret = io_issue_sqe(req, &nxt, true); - if (nxt) - io_queue_async_work(nxt);
/* * We async punt it if the file wasn't marked NOWAIT, or if the file @@ -3248,7 +3249,7 @@ static void __io_queue_sqe(struct io_kiocb *req) * submit reference when the iocb is actually submitted. */ io_queue_async_work(req); - return; + goto done_req; }
err: @@ -3268,6 +3269,12 @@ static void __io_queue_sqe(struct io_kiocb *req) req_set_fail_links(req); io_put_req(req); } +done_req: + if (nxt) { + req = nxt; + nxt = NULL; + goto again; + } }
static void io_queue_sqe(struct io_kiocb *req)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc2 commit 53108d476a105ab2597d7a4e6040b127829391b5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We hash regular files to avoid having multiple threads hammer on the inode mutex, but it should not be needed on other types of files (like sockets).
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b5103c9202f1..3f5dff057c67 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -580,7 +580,9 @@ static inline bool io_prep_async_work(struct io_kiocb *req, switch (req->sqe->opcode) { case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: - do_hashed = true; + /* only regular files should be hashed for writes */ + if (req->flags & REQ_F_ISREG) + do_hashed = true; /* fall-through */ case IORING_OP_READV: case IORING_OP_READ_FIXED:
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc2 commit 10d59345578a116042c1a5d737a18234aaf3e0e6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In chasing a performance issue between using IORING_OP_RECVMSG and IORING_OP_READV on sockets, tracing showed that we always punt the socket reads to async offload. This is due to io_file_supports_async() not checking for S_ISSOCK on the inode. Since sockets supports the O_NONBLOCK (or MSG_DONTWAIT) flag just fine, add sockets to the list of file types that we can do a non-blocking issue to.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3f5dff057c67..3276be109b98 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1430,7 +1430,7 @@ static bool io_file_supports_async(struct file *file) { umode_t mode = file_inode(file)->i_mode;
- if (S_ISBLK(mode) || S_ISCHR(mode)) + if (S_ISBLK(mode) || S_ISCHR(mode) || S_ISSOCK(mode)) return true; if (S_ISREG(mode) && file->f_op != &io_uring_fops) return true; @@ -1866,7 +1866,9 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, goto copy_iov; }
- if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) + /* file path doesn't support NOWAIT for non-direct_IO */ + if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT) && + (req->flags & REQ_F_ISREG)) goto copy_iov;
iov_count = iov_iter_count(&iter);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc2 commit 9e3aa61ae3e01ce1ce6361a41ef725e1f4d1d2bf category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we submit an unknown opcode and have fd == -1, io_op_needs_file() will return true as we default to needing a file. Then when we go and assign the file, we find the 'fd' invalid and return -EBADF. We really should be returning -EINVAL for that case, as we normally do for unsupported opcodes.
Change io_op_needs_file() to have the following return values:
0 - does not need a file 1 - does need a file < 0 - error value
and use this to pass back the right value for this invalid case.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 21 ++++++++++++------- include/uapi/linux/io_uring.h | 39 ++++++++++++++++++++--------------- 2 files changed, 36 insertions(+), 24 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3276be109b98..17103383d146 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3061,7 +3061,12 @@ static void io_wq_submit_work(struct io_wq_work **workptr) } }
-static bool io_op_needs_file(const struct io_uring_sqe *sqe) +static bool io_req_op_valid(int op) +{ + return op >= IORING_OP_NOP && op < IORING_OP_LAST; +} + +static int io_op_needs_file(const struct io_uring_sqe *sqe) { int op = READ_ONCE(sqe->opcode);
@@ -3072,9 +3077,11 @@ static bool io_op_needs_file(const struct io_uring_sqe *sqe) case IORING_OP_TIMEOUT_REMOVE: case IORING_OP_ASYNC_CANCEL: case IORING_OP_LINK_TIMEOUT: - return false; + return 0; default: - return true; + if (io_req_op_valid(op)) + return 1; + return -EINVAL; } }
@@ -3091,7 +3098,7 @@ static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; unsigned flags; - int fd; + int fd, ret;
flags = READ_ONCE(req->sqe->flags); fd = READ_ONCE(req->sqe->fd); @@ -3099,8 +3106,9 @@ static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req) if (flags & IOSQE_IO_DRAIN) req->flags |= REQ_F_IO_DRAIN;
- if (!io_op_needs_file(req->sqe)) - return 0; + ret = io_op_needs_file(req->sqe); + if (ret <= 0) + return ret;
if (flags & IOSQE_FIXED_FILE) { if (unlikely(!ctx->file_table || @@ -3311,7 +3319,6 @@ static inline void io_queue_link_head(struct io_kiocb *req) io_queue_sqe(req); }
- #define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK| \ IOSQE_IO_HARDLINK)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ea231366f5fd..a3300e1b9a01 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -58,23 +58,28 @@ struct io_uring_sqe { #define IORING_SETUP_SQ_AFF (1U << 2) /* sq_thread_cpu is valid */ #define IORING_SETUP_CQSIZE (1U << 3) /* app defines CQ size */
-#define IORING_OP_NOP 0 -#define IORING_OP_READV 1 -#define IORING_OP_WRITEV 2 -#define IORING_OP_FSYNC 3 -#define IORING_OP_READ_FIXED 4 -#define IORING_OP_WRITE_FIXED 5 -#define IORING_OP_POLL_ADD 6 -#define IORING_OP_POLL_REMOVE 7 -#define IORING_OP_SYNC_FILE_RANGE 8 -#define IORING_OP_SENDMSG 9 -#define IORING_OP_RECVMSG 10 -#define IORING_OP_TIMEOUT 11 -#define IORING_OP_TIMEOUT_REMOVE 12 -#define IORING_OP_ACCEPT 13 -#define IORING_OP_ASYNC_CANCEL 14 -#define IORING_OP_LINK_TIMEOUT 15 -#define IORING_OP_CONNECT 16 +enum { + IORING_OP_NOP, + IORING_OP_READV, + IORING_OP_WRITEV, + IORING_OP_FSYNC, + IORING_OP_READ_FIXED, + IORING_OP_WRITE_FIXED, + IORING_OP_POLL_ADD, + IORING_OP_POLL_REMOVE, + IORING_OP_SYNC_FILE_RANGE, + IORING_OP_SENDMSG, + IORING_OP_RECVMSG, + IORING_OP_TIMEOUT, + IORING_OP_TIMEOUT_REMOVE, + IORING_OP_ACCEPT, + IORING_OP_ASYNC_CANCEL, + IORING_OP_LINK_TIMEOUT, + IORING_OP_CONNECT, + + /* this goes last, obviously */ + IORING_OP_LAST, +};
/* * sqe->fsync_flags
From: Brian Gianforcaro b.gianfo@gmail.com
mainline inclusion from mainline-5.5-rc3 commit d195a66e367b3d24fdd3c3565f37ab7c6882b9d2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
- Fix a few typos found while reading the code.
- Fix stale io_get_sqring comment referencing s->sqe, the 's' parameter was renamed to 'req', but the comment still holds.
Signed-off-by: Brian Gianforcaro b.gianfo@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 2 +- fs/io_uring.c | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 1be307a174b5..e38e3c6e30f7 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -949,7 +949,7 @@ static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, /* * Now check if a free (going busy) or busy worker has the work * currently running. If we find it there, we'll return CANCEL_RUNNING - * as an indication that we attempte to signal cancellation. The + * as an indication that we attempt to signal cancellation. The * completion will run normally in this case. */ rcu_read_lock(); diff --git a/fs/io_uring.c b/fs/io_uring.c index 17103383d146..7c388b4331a3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1177,7 +1177,7 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, }
/* - * Poll for a mininum of 'min' events. Note that if min == 0 we consider that a + * Poll for a minimum of 'min' events. Note that if min == 0 we consider that a * non-spinning poll check - we'll still enter the driver poll loop, but only * as a non-spinning completion check. */ @@ -2572,7 +2572,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer)
/* * Adjust the reqs sequence before the current one because it - * will consume a slot in the cq_ring and the the cq_tail + * will consume a slot in the cq_ring and the cq_tail * pointer will be increased, otherwise other timeout reqs may * return in advance without waiting for enough wait_nr. */ @@ -3429,7 +3429,7 @@ static void io_commit_sqring(struct io_ring_ctx *ctx) }
/* - * Fetch an sqe, if one is available. Note that s->sqe will point to memory + * Fetch an sqe, if one is available. Note that req->sqe will point to memory * that is mapped by userspace. This means that care needs to be taken to * ensure that reads are stable, as we cannot rely on userspace always * being a good citizen. If members of the sqe are validated and then later @@ -3693,7 +3693,7 @@ static inline bool io_should_wake(struct io_wait_queue *iowq, bool noflush) struct io_ring_ctx *ctx = iowq->ctx;
/* - * Wake up if we have enough events, or if a timeout occured since we + * Wake up if we have enough events, or if a timeout occurred since we * started waiting. For timeouts, we always want to return to userspace, * regardless of event count. */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc3 commit 0b416c3e1345fd696db4c422643468d844410877 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we have to punt the recvmsg to async context, we copy all the context. But since the iovec used can be either on-stack (if small) or dynamically allocated, if it's on-stack, then we need to ensure we reset the iov pointer. If we don't, then we're reusing old stack data, and that can lead to -EFAULTs if things get overwritten.
Ensure we retain the right pointers for the iov, and free it as well if we end up having to go beyond UIO_FASTIOV number of vectors.
Fixes: 03b1230ca12a ("io_uring: ensure async punted sendmsg/recvmsg requests copy data") Reported-by: 李通洲 carter.li@eoitek.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 40 ++++++++++++++++++++++++++-------------- 1 file changed, 26 insertions(+), 14 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7c388b4331a3..2dd3f52614ea 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2040,6 +2040,7 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct io_kiocb **nxt, bool force_nonblock) { #if defined(CONFIG_NET) + struct io_async_msghdr *kmsg = NULL; struct socket *sock; int ret;
@@ -2050,7 +2051,6 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (sock) { struct io_async_ctx io, *copy; struct sockaddr_storage addr; - struct msghdr *kmsg; unsigned flags;
flags = READ_ONCE(sqe->msg_flags); @@ -2060,17 +2060,21 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, flags |= MSG_DONTWAIT;
if (req->io) { - kmsg = &req->io->msg.msg; - kmsg->msg_name = &addr; + kmsg = &req->io->msg; + kmsg->msg.msg_name = &addr; + /* if iov is set, it's allocated already */ + if (!kmsg->iov) + kmsg->iov = kmsg->fast_iov; + kmsg->msg.msg_iter.iov = kmsg->iov; } else { - kmsg = &io.msg.msg; - kmsg->msg_name = &addr; + kmsg = &io.msg; + kmsg->msg.msg_name = &addr; ret = io_sendmsg_prep(req, &io); if (ret) goto out; }
- ret = __sys_sendmsg_sock(sock, kmsg, flags); + ret = __sys_sendmsg_sock(sock, &kmsg->msg, flags); if (force_nonblock && ret == -EAGAIN) { copy = kmalloc(sizeof(*copy), GFP_KERNEL); if (!copy) { @@ -2081,13 +2085,15 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, req->io = copy; memcpy(&req->io->sqe, req->sqe, sizeof(*req->sqe)); req->sqe = &req->io->sqe; - return ret; + return -EAGAIN; } if (ret == -ERESTARTSYS) ret = -EINTR; }
out: + if (kmsg && kmsg->iov != kmsg->fast_iov) + kfree(kmsg->iov); io_cqring_add_event(req, ret); if (ret < 0) req_set_fail_links(req); @@ -2119,6 +2125,7 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct io_kiocb **nxt, bool force_nonblock) { #if defined(CONFIG_NET) + struct io_async_msghdr *kmsg = NULL; struct socket *sock; int ret;
@@ -2130,7 +2137,6 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct user_msghdr __user *msg; struct io_async_ctx io, *copy; struct sockaddr_storage addr; - struct msghdr *kmsg; unsigned flags;
flags = READ_ONCE(sqe->msg_flags); @@ -2142,17 +2148,21 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, msg = (struct user_msghdr __user *) (unsigned long) READ_ONCE(sqe->addr); if (req->io) { - kmsg = &req->io->msg.msg; - kmsg->msg_name = &addr; + kmsg = &req->io->msg; + kmsg->msg.msg_name = &addr; + /* if iov is set, it's allocated already */ + if (!kmsg->iov) + kmsg->iov = kmsg->fast_iov; + kmsg->msg.msg_iter.iov = kmsg->iov; } else { - kmsg = &io.msg.msg; - kmsg->msg_name = &addr; + kmsg = &io.msg; + kmsg->msg.msg_name = &addr; ret = io_recvmsg_prep(req, &io); if (ret) goto out; }
- ret = __sys_recvmsg_sock(sock, kmsg, msg, io.msg.uaddr, flags); + ret = __sys_recvmsg_sock(sock, &kmsg->msg, msg, kmsg->uaddr, flags); if (force_nonblock && ret == -EAGAIN) { copy = kmalloc(sizeof(*copy), GFP_KERNEL); if (!copy) { @@ -2163,13 +2173,15 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, req->io = copy; memcpy(&req->io->sqe, req->sqe, sizeof(*req->sqe)); req->sqe = &req->io->sqe; - return ret; + return -EAGAIN; } if (ret == -ERESTARTSYS) ret = -EINTR; }
out: + if (kmsg && kmsg->iov != kmsg->fast_iov) + kfree(kmsg->iov); io_cqring_add_event(req, ret); if (ret < 0) req_set_fail_links(req);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc3 commit 525b305d61ede489ce2118b000a5dabd6d869dac category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This reverts commit 8cdda87a4414, we now have several use csaes for this helper. Reinstate it.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.h | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/io-wq.h b/fs/io-wq.h index fb993b2bd0ef..3f5e356de980 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -120,6 +120,10 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk) static inline void io_wq_worker_running(struct task_struct *tsk) { } -#endif /* CONFIG_IO_WQ */ +#endif
-#endif /* INTERNAL_IO_WQ_H */ +static inline bool io_wq_current_is_worker(void) +{ + return in_task() && (current->flags & PF_IO_WORKER); +} +#endif
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc3 commit b7bb4f7da0a1a92f142697f1c9ce335e7a44f4b1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Some of these code paths assume that any force_nonblock == true issue is not prepped, but that's not true if we did prep as part of link setup earlier. Check if we already have an async context allocate before setting up a new one.
Cleanup the async context setup in general, we have a lot of duplicated code there.
Fixes: 03b1230ca12a ("io_uring: ensure async punted sendmsg/recvmsg requests copy data") Fixes: f67676d160c6 ("io_uring: ensure async punted read/write requests copy iovec") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 175 ++++++++++++++++++++++++++++---------------------- 1 file changed, 98 insertions(+), 77 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2dd3f52614ea..d45c6f8b8270 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1700,7 +1700,7 @@ static ssize_t loop_rw_iter(int rw, struct file *file, struct kiocb *kiocb, return ret; }
-static void io_req_map_io(struct io_kiocb *req, ssize_t io_size, +static void io_req_map_rw(struct io_kiocb *req, ssize_t io_size, struct iovec *iovec, struct iovec *fast_iov, struct iov_iter *iter) { @@ -1714,19 +1714,39 @@ static void io_req_map_io(struct io_kiocb *req, ssize_t io_size, } }
-static int io_setup_async_io(struct io_kiocb *req, ssize_t io_size, - struct iovec *iovec, struct iovec *fast_iov, - struct iov_iter *iter) +static int io_alloc_async_ctx(struct io_kiocb *req) { req->io = kmalloc(sizeof(*req->io), GFP_KERNEL); if (req->io) { - io_req_map_io(req, io_size, iovec, fast_iov, iter); memcpy(&req->io->sqe, req->sqe, sizeof(req->io->sqe)); req->sqe = &req->io->sqe; return 0; }
- return -ENOMEM; + return 1; +} + +static void io_rw_async(struct io_wq_work **workptr) +{ + struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); + struct iovec *iov = NULL; + + if (req->io->rw.iov != req->io->rw.fast_iov) + iov = req->io->rw.iov; + io_wq_submit_work(workptr); + kfree(iov); +} + +static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size, + struct iovec *iovec, struct iovec *fast_iov, + struct iov_iter *iter) +{ + if (!req->io && io_alloc_async_ctx(req)) + return -ENOMEM; + + io_req_map_rw(req, io_size, iovec, fast_iov, iter); + req->work.func = io_rw_async; + return 0; }
static int io_read_prep(struct io_kiocb *req, struct iovec **iovec, @@ -1805,7 +1825,7 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, kiocb_done(kiocb, ret2, nxt, req->in_async); } else { copy_iov: - ret = io_setup_async_io(req, io_size, iovec, + ret = io_setup_async_rw(req, io_size, iovec, inline_vecs, &iter); if (ret) goto out_free; @@ -1813,7 +1833,8 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, } } out_free: - kfree(iovec); + if (!io_wq_current_is_worker()) + kfree(iovec); return ret; }
@@ -1899,7 +1920,7 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, kiocb_done(kiocb, ret2, nxt, req->in_async); } else { copy_iov: - ret = io_setup_async_io(req, io_size, iovec, + ret = io_setup_async_rw(req, io_size, iovec, inline_vecs, &iter); if (ret) goto out_free; @@ -1907,7 +1928,8 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, } } out_free: - kfree(iovec); + if (!io_wq_current_is_worker()) + kfree(iovec); return ret; }
@@ -2020,6 +2042,19 @@ static int io_sync_file_range(struct io_kiocb *req, return 0; }
+#if defined(CONFIG_NET) +static void io_sendrecv_async(struct io_wq_work **workptr) +{ + struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); + struct iovec *iov = NULL; + + if (req->io->rw.iov != req->io->rw.fast_iov) + iov = req->io->msg.iov; + io_wq_submit_work(workptr); + kfree(iov); +} +#endif + static int io_sendmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) { #if defined(CONFIG_NET) @@ -2049,7 +2084,7 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
sock = sock_from_file(req->file, &ret); if (sock) { - struct io_async_ctx io, *copy; + struct io_async_ctx io; struct sockaddr_storage addr; unsigned flags;
@@ -2076,15 +2111,12 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
ret = __sys_sendmsg_sock(sock, &kmsg->msg, flags); if (force_nonblock && ret == -EAGAIN) { - copy = kmalloc(sizeof(*copy), GFP_KERNEL); - if (!copy) { - ret = -ENOMEM; - goto out; - } - memcpy(©->msg, &io.msg, sizeof(copy->msg)); - req->io = copy; - memcpy(&req->io->sqe, req->sqe, sizeof(*req->sqe)); - req->sqe = &req->io->sqe; + if (req->io) + return -EAGAIN; + if (io_alloc_async_ctx(req)) + return -ENOMEM; + memcpy(&req->io->msg, &io.msg, sizeof(io.msg)); + req->work.func = io_sendrecv_async; return -EAGAIN; } if (ret == -ERESTARTSYS) @@ -2092,7 +2124,7 @@ static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, }
out: - if (kmsg && kmsg->iov != kmsg->fast_iov) + if (!io_wq_current_is_worker() && kmsg && kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); io_cqring_add_event(req, ret); if (ret < 0) @@ -2135,7 +2167,7 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, sock = sock_from_file(req->file, &ret); if (sock) { struct user_msghdr __user *msg; - struct io_async_ctx io, *copy; + struct io_async_ctx io; struct sockaddr_storage addr; unsigned flags;
@@ -2164,15 +2196,12 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
ret = __sys_recvmsg_sock(sock, &kmsg->msg, msg, kmsg->uaddr, flags); if (force_nonblock && ret == -EAGAIN) { - copy = kmalloc(sizeof(*copy), GFP_KERNEL); - if (!copy) { - ret = -ENOMEM; - goto out; - } - memcpy(copy, &io, sizeof(*copy)); - req->io = copy; - memcpy(&req->io->sqe, req->sqe, sizeof(*req->sqe)); - req->sqe = &req->io->sqe; + if (req->io) + return -EAGAIN; + if (io_alloc_async_ctx(req)) + return -ENOMEM; + memcpy(&req->io->msg, &io.msg, sizeof(io.msg)); + req->work.func = io_sendrecv_async; return -EAGAIN; } if (ret == -ERESTARTSYS) @@ -2180,7 +2209,7 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, }
out: - if (kmsg && kmsg->iov != kmsg->fast_iov) + if (!io_wq_current_is_worker() && kmsg && kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); io_cqring_add_event(req, ret); if (ret < 0) @@ -2271,15 +2300,13 @@ static int io_connect(struct io_kiocb *req, const struct io_uring_sqe *sqe, ret = __sys_connect_file(req->file, &io->connect.address, addr_len, file_flags); if ((ret == -EAGAIN || ret == -EINPROGRESS) && force_nonblock) { - io = kmalloc(sizeof(*io), GFP_KERNEL); - if (!io) { + if (req->io) + return -EAGAIN; + if (io_alloc_async_ctx(req)) { ret = -ENOMEM; goto out; } - memcpy(&io->connect, &__io.connect, sizeof(io->connect)); - req->io = io; - memcpy(&io->sqe, req->sqe, sizeof(*req->sqe)); - req->sqe = &io->sqe; + memcpy(&req->io->connect, &__io.connect, sizeof(__io.connect)); return -EAGAIN; } if (ret == -ERESTARTSYS) @@ -2510,7 +2537,6 @@ static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (!poll->file) return -EBADF;
- req->io = NULL; INIT_IO_WORK(&req->work, io_poll_complete_work); events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; @@ -2691,7 +2717,6 @@ static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io, data->mode = HRTIMER_MODE_REL;
hrtimer_init(&data->timer, CLOCK_MONOTONIC, data->mode); - req->io = io; return 0; }
@@ -2700,22 +2725,16 @@ static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) unsigned count; struct io_ring_ctx *ctx = req->ctx; struct io_timeout_data *data; - struct io_async_ctx *io; struct list_head *entry; unsigned span = 0; + int ret;
- io = req->io; - if (!io) { - int ret; - - io = kmalloc(sizeof(*io), GFP_KERNEL); - if (!io) + if (!req->io) { + if (io_alloc_async_ctx(req)) return -ENOMEM; - ret = io_timeout_prep(req, io, false); - if (ret) { - kfree(io); + ret = io_timeout_prep(req, req->io, false); + if (ret) return ret; - } } data = &req->io->timeout;
@@ -2857,23 +2876,35 @@ static int io_async_cancel(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; }
-static int io_req_defer_prep(struct io_kiocb *req, struct io_async_ctx *io) +static int io_req_defer_prep(struct io_kiocb *req) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct io_async_ctx *io = req->io; struct iov_iter iter; ssize_t ret;
- memcpy(&io->sqe, req->sqe, sizeof(io->sqe)); - req->sqe = &io->sqe; - switch (io->sqe.opcode) { case IORING_OP_READV: case IORING_OP_READ_FIXED: + /* ensure prep does right import */ + req->io = NULL; ret = io_read_prep(req, &iovec, &iter, true); + req->io = io; + if (ret < 0) + break; + io_req_map_rw(req, ret, iovec, inline_vecs, &iter); + ret = 0; break; case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: + /* ensure prep does right import */ + req->io = NULL; ret = io_write_prep(req, &iovec, &iter, true); + req->io = io; + if (ret < 0) + break; + io_req_map_rw(req, ret, iovec, inline_vecs, &iter); + ret = 0; break; case IORING_OP_SENDMSG: ret = io_sendmsg_prep(req, io); @@ -2885,41 +2916,34 @@ static int io_req_defer_prep(struct io_kiocb *req, struct io_async_ctx *io) ret = io_connect_prep(req, io); break; case IORING_OP_TIMEOUT: - return io_timeout_prep(req, io, false); + ret = io_timeout_prep(req, io, false); + break; case IORING_OP_LINK_TIMEOUT: - return io_timeout_prep(req, io, true); + ret = io_timeout_prep(req, io, true); + break; default: - req->io = io; - return 0; + ret = 0; + break; }
- if (ret < 0) - return ret; - - req->io = io; - io_req_map_io(req, ret, iovec, inline_vecs, &iter); - return 0; + return ret; }
static int io_req_defer(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; - struct io_async_ctx *io; int ret;
/* Still need defer if there is pending req in defer list. */ if (!req_need_defer(req) && list_empty(&ctx->defer_list)) return 0;
- io = kmalloc(sizeof(*io), GFP_KERNEL); - if (!io) + if (io_alloc_async_ctx(req)) return -EAGAIN;
- ret = io_req_defer_prep(req, io); - if (ret < 0) { - kfree(io); + ret = io_req_defer_prep(req); + if (ret < 0) return ret; - }
spin_lock_irq(&ctx->completion_lock); if (!req_need_defer(req) && list_empty(&ctx->defer_list)) { @@ -3365,7 +3389,6 @@ static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, */ if (*link) { struct io_kiocb *prev = *link; - struct io_async_ctx *io;
if (req->sqe->flags & IOSQE_IO_DRAIN) (*link)->flags |= REQ_F_DRAIN_LINK | REQ_F_IO_DRAIN; @@ -3373,15 +3396,13 @@ static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, if (req->sqe->flags & IOSQE_IO_HARDLINK) req->flags |= REQ_F_HARDLINK;
- io = kmalloc(sizeof(*io), GFP_KERNEL); - if (!io) { + if (io_alloc_async_ctx(req)) { ret = -EAGAIN; goto err_req; }
- ret = io_req_defer_prep(req, io); + ret = io_req_defer_prep(req); if (ret) { - kfree(io); /* fail even hard links since we don't submit */ prev->flags |= REQ_F_FAIL_LINK; goto err_req;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc3 commit fc4df999e24fc3006441acd4ce6250e6a76ac851 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We pass in req->sqe for all of them, no need to pass it in as the request is always passed in. This is a necessary prep patch to be able to cleanup/fix the request prep path.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 80 ++++++++++++++++++++++++++++----------------------- 1 file changed, 44 insertions(+), 36 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d45c6f8b8270..9bfcf4d6f9c6 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1948,8 +1948,9 @@ static int io_nop(struct io_kiocb *req) return 0; }
-static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe) +static int io_prep_fsync(struct io_kiocb *req) { + const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx;
if (!req->file) @@ -1963,9 +1964,10 @@ static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, - struct io_kiocb **nxt, bool force_nonblock) +static int io_fsync(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) { + const struct io_uring_sqe *sqe = req->sqe; loff_t sqe_off = READ_ONCE(sqe->off); loff_t sqe_len = READ_ONCE(sqe->len); loff_t end = sqe_off + sqe_len; @@ -1976,7 +1978,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (unlikely(fsync_flags & ~IORING_FSYNC_DATASYNC)) return -EINVAL;
- ret = io_prep_fsync(req, sqe); + ret = io_prep_fsync(req); if (ret) return ret;
@@ -1995,8 +1997,9 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; }
-static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe) +static int io_prep_sfr(struct io_kiocb *req) { + const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx; int ret = 0;
@@ -2011,17 +2014,16 @@ static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe) return ret; }
-static int io_sync_file_range(struct io_kiocb *req, - const struct io_uring_sqe *sqe, - struct io_kiocb **nxt, +static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { + const struct io_uring_sqe *sqe = req->sqe; loff_t sqe_off; loff_t sqe_len; unsigned flags; int ret;
- ret = io_prep_sfr(req, sqe); + ret = io_prep_sfr(req); if (ret) return ret;
@@ -2071,10 +2073,11 @@ static int io_sendmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) #endif }
-static int io_sendmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, - struct io_kiocb **nxt, bool force_nonblock) +static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) { #if defined(CONFIG_NET) + const struct io_uring_sqe *sqe = req->sqe; struct io_async_msghdr *kmsg = NULL; struct socket *sock; int ret; @@ -2153,10 +2156,11 @@ static int io_recvmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) #endif }
-static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, - struct io_kiocb **nxt, bool force_nonblock) +static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) { #if defined(CONFIG_NET) + const struct io_uring_sqe *sqe = req->sqe; struct io_async_msghdr *kmsg = NULL; struct socket *sock; int ret; @@ -2221,10 +2225,11 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe, #endif }
-static int io_accept(struct io_kiocb *req, const struct io_uring_sqe *sqe, - struct io_kiocb **nxt, bool force_nonblock) +static int io_accept(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) { #if defined(CONFIG_NET) + const struct io_uring_sqe *sqe = req->sqe; struct sockaddr __user *addr; int __user *addr_len; unsigned file_flags; @@ -2272,10 +2277,11 @@ static int io_connect_prep(struct io_kiocb *req, struct io_async_ctx *io) #endif }
-static int io_connect(struct io_kiocb *req, const struct io_uring_sqe *sqe, - struct io_kiocb **nxt, bool force_nonblock) +static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) { #if defined(CONFIG_NET) + const struct io_uring_sqe *sqe = req->sqe; struct io_async_ctx __io, *io; unsigned file_flags; int addr_len, ret; @@ -2373,8 +2379,9 @@ static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr) * Find a running poll command that matches one specified in sqe->addr, * and remove it if found. */ -static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe) +static int io_poll_remove(struct io_kiocb *req) { + const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx; int ret;
@@ -2520,9 +2527,9 @@ static void io_poll_req_insert(struct io_kiocb *req) hlist_add_head(&req->hash_node, list); }
-static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe, - struct io_kiocb **nxt) +static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) { + const struct io_uring_sqe *sqe = req->sqe; struct io_poll_iocb *poll = &req->poll; struct io_ring_ctx *ctx = req->ctx; struct io_poll_table ipt; @@ -2659,9 +2666,9 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) /* * Remove or update an existing timeout command */ -static int io_timeout_remove(struct io_kiocb *req, - const struct io_uring_sqe *sqe) +static int io_timeout_remove(struct io_kiocb *req) { + const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx; unsigned flags; int ret; @@ -2720,8 +2727,9 @@ static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io, return 0; }
-static int io_timeout(struct io_kiocb *req, const struct io_uring_sqe *sqe) +static int io_timeout(struct io_kiocb *req) { + const struct io_uring_sqe *sqe = req->sqe; unsigned count; struct io_ring_ctx *ctx = req->ctx; struct io_timeout_data *data; @@ -2861,9 +2869,9 @@ static void io_async_find_and_cancel(struct io_ring_ctx *ctx, io_put_req_find_next(req, nxt); }
-static int io_async_cancel(struct io_kiocb *req, const struct io_uring_sqe *sqe, - struct io_kiocb **nxt) +static int io_async_cancel(struct io_kiocb *req, struct io_kiocb **nxt) { + const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx;
if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) @@ -2986,37 +2994,37 @@ static int io_issue_sqe(struct io_kiocb *req, struct io_kiocb **nxt, ret = io_write(req, nxt, force_nonblock); break; case IORING_OP_FSYNC: - ret = io_fsync(req, req->sqe, nxt, force_nonblock); + ret = io_fsync(req, nxt, force_nonblock); break; case IORING_OP_POLL_ADD: - ret = io_poll_add(req, req->sqe, nxt); + ret = io_poll_add(req, nxt); break; case IORING_OP_POLL_REMOVE: - ret = io_poll_remove(req, req->sqe); + ret = io_poll_remove(req); break; case IORING_OP_SYNC_FILE_RANGE: - ret = io_sync_file_range(req, req->sqe, nxt, force_nonblock); + ret = io_sync_file_range(req, nxt, force_nonblock); break; case IORING_OP_SENDMSG: - ret = io_sendmsg(req, req->sqe, nxt, force_nonblock); + ret = io_sendmsg(req, nxt, force_nonblock); break; case IORING_OP_RECVMSG: - ret = io_recvmsg(req, req->sqe, nxt, force_nonblock); + ret = io_recvmsg(req, nxt, force_nonblock); break; case IORING_OP_TIMEOUT: - ret = io_timeout(req, req->sqe); + ret = io_timeout(req); break; case IORING_OP_TIMEOUT_REMOVE: - ret = io_timeout_remove(req, req->sqe); + ret = io_timeout_remove(req); break; case IORING_OP_ACCEPT: - ret = io_accept(req, req->sqe, nxt, force_nonblock); + ret = io_accept(req, nxt, force_nonblock); break; case IORING_OP_CONNECT: - ret = io_connect(req, req->sqe, nxt, force_nonblock); + ret = io_connect(req, nxt, force_nonblock); break; case IORING_OP_ASYNC_CANCEL: - ret = io_async_cancel(req, req->sqe, nxt); + ret = io_async_cancel(req, nxt); break; default: ret = -EINVAL;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc3 commit 8ed8d3c3bc32bf5b442c9f54013b4a47d5cae740 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We're currently not retaining sqe data for accept, fsync, and sync_file_range. None of these commands need data outside of what is directly provided, hence it can't go stale when the request is deferred. However, it can get reused, if an application reuses SQE entries.
Ensure that we retain the information we need and only read the sqe contents once, off the submission path. Most of this is just moving code into a prep and finish function.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 221 +++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 172 insertions(+), 49 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9bfcf4d6f9c6..cb3b7fb78dff 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -304,6 +304,20 @@ struct io_timeout_data { u32 seq_offset; };
+struct io_accept { + struct file *file; + struct sockaddr __user *addr; + int __user *addr_len; + int flags; +}; + +struct io_sync { + struct file *file; + loff_t len; + loff_t off; + int flags; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -343,6 +357,8 @@ struct io_kiocb { struct file *file; struct kiocb rw; struct io_poll_iocb poll; + struct io_accept accept; + struct io_sync sync; };
const struct io_uring_sqe *sqe; @@ -378,6 +394,7 @@ struct io_kiocb { #define REQ_F_INFLIGHT 16384 /* on inflight list */ #define REQ_F_COMP_LOCKED 32768 /* completion under lock */ #define REQ_F_HARDLINK 65536 /* doesn't sever on completion < 0 */ +#define REQ_F_PREPPED 131072 /* request already opcode prepared */ u64 user_data; u32 result; u32 sequence; @@ -1953,6 +1970,8 @@ static int io_prep_fsync(struct io_kiocb *req) const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx;
+ if (req->flags & REQ_F_PREPPED) + return 0; if (!req->file) return -EBADF;
@@ -1961,39 +1980,70 @@ static int io_prep_fsync(struct io_kiocb *req) if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index)) return -EINVAL;
+ req->sync.flags = READ_ONCE(sqe->fsync_flags); + if (unlikely(req->sync.flags & ~IORING_FSYNC_DATASYNC)) + return -EINVAL; + + req->sync.off = READ_ONCE(sqe->off); + req->sync.len = READ_ONCE(sqe->len); + req->flags |= REQ_F_PREPPED; return 0; }
+static bool io_req_cancelled(struct io_kiocb *req) +{ + if (req->work.flags & IO_WQ_WORK_CANCEL) { + req_set_fail_links(req); + io_cqring_add_event(req, -ECANCELED); + io_put_req(req); + return true; + } + + return false; +} + +static void io_fsync_finish(struct io_wq_work **workptr) +{ + struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); + loff_t end = req->sync.off + req->sync.len; + struct io_kiocb *nxt = NULL; + int ret; + + if (io_req_cancelled(req)) + return; + + ret = vfs_fsync_range(req->rw.ki_filp, req->sync.off, + end > 0 ? end : LLONG_MAX, + req->sync.flags & IORING_FSYNC_DATASYNC); + if (ret < 0) + req_set_fail_links(req); + io_cqring_add_event(req, ret); + io_put_req_find_next(req, &nxt); + if (nxt) + *workptr = &nxt->work; +} + static int io_fsync(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { - const struct io_uring_sqe *sqe = req->sqe; - loff_t sqe_off = READ_ONCE(sqe->off); - loff_t sqe_len = READ_ONCE(sqe->len); - loff_t end = sqe_off + sqe_len; - unsigned fsync_flags; + struct io_wq_work *work, *old_work; int ret;
- fsync_flags = READ_ONCE(sqe->fsync_flags); - if (unlikely(fsync_flags & ~IORING_FSYNC_DATASYNC)) - return -EINVAL; - ret = io_prep_fsync(req); if (ret) return ret;
/* fsync always requires a blocking context */ - if (force_nonblock) + if (force_nonblock) { + io_put_req(req); + req->work.func = io_fsync_finish; return -EAGAIN; + }
- ret = vfs_fsync_range(req->rw.ki_filp, sqe_off, - end > 0 ? end : LLONG_MAX, - fsync_flags & IORING_FSYNC_DATASYNC); - - if (ret < 0) - req_set_fail_links(req); - io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + work = old_work = &req->work; + io_fsync_finish(&work); + if (work && work != old_work) + *nxt = container_of(work, struct io_kiocb, work); return 0; }
@@ -2001,8 +2051,9 @@ static int io_prep_sfr(struct io_kiocb *req) { const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx; - int ret = 0;
+ if (req->flags & REQ_F_PREPPED) + return 0; if (!req->file) return -EBADF;
@@ -2011,16 +2062,36 @@ static int io_prep_sfr(struct io_kiocb *req) if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index)) return -EINVAL;
- return ret; + req->sync.off = READ_ONCE(sqe->off); + req->sync.len = READ_ONCE(sqe->len); + req->sync.flags = READ_ONCE(sqe->sync_range_flags); + req->flags |= REQ_F_PREPPED; + return 0; +} + +static void io_sync_file_range_finish(struct io_wq_work **workptr) +{ + struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); + struct io_kiocb *nxt = NULL; + int ret; + + if (io_req_cancelled(req)) + return; + + ret = sync_file_range(req->rw.ki_filp, req->sync.off, req->sync.len, + req->sync.flags); + if (ret < 0) + req_set_fail_links(req); + io_cqring_add_event(req, ret); + io_put_req_find_next(req, &nxt); + if (nxt) + *workptr = &nxt->work; }
static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { - const struct io_uring_sqe *sqe = req->sqe; - loff_t sqe_off; - loff_t sqe_len; - unsigned flags; + struct io_wq_work *work, *old_work; int ret;
ret = io_prep_sfr(req); @@ -2028,19 +2099,16 @@ static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt, return ret;
/* sync_file_range always requires a blocking context */ - if (force_nonblock) + if (force_nonblock) { + io_put_req(req); + req->work.func = io_sync_file_range_finish; return -EAGAIN; + }
- sqe_off = READ_ONCE(sqe->off); - sqe_len = READ_ONCE(sqe->len); - flags = READ_ONCE(sqe->sync_range_flags); - - ret = sync_file_range(req->rw.ki_filp, sqe_off, sqe_len, flags); - - if (ret < 0) - req_set_fail_links(req); - io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + work = old_work = &req->work; + io_sync_file_range_finish(&work); + if (work && work != old_work) + *nxt = container_of(work, struct io_kiocb, work); return 0; }
@@ -2225,31 +2293,44 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, #endif }
-static int io_accept(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_accept_prep(struct io_kiocb *req) { #if defined(CONFIG_NET) const struct io_uring_sqe *sqe = req->sqe; - struct sockaddr __user *addr; - int __user *addr_len; - unsigned file_flags; - int flags, ret; + struct io_accept *accept = &req->accept; + + if (req->flags & REQ_F_PREPPED) + return 0;
if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) return -EINVAL; if (sqe->ioprio || sqe->len || sqe->buf_index) return -EINVAL;
- addr = (struct sockaddr __user *) (unsigned long) READ_ONCE(sqe->addr); - addr_len = (int __user *) (unsigned long) READ_ONCE(sqe->addr2); - flags = READ_ONCE(sqe->accept_flags); - file_flags = force_nonblock ? O_NONBLOCK : 0; + accept->addr = (struct sockaddr __user *) + (unsigned long) READ_ONCE(sqe->addr); + accept->addr_len = (int __user *) (unsigned long) READ_ONCE(sqe->addr2); + accept->flags = READ_ONCE(sqe->accept_flags); + req->flags |= REQ_F_PREPPED; + return 0; +#else + return -EOPNOTSUPP; +#endif +}
- ret = __sys_accept4_file(req->file, file_flags, addr, addr_len, flags); - if (ret == -EAGAIN && force_nonblock) { - req->work.flags |= IO_WQ_WORK_NEEDS_FILES; +#if defined(CONFIG_NET) +static int __io_accept(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) +{ + struct io_accept *accept = &req->accept; + unsigned file_flags; + int ret; + + file_flags = force_nonblock ? O_NONBLOCK : 0; + ret = __sys_accept4_file(req->file, file_flags, accept->addr, + accept->addr_len, accept->flags); + if (ret == -EAGAIN && force_nonblock) return -EAGAIN; - } if (ret == -ERESTARTSYS) ret = -EINTR; if (ret < 0) @@ -2257,6 +2338,39 @@ static int io_accept(struct io_kiocb *req, struct io_kiocb **nxt, io_cqring_add_event(req, ret); io_put_req_find_next(req, nxt); return 0; +} + +static void io_accept_finish(struct io_wq_work **workptr) +{ + struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); + struct io_kiocb *nxt = NULL; + + if (io_req_cancelled(req)) + return; + __io_accept(req, &nxt, false); + if (nxt) + *workptr = &nxt->work; +} +#endif + +static int io_accept(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) +{ +#if defined(CONFIG_NET) + int ret; + + ret = io_accept_prep(req); + if (ret) + return ret; + + ret = __io_accept(req, nxt, force_nonblock); + if (ret == -EAGAIN && force_nonblock) { + req->work.func = io_accept_finish; + req->work.flags |= IO_WQ_WORK_NEEDS_FILES; + io_put_req(req); + return -EAGAIN; + } + return 0; #else return -EOPNOTSUPP; #endif @@ -2914,6 +3028,12 @@ static int io_req_defer_prep(struct io_kiocb *req) io_req_map_rw(req, ret, iovec, inline_vecs, &iter); ret = 0; break; + case IORING_OP_FSYNC: + ret = io_prep_fsync(req); + break; + case IORING_OP_SYNC_FILE_RANGE: + ret = io_prep_sfr(req); + break; case IORING_OP_SENDMSG: ret = io_sendmsg_prep(req, io); break; @@ -2929,6 +3049,9 @@ static int io_req_defer_prep(struct io_kiocb *req) case IORING_OP_LINK_TIMEOUT: ret = io_timeout_prep(req, io, true); break; + case IORING_OP_ACCEPT: + ret = io_accept_prep(req); + break; default: ret = 0; break;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc3 commit ffbb8d6b76910d4f3a2bafeaf68c419011e98d05 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The rules are as follows, if IOSQE_IO_HARDLINK is specified, then it's a link and there is no need to set IOSQE_IO_LINK separately, though it could be there. Add proper check and ensure that IOSQE_IO_HARDLINK implies IOSQE_IO_LINK.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cb3b7fb78dff..1d6c4ee18daf 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3697,7 +3697,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, * If previous wasn't linked and we have a linked command, * that's the end of the chain. Submit the previous link. */ - if (!(sqe_flags & IOSQE_IO_LINK) && link) { + if (!(sqe_flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) && link) { io_queue_link_head(link); link = NULL; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc3 commit 0969e783e3a8913f79df27286501a6c21e961524 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we defer these commands as part of a link, we have to make sure that the SQE data has been read upfront. Integrate the poll add/remove into the prep handling to make it safe for SQE reuse.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 68 ++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 54 insertions(+), 14 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1d6c4ee18daf..3ea74527361f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -289,7 +289,10 @@ struct io_ring_ctx { */ struct io_poll_iocb { struct file *file; - struct wait_queue_head *head; + union { + struct wait_queue_head *head; + u64 addr; + }; __poll_t events; bool done; bool canceled; @@ -2489,24 +2492,40 @@ static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr) return -ENOENT; }
+static int io_poll_remove_prep(struct io_kiocb *req) +{ + const struct io_uring_sqe *sqe = req->sqe; + + if (req->flags & REQ_F_PREPPED) + return 0; + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + if (sqe->ioprio || sqe->off || sqe->len || sqe->buf_index || + sqe->poll_events) + return -EINVAL; + + req->poll.addr = READ_ONCE(sqe->addr); + req->flags |= REQ_F_PREPPED; + return 0; +} + /* * Find a running poll command that matches one specified in sqe->addr, * and remove it if found. */ static int io_poll_remove(struct io_kiocb *req) { - const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx; + u64 addr; int ret;
- if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) - return -EINVAL; - if (sqe->ioprio || sqe->off || sqe->len || sqe->buf_index || - sqe->poll_events) - return -EINVAL; + ret = io_poll_remove_prep(req); + if (ret) + return ret;
+ addr = req->poll.addr; spin_lock_irq(&ctx->completion_lock); - ret = io_poll_cancel(ctx, READ_ONCE(sqe->addr)); + ret = io_poll_cancel(ctx, addr); spin_unlock_irq(&ctx->completion_lock);
io_cqring_add_event(req, ret); @@ -2641,16 +2660,14 @@ static void io_poll_req_insert(struct io_kiocb *req) hlist_add_head(&req->hash_node, list); }
-static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) +static int io_poll_add_prep(struct io_kiocb *req) { const struct io_uring_sqe *sqe = req->sqe; struct io_poll_iocb *poll = &req->poll; - struct io_ring_ctx *ctx = req->ctx; - struct io_poll_table ipt; - bool cancel = false; - __poll_t mask; u16 events;
+ if (req->flags & REQ_F_PREPPED) + return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->addr || sqe->ioprio || sqe->off || sqe->len || sqe->buf_index) @@ -2658,9 +2675,26 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) if (!poll->file) return -EBADF;
- INIT_IO_WORK(&req->work, io_poll_complete_work); + req->flags |= REQ_F_PREPPED; events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; + return 0; +} + +static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) +{ + struct io_poll_iocb *poll = &req->poll; + struct io_ring_ctx *ctx = req->ctx; + struct io_poll_table ipt; + bool cancel = false; + __poll_t mask; + int ret; + + ret = io_poll_add_prep(req); + if (ret) + return ret; + + INIT_IO_WORK(&req->work, io_poll_complete_work); INIT_HLIST_NODE(&req->hash_node);
poll->head = NULL; @@ -3028,6 +3062,12 @@ static int io_req_defer_prep(struct io_kiocb *req) io_req_map_rw(req, ret, iovec, inline_vecs, &iter); ret = 0; break; + case IORING_OP_POLL_ADD: + ret = io_poll_add_prep(req); + break; + case IORING_OP_POLL_REMOVE: + ret = io_poll_remove_prep(req); + break; case IORING_OP_FSYNC: ret = io_prep_fsync(req); break;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc3 commit fbf23849b1724d3ea362e346d0877a8d87978fe6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we defer this command as part of a link, we have to make sure that the SQE data has been read upfront. Integrate the async cancel op into the prep handling to make it safe for SQE reuse.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 32 ++++++++++++++++++++++++++++---- 1 file changed, 28 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3ea74527361f..bf6111474e66 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -321,6 +321,11 @@ struct io_sync { int flags; };
+struct io_cancel { + struct file *file; + u64 addr; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -362,6 +367,7 @@ struct io_kiocb { struct io_poll_iocb poll; struct io_accept accept; struct io_sync sync; + struct io_cancel cancel; };
const struct io_uring_sqe *sqe; @@ -3017,18 +3023,33 @@ static void io_async_find_and_cancel(struct io_ring_ctx *ctx, io_put_req_find_next(req, nxt); }
-static int io_async_cancel(struct io_kiocb *req, struct io_kiocb **nxt) +static int io_async_cancel_prep(struct io_kiocb *req) { const struct io_uring_sqe *sqe = req->sqe; - struct io_ring_ctx *ctx = req->ctx;
- if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) + if (req->flags & REQ_F_PREPPED) + return 0; + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->flags || sqe->ioprio || sqe->off || sqe->len || sqe->cancel_flags) return -EINVAL;
- io_async_find_and_cancel(ctx, req, READ_ONCE(sqe->addr), nxt, 0); + req->flags |= REQ_F_PREPPED; + req->cancel.addr = READ_ONCE(sqe->addr); + return 0; +} + +static int io_async_cancel(struct io_kiocb *req, struct io_kiocb **nxt) +{ + struct io_ring_ctx *ctx = req->ctx; + int ret; + + ret = io_async_cancel_prep(req); + if (ret) + return ret; + + io_async_find_and_cancel(ctx, req, req->cancel.addr, nxt, 0); return 0; }
@@ -3086,6 +3107,9 @@ static int io_req_defer_prep(struct io_kiocb *req) case IORING_OP_TIMEOUT: ret = io_timeout_prep(req, io, false); break; + case IORING_OP_ASYNC_CANCEL: + ret = io_async_cancel_prep(req); + break; case IORING_OP_LINK_TIMEOUT: ret = io_timeout_prep(req, io, true); break;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc3 commit b29472ee7b53784f44011069fad15e539fd25bcf category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we defer this command as part of a link, we have to make sure that the SQE data has been read upfront. Integrate the timeout remove op into the prep handling to make it safe for SQE reuse.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 44 ++++++++++++++++++++++++++++++++++---------- 1 file changed, 34 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index bf6111474e66..0911ad41c3f8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -326,6 +326,12 @@ struct io_cancel { u64 addr; };
+struct io_timeout { + struct file *file; + u64 addr; + int flags; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -368,6 +374,7 @@ struct io_kiocb { struct io_accept accept; struct io_sync sync; struct io_cancel cancel; + struct io_timeout timeout; };
const struct io_uring_sqe *sqe; @@ -2817,26 +2824,40 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) return 0; }
+static int io_timeout_remove_prep(struct io_kiocb *req) +{ + const struct io_uring_sqe *sqe = req->sqe; + + if (req->flags & REQ_F_PREPPED) + return 0; + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + if (sqe->flags || sqe->ioprio || sqe->buf_index || sqe->len) + return -EINVAL; + + req->timeout.addr = READ_ONCE(sqe->addr); + req->timeout.flags = READ_ONCE(sqe->timeout_flags); + if (req->timeout.flags) + return -EINVAL; + + req->flags |= REQ_F_PREPPED; + return 0; +} + /* * Remove or update an existing timeout command */ static int io_timeout_remove(struct io_kiocb *req) { - const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx; - unsigned flags; int ret;
- if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) - return -EINVAL; - if (sqe->flags || sqe->ioprio || sqe->buf_index || sqe->len) - return -EINVAL; - flags = READ_ONCE(sqe->timeout_flags); - if (flags) - return -EINVAL; + ret = io_timeout_remove_prep(req); + if (ret) + return ret;
spin_lock_irq(&ctx->completion_lock); - ret = io_timeout_cancel(ctx, READ_ONCE(sqe->addr)); + ret = io_timeout_cancel(ctx, req->timeout.addr);
io_cqring_fill_event(req, ret); io_commit_cqring(ctx); @@ -3107,6 +3128,9 @@ static int io_req_defer_prep(struct io_kiocb *req) case IORING_OP_TIMEOUT: ret = io_timeout_prep(req, io, false); break; + case IORING_OP_TIMEOUT_REMOVE: + ret = io_timeout_remove_prep(req); + break; case IORING_OP_ASYNC_CANCEL: ret = io_async_cancel_prep(req); break;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc3 commit d625c6ee4975000140c57da7e1ff244efefde274 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we defer a request, we can't be reading the opcode again. Ensure that the user_data and opcode fields are stable. For the user_data we already have a place for it, for the opcode we can fill a one byte hold and store that as well. For both of them, assign them when we originally read the SQE in io_get_sqring(). Any code that uses sqe->opcode or sqe->user_data is switched to req->opcode and req->user_data.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 45 ++++++++++++++++++++------------------------- 1 file changed, 20 insertions(+), 25 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 0911ad41c3f8..dbaad8a562de 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -384,6 +384,7 @@ struct io_kiocb { bool has_user; bool in_async; bool needs_fixed_file; + u8 opcode;
struct io_ring_ctx *ctx; union { @@ -596,12 +597,10 @@ static void __io_commit_cqring(struct io_ring_ctx *ctx) } }
-static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe) +static inline bool io_req_needs_user(struct io_kiocb *req) { - u8 opcode = READ_ONCE(sqe->opcode); - - return !(opcode == IORING_OP_READ_FIXED || - opcode == IORING_OP_WRITE_FIXED); + return !(req->opcode == IORING_OP_READ_FIXED || + req->opcode == IORING_OP_WRITE_FIXED); }
static inline bool io_prep_async_work(struct io_kiocb *req, @@ -610,7 +609,7 @@ static inline bool io_prep_async_work(struct io_kiocb *req, bool do_hashed = false;
if (req->sqe) { - switch (req->sqe->opcode) { + switch (req->opcode) { case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: /* only regular files should be hashed for writes */ @@ -633,7 +632,7 @@ static inline bool io_prep_async_work(struct io_kiocb *req, req->work.flags |= IO_WQ_WORK_UNBOUND; break; } - if (io_sqe_needs_user(req->sqe)) + if (io_req_needs_user(req)) req->work.flags |= IO_WQ_WORK_NEEDS_USER; }
@@ -1004,7 +1003,7 @@ static void io_fail_links(struct io_kiocb *req) trace_io_uring_fail_link(req, link);
if ((req->flags & REQ_F_LINK_TIMEOUT) && - link->sqe->opcode == IORING_OP_LINK_TIMEOUT) { + link->opcode == IORING_OP_LINK_TIMEOUT) { io_link_cancel_timeout(link); } else { io_cqring_fill_event(link, -ECANCELED); @@ -1647,7 +1646,7 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, * for that purpose and instead let the caller pass in the read/write * flag. */ - opcode = READ_ONCE(sqe->opcode); + opcode = req->opcode; if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) { *iovec = NULL; return io_import_fixed(req->ctx, rw, sqe, iter); @@ -3081,7 +3080,7 @@ static int io_req_defer_prep(struct io_kiocb *req) struct iov_iter iter; ssize_t ret;
- switch (io->sqe.opcode) { + switch (req->opcode) { case IORING_OP_READV: case IORING_OP_READ_FIXED: /* ensure prep does right import */ @@ -3180,11 +3179,10 @@ __attribute__((nonnull)) static int io_issue_sqe(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { - int ret, opcode; struct io_ring_ctx *ctx = req->ctx; + int ret;
- opcode = READ_ONCE(req->sqe->opcode); - switch (opcode) { + switch (req->opcode) { case IORING_OP_NOP: ret = io_nop(req); break; @@ -3321,11 +3319,9 @@ static bool io_req_op_valid(int op) return op >= IORING_OP_NOP && op < IORING_OP_LAST; }
-static int io_op_needs_file(const struct io_uring_sqe *sqe) +static int io_req_needs_file(struct io_kiocb *req) { - int op = READ_ONCE(sqe->opcode); - - switch (op) { + switch (req->opcode) { case IORING_OP_NOP: case IORING_OP_POLL_REMOVE: case IORING_OP_TIMEOUT: @@ -3334,7 +3330,7 @@ static int io_op_needs_file(const struct io_uring_sqe *sqe) case IORING_OP_LINK_TIMEOUT: return 0; default: - if (io_req_op_valid(op)) + if (io_req_op_valid(req->opcode)) return 1; return -EINVAL; } @@ -3361,7 +3357,7 @@ static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req) if (flags & IOSQE_IO_DRAIN) req->flags |= REQ_F_IO_DRAIN;
- ret = io_op_needs_file(req->sqe); + ret = io_req_needs_file(req); if (ret <= 0) return ret;
@@ -3481,7 +3477,7 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req)
nxt = list_first_entry_or_null(&req->link_list, struct io_kiocb, link_list); - if (!nxt || nxt->sqe->opcode != IORING_OP_LINK_TIMEOUT) + if (!nxt || nxt->opcode != IORING_OP_LINK_TIMEOUT) return NULL;
req->flags |= REQ_F_LINK_TIMEOUT; @@ -3583,8 +3579,6 @@ static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, struct io_ring_ctx *ctx = req->ctx; int ret;
- req->user_data = req->sqe->user_data; - /* enforce forwards compatibility on users */ if (unlikely(req->sqe->flags & ~SQE_VALID_FLAGS)) { ret = -EINVAL; @@ -3716,6 +3710,8 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct io_kiocb *req) */ req->sequence = ctx->cached_sq_head; req->sqe = &ctx->sq_sqes[head]; + req->opcode = READ_ONCE(req->sqe->opcode); + req->user_data = READ_ONCE(req->sqe->user_data); ctx->cached_sq_head++; return true; } @@ -3761,7 +3757,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, break; }
- if (io_sqe_needs_user(req->sqe) && !*mm) { + if (io_req_needs_user(req) && !*mm) { mm_fault = mm_fault || !mmget_not_zero(ctx->sqo_mm); if (!mm_fault) { use_mm(ctx->sqo_mm); @@ -3777,8 +3773,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, req->has_user = *mm != NULL; req->in_async = async; req->needs_fixed_file = async; - trace_io_uring_submit_sqe(ctx, req->sqe->user_data, - true, async); + trace_io_uring_submit_sqe(ctx, req->user_data, true, async); if (!io_submit_sqe(req, statep, &link)) break; /*
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc3 commit e781573e2fb1b75acdba61dcb9bcbfc16f288442 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Now that we have all the opcodes handled in terms of command prep and SQE reuse, add a printk_once() to warn about any potentially new and unhandled ones.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index dbaad8a562de..61b468153815 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3078,9 +3078,11 @@ static int io_req_defer_prep(struct io_kiocb *req) struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct io_async_ctx *io = req->io; struct iov_iter iter; - ssize_t ret; + ssize_t ret = 0;
switch (req->opcode) { + case IORING_OP_NOP: + break; case IORING_OP_READV: case IORING_OP_READ_FIXED: /* ensure prep does right import */ @@ -3140,7 +3142,9 @@ static int io_req_defer_prep(struct io_kiocb *req) ret = io_accept_prep(req); break; default: - ret = 0; + printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", + req->opcode); + ret = -EINVAL; break; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.5-rc3 commit 7c504e65206a4379ff38fe41d21b32b6c2c3e53e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There is no reliable way to submit and wait in a single syscall, as io_submit_sqes() may under-consume sqes (in case of an early error). Then it will wait for not-yet-submitted requests, deadlocking the user in most cases.
Don't wait/poll if can't submit all sqes
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 61b468153815..e0b372819d8e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5135,6 +5135,9 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, submitted = io_submit_sqes(ctx, to_submit, f.file, fd, &cur_mm, false); mutex_unlock(&ctx->uring_lock); + + if (submitted != to_submit) + goto out; } if (flags & IORING_ENTER_GETEVENTS) { unsigned nr_events = 0; @@ -5148,6 +5151,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, } }
+out: percpu_ref_put(&ctx->refs); out_fput: fdput(f);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc3 commit fd6c2e4c063d64511657ad0031a1677b6a914859 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
I've been chasing a weird and obscure crash that was userspace stack corruption, and finally narrowed it down to a bit flip that made a stack address invalid. io_wq_submit_work() unconditionally flips the req->rw.ki_flags IOCB_NOWAIT bit, but since it's a generic work handler, this isn't valid. Normal read/write operations own that part of the request, on other types it could be something else.
Move the IOCB_NOWAIT clear to the read/write handlers where it belongs.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e0b372819d8e..67d578c36351 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1816,6 +1816,10 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, return ret; }
+ /* Ensure we clear previously set non-block flag */ + if (!force_nonblock) + req->rw.ki_flags &= ~IOCB_NOWAIT; + file = req->file; io_size = ret; if (req->flags & REQ_F_LINK) @@ -1905,6 +1909,10 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, return ret; }
+ /* Ensure we clear previously set non-block flag */ + if (!force_nonblock) + req->rw.ki_flags &= ~IOCB_NOWAIT; + file = kiocb->ki_filp; io_size = ret; if (req->flags & REQ_F_LINK) @@ -3273,9 +3281,6 @@ static void io_wq_submit_work(struct io_wq_work **workptr) struct io_kiocb *nxt = NULL; int ret = 0;
- /* Ensure we clear previously set non-block flag */ - req->rw.ki_flags &= ~IOCB_NOWAIT; - if (work->flags & IO_WQ_WORK_CANCEL) ret = -ECANCELED;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc4 commit d55e5f5b70dd6214ef81fb2313121b72a7dd2200 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We use it in some spots, but not consistently. Convert the rest over, makes it easier to read as well.
No functional changes in this patch.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 67d578c36351..70c62542fd67 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2156,7 +2156,7 @@ static int io_sendmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) unsigned flags;
flags = READ_ONCE(sqe->msg_flags); - msg = (struct user_msghdr __user *)(unsigned long) READ_ONCE(sqe->addr); + msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); io->msg.iov = io->msg.fast_iov; return sendmsg_copy_msghdr(&io->msg.msg, msg, flags, &io->msg.iov); #else @@ -2238,7 +2238,7 @@ static int io_recvmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) unsigned flags;
flags = READ_ONCE(sqe->msg_flags); - msg = (struct user_msghdr __user *)(unsigned long) READ_ONCE(sqe->addr); + msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); io->msg.iov = io->msg.fast_iov; return recvmsg_copy_msghdr(&io->msg.msg, msg, flags, &io->msg.uaddr, &io->msg.iov); @@ -2272,8 +2272,7 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, else if (force_nonblock) flags |= MSG_DONTWAIT;
- msg = (struct user_msghdr __user *) (unsigned long) - READ_ONCE(sqe->addr); + msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); if (req->io) { kmsg = &req->io->msg; kmsg->msg.msg_name = &addr; @@ -2330,9 +2329,8 @@ static int io_accept_prep(struct io_kiocb *req) if (sqe->ioprio || sqe->len || sqe->buf_index) return -EINVAL;
- accept->addr = (struct sockaddr __user *) - (unsigned long) READ_ONCE(sqe->addr); - accept->addr_len = (int __user *) (unsigned long) READ_ONCE(sqe->addr2); + accept->addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); + accept->addr_len = u64_to_user_ptr(READ_ONCE(sqe->addr2)); accept->flags = READ_ONCE(sqe->accept_flags); req->flags |= REQ_F_PREPPED; return 0; @@ -2406,7 +2404,7 @@ static int io_connect_prep(struct io_kiocb *req, struct io_async_ctx *io) struct sockaddr __user *addr; int addr_len;
- addr = (struct sockaddr __user *) (unsigned long) READ_ONCE(sqe->addr); + addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); addr_len = READ_ONCE(sqe->addr2); return move_addr_to_kernel(addr, addr_len, &io->connect.address); #else @@ -4701,7 +4699,7 @@ static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst, if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov))) return -EFAULT;
- dst->iov_base = (void __user *) (unsigned long) ciov.iov_base; + dst->iov_base = u64_to_user_ptr((u64)ciov.iov_base); dst->iov_len = ciov.iov_len; return 0; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc4 commit 9adbd45d6d32ffc1a03f3c51d72cfc69ebfc2ddb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Put the kiocb in struct io_rw, and add the addr/len for the request as well. Use the kiocb->private field for the buffer index for fixed reads and writes.
Any use of kiocb->ki_filp is flipped to req->file. It's the same thing, and less confusing.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 96 +++++++++++++++++++++++++++------------------------ 1 file changed, 50 insertions(+), 46 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 70c62542fd67..c38e34925bb3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -332,6 +332,13 @@ struct io_timeout { int flags; };
+struct io_rw { + /* NOTE: kiocb has the file as the first member, so don't do it here */ + struct kiocb kiocb; + u64 addr; + u64 len; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -369,7 +376,7 @@ struct io_async_ctx { struct io_kiocb { union { struct file *file; - struct kiocb rw; + struct io_rw rw; struct io_poll_iocb poll; struct io_accept accept; struct io_sync sync; @@ -1179,7 +1186,7 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events,
ret = 0; list_for_each_entry_safe(req, tmp, &ctx->poll_list, list) { - struct kiocb *kiocb = &req->rw; + struct kiocb *kiocb = &req->rw.kiocb;
/* * Move completed entries to our local list. If we find a @@ -1334,7 +1341,7 @@ static inline void req_set_fail_links(struct io_kiocb *req)
static void io_complete_rw_common(struct kiocb *kiocb, long res) { - struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
if (kiocb->ki_flags & IOCB_WRITE) kiocb_end_write(req); @@ -1346,7 +1353,7 @@ static void io_complete_rw_common(struct kiocb *kiocb, long res)
static void io_complete_rw(struct kiocb *kiocb, long res, long res2) { - struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
io_complete_rw_common(kiocb, res); io_put_req(req); @@ -1354,7 +1361,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
static struct io_kiocb *__io_complete_rw(struct kiocb *kiocb, long res) { - struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); struct io_kiocb *nxt = NULL;
io_complete_rw_common(kiocb, res); @@ -1365,7 +1372,7 @@ static struct io_kiocb *__io_complete_rw(struct kiocb *kiocb, long res)
static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2) { - struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
if (kiocb->ki_flags & IOCB_WRITE) kiocb_end_write(req); @@ -1399,7 +1406,7 @@ static void io_iopoll_req_issued(struct io_kiocb *req)
list_req = list_first_entry(&ctx->poll_list, struct io_kiocb, list); - if (list_req->rw.ki_filp != req->rw.ki_filp) + if (list_req->file != req->file) ctx->poll_multi_file = true; }
@@ -1474,7 +1481,7 @@ static int io_prep_rw(struct io_kiocb *req, bool force_nonblock) { const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx; - struct kiocb *kiocb = &req->rw; + struct kiocb *kiocb = &req->rw.kiocb; unsigned ioprio; int ret;
@@ -1523,6 +1530,12 @@ static int io_prep_rw(struct io_kiocb *req, bool force_nonblock) return -EINVAL; kiocb->ki_complete = io_complete_rw; } + + req->rw.addr = READ_ONCE(req->sqe->addr); + req->rw.len = READ_ONCE(req->sqe->len); + /* we own ->private, reuse it for the buffer index */ + req->rw.kiocb.private = (void *) (unsigned long) + READ_ONCE(req->sqe->buf_index); return 0; }
@@ -1556,11 +1569,11 @@ static void kiocb_done(struct kiocb *kiocb, ssize_t ret, struct io_kiocb **nxt, io_rw_done(kiocb, ret); }
-static ssize_t io_import_fixed(struct io_ring_ctx *ctx, int rw, - const struct io_uring_sqe *sqe, +static ssize_t io_import_fixed(struct io_kiocb *req, int rw, struct iov_iter *iter) { - size_t len = READ_ONCE(sqe->len); + struct io_ring_ctx *ctx = req->ctx; + size_t len = req->rw.len; struct io_mapped_ubuf *imu; unsigned index, buf_index; size_t offset; @@ -1570,13 +1583,13 @@ static ssize_t io_import_fixed(struct io_ring_ctx *ctx, int rw, if (unlikely(!ctx->user_bufs)) return -EFAULT;
- buf_index = READ_ONCE(sqe->buf_index); + buf_index = (unsigned long) req->rw.kiocb.private; if (unlikely(buf_index >= ctx->nr_user_bufs)) return -EFAULT;
index = array_index_nospec(buf_index, ctx->nr_user_bufs); imu = &ctx->user_bufs[index]; - buf_addr = READ_ONCE(sqe->addr); + buf_addr = req->rw.addr;
/* overflow */ if (buf_addr + len < buf_addr) @@ -1633,25 +1646,20 @@ static ssize_t io_import_fixed(struct io_ring_ctx *ctx, int rw, static ssize_t io_import_iovec(int rw, struct io_kiocb *req, struct iovec **iovec, struct iov_iter *iter) { - const struct io_uring_sqe *sqe = req->sqe; - void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr)); - size_t sqe_len = READ_ONCE(sqe->len); + void __user *buf = u64_to_user_ptr(req->rw.addr); + size_t sqe_len = req->rw.len; u8 opcode;
- /* - * We're reading ->opcode for the second time, but the first read - * doesn't care whether it's _FIXED or not, so it doesn't matter - * whether ->opcode changes concurrently. The first read does care - * about whether it is a READ or a WRITE, so we don't trust this read - * for that purpose and instead let the caller pass in the read/write - * flag. - */ opcode = req->opcode; if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) { *iovec = NULL; - return io_import_fixed(req->ctx, rw, sqe, iter); + return io_import_fixed(req, rw, iter); }
+ /* buffer index only valid with fixed read/write */ + if (req->rw.kiocb.private) + return -EINVAL; + if (req->io) { struct io_async_rw *iorw = &req->io->rw;
@@ -1800,9 +1808,8 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; - struct kiocb *kiocb = &req->rw; + struct kiocb *kiocb = &req->rw.kiocb; struct iov_iter iter; - struct file *file; size_t iov_count; ssize_t io_size, ret;
@@ -1818,9 +1825,8 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt,
/* Ensure we clear previously set non-block flag */ if (!force_nonblock) - req->rw.ki_flags &= ~IOCB_NOWAIT; + req->rw.kiocb.ki_flags &= ~IOCB_NOWAIT;
- file = req->file; io_size = ret; if (req->flags & REQ_F_LINK) req->result = io_size; @@ -1829,20 +1835,20 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, * If the file doesn't support async, mark it as REQ_F_MUST_PUNT so * we know to async punt it even if it was opened O_NONBLOCK */ - if (force_nonblock && !io_file_supports_async(file)) { + if (force_nonblock && !io_file_supports_async(req->file)) { req->flags |= REQ_F_MUST_PUNT; goto copy_iov; }
iov_count = iov_iter_count(&iter); - ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_count); + ret = rw_verify_area(READ, req->file, &kiocb->ki_pos, iov_count); if (!ret) { ssize_t ret2;
- if (file->f_op->read_iter) - ret2 = call_read_iter(file, kiocb, &iter); + if (req->file->f_op->read_iter) + ret2 = call_read_iter(req->file, kiocb, &iter); else - ret2 = loop_rw_iter(READ, file, kiocb, &iter); + ret2 = loop_rw_iter(READ, req->file, kiocb, &iter);
/* * In case of a short read, punt to async. This can happen @@ -1893,9 +1899,8 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; - struct kiocb *kiocb = &req->rw; + struct kiocb *kiocb = &req->rw.kiocb; struct iov_iter iter; - struct file *file; size_t iov_count; ssize_t ret, io_size;
@@ -1911,9 +1916,8 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt,
/* Ensure we clear previously set non-block flag */ if (!force_nonblock) - req->rw.ki_flags &= ~IOCB_NOWAIT; + req->rw.kiocb.ki_flags &= ~IOCB_NOWAIT;
- file = kiocb->ki_filp; io_size = ret; if (req->flags & REQ_F_LINK) req->result = io_size; @@ -1933,7 +1937,7 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, goto copy_iov;
iov_count = iov_iter_count(&iter); - ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, iov_count); + ret = rw_verify_area(WRITE, req->file, &kiocb->ki_pos, iov_count); if (!ret) { ssize_t ret2;
@@ -1945,17 +1949,17 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, * we return to userspace. */ if (req->flags & REQ_F_ISREG) { - __sb_start_write(file_inode(file)->i_sb, + __sb_start_write(file_inode(req->file)->i_sb, SB_FREEZE_WRITE, true); - __sb_writers_release(file_inode(file)->i_sb, + __sb_writers_release(file_inode(req->file)->i_sb, SB_FREEZE_WRITE); } kiocb->ki_flags |= IOCB_WRITE;
- if (file->f_op->write_iter) - ret2 = call_write_iter(file, kiocb, &iter); + if (req->file->f_op->write_iter) + ret2 = call_write_iter(req->file, kiocb, &iter); else - ret2 = loop_rw_iter(WRITE, file, kiocb, &iter); + ret2 = loop_rw_iter(WRITE, req->file, kiocb, &iter); if (!force_nonblock || ret2 != -EAGAIN) { kiocb_done(kiocb, ret2, nxt, req->in_async); } else { @@ -2035,7 +2039,7 @@ static void io_fsync_finish(struct io_wq_work **workptr) if (io_req_cancelled(req)) return;
- ret = vfs_fsync_range(req->rw.ki_filp, req->sync.off, + ret = vfs_fsync_range(req->file, req->sync.off, end > 0 ? end : LLONG_MAX, req->sync.flags & IORING_FSYNC_DATASYNC); if (ret < 0) @@ -2101,7 +2105,7 @@ static void io_sync_file_range_finish(struct io_wq_work **workptr) if (io_req_cancelled(req)) return;
- ret = sync_file_range(req->rw.ki_filp, req->sync.off, req->sync.len, + ret = sync_file_range(req->file, req->sync.off, req->sync.len, req->sync.flags); if (ret < 0) req_set_fail_links(req);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc4 commit 3fbb51c18f5c15a23db74c4da79d3d035176c480 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add struct io_connect in our io_kiocb per-command union, and ensure that io_connect_prep() has grabbed what it needs from the SQE.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 40 ++++++++++++++++++++++------------------ 1 file changed, 22 insertions(+), 18 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c38e34925bb3..e97d6e98d6bf 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -339,6 +339,12 @@ struct io_rw { u64 len; };
+struct io_connect { + struct file *file; + struct sockaddr __user *addr; + int addr_len; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -382,6 +388,7 @@ struct io_kiocb { struct io_sync sync; struct io_cancel cancel; struct io_timeout timeout; + struct io_connect connect; };
const struct io_uring_sqe *sqe; @@ -2405,14 +2412,18 @@ static int io_connect_prep(struct io_kiocb *req, struct io_async_ctx *io) { #if defined(CONFIG_NET) const struct io_uring_sqe *sqe = req->sqe; - struct sockaddr __user *addr; - int addr_len;
- addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); - addr_len = READ_ONCE(sqe->addr2); - return move_addr_to_kernel(addr, addr_len, &io->connect.address); + if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) + return -EINVAL; + if (sqe->ioprio || sqe->len || sqe->buf_index || sqe->rw_flags) + return -EINVAL; + + req->connect.addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); + req->connect.addr_len = READ_ONCE(sqe->addr2); + return move_addr_to_kernel(req->connect.addr, req->connect.addr_len, + &io->connect.address); #else - return 0; + return -EOPNOTSUPP; #endif }
@@ -2420,18 +2431,9 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { #if defined(CONFIG_NET) - const struct io_uring_sqe *sqe = req->sqe; struct io_async_ctx __io, *io; unsigned file_flags; - int addr_len, ret; - - if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) - return -EINVAL; - if (sqe->ioprio || sqe->len || sqe->buf_index || sqe->rw_flags) - return -EINVAL; - - addr_len = READ_ONCE(sqe->addr2); - file_flags = force_nonblock ? O_NONBLOCK : 0; + int ret;
if (req->io) { io = req->io; @@ -2442,8 +2444,10 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt, io = &__io; }
- ret = __sys_connect_file(req->file, &io->connect.address, addr_len, - file_flags); + file_flags = force_nonblock ? O_NONBLOCK : 0; + + ret = __sys_connect_file(req->file, &io->connect.address, + req->connect.addr_len, file_flags); if ((ret == -EAGAIN || ret == -EINPROGRESS) && force_nonblock) { if (req->io) return -EAGAIN;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc4 commit e47293fdf98998292a89d516c8f7b8b9eb5c5213 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add struct io_sr_msg in our io_kiocb per-command union, and ensure that the send/recvmsg prep handlers have grabbed what they need from the SQE by the time prep is done.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 64 ++++++++++++++++++++++++++------------------------- 1 file changed, 33 insertions(+), 31 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e97d6e98d6bf..05463be5e320 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -345,6 +345,12 @@ struct io_connect { int addr_len; };
+struct io_sr_msg { + struct file *file; + struct user_msghdr __user *msg; + int msg_flags; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -389,6 +395,7 @@ struct io_kiocb { struct io_cancel cancel; struct io_timeout timeout; struct io_connect connect; + struct io_sr_msg sr_msg; };
const struct io_uring_sqe *sqe; @@ -2163,15 +2170,15 @@ static int io_sendmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) { #if defined(CONFIG_NET) const struct io_uring_sqe *sqe = req->sqe; - struct user_msghdr __user *msg; - unsigned flags; + struct io_sr_msg *sr = &req->sr_msg;
- flags = READ_ONCE(sqe->msg_flags); - msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + sr->msg_flags = READ_ONCE(sqe->msg_flags); + sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); io->msg.iov = io->msg.fast_iov; - return sendmsg_copy_msghdr(&io->msg.msg, msg, flags, &io->msg.iov); + return sendmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, + &io->msg.iov); #else - return 0; + return -EOPNOTSUPP; #endif }
@@ -2179,7 +2186,6 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { #if defined(CONFIG_NET) - const struct io_uring_sqe *sqe = req->sqe; struct io_async_msghdr *kmsg = NULL; struct socket *sock; int ret; @@ -2193,12 +2199,6 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, struct sockaddr_storage addr; unsigned flags;
- flags = READ_ONCE(sqe->msg_flags); - if (flags & MSG_DONTWAIT) - req->flags |= REQ_F_NOWAIT; - else if (force_nonblock) - flags |= MSG_DONTWAIT; - if (req->io) { kmsg = &req->io->msg; kmsg->msg.msg_name = &addr; @@ -2214,6 +2214,12 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, goto out; }
+ flags = req->sr_msg.msg_flags; + if (flags & MSG_DONTWAIT) + req->flags |= REQ_F_NOWAIT; + else if (force_nonblock) + flags |= MSG_DONTWAIT; + ret = __sys_sendmsg_sock(sock, &kmsg->msg, flags); if (force_nonblock && ret == -EAGAIN) { if (req->io) @@ -2244,17 +2250,15 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, static int io_recvmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) { #if defined(CONFIG_NET) - const struct io_uring_sqe *sqe = req->sqe; - struct user_msghdr __user *msg; - unsigned flags; + struct io_sr_msg *sr = &req->sr_msg;
- flags = READ_ONCE(sqe->msg_flags); - msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + sr->msg_flags = READ_ONCE(req->sqe->msg_flags); + sr->msg = u64_to_user_ptr(READ_ONCE(req->sqe->addr)); io->msg.iov = io->msg.fast_iov; - return recvmsg_copy_msghdr(&io->msg.msg, msg, flags, &io->msg.uaddr, - &io->msg.iov); + return recvmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, + &io->msg.uaddr, &io->msg.iov); #else - return 0; + return -EOPNOTSUPP; #endif }
@@ -2262,7 +2266,6 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { #if defined(CONFIG_NET) - const struct io_uring_sqe *sqe = req->sqe; struct io_async_msghdr *kmsg = NULL; struct socket *sock; int ret; @@ -2272,18 +2275,10 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt,
sock = sock_from_file(req->file, &ret); if (sock) { - struct user_msghdr __user *msg; struct io_async_ctx io; struct sockaddr_storage addr; unsigned flags;
- flags = READ_ONCE(sqe->msg_flags); - if (flags & MSG_DONTWAIT) - req->flags |= REQ_F_NOWAIT; - else if (force_nonblock) - flags |= MSG_DONTWAIT; - - msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); if (req->io) { kmsg = &req->io->msg; kmsg->msg.msg_name = &addr; @@ -2299,7 +2294,14 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, goto out; }
- ret = __sys_recvmsg_sock(sock, &kmsg->msg, msg, kmsg->uaddr, flags); + flags = req->sr_msg.msg_flags; + if (flags & MSG_DONTWAIT) + req->flags |= REQ_F_NOWAIT; + else if (force_nonblock) + flags |= MSG_DONTWAIT; + + ret = __sys_recvmsg_sock(sock, &kmsg->msg, req->sr_msg.msg, + kmsg->uaddr, flags); if (force_nonblock && ret == -EAGAIN) { if (req->io) return -EAGAIN;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc4 commit 26a61679f10c6f041726411964b172565021c2eb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add the count field to struct io_timeout, and ensure the prep handler has read it. Timeout also needs an async context always, set it up in the prep handler if we don't have one.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 05463be5e320..5badcd315eef 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -330,6 +330,7 @@ struct io_timeout { struct file *file; u64 addr; int flags; + unsigned count; };
struct io_rw { @@ -2901,7 +2902,12 @@ static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io, if (flags & ~IORING_TIMEOUT_ABS) return -EINVAL;
- data = &io->timeout; + req->timeout.count = READ_ONCE(sqe->off); + + if (!io && io_alloc_async_ctx(req)) + return -ENOMEM; + + data = &req->io->timeout; data->req = req; req->flags |= REQ_F_TIMEOUT;
@@ -2919,7 +2925,6 @@ static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io,
static int io_timeout(struct io_kiocb *req) { - const struct io_uring_sqe *sqe = req->sqe; unsigned count; struct io_ring_ctx *ctx = req->ctx; struct io_timeout_data *data; @@ -2941,7 +2946,7 @@ static int io_timeout(struct io_kiocb *req) * timeout event to be satisfied. If it isn't set, then this is * a pure timeout request, sequence isn't used. */ - count = READ_ONCE(sqe->off); + count = req->timeout.count; if (!count) { req->flags |= REQ_F_TIMEOUT_NOSEQ; spin_lock_irq(&ctx->completion_lock);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc4 commit 06b76d44ba25e52711dc7cc4fc75b50907bc6b8e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently have a mix of use cases. Most of the newer ones are pretty uniform, but we have some older ones that use different calling calling conventions. This is confusing.
For the opcodes that currently rely on the req->io->sqe copy saving them from reuse, add a request type struct in the io_kiocb command union to store the data they need.
Prepare for all opcodes having a standard prep method, so we can call it in a uniform fashion and outside of the opcode handler. This is in preparation for passing in the 'sqe' pointer, rather than storing it in the io_kiocb. Once we have uniform prep handlers, we can leave all the prep work to that part, and not even pass in the sqe to the opcode handler. This ensures that we don't reuse sqe data inadvertently.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 128 +++++++++++++++++++++++++------------------------- 1 file changed, 63 insertions(+), 65 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5badcd315eef..05abe7bf6a81 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -371,7 +371,6 @@ struct io_async_rw { };
struct io_async_ctx { - struct io_uring_sqe sqe; union { struct io_async_rw rw; struct io_async_msghdr msg; @@ -433,7 +432,6 @@ struct io_kiocb { #define REQ_F_INFLIGHT 16384 /* on inflight list */ #define REQ_F_COMP_LOCKED 32768 /* completion under lock */ #define REQ_F_HARDLINK 65536 /* doesn't sever on completion < 0 */ -#define REQ_F_PREPPED 131072 /* request already opcode prepared */ u64 user_data; u32 result; u32 sequence; @@ -1500,6 +1498,8 @@ static int io_prep_rw(struct io_kiocb *req, bool force_nonblock) unsigned ioprio; int ret;
+ if (!sqe) + return 0; if (!req->file) return -EBADF;
@@ -1551,6 +1551,7 @@ static int io_prep_rw(struct io_kiocb *req, bool force_nonblock) /* we own ->private, reuse it for the buffer index */ req->rw.kiocb.private = (void *) (unsigned long) READ_ONCE(req->sqe->buf_index); + req->sqe = NULL; return 0; }
@@ -1772,13 +1773,7 @@ static void io_req_map_rw(struct io_kiocb *req, ssize_t io_size, static int io_alloc_async_ctx(struct io_kiocb *req) { req->io = kmalloc(sizeof(*req->io), GFP_KERNEL); - if (req->io) { - memcpy(&req->io->sqe, req->sqe, sizeof(req->io->sqe)); - req->sqe = &req->io->sqe; - return 0; - } - - return 1; + return req->io == NULL; }
static void io_rw_async(struct io_wq_work **workptr) @@ -1809,12 +1804,14 @@ static int io_read_prep(struct io_kiocb *req, struct iovec **iovec, { ssize_t ret;
- ret = io_prep_rw(req, force_nonblock); - if (ret) - return ret; + if (req->sqe) { + ret = io_prep_rw(req, force_nonblock); + if (ret) + return ret;
- if (unlikely(!(req->file->f_mode & FMODE_READ))) - return -EBADF; + if (unlikely(!(req->file->f_mode & FMODE_READ))) + return -EBADF; + }
return io_import_iovec(READ, req, iovec, iter); } @@ -1828,15 +1825,9 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, size_t iov_count; ssize_t io_size, ret;
- if (!req->io) { - ret = io_read_prep(req, &iovec, &iter, force_nonblock); - if (ret < 0) - return ret; - } else { - ret = io_import_iovec(READ, req, &iovec, &iter); - if (ret < 0) - return ret; - } + ret = io_read_prep(req, &iovec, &iter, force_nonblock); + if (ret < 0) + return ret;
/* Ensure we clear previously set non-block flag */ if (!force_nonblock) @@ -1900,12 +1891,14 @@ static int io_write_prep(struct io_kiocb *req, struct iovec **iovec, { ssize_t ret;
- ret = io_prep_rw(req, force_nonblock); - if (ret) - return ret; + if (req->sqe) { + ret = io_prep_rw(req, force_nonblock); + if (ret) + return ret;
- if (unlikely(!(req->file->f_mode & FMODE_WRITE))) - return -EBADF; + if (unlikely(!(req->file->f_mode & FMODE_WRITE))) + return -EBADF; + }
return io_import_iovec(WRITE, req, iovec, iter); } @@ -1919,15 +1912,9 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, size_t iov_count; ssize_t ret, io_size;
- if (!req->io) { - ret = io_write_prep(req, &iovec, &iter, force_nonblock); - if (ret < 0) - return ret; - } else { - ret = io_import_iovec(WRITE, req, &iovec, &iter); - if (ret < 0) - return ret; - } + ret = io_write_prep(req, &iovec, &iter, force_nonblock); + if (ret < 0) + return ret;
/* Ensure we clear previously set non-block flag */ if (!force_nonblock) @@ -2012,7 +1999,7 @@ static int io_prep_fsync(struct io_kiocb *req) const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx;
- if (req->flags & REQ_F_PREPPED) + if (!req->sqe) return 0; if (!req->file) return -EBADF; @@ -2028,7 +2015,7 @@ static int io_prep_fsync(struct io_kiocb *req)
req->sync.off = READ_ONCE(sqe->off); req->sync.len = READ_ONCE(sqe->len); - req->flags |= REQ_F_PREPPED; + req->sqe = NULL; return 0; }
@@ -2094,7 +2081,7 @@ static int io_prep_sfr(struct io_kiocb *req) const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx;
- if (req->flags & REQ_F_PREPPED) + if (!sqe) return 0; if (!req->file) return -EBADF; @@ -2107,7 +2094,7 @@ static int io_prep_sfr(struct io_kiocb *req) req->sync.off = READ_ONCE(sqe->off); req->sync.len = READ_ONCE(sqe->len); req->sync.flags = READ_ONCE(sqe->sync_range_flags); - req->flags |= REQ_F_PREPPED; + req->sqe = NULL; return 0; }
@@ -2172,12 +2159,17 @@ static int io_sendmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) #if defined(CONFIG_NET) const struct io_uring_sqe *sqe = req->sqe; struct io_sr_msg *sr = &req->sr_msg; + int ret;
+ if (!sqe) + return 0; sr->msg_flags = READ_ONCE(sqe->msg_flags); sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); io->msg.iov = io->msg.fast_iov; - return sendmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, + ret = sendmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, &io->msg.iov); + req->sqe = NULL; + return ret; #else return -EOPNOTSUPP; #endif @@ -2252,12 +2244,18 @@ static int io_recvmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) { #if defined(CONFIG_NET) struct io_sr_msg *sr = &req->sr_msg; + int ret; + + if (!req->sqe) + return 0;
sr->msg_flags = READ_ONCE(req->sqe->msg_flags); sr->msg = u64_to_user_ptr(READ_ONCE(req->sqe->addr)); io->msg.iov = io->msg.fast_iov; - return recvmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, + ret = recvmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, &io->msg.uaddr, &io->msg.iov); + req->sqe = NULL; + return ret; #else return -EOPNOTSUPP; #endif @@ -2335,7 +2333,7 @@ static int io_accept_prep(struct io_kiocb *req) const struct io_uring_sqe *sqe = req->sqe; struct io_accept *accept = &req->accept;
- if (req->flags & REQ_F_PREPPED) + if (!req->sqe) return 0;
if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) @@ -2346,7 +2344,7 @@ static int io_accept_prep(struct io_kiocb *req) accept->addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); accept->addr_len = u64_to_user_ptr(READ_ONCE(sqe->addr2)); accept->flags = READ_ONCE(sqe->accept_flags); - req->flags |= REQ_F_PREPPED; + req->sqe = NULL; return 0; #else return -EOPNOTSUPP; @@ -2415,7 +2413,10 @@ static int io_connect_prep(struct io_kiocb *req, struct io_async_ctx *io) { #if defined(CONFIG_NET) const struct io_uring_sqe *sqe = req->sqe; + int ret;
+ if (!sqe) + return 0; if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) return -EINVAL; if (sqe->ioprio || sqe->len || sqe->buf_index || sqe->rw_flags) @@ -2423,8 +2424,10 @@ static int io_connect_prep(struct io_kiocb *req, struct io_async_ctx *io)
req->connect.addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); req->connect.addr_len = READ_ONCE(sqe->addr2); - return move_addr_to_kernel(req->connect.addr, req->connect.addr_len, + ret = move_addr_to_kernel(req->connect.addr, req->connect.addr_len, &io->connect.address); + req->sqe = NULL; + return ret; #else return -EOPNOTSUPP; #endif @@ -2525,7 +2528,7 @@ static int io_poll_remove_prep(struct io_kiocb *req) { const struct io_uring_sqe *sqe = req->sqe;
- if (req->flags & REQ_F_PREPPED) + if (!sqe) return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; @@ -2534,7 +2537,7 @@ static int io_poll_remove_prep(struct io_kiocb *req) return -EINVAL;
req->poll.addr = READ_ONCE(sqe->addr); - req->flags |= REQ_F_PREPPED; + req->sqe = NULL; return 0; }
@@ -2695,7 +2698,7 @@ static int io_poll_add_prep(struct io_kiocb *req) struct io_poll_iocb *poll = &req->poll; u16 events;
- if (req->flags & REQ_F_PREPPED) + if (!sqe) return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; @@ -2704,9 +2707,9 @@ static int io_poll_add_prep(struct io_kiocb *req) if (!poll->file) return -EBADF;
- req->flags |= REQ_F_PREPPED; events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; + req->sqe = NULL; return 0; }
@@ -2844,7 +2847,7 @@ static int io_timeout_remove_prep(struct io_kiocb *req) { const struct io_uring_sqe *sqe = req->sqe;
- if (req->flags & REQ_F_PREPPED) + if (!sqe) return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; @@ -2856,7 +2859,7 @@ static int io_timeout_remove_prep(struct io_kiocb *req) if (req->timeout.flags) return -EINVAL;
- req->flags |= REQ_F_PREPPED; + req->sqe = NULL; return 0; }
@@ -2892,6 +2895,8 @@ static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io, struct io_timeout_data *data; unsigned flags;
+ if (!sqe) + return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->ioprio || sqe->buf_index || sqe->len != 1) @@ -2920,6 +2925,7 @@ static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io, data->mode = HRTIMER_MODE_REL;
hrtimer_init(&data->timer, CLOCK_MONOTONIC, data->mode); + req->sqe = NULL; return 0; }
@@ -2932,13 +2938,9 @@ static int io_timeout(struct io_kiocb *req) unsigned span = 0; int ret;
- if (!req->io) { - if (io_alloc_async_ctx(req)) - return -ENOMEM; - ret = io_timeout_prep(req, req->io, false); - if (ret) - return ret; - } + ret = io_timeout_prep(req, req->io, false); + if (ret) + return ret; data = &req->io->timeout;
/* @@ -3068,7 +3070,7 @@ static int io_async_cancel_prep(struct io_kiocb *req) { const struct io_uring_sqe *sqe = req->sqe;
- if (req->flags & REQ_F_PREPPED) + if (!sqe) return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; @@ -3076,8 +3078,8 @@ static int io_async_cancel_prep(struct io_kiocb *req) sqe->cancel_flags) return -EINVAL;
- req->flags |= REQ_F_PREPPED; req->cancel.addr = READ_ONCE(sqe->addr); + req->sqe = NULL; return 0; }
@@ -3212,13 +3214,9 @@ static int io_issue_sqe(struct io_kiocb *req, struct io_kiocb **nxt, ret = io_nop(req); break; case IORING_OP_READV: - if (unlikely(req->sqe->buf_index)) - return -EINVAL; ret = io_read(req, nxt, force_nonblock); break; case IORING_OP_WRITEV: - if (unlikely(req->sqe->buf_index)) - return -EINVAL; ret = io_write(req, nxt, force_nonblock); break; case IORING_OP_READ_FIXED:
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.5-rc4 commit 3529d8c2b353e6e446277ae96a36e7471cb070fc category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This moves the prep handlers outside of the opcode handlers, and allows us to pass in the sqe directly. If the sqe is non-NULL, it means that the request should be prepared for the first time.
With the opcode handlers not having access to the sqe at all, we are guaranteed that the prep handler has setup the request fully by the time we get there. As before, for opcodes that need to copy in more data then the io_kiocb allows for, the io_async_ctx holds that info. If a prep handler is invoked with req->io set, it must use that to retain information for later.
Finally, we can remove io_kiocb->sqe as well.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 493 +++++++++++++++++++++++++------------------------- 1 file changed, 251 insertions(+), 242 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 05abe7bf6a81..8b4faa21e2f1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -398,7 +398,6 @@ struct io_kiocb { struct io_sr_msg sr_msg; };
- const struct io_uring_sqe *sqe; struct io_async_ctx *io; struct file *ring_file; int ring_fd; @@ -628,33 +627,31 @@ static inline bool io_prep_async_work(struct io_kiocb *req, { bool do_hashed = false;
- if (req->sqe) { - switch (req->opcode) { - case IORING_OP_WRITEV: - case IORING_OP_WRITE_FIXED: - /* only regular files should be hashed for writes */ - if (req->flags & REQ_F_ISREG) - do_hashed = true; - /* fall-through */ - case IORING_OP_READV: - case IORING_OP_READ_FIXED: - case IORING_OP_SENDMSG: - case IORING_OP_RECVMSG: - case IORING_OP_ACCEPT: - case IORING_OP_POLL_ADD: - case IORING_OP_CONNECT: - /* - * We know REQ_F_ISREG is not set on some of these - * opcodes, but this enables us to keep the check in - * just one place. - */ - if (!(req->flags & REQ_F_ISREG)) - req->work.flags |= IO_WQ_WORK_UNBOUND; - break; - } - if (io_req_needs_user(req)) - req->work.flags |= IO_WQ_WORK_NEEDS_USER; + switch (req->opcode) { + case IORING_OP_WRITEV: + case IORING_OP_WRITE_FIXED: + /* only regular files should be hashed for writes */ + if (req->flags & REQ_F_ISREG) + do_hashed = true; + /* fall-through */ + case IORING_OP_READV: + case IORING_OP_READ_FIXED: + case IORING_OP_SENDMSG: + case IORING_OP_RECVMSG: + case IORING_OP_ACCEPT: + case IORING_OP_POLL_ADD: + case IORING_OP_CONNECT: + /* + * We know REQ_F_ISREG is not set on some of these + * opcodes, but this enables us to keep the check in + * just one place. + */ + if (!(req->flags & REQ_F_ISREG)) + req->work.flags |= IO_WQ_WORK_UNBOUND; + break; } + if (io_req_needs_user(req)) + req->work.flags |= IO_WQ_WORK_NEEDS_USER;
*link = io_prep_linked_timeout(req); return do_hashed; @@ -1490,16 +1487,14 @@ static bool io_file_supports_async(struct file *file) return false; }
-static int io_prep_rw(struct io_kiocb *req, bool force_nonblock) +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) { - const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx; struct kiocb *kiocb = &req->rw.kiocb; unsigned ioprio; int ret;
- if (!sqe) - return 0; if (!req->file) return -EBADF;
@@ -1546,12 +1541,11 @@ static int io_prep_rw(struct io_kiocb *req, bool force_nonblock) kiocb->ki_complete = io_complete_rw; }
- req->rw.addr = READ_ONCE(req->sqe->addr); - req->rw.len = READ_ONCE(req->sqe->len); + req->rw.addr = READ_ONCE(sqe->addr); + req->rw.len = READ_ONCE(sqe->len); /* we own ->private, reuse it for the buffer index */ req->rw.kiocb.private = (void *) (unsigned long) - READ_ONCE(req->sqe->buf_index); - req->sqe = NULL; + READ_ONCE(sqe->buf_index); return 0; }
@@ -1799,21 +1793,33 @@ static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size, return 0; }
-static int io_read_prep(struct io_kiocb *req, struct iovec **iovec, - struct iov_iter *iter, bool force_nonblock) +static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) { + struct io_async_ctx *io; + struct iov_iter iter; ssize_t ret;
- if (req->sqe) { - ret = io_prep_rw(req, force_nonblock); - if (ret) - return ret; + ret = io_prep_rw(req, sqe, force_nonblock); + if (ret) + return ret;
- if (unlikely(!(req->file->f_mode & FMODE_READ))) - return -EBADF; - } + if (unlikely(!(req->file->f_mode & FMODE_READ))) + return -EBADF;
- return io_import_iovec(READ, req, iovec, iter); + if (!req->io) + return 0; + + io = req->io; + io->rw.iov = io->rw.fast_iov; + req->io = NULL; + ret = io_import_iovec(READ, req, &io->rw.iov, &iter); + req->io = io; + if (ret < 0) + return ret; + + io_req_map_rw(req, ret, io->rw.iov, io->rw.fast_iov, &iter); + return 0; }
static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, @@ -1825,7 +1831,7 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, size_t iov_count; ssize_t io_size, ret;
- ret = io_read_prep(req, &iovec, &iter, force_nonblock); + ret = io_import_iovec(READ, req, &iovec, &iter); if (ret < 0) return ret;
@@ -1886,21 +1892,33 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, return ret; }
-static int io_write_prep(struct io_kiocb *req, struct iovec **iovec, - struct iov_iter *iter, bool force_nonblock) +static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) { + struct io_async_ctx *io; + struct iov_iter iter; ssize_t ret;
- if (req->sqe) { - ret = io_prep_rw(req, force_nonblock); - if (ret) - return ret; + ret = io_prep_rw(req, sqe, force_nonblock); + if (ret) + return ret;
- if (unlikely(!(req->file->f_mode & FMODE_WRITE))) - return -EBADF; - } + if (unlikely(!(req->file->f_mode & FMODE_WRITE))) + return -EBADF;
- return io_import_iovec(WRITE, req, iovec, iter); + if (!req->io) + return 0; + + io = req->io; + io->rw.iov = io->rw.fast_iov; + req->io = NULL; + ret = io_import_iovec(WRITE, req, &io->rw.iov, &iter); + req->io = io; + if (ret < 0) + return ret; + + io_req_map_rw(req, ret, io->rw.iov, io->rw.fast_iov, &iter); + return 0; }
static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, @@ -1912,7 +1930,7 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, size_t iov_count; ssize_t ret, io_size;
- ret = io_write_prep(req, &iovec, &iter, force_nonblock); + ret = io_import_iovec(WRITE, req, &iovec, &iter); if (ret < 0) return ret;
@@ -1994,13 +2012,10 @@ static int io_nop(struct io_kiocb *req) return 0; }
-static int io_prep_fsync(struct io_kiocb *req) +static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe) { - const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx;
- if (!req->sqe) - return 0; if (!req->file) return -EBADF;
@@ -2015,7 +2030,6 @@ static int io_prep_fsync(struct io_kiocb *req)
req->sync.off = READ_ONCE(sqe->off); req->sync.len = READ_ONCE(sqe->len); - req->sqe = NULL; return 0; }
@@ -2056,11 +2070,6 @@ static int io_fsync(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { struct io_wq_work *work, *old_work; - int ret; - - ret = io_prep_fsync(req); - if (ret) - return ret;
/* fsync always requires a blocking context */ if (force_nonblock) { @@ -2076,13 +2085,10 @@ static int io_fsync(struct io_kiocb *req, struct io_kiocb **nxt, return 0; }
-static int io_prep_sfr(struct io_kiocb *req) +static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe) { - const struct io_uring_sqe *sqe = req->sqe; struct io_ring_ctx *ctx = req->ctx;
- if (!sqe) - return 0; if (!req->file) return -EBADF;
@@ -2094,7 +2100,6 @@ static int io_prep_sfr(struct io_kiocb *req) req->sync.off = READ_ONCE(sqe->off); req->sync.len = READ_ONCE(sqe->len); req->sync.flags = READ_ONCE(sqe->sync_range_flags); - req->sqe = NULL; return 0; }
@@ -2121,11 +2126,6 @@ static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt, bool force_nonblock) { struct io_wq_work *work, *old_work; - int ret; - - ret = io_prep_sfr(req); - if (ret) - return ret;
/* sync_file_range always requires a blocking context */ if (force_nonblock) { @@ -2154,22 +2154,21 @@ static void io_sendrecv_async(struct io_wq_work **workptr) } #endif
-static int io_sendmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) +static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { #if defined(CONFIG_NET) - const struct io_uring_sqe *sqe = req->sqe; struct io_sr_msg *sr = &req->sr_msg; - int ret; + struct io_async_ctx *io = req->io;
- if (!sqe) - return 0; sr->msg_flags = READ_ONCE(sqe->msg_flags); sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + + if (!io) + return 0; + io->msg.iov = io->msg.fast_iov; - ret = sendmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, + return sendmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, &io->msg.iov); - req->sqe = NULL; - return ret; #else return -EOPNOTSUPP; #endif @@ -2200,11 +2199,16 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, kmsg->iov = kmsg->fast_iov; kmsg->msg.msg_iter.iov = kmsg->iov; } else { + struct io_sr_msg *sr = &req->sr_msg; + kmsg = &io.msg; kmsg->msg.msg_name = &addr; - ret = io_sendmsg_prep(req, &io); + + io.msg.iov = io.msg.fast_iov; + ret = sendmsg_copy_msghdr(&io.msg.msg, sr->msg, + sr->msg_flags, &io.msg.iov); if (ret) - goto out; + return ret; }
flags = req->sr_msg.msg_flags; @@ -2227,7 +2231,6 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, ret = -EINTR; }
-out: if (!io_wq_current_is_worker() && kmsg && kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); io_cqring_add_event(req, ret); @@ -2240,22 +2243,22 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, #endif }
-static int io_recvmsg_prep(struct io_kiocb *req, struct io_async_ctx *io) +static int io_recvmsg_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) { #if defined(CONFIG_NET) struct io_sr_msg *sr = &req->sr_msg; - int ret; + struct io_async_ctx *io = req->io; + + sr->msg_flags = READ_ONCE(sqe->msg_flags); + sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr));
- if (!req->sqe) + if (!io) return 0;
- sr->msg_flags = READ_ONCE(req->sqe->msg_flags); - sr->msg = u64_to_user_ptr(READ_ONCE(req->sqe->addr)); io->msg.iov = io->msg.fast_iov; - ret = recvmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, + return recvmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, &io->msg.uaddr, &io->msg.iov); - req->sqe = NULL; - return ret; #else return -EOPNOTSUPP; #endif @@ -2286,11 +2289,17 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, kmsg->iov = kmsg->fast_iov; kmsg->msg.msg_iter.iov = kmsg->iov; } else { + struct io_sr_msg *sr = &req->sr_msg; + kmsg = &io.msg; kmsg->msg.msg_name = &addr; - ret = io_recvmsg_prep(req, &io); + + io.msg.iov = io.msg.fast_iov; + ret = recvmsg_copy_msghdr(&io.msg.msg, sr->msg, + sr->msg_flags, &io.msg.uaddr, + &io.msg.iov); if (ret) - goto out; + return ret; }
flags = req->sr_msg.msg_flags; @@ -2314,7 +2323,6 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, ret = -EINTR; }
-out: if (!io_wq_current_is_worker() && kmsg && kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); io_cqring_add_event(req, ret); @@ -2327,15 +2335,11 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, #endif }
-static int io_accept_prep(struct io_kiocb *req) +static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { #if defined(CONFIG_NET) - const struct io_uring_sqe *sqe = req->sqe; struct io_accept *accept = &req->accept;
- if (!req->sqe) - return 0; - if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) return -EINVAL; if (sqe->ioprio || sqe->len || sqe->buf_index) @@ -2344,7 +2348,6 @@ static int io_accept_prep(struct io_kiocb *req) accept->addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); accept->addr_len = u64_to_user_ptr(READ_ONCE(sqe->addr2)); accept->flags = READ_ONCE(sqe->accept_flags); - req->sqe = NULL; return 0; #else return -EOPNOTSUPP; @@ -2392,10 +2395,6 @@ static int io_accept(struct io_kiocb *req, struct io_kiocb **nxt, #if defined(CONFIG_NET) int ret;
- ret = io_accept_prep(req); - if (ret) - return ret; - ret = __io_accept(req, nxt, force_nonblock); if (ret == -EAGAIN && force_nonblock) { req->work.func = io_accept_finish; @@ -2409,25 +2408,25 @@ static int io_accept(struct io_kiocb *req, struct io_kiocb **nxt, #endif }
-static int io_connect_prep(struct io_kiocb *req, struct io_async_ctx *io) +static int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { #if defined(CONFIG_NET) - const struct io_uring_sqe *sqe = req->sqe; - int ret; + struct io_connect *conn = &req->connect; + struct io_async_ctx *io = req->io;
- if (!sqe) - return 0; if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) return -EINVAL; if (sqe->ioprio || sqe->len || sqe->buf_index || sqe->rw_flags) return -EINVAL;
- req->connect.addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); - req->connect.addr_len = READ_ONCE(sqe->addr2); - ret = move_addr_to_kernel(req->connect.addr, req->connect.addr_len, + conn->addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); + conn->addr_len = READ_ONCE(sqe->addr2); + + if (!io) + return 0; + + return move_addr_to_kernel(conn->addr, conn->addr_len, &io->connect.address); - req->sqe = NULL; - return ret; #else return -EOPNOTSUPP; #endif @@ -2444,7 +2443,9 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt, if (req->io) { io = req->io; } else { - ret = io_connect_prep(req, &__io); + ret = move_addr_to_kernel(req->connect.addr, + req->connect.addr_len, + &__io.connect.address); if (ret) goto out; io = &__io; @@ -2524,12 +2525,9 @@ static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr) return -ENOENT; }
-static int io_poll_remove_prep(struct io_kiocb *req) +static int io_poll_remove_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) { - const struct io_uring_sqe *sqe = req->sqe; - - if (!sqe) - return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->ioprio || sqe->off || sqe->len || sqe->buf_index || @@ -2537,7 +2535,6 @@ static int io_poll_remove_prep(struct io_kiocb *req) return -EINVAL;
req->poll.addr = READ_ONCE(sqe->addr); - req->sqe = NULL; return 0; }
@@ -2551,10 +2548,6 @@ static int io_poll_remove(struct io_kiocb *req) u64 addr; int ret;
- ret = io_poll_remove_prep(req); - if (ret) - return ret; - addr = req->poll.addr; spin_lock_irq(&ctx->completion_lock); ret = io_poll_cancel(ctx, addr); @@ -2692,14 +2685,11 @@ static void io_poll_req_insert(struct io_kiocb *req) hlist_add_head(&req->hash_node, list); }
-static int io_poll_add_prep(struct io_kiocb *req) +static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { - const struct io_uring_sqe *sqe = req->sqe; struct io_poll_iocb *poll = &req->poll; u16 events;
- if (!sqe) - return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->addr || sqe->ioprio || sqe->off || sqe->len || sqe->buf_index) @@ -2709,7 +2699,6 @@ static int io_poll_add_prep(struct io_kiocb *req)
events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; - req->sqe = NULL; return 0; }
@@ -2720,11 +2709,6 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) struct io_poll_table ipt; bool cancel = false; __poll_t mask; - int ret; - - ret = io_poll_add_prep(req); - if (ret) - return ret;
INIT_IO_WORK(&req->work, io_poll_complete_work); INIT_HLIST_NODE(&req->hash_node); @@ -2843,12 +2827,9 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) return 0; }
-static int io_timeout_remove_prep(struct io_kiocb *req) +static int io_timeout_remove_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) { - const struct io_uring_sqe *sqe = req->sqe; - - if (!sqe) - return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->flags || sqe->ioprio || sqe->buf_index || sqe->len) @@ -2859,7 +2840,6 @@ static int io_timeout_remove_prep(struct io_kiocb *req) if (req->timeout.flags) return -EINVAL;
- req->sqe = NULL; return 0; }
@@ -2871,10 +2851,6 @@ static int io_timeout_remove(struct io_kiocb *req) struct io_ring_ctx *ctx = req->ctx; int ret;
- ret = io_timeout_remove_prep(req); - if (ret) - return ret; - spin_lock_irq(&ctx->completion_lock); ret = io_timeout_cancel(ctx, req->timeout.addr);
@@ -2888,15 +2864,12 @@ static int io_timeout_remove(struct io_kiocb *req) return 0; }
-static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io, +static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, bool is_timeout_link) { - const struct io_uring_sqe *sqe = req->sqe; struct io_timeout_data *data; unsigned flags;
- if (!sqe) - return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->ioprio || sqe->buf_index || sqe->len != 1) @@ -2909,7 +2882,7 @@ static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io,
req->timeout.count = READ_ONCE(sqe->off);
- if (!io && io_alloc_async_ctx(req)) + if (!req->io && io_alloc_async_ctx(req)) return -ENOMEM;
data = &req->io->timeout; @@ -2925,7 +2898,6 @@ static int io_timeout_prep(struct io_kiocb *req, struct io_async_ctx *io, data->mode = HRTIMER_MODE_REL;
hrtimer_init(&data->timer, CLOCK_MONOTONIC, data->mode); - req->sqe = NULL; return 0; }
@@ -2936,11 +2908,7 @@ static int io_timeout(struct io_kiocb *req) struct io_timeout_data *data; struct list_head *entry; unsigned span = 0; - int ret;
- ret = io_timeout_prep(req, req->io, false); - if (ret) - return ret; data = &req->io->timeout;
/* @@ -3066,12 +3034,9 @@ static void io_async_find_and_cancel(struct io_ring_ctx *ctx, io_put_req_find_next(req, nxt); }
-static int io_async_cancel_prep(struct io_kiocb *req) +static int io_async_cancel_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) { - const struct io_uring_sqe *sqe = req->sqe; - - if (!sqe) - return 0; if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->flags || sqe->ioprio || sqe->off || sqe->len || @@ -3079,28 +3044,20 @@ static int io_async_cancel_prep(struct io_kiocb *req) return -EINVAL;
req->cancel.addr = READ_ONCE(sqe->addr); - req->sqe = NULL; return 0; }
static int io_async_cancel(struct io_kiocb *req, struct io_kiocb **nxt) { struct io_ring_ctx *ctx = req->ctx; - int ret; - - ret = io_async_cancel_prep(req); - if (ret) - return ret;
io_async_find_and_cancel(ctx, req, req->cancel.addr, nxt, 0); return 0; }
-static int io_req_defer_prep(struct io_kiocb *req) +static int io_req_defer_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) { - struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; - struct io_async_ctx *io = req->io; - struct iov_iter iter; ssize_t ret = 0;
switch (req->opcode) { @@ -3108,61 +3065,47 @@ static int io_req_defer_prep(struct io_kiocb *req) break; case IORING_OP_READV: case IORING_OP_READ_FIXED: - /* ensure prep does right import */ - req->io = NULL; - ret = io_read_prep(req, &iovec, &iter, true); - req->io = io; - if (ret < 0) - break; - io_req_map_rw(req, ret, iovec, inline_vecs, &iter); - ret = 0; + ret = io_read_prep(req, sqe, true); break; case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: - /* ensure prep does right import */ - req->io = NULL; - ret = io_write_prep(req, &iovec, &iter, true); - req->io = io; - if (ret < 0) - break; - io_req_map_rw(req, ret, iovec, inline_vecs, &iter); - ret = 0; + ret = io_write_prep(req, sqe, true); break; case IORING_OP_POLL_ADD: - ret = io_poll_add_prep(req); + ret = io_poll_add_prep(req, sqe); break; case IORING_OP_POLL_REMOVE: - ret = io_poll_remove_prep(req); + ret = io_poll_remove_prep(req, sqe); break; case IORING_OP_FSYNC: - ret = io_prep_fsync(req); + ret = io_prep_fsync(req, sqe); break; case IORING_OP_SYNC_FILE_RANGE: - ret = io_prep_sfr(req); + ret = io_prep_sfr(req, sqe); break; case IORING_OP_SENDMSG: - ret = io_sendmsg_prep(req, io); + ret = io_sendmsg_prep(req, sqe); break; case IORING_OP_RECVMSG: - ret = io_recvmsg_prep(req, io); + ret = io_recvmsg_prep(req, sqe); break; case IORING_OP_CONNECT: - ret = io_connect_prep(req, io); + ret = io_connect_prep(req, sqe); break; case IORING_OP_TIMEOUT: - ret = io_timeout_prep(req, io, false); + ret = io_timeout_prep(req, sqe, false); break; case IORING_OP_TIMEOUT_REMOVE: - ret = io_timeout_remove_prep(req); + ret = io_timeout_remove_prep(req, sqe); break; case IORING_OP_ASYNC_CANCEL: - ret = io_async_cancel_prep(req); + ret = io_async_cancel_prep(req, sqe); break; case IORING_OP_LINK_TIMEOUT: - ret = io_timeout_prep(req, io, true); + ret = io_timeout_prep(req, sqe, true); break; case IORING_OP_ACCEPT: - ret = io_accept_prep(req); + ret = io_accept_prep(req, sqe); break; default: printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", @@ -3174,7 +3117,7 @@ static int io_req_defer_prep(struct io_kiocb *req) return ret; }
-static int io_req_defer(struct io_kiocb *req) +static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_ring_ctx *ctx = req->ctx; int ret; @@ -3183,10 +3126,10 @@ static int io_req_defer(struct io_kiocb *req) if (!req_need_defer(req) && list_empty(&ctx->defer_list)) return 0;
- if (io_alloc_async_ctx(req)) + if (!req->io && io_alloc_async_ctx(req)) return -EAGAIN;
- ret = io_req_defer_prep(req); + ret = io_req_defer_prep(req, sqe); if (ret < 0) return ret;
@@ -3202,9 +3145,8 @@ static int io_req_defer(struct io_kiocb *req) return -EIOCBQUEUED; }
-__attribute__((nonnull)) -static int io_issue_sqe(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, + struct io_kiocb **nxt, bool force_nonblock) { struct io_ring_ctx *ctx = req->ctx; int ret; @@ -3214,48 +3156,109 @@ static int io_issue_sqe(struct io_kiocb *req, struct io_kiocb **nxt, ret = io_nop(req); break; case IORING_OP_READV: - ret = io_read(req, nxt, force_nonblock); - break; - case IORING_OP_WRITEV: - ret = io_write(req, nxt, force_nonblock); - break; case IORING_OP_READ_FIXED: + if (sqe) { + ret = io_read_prep(req, sqe, force_nonblock); + if (ret < 0) + break; + } ret = io_read(req, nxt, force_nonblock); break; + case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: + if (sqe) { + ret = io_write_prep(req, sqe, force_nonblock); + if (ret < 0) + break; + } ret = io_write(req, nxt, force_nonblock); break; case IORING_OP_FSYNC: + if (sqe) { + ret = io_prep_fsync(req, sqe); + if (ret < 0) + break; + } ret = io_fsync(req, nxt, force_nonblock); break; case IORING_OP_POLL_ADD: + if (sqe) { + ret = io_poll_add_prep(req, sqe); + if (ret) + break; + } ret = io_poll_add(req, nxt); break; case IORING_OP_POLL_REMOVE: + if (sqe) { + ret = io_poll_remove_prep(req, sqe); + if (ret < 0) + break; + } ret = io_poll_remove(req); break; case IORING_OP_SYNC_FILE_RANGE: + if (sqe) { + ret = io_prep_sfr(req, sqe); + if (ret < 0) + break; + } ret = io_sync_file_range(req, nxt, force_nonblock); break; case IORING_OP_SENDMSG: + if (sqe) { + ret = io_sendmsg_prep(req, sqe); + if (ret < 0) + break; + } ret = io_sendmsg(req, nxt, force_nonblock); break; case IORING_OP_RECVMSG: + if (sqe) { + ret = io_recvmsg_prep(req, sqe); + if (ret) + break; + } ret = io_recvmsg(req, nxt, force_nonblock); break; case IORING_OP_TIMEOUT: + if (sqe) { + ret = io_timeout_prep(req, sqe, false); + if (ret) + break; + } ret = io_timeout(req); break; case IORING_OP_TIMEOUT_REMOVE: + if (sqe) { + ret = io_timeout_remove_prep(req, sqe); + if (ret) + break; + } ret = io_timeout_remove(req); break; case IORING_OP_ACCEPT: + if (sqe) { + ret = io_accept_prep(req, sqe); + if (ret) + break; + } ret = io_accept(req, nxt, force_nonblock); break; case IORING_OP_CONNECT: + if (sqe) { + ret = io_connect_prep(req, sqe); + if (ret) + break; + } ret = io_connect(req, nxt, force_nonblock); break; case IORING_OP_ASYNC_CANCEL: + if (sqe) { + ret = io_async_cancel_prep(req, sqe); + if (ret) + break; + } ret = io_async_cancel(req, nxt); break; default: @@ -3299,7 +3302,7 @@ static void io_wq_submit_work(struct io_wq_work **workptr) req->has_user = (work->flags & IO_WQ_WORK_HAS_MM) != 0; req->in_async = true; do { - ret = io_issue_sqe(req, &nxt, false); + ret = io_issue_sqe(req, NULL, &nxt, false); /* * We can get EAGAIN for polled IO even though we're * forcing a sync submission from here, since we can't @@ -3365,14 +3368,15 @@ static inline struct file *io_file_from_index(struct io_ring_ctx *ctx, return table->files[index & IORING_FILE_TABLE_MASK]; }
-static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req) +static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req, + const struct io_uring_sqe *sqe) { struct io_ring_ctx *ctx = req->ctx; unsigned flags; int fd, ret;
- flags = READ_ONCE(req->sqe->flags); - fd = READ_ONCE(req->sqe->fd); + flags = READ_ONCE(sqe->flags); + fd = READ_ONCE(sqe->fd);
if (flags & IOSQE_IO_DRAIN) req->flags |= REQ_F_IO_DRAIN; @@ -3504,7 +3508,7 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req) return nxt; }
-static void __io_queue_sqe(struct io_kiocb *req) +static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_kiocb *linked_timeout; struct io_kiocb *nxt = NULL; @@ -3513,7 +3517,7 @@ static void __io_queue_sqe(struct io_kiocb *req) again: linked_timeout = io_prep_linked_timeout(req);
- ret = io_issue_sqe(req, &nxt, true); + ret = io_issue_sqe(req, sqe, &nxt, true);
/* * We async punt it if the file wasn't marked NOWAIT, or if the file @@ -3560,7 +3564,7 @@ static void __io_queue_sqe(struct io_kiocb *req) } }
-static void io_queue_sqe(struct io_kiocb *req) +static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) { int ret;
@@ -3570,7 +3574,7 @@ static void io_queue_sqe(struct io_kiocb *req) } req->ctx->drain_next = (req->flags & REQ_F_DRAIN_LINK);
- ret = io_req_defer(req); + ret = io_req_defer(req, sqe); if (ret) { if (ret != -EIOCBQUEUED) { io_cqring_add_event(req, ret); @@ -3578,7 +3582,7 @@ static void io_queue_sqe(struct io_kiocb *req) io_double_put_req(req); } } else - __io_queue_sqe(req); + __io_queue_sqe(req, sqe); }
static inline void io_queue_link_head(struct io_kiocb *req) @@ -3587,25 +3591,25 @@ static inline void io_queue_link_head(struct io_kiocb *req) io_cqring_add_event(req, -ECANCELED); io_double_put_req(req); } else - io_queue_sqe(req); + io_queue_sqe(req, NULL); }
#define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK| \ IOSQE_IO_HARDLINK)
-static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, - struct io_kiocb **link) +static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, + struct io_submit_state *state, struct io_kiocb **link) { struct io_ring_ctx *ctx = req->ctx; int ret;
/* enforce forwards compatibility on users */ - if (unlikely(req->sqe->flags & ~SQE_VALID_FLAGS)) { + if (unlikely(sqe->flags & ~SQE_VALID_FLAGS)) { ret = -EINVAL; goto err_req; }
- ret = io_req_set_file(state, req); + ret = io_req_set_file(state, req, sqe); if (unlikely(ret)) { err_req: io_cqring_add_event(req, ret); @@ -3623,10 +3627,10 @@ static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, if (*link) { struct io_kiocb *prev = *link;
- if (req->sqe->flags & IOSQE_IO_DRAIN) + if (sqe->flags & IOSQE_IO_DRAIN) (*link)->flags |= REQ_F_DRAIN_LINK | REQ_F_IO_DRAIN;
- if (req->sqe->flags & IOSQE_IO_HARDLINK) + if (sqe->flags & IOSQE_IO_HARDLINK) req->flags |= REQ_F_HARDLINK;
if (io_alloc_async_ctx(req)) { @@ -3634,7 +3638,7 @@ static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, goto err_req; }
- ret = io_req_defer_prep(req); + ret = io_req_defer_prep(req, sqe); if (ret) { /* fail even hard links since we don't submit */ prev->flags |= REQ_F_FAIL_LINK; @@ -3642,15 +3646,18 @@ static bool io_submit_sqe(struct io_kiocb *req, struct io_submit_state *state, } trace_io_uring_link(ctx, req, prev); list_add_tail(&req->link_list, &prev->link_list); - } else if (req->sqe->flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) { + } else if (sqe->flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) { req->flags |= REQ_F_LINK; - if (req->sqe->flags & IOSQE_IO_HARDLINK) + if (sqe->flags & IOSQE_IO_HARDLINK) req->flags |= REQ_F_HARDLINK;
INIT_LIST_HEAD(&req->link_list); + ret = io_req_defer_prep(req, sqe); + if (ret) + req->flags |= REQ_F_FAIL_LINK; *link = req; } else { - io_queue_sqe(req); + io_queue_sqe(req, sqe); }
return true; @@ -3695,14 +3702,15 @@ static void io_commit_sqring(struct io_ring_ctx *ctx) }
/* - * Fetch an sqe, if one is available. Note that req->sqe will point to memory + * Fetch an sqe, if one is available. Note that sqe_ptr will point to memory * that is mapped by userspace. This means that care needs to be taken to * ensure that reads are stable, as we cannot rely on userspace always * being a good citizen. If members of the sqe are validated and then later * used, it's important that those reads are done through READ_ONCE() to * prevent a re-load down the line. */ -static bool io_get_sqring(struct io_ring_ctx *ctx, struct io_kiocb *req) +static bool io_get_sqring(struct io_ring_ctx *ctx, struct io_kiocb *req, + const struct io_uring_sqe **sqe_ptr) { struct io_rings *rings = ctx->rings; u32 *sq_array = ctx->sq_array; @@ -3729,9 +3737,9 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct io_kiocb *req) * link list. */ req->sequence = ctx->cached_sq_head; - req->sqe = &ctx->sq_sqes[head]; - req->opcode = READ_ONCE(req->sqe->opcode); - req->user_data = READ_ONCE(req->sqe->user_data); + *sqe_ptr = &ctx->sq_sqes[head]; + req->opcode = READ_ONCE((*sqe_ptr)->opcode); + req->user_data = READ_ONCE((*sqe_ptr)->user_data); ctx->cached_sq_head++; return true; } @@ -3763,6 +3771,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, }
for (i = 0; i < nr; i++) { + const struct io_uring_sqe *sqe; struct io_kiocb *req; unsigned int sqe_flags;
@@ -3772,7 +3781,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, submitted = -EAGAIN; break; } - if (!io_get_sqring(ctx, req)) { + if (!io_get_sqring(ctx, req, &sqe)) { __io_free_req(req); break; } @@ -3786,7 +3795,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, }
submitted++; - sqe_flags = req->sqe->flags; + sqe_flags = sqe->flags;
req->ring_file = ring_file; req->ring_fd = ring_fd; @@ -3794,7 +3803,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, req->in_async = async; req->needs_fixed_file = async; trace_io_uring_submit_sqe(ctx, req->user_data, true, async); - if (!io_submit_sqe(req, statep, &link)) + if (!io_submit_sqe(req, sqe, statep, &link)) break; /* * If previous wasn't linked and we have a linked command,
From: Hillf Danton hdanton@sina.com
mainline inclusion from mainline-5.5-rc4 commit 1f424e8bd18754d27b15f49359004b0cea344fb5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Commit e61df66c69b1 ("io-wq: ensure free/busy list browsing see all items") added a list for io workers in addition to the free and busy lists, not only making worker walk cleaner, but leaving the busy list unused. Let's remove it.
Signed-off-by: Hillf Danton hdanton@sina.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 8 -------- 1 file changed, 8 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index e38e3c6e30f7..8adc2821b0cc 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -93,7 +93,6 @@ struct io_wqe { struct io_wqe_acct acct[2];
struct hlist_nulls_head free_list; - struct hlist_nulls_head busy_list; struct list_head all_list;
struct io_wq *wq; @@ -328,7 +327,6 @@ static void __io_worker_busy(struct io_wqe *wqe, struct io_worker *worker, if (worker->flags & IO_WORKER_F_FREE) { worker->flags &= ~IO_WORKER_F_FREE; hlist_nulls_del_init_rcu(&worker->nulls_node); - hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->busy_list); }
/* @@ -366,7 +364,6 @@ static bool __io_worker_idle(struct io_wqe *wqe, struct io_worker *worker) { if (!(worker->flags & IO_WORKER_F_FREE)) { worker->flags |= IO_WORKER_F_FREE; - hlist_nulls_del_init_rcu(&worker->nulls_node); hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list); }
@@ -799,10 +796,6 @@ void io_wq_cancel_all(struct io_wq *wq)
set_bit(IO_WQ_BIT_CANCEL, &wq->state);
- /* - * Browse both lists, as there's a gap between handing work off - * to a worker and the worker putting itself on the busy_list - */ rcu_read_lock(); for_each_node(node) { struct io_wqe *wqe = wq->wqes[node]; @@ -1050,7 +1043,6 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) spin_lock_init(&wqe->lock); INIT_WQ_LIST(&wqe->work_list); INIT_HLIST_NULLS_HEAD(&wqe->free_list, 0); - INIT_HLIST_NULLS_HEAD(&wqe->busy_list, 1); INIT_LIST_HEAD(&wqe->all_list); }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 3934e36f6099e6277db33f433fe135c6644e8ac2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
To implement an async stat, we need to provide the flags mapping and the statx user copy. Make them available internally, through fs/internal.h.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/internal.h | 6 ++++++ fs/stat.c | 34 ++++++++++++++++++++++------------ 2 files changed, 28 insertions(+), 12 deletions(-)
diff --git a/fs/internal.h b/fs/internal.h index 544ae37d15f2..acbc60a8e13e 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -189,3 +189,9 @@ loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
/* direct-io.c: */ int sb_init_dio_done_wq(struct super_block *sb); + +/* + * fs/stat.c: + */ +unsigned vfs_stat_set_lookup_flags(unsigned *lookup_flags, int flags); +int cp_statx(const struct kstat *stat, struct statx __user *buffer); diff --git a/fs/stat.c b/fs/stat.c index f8e6fb2c3657..46dfe0df1a71 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -21,6 +21,8 @@ #include <linux/uaccess.h> #include <asm/unistd.h>
+#include "internal.h" + /** * generic_fillattr - Fill in the basic attributes from the inode struct * @inode: Inode to use as the source @@ -148,6 +150,23 @@ int vfs_statx_fd(unsigned int fd, struct kstat *stat, } EXPORT_SYMBOL(vfs_statx_fd);
+inline unsigned vfs_stat_set_lookup_flags(unsigned *lookup_flags, int flags) +{ + if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | + AT_EMPTY_PATH | KSTAT_QUERY_FLAGS)) != 0) + return -EINVAL; + + *lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT; + if (flags & AT_SYMLINK_NOFOLLOW) + *lookup_flags &= ~LOOKUP_FOLLOW; + if (flags & AT_NO_AUTOMOUNT) + *lookup_flags &= ~LOOKUP_AUTOMOUNT; + if (flags & AT_EMPTY_PATH) + *lookup_flags |= LOOKUP_EMPTY; + + return 0; +} + /** * vfs_statx - Get basic and extra attributes by filename * @dfd: A file descriptor representing the base dir for a relative filename @@ -168,19 +187,10 @@ int vfs_statx(int dfd, const char __user *filename, int flags, { struct path path; int error = -EINVAL; - unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT; + unsigned lookup_flags;
- if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | - AT_EMPTY_PATH | KSTAT_QUERY_FLAGS)) != 0) + if (vfs_stat_set_lookup_flags(&lookup_flags, flags)) return -EINVAL; - - if (flags & AT_SYMLINK_NOFOLLOW) - lookup_flags &= ~LOOKUP_FOLLOW; - if (flags & AT_NO_AUTOMOUNT) - lookup_flags &= ~LOOKUP_AUTOMOUNT; - if (flags & AT_EMPTY_PATH) - lookup_flags |= LOOKUP_EMPTY; - retry: error = user_path_at(dfd, filename, lookup_flags, &path); if (error) @@ -518,7 +528,7 @@ SYSCALL_DEFINE4(fstatat64, int, dfd, const char __user *, filename, } #endif /* __ARCH_WANT_STAT64 || __ARCH_WANT_COMPAT_STAT64 */
-static noinline_for_stack int +noinline_for_stack int cp_statx(const struct kstat *stat, struct statx __user *buffer) { struct statx tmp;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit eddc7ef52a6b37b7ba3d1c8a8fbb63d5d9914f8a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This provides support for async statx(2) through io_uring.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 86 ++++++++++++++++++++++++++++++++++- include/uapi/linux/io_uring.h | 2 + 2 files changed, 87 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8f47e53164f1..b8e5b742a00a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -379,9 +379,13 @@ struct io_sr_msg { struct io_open { struct file *file; int dfd; - umode_t mode; + union { + umode_t mode; + unsigned mask; + }; const char __user *fname; struct filename *filename; + struct statx __user *buffer; int flags; };
@@ -2263,6 +2267,74 @@ static int io_openat(struct io_kiocb *req, struct io_kiocb **nxt, return 0; }
+static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + unsigned lookup_flags; + int ret; + + if (sqe->ioprio || sqe->buf_index) + return -EINVAL; + + req->open.dfd = READ_ONCE(sqe->fd); + req->open.mask = READ_ONCE(sqe->len); + req->open.fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); + req->open.buffer = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + req->open.flags = READ_ONCE(sqe->statx_flags); + + if (vfs_stat_set_lookup_flags(&lookup_flags, req->open.flags)) + return -EINVAL; + + req->open.filename = getname_flags(req->open.fname, lookup_flags, NULL); + if (IS_ERR(req->open.filename)) { + ret = PTR_ERR(req->open.filename); + req->open.filename = NULL; + return ret; + } + + return 0; +} + +static int io_statx(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) +{ + struct io_open *ctx = &req->open; + unsigned lookup_flags; + struct path path; + struct kstat stat; + int ret; + + if (force_nonblock) + return -EAGAIN; + + if (vfs_stat_set_lookup_flags(&lookup_flags, ctx->flags)) + return -EINVAL; + +retry: + /* filename_lookup() drops it, keep a reference */ + ctx->filename->refcnt++; + + ret = filename_lookup(ctx->dfd, ctx->filename, lookup_flags, &path, + NULL); + if (ret) + goto err; + + ret = vfs_getattr(&path, &stat, ctx->mask, ctx->flags); + path_put(&path); + if (retry_estale(ret, lookup_flags)) { + lookup_flags |= LOOKUP_REVAL; + goto retry; + } + if (!ret) + ret = cp_statx(&stat, ctx->buffer); +err: + putname(ctx->filename); + if (ret < 0) + req_set_fail_links(req); + io_cqring_add_event(req, ret); + io_put_req_find_next(req, nxt); + return 0; +} + static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { /* @@ -3424,6 +3496,9 @@ static int io_req_defer_prep(struct io_kiocb *req, case IORING_OP_FILES_UPDATE: ret = io_files_update_prep(req, sqe); break; + case IORING_OP_STATX: + ret = io_statx_prep(req, sqe); + break; default: printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", req->opcode); @@ -3610,6 +3685,14 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } ret = io_files_update(req, force_nonblock); break; + case IORING_OP_STATX: + if (sqe) { + ret = io_statx_prep(req, sqe); + if (ret) + break; + } + ret = io_statx(req, nxt, force_nonblock); + break; default: ret = -EINVAL; break; @@ -3696,6 +3779,7 @@ static int io_req_needs_file(struct io_kiocb *req, int fd) case IORING_OP_LINK_TIMEOUT: return 0; case IORING_OP_OPENAT: + case IORING_OP_STATX: return fd != -1; default: if (io_req_op_valid(req->opcode)) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ca436b9d4921..3f45f7c543de 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -35,6 +35,7 @@ struct io_uring_sqe { __u32 accept_flags; __u32 cancel_flags; __u32 open_flags; + __u32 statx_flags; }; __u64 user_data; /* data to be passed back at completion time */ union { @@ -81,6 +82,7 @@ enum { IORING_OP_OPENAT, IORING_OP_CLOSE, IORING_OP_FILES_UPDATE, + IORING_OP_STATX,
/* this goes last, obviously */ IORING_OP_LAST,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 895e2ca0f693c672902191747b548bdc56f0c7de category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io-wq assumes that work will complete fast (and not block), so it doesn't create a new worker when work is enqueued, if we already have at least one worker running. This is done on the assumption that if work is running, then it will complete fast.
Add an option to force io-wq to fork a new worker for work queued. This is signaled by setting IO_WQ_WORK_CONCURRENT on the work item. For that case, io-wq will create a new worker, even though workers are already running.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 5 ++++- fs/io-wq.h | 1 + 2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index df3d58a02fac..09896c7e4205 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -725,6 +725,7 @@ static bool io_wq_can_queue(struct io_wqe *wqe, struct io_wqe_acct *acct, static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work) { struct io_wqe_acct *acct = io_work_get_acct(wqe, work); + int work_flags; unsigned long flags;
/* @@ -739,12 +740,14 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work) return; }
+ work_flags = work->flags; spin_lock_irqsave(&wqe->lock, flags); wq_list_add_tail(&work->list, &wqe->work_list); wqe->flags &= ~IO_WQE_FLAG_STALLED; spin_unlock_irqrestore(&wqe->lock, flags);
- if (!atomic_read(&acct->nr_running)) + if ((work_flags & IO_WQ_WORK_CONCURRENT) || + !atomic_read(&acct->nr_running)) io_wqe_wake_worker(wqe, acct); }
diff --git a/fs/io-wq.h b/fs/io-wq.h index 04d60ad38dfc..1cd039af8813 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -13,6 +13,7 @@ enum { IO_WQ_WORK_INTERNAL = 64, IO_WQ_WORK_CB = 128, IO_WQ_WORK_NO_CANCEL = 256, + IO_WQ_WORK_CONCURRENT = 512,
IO_WQ_HASH_SHIFT = 24, /* upper 8 bits are used for hash key */ };
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit ce35a47a3a0208a77b4d31b7f2e8ed57d624093d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_uring defaults to always doing inline submissions, if at all possible. But for larger copies, even if the data is fully cached, that can take a long time. Add an IOSQE_ASYNC flag that the application can set on the SQE - if set, it'll ensure that we always go async for those kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we get the concurrency we desire for this case.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 ++++++++++++++-- include/uapi/linux/io_uring.h | 1 + 2 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b8e5b742a00a..29d67e40e81d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -483,6 +483,7 @@ struct io_kiocb { #define REQ_F_INFLIGHT 16384 /* on inflight list */ #define REQ_F_COMP_LOCKED 32768 /* completion under lock */ #define REQ_F_HARDLINK 65536 /* doesn't sever on completion < 0 */ +#define REQ_F_FORCE_ASYNC 131072 /* IOSQE_ASYNC */ u64 user_data; u32 result; u32 sequence; @@ -4014,8 +4015,17 @@ static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) req_set_fail_links(req); io_double_put_req(req); } - } else + } else if ((req->flags & REQ_F_FORCE_ASYNC) && + !io_wq_current_is_worker()) { + /* + * Never try inline submit of IOSQE_ASYNC is set, go straight + * to async execution. + */ + req->work.flags |= IO_WQ_WORK_CONCURRENT; + io_queue_async_work(req); + } else { __io_queue_sqe(req, sqe); + } }
static inline void io_queue_link_head(struct io_kiocb *req) @@ -4028,7 +4038,7 @@ static inline void io_queue_link_head(struct io_kiocb *req) }
#define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK| \ - IOSQE_IO_HARDLINK) + IOSQE_IO_HARDLINK | IOSQE_ASYNC)
static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct io_submit_state *state, struct io_kiocb **link) @@ -4041,6 +4051,8 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, ret = -EINVAL; goto err_req; } + if (sqe->flags & IOSQE_ASYNC) + req->flags |= REQ_F_FORCE_ASYNC;
ret = io_req_set_file(state, req, sqe); if (unlikely(ret)) { diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 3f45f7c543de..d7ec50247a3a 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -51,6 +51,7 @@ struct io_uring_sqe { #define IOSQE_IO_DRAIN (1U << 1) /* issue after inflight IO */ #define IOSQE_IO_LINK (1U << 2) /* links next sqe */ #define IOSQE_IO_HARDLINK (1U << 3) /* like LINK, but stronger */ +#define IOSQE_ASYNC (1U << 4) /* always go async */
/* * io_uring_setup() flags
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit 9d76377f7e13c19441fdd066033345289f89b5fe category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Calling "prev" a head of a link is a bit misleading. Rename it
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 29d67e40e81d..d481b9ae8715 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4070,10 +4070,10 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, * conditions are true (normal request), then just queue it. */ if (*link) { - struct io_kiocb *prev = *link; + struct io_kiocb *head = *link;
if (sqe->flags & IOSQE_IO_DRAIN) - (*link)->flags |= REQ_F_DRAIN_LINK | REQ_F_IO_DRAIN; + head->flags |= REQ_F_DRAIN_LINK | REQ_F_IO_DRAIN;
if (sqe->flags & IOSQE_IO_HARDLINK) req->flags |= REQ_F_HARDLINK; @@ -4086,11 +4086,11 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, ret = io_req_defer_prep(req, sqe); if (ret) { /* fail even hard links since we don't submit */ - prev->flags |= REQ_F_FAIL_LINK; + head->flags |= REQ_F_FAIL_LINK; goto err_req; } - trace_io_uring_link(ctx, req, prev); - list_add_tail(&req->link_list, &prev->link_list); + trace_io_uring_link(ctx, req, head); + list_add_tail(&req->link_list, &head->link_list); } else if (sqe->flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) { req->flags |= REQ_F_LINK; if (sqe->flags & IOSQE_IO_HARDLINK)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit add7b6b85a4dfa89283834d181e87ea2144b9028 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
__io_free_req() and io_double_put_req() aren't used before they are defined, so we can kill these two forwards.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 23e549dcc3a1..e50de3e3c341 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -518,9 +518,7 @@ struct io_submit_state {
static void io_wq_submit_work(struct io_wq_work **workptr); static void io_cqring_fill_event(struct io_kiocb *req, long res); -static void __io_free_req(struct io_kiocb *req); static void io_put_req(struct io_kiocb *req); -static void io_double_put_req(struct io_kiocb *req); static void __io_double_put_req(struct io_kiocb *req); static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req); static void io_queue_linked_timeout(struct io_kiocb *req);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit d3656344fea0339fb0365c8df4d2beba4e0089cd category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently have various switch statements that check if an opcode needs a file, mm, etc. These are hard to keep in sync as opcodes are added. Add a struct io_op_def that holds all of this information, so we have just one spot to update when opcodes are added.
This also enables us to NOT allocate req->io if a deferred command doesn't need it, and corrects some mistakes we had in terms of what commands need mm context.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 208 +++++++++++++++++++++++++++++++++++++------------- 1 file changed, 155 insertions(+), 53 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e50de3e3c341..9216f407ab03 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -516,6 +516,135 @@ struct io_submit_state { unsigned int ios_left; };
+struct io_op_def { + /* needs req->io allocated for deferral/async */ + unsigned async_ctx : 1; + /* needs current->mm setup, does mm access */ + unsigned needs_mm : 1; + /* needs req->file assigned */ + unsigned needs_file : 1; + /* needs req->file assigned IFF fd is >= 0 */ + unsigned fd_non_neg : 1; + /* hash wq insertion if file is a regular file */ + unsigned hash_reg_file : 1; + /* unbound wq insertion if file is a non-regular file */ + unsigned unbound_nonreg_file : 1; +}; + +static const struct io_op_def io_op_defs[] = { + { + /* IORING_OP_NOP */ + }, + { + /* IORING_OP_READV */ + .async_ctx = 1, + .needs_mm = 1, + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_WRITEV */ + .async_ctx = 1, + .needs_mm = 1, + .needs_file = 1, + .hash_reg_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_FSYNC */ + .needs_file = 1, + }, + { + /* IORING_OP_READ_FIXED */ + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_WRITE_FIXED */ + .needs_file = 1, + .hash_reg_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_POLL_ADD */ + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_POLL_REMOVE */ + }, + { + /* IORING_OP_SYNC_FILE_RANGE */ + .needs_file = 1, + }, + { + /* IORING_OP_SENDMSG */ + .async_ctx = 1, + .needs_mm = 1, + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_RECVMSG */ + .async_ctx = 1, + .needs_mm = 1, + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_TIMEOUT */ + .async_ctx = 1, + .needs_mm = 1, + }, + { + /* IORING_OP_TIMEOUT_REMOVE */ + }, + { + /* IORING_OP_ACCEPT */ + .needs_mm = 1, + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_ASYNC_CANCEL */ + }, + { + /* IORING_OP_LINK_TIMEOUT */ + .async_ctx = 1, + .needs_mm = 1, + }, + { + /* IORING_OP_CONNECT */ + .async_ctx = 1, + .needs_mm = 1, + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_FALLOCATE */ + .needs_file = 1, + }, + { + /* IORING_OP_OPENAT */ + .needs_file = 1, + .fd_non_neg = 1, + }, + { + /* IORING_OP_CLOSE */ + .needs_file = 1, + }, + { + /* IORING_OP_FILES_UPDATE */ + .needs_mm = 1, + }, + { + /* IORING_OP_STATX */ + .needs_mm = 1, + .needs_file = 1, + .fd_non_neg = 1, + }, +}; + static void io_wq_submit_work(struct io_wq_work **workptr); static void io_cqring_fill_event(struct io_kiocb *req, long res); static void io_put_req(struct io_kiocb *req); @@ -670,41 +799,20 @@ static void __io_commit_cqring(struct io_ring_ctx *ctx) } }
-static inline bool io_req_needs_user(struct io_kiocb *req) -{ - return !(req->opcode == IORING_OP_READ_FIXED || - req->opcode == IORING_OP_WRITE_FIXED); -} - static inline bool io_prep_async_work(struct io_kiocb *req, struct io_kiocb **link) { + const struct io_op_def *def = &io_op_defs[req->opcode]; bool do_hashed = false;
- switch (req->opcode) { - case IORING_OP_WRITEV: - case IORING_OP_WRITE_FIXED: - /* only regular files should be hashed for writes */ - if (req->flags & REQ_F_ISREG) + if (req->flags & REQ_F_ISREG) { + if (def->hash_reg_file) do_hashed = true; - /* fall-through */ - case IORING_OP_READV: - case IORING_OP_READ_FIXED: - case IORING_OP_SENDMSG: - case IORING_OP_RECVMSG: - case IORING_OP_ACCEPT: - case IORING_OP_POLL_ADD: - case IORING_OP_CONNECT: - /* - * We know REQ_F_ISREG is not set on some of these - * opcodes, but this enables us to keep the check in - * just one place. - */ - if (!(req->flags & REQ_F_ISREG)) + } else { + if (def->unbound_nonreg_file) req->work.flags |= IO_WQ_WORK_UNBOUND; - break; } - if (io_req_needs_user(req)) + if (def->needs_mm) req->work.flags |= IO_WQ_WORK_NEEDS_USER;
*link = io_prep_linked_timeout(req); @@ -1825,6 +1933,8 @@ static void io_req_map_rw(struct io_kiocb *req, ssize_t io_size,
static int io_alloc_async_ctx(struct io_kiocb *req) { + if (!io_op_defs[req->opcode].async_ctx) + return 0; req->io = kmalloc(sizeof(*req->io), GFP_KERNEL); return req->io == NULL; } @@ -3762,29 +3872,13 @@ static void io_wq_submit_work(struct io_wq_work **workptr) io_wq_assign_next(workptr, nxt); }
-static bool io_req_op_valid(int op) -{ - return op >= IORING_OP_NOP && op < IORING_OP_LAST; -} - static int io_req_needs_file(struct io_kiocb *req, int fd) { - switch (req->opcode) { - case IORING_OP_NOP: - case IORING_OP_POLL_REMOVE: - case IORING_OP_TIMEOUT: - case IORING_OP_TIMEOUT_REMOVE: - case IORING_OP_ASYNC_CANCEL: - case IORING_OP_LINK_TIMEOUT: + if (!io_op_defs[req->opcode].needs_file) return 0; - case IORING_OP_OPENAT: - case IORING_OP_STATX: - return fd != -1; - default: - if (io_req_op_valid(req->opcode)) - return 1; - return -EINVAL; - } + if (fd == -1 && io_op_defs[req->opcode].fd_non_neg) + return 0; + return 1; }
static inline struct file *io_file_from_index(struct io_ring_ctx *ctx, @@ -3801,7 +3895,7 @@ static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req, { struct io_ring_ctx *ctx = req->ctx; unsigned flags; - int fd, ret; + int fd;
flags = READ_ONCE(sqe->flags); fd = READ_ONCE(sqe->fd); @@ -3809,9 +3903,8 @@ static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req, if (flags & IOSQE_IO_DRAIN) req->flags |= REQ_F_IO_DRAIN;
- ret = io_req_needs_file(req, fd); - if (ret <= 0) - return ret; + if (!io_req_needs_file(req, fd)) + return 0;
if (flags & IOSQE_FIXED_FILE) { if (unlikely(!ctx->file_data || @@ -4237,7 +4330,16 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, break; }
- if (io_req_needs_user(req) && !*mm) { + /* will complete beyond this point, count as submitted */ + submitted++; + + if (unlikely(req->opcode >= IORING_OP_LAST)) { + io_cqring_add_event(req, -EINVAL); + io_double_put_req(req); + break; + } + + if (io_op_defs[req->opcode].needs_mm && !*mm) { mm_fault = mm_fault || !mmget_not_zero(ctx->sqo_mm); if (!mm_fault) { use_mm(ctx->sqo_mm); @@ -4245,7 +4347,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, } }
- submitted++; req->ring_file = ring_file; req->ring_fd = ring_fd; req->has_user = *mm != NULL; @@ -6092,6 +6193,7 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
static int __init io_uring_init(void) { + BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST); req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); return 0; };
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit ad3eb2c89fb24d14ac81f43eff8e85fece2c934d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently check ->cq_overflow_list from both SQ and CQ context, which causes some bouncing of that cache line. Add separate bits of state for this instead, so that the SQ side can check using its own state, and likewise for the CQ side.
This adds ->sq_check_overflow with the SQ state, and ->cq_check_overflow with the CQ state. If we hit an overflow condition, both of these bits are set. Likewise for overflow flush clear, we clear both bits. For the fast path of just checking if there's an overflow condition on either the SQ or CQ side, we can use our own private bit for this.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 40 +++++++++++++++++++++++++++------------- 1 file changed, 27 insertions(+), 13 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9216f407ab03..44a0166f7d85 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -224,13 +224,14 @@ struct io_ring_ctx { unsigned sq_thread_idle; unsigned cached_sq_dropped; atomic_t cached_cq_overflow; - struct io_uring_sqe *sq_sqes; + unsigned long sq_check_overflow;
struct list_head defer_list; struct list_head timeout_list; struct list_head cq_overflow_list;
wait_queue_head_t inflight_wait; + struct io_uring_sqe *sq_sqes; } ____cacheline_aligned_in_smp;
struct io_rings *rings; @@ -272,6 +273,7 @@ struct io_ring_ctx { unsigned cq_entries; unsigned cq_mask; atomic_t cq_timeouts; + unsigned long cq_check_overflow; struct wait_queue_head cq_wait; struct fasync_struct *cq_fasync; struct eventfd_ctx *cq_ev_fd; @@ -949,6 +951,10 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) }
io_commit_cqring(ctx); + if (cqe) { + clear_bit(0, &ctx->sq_check_overflow); + clear_bit(0, &ctx->cq_check_overflow); + } spin_unlock_irqrestore(&ctx->completion_lock, flags); io_cqring_ev_posted(ctx);
@@ -982,6 +988,10 @@ static void io_cqring_fill_event(struct io_kiocb *req, long res) WRITE_ONCE(ctx->rings->cq_overflow, atomic_inc_return(&ctx->cached_cq_overflow)); } else { + if (list_empty(&ctx->cq_overflow_list)) { + set_bit(0, &ctx->sq_check_overflow); + set_bit(0, &ctx->cq_check_overflow); + } refcount_inc(&req->refs); req->result = res; list_add_tail(&req->list, &ctx->cq_overflow_list); @@ -1284,19 +1294,21 @@ static unsigned io_cqring_events(struct io_ring_ctx *ctx, bool noflush) { struct io_rings *rings = ctx->rings;
- /* - * noflush == true is from the waitqueue handler, just ensure we wake - * up the task, and the next invocation will flush the entries. We - * cannot safely to it from here. - */ - if (noflush && !list_empty(&ctx->cq_overflow_list)) - return -1U; + if (test_bit(0, &ctx->cq_check_overflow)) { + /* + * noflush == true is from the waitqueue handler, just ensure + * we wake up the task, and the next invocation will flush the + * entries. We cannot safely to it from here. + */ + if (noflush && !list_empty(&ctx->cq_overflow_list)) + return -1U;
- io_cqring_overflow_flush(ctx, false); + io_cqring_overflow_flush(ctx, false); + }
/* See comment at the top of this file */ smp_rmb(); - return READ_ONCE(rings->cq.tail) - READ_ONCE(rings->cq.head); + return ctx->cached_cq_tail - READ_ONCE(rings->cq.head); }
static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx) @@ -4306,9 +4318,11 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, bool mm_fault = false;
/* if we have a backlog and couldn't flush it all, return BUSY */ - if (!list_empty(&ctx->cq_overflow_list) && - !io_cqring_overflow_flush(ctx, false)) - return -EBUSY; + if (test_bit(0, &ctx->sq_check_overflow)) { + if (!list_empty(&ctx->cq_overflow_list) && + !io_cqring_overflow_flush(ctx, false)) + return -EBUSY; + }
if (nr > IO_PLUG_THRESHOLD) { io_submit_state_start(&state, nr);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit e94f141bd248ebdadcb7351f1e70b31cee5add53 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For busy IORING_OP_POLL_ADD workloads, we can have enough contention on the completion lock that we fail the inline completion path quite often as we fail the trylock on that lock. Add a list for deferred completions that we can use in that case. This helps reduce the number of async offloads we have to do, as if we get multiple completions in a row, we'll piggy back on to the poll_llist instead of having to queue our own offload.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 108 ++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 88 insertions(+), 20 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 44a0166f7d85..c96694d7b0fb 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -286,7 +286,8 @@ struct io_ring_ctx {
struct { spinlock_t completion_lock; - bool poll_multi_file; + struct llist_head poll_llist; + /* * ->poll_list is protected by the ctx->uring_lock for * io_uring instances that don't use IORING_SETUP_SQPOLL. @@ -296,6 +297,7 @@ struct io_ring_ctx { struct list_head poll_list; struct hlist_head *cancel_hash; unsigned cancel_hash_bits; + bool poll_multi_file;
spinlock_t inflight_lock; struct list_head inflight_list; @@ -453,7 +455,14 @@ struct io_kiocb { };
struct io_async_ctx *io; - struct file *ring_file; + union { + /* + * ring_file is only used in the submission path, and + * llist_node is only used for poll deferred completions + */ + struct file *ring_file; + struct llist_node llist_node; + }; int ring_fd; bool has_user; bool in_async; @@ -724,6 +733,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) mutex_init(&ctx->uring_lock); init_waitqueue_head(&ctx->wait); spin_lock_init(&ctx->completion_lock); + init_llist_head(&ctx->poll_llist); INIT_LIST_HEAD(&ctx->poll_list); INIT_LIST_HEAD(&ctx->defer_list); INIT_LIST_HEAD(&ctx->timeout_list); @@ -1319,6 +1329,20 @@ static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx) return smp_load_acquire(&rings->sq.tail) - ctx->cached_sq_head; }
+static inline bool io_req_multi_free(struct io_kiocb *req) +{ + /* + * If we're not using fixed files, we have to pair the completion part + * with the file put. Use regular completions for those, only batch + * free for fixed file and non-linked commands. + */ + if (((req->flags & (REQ_F_FIXED_FILE|REQ_F_LINK)) == REQ_F_FIXED_FILE) + && !io_is_fallback_req(req) && !req->io) + return true; + + return false; +} + /* * Find and free completed poll iocbs */ @@ -1338,14 +1362,7 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, (*nr_events)++;
if (refcount_dec_and_test(&req->refs)) { - /* If we're not using fixed files, we have to pair the - * completion part with the file put. Use regular - * completions for those, only batch free for fixed - * file and non-linked commands. - */ - if (((req->flags & (REQ_F_FIXED_FILE|REQ_F_LINK)) == - REQ_F_FIXED_FILE) && !io_is_fallback_req(req) && - !req->io) { + if (io_req_multi_free(req)) { reqs[to_free++] = req; if (to_free == ARRAY_SIZE(reqs)) io_free_req_many(ctx, reqs, &to_free); @@ -3078,6 +3095,44 @@ static void io_poll_complete_work(struct io_wq_work **workptr) io_wq_assign_next(workptr, nxt); }
+static void __io_poll_flush(struct io_ring_ctx *ctx, struct llist_node *nodes) +{ + void *reqs[IO_IOPOLL_BATCH]; + struct io_kiocb *req, *tmp; + int to_free = 0; + + spin_lock_irq(&ctx->completion_lock); + llist_for_each_entry_safe(req, tmp, nodes, llist_node) { + hash_del(&req->hash_node); + io_poll_complete(req, req->result, 0); + + if (refcount_dec_and_test(&req->refs)) { + if (io_req_multi_free(req)) { + reqs[to_free++] = req; + if (to_free == ARRAY_SIZE(reqs)) + io_free_req_many(ctx, reqs, &to_free); + } else { + req->flags |= REQ_F_COMP_LOCKED; + io_free_req(req); + } + } + } + spin_unlock_irq(&ctx->completion_lock); + + io_cqring_ev_posted(ctx); + io_free_req_many(ctx, reqs, &to_free); +} + +static void io_poll_flush(struct io_wq_work **workptr) +{ + struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); + struct llist_node *nodes; + + nodes = llist_del_all(&req->ctx->poll_llist); + if (nodes) + __io_poll_flush(req->ctx, nodes); +} + static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, void *key) { @@ -3085,7 +3140,6 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, struct io_kiocb *req = container_of(poll, struct io_kiocb, poll); struct io_ring_ctx *ctx = req->ctx; __poll_t mask = key_to_poll(key); - unsigned long flags;
/* for instances that support it check for an event match first: */ if (mask && !(mask & poll->events)) @@ -3099,17 +3153,31 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, * If we have a link timeout we're going to need the completion_lock * for finalizing the request, mark us as having grabbed that already. */ - if (mask && spin_trylock_irqsave(&ctx->completion_lock, flags)) { - hash_del(&req->hash_node); - io_poll_complete(req, mask, 0); - req->flags |= REQ_F_COMP_LOCKED; - io_put_req(req); - spin_unlock_irqrestore(&ctx->completion_lock, flags); + if (mask) { + unsigned long flags;
- io_cqring_ev_posted(ctx); - } else { - io_queue_async_work(req); + if (llist_empty(&ctx->poll_llist) && + spin_trylock_irqsave(&ctx->completion_lock, flags)) { + hash_del(&req->hash_node); + io_poll_complete(req, mask, 0); + req->flags |= REQ_F_COMP_LOCKED; + io_put_req(req); + spin_unlock_irqrestore(&ctx->completion_lock, flags); + + io_cqring_ev_posted(ctx); + req = NULL; + } else { + req->result = mask; + req->llist_node.next = NULL; + /* if the list wasn't empty, we're done */ + if (!llist_add(&req->llist_node, &ctx->poll_llist)) + req = NULL; + else + req->work.func = io_poll_flush; + } } + if (req) + io_queue_async_work(req);
return 1; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 3a6820f2bb8a079975109c25a5d1f29f46bce5d2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For uses cases that don't already naturally have an iovec, it's easier (or more convenient) to just use a buffer address + length. This is particular true if the use case is from languages that want to create a memory safe abstraction on top of io_uring, and where introducing the need for the iovec may impose an ownership issue. For those cases, they currently need an indirection buffer, which means allocating data just for this purpose.
Add basic read/write that don't require the iovec.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 23 +++++++++++++++++++++++ include/uapi/linux/io_uring.h | 2 ++ 2 files changed, 25 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c96694d7b0fb..8cb06ca5f21c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -654,6 +654,18 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .fd_non_neg = 1, }, + { + /* IORING_OP_READ */ + .needs_mm = 1, + .needs_file = 1, + .unbound_nonreg_file = 1, + }, + { + /* IORING_OP_WRITE */ + .needs_mm = 1, + .needs_file = 1, + .unbound_nonreg_file = 1, + }, };
static void io_wq_submit_work(struct io_wq_work **workptr); @@ -1866,6 +1878,13 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, if (req->rw.kiocb.private) return -EINVAL;
+ if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) { + ssize_t ret; + ret = import_single_range(rw, buf, sqe_len, *iovec, iter); + *iovec = NULL; + return ret; + } + if (req->io) { struct io_async_rw *iorw = &req->io->rw;
@@ -3631,10 +3650,12 @@ static int io_req_defer_prep(struct io_kiocb *req, break; case IORING_OP_READV: case IORING_OP_READ_FIXED: + case IORING_OP_READ: ret = io_read_prep(req, sqe, true); break; case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: + case IORING_OP_WRITE: ret = io_write_prep(req, sqe, true); break; case IORING_OP_POLL_ADD: @@ -3738,6 +3759,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, break; case IORING_OP_READV: case IORING_OP_READ_FIXED: + case IORING_OP_READ: if (sqe) { ret = io_read_prep(req, sqe, force_nonblock); if (ret < 0) @@ -3747,6 +3769,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, break; case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: + case IORING_OP_WRITE: if (sqe) { ret = io_write_prep(req, sqe, force_nonblock); if (ret < 0) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index d7ec50247a3a..7fdf994f3313 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -84,6 +84,8 @@ enum { IORING_OP_CLOSE, IORING_OP_FILES_UPDATE, IORING_OP_STATX, + IORING_OP_READ, + IORING_OP_WRITE,
/* this goes last, obviously */ IORING_OP_LAST,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit ba04291eb66ed895f194ae5abd3748d72bf8aaea category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This behaves like preadv2/pwritev2 with offset == -1, it'll use (and update) the current file position. This obviously comes with the caveat that if the application has multiple read/writes in flight, then the end result will not be as expected. This is similar to threads sharing a file descriptor and doing IO using the current file position.
Since this feature isn't easily detectable by doing a read or write, add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to detect presence of this feature.
Reported-by: 李通洲 carter.li@eoitek.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 11 ++++++++++- include/uapi/linux/io_uring.h | 1 + 2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8cb06ca5f21c..4385714506c2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -495,6 +495,7 @@ struct io_kiocb { #define REQ_F_COMP_LOCKED 32768 /* completion under lock */ #define REQ_F_HARDLINK 65536 /* doesn't sever on completion < 0 */ #define REQ_F_FORCE_ASYNC 131072 /* IOSQE_ASYNC */ +#define REQ_F_CUR_POS 262144 /* read/write uses file position */ u64 user_data; u32 result; u32 sequence; @@ -1710,6 +1711,10 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, req->flags |= REQ_F_ISREG;
kiocb->ki_pos = READ_ONCE(sqe->off); + if (kiocb->ki_pos == -1 && !(req->file->f_mode & FMODE_STREAM)) { + req->flags |= REQ_F_CUR_POS; + kiocb->ki_pos = req->file->f_pos; + } kiocb->ki_flags = iocb_flags(kiocb->ki_filp); kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
@@ -1781,6 +1786,10 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) static void kiocb_done(struct kiocb *kiocb, ssize_t ret, struct io_kiocb **nxt, bool in_async) { + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); + + if (req->flags & REQ_F_CUR_POS) + req->file->f_pos = kiocb->ki_pos; if (in_async && ret >= 0 && kiocb->ki_complete == io_complete_rw) *nxt = __io_complete_rw(kiocb, ret); else @@ -6142,7 +6151,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) goto err;
p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP | - IORING_FEAT_SUBMIT_STABLE; + IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS; trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags); return ret; err: diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 7fdf994f3313..1f96136eb6ee 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -174,6 +174,7 @@ struct io_uring_params { #define IORING_FEAT_SINGLE_MMAP (1U << 0) #define IORING_FEAT_NODROP (1U << 1) #define IORING_FEAT_SUBMIT_STABLE (1U << 2) +#define IORING_FEAT_RW_CUR_POS (1U << 3)
/* * io_uring_register(2) opcodes and arguments
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 4840e418c2fc533d55ff6caa5b9313eed1d26cfd category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This adds support for doing fadvise through io_uring. We assume that WILLNEED doesn't block, but that DONTNEED may block.
Reviewed-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 53 +++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 2 ++ 2 files changed, 55 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4385714506c2..c47ab9ce390e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -72,6 +72,7 @@ #include <linux/highmem.h> #include <linux/namei.h> #include <linux/fsnotify.h> +#include <linux/fadvise.h>
#define CREATE_TRACE_POINTS #include <trace/events/io_uring.h> @@ -400,6 +401,13 @@ struct io_files_update { u32 offset; };
+struct io_fadvise { + struct file *file; + u64 offset; + u32 len; + u32 advice; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -452,6 +460,7 @@ struct io_kiocb { struct io_open open; struct io_close close; struct io_files_update files_update; + struct io_fadvise fadvise; };
struct io_async_ctx *io; @@ -667,6 +676,10 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .unbound_nonreg_file = 1, }, + { + /* IORING_OP_FADVISE */ + .needs_file = 1, + }, };
static void io_wq_submit_work(struct io_wq_work **workptr); @@ -2433,6 +2446,35 @@ static int io_openat(struct io_kiocb *req, struct io_kiocb **nxt, return 0; }
+static int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + if (sqe->ioprio || sqe->buf_index || sqe->addr) + return -EINVAL; + + req->fadvise.offset = READ_ONCE(sqe->off); + req->fadvise.len = READ_ONCE(sqe->len); + req->fadvise.advice = READ_ONCE(sqe->fadvise_advice); + return 0; +} + +static int io_fadvise(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) +{ + struct io_fadvise *fa = &req->fadvise; + int ret; + + /* DONTNEED may block, others _should_ not */ + if (fa->advice == POSIX_FADV_DONTNEED && force_nonblock) + return -EAGAIN; + + ret = vfs_fadvise(req->file, fa->offset, fa->len, fa->advice); + if (ret < 0) + req_set_fail_links(req); + io_cqring_add_event(req, ret); + io_put_req_find_next(req, nxt); + return 0; +} + static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { unsigned lookup_flags; @@ -3718,6 +3760,9 @@ static int io_req_defer_prep(struct io_kiocb *req, case IORING_OP_STATX: ret = io_statx_prep(req, sqe); break; + case IORING_OP_FADVISE: + ret = io_fadvise_prep(req, sqe); + break; default: printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", req->opcode); @@ -3914,6 +3959,14 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } ret = io_statx(req, nxt, force_nonblock); break; + case IORING_OP_FADVISE: + if (sqe) { + ret = io_fadvise_prep(req, sqe); + if (ret) + break; + } + ret = io_fadvise(req, nxt, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 1f96136eb6ee..f86d1c776078 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -36,6 +36,7 @@ struct io_uring_sqe { __u32 cancel_flags; __u32 open_flags; __u32 statx_flags; + __u32 fadvise_advice; }; __u64 user_data; /* data to be passed back at completion time */ union { @@ -86,6 +87,7 @@ enum { IORING_OP_STATX, IORING_OP_READ, IORING_OP_WRITE, + IORING_OP_FADVISE,
/* this goes last, obviously */ IORING_OP_LAST,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit db08ca25253d56f1f76eb4b3fe32a7ac1fbab741 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This is in preparation for enabling this functionality through io_uring. Add a helper that is just exporting what sys_madvise() does, and have the system call use it.
No functional changes in this patch.
Reviewed-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: include/linux/mm.h
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/mm.h | 1 + mm/madvise.c | 7 ++++++- 2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 61734ef3c184..c0e5a3323036 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2406,6 +2406,7 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr, struct list_head *uf); extern int do_munmap(struct mm_struct *, unsigned long, size_t, struct list_head *uf); +extern int do_madvise(unsigned long start, size_t len_in, int behavior);
static inline unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, diff --git a/mm/madvise.c b/mm/madvise.c index 1369e6d062bc..1317267807b1 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -808,7 +808,7 @@ madvise_behavior_valid(int behavior) * -EBADF - map exists, but area maps something that isn't a file. * -EAGAIN - a kernel resource was temporarily unavailable. */ -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) +int do_madvise(unsigned long start, size_t len_in, int behavior) { unsigned long end, tmp; struct vm_area_struct *vma, *prev; @@ -903,3 +903,8 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
return error; } + +SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) +{ + return do_madvise(start, len_in, behavior); +}
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit c1ca757bd6f4632c510714631ddcc2d13030fe1e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This adds support for doing madvise(2) through io_uring. We assume that any operation can block, and hence punt everything async. This could be improved, but hard to make bullet proof. The async punt ensures it's safe.
Reviewed-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 59 +++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 1 + 2 files changed, 60 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c47ab9ce390e..17bf101c270a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -408,6 +408,13 @@ struct io_fadvise { u32 advice; };
+struct io_madvise { + struct file *file; + u64 addr; + u32 len; + u32 advice; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -461,6 +468,7 @@ struct io_kiocb { struct io_close close; struct io_files_update files_update; struct io_fadvise fadvise; + struct io_madvise madvise; };
struct io_async_ctx *io; @@ -680,6 +688,10 @@ static const struct io_op_def io_op_defs[] = { /* IORING_OP_FADVISE */ .needs_file = 1, }, + { + /* IORING_OP_MADVISE */ + .needs_mm = 1, + }, };
static void io_wq_submit_work(struct io_wq_work **workptr); @@ -2446,6 +2458,42 @@ static int io_openat(struct io_kiocb *req, struct io_kiocb **nxt, return 0; }
+static int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ +#if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) + if (sqe->ioprio || sqe->buf_index || sqe->off) + return -EINVAL; + + req->madvise.addr = READ_ONCE(sqe->addr); + req->madvise.len = READ_ONCE(sqe->len); + req->madvise.advice = READ_ONCE(sqe->fadvise_advice); + return 0; +#else + return -EOPNOTSUPP; +#endif +} + +static int io_madvise(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) +{ +#if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) + struct io_madvise *ma = &req->madvise; + int ret; + + if (force_nonblock) + return -EAGAIN; + + ret = do_madvise(ma->addr, ma->len, ma->advice); + if (ret < 0) + req_set_fail_links(req); + io_cqring_add_event(req, ret); + io_put_req_find_next(req, nxt); + return 0; +#else + return -EOPNOTSUPP; +#endif +} + static int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { if (sqe->ioprio || sqe->buf_index || sqe->addr) @@ -3763,6 +3811,9 @@ static int io_req_defer_prep(struct io_kiocb *req, case IORING_OP_FADVISE: ret = io_fadvise_prep(req, sqe); break; + case IORING_OP_MADVISE: + ret = io_madvise_prep(req, sqe); + break; default: printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", req->opcode); @@ -3967,6 +4018,14 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } ret = io_fadvise(req, nxt, force_nonblock); break; + case IORING_OP_MADVISE: + if (sqe) { + ret = io_madvise_prep(req, sqe); + if (ret) + break; + } + ret = io_madvise(req, nxt, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index f86d1c776078..8ad3cece5440 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -88,6 +88,7 @@ enum { IORING_OP_READ, IORING_OP_WRITE, IORING_OP_FADVISE, + IORING_OP_MADVISE,
/* this goes last, obviously */ IORING_OP_LAST,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit 4e5ef02317b12e2ed3d604281ffb6b75261f7612 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add percpu_ref_tryget_many(), which works the same way as percpu_ref_tryget(), but grabs specified number of refs.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Acked-by: Tejun Heo tj@kernel.org Acked-by: Dennis Zhou dennis@kernel.org Cc: Christoph Lameter cl@linux.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/percpu-refcount.h | 26 +++++++++++++++++++++----- 1 file changed, 21 insertions(+), 5 deletions(-)
diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 0f0240af8520..a01f8b4ebcfe 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -209,15 +209,17 @@ static inline void percpu_ref_get(struct percpu_ref *ref) }
/** - * percpu_ref_tryget - try to increment a percpu refcount + * percpu_ref_tryget_many - try to increment a percpu refcount * @ref: percpu_ref to try-get + * @nr: number of references to get * - * Increment a percpu refcount unless its count already reached zero. + * Increment a percpu refcount by @nr unless its count already reached zero. * Returns %true on success; %false on failure. * * This function is safe to call as long as @ref is between init and exit. */ -static inline bool percpu_ref_tryget(struct percpu_ref *ref) +static inline bool percpu_ref_tryget_many(struct percpu_ref *ref, + unsigned long nr) { unsigned long __percpu *percpu_count; bool ret; @@ -225,10 +227,10 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref) rcu_read_lock_sched();
if (__ref_is_percpu(ref, &percpu_count)) { - this_cpu_inc(*percpu_count); + this_cpu_add(*percpu_count, nr); ret = true; } else { - ret = atomic_long_inc_not_zero(&ref->count); + ret = atomic_long_add_unless(&ref->count, nr, 0); }
rcu_read_unlock_sched(); @@ -236,6 +238,20 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref) return ret; }
+/** + * percpu_ref_tryget - try to increment a percpu refcount + * @ref: percpu_ref to try-get + * + * Increment a percpu refcount unless its count already reached zero. + * Returns %true on success; %false on failure. + * + * This function is safe to call as long as @ref is between init and exit. + */ +static inline bool percpu_ref_tryget(struct percpu_ref *ref) +{ + return percpu_ref_tryget_many(ref, 1); +} + /** * percpu_ref_tryget_live - try to increment a live percpu refcount * @ref: percpu_ref to try-get
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 39220e8d4a2aaab045ea03cc16d737e85d0817bf category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Also make it available outside of epoll, along with the helper that decides if we need to copy the passed in epoll_event.
Signed-off-by: Jens Axboe axboe@kernel.dk Conflicts: fs/eventpoll.c [conflicts with get_file(tf.file); in commit 492a9215c4e6 ("epoll: Keep a reference on files added to the check list")] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/eventpoll.c | 46 ++++++++++++++++++++++++++++----------- include/linux/eventpoll.h | 9 ++++++++ 2 files changed, 42 insertions(+), 13 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index cfe8dbf8199d..d46007154250 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -356,12 +356,6 @@ static inline struct epitem *ep_item_from_epqueue(poll_table *p) return container_of(p, struct ep_pqueue, pt)->epi; }
-/* Tells if the epoll_ctl(2) operation needs an event copy from userspace */ -static inline int ep_op_has_event(int op) -{ - return op != EPOLL_CTL_DEL; -} - /* Initialize the poll safe wake up structure */ static void ep_nested_calls_init(struct nested_calls *ncalls) { @@ -1991,7 +1985,20 @@ SYSCALL_DEFINE1(epoll_create, int, size) return do_epoll_create(0); }
-static int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds) +static inline int epoll_mutex_lock(struct mutex *mutex, int depth, + bool nonblock) +{ + if (!nonblock) { + mutex_lock_nested(mutex, depth); + return 0; + } + if (mutex_trylock(mutex)) + return 0; + return -EAGAIN; +} + +int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds, + bool nonblock) { int error; int full_check = 0; @@ -2062,14 +2069,18 @@ static int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds) * deep wakeup paths from forming in parallel through multiple * EPOLL_CTL_ADD operations. */ - mutex_lock_nested(&ep->mtx, 0); + error = epoll_mutex_lock(&ep->mtx, 0, nonblock); + if (error) + goto error_tgt_fput; if (op == EPOLL_CTL_ADD) { if (!list_empty(&f.file->f_ep_links) || ep->gen == loop_check_gen || is_file_epoll(tf.file)) { - full_check = 1; mutex_unlock(&ep->mtx); - mutex_lock(&epmutex); + error = epoll_mutex_lock(&epmutex, 0, nonblock); + if (error) + goto error_tgt_fput; + full_check = 1; if (is_file_epoll(tf.file)) { error = -ELOOP; if (ep_loop_check(ep, tf.file) != 0) @@ -2079,10 +2090,19 @@ static int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds) list_add(&tf.file->f_tfile_llink, &tfile_check_list); } - mutex_lock_nested(&ep->mtx, 0); + error = epoll_mutex_lock(&ep->mtx, 0, nonblock); + if (error) { +out_del: + list_del(&tf.file->f_tfile_llink); + goto error_tgt_fput; + } if (is_file_epoll(tf.file)) { tep = tf.file->private_data; - mutex_lock_nested(&tep->mtx, 1); + error = epoll_mutex_lock(&tep->mtx, 1, nonblock); + if (error) { + mutex_unlock(&ep->mtx); + goto out_del; + } } } } @@ -2152,7 +2172,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, copy_from_user(&epds, event, sizeof(struct epoll_event))) return -EFAULT;
- return do_epoll_ctl(epfd, op, fd, &epds); + return do_epoll_ctl(epfd, op, fd, &epds, false); }
/* diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h index 2f14ac73d01d..48dedbafe5fa 100644 --- a/include/linux/eventpoll.h +++ b/include/linux/eventpoll.h @@ -66,6 +66,15 @@ static inline void eventpoll_release(struct file *file) eventpoll_release_file(file); }
+int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds, + bool nonblock); + +/* Tells if the epoll_ctl(2) operation needs an event copy from userspace */ +static inline int ep_op_has_event(int op) +{ + return op != EPOLL_CTL_DEL; +} + #else
static inline void eventpoll_init_file(struct file *file) {}
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 3e4827b05d2ac2d377ed136a52829ec46787bf4b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This adds IORING_OP_EPOLL_CTL, which can perform the same work as the epoll_ctl(2) system call.
Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c include/uapi/linux/io_uring.h [commit cebdb98617ae("io_uring: add support for IORING_OP_OPENAT2") is not applied] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 71 +++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 1 + 2 files changed, 72 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f92e1f261dea..d4e5f2ec8151 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -74,6 +74,7 @@ #include <linux/namei.h> #include <linux/fsnotify.h> #include <linux/fadvise.h> +#include <linux/eventpoll.h>
#define CREATE_TRACE_POINTS #include <trace/events/io_uring.h> @@ -424,6 +425,14 @@ struct io_madvise { u32 advice; };
+struct io_epoll { + struct file *file; + int epfd; + int op; + int fd; + struct epoll_event event; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -537,6 +546,7 @@ struct io_kiocb { struct io_files_update files_update; struct io_fadvise fadvise; struct io_madvise madvise; + struct io_epoll epoll; };
struct io_async_ctx *io; @@ -724,6 +734,10 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .unbound_nonreg_file = 1, }, + [IORING_OP_EPOLL_CTL] = { + .unbound_nonreg_file = 1, + .file_table = 1, + }, };
static void io_wq_submit_work(struct io_wq_work **workptr); @@ -2563,6 +2577,52 @@ static int io_openat(struct io_kiocb *req, struct io_kiocb **nxt, return 0; }
+static int io_epoll_ctl_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ +#if defined(CONFIG_EPOLL) + if (sqe->ioprio || sqe->buf_index) + return -EINVAL; + + req->epoll.epfd = READ_ONCE(sqe->fd); + req->epoll.op = READ_ONCE(sqe->len); + req->epoll.fd = READ_ONCE(sqe->off); + + if (ep_op_has_event(req->epoll.op)) { + struct epoll_event __user *ev; + + ev = u64_to_user_ptr(READ_ONCE(sqe->addr)); + if (copy_from_user(&req->epoll.event, ev, sizeof(*ev))) + return -EFAULT; + } + + return 0; +#else + return -EOPNOTSUPP; +#endif +} + +static int io_epoll_ctl(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) +{ +#if defined(CONFIG_EPOLL) + struct io_epoll *ie = &req->epoll; + int ret; + + ret = do_epoll_ctl(ie->epfd, ie->op, ie->fd, &ie->event, force_nonblock); + if (force_nonblock && ret == -EAGAIN) + return -EAGAIN; + + if (ret < 0) + req_set_fail_links(req); + io_cqring_add_event(req, ret); + io_put_req_find_next(req, nxt); + return 0; +#else + return -EOPNOTSUPP; +#endif +} + static int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) @@ -4024,6 +4084,9 @@ static int io_req_defer_prep(struct io_kiocb *req, case IORING_OP_MADVISE: ret = io_madvise_prep(req, sqe); break; + case IORING_OP_EPOLL_CTL: + ret = io_epoll_ctl_prep(req, sqe); + break; default: printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", req->opcode); @@ -4244,6 +4307,14 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } ret = io_madvise(req, nxt, force_nonblock); break; + case IORING_OP_EPOLL_CTL: + if (sqe) { + ret = io_epoll_ctl_prep(req, sqe); + if (ret) + break; + } + ret = io_epoll_ctl(req, nxt, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ad96791b34cf..90fed30a38b7 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -111,6 +111,7 @@ enum { IORING_OP_MADVISE, IORING_OP_SEND, IORING_OP_RECV, + IORING_OP_EPOLL_CTL,
/* this goes last, obviously */ IORING_OP_LAST,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 87ce955b24c9940cb2ca7e5173fcf175578d9fe9 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
It can be hard to know exactly what is registered with the ring. Especially for credentials, it'd be handy to be able to see which ones are registered, what personalities they have, and what the ID of each of them is.
This adds support for showing information registered in the ring from the fdinfo of the io_uring fd. Here's an example from a test case that registers 4 files (two of them sparse), 4 buffers, and 2 personalities:
pos: 0 flags: 02000002 mnt_id: 14 UserFiles: 4 0: file-no-1 1: file-no-2 2: <none> 3: <none> UserBufs: 4 0: 0x563817c46000/128 1: 0x563817c47000/256 2: 0x563817c48000/512 3: 0x563817c49000/1024 Personalities: 1 Uid: 0 0 0 0 Gid: 0 0 0 0 Groups: 0 CapEff: 0000003fffffffff 2 Uid: 0 0 0 0 Gid: 0 0 0 0 Groups: 0 CapEff: 0000003fffffffff
Suggested-by: Jann Horn jannh@google.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 75 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d4e5f2ec8151..b60e528741d5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6443,6 +6443,80 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, return submitted ? submitted : ret; }
+static int io_uring_show_cred(int id, void *p, void *data) +{ + const struct cred *cred = p; + struct seq_file *m = data; + struct user_namespace *uns = seq_user_ns(m); + struct group_info *gi; + kernel_cap_t cap; + unsigned __capi; + int g; + + seq_printf(m, "%5d\n", id); + seq_put_decimal_ull(m, "\tUid:\t", from_kuid_munged(uns, cred->uid)); + seq_put_decimal_ull(m, "\t\t", from_kuid_munged(uns, cred->euid)); + seq_put_decimal_ull(m, "\t\t", from_kuid_munged(uns, cred->suid)); + seq_put_decimal_ull(m, "\t\t", from_kuid_munged(uns, cred->fsuid)); + seq_put_decimal_ull(m, "\n\tGid:\t", from_kgid_munged(uns, cred->gid)); + seq_put_decimal_ull(m, "\t\t", from_kgid_munged(uns, cred->egid)); + seq_put_decimal_ull(m, "\t\t", from_kgid_munged(uns, cred->sgid)); + seq_put_decimal_ull(m, "\t\t", from_kgid_munged(uns, cred->fsgid)); + seq_puts(m, "\n\tGroups:\t"); + gi = cred->group_info; + for (g = 0; g < gi->ngroups; g++) { + seq_put_decimal_ull(m, g ? " " : "", + from_kgid_munged(uns, gi->gid[g])); + } + seq_puts(m, "\n\tCapEff:\t"); + cap = cred->cap_effective; + CAP_FOR_EACH_U32(__capi) + seq_put_hex_ll(m, NULL, cap.cap[CAP_LAST_U32 - __capi], 8); + seq_putc(m, '\n'); + return 0; +} + +static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m) +{ + int i; + + mutex_lock(&ctx->uring_lock); + seq_printf(m, "UserFiles:\t%u\n", ctx->nr_user_files); + for (i = 0; i < ctx->nr_user_files; i++) { + struct fixed_file_table *table; + struct file *f; + + table = &ctx->file_data->table[i >> IORING_FILE_TABLE_SHIFT]; + f = table->files[i & IORING_FILE_TABLE_MASK]; + if (f) + seq_printf(m, "%5u: %s\n", i, file_dentry(f)->d_iname); + else + seq_printf(m, "%5u: <none>\n", i); + } + seq_printf(m, "UserBufs:\t%u\n", ctx->nr_user_bufs); + for (i = 0; i < ctx->nr_user_bufs; i++) { + struct io_mapped_ubuf *buf = &ctx->user_bufs[i]; + + seq_printf(m, "%5u: 0x%llx/%u\n", i, buf->ubuf, + (unsigned int) buf->len); + } + if (!idr_is_empty(&ctx->personality_idr)) { + seq_printf(m, "Personalities:\n"); + idr_for_each(&ctx->personality_idr, io_uring_show_cred, m); + } + mutex_unlock(&ctx->uring_lock); +} + +static void io_uring_show_fdinfo(struct seq_file *m, struct file *f) +{ + struct io_ring_ctx *ctx = f->private_data; + + if (percpu_ref_tryget(&ctx->refs)) { + __io_uring_show_fdinfo(ctx, m); + percpu_ref_put(&ctx->refs); + } +} + static const struct file_operations io_uring_fops = { .release = io_uring_release, .flush = io_uring_flush, @@ -6453,6 +6527,7 @@ static const struct file_operations io_uring_fops = { #endif .poll = io_uring_poll, .fasync = io_uring_fasync, + .show_fdinfo = io_uring_show_fdinfo, };
static int io_allocate_scq_urings(struct io_ring_ctx *ctx,
From: Stefan Metzmacher metze@samba.org
mainline inclusion from mainline-5.6-rc1 commit d7f62e825fd19202a0749d10fb439714c51f67d2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
With nesting of anonymous unions and structs it's hard to review layout changes. It's better to ask the compiler for these things.
Signed-off-by: Stefan Metzmacher metze@samba.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b60e528741d5..c42bf74a3537 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6980,6 +6980,39 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
static int __init io_uring_init(void) { +#define __BUILD_BUG_VERIFY_ELEMENT(stype, eoffset, etype, ename) do { \ + BUILD_BUG_ON(offsetof(stype, ename) != eoffset); \ + BUILD_BUG_ON(sizeof(etype) != sizeof_field(stype, ename)); \ +} while (0) + +#define BUILD_BUG_SQE_ELEM(eoffset, etype, ename) \ + __BUILD_BUG_VERIFY_ELEMENT(struct io_uring_sqe, eoffset, etype, ename) + BUILD_BUG_ON(sizeof(struct io_uring_sqe) != 64); + BUILD_BUG_SQE_ELEM(0, __u8, opcode); + BUILD_BUG_SQE_ELEM(1, __u8, flags); + BUILD_BUG_SQE_ELEM(2, __u16, ioprio); + BUILD_BUG_SQE_ELEM(4, __s32, fd); + BUILD_BUG_SQE_ELEM(8, __u64, off); + BUILD_BUG_SQE_ELEM(8, __u64, addr2); + BUILD_BUG_SQE_ELEM(16, __u64, addr); + BUILD_BUG_SQE_ELEM(24, __u32, len); + BUILD_BUG_SQE_ELEM(28, __kernel_rwf_t, rw_flags); + BUILD_BUG_SQE_ELEM(28, /* compat */ int, rw_flags); + BUILD_BUG_SQE_ELEM(28, /* compat */ __u32, rw_flags); + BUILD_BUG_SQE_ELEM(28, __u32, fsync_flags); + BUILD_BUG_SQE_ELEM(28, __u16, poll_events); + BUILD_BUG_SQE_ELEM(28, __u32, sync_range_flags); + BUILD_BUG_SQE_ELEM(28, __u32, msg_flags); + BUILD_BUG_SQE_ELEM(28, __u32, timeout_flags); + BUILD_BUG_SQE_ELEM(28, __u32, accept_flags); + BUILD_BUG_SQE_ELEM(28, __u32, cancel_flags); + BUILD_BUG_SQE_ELEM(28, __u32, open_flags); + BUILD_BUG_SQE_ELEM(28, __u32, statx_flags); + BUILD_BUG_SQE_ELEM(28, __u32, fadvise_advice); + BUILD_BUG_SQE_ELEM(32, __u64, user_data); + BUILD_BUG_SQE_ELEM(40, __u16, buf_index); + BUILD_BUG_SQE_ELEM(42, __u16, personality); + BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST); req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); return 0;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit f0b493e6b9a8959356983f57112229e69c2f7b8c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we have nested or circular eventfd wakeups, then we can deadlock if we run them inline from our poll waitqueue wakeup handler. It's also possible to have very long chains of notifications, to the extent where we could risk blowing the stack.
Check the eventfd recursion count before calling eventfd_signal(). If it's non-zero, then punt the signaling to async context. This is always safe, as it takes us out-of-line in terms of stack and locking context.
Cc: stable@vger.kernel.org # 5.1+ Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 37 ++++++++++++++++++++++++++++++------- 1 file changed, 30 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c42bf74a3537..bb569a31882d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1015,21 +1015,28 @@ static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx) { + if (!ctx->cq_ev_fd) + return false; if (!ctx->eventfd_async) return true; return io_wq_current_is_worker() || in_interrupt(); }
-static void io_cqring_ev_posted(struct io_ring_ctx *ctx) +static void __io_cqring_ev_posted(struct io_ring_ctx *ctx, bool trigger_ev) { if (waitqueue_active(&ctx->wait)) wake_up(&ctx->wait); if (waitqueue_active(&ctx->sqo_wait)) wake_up(&ctx->sqo_wait); - if (ctx->cq_ev_fd && io_should_trigger_evfd(ctx)) + if (trigger_ev) eventfd_signal(ctx->cq_ev_fd, 1); }
+static void io_cqring_ev_posted(struct io_ring_ctx *ctx) +{ + __io_cqring_ev_posted(ctx, io_should_trigger_evfd(ctx)); +} + /* Returns true if there are no backlogged entries after the flush */ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) { @@ -3513,6 +3520,14 @@ static void io_poll_flush(struct io_wq_work **workptr) __io_poll_flush(req->ctx, nodes); }
+static void io_poll_trigger_evfd(struct io_wq_work **workptr) +{ + struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); + + eventfd_signal(req->ctx->cq_ev_fd, 1); + io_put_req(req); +} + static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, void *key) { @@ -3538,14 +3553,22 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
if (llist_empty(&ctx->poll_llist) && spin_trylock_irqsave(&ctx->completion_lock, flags)) { + bool trigger_ev; + hash_del(&req->hash_node); io_poll_complete(req, mask, 0); - req->flags |= REQ_F_COMP_LOCKED; - io_put_req(req); - spin_unlock_irqrestore(&ctx->completion_lock, flags);
- io_cqring_ev_posted(ctx); - req = NULL; + trigger_ev = io_should_trigger_evfd(ctx); + if (trigger_ev && eventfd_signal_count()) { + trigger_ev = false; + req->work.func = io_poll_trigger_evfd; + } else { + req->flags |= REQ_F_COMP_LOCKED; + io_put_req(req); + req = NULL; + } + spin_unlock_irqrestore(&ctx->completion_lock, flags); + __io_cqring_ev_posted(ctx, trigger_ev); } else { req->result = mask; req->llist_node.next = NULL;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 0b7b21e42ba2d6ac9595a4358a9354249605a3af category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Don't use the recvmsg/sendmsg helpers, use the same helpers that the recv(2) and send(2) system calls use.
Reported-by: 李通洲 carter.li@eoitek.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index bb569a31882d..31359a6eab42 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3042,7 +3042,8 @@ static int io_send(struct io_kiocb *req, struct io_kiocb **nxt, else if (force_nonblock) flags |= MSG_DONTWAIT;
- ret = __sys_sendmsg_sock(sock, &msg, flags); + msg.msg_flags = flags; + ret = sock_sendmsg(sock, &msg); if (force_nonblock && ret == -EAGAIN) return -EAGAIN; if (ret == -ERESTARTSYS) @@ -3068,6 +3069,7 @@ static int io_recvmsg_prep(struct io_kiocb *req,
sr->msg_flags = READ_ONCE(sqe->msg_flags); sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + sr->len = READ_ONCE(sqe->len);
if (!io || req->opcode == IORING_OP_RECV) return 0; @@ -3186,7 +3188,7 @@ static int io_recv(struct io_kiocb *req, struct io_kiocb **nxt, else if (force_nonblock) flags |= MSG_DONTWAIT;
- ret = __sys_recvmsg_sock(sock, &msg, NULL, NULL, flags); + ret = sock_recvmsg(sock, &msg, flags); if (force_nonblock && ret == -EAGAIN) return -EAGAIN; if (ret == -ERESTARTSYS)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 5d204bcfa09330972ad3428a8f81c23f371d3e6d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we have a read/write that is deferred, we already setup the async IO context for that request, and mapped it. When we later try and execute the request and we get -EAGAIN, we don't want to attempt to re-map it. If we do, we end up with garbage in the iovec, which typically leads to an -EFAULT or -EINVAL completion.
Cc: stable@vger.kernel.org # 5.5 Reported-by: Dan Melnic dmm@fb.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 31359a6eab42..63261cd05831 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2166,10 +2166,12 @@ static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size, { if (!io_op_defs[req->opcode].async_ctx) return 0; - if (!req->io && io_alloc_async_ctx(req)) - return -ENOMEM; + if (!req->io) { + if (io_alloc_async_ctx(req)) + return -ENOMEM;
- io_req_map_rw(req, io_size, iovec, fast_iov, iter); + io_req_map_rw(req, io_size, iovec, fast_iov, iter); + } req->work.func = io_rw_async; return 0; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc1 commit 9250f9ee194dc3dcee28a42a1533fa2cc0edd215 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
It won't ever get into io_prep_rw() when req->file haven't been set in io_req_set_file(), hence remove the check.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 --- 1 file changed, 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 63261cd05831..e21e647ae30a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1857,9 +1857,6 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, unsigned ioprio; int ret;
- if (!req->file) - return -EBADF; - if (S_ISREG(file_inode(req->file)->i_mode)) req->flags |= REQ_F_ISREG;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 1a417f4e618e05fba29ba222f1e8555c302376ce category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We punt close to async for the final fput(), but we log the completion even before that even in that case. We rely on the request not having a files table assigned to detect what the final async close should do. However, if we punt the async queue to __io_queue_sqe(), we'll get ->files assigned and this makes io_close_finish() think it should both close the filp again (which does no harm) AND log a new CQE event for this request. This causes duplicate CQEs.
Queue the request up for async manually so we don't grab files needlessly and trigger this condition.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e21e647ae30a..5c16c1edc40f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2795,16 +2795,13 @@ static void io_close_finish(struct io_wq_work **workptr) int ret;
ret = filp_close(req->close.put_file, req->work.files); - if (ret < 0) { + if (ret < 0) req_set_fail_links(req); - } io_cqring_add_event(req, ret); }
fput(req->close.put_file);
- /* we bypassed the re-issue, drop the submission reference */ - io_put_req(req); io_put_req_find_next(req, &nxt); if (nxt) io_wq_assign_next(workptr, nxt); @@ -2846,7 +2843,13 @@ static int io_close(struct io_kiocb *req, struct io_kiocb **nxt,
eagain: req->work.func = io_close_finish; - return -EAGAIN; + /* + * Do manual async queue here to avoid grabbing files - we don't + * need the files, and it'll cause io_close_finish() to close + * the file again and cause a double CQE entry for this request + */ + io_queue_async_work(req); + return 0; }
static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc1 commit 3e69426da2599677ebbe76e2d97a606c4797bd74 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Andres correctly points out that read-ahead can block, if it needs to read in meta data (or even just through the page cache page allocations). Play it safe for now and just ensure WILLNEED is also punted to async context.
While in there, allow the file settings hints from non-blocking context. They don't need to start/do IO, and we can safely do them inline.
Fixes: 4840e418c2fc ("io_uring: add IORING_OP_FADVISE") Reported-by: Andres Freund andres@anarazel.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5c16c1edc40f..b657cc629908 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2682,9 +2682,16 @@ static int io_fadvise(struct io_kiocb *req, struct io_kiocb **nxt, struct io_fadvise *fa = &req->fadvise; int ret;
- /* DONTNEED may block, others _should_ not */ - if (fa->advice == POSIX_FADV_DONTNEED && force_nonblock) - return -EAGAIN; + if (force_nonblock) { + switch (fa->advice) { + case POSIX_FADV_NORMAL: + case POSIX_FADV_RANDOM: + case POSIX_FADV_SEQUENTIAL: + break; + default: + return -EAGAIN; + } + }
ret = vfs_fadvise(req->file, fa->offset, fa->len, fa->advice); if (ret < 0)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc2 commit 00bcda13dcbf6bf7fa6f2a5886dd555362de8cfa category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We want to use the cancel functionality for canceling based on not just the work itself. Instead of matching on the work address manually, allow a match handler to tell us if we found the right work item or not.
No functional changes in this patch.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 33 ++++++++++++++++++++++----------- 1 file changed, 22 insertions(+), 11 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 0f02f35f45d0..248efd65b869 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -938,17 +938,19 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel, return ret; }
+struct work_match { + bool (*fn)(struct io_wq_work *, void *data); + void *data; +}; + static bool io_wq_worker_cancel(struct io_worker *worker, void *data) { - struct io_wq_work *work = data; + struct work_match *match = data; unsigned long flags; bool ret = false;
- if (worker->cur_work != work) - return false; - spin_lock_irqsave(&worker->lock, flags); - if (worker->cur_work == work && + if (match->fn(worker->cur_work, match->data) && !(worker->cur_work->flags & IO_WQ_WORK_NO_CANCEL)) { send_sig(SIGINT, worker->task, 1); ret = true; @@ -959,15 +961,13 @@ static bool io_wq_worker_cancel(struct io_worker *worker, void *data) }
static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, - struct io_wq_work *cwork) + struct work_match *match) { struct io_wq_work_node *node, *prev; struct io_wq_work *work; unsigned long flags; bool found = false;
- cwork->flags |= IO_WQ_WORK_CANCEL; - /* * First check pending list, if we're lucky we can just remove it * from there. CANCEL_OK means that the work is returned as-new, @@ -977,7 +977,7 @@ static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, wq_list_for_each(node, prev, &wqe->work_list) { work = container_of(node, struct io_wq_work, list);
- if (work == cwork) { + if (match->fn(work, match->data)) { wq_node_del(&wqe->work_list, node, prev); found = true; break; @@ -998,20 +998,31 @@ static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, * completion will run normally in this case. */ rcu_read_lock(); - found = io_wq_for_each_worker(wqe, io_wq_worker_cancel, cwork); + found = io_wq_for_each_worker(wqe, io_wq_worker_cancel, match); rcu_read_unlock(); return found ? IO_WQ_CANCEL_RUNNING : IO_WQ_CANCEL_NOTFOUND; }
+static bool io_wq_work_match(struct io_wq_work *work, void *data) +{ + return work == data; +} + enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork) { + struct work_match match = { + .fn = io_wq_work_match, + .data = cwork + }; enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; int node;
+ cwork->flags |= IO_WQ_WORK_CANCEL; + for_each_node(node) { struct io_wqe *wqe = wq->wqes[node];
- ret = io_wqe_cancel_work(wqe, cwork); + ret = io_wqe_cancel_work(wqe, &match); if (ret != IO_WQ_CANCEL_NOTFOUND) break; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc2 commit 36282881a795cbf717aca79392ae9cdf0fef59c9 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add a helper that allows the caller to cancel work based on what mm it belongs to. This allows io_uring to cancel work from a given task or thread when it exits.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 29 +++++++++++++++++++++++++++++ fs/io-wq.h | 2 ++ 2 files changed, 31 insertions(+)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 248efd65b869..419845d514df 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -1030,6 +1030,35 @@ enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork) return ret; }
+static bool io_wq_pid_match(struct io_wq_work *work, void *data) +{ + pid_t pid = (pid_t) (unsigned long) data; + + if (work) + return work->task_pid == pid; + return false; +} + +enum io_wq_cancel io_wq_cancel_pid(struct io_wq *wq, pid_t pid) +{ + struct work_match match = { + .fn = io_wq_pid_match, + .data = (void *) (unsigned long) pid + }; + enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; + int node; + + for_each_node(node) { + struct io_wqe *wqe = wq->wqes[node]; + + ret = io_wqe_cancel_work(wqe, &match); + if (ret != IO_WQ_CANCEL_NOTFOUND) + break; + } + + return ret; +} + struct io_wq_flush_data { struct io_wq_work work; struct completion done; diff --git a/fs/io-wq.h b/fs/io-wq.h index f152ba677d8f..ccc7d84af57d 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -76,6 +76,7 @@ struct io_wq_work { const struct cred *creds; struct fs_struct *fs; unsigned flags; + pid_t task_pid; };
#define INIT_IO_WORK(work, _func) \ @@ -109,6 +110,7 @@ void io_wq_flush(struct io_wq *wq);
void io_wq_cancel_all(struct io_wq *wq); enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork); +enum io_wq_cancel io_wq_cancel_pid(struct io_wq *wq, pid_t pid);
typedef bool (work_cancel_fn)(struct io_wq_work *, void *);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc2 commit 6ab231448fdc5e37c15a94a4700fca11e80007f7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Normally we cancel all work we track, but for untracked work we could leave the async worker behind until that work completes. This is totally fine, but does leave resources pending after the task is gone until that work completes.
Cancel work that this task queued up when it goes away.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 465c46f48025..a63302ba21ae 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -919,6 +919,8 @@ static inline void io_req_work_grab_env(struct io_kiocb *req, } spin_unlock(¤t->fs->lock); } + if (!req->work.task_pid) + req->work.task_pid = task_pid_vnr(current); }
static inline void io_req_work_drop_env(struct io_kiocb *req) @@ -6409,6 +6411,13 @@ static int io_uring_flush(struct file *file, void *data) struct io_ring_ctx *ctx = file->private_data;
io_uring_cancel_files(ctx, data); + + /* + * If the task is going away, cancel work it may have pending + */ + if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) + io_wq_cancel_pid(ctx->io_wq, task_pid_vnr(current)); + return 0; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc2 commit b537916ca5107c3a8714b8ab3099c0ec205aec12 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Jonas reports that he sometimes sees -97/-22 error returns from sendmsg, if it gets punted async. This is due to not retaining the sockaddr_storage between calls. Include that in the state we copy when going async.
Cc: stable@vger.kernel.org # 5.3+ Reported-by: Jonas Bonn jonas@norrbonn.se Tested-by: Jonas Bonn jonas@norrbonn.se Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a63302ba21ae..cae36041e12b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -443,6 +443,7 @@ struct io_async_msghdr { struct iovec *iov; struct sockaddr __user *uaddr; struct msghdr msg; + struct sockaddr_storage addr; };
struct io_async_rw { @@ -2978,12 +2979,11 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, sock = sock_from_file(req->file, &ret); if (sock) { struct io_async_ctx io; - struct sockaddr_storage addr; unsigned flags;
if (req->io) { kmsg = &req->io->msg; - kmsg->msg.msg_name = &addr; + kmsg->msg.msg_name = &req->io->msg.addr; /* if iov is set, it's allocated already */ if (!kmsg->iov) kmsg->iov = kmsg->fast_iov; @@ -2992,7 +2992,7 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, struct io_sr_msg *sr = &req->sr_msg;
kmsg = &io.msg; - kmsg->msg.msg_name = &addr; + kmsg->msg.msg_name = &io.msg.addr;
io.msg.iov = io.msg.fast_iov; ret = sendmsg_copy_msghdr(&io.msg.msg, sr->msg, @@ -3131,12 +3131,11 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, sock = sock_from_file(req->file, &ret); if (sock) { struct io_async_ctx io; - struct sockaddr_storage addr; unsigned flags;
if (req->io) { kmsg = &req->io->msg; - kmsg->msg.msg_name = &addr; + kmsg->msg.msg_name = &req->io->msg.addr; /* if iov is set, it's allocated already */ if (!kmsg->iov) kmsg->iov = kmsg->fast_iov; @@ -3145,7 +3144,7 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, struct io_sr_msg *sr = &req->sr_msg;
kmsg = &io.msg; - kmsg->msg.msg_name = &addr; + kmsg->msg.msg_name = &io.msg.addr;
io.msg.iov = io.msg.fast_iov; ret = recvmsg_copy_msghdr(&io.msg.msg, sr->msg,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc2 commit 7563439adfae153b20331f1567c8b5d0e5cbd8a7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Glauber reports a crash on init on a box he has:
RIP: 0010:__alloc_pages_nodemask+0x132/0x340 Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 74 01 00 00 <3b> 77 08 0f 82 6b 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02 RSP: 0018:ffffb8be4d0b7c28 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080 RBP: 0000000000012cc0 R08: 0000000000000000 R09: 0000000000000002 R10: 0000000000000dc0 R11: ffff995c60400100 R12: 0000000000000000 R13: 0000000000012cc0 R14: 0000000000000001 R15: ffff995c60db00f0 FS: 00007f4d115ca900(0000) GS:ffff995c60d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000002088 CR3: 00000017cca66002 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: alloc_slab_page+0x46/0x320 new_slab+0x9d/0x4e0 ___slab_alloc+0x507/0x6a0 ? io_wq_create+0xb4/0x2a0 __slab_alloc+0x1c/0x30 kmem_cache_alloc_node_trace+0xa6/0x260 io_wq_create+0xb4/0x2a0 io_uring_setup+0x97f/0xaa0 ? io_remove_personalities+0x30/0x30 ? io_poll_trigger_evfd+0x30/0x30 do_syscall_64+0x5b/0x1c0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f4d116cb1ed
which is due to the 'wqe' and 'worker' allocation being node affine. But it isn't valid to call the node affine allocation if the node isn't online.
Setup structures for even offline nodes, as usual, but skip them in terms of thread setup to not waste resources. If the node isn't online, just alloc memory with NUMA_NO_NODE.
Reported-by: Glauber Costa glauber@scylladb.com Tested-by: Glauber Costa glauber@scylladb.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 419845d514df..4e9a202362e5 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -700,11 +700,16 @@ static int io_wq_manager(void *data) /* create fixed workers */ refcount_set(&wq->refs, workers_to_create); for_each_node(node) { + if (!node_online(node)) + continue; if (!create_io_worker(wq, wq->wqes[node], IO_WQ_ACCT_BOUND)) goto err; workers_to_create--; }
+ while (workers_to_create--) + refcount_dec(&wq->refs); + complete(&wq->done);
while (!kthread_should_stop()) { @@ -712,6 +717,9 @@ static int io_wq_manager(void *data) struct io_wqe *wqe = wq->wqes[node]; bool fork_worker[2] = { false, false };
+ if (!node_online(node)) + continue; + spin_lock_irq(&wqe->lock); if (io_wqe_need_worker(wqe, IO_WQ_ACCT_BOUND)) fork_worker[IO_WQ_ACCT_BOUND] = true; @@ -830,7 +838,9 @@ static bool io_wq_for_each_worker(struct io_wqe *wqe,
list_for_each_entry_rcu(worker, &wqe->all_list, all_list) { if (io_worker_get(worker)) { - ret = func(worker, data); + /* no task if node is/was offline */ + if (worker->task) + ret = func(worker, data); io_worker_release(worker); if (ret) break; @@ -1085,6 +1095,8 @@ void io_wq_flush(struct io_wq *wq) for_each_node(node) { struct io_wqe *wqe = wq->wqes[node];
+ if (!node_online(node)) + continue; init_completion(&data.done); INIT_IO_WORK(&data.work, io_wq_flush_func); data.work.flags |= IO_WQ_WORK_INTERNAL; @@ -1116,12 +1128,15 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
for_each_node(node) { struct io_wqe *wqe; + int alloc_node = node;
- wqe = kzalloc_node(sizeof(struct io_wqe), GFP_KERNEL, node); + if (!node_online(alloc_node)) + alloc_node = NUMA_NO_NODE; + wqe = kzalloc_node(sizeof(struct io_wqe), GFP_KERNEL, alloc_node); if (!wqe) goto err; wq->wqes[node] = wqe; - wqe->node = node; + wqe->node = alloc_node; wqe->acct[IO_WQ_ACCT_BOUND].max_workers = bounded; atomic_set(&wqe->acct[IO_WQ_ACCT_BOUND].nr_running, 0); if (wq->user) { @@ -1129,7 +1144,6 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) task_rlimit(current, RLIMIT_NPROC); } atomic_set(&wqe->acct[IO_WQ_ACCT_UNBOUND].nr_running, 0); - wqe->node = node; wqe->wq = wq; spin_lock_init(&wqe->lock); INIT_WQ_LIST(&wqe->work_list);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc2 commit 2ca10259b4189a433c309054496dd6af1415f992 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Carter reported an issue where he could produce a stall on ring exit, when we're cleaning up requests that match the given file table. For this particular test case, a combination of a few things caused the issue:
- The cq ring was overflown - The request being canceled was in the overflow list
The combination of the above means that the cq overflow list holds a reference to the request. The request is canceled correctly, but since the overflow list holds a reference to it, the final put won't happen. Since the final put doesn't happen, the request remains in the inflight. Hence we never finish the cancelation flush.
Fix this by removing requests from the overflow list if we're canceling them.
Cc: stable@vger.kernel.org # 5.5 Reported-by: Carter Li 李通洲 carter.li@eoitek.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cae36041e12b..bbb5a45f3718 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -482,6 +482,7 @@ enum { REQ_F_TIMEOUT_NOSEQ_BIT, REQ_F_COMP_LOCKED_BIT, REQ_F_NEED_CLEANUP_BIT, + REQ_F_OVERFLOW_BIT, };
enum { @@ -522,6 +523,8 @@ enum { REQ_F_COMP_LOCKED = BIT(REQ_F_COMP_LOCKED_BIT), /* needs cleanup */ REQ_F_NEED_CLEANUP = BIT(REQ_F_NEED_CLEANUP_BIT), + /* in overflow list */ + REQ_F_OVERFLOW = BIT(REQ_F_OVERFLOW_BIT), };
/* @@ -1097,6 +1100,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) req = list_first_entry(&ctx->cq_overflow_list, struct io_kiocb, list); list_move(&req->list, &list); + req->flags &= ~REQ_F_OVERFLOW; if (cqe) { WRITE_ONCE(cqe->user_data, req->user_data); WRITE_ONCE(cqe->res, req->result); @@ -1149,6 +1153,7 @@ static void io_cqring_fill_event(struct io_kiocb *req, long res) set_bit(0, &ctx->sq_check_overflow); set_bit(0, &ctx->cq_check_overflow); } + req->flags |= REQ_F_OVERFLOW; refcount_inc(&req->refs); req->result = res; list_add_tail(&req->list, &ctx->cq_overflow_list); @@ -6398,6 +6403,29 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, if (!cancel_req) break;
+ if (cancel_req->flags & REQ_F_OVERFLOW) { + spin_lock_irq(&ctx->completion_lock); + list_del(&cancel_req->list); + cancel_req->flags &= ~REQ_F_OVERFLOW; + if (list_empty(&ctx->cq_overflow_list)) { + clear_bit(0, &ctx->sq_check_overflow); + clear_bit(0, &ctx->cq_check_overflow); + } + spin_unlock_irq(&ctx->completion_lock); + + WRITE_ONCE(ctx->rings->cq_overflow, + atomic_inc_return(&ctx->cached_cq_overflow)); + + /* + * Put inflight ref and overflow ref. If that's + * all we had, then we're done with this request. + */ + if (refcount_sub_and_test(2, &cancel_req->refs)) { + io_put_req(cancel_req); + continue; + } + } + io_wq_cancel_work(ctx->io_wq, &cancel_req->work); io_put_req(cancel_req); schedule();
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc3 commit 7fbeb95d0f68e21e6ca61284f1ac681630976947 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
fallocate_finish() is missing cancellation check. Add it. It's safe to do that, as only flags setup and sqe fields copy are done before it gets into __io_fallocate().
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index bbb5a45f3718..c0f3400f6ceb 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2511,6 +2511,9 @@ static void io_fallocate_finish(struct io_wq_work **workptr) struct io_kiocb *nxt = NULL; int ret;
+ if (io_req_cancelled(req)) + return; + ret = vfs_fallocate(req->file, req->sync.mode, req->sync.off, req->sync.len); if (ret < 0) @@ -2850,6 +2853,7 @@ static void io_close_finish(struct io_wq_work **workptr) struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); struct io_kiocb *nxt = NULL;
+ /* not cancellable, don't do io_req_cancelled() */ __io_close_finish(req, &nxt); if (nxt) io_wq_assign_next(workptr, nxt);
From: Dan Carpenter dan.carpenter@oracle.com
mainline inclusion from mainline-5.6-rc3 commit 297a31e3e8318f533cff4fe33ffaefb74f72c6e2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The "kmsg" pointer can't be NULL and we have already dereferenced it so a check here would be useless.
Reviewed-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c0f3400f6ceb..4d82b04a92c9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3021,7 +3021,7 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, if (req->io) return -EAGAIN; if (io_alloc_async_ctx(req)) { - if (kmsg && kmsg->iov != kmsg->fast_iov) + if (kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); return -ENOMEM; } @@ -3175,7 +3175,7 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, if (req->io) return -EAGAIN; if (io_alloc_async_ctx(req)) { - if (kmsg && kmsg->iov != kmsg->fast_iov) + if (kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); return -ENOMEM; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.6-rc3 commit 929a3af90f0f4bd7132d83552c1a98c83f60ef7e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_cleanup_req() should be called before req->io is freed, and so shouldn't be after __io_free_req() -> __io_req_aux_free(). Also, it will be ignored for in io_free_req_many(), which use __io_req_aux_free().
Place cleanup_req() into __io_req_aux_free().
Fixes: 99bc4c38537d774 ("io_uring: fix iovec leaks") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4d82b04a92c9..4ed5a7d97640 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1254,6 +1254,9 @@ static void __io_req_aux_free(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx;
+ if (req->flags & REQ_F_NEED_CLEANUP) + io_cleanup_req(req); + kfree(req->io); if (req->file) { if (req->flags & REQ_F_FIXED_FILE) @@ -1269,9 +1272,6 @@ static void __io_free_req(struct io_kiocb *req) { __io_req_aux_free(req);
- if (req->flags & REQ_F_NEED_CLEANUP) - io_cleanup_req(req); - if (req->flags & REQ_F_INFLIGHT) { struct io_ring_ctx *ctx = req->ctx; unsigned long flags;
From: Stefano Garzarella sgarzare@redhat.com
mainline inclusion from mainline-5.6-rc3 commit 7143b5ac5750f404ff3a594b34fdf3fc2f99f828 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This patch drops 'cur_mm' before calling cond_resched(), to prevent the sq_thread from spinning even when the user process is finished.
Before this patch, if the user process ended without closing the io_uring fd, the sq_thread continues to spin until the 'sq_thread_idle' timeout ends.
In the worst case where the 'sq_thread_idle' parameter is bigger than INT_MAX, the sq_thread will spin forever.
Fixes: 6c271ce2f1d5 ("io_uring: add submission polling") Signed-off-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4ed5a7d97640..c3f0df489a57 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5076,6 +5076,18 @@ static int io_sq_thread(void *data) * to enter the kernel to reap and flush events. */ if (!to_submit || ret == -EBUSY) { + /* + * Drop cur_mm before scheduling, we can't hold it for + * long periods (or over schedule()). Do this before + * adding ourselves to the waitqueue, as the unuse/drop + * may sleep. + */ + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + cur_mm = NULL; + } + /* * We're polling. If we're within the defined idle * period, then let us spin without work before going @@ -5090,18 +5102,6 @@ static int io_sq_thread(void *data) continue; }
- /* - * Drop cur_mm before scheduling, we can't hold it for - * long periods (or over schedule()). Do this before - * adding ourselves to the waitqueue, as the unuse/drop - * may sleep. - */ - if (cur_mm) { - unuse_mm(cur_mm); - mmput(cur_mm); - cur_mm = NULL; - } - prepare_to_wait(&ctx->sqo_wait, &wait, TASK_INTERRUPTIBLE);
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.6-rc3 commit c7849be9cc2dd2754c48ddbaca27c2de6d80a95d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Since commit a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending"), if we already events pending, we won't enter poll loop. In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has been terminated and don't reap pending events which are already in cq ring, and there are some reqs in poll_list, io_sq_thread will enter __io_iopoll_check(), and find pending events, then return, this loop will never have a chance to exit.
I have seen this issue in fio stress tests, to fix this issue, let io_sq_thread call io_iopoll_getevents() with argument 'min' being zero, and remove __io_iopoll_check().
Fixes: a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending") Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 27 +++++++++------------------ 1 file changed, 9 insertions(+), 18 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c3f0df489a57..717055df430a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1666,11 +1666,17 @@ static void io_iopoll_reap_events(struct io_ring_ctx *ctx) mutex_unlock(&ctx->uring_lock); }
-static int __io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, - long min) +static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, + long min) { int iters = 0, ret = 0;
+ /* + * We disallow the app entering submit/complete with polling, but we + * still need to lock the ring to prevent racing with polled issue + * that got punted to a workqueue. + */ + mutex_lock(&ctx->uring_lock); do { int tmin = 0;
@@ -1706,21 +1712,6 @@ static int __io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, ret = 0; } while (min && !*nr_events && !need_resched());
- return ret; -} - -static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, - long min) -{ - int ret; - - /* - * We disallow the app entering submit/complete with polling, but we - * still need to lock the ring to prevent racing with polled issue - * that got punted to a workqueue. - */ - mutex_lock(&ctx->uring_lock); - ret = __io_iopoll_check(ctx, nr_events, min); mutex_unlock(&ctx->uring_lock); return ret; } @@ -5052,7 +5043,7 @@ static int io_sq_thread(void *data) */ mutex_lock(&ctx->uring_lock); if (!list_empty(&ctx->poll_list)) - __io_iopoll_check(ctx, &nr_events, 0); + io_iopoll_getevents(ctx, &nr_events, 0); else inflight = 0; mutex_unlock(&ctx->uring_lock);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc4 commit 193155c8c9429f57400daf1f2ef0075016767112 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we have a chain of requests and they don't all use the same credentials, then the head of the chain will be issued with the credentails of the tail of the chain.
Ensure __io_queue_sqe() overrides the credentials, if they are different.
Once we do that, we can clean up the creds handling as well, by only having io_submit_sqe() do the lookup of a personality. It doesn't need to assign it, since __io_queue_sqe() now always does the right thing.
Fixes: 75c6a03904e0 ("io_uring: support using a registered personality for commands") Reported-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 717055df430a..b02907e824f3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4639,11 +4639,21 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_kiocb *linked_timeout; struct io_kiocb *nxt = NULL; + const struct cred *old_creds = NULL; int ret;
again: linked_timeout = io_prep_linked_timeout(req);
+ if (req->work.creds && req->work.creds != current_cred()) { + if (old_creds) + revert_creds(old_creds); + if (old_creds == req->work.creds) + old_creds = NULL; /* restored original creds */ + else + old_creds = override_creds(req->work.creds); + } + ret = io_issue_sqe(req, sqe, &nxt, true);
/* @@ -4693,6 +4703,8 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) goto punt; goto again; } + if (old_creds) + revert_creds(old_creds); }
static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) @@ -4737,7 +4749,6 @@ static inline void io_queue_link_head(struct io_kiocb *req) static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct io_submit_state *state, struct io_kiocb **link) { - const struct cred *old_creds = NULL; struct io_ring_ctx *ctx = req->ctx; unsigned int sqe_flags; int ret, id; @@ -4752,14 +4763,12 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
id = READ_ONCE(sqe->personality); if (id) { - const struct cred *personality_creds; - - personality_creds = idr_find(&ctx->personality_idr, id); - if (unlikely(!personality_creds)) { + req->work.creds = idr_find(&ctx->personality_idr, id); + if (unlikely(!req->work.creds)) { ret = -EINVAL; goto err_req; } - old_creds = override_creds(personality_creds); + get_cred(req->work.creds); }
/* same numerical values with corresponding REQ_F_*, safe to copy */ @@ -4771,8 +4780,6 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, err_req: io_cqring_add_event(req, ret); io_double_put_req(req); - if (old_creds) - revert_creds(old_creds); return false; }
@@ -4833,8 +4840,6 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } }
- if (old_creds) - revert_creds(old_creds); return true; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc4 commit 41726c9a50e7464beca7112d0aebf3a0090c62d2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We somehow never free the idr, even though we init it for every ctx. Free it when the rest of the ring data is freed.
Fixes: 071698e13ac6 ("io_uring: allow registering credentials") Reviewed-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b02907e824f3..99dd85b92ec5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6274,6 +6274,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) io_sqe_buffer_unregister(ctx); io_sqe_files_unregister(ctx); io_eventfd_unregister(ctx); + idr_destroy(&ctx->personality_idr);
#if defined(CONFIG_UNIX) if (ctx->ring_sock) {
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.6-rc4 commit bdcd3eab2a9ae0ac93f27275b6895dd95e5bf360 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
After making ext4 support iopoll method: let ext4_file_operations's iopoll method be iomap_dio_iopoll(), we found fio can easily hang in fio_ioring_getevents() with below fio job: rm -f testfile; sync; sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread -rw=write -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled.
There are two issues that results in this hang, one reason is that when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio does not use io_uring_enter to get completed events, it relies on kernel io_sq_thread to poll for completed events.
Another reason is that there is a race: when io_submit_sqes() in io_sq_thread() submits a batch of sqes, variable 'inflight' will record the number of submitted reqs, then io_sq_thread will poll for reqs which have been added to poll_list. But note, if some previous reqs have been punted to io worker, these reqs will won't be in poll_list timely. io_sq_thread() will only poll for a part of previous submitted reqs, and then find poll_list is empty, reset variable 'inflight' to be zero. If app just waits these deferred reqs and does not wake up io_sq_thread again, then hang happens.
For app that entirely relies on io_sq_thread to poll completed requests, let io_iopoll_req_issued() wake up io_sq_thread properly when adding new element to poll_list, and when io_sq_thread prepares to sleep, check whether poll_list is empty again, if not empty, continue to poll.
Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 59 +++++++++++++++++++++++---------------------------- 1 file changed, 27 insertions(+), 32 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 99dd85b92ec5..52b21cd6cca0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1815,6 +1815,10 @@ static void io_iopoll_req_issued(struct io_kiocb *req) list_add(&req->list, &ctx->poll_list); else list_add_tail(&req->list, &ctx->poll_list); + + if ((ctx->flags & IORING_SETUP_SQPOLL) && + wq_has_sleeper(&ctx->sqo_wait)) + wake_up(&ctx->sqo_wait); }
static void io_file_put(struct io_submit_state *state) @@ -5020,9 +5024,8 @@ static int io_sq_thread(void *data) const struct cred *old_cred; mm_segment_t old_fs; DEFINE_WAIT(wait); - unsigned inflight; unsigned long timeout; - int ret; + int ret = 0;
complete(&ctx->completions[1]);
@@ -5030,39 +5033,19 @@ static int io_sq_thread(void *data) set_fs(USER_DS); old_cred = override_creds(ctx->creds);
- ret = timeout = inflight = 0; + timeout = jiffies + ctx->sq_thread_idle; while (!kthread_should_park()) { unsigned int to_submit;
- if (inflight) { + if (!list_empty(&ctx->poll_list)) { unsigned nr_events = 0;
- if (ctx->flags & IORING_SETUP_IOPOLL) { - /* - * inflight is the count of the maximum possible - * entries we submitted, but it can be smaller - * if we dropped some of them. If we don't have - * poll entries available, then we know that we - * have nothing left to poll for. Reset the - * inflight count to zero in that case. - */ - mutex_lock(&ctx->uring_lock); - if (!list_empty(&ctx->poll_list)) - io_iopoll_getevents(ctx, &nr_events, 0); - else - inflight = 0; - mutex_unlock(&ctx->uring_lock); - } else { - /* - * Normal IO, just pretend everything completed. - * We don't have to poll completions for that. - */ - nr_events = inflight; - } - - inflight -= nr_events; - if (!inflight) + mutex_lock(&ctx->uring_lock); + if (!list_empty(&ctx->poll_list)) + io_iopoll_getevents(ctx, &nr_events, 0); + else timeout = jiffies + ctx->sq_thread_idle; + mutex_unlock(&ctx->uring_lock); }
to_submit = io_sqring_entries(ctx); @@ -5091,7 +5074,7 @@ static int io_sq_thread(void *data) * more IO, we should wait for the application to * reap events and wake us up. */ - if (inflight || + if (!list_empty(&ctx->poll_list) || (!time_after(jiffies, timeout) && ret != -EBUSY && !percpu_ref_is_dying(&ctx->refs))) { cond_resched(); @@ -5101,6 +5084,19 @@ static int io_sq_thread(void *data) prepare_to_wait(&ctx->sqo_wait, &wait, TASK_INTERRUPTIBLE);
+ /* + * While doing polled IO, before going to sleep, we need + * to check if there are new reqs added to poll_list, it + * is because reqs may have been punted to io worker and + * will be added to poll_list later, hence check the + * poll_list again. + */ + if ((ctx->flags & IORING_SETUP_IOPOLL) && + !list_empty_careful(&ctx->poll_list)) { + finish_wait(&ctx->sqo_wait, &wait); + continue; + } + /* Tell userspace we may need a wakeup call */ ctx->rings->sq_flags |= IORING_SQ_NEED_WAKEUP; /* make sure to read SQ tail after writing flags */ @@ -5128,8 +5124,7 @@ static int io_sq_thread(void *data) mutex_lock(&ctx->uring_lock); ret = io_submit_sqes(ctx, to_submit, NULL, -1, &cur_mm, true); mutex_unlock(&ctx->uring_lock); - if (ret > 0) - inflight += ret; + timeout = jiffies + ctx->sq_thread_idle; }
set_fs(old_fs);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc4 commit 3030fd4cb783377eca0e2a3eee63724a5c66ee15 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Andres reports that buffered IO seems to suck up more cycles than we would like, and he narrowed it down to the fact that the io-wq workers will briefly spin for more work on completion of a work item. This was a win on the networking side, but apparently some other cases take a hit because of it. Remove the optimization to avoid burning more CPU than we have to for disk IO.
Reported-by: Andres Freund andres@anarazel.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 19 ------------------- 1 file changed, 19 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 4e9a202362e5..88f34f66c387 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -536,42 +536,23 @@ static void io_worker_handle_work(struct io_worker *worker) } while (1); }
-static inline void io_worker_spin_for_work(struct io_wqe *wqe) -{ - int i = 0; - - while (++i < 1000) { - if (io_wqe_run_queue(wqe)) - break; - if (need_resched()) - break; - cpu_relax(); - } -} - static int io_wqe_worker(void *data) { struct io_worker *worker = data; struct io_wqe *wqe = worker->wqe; struct io_wq *wq = wqe->wq; - bool did_work;
io_worker_start(wqe, worker);
- did_work = false; while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) { set_current_state(TASK_INTERRUPTIBLE); loop: - if (did_work) - io_worker_spin_for_work(wqe); spin_lock_irq(&wqe->lock); if (io_wqe_run_queue(wqe)) { __set_current_state(TASK_RUNNING); io_worker_handle_work(worker); - did_work = true; goto loop; } - did_work = false; /* drops the lock on success, retry */ if (__io_worker_idle(wqe, worker)) { __release(&wqe->lock);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc4 commit 2d141dd2caa78fbaf87b57c27769bdc14975ab3d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We use ->task_pid for exit cancellation, but we need to ensure it's cleared to zero for io_req_work_grab_env() to do the right thing. Take a suggestion from Bart and clear the whole thing, just setting the function passed in. This makes it more future proof as well.
Fixes: 36282881a795 ("io-wq: add io_wq_cancel_pid() to cancel based on a specific pid") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.h | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-)
diff --git a/fs/io-wq.h b/fs/io-wq.h index ccc7d84af57d..33baba4370c5 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -79,16 +79,10 @@ struct io_wq_work { pid_t task_pid; };
-#define INIT_IO_WORK(work, _func) \ - do { \ - (work)->list.next = NULL; \ - (work)->func = _func; \ - (work)->files = NULL; \ - (work)->mm = NULL; \ - (work)->creds = NULL; \ - (work)->fs = NULL; \ - (work)->flags = 0; \ - } while (0) \ +#define INIT_IO_WORK(work, _func) \ + do { \ + *(work) = (struct io_wq_work){ .func = _func }; \ + } while (0) \
typedef void (get_work_fn)(struct io_wq_work *); typedef void (put_work_fn)(struct io_wq_work *);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc4 commit 2a44f46781617c5040372b59da33553a02b1f46d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If work completes inline, then we should pick up a dependent link item in __io_queue_sqe() as well. If we don't do so, we're forced to go async with that item, which is suboptimal.
This also fixes an issue with io_put_req_find_next(), which always looks up the next work item. That should only be done if we're dropping the last reference to the request, to prevent multiple lookups of the same work item.
Outside of being a fix, this also enables a good cleanup series for 5.7, where we never have to pass 'nxt' around or into the work handlers.
Reviewed-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 52b21cd6cca0..6c0ab5096e19 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1477,10 +1477,10 @@ static void io_free_req(struct io_kiocb *req) __attribute__((nonnull)) static void io_put_req_find_next(struct io_kiocb *req, struct io_kiocb **nxtptr) { - io_req_find_next(req, nxtptr); - - if (refcount_dec_and_test(&req->refs)) + if (refcount_dec_and_test(&req->refs)) { + io_req_find_next(req, nxtptr); __io_free_req(req); + } }
static void io_put_req(struct io_kiocb *req) @@ -4683,7 +4683,7 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
err: /* drop submission reference */ - io_put_req(req); + io_put_req_find_next(req, &nxt);
if (linked_timeout) { if (!ret)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.6-rc4 commit 3a9015988b3d41027cda61f4fdeaaeee73be8b24 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Unlike the other core import helpers, import_single_range() returns 0 on success, not the length imported. This means that links that depend on the result of non-vec based IORING_OP_{READ,WRITE} that were added for 5.5 get errored when they should not be.
Fixes: 3a6820f2bb8a ("io_uring: add non-vectored read/write commands") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6c0ab5096e19..352ac75afe97 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2069,7 +2069,7 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, ssize_t ret; ret = import_single_range(rw, buf, sqe_len, *iovec, iter); *iovec = NULL; - return ret; + return ret < 0 ? ret : sqe_len; }
if (req->io) {
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 7d67af2c013402537385dae343a2d0f6a4cb3bfd category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add support for splice(2).
- output file is specified as sqe->fd, so it's handled by generic code - hash_reg_file handled by generic code as well - len is 32bit, but should be fine - the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which is a splice flag (i.e. sqe->splice_flags).
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 109 ++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 14 ++++- 2 files changed, 122 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3fbc9f02f630..cdfcc578fe6b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -76,6 +76,7 @@ #include <linux/fadvise.h> #include <linux/eventpoll.h> #include <linux/fs_struct.h> +#include <linux/splice.h>
#define CREATE_TRACE_POINTS #include <trace/events/io_uring.h> @@ -431,6 +432,15 @@ struct io_epoll { struct epoll_event event; };
+struct io_splice { + struct file *file_out; + struct file *file_in; + loff_t off_out; + loff_t off_in; + u64 len; + unsigned int flags; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -547,6 +557,7 @@ struct io_kiocb { struct io_fadvise fadvise; struct io_madvise madvise; struct io_epoll epoll; + struct io_splice splice; };
struct io_async_ctx *io; @@ -741,6 +752,11 @@ static const struct io_op_def io_op_defs[] = { .unbound_nonreg_file = 1, .file_table = 1, }, + [IORING_OP_SPLICE] = { + .needs_file = 1, + .hash_reg_file = 1, + .unbound_nonreg_file = 1, + } };
static void io_wq_submit_work(struct io_wq_work **workptr); @@ -755,6 +771,10 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, static int io_grab_files(struct io_kiocb *req); static void io_ring_file_ref_flush(struct fixed_file_data *data); static void io_cleanup_req(struct io_kiocb *req); +static int io_file_get(struct io_submit_state *state, + struct io_kiocb *req, + int fd, struct file **out_file, + bool fixed);
static struct kmem_cache *req_cachep;
@@ -2401,6 +2421,77 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, return ret; }
+static int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_splice* sp = &req->splice; + unsigned int valid_flags = SPLICE_F_FD_IN_FIXED | SPLICE_F_ALL; + int ret; + + if (req->flags & REQ_F_NEED_CLEANUP) + return 0; + + sp->file_in = NULL; + sp->off_in = READ_ONCE(sqe->splice_off_in); + sp->off_out = READ_ONCE(sqe->off); + sp->len = READ_ONCE(sqe->len); + sp->flags = READ_ONCE(sqe->splice_flags); + + if (unlikely(sp->flags & ~valid_flags)) + return -EINVAL; + + ret = io_file_get(NULL, req, READ_ONCE(sqe->splice_fd_in), &sp->file_in, + (sp->flags & SPLICE_F_FD_IN_FIXED)); + if (ret) + return ret; + req->flags |= REQ_F_NEED_CLEANUP; + + if (!S_ISREG(file_inode(sp->file_in)->i_mode)) + req->work.flags |= IO_WQ_WORK_UNBOUND; + + return 0; +} + +static bool io_splice_punt(struct file *file) +{ + if (get_pipe_info(file)) + return false; + if (!io_file_supports_async(file)) + return true; + return !(file->f_mode & O_NONBLOCK); +} + +static int io_splice(struct io_kiocb *req, struct io_kiocb **nxt, + bool force_nonblock) +{ + struct io_splice *sp = &req->splice; + struct file *in = sp->file_in; + struct file *out = sp->file_out; + unsigned int flags = sp->flags & ~SPLICE_F_FD_IN_FIXED; + loff_t *poff_in, *poff_out; + long ret; + + if (force_nonblock) { + if (io_splice_punt(in) || io_splice_punt(out)) + return -EAGAIN; + flags |= SPLICE_F_NONBLOCK; + } + + poff_in = (sp->off_in == -1) ? NULL : &sp->off_in; + poff_out = (sp->off_out == -1) ? NULL : &sp->off_out; + ret = do_splice(in, poff_in, out, poff_out, sp->len, flags); + if (force_nonblock && ret == -EAGAIN) + return -EAGAIN; + + io_put_file(req, in, (sp->flags & SPLICE_F_FD_IN_FIXED)); + req->flags &= ~REQ_F_NEED_CLEANUP; + + io_cqring_add_event(req, ret); + if (ret != sp->len) + req_set_fail_links(req); + io_put_req_find_next(req, nxt); + return 0; +} + /* * IORING_OP_NOP just posts a completion event, nothing else. */ @@ -4182,6 +4273,9 @@ static int io_req_defer_prep(struct io_kiocb *req, case IORING_OP_EPOLL_CTL: ret = io_epoll_ctl_prep(req, sqe); break; + case IORING_OP_SPLICE: + ret = io_splice_prep(req, sqe); + break; default: printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", req->opcode); @@ -4243,6 +4337,10 @@ static void io_cleanup_req(struct io_kiocb *req) case IORING_OP_STATX: putname(req->open.filename); break; + case IORING_OP_SPLICE: + io_put_file(req, req->splice.file_in, + (req->splice.flags & SPLICE_F_FD_IN_FIXED)); + break; }
req->flags &= ~REQ_F_NEED_CLEANUP; @@ -4438,6 +4536,14 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } ret = io_epoll_ctl(req, nxt, force_nonblock); break; + case IORING_OP_SPLICE: + if (sqe) { + ret = io_splice_prep(req, sqe); + if (ret < 0) + break; + } + ret = io_splice(req, nxt, force_nonblock); + break; default: ret = -EINVAL; break; @@ -7196,6 +7302,7 @@ static int __init io_uring_init(void) BUILD_BUG_SQE_ELEM(8, __u64, off); BUILD_BUG_SQE_ELEM(8, __u64, addr2); BUILD_BUG_SQE_ELEM(16, __u64, addr); + BUILD_BUG_SQE_ELEM(16, __u64, splice_off_in); BUILD_BUG_SQE_ELEM(24, __u32, len); BUILD_BUG_SQE_ELEM(28, __kernel_rwf_t, rw_flags); BUILD_BUG_SQE_ELEM(28, /* compat */ int, rw_flags); @@ -7210,9 +7317,11 @@ static int __init io_uring_init(void) BUILD_BUG_SQE_ELEM(28, __u32, open_flags); BUILD_BUG_SQE_ELEM(28, __u32, statx_flags); BUILD_BUG_SQE_ELEM(28, __u32, fadvise_advice); + BUILD_BUG_SQE_ELEM(28, __u32, splice_flags); BUILD_BUG_SQE_ELEM(32, __u64, user_data); BUILD_BUG_SQE_ELEM(40, __u16, buf_index); BUILD_BUG_SQE_ELEM(42, __u16, personality); + BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in);
BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST); req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 90fed30a38b7..6c607e42db68 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -23,7 +23,10 @@ struct io_uring_sqe { __u64 off; /* offset into file */ __u64 addr2; }; - __u64 addr; /* pointer to buffer or iovecs */ + union { + __u64 addr; /* pointer to buffer or iovecs */ + __u64 splice_off_in; + }; __u32 len; /* buffer size or number of iovecs */ union { __kernel_rwf_t rw_flags; @@ -37,6 +40,7 @@ struct io_uring_sqe { __u32 open_flags; __u32 statx_flags; __u32 fadvise_advice; + __u32 splice_flags; }; __u64 user_data; /* data to be passed back at completion time */ union { @@ -45,6 +49,7 @@ struct io_uring_sqe { __u16 buf_index; /* personality to use, if used */ __u16 personality; + __s32 splice_fd_in; }; __u64 __pad2[3]; }; @@ -112,6 +117,7 @@ enum { IORING_OP_SEND, IORING_OP_RECV, IORING_OP_EPOLL_CTL, + IORING_OP_SPLICE,
/* this goes last, obviously */ IORING_OP_LAST, @@ -127,6 +133,12 @@ enum { */ #define IORING_TIMEOUT_ABS (1U << 0)
+/* + * sqe->splice_flags + * extends splice(2) flags + */ +#define SPLICE_F_FD_IN_FIXED (1U << 31) /* the last bit of __u32 */ + /* * IO completion data structure (Completion Queue Entry) */
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit b0a20349f212dc725f5ddfd060e426fe6181d9c5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Deduplicate call to io_cqring_fill_event(), plain and easy
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cdfcc578fe6b..40f6e95b471a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3593,10 +3593,7 @@ static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error) struct io_ring_ctx *ctx = req->ctx;
req->poll.done = true; - if (error) - io_cqring_fill_event(req, error); - else - io_cqring_fill_event(req, mangle_poll(mask)); + io_cqring_fill_event(req, error ? error : mangle_poll(mask)); io_commit_cqring(ctx); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 02d27d895323c4baa3234e4bed015eb3a196e1dd category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_recvmsg() and io_sendmsg() duplicate nonblock -EAGAIN finilising part, so add helper for that.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 43 +++++++++++++++++++------------------------ 1 file changed, 19 insertions(+), 24 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 40f6e95b471a..7891229e1b5c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3044,6 +3044,21 @@ static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt, return 0; }
+static int io_setup_async_msg(struct io_kiocb *req, + struct io_async_msghdr *kmsg) +{ + if (req->io) + return -EAGAIN; + if (io_alloc_async_ctx(req)) { + if (kmsg->iov != kmsg->fast_iov) + kfree(kmsg->iov); + return -ENOMEM; + } + req->flags |= REQ_F_NEED_CLEANUP; + memcpy(&req->io->msg, kmsg, sizeof(*kmsg)); + return -EAGAIN; +} + static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { #if defined(CONFIG_NET) @@ -3120,18 +3135,8 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, flags |= MSG_DONTWAIT;
ret = __sys_sendmsg_sock(sock, &kmsg->msg, flags); - if (force_nonblock && ret == -EAGAIN) { - if (req->io) - return -EAGAIN; - if (io_alloc_async_ctx(req)) { - if (kmsg->iov != kmsg->fast_iov) - kfree(kmsg->iov); - return -ENOMEM; - } - req->flags |= REQ_F_NEED_CLEANUP; - memcpy(&req->io->msg, &io.msg, sizeof(io.msg)); - return -EAGAIN; - } + if (force_nonblock && ret == -EAGAIN) + return io_setup_async_msg(req, kmsg); if (ret == -ERESTARTSYS) ret = -EINTR; } @@ -3279,18 +3284,8 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt,
ret = __sys_recvmsg_sock(sock, &kmsg->msg, req->sr_msg.msg, kmsg->uaddr, flags); - if (force_nonblock && ret == -EAGAIN) { - if (req->io) - return -EAGAIN; - if (io_alloc_async_ctx(req)) { - if (kmsg->iov != kmsg->fast_iov) - kfree(kmsg->iov); - return -ENOMEM; - } - memcpy(&req->io->msg, &io.msg, sizeof(io.msg)); - req->flags |= REQ_F_NEED_CLEANUP; - return -EAGAIN; - } + if (force_nonblock && ret == -EAGAIN) + return io_setup_async_msg(req, kmsg); if (ret == -ERESTARTSYS) ret = -EINTR; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit e85530ddda4f08d4f9ed6506d4a1f42e086e3b21 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
IO_WQ_WORK_HAS_MM is set but never used, remove it.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 2 -- fs/io-wq.h | 1 - 2 files changed, 3 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 587815b8b088..90767828ad01 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -500,8 +500,6 @@ static void io_worker_handle_work(struct io_worker *worker) */ if (test_bit(IO_WQ_BIT_CANCEL, &wq->state)) work->flags |= IO_WQ_WORK_CANCEL; - if (worker->mm) - work->flags |= IO_WQ_WORK_HAS_MM;
if (wq->get_work) { put_work = work; diff --git a/fs/io-wq.h b/fs/io-wq.h index e5e15f2c93ec..d500d88ab84e 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -5,7 +5,6 @@ struct io_wq;
enum { IO_WQ_WORK_CANCEL = 1, - IO_WQ_WORK_HAS_MM = 2, IO_WQ_WORK_HASHED = 4, IO_WQ_WORK_UNBOUND = 32, IO_WQ_WORK_CB = 128,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 5eae8619907a1389dbd1b4a1049caf52782c0916 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
IO_WQ_WORK_CB is used only for linked timeouts, which will be armed before the work setup (i.e. mm, override creds, etc). The setup shouldn't take long, so it's ok to arm it a bit later and get rid of IO_WQ_WORK_CB.
Make io-wq call work->func() only once, callbacks will handle the rest. i.e. the linked timeout handler will do the actual issue. And as a bonus, it removes an extra indirect call.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 3 --- fs/io-wq.h | 1 - fs/io_uring.c | 3 +-- 3 files changed, 1 insertion(+), 6 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 90767828ad01..f3894022d467 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -480,9 +480,6 @@ static void io_worker_handle_work(struct io_worker *worker) worker->cur_work = work; spin_unlock_irq(&worker->lock);
- if (work->flags & IO_WQ_WORK_CB) - work->func(&work); - if (work->files && current->files != work->files) { task_lock(current); current->files = work->files; diff --git a/fs/io-wq.h b/fs/io-wq.h index d500d88ab84e..a0978d6958f0 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -7,7 +7,6 @@ enum { IO_WQ_WORK_CANCEL = 1, IO_WQ_WORK_HASHED = 4, IO_WQ_WORK_UNBOUND = 32, - IO_WQ_WORK_CB = 128, IO_WQ_WORK_NO_CANCEL = 256, IO_WQ_WORK_CONCURRENT = 512,
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7891229e1b5c..a3f93c6ebe4d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2546,7 +2546,7 @@ static void io_link_work_cb(struct io_wq_work **workptr) struct io_kiocb *link = work->data;
io_queue_linked_timeout(link); - work->func = io_wq_submit_work; + io_wq_submit_work(workptr); }
static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) @@ -2556,7 +2556,6 @@ static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) io_prep_next_work(nxt, &link); *workptr = &nxt->work; if (link) { - nxt->work.flags |= IO_WQ_WORK_CB; nxt->work.func = io_link_work_cb; nxt->work.data = link; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 3684f24653534c71c7dc9f44d7281a838f4e4979 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
@hash_map is unsigned long, but BIT_ULL() is used for manipulations. BIT() is a better match as it returns exactly unsigned long value.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index f3894022d467..0ca2b17c82f9 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -394,8 +394,8 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe, unsigned *hash)
/* hashed, can run if not already running */ *hash = work->flags >> IO_WQ_HASH_SHIFT; - if (!(wqe->hash_map & BIT_ULL(*hash))) { - wqe->hash_map |= BIT_ULL(*hash); + if (!(wqe->hash_map & BIT(*hash))) { + wqe->hash_map |= BIT(*hash); wq_node_del(&wqe->work_list, node, prev); return work; } @@ -513,7 +513,7 @@ static void io_worker_handle_work(struct io_worker *worker) spin_lock_irq(&wqe->lock);
if (hash != -1U) { - wqe->hash_map &= ~BIT_ULL(hash); + wqe->hash_map &= ~BIT(hash); wqe->flags &= ~IO_WQE_FLAG_STALLED; } if (work && work != old_work) {
From: Oleg Nesterov oleg@redhat.com
mainline inclusion from mainline-5.7-rc1 commit 6fb614920b38bbf3c1c7fcd944c6d9b5d746103d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
As Peter pointed out, task_work() can avoid ->pi_lock and cmpxchg() if task->task_works == NULL && !PF_EXITING.
And in fact the only reason why task_work_run() needs ->pi_lock is the possible race with task_work_cancel(), we can optimize this code and make the locking more clear.
Signed-off-by: Oleg Nesterov oleg@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- kernel/task_work.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/kernel/task_work.c b/kernel/task_work.c index 0fef395662a6..825f28259a19 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -97,16 +97,26 @@ void task_work_run(void) * work->func() can do task_work_add(), do not set * work_exited unless the list is empty. */ - raw_spin_lock_irq(&task->pi_lock); do { + head = NULL; work = READ_ONCE(task->task_works); - head = !work && (task->flags & PF_EXITING) ? - &work_exited : NULL; + if (!work) { + if (task->flags & PF_EXITING) + head = &work_exited; + else + break; + } } while (cmpxchg(&task->task_works, work, head) != work); - raw_spin_unlock_irq(&task->pi_lock);
if (!work) break; + /* + * Synchronize with task_work_cancel(). It can not remove + * the first entry == work, cmpxchg(task_works) must fail. + * But it can remove another entry from the ->next list. + */ + raw_spin_lock_irq(&task->pi_lock); + raw_spin_unlock_irq(&task->pi_lock);
do { next = work->next;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit c2f2eb7d2c1cdc37fa9633bae96f381d33ee7a14 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Store the io_kiocb in the private field instead of the poll entry, this is in preparation for allowing multiple waitqueues.
No functional changes in this patch.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a3f93c6ebe4d..f845a9a55f02 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3682,8 +3682,8 @@ static void io_poll_trigger_evfd(struct io_wq_work **workptr) static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, void *key) { - struct io_poll_iocb *poll = wait->private; - struct io_kiocb *req = container_of(poll, struct io_kiocb, poll); + struct io_kiocb *req = wait->private; + struct io_poll_iocb *poll = &req->poll; struct io_ring_ctx *ctx = req->ctx; __poll_t mask = key_to_poll(key);
@@ -3806,7 +3806,7 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) /* initialized the list so that we can do list_empty checks */ INIT_LIST_HEAD(&poll->wait.entry); init_waitqueue_func_entry(&poll->wait, io_poll_wake); - poll->wait.private = poll; + poll->wait.private = req;
INIT_LIST_HEAD(&req->list);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit b41e98524e424d104aa7851d54fd65820759875a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For poll requests, it's not uncommon to link a read (or write) after the poll to execute immediately after the file is marked as ready. Since the poll completion is called inside the waitqueue wake up handler, we have to punt that linked request to async context. This slows down the processing, and actually means it's faster to not use a link for this use case.
We also run into problems if the completion_lock is contended, as we're doing a different lock ordering than the issue side is. Hence we have to do trylock for completion, and if that fails, go async. Poll removal needs to go async as well, for the same reason.
eventfd notification needs special case as well, to avoid stack blowing recursion or deadlocks.
These are all deficiencies that were inherited from the aio poll implementation, but I think we can do better. When a poll completes, simply queue it up in the task poll list. When the task completes the list, we can run dependent links inline as well. This means we never have to go async, and we can remove a bunch of code associated with that, and optimizations to try and make that run faster. The diffstat speaks for itself.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 218 ++++++++++++++++++-------------------------------- 1 file changed, 76 insertions(+), 142 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f845a9a55f02..df66bf2ea600 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -77,6 +77,7 @@ #include <linux/eventpoll.h> #include <linux/fs_struct.h> #include <linux/splice.h> +#include <linux/task_work.h>
#define CREATE_TRACE_POINTS #include <trace/events/io_uring.h> @@ -291,7 +292,6 @@ struct io_ring_ctx {
struct { spinlock_t completion_lock; - struct llist_head poll_llist;
/* * ->poll_list is protected by the ctx->uring_lock for @@ -561,10 +561,6 @@ struct io_kiocb { };
struct io_async_ctx *io; - /* - * llist_node is only used for poll deferred completions - */ - struct llist_node llist_node; bool needs_fixed_file; u8 opcode;
@@ -582,7 +578,17 @@ struct io_kiocb {
struct list_head inflight_entry;
- struct io_wq_work work; + union { + /* + * Only commands that never go async can use the below fields, + * obviously. Right now only IORING_OP_POLL_ADD uses them. + */ + struct { + struct task_struct *task; + struct callback_head task_work; + }; + struct io_wq_work work; + }; };
#define IO_PLUG_THRESHOLD 2 @@ -771,10 +777,10 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, static int io_grab_files(struct io_kiocb *req); static void io_ring_file_ref_flush(struct fixed_file_data *data); static void io_cleanup_req(struct io_kiocb *req); -static int io_file_get(struct io_submit_state *state, - struct io_kiocb *req, - int fd, struct file **out_file, - bool fixed); +static int io_file_get(struct io_submit_state *state, struct io_kiocb *req, + int fd, struct file **out_file, bool fixed); +static void __io_queue_sqe(struct io_kiocb *req, + const struct io_uring_sqe *sqe);
static struct kmem_cache *req_cachep;
@@ -844,7 +850,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) mutex_init(&ctx->uring_lock); init_waitqueue_head(&ctx->wait); spin_lock_init(&ctx->completion_lock); - init_llist_head(&ctx->poll_llist); INIT_LIST_HEAD(&ctx->poll_list); INIT_LIST_HEAD(&ctx->defer_list); INIT_LIST_HEAD(&ctx->timeout_list); @@ -1078,24 +1083,19 @@ static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx) return false; if (!ctx->eventfd_async) return true; - return io_wq_current_is_worker() || in_interrupt(); + return io_wq_current_is_worker(); }
-static void __io_cqring_ev_posted(struct io_ring_ctx *ctx, bool trigger_ev) +static void io_cqring_ev_posted(struct io_ring_ctx *ctx) { if (waitqueue_active(&ctx->wait)) wake_up(&ctx->wait); if (waitqueue_active(&ctx->sqo_wait)) wake_up(&ctx->sqo_wait); - if (trigger_ev) + if (io_should_trigger_evfd(ctx)) eventfd_signal(ctx->cq_ev_fd, 1); }
-static void io_cqring_ev_posted(struct io_ring_ctx *ctx) -{ - __io_cqring_ev_posted(ctx, io_should_trigger_evfd(ctx)); -} - /* Returns true if there are no backlogged entries after the flush */ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) { @@ -3500,18 +3500,27 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt, #endif }
-static void io_poll_remove_one(struct io_kiocb *req) +static bool io_poll_remove_one(struct io_kiocb *req) { struct io_poll_iocb *poll = &req->poll; + bool do_complete = false;
spin_lock(&poll->head->lock); WRITE_ONCE(poll->canceled, true); if (!list_empty(&poll->wait.entry)) { list_del_init(&poll->wait.entry); - io_queue_async_work(req); + do_complete = true; } spin_unlock(&poll->head->lock); hash_del(&req->hash_node); + if (do_complete) { + io_cqring_fill_event(req, -ECANCELED); + io_commit_cqring(req->ctx); + req->flags |= REQ_F_COMP_LOCKED; + io_put_req(req); + } + + return do_complete; }
static void io_poll_remove_all(struct io_ring_ctx *ctx) @@ -3529,6 +3538,8 @@ static void io_poll_remove_all(struct io_ring_ctx *ctx) io_poll_remove_one(req); } spin_unlock_irq(&ctx->completion_lock); + + io_cqring_ev_posted(ctx); }
static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr) @@ -3538,10 +3549,11 @@ static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr)
list = &ctx->cancel_hash[hash_long(sqe_addr, ctx->cancel_hash_bits)]; hlist_for_each_entry(req, list, hash_node) { - if (sqe_addr == req->user_data) { - io_poll_remove_one(req); + if (sqe_addr != req->user_data) + continue; + if (io_poll_remove_one(req)) return 0; - } + return -EALREADY; }
return -ENOENT; @@ -3591,92 +3603,28 @@ static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error) io_commit_cqring(ctx); }
-static void io_poll_complete_work(struct io_wq_work **workptr) +static void io_poll_task_handler(struct io_kiocb *req, struct io_kiocb **nxt) { - struct io_wq_work *work = *workptr; - struct io_kiocb *req = container_of(work, struct io_kiocb, work); - struct io_poll_iocb *poll = &req->poll; - struct poll_table_struct pt = { ._key = poll->events }; struct io_ring_ctx *ctx = req->ctx; - struct io_kiocb *nxt = NULL; - __poll_t mask = 0; - int ret = 0;
- if (work->flags & IO_WQ_WORK_CANCEL) { - WRITE_ONCE(poll->canceled, true); - ret = -ECANCELED; - } else if (READ_ONCE(poll->canceled)) { - ret = -ECANCELED; - } - - if (ret != -ECANCELED) - mask = vfs_poll(poll->file, &pt) & poll->events; - - /* - * Note that ->ki_cancel callers also delete iocb from active_reqs after - * calling ->ki_cancel. We need the ctx_lock roundtrip here to - * synchronize with them. In the cancellation case the list_del_init - * itself is not actually needed, but harmless so we keep it in to - * avoid further branches in the fast path. - */ spin_lock_irq(&ctx->completion_lock); - if (!mask && ret != -ECANCELED) { - add_wait_queue(poll->head, &poll->wait); - spin_unlock_irq(&ctx->completion_lock); - return; - } hash_del(&req->hash_node); - io_poll_complete(req, mask, ret); - spin_unlock_irq(&ctx->completion_lock); - - io_cqring_ev_posted(ctx); - - if (ret < 0) - req_set_fail_links(req); - io_put_req_find_next(req, &nxt); - if (nxt) - io_wq_assign_next(workptr, nxt); -} - -static void __io_poll_flush(struct io_ring_ctx *ctx, struct llist_node *nodes) -{ - struct io_kiocb *req, *tmp; - struct req_batch rb; - - rb.to_free = rb.need_iter = 0; - spin_lock_irq(&ctx->completion_lock); - llist_for_each_entry_safe(req, tmp, nodes, llist_node) { - hash_del(&req->hash_node); - io_poll_complete(req, req->result, 0); - - if (refcount_dec_and_test(&req->refs) && - !io_req_multi_free(&rb, req)) { - req->flags |= REQ_F_COMP_LOCKED; - io_free_req(req); - } - } + io_poll_complete(req, req->result, 0); + req->flags |= REQ_F_COMP_LOCKED; + io_put_req_find_next(req, nxt); spin_unlock_irq(&ctx->completion_lock);
io_cqring_ev_posted(ctx); - io_free_req_many(ctx, &rb); -} - -static void io_poll_flush(struct io_wq_work **workptr) -{ - struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - struct llist_node *nodes; - - nodes = llist_del_all(&req->ctx->poll_llist); - if (nodes) - __io_poll_flush(req->ctx, nodes); }
-static void io_poll_trigger_evfd(struct io_wq_work **workptr) +static void io_poll_task_func(struct callback_head *cb) { - struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); + struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); + struct io_kiocb *nxt = NULL;
- eventfd_signal(req->ctx->cq_ev_fd, 1); - io_put_req(req); + io_poll_task_handler(req, &nxt); + if (nxt) + __io_queue_sqe(nxt, NULL); }
static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, @@ -3684,8 +3632,8 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, { struct io_kiocb *req = wait->private; struct io_poll_iocb *poll = &req->poll; - struct io_ring_ctx *ctx = req->ctx; __poll_t mask = key_to_poll(key); + struct task_struct *tsk;
/* for instances that support it check for an event match first: */ if (mask && !(mask & poll->events)) @@ -3693,46 +3641,11 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
list_del_init(&poll->wait.entry);
- /* - * Run completion inline if we can. We're using trylock here because - * we are violating the completion_lock -> poll wq lock ordering. - * If we have a link timeout we're going to need the completion_lock - * for finalizing the request, mark us as having grabbed that already. - */ - if (mask) { - unsigned long flags; - - if (llist_empty(&ctx->poll_llist) && - spin_trylock_irqsave(&ctx->completion_lock, flags)) { - bool trigger_ev; - - hash_del(&req->hash_node); - io_poll_complete(req, mask, 0); - - trigger_ev = io_should_trigger_evfd(ctx); - if (trigger_ev && eventfd_signal_count()) { - trigger_ev = false; - req->work.func = io_poll_trigger_evfd; - } else { - req->flags |= REQ_F_COMP_LOCKED; - io_put_req(req); - req = NULL; - } - spin_unlock_irqrestore(&ctx->completion_lock, flags); - __io_cqring_ev_posted(ctx, trigger_ev); - } else { - req->result = mask; - req->llist_node.next = NULL; - /* if the list wasn't empty, we're done */ - if (!llist_add(&req->llist_node, &ctx->poll_llist)) - req = NULL; - else - req->work.func = io_poll_flush; - } - } - if (req) - io_queue_async_work(req); - + tsk = req->task; + req->result = mask; + init_task_work(&req->task_work, io_poll_task_func); + task_work_add(tsk, &req->task_work, true); + wake_up_process(tsk); return 1; }
@@ -3780,6 +3693,9 @@ static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe
events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; + + /* task will wait for requests on exit, don't need a ref */ + req->task = current; return 0; }
@@ -3791,7 +3707,6 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) bool cancel = false; __poll_t mask;
- INIT_IO_WORK(&req->work, io_poll_complete_work); INIT_HLIST_NODE(&req->hash_node);
poll->head = NULL; @@ -5216,6 +5131,8 @@ static int io_sq_thread(void *data) if (!list_empty(&ctx->poll_list) || (!time_after(jiffies, timeout) && ret != -EBUSY && !percpu_ref_is_dying(&ctx->refs))) { + if (current->task_works) + task_work_run(); cond_resched(); continue; } @@ -5247,6 +5164,10 @@ static int io_sq_thread(void *data) finish_wait(&ctx->sqo_wait, &wait); break; } + if (current->task_works) { + task_work_run(); + continue; + } if (signal_pending(current)) flush_signals(current); schedule(); @@ -5266,6 +5187,9 @@ static int io_sq_thread(void *data) timeout = jiffies + ctx->sq_thread_idle; }
+ if (current->task_works) + task_work_run(); + set_fs(old_fs); if (cur_mm) { unuse_mm(cur_mm); @@ -5330,8 +5254,13 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, struct io_rings *rings = ctx->rings; int ret = 0;
- if (io_cqring_events(ctx, false) >= min_events) - return 0; + do { + if (io_cqring_events(ctx, false) >= min_events) + return 0; + if (!current->task_works) + break; + task_work_run(); + } while (1);
if (sig) { #ifdef CONFIG_COMPAT @@ -5351,6 +5280,8 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, do { prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); + if (current->task_works) + task_work_run(); if (io_should_wake(&iowq, false)) break; schedule(); @@ -6677,6 +6608,9 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, int submitted = 0; struct fd f;
+ if (current->task_works) + task_work_run(); + if (flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP)) return -EINVAL;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 8a72758c51f8a5501a0e01ea95069630edb9ca07 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add a pollin/pollout field to the request table, and have commands that we can safely poll for properly marked.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index df66bf2ea600..0deaeb894892 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -632,6 +632,9 @@ struct io_op_def { unsigned file_table : 1; /* needs ->fs */ unsigned needs_fs : 1; + /* set if opcode supports polled "wait" */ + unsigned pollin : 1; + unsigned pollout : 1; };
static const struct io_op_def io_op_defs[] = { @@ -641,6 +644,7 @@ static const struct io_op_def io_op_defs[] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, + .pollin = 1, }, [IORING_OP_WRITEV] = { .async_ctx = 1, @@ -648,6 +652,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .hash_reg_file = 1, .unbound_nonreg_file = 1, + .pollout = 1, }, [IORING_OP_FSYNC] = { .needs_file = 1, @@ -655,11 +660,13 @@ static const struct io_op_def io_op_defs[] = { [IORING_OP_READ_FIXED] = { .needs_file = 1, .unbound_nonreg_file = 1, + .pollin = 1, }, [IORING_OP_WRITE_FIXED] = { .needs_file = 1, .hash_reg_file = 1, .unbound_nonreg_file = 1, + .pollout = 1, }, [IORING_OP_POLL_ADD] = { .needs_file = 1, @@ -675,6 +682,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .unbound_nonreg_file = 1, .needs_fs = 1, + .pollout = 1, }, [IORING_OP_RECVMSG] = { .async_ctx = 1, @@ -682,6 +690,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .unbound_nonreg_file = 1, .needs_fs = 1, + .pollin = 1, }, [IORING_OP_TIMEOUT] = { .async_ctx = 1, @@ -693,6 +702,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .unbound_nonreg_file = 1, .file_table = 1, + .pollin = 1, }, [IORING_OP_ASYNC_CANCEL] = {}, [IORING_OP_LINK_TIMEOUT] = { @@ -704,6 +714,7 @@ static const struct io_op_def io_op_defs[] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, + .pollout = 1, }, [IORING_OP_FALLOCATE] = { .needs_file = 1, @@ -732,11 +743,13 @@ static const struct io_op_def io_op_defs[] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, + .pollin = 1, }, [IORING_OP_WRITE] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, + .pollout = 1, }, [IORING_OP_FADVISE] = { .needs_file = 1, @@ -748,11 +761,13 @@ static const struct io_op_def io_op_defs[] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, + .pollout = 1, }, [IORING_OP_RECV] = { .needs_mm = 1, .needs_file = 1, .unbound_nonreg_file = 1, + .pollin = 1, }, [IORING_OP_EPOLL_CTL] = { .unbound_nonreg_file = 1,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit d7718a9d25a61442da8ee8aeeff6a0097f0ccfd6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently io_uring tries any request in a non-blocking manner, if it can, and then retries from a worker thread if we get -EAGAIN. Now that we have a new and fancy poll based retry backend, use that to retry requests if the file supports it.
This means that, for example, an IORING_OP_RECVMSG on a socket no longer requires an async thread to complete the IO. If we get -EAGAIN reading from the socket in a non-blocking manner, we arm a poll handler for notification on when the socket becomes readable. When it does, the pending read is executed directly by the task again, through the io_uring task work handlers. Not only is this faster and more efficient, it also means we're not generating potentially tons of async threads that just sit and block, waiting for the IO to complete.
The feature is marked with IORING_FEAT_FAST_POLL, meaning that async pollable IO is fast, and that poll<link>other_op is fast as well.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 354 ++++++++++++++++++++++++-------- include/trace/events/io_uring.h | 103 ++++++++++ include/uapi/linux/io_uring.h | 1 + 3 files changed, 375 insertions(+), 83 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 0deaeb894892..aba21e017cb9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -490,6 +490,7 @@ enum { REQ_F_COMP_LOCKED_BIT, REQ_F_NEED_CLEANUP_BIT, REQ_F_OVERFLOW_BIT, + REQ_F_POLLED_BIT, };
enum { @@ -532,6 +533,13 @@ enum { REQ_F_NEED_CLEANUP = BIT(REQ_F_NEED_CLEANUP_BIT), /* in overflow list */ REQ_F_OVERFLOW = BIT(REQ_F_OVERFLOW_BIT), + /* already went through poll handler */ + REQ_F_POLLED = BIT(REQ_F_POLLED_BIT), +}; + +struct async_poll { + struct io_poll_iocb poll; + struct io_wq_work work; };
/* @@ -565,27 +573,29 @@ struct io_kiocb { u8 opcode;
struct io_ring_ctx *ctx; - union { - struct list_head list; - struct hlist_node hash_node; - }; - struct list_head link_list; + struct list_head list; unsigned int flags; refcount_t refs; + struct task_struct *task; u64 user_data; u32 result; u32 sequence;
+ struct list_head link_list; + struct list_head inflight_entry;
union { /* * Only commands that never go async can use the below fields, - * obviously. Right now only IORING_OP_POLL_ADD uses them. + * obviously. Right now only IORING_OP_POLL_ADD uses them, and + * async armed poll handlers for regular commands. The latter + * restore the work, if needed. */ struct { - struct task_struct *task; struct callback_head task_work; + struct hlist_node hash_node; + struct async_poll *apoll; }; struct io_wq_work work; }; @@ -3515,9 +3525,209 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt, #endif }
-static bool io_poll_remove_one(struct io_kiocb *req) +struct io_poll_table { + struct poll_table_struct pt; + struct io_kiocb *req; + int error; +}; + +static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt, + struct wait_queue_head *head) +{ + if (unlikely(poll->head)) { + pt->error = -EINVAL; + return; + } + + pt->error = 0; + poll->head = head; + add_wait_queue(head, &poll->wait); +} + +static void io_async_queue_proc(struct file *file, struct wait_queue_head *head, + struct poll_table_struct *p) +{ + struct io_poll_table *pt = container_of(p, struct io_poll_table, pt); + + __io_queue_proc(&pt->req->apoll->poll, pt, head); +} + +static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, + __poll_t mask, task_work_func_t func) +{ + struct task_struct *tsk; + + /* for instances that support it check for an event match first: */ + if (mask && !(mask & poll->events)) + return 0; + + trace_io_uring_task_add(req->ctx, req->opcode, req->user_data, mask); + + list_del_init(&poll->wait.entry); + + tsk = req->task; + req->result = mask; + init_task_work(&req->task_work, func); + /* + * If this fails, then the task is exiting. If that is the case, then + * the exit check will ultimately cancel these work items. Hence we + * don't need to check here and handle it specifically. + */ + task_work_add(tsk, &req->task_work, true); + wake_up_process(tsk); + return 1; +} + +static void io_async_task_func(struct callback_head *cb) +{ + struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); + struct async_poll *apoll = req->apoll; + struct io_ring_ctx *ctx = req->ctx; + + trace_io_uring_task_run(req->ctx, req->opcode, req->user_data); + + WARN_ON_ONCE(!list_empty(&req->apoll->poll.wait.entry)); + + if (hash_hashed(&req->hash_node)) { + spin_lock_irq(&ctx->completion_lock); + hash_del(&req->hash_node); + spin_unlock_irq(&ctx->completion_lock); + } + + /* restore ->work in case we need to retry again */ + memcpy(&req->work, &apoll->work, sizeof(req->work)); + + __set_current_state(TASK_RUNNING); + mutex_lock(&ctx->uring_lock); + __io_queue_sqe(req, NULL); + mutex_unlock(&ctx->uring_lock); + + kfree(apoll); +} + +static int io_async_wake(struct wait_queue_entry *wait, unsigned mode, int sync, + void *key) +{ + struct io_kiocb *req = wait->private; + struct io_poll_iocb *poll = &req->apoll->poll; + + trace_io_uring_poll_wake(req->ctx, req->opcode, req->user_data, + key_to_poll(key)); + + return __io_async_wake(req, poll, key_to_poll(key), io_async_task_func); +} + +static void io_poll_req_insert(struct io_kiocb *req) +{ + struct io_ring_ctx *ctx = req->ctx; + struct hlist_head *list; + + list = &ctx->cancel_hash[hash_long(req->user_data, ctx->cancel_hash_bits)]; + hlist_add_head(&req->hash_node, list); +} + +static __poll_t __io_arm_poll_handler(struct io_kiocb *req, + struct io_poll_iocb *poll, + struct io_poll_table *ipt, __poll_t mask, + wait_queue_func_t wake_func) + __acquires(&ctx->completion_lock) +{ + struct io_ring_ctx *ctx = req->ctx; + bool cancel = false; + + poll->file = req->file; + poll->head = NULL; + poll->done = poll->canceled = false; + poll->events = mask; + + ipt->pt._key = mask; + ipt->req = req; + ipt->error = -EINVAL; + + INIT_LIST_HEAD(&poll->wait.entry); + init_waitqueue_func_entry(&poll->wait, wake_func); + poll->wait.private = req; + + mask = vfs_poll(req->file, &ipt->pt) & poll->events; + + spin_lock_irq(&ctx->completion_lock); + if (likely(poll->head)) { + spin_lock(&poll->head->lock); + if (unlikely(list_empty(&poll->wait.entry))) { + if (ipt->error) + cancel = true; + ipt->error = 0; + mask = 0; + } + if (mask || ipt->error) + list_del_init(&poll->wait.entry); + else if (cancel) + WRITE_ONCE(poll->canceled, true); + else if (!poll->done) /* actually waiting for an event */ + io_poll_req_insert(req); + spin_unlock(&poll->head->lock); + } + + return mask; +} + +static bool io_arm_poll_handler(struct io_kiocb *req) +{ + const struct io_op_def *def = &io_op_defs[req->opcode]; + struct io_ring_ctx *ctx = req->ctx; + struct async_poll *apoll; + struct io_poll_table ipt; + __poll_t mask, ret; + + if (!req->file || !file_can_poll(req->file)) + return false; + if (req->flags & (REQ_F_MUST_PUNT | REQ_F_POLLED)) + return false; + if (!def->pollin && !def->pollout) + return false; + + apoll = kmalloc(sizeof(*apoll), GFP_ATOMIC); + if (unlikely(!apoll)) + return false; + + req->flags |= REQ_F_POLLED; + memcpy(&apoll->work, &req->work, sizeof(req->work)); + + /* + * Don't need a reference here, as we're adding it to the task + * task_works list. If the task exits, the list is pruned. + */ + req->task = current; + req->apoll = apoll; + INIT_HLIST_NODE(&req->hash_node); + + if (def->pollin) + mask = POLLIN | POLLRDNORM; + if (def->pollout) + mask |= POLLOUT | POLLWRNORM; + mask |= POLLERR | POLLPRI; + + ipt.pt._qproc = io_async_queue_proc; + + ret = __io_arm_poll_handler(req, &apoll->poll, &ipt, mask, + io_async_wake); + if (ret) { + ipt.error = 0; + apoll->poll.done = true; + spin_unlock_irq(&ctx->completion_lock); + memcpy(&req->work, &apoll->work, sizeof(req->work)); + kfree(apoll); + return false; + } + spin_unlock_irq(&ctx->completion_lock); + trace_io_uring_poll_arm(ctx, req->opcode, req->user_data, mask, + apoll->poll.events); + return true; +} + +static bool __io_poll_remove_one(struct io_kiocb *req, + struct io_poll_iocb *poll) { - struct io_poll_iocb *poll = &req->poll; bool do_complete = false;
spin_lock(&poll->head->lock); @@ -3527,7 +3737,24 @@ static bool io_poll_remove_one(struct io_kiocb *req) do_complete = true; } spin_unlock(&poll->head->lock); + return do_complete; +} + +static bool io_poll_remove_one(struct io_kiocb *req) +{ + bool do_complete; + + if (req->opcode == IORING_OP_POLL_ADD) { + do_complete = __io_poll_remove_one(req, &req->poll); + } else { + /* non-poll requests have submit ref still */ + do_complete = __io_poll_remove_one(req, &req->apoll->poll); + if (do_complete) + io_put_req(req); + } + hash_del(&req->hash_node); + if (do_complete) { io_cqring_fill_event(req, -ECANCELED); io_commit_cqring(req->ctx); @@ -3638,8 +3865,13 @@ static void io_poll_task_func(struct callback_head *cb) struct io_kiocb *nxt = NULL;
io_poll_task_handler(req, &nxt); - if (nxt) + if (nxt) { + struct io_ring_ctx *ctx = nxt->ctx; + + mutex_lock(&ctx->uring_lock); __io_queue_sqe(nxt, NULL); + mutex_unlock(&ctx->uring_lock); + } }
static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, @@ -3647,51 +3879,16 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, { struct io_kiocb *req = wait->private; struct io_poll_iocb *poll = &req->poll; - __poll_t mask = key_to_poll(key); - struct task_struct *tsk;
- /* for instances that support it check for an event match first: */ - if (mask && !(mask & poll->events)) - return 0; - - list_del_init(&poll->wait.entry); - - tsk = req->task; - req->result = mask; - init_task_work(&req->task_work, io_poll_task_func); - task_work_add(tsk, &req->task_work, true); - wake_up_process(tsk); - return 1; + return __io_async_wake(req, poll, key_to_poll(key), io_poll_task_func); }
-struct io_poll_table { - struct poll_table_struct pt; - struct io_kiocb *req; - int error; -}; - static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head, struct poll_table_struct *p) { struct io_poll_table *pt = container_of(p, struct io_poll_table, pt);
- if (unlikely(pt->req->poll.head)) { - pt->error = -EINVAL; - return; - } - - pt->error = 0; - pt->req->poll.head = head; - add_wait_queue(head, &pt->req->poll.wait); -} - -static void io_poll_req_insert(struct io_kiocb *req) -{ - struct io_ring_ctx *ctx = req->ctx; - struct hlist_head *list; - - list = &ctx->cancel_hash[hash_long(req->user_data, ctx->cancel_hash_bits)]; - hlist_add_head(&req->hash_node, list); + __io_queue_proc(&pt->req->poll, pt, head); }
static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) @@ -3709,7 +3906,10 @@ static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP;
- /* task will wait for requests on exit, don't need a ref */ + /* + * Don't need a reference here, as we're adding it to the task + * task_works list. If the task exits, the list is pruned. + */ req->task = current; return 0; } @@ -3719,46 +3919,15 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) struct io_poll_iocb *poll = &req->poll; struct io_ring_ctx *ctx = req->ctx; struct io_poll_table ipt; - bool cancel = false; __poll_t mask;
INIT_HLIST_NODE(&req->hash_node); - - poll->head = NULL; - poll->done = false; - poll->canceled = false; - - ipt.pt._qproc = io_poll_queue_proc; - ipt.pt._key = poll->events; - ipt.req = req; - ipt.error = -EINVAL; /* same as no support for IOCB_CMD_POLL */ - - /* initialized the list so that we can do list_empty checks */ - INIT_LIST_HEAD(&poll->wait.entry); - init_waitqueue_func_entry(&poll->wait, io_poll_wake); - poll->wait.private = req; - INIT_LIST_HEAD(&req->list); + ipt.pt._qproc = io_poll_queue_proc;
- mask = vfs_poll(poll->file, &ipt.pt) & poll->events; + mask = __io_arm_poll_handler(req, &req->poll, &ipt, poll->events, + io_poll_wake);
- spin_lock_irq(&ctx->completion_lock); - if (likely(poll->head)) { - spin_lock(&poll->head->lock); - if (unlikely(list_empty(&poll->wait.entry))) { - if (ipt.error) - cancel = true; - ipt.error = 0; - mask = 0; - } - if (mask || ipt.error) - list_del_init(&poll->wait.entry); - else if (cancel) - WRITE_ONCE(poll->canceled, true); - else if (!poll->done) /* actually waiting for an event */ - io_poll_req_insert(req); - spin_unlock(&poll->head->lock); - } if (mask) { /* no async, we'd stolen it */ ipt.error = 0; io_poll_complete(req, mask, 0); @@ -4694,6 +4863,9 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req)
if (!(req->flags & REQ_F_LINK)) return NULL; + /* for polled retry, if flag is set, we already went through here */ + if (req->flags & REQ_F_POLLED) + return NULL;
nxt = list_first_entry_or_null(&req->link_list, struct io_kiocb, link_list); @@ -4731,6 +4903,11 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) */ if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) || (req->flags & REQ_F_MUST_PUNT))) { + if (io_arm_poll_handler(req)) { + if (linked_timeout) + io_queue_linked_timeout(linked_timeout); + goto done_req; + } punt: if (io_op_defs[req->opcode].file_table) { ret = io_grab_files(req); @@ -6748,6 +6925,17 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m) seq_printf(m, "Personalities:\n"); idr_for_each(&ctx->personality_idr, io_uring_show_cred, m); } + seq_printf(m, "PollList:\n"); + spin_lock_irq(&ctx->completion_lock); + for (i = 0; i < (1U << ctx->cancel_hash_bits); i++) { + struct hlist_head *list = &ctx->cancel_hash[i]; + struct io_kiocb *req; + + hlist_for_each_entry(req, list, hash_node) + seq_printf(m, " op=%d, task_works=%d\n", req->opcode, + req->task->task_works != NULL); + } + spin_unlock_irq(&ctx->completion_lock); mutex_unlock(&ctx->uring_lock); }
@@ -6964,7 +7152,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p)
p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP | IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS | - IORING_FEAT_CUR_PERSONALITY; + IORING_FEAT_CUR_PERSONALITY | IORING_FEAT_FAST_POLL; trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags); return ret; err: diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h index b116de688a0e..be97b7fa0ac9 100644 --- a/include/trace/events/io_uring.h +++ b/include/trace/events/io_uring.h @@ -386,6 +386,109 @@ TRACE_EVENT(io_uring_submit_sqe, __entry->force_nonblock, __entry->sq_thread) );
+TRACE_EVENT(io_uring_poll_arm, + + TP_PROTO(void *ctx, u8 opcode, u64 user_data, int mask, int events), + + TP_ARGS(ctx, opcode, user_data, mask, events), + + TP_STRUCT__entry ( + __field( void *, ctx ) + __field( u8, opcode ) + __field( u64, user_data ) + __field( int, mask ) + __field( int, events ) + ), + + TP_fast_assign( + __entry->ctx = ctx; + __entry->opcode = opcode; + __entry->user_data = user_data; + __entry->mask = mask; + __entry->events = events; + ), + + TP_printk("ring %p, op %d, data 0x%llx, mask 0x%x, events 0x%x", + __entry->ctx, __entry->opcode, + (unsigned long long) __entry->user_data, + __entry->mask, __entry->events) +); + +TRACE_EVENT(io_uring_poll_wake, + + TP_PROTO(void *ctx, u8 opcode, u64 user_data, int mask), + + TP_ARGS(ctx, opcode, user_data, mask), + + TP_STRUCT__entry ( + __field( void *, ctx ) + __field( u8, opcode ) + __field( u64, user_data ) + __field( int, mask ) + ), + + TP_fast_assign( + __entry->ctx = ctx; + __entry->opcode = opcode; + __entry->user_data = user_data; + __entry->mask = mask; + ), + + TP_printk("ring %p, op %d, data 0x%llx, mask 0x%x", + __entry->ctx, __entry->opcode, + (unsigned long long) __entry->user_data, + __entry->mask) +); + +TRACE_EVENT(io_uring_task_add, + + TP_PROTO(void *ctx, u8 opcode, u64 user_data, int mask), + + TP_ARGS(ctx, opcode, user_data, mask), + + TP_STRUCT__entry ( + __field( void *, ctx ) + __field( u8, opcode ) + __field( u64, user_data ) + __field( int, mask ) + ), + + TP_fast_assign( + __entry->ctx = ctx; + __entry->opcode = opcode; + __entry->user_data = user_data; + __entry->mask = mask; + ), + + TP_printk("ring %p, op %d, data 0x%llx, mask %x", + __entry->ctx, __entry->opcode, + (unsigned long long) __entry->user_data, + __entry->mask) +); + +TRACE_EVENT(io_uring_task_run, + + TP_PROTO(void *ctx, u8 opcode, u64 user_data), + + TP_ARGS(ctx, opcode, user_data), + + TP_STRUCT__entry ( + __field( void *, ctx ) + __field( u8, opcode ) + __field( u64, user_data ) + ), + + TP_fast_assign( + __entry->ctx = ctx; + __entry->opcode = opcode; + __entry->user_data = user_data; + ), + + TP_printk("ring %p, op %d, data 0x%llx", + __entry->ctx, __entry->opcode, + (unsigned long long) __entry->user_data) +); + #endif /* _TRACE_IO_URING_H */
/* This part must be outside protection */ diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 6c607e42db68..14b4f075068f 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -215,6 +215,7 @@ struct io_uring_params { #define IORING_FEAT_SUBMIT_STABLE (1U << 2) #define IORING_FEAT_RW_CUR_POS (1U << 3) #define IORING_FEAT_CUR_PERSONALITY (1U << 4) +#define IORING_FEAT_FAST_POLL (1U << 5)
/* * io_uring_register(2) opcodes and arguments
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 4bc4494ec7c97ee38e2aa3d1cd76e289c49ac083 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
After __io_queue_sqe() ended up in io_queue_async_work(), it's already known that there is no @nxt req, so skip the check and return from the function.
Also, @nxt initialisation now can be done just before io_put_req_find_next(), as there is no jumping until it's checked.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index aba21e017cb9..ab68201407a2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4879,7 +4879,7 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req) static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_kiocb *linked_timeout; - struct io_kiocb *nxt = NULL; + struct io_kiocb *nxt; const struct cred *old_creds = NULL; int ret;
@@ -4906,7 +4906,7 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (io_arm_poll_handler(req)) { if (linked_timeout) io_queue_linked_timeout(linked_timeout); - goto done_req; + goto exit; } punt: if (io_op_defs[req->opcode].file_table) { @@ -4920,10 +4920,11 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) * submit reference when the iocb is actually submitted. */ io_queue_async_work(req); - goto done_req; + goto exit; }
err: + nxt = NULL; /* drop submission reference */ io_put_req_find_next(req, &nxt);
@@ -4940,15 +4941,14 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) req_set_fail_links(req); io_put_req(req); } -done_req: if (nxt) { req = nxt; - nxt = NULL;
if (req->flags & REQ_F_FORCE_ASYNC) goto punt; goto again; } +exit: if (old_creds) revert_creds(old_creds); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 3b17cf5a58f2a38e23ee980b5dece717d0464fb7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io-wq cares about IO_WQ_WORK_UNBOUND flag only while enqueueing, so it's useless setting it for a next req of a link. Thus, removed it from io_prep_linked_timeout(), and inline the function.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 13 +------------ 1 file changed, 1 insertion(+), 12 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ab68201407a2..6b0b5d6ad145 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -995,17 +995,6 @@ static inline void io_req_work_drop_env(struct io_kiocb *req) } }
-static inline void io_prep_next_work(struct io_kiocb *req, - struct io_kiocb **link) -{ - const struct io_op_def *def = &io_op_defs[req->opcode]; - - if (!(req->flags & REQ_F_ISREG) && def->unbound_nonreg_file) - req->work.flags |= IO_WQ_WORK_UNBOUND; - - *link = io_prep_linked_timeout(req); -} - static inline bool io_prep_async_work(struct io_kiocb *req, struct io_kiocb **link) { @@ -2578,8 +2567,8 @@ static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) { struct io_kiocb *link;
- io_prep_next_work(nxt, &link); *workptr = &nxt->work; + link = io_prep_linked_timeout(nxt); if (link) { nxt->work.func = io_link_work_cb; nxt->work.data = link;
From: Nathan Chancellor natechancellor@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 8755d97a09fed0de206772bcad1838301293c4d8 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Clang warns:
fs/io_uring.c:4178:6: warning: variable 'mask' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] if (def->pollin) ^~~~~~~~~~~ fs/io_uring.c:4182:2: note: uninitialized use occurs here mask |= POLLERR | POLLPRI; ^~~~ fs/io_uring.c:4178:2: note: remove the 'if' if its condition is always true if (def->pollin) ^~~~~~~~~~~~~~~~ fs/io_uring.c:4154:15: note: initialize the variable 'mask' to silence this warning __poll_t mask, ret; ^ = 0 1 warning generated.
io_op_defs has many definitions where pollin is not set so mask indeed might be uninitialized. Initialize it to zero and change the next assignment to |=, in case further masks are added in the future to avoid missing changing the assignment then.
Fixes: d7718a9d25a6 ("io_uring: use poll driven retry for files that support it") Link: https://github.com/ClangBuiltLinux/linux/issues/916 Signed-off-by: Nathan Chancellor natechancellor@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6b0b5d6ad145..5a97d110602a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3690,8 +3690,9 @@ static bool io_arm_poll_handler(struct io_kiocb *req) req->apoll = apoll; INIT_HLIST_NODE(&req->hash_node);
+ mask = 0; if (def->pollin) - mask = POLLIN | POLLRDNORM; + mask |= POLLIN | POLLRDNORM; if (def->pollout) mask |= POLLOUT | POLLWRNORM; mask |= POLLERR | POLLPRI;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit a2100672f3b2afdd55ccc2e640d1a8bd99ff6338 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Don't abuse labels for plain and straightworward code.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5a97d110602a..54acd816c7dd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2980,8 +2980,16 @@ static int io_close(struct io_kiocb *req, struct io_kiocb **nxt, return ret;
/* if the file has a flush method, be safe and punt to async */ - if (req->close.put_file->f_op->flush && !io_wq_current_is_worker()) - goto eagain; + if (req->close.put_file->f_op->flush && force_nonblock) { + req->work.func = io_close_finish; + /* + * Do manual async queue here to avoid grabbing files - we don't + * need the files, and it'll cause io_close_finish() to close + * the file again and cause a double CQE entry for this request + */ + io_queue_async_work(req); + return 0; + }
/* * No ->flush(), safely close from here and just punt the @@ -2989,15 +2997,6 @@ static int io_close(struct io_kiocb *req, struct io_kiocb **nxt, */ __io_close_finish(req, nxt); return 0; -eagain: - req->work.func = io_close_finish; - /* - * Do manual async queue here to avoid grabbing files - we don't - * need the files, and it'll cause io_close_finish() to close - * the file again and cause a double CQE entry for this request - */ - io_queue_async_work(req); - return 0; }
static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 594506fec5faec2b1ec82ad6fb0c8132512fc459 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The rule is simple, any async handler gets a submission ref and should put it at the end. Make them all follow it, and so more consistent.
This is a preparation patch, and as io_wq_assign_next() currently won't ever work, this doesn't care to use io_put_req_find_next() instead of io_put_req().
Signed-off-by: Pavel Begunkov asml.silence@gmail.com
refcount_inc_not_zero() -> refcount_inc() fix.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 54acd816c7dd..b56b3ff5e519 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2547,7 +2547,7 @@ static bool io_req_cancelled(struct io_kiocb *req) if (req->work.flags & IO_WQ_WORK_CANCEL) { req_set_fail_links(req); io_cqring_add_event(req, -ECANCELED); - io_put_req(req); + io_double_put_req(req); return true; }
@@ -2597,6 +2597,7 @@ static void io_fsync_finish(struct io_wq_work **workptr) if (io_req_cancelled(req)) return; __io_fsync(req, &nxt); + io_put_req(req); /* drop submission reference */ if (nxt) io_wq_assign_next(workptr, nxt); } @@ -2606,7 +2607,6 @@ static int io_fsync(struct io_kiocb *req, struct io_kiocb **nxt, { /* fsync always requires a blocking context */ if (force_nonblock) { - io_put_req(req); req->work.func = io_fsync_finish; return -EAGAIN; } @@ -2618,9 +2618,6 @@ static void __io_fallocate(struct io_kiocb *req, struct io_kiocb **nxt) { int ret;
- if (io_req_cancelled(req)) - return; - ret = vfs_fallocate(req->file, req->sync.mode, req->sync.off, req->sync.len); if (ret < 0) @@ -2634,7 +2631,10 @@ static void io_fallocate_finish(struct io_wq_work **workptr) struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); struct io_kiocb *nxt = NULL;
+ if (io_req_cancelled(req)) + return; __io_fallocate(req, &nxt); + io_put_req(req); /* drop submission reference */ if (nxt) io_wq_assign_next(workptr, nxt); } @@ -2656,7 +2656,6 @@ static int io_fallocate(struct io_kiocb *req, struct io_kiocb **nxt, { /* fallocate always requiring blocking context */ if (force_nonblock) { - io_put_req(req); req->work.func = io_fallocate_finish; return -EAGAIN; } @@ -2965,6 +2964,7 @@ static void io_close_finish(struct io_wq_work **workptr)
/* not cancellable, don't do io_req_cancelled() */ __io_close_finish(req, &nxt); + io_put_req(req); /* drop submission reference */ if (nxt) io_wq_assign_next(workptr, nxt); } @@ -2981,6 +2981,9 @@ static int io_close(struct io_kiocb *req, struct io_kiocb **nxt,
/* if the file has a flush method, be safe and punt to async */ if (req->close.put_file->f_op->flush && force_nonblock) { + /* submission ref will be dropped, take it for async */ + refcount_inc(&req->refs); + req->work.func = io_close_finish; /* * Do manual async queue here to avoid grabbing files - we don't @@ -3038,6 +3041,7 @@ static void io_sync_file_range_finish(struct io_wq_work **workptr) if (io_req_cancelled(req)) return; __io_sync_file_range(req, &nxt); + io_put_req(req); /* put submission ref */ if (nxt) io_wq_assign_next(workptr, nxt); } @@ -3047,7 +3051,6 @@ static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt, { /* sync_file_range always requires a blocking context */ if (force_nonblock) { - io_put_req(req); req->work.func = io_sync_file_range_finish; return -EAGAIN; } @@ -3416,11 +3419,10 @@ static void io_accept_finish(struct io_wq_work **workptr) struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); struct io_kiocb *nxt = NULL;
- io_put_req(req); - if (io_req_cancelled(req)) return; __io_accept(req, &nxt, false); + io_put_req(req); /* drop submission reference */ if (nxt) io_wq_assign_next(workptr, nxt); } @@ -4677,17 +4679,14 @@ static void io_wq_submit_work(struct io_wq_work **workptr) } while (1); }
- /* drop submission reference */ - io_put_req(req); - if (ret) { req_set_fail_links(req); io_cqring_add_event(req, ret); io_put_req(req); }
- /* if a dependent link is ready, pass it back */ - if (!ret && nxt) + io_put_req(req); /* drop submission reference */ + if (nxt) io_wq_assign_next(workptr, nxt); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 014db0073cc6a12e1f421b9231d6f3aa35735823 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There will be no use for @nxt in the handlers, and it's doesn't work anyway, so purge it
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [ignore openat2 for commit cebdb98617ae ("io_uring: add support for IORING_OP_OPENAT2") not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 197 +++++++++++++++++++++----------------------------- 1 file changed, 83 insertions(+), 114 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b56b3ff5e519..270c1d0fe5e1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1801,17 +1801,6 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2) io_put_req(req); }
-static struct io_kiocb *__io_complete_rw(struct kiocb *kiocb, long res) -{ - struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); - struct io_kiocb *nxt = NULL; - - io_complete_rw_common(kiocb, res); - io_put_req_find_next(req, &nxt); - - return nxt; -} - static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); @@ -2006,14 +1995,14 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) } }
-static void kiocb_done(struct kiocb *kiocb, ssize_t ret, struct io_kiocb **nxt) +static void kiocb_done(struct kiocb *kiocb, ssize_t ret) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
if (req->flags & REQ_F_CUR_POS) req->file->f_pos = kiocb->ki_pos; if (ret >= 0 && kiocb->ki_complete == io_complete_rw) - *nxt = __io_complete_rw(kiocb, ret); + io_complete_rw(kiocb, ret, 0); else io_rw_done(kiocb, ret); } @@ -2262,8 +2251,7 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; }
-static int io_read(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_read(struct io_kiocb *req, bool force_nonblock) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw.kiocb; @@ -2303,7 +2291,7 @@ static int io_read(struct io_kiocb *req, struct io_kiocb **nxt,
/* Catch -EAGAIN return for forced non-blocking submission */ if (!force_nonblock || ret2 != -EAGAIN) { - kiocb_done(kiocb, ret2, nxt); + kiocb_done(kiocb, ret2); } else { copy_iov: ret = io_setup_async_rw(req, io_size, iovec, @@ -2352,8 +2340,7 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; }
-static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_write(struct io_kiocb *req, bool force_nonblock) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw.kiocb; @@ -2417,7 +2404,7 @@ static int io_write(struct io_kiocb *req, struct io_kiocb **nxt, if (ret2 == -EOPNOTSUPP && (kiocb->ki_flags & IOCB_NOWAIT)) ret2 = -EAGAIN; if (!force_nonblock || ret2 != -EAGAIN) { - kiocb_done(kiocb, ret2, nxt); + kiocb_done(kiocb, ret2); } else { copy_iov: ret = io_setup_async_rw(req, io_size, iovec, @@ -2474,8 +2461,7 @@ static bool io_splice_punt(struct file *file) return !(file->f_mode & O_NONBLOCK); }
-static int io_splice(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_splice(struct io_kiocb *req, bool force_nonblock) { struct io_splice *sp = &req->splice; struct file *in = sp->file_in; @@ -2502,7 +2488,7 @@ static int io_splice(struct io_kiocb *req, struct io_kiocb **nxt, io_cqring_add_event(req, ret); if (ret != sp->len) req_set_fail_links(req); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; }
@@ -2575,7 +2561,7 @@ static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) } }
-static void __io_fsync(struct io_kiocb *req, struct io_kiocb **nxt) +static void __io_fsync(struct io_kiocb *req) { loff_t end = req->sync.off + req->sync.len; int ret; @@ -2586,7 +2572,7 @@ static void __io_fsync(struct io_kiocb *req, struct io_kiocb **nxt) if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); }
static void io_fsync_finish(struct io_wq_work **workptr) @@ -2596,25 +2582,24 @@ static void io_fsync_finish(struct io_wq_work **workptr)
if (io_req_cancelled(req)) return; - __io_fsync(req, &nxt); + __io_fsync(req); io_put_req(req); /* drop submission reference */ if (nxt) io_wq_assign_next(workptr, nxt); }
-static int io_fsync(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_fsync(struct io_kiocb *req, bool force_nonblock) { /* fsync always requires a blocking context */ if (force_nonblock) { req->work.func = io_fsync_finish; return -EAGAIN; } - __io_fsync(req, nxt); + __io_fsync(req); return 0; }
-static void __io_fallocate(struct io_kiocb *req, struct io_kiocb **nxt) +static void __io_fallocate(struct io_kiocb *req) { int ret;
@@ -2623,7 +2608,7 @@ static void __io_fallocate(struct io_kiocb *req, struct io_kiocb **nxt) if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); }
static void io_fallocate_finish(struct io_wq_work **workptr) @@ -2633,7 +2618,7 @@ static void io_fallocate_finish(struct io_wq_work **workptr)
if (io_req_cancelled(req)) return; - __io_fallocate(req, &nxt); + __io_fallocate(req); io_put_req(req); /* drop submission reference */ if (nxt) io_wq_assign_next(workptr, nxt); @@ -2651,8 +2636,7 @@ static int io_fallocate_prep(struct io_kiocb *req, return 0; }
-static int io_fallocate(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_fallocate(struct io_kiocb *req, bool force_nonblock) { /* fallocate always requiring blocking context */ if (force_nonblock) { @@ -2660,7 +2644,7 @@ static int io_fallocate(struct io_kiocb *req, struct io_kiocb **nxt, return -EAGAIN; }
- __io_fallocate(req, nxt); + __io_fallocate(req); return 0; }
@@ -2693,8 +2677,7 @@ static int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static int io_openat(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_openat(struct io_kiocb *req, bool force_nonblock) { struct open_flags op; struct file *file; @@ -2725,7 +2708,7 @@ static int io_openat(struct io_kiocb *req, struct io_kiocb **nxt, if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; }
@@ -2754,8 +2737,7 @@ static int io_epoll_ctl_prep(struct io_kiocb *req, #endif }
-static int io_epoll_ctl(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_epoll_ctl(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_EPOLL) struct io_epoll *ie = &req->epoll; @@ -2768,7 +2750,7 @@ static int io_epoll_ctl(struct io_kiocb *req, struct io_kiocb **nxt, if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; #else return -EOPNOTSUPP; @@ -2790,8 +2772,7 @@ static int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) #endif }
-static int io_madvise(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_madvise(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) struct io_madvise *ma = &req->madvise; @@ -2804,7 +2785,7 @@ static int io_madvise(struct io_kiocb *req, struct io_kiocb **nxt, if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; #else return -EOPNOTSUPP; @@ -2822,8 +2803,7 @@ static int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static int io_fadvise(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_fadvise(struct io_kiocb *req, bool force_nonblock) { struct io_fadvise *fa = &req->fadvise; int ret; @@ -2843,7 +2823,7 @@ static int io_fadvise(struct io_kiocb *req, struct io_kiocb **nxt, if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; }
@@ -2880,8 +2860,7 @@ static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static int io_statx(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_statx(struct io_kiocb *req, bool force_nonblock) { struct io_open *ctx = &req->open; unsigned lookup_flags; @@ -2918,7 +2897,7 @@ static int io_statx(struct io_kiocb *req, struct io_kiocb **nxt, if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; }
@@ -2945,7 +2924,7 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) }
/* only called when __close_fd_get_file() is done */ -static void __io_close_finish(struct io_kiocb *req, struct io_kiocb **nxt) +static void __io_close_finish(struct io_kiocb *req) { int ret;
@@ -2954,7 +2933,7 @@ static void __io_close_finish(struct io_kiocb *req, struct io_kiocb **nxt) req_set_fail_links(req); io_cqring_add_event(req, ret); fput(req->close.put_file); - io_put_req_find_next(req, nxt); + io_put_req(req); }
static void io_close_finish(struct io_wq_work **workptr) @@ -2963,14 +2942,13 @@ static void io_close_finish(struct io_wq_work **workptr) struct io_kiocb *nxt = NULL;
/* not cancellable, don't do io_req_cancelled() */ - __io_close_finish(req, &nxt); + __io_close_finish(req); io_put_req(req); /* drop submission reference */ if (nxt) io_wq_assign_next(workptr, nxt); }
-static int io_close(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_close(struct io_kiocb *req, bool force_nonblock) { int ret;
@@ -2998,7 +2976,7 @@ static int io_close(struct io_kiocb *req, struct io_kiocb **nxt, * No ->flush(), safely close from here and just punt the * fput() to async context. */ - __io_close_finish(req, nxt); + __io_close_finish(req); return 0; }
@@ -3020,7 +2998,7 @@ static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static void __io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt) +static void __io_sync_file_range(struct io_kiocb *req) { int ret;
@@ -3029,7 +3007,7 @@ static void __io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt) if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); }
@@ -3040,14 +3018,13 @@ static void io_sync_file_range_finish(struct io_wq_work **workptr)
if (io_req_cancelled(req)) return; - __io_sync_file_range(req, &nxt); + __io_sync_file_range(req); io_put_req(req); /* put submission ref */ if (nxt) io_wq_assign_next(workptr, nxt); }
-static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_sync_file_range(struct io_kiocb *req, bool force_nonblock) { /* sync_file_range always requires a blocking context */ if (force_nonblock) { @@ -3055,7 +3032,7 @@ static int io_sync_file_range(struct io_kiocb *req, struct io_kiocb **nxt, return -EAGAIN; }
- __io_sync_file_range(req, nxt); + __io_sync_file_range(req); return 0; }
@@ -3107,8 +3084,7 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) #endif }
-static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_NET) struct io_async_msghdr *kmsg = NULL; @@ -3162,15 +3138,14 @@ static int io_sendmsg(struct io_kiocb *req, struct io_kiocb **nxt, io_cqring_add_event(req, ret); if (ret < 0) req_set_fail_links(req); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; #else return -EOPNOTSUPP; #endif }
-static int io_send(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_send(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_NET) struct socket *sock; @@ -3213,7 +3188,7 @@ static int io_send(struct io_kiocb *req, struct io_kiocb **nxt, io_cqring_add_event(req, ret); if (ret < 0) req_set_fail_links(req); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; #else return -EOPNOTSUPP; @@ -3254,8 +3229,7 @@ static int io_recvmsg_prep(struct io_kiocb *req, #endif }
-static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_NET) struct io_async_msghdr *kmsg = NULL; @@ -3311,15 +3285,14 @@ static int io_recvmsg(struct io_kiocb *req, struct io_kiocb **nxt, io_cqring_add_event(req, ret); if (ret < 0) req_set_fail_links(req); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; #else return -EOPNOTSUPP; #endif }
-static int io_recv(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_recv(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_NET) struct socket *sock; @@ -3363,7 +3336,7 @@ static int io_recv(struct io_kiocb *req, struct io_kiocb **nxt, io_cqring_add_event(req, ret); if (ret < 0) req_set_fail_links(req); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; #else return -EOPNOTSUPP; @@ -3392,8 +3365,7 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) }
#if defined(CONFIG_NET) -static int __io_accept(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int __io_accept(struct io_kiocb *req, bool force_nonblock) { struct io_accept *accept = &req->accept; unsigned file_flags; @@ -3410,7 +3382,7 @@ static int __io_accept(struct io_kiocb *req, struct io_kiocb **nxt, if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; }
@@ -3421,20 +3393,19 @@ static void io_accept_finish(struct io_wq_work **workptr)
if (io_req_cancelled(req)) return; - __io_accept(req, &nxt, false); + __io_accept(req, false); io_put_req(req); /* drop submission reference */ if (nxt) io_wq_assign_next(workptr, nxt); } #endif
-static int io_accept(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_accept(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_NET) int ret;
- ret = __io_accept(req, nxt, force_nonblock); + ret = __io_accept(req, force_nonblock); if (ret == -EAGAIN && force_nonblock) { req->work.func = io_accept_finish; return -EAGAIN; @@ -3469,8 +3440,7 @@ static int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) #endif }
-static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt, - bool force_nonblock) +static int io_connect(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_NET) struct io_async_ctx __io, *io; @@ -3508,7 +3478,7 @@ static int io_connect(struct io_kiocb *req, struct io_kiocb **nxt, if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); - io_put_req_find_next(req, nxt); + io_put_req(req); return 0; #else return -EOPNOTSUPP; @@ -3905,7 +3875,7 @@ static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe return 0; }
-static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt) +static int io_poll_add(struct io_kiocb *req) { struct io_poll_iocb *poll = &req->poll; struct io_ring_ctx *ctx = req->ctx; @@ -3927,7 +3897,7 @@ static int io_poll_add(struct io_kiocb *req, struct io_kiocb **nxt)
if (mask) { io_cqring_ev_posted(ctx); - io_put_req_find_next(req, nxt); + io_put_req(req); } return ipt.error; } @@ -4176,7 +4146,7 @@ static int io_async_cancel_one(struct io_ring_ctx *ctx, void *sqe_addr)
static void io_async_find_and_cancel(struct io_ring_ctx *ctx, struct io_kiocb *req, __u64 sqe_addr, - struct io_kiocb **nxt, int success_ret) + int success_ret) { unsigned long flags; int ret; @@ -4202,7 +4172,7 @@ static void io_async_find_and_cancel(struct io_ring_ctx *ctx,
if (ret < 0) req_set_fail_links(req); - io_put_req_find_next(req, nxt); + io_put_req(req); }
static int io_async_cancel_prep(struct io_kiocb *req, @@ -4218,11 +4188,11 @@ static int io_async_cancel_prep(struct io_kiocb *req, return 0; }
-static int io_async_cancel(struct io_kiocb *req, struct io_kiocb **nxt) +static int io_async_cancel(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx;
- io_async_find_and_cancel(ctx, req, req->cancel.addr, nxt, 0); + io_async_find_and_cancel(ctx, req, req->cancel.addr, 0); return 0; }
@@ -4428,7 +4398,7 @@ static void io_cleanup_req(struct io_kiocb *req) }
static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, - struct io_kiocb **nxt, bool force_nonblock) + bool force_nonblock) { struct io_ring_ctx *ctx = req->ctx; int ret; @@ -4445,7 +4415,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret < 0) break; } - ret = io_read(req, nxt, force_nonblock); + ret = io_read(req, force_nonblock); break; case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: @@ -4455,7 +4425,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret < 0) break; } - ret = io_write(req, nxt, force_nonblock); + ret = io_write(req, force_nonblock); break; case IORING_OP_FSYNC: if (sqe) { @@ -4463,7 +4433,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret < 0) break; } - ret = io_fsync(req, nxt, force_nonblock); + ret = io_fsync(req, force_nonblock); break; case IORING_OP_POLL_ADD: if (sqe) { @@ -4471,7 +4441,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_poll_add(req, nxt); + ret = io_poll_add(req); break; case IORING_OP_POLL_REMOVE: if (sqe) { @@ -4487,7 +4457,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret < 0) break; } - ret = io_sync_file_range(req, nxt, force_nonblock); + ret = io_sync_file_range(req, force_nonblock); break; case IORING_OP_SENDMSG: case IORING_OP_SEND: @@ -4497,9 +4467,9 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, break; } if (req->opcode == IORING_OP_SENDMSG) - ret = io_sendmsg(req, nxt, force_nonblock); + ret = io_sendmsg(req, force_nonblock); else - ret = io_send(req, nxt, force_nonblock); + ret = io_send(req, force_nonblock); break; case IORING_OP_RECVMSG: case IORING_OP_RECV: @@ -4509,9 +4479,9 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, break; } if (req->opcode == IORING_OP_RECVMSG) - ret = io_recvmsg(req, nxt, force_nonblock); + ret = io_recvmsg(req, force_nonblock); else - ret = io_recv(req, nxt, force_nonblock); + ret = io_recv(req, force_nonblock); break; case IORING_OP_TIMEOUT: if (sqe) { @@ -4535,7 +4505,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_accept(req, nxt, force_nonblock); + ret = io_accept(req, force_nonblock); break; case IORING_OP_CONNECT: if (sqe) { @@ -4543,7 +4513,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_connect(req, nxt, force_nonblock); + ret = io_connect(req, force_nonblock); break; case IORING_OP_ASYNC_CANCEL: if (sqe) { @@ -4551,7 +4521,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_async_cancel(req, nxt); + ret = io_async_cancel(req); break; case IORING_OP_FALLOCATE: if (sqe) { @@ -4559,7 +4529,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_fallocate(req, nxt, force_nonblock); + ret = io_fallocate(req, force_nonblock); break; case IORING_OP_OPENAT: if (sqe) { @@ -4567,7 +4537,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_openat(req, nxt, force_nonblock); + ret = io_openat(req, force_nonblock); break; case IORING_OP_CLOSE: if (sqe) { @@ -4575,7 +4545,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_close(req, nxt, force_nonblock); + ret = io_close(req, force_nonblock); break; case IORING_OP_FILES_UPDATE: if (sqe) { @@ -4591,7 +4561,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_statx(req, nxt, force_nonblock); + ret = io_statx(req, force_nonblock); break; case IORING_OP_FADVISE: if (sqe) { @@ -4599,7 +4569,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_fadvise(req, nxt, force_nonblock); + ret = io_fadvise(req, force_nonblock); break; case IORING_OP_MADVISE: if (sqe) { @@ -4607,7 +4577,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_madvise(req, nxt, force_nonblock); + ret = io_madvise(req, force_nonblock); break; case IORING_OP_EPOLL_CTL: if (sqe) { @@ -4615,7 +4585,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret) break; } - ret = io_epoll_ctl(req, nxt, force_nonblock); + ret = io_epoll_ctl(req, force_nonblock); break; case IORING_OP_SPLICE: if (sqe) { @@ -4623,7 +4593,7 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (ret < 0) break; } - ret = io_splice(req, nxt, force_nonblock); + ret = io_splice(req, force_nonblock); break; default: ret = -EINVAL; @@ -4667,7 +4637,7 @@ static void io_wq_submit_work(struct io_wq_work **workptr)
if (!ret) { do { - ret = io_issue_sqe(req, NULL, &nxt, false); + ret = io_issue_sqe(req, NULL, false); /* * We can get EAGAIN for polled IO even though we're * forcing a sync submission from here, since we can't @@ -4813,8 +4783,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer)
if (prev) { req_set_fail_links(prev); - io_async_find_and_cancel(ctx, req, prev->user_data, NULL, - -ETIME); + io_async_find_and_cancel(ctx, req, prev->user_data, -ETIME); io_put_req(prev); } else { io_cqring_add_event(req, -ETIME); @@ -4883,7 +4852,7 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) old_creds = override_creds(req->work.creds); }
- ret = io_issue_sqe(req, sqe, &nxt, true); + ret = io_issue_sqe(req, sqe, true);
/* * We async punt it if the file wasn't marked NOWAIT, or if the file
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 7a743e225b2a9da772b28a50031e1ccd8a8ce404 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If after dropping the submission reference req->refs == 1, the request is done, because this one is for io_put_work() and will be dropped synchronously shortly after. In this case it's safe to steal a next work from the request.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 89 +++++++++++++++++++++++++++------------------------ 1 file changed, 48 insertions(+), 41 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 270c1d0fe5e1..d6eaafea0aa1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1515,6 +1515,27 @@ static void io_free_req(struct io_kiocb *req) io_queue_async_work(nxt); }
+static void io_link_work_cb(struct io_wq_work **workptr) +{ + struct io_wq_work *work = *workptr; + struct io_kiocb *link = work->data; + + io_queue_linked_timeout(link); + io_wq_submit_work(workptr); +} + +static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) +{ + struct io_kiocb *link; + + *workptr = &nxt->work; + link = io_prep_linked_timeout(nxt); + if (link) { + nxt->work.func = io_link_work_cb; + nxt->work.data = link; + } +} + /* * Drop reference to request, return next in chain (if there is one) if this * was the last reference to this request. @@ -1534,6 +1555,27 @@ static void io_put_req(struct io_kiocb *req) io_free_req(req); }
+static void io_put_req_async_completion(struct io_kiocb *req, + struct io_wq_work **workptr) +{ + /* + * It's in an io-wq worker, so there always should be at least + * one reference, which will be dropped in io_put_work() just + * after the current handler returns. + * + * It also means, that if the counter dropped to 1, then there is + * no asynchronous users left, so it's safe to steal the next work. + */ + refcount_dec(&req->refs); + if (refcount_read(&req->refs) == 1) { + struct io_kiocb *nxt = NULL; + + io_req_find_next(req, &nxt); + if (nxt) + io_wq_assign_next(workptr, nxt); + } +} + /* * Must only be used if we don't need to care about links, usually from * within the completion handling itself. @@ -2540,27 +2582,6 @@ static bool io_req_cancelled(struct io_kiocb *req) return false; }
-static void io_link_work_cb(struct io_wq_work **workptr) -{ - struct io_wq_work *work = *workptr; - struct io_kiocb *link = work->data; - - io_queue_linked_timeout(link); - io_wq_submit_work(workptr); -} - -static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) -{ - struct io_kiocb *link; - - *workptr = &nxt->work; - link = io_prep_linked_timeout(nxt); - if (link) { - nxt->work.func = io_link_work_cb; - nxt->work.data = link; - } -} - static void __io_fsync(struct io_kiocb *req) { loff_t end = req->sync.off + req->sync.len; @@ -2578,14 +2599,11 @@ static void __io_fsync(struct io_kiocb *req) static void io_fsync_finish(struct io_wq_work **workptr) { struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - struct io_kiocb *nxt = NULL;
if (io_req_cancelled(req)) return; __io_fsync(req); - io_put_req(req); /* drop submission reference */ - if (nxt) - io_wq_assign_next(workptr, nxt); + io_put_req_async_completion(req, workptr); }
static int io_fsync(struct io_kiocb *req, bool force_nonblock) @@ -2614,14 +2632,11 @@ static void __io_fallocate(struct io_kiocb *req) static void io_fallocate_finish(struct io_wq_work **workptr) { struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - struct io_kiocb *nxt = NULL;
if (io_req_cancelled(req)) return; __io_fallocate(req); - io_put_req(req); /* drop submission reference */ - if (nxt) - io_wq_assign_next(workptr, nxt); + io_put_req_async_completion(req, workptr); }
static int io_fallocate_prep(struct io_kiocb *req, @@ -2939,13 +2954,10 @@ static void __io_close_finish(struct io_kiocb *req) static void io_close_finish(struct io_wq_work **workptr) { struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - struct io_kiocb *nxt = NULL;
/* not cancellable, don't do io_req_cancelled() */ __io_close_finish(req); - io_put_req(req); /* drop submission reference */ - if (nxt) - io_wq_assign_next(workptr, nxt); + io_put_req_async_completion(req, workptr); }
static int io_close(struct io_kiocb *req, bool force_nonblock) @@ -3389,14 +3401,11 @@ static int __io_accept(struct io_kiocb *req, bool force_nonblock) static void io_accept_finish(struct io_wq_work **workptr) { struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - struct io_kiocb *nxt = NULL;
if (io_req_cancelled(req)) return; __io_accept(req, false); - io_put_req(req); /* drop submission reference */ - if (nxt) - io_wq_assign_next(workptr, nxt); + io_put_req_async_completion(req, workptr); } #endif
@@ -4626,7 +4635,6 @@ static void io_wq_submit_work(struct io_wq_work **workptr) { struct io_wq_work *work = *workptr; struct io_kiocb *req = container_of(work, struct io_kiocb, work); - struct io_kiocb *nxt = NULL; int ret = 0;
/* if NO_CANCEL is set, we must still run the work */ @@ -4655,9 +4663,7 @@ static void io_wq_submit_work(struct io_wq_work **workptr) io_put_req(req); }
- io_put_req(req); /* drop submission reference */ - if (nxt) - io_wq_assign_next(workptr, nxt); + io_put_req_async_completion(req, workptr); }
static int io_req_needs_file(struct io_kiocb *req, int fd) @@ -6069,6 +6075,7 @@ static void io_put_work(struct io_wq_work *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work);
+ /* Consider that io_put_req_async_completion() relies on this ref */ io_put_req(req); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit dc026a73c7221b4d9d146ed0bde69ff578ebe8dc category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This is a preparation patch, it adds some helpers and makes the next patches cleaner.
- extract io_impersonate_work() and io_assign_current_work() - replace @next label with nested do-while - move put_work() right after NULL'ing cur_work.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 123 ++++++++++++++++++++++++++++------------------------- 1 file changed, 64 insertions(+), 59 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 0ca2b17c82f9..d6479bfbfd51 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -441,14 +441,43 @@ static void io_wq_switch_creds(struct io_worker *worker, worker->saved_creds = old_creds; }
+static void io_impersonate_work(struct io_worker *worker, + struct io_wq_work *work) +{ + if (work->files && current->files != work->files) { + task_lock(current); + current->files = work->files; + task_unlock(current); + } + if (work->fs && current->fs != work->fs) + current->fs = work->fs; + if (work->mm != worker->mm) + io_wq_switch_mm(worker, work); + if (worker->cur_creds != work->creds) + io_wq_switch_creds(worker, work); +} + +static void io_assign_current_work(struct io_worker *worker, + struct io_wq_work *work) +{ + /* flush pending signals before assigning new work */ + if (signal_pending(current)) + flush_signals(current); + cond_resched(); + + spin_lock_irq(&worker->lock); + worker->cur_work = work; + spin_unlock_irq(&worker->lock); +} + static void io_worker_handle_work(struct io_worker *worker) __releases(wqe->lock) { - struct io_wq_work *work, *old_work = NULL, *put_work = NULL; struct io_wqe *wqe = worker->wqe; struct io_wq *wq = wqe->wq;
do { + struct io_wq_work *work, *old_work; unsigned hash = -1U;
/* @@ -465,69 +494,45 @@ static void io_worker_handle_work(struct io_worker *worker) wqe->flags |= IO_WQE_FLAG_STALLED;
spin_unlock_irq(&wqe->lock); - if (put_work && wq->put_work) - wq->put_work(old_work); if (!work) break; -next: - /* flush any pending signals before assigning new work */ - if (signal_pending(current)) - flush_signals(current); - - cond_resched();
- spin_lock_irq(&worker->lock); - worker->cur_work = work; - spin_unlock_irq(&worker->lock); - - if (work->files && current->files != work->files) { - task_lock(current); - current->files = work->files; - task_unlock(current); - } - if (work->fs && current->fs != work->fs) - current->fs = work->fs; - if (work->mm != worker->mm) - io_wq_switch_mm(worker, work); - if (worker->cur_creds != work->creds) - io_wq_switch_creds(worker, work); - /* - * OK to set IO_WQ_WORK_CANCEL even for uncancellable work, - * the worker function will do the right thing. - */ - if (test_bit(IO_WQ_BIT_CANCEL, &wq->state)) - work->flags |= IO_WQ_WORK_CANCEL; - - if (wq->get_work) { - put_work = work; - wq->get_work(work); - } - - old_work = work; - work->func(&work); - - spin_lock_irq(&worker->lock); - worker->cur_work = NULL; - spin_unlock_irq(&worker->lock); - - spin_lock_irq(&wqe->lock); - - if (hash != -1U) { - wqe->hash_map &= ~BIT(hash); - wqe->flags &= ~IO_WQE_FLAG_STALLED; - } - if (work && work != old_work) { - spin_unlock_irq(&wqe->lock); - - if (put_work && wq->put_work) { - wq->put_work(put_work); - put_work = NULL; + /* handle a whole dependent link */ + do { + io_assign_current_work(worker, work); + io_impersonate_work(worker, work); + + /* + * OK to set IO_WQ_WORK_CANCEL even for uncancellable + * work, the worker function will do the right thing. + */ + if (test_bit(IO_WQ_BIT_CANCEL, &wq->state)) + work->flags |= IO_WQ_WORK_CANCEL; + + if (wq->get_work) + wq->get_work(work); + + old_work = work; + work->func(&work); + + spin_lock_irq(&worker->lock); + worker->cur_work = NULL; + spin_unlock_irq(&worker->lock); + + if (wq->put_work) + wq->put_work(old_work); + + if (hash != -1U) { + spin_lock_irq(&wqe->lock); + wqe->hash_map &= ~BIT_ULL(hash); + wqe->flags &= ~IO_WQE_FLAG_STALLED; + spin_unlock_irq(&wqe->lock); + /* dependent work is not hashed */ + hash = -1U; } + } while (work && work != old_work);
- /* dependent work not hashed */ - hash = -1U; - goto next; - } + spin_lock_irq(&wqe->lock); } while (1); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 58e3931987377d3f4ec7bbc13e4ea0aab52dc6b0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There are 2 optimisations: - Now, io_worker_handler_work() do io_assign_current_work() twice per request, and each one adds lock/unlock(worker->lock) pair. The first is to reset worker->cur_work to NULL, and the second to set a real work shortly after. If there is a dependant work, set it immediately, that effectively removes the extra NULL'ing.
- And there is no use in taking wqe->lock for linked works, as they are not hashed now. Optimise it out.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index d6479bfbfd51..05f2fdc6bdce 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -477,7 +477,7 @@ static void io_worker_handle_work(struct io_worker *worker) struct io_wq *wq = wqe->wq;
do { - struct io_wq_work *work, *old_work; + struct io_wq_work *work; unsigned hash = -1U;
/* @@ -496,12 +496,13 @@ static void io_worker_handle_work(struct io_worker *worker) spin_unlock_irq(&wqe->lock); if (!work) break; + io_assign_current_work(worker, work);
/* handle a whole dependent link */ do { - io_assign_current_work(worker, work); - io_impersonate_work(worker, work); + struct io_wq_work *old_work;
+ io_impersonate_work(worker, work); /* * OK to set IO_WQ_WORK_CANCEL even for uncancellable * work, the worker function will do the right thing. @@ -514,10 +515,8 @@ static void io_worker_handle_work(struct io_worker *worker)
old_work = work; work->func(&work); - - spin_lock_irq(&worker->lock); - worker->cur_work = NULL; - spin_unlock_irq(&worker->lock); + work = (old_work == work) ? NULL : work; + io_assign_current_work(worker, work);
if (wq->put_work) wq->put_work(old_work); @@ -530,7 +529,7 @@ static void io_worker_handle_work(struct io_worker *worker) /* dependent work is not hashed */ hash = -1U; } - } while (work && work != old_work); + } while (work);
spin_lock_irq(&wqe->lock); } while (1);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-v5.7-rc1 commit f462fd36fc43662eeb42c95a9b8da8659af6d75e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
When executing non-linked hashed work, io_worker_handle_work() will lock-unlock wqe->lock to update hash, and then immediately lock-unlock to get next work. Optimise this case and do lock/unlock only once.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 05f2fdc6bdce..3a3a818f5416 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -475,11 +475,11 @@ static void io_worker_handle_work(struct io_worker *worker) { struct io_wqe *wqe = worker->wqe; struct io_wq *wq = wqe->wq; + unsigned hash = -1U;
do { struct io_wq_work *work; - unsigned hash = -1U; - +get_next: /* * If we got some work, mark us as busy. If we didn't, but * the list isn't empty, it means we stalled on hashed work. @@ -525,9 +525,12 @@ static void io_worker_handle_work(struct io_worker *worker) spin_lock_irq(&wqe->lock); wqe->hash_map &= ~BIT_ULL(hash); wqe->flags &= ~IO_WQE_FLAG_STALLED; - spin_unlock_irq(&wqe->lock); /* dependent work is not hashed */ hash = -1U; + /* skip unnecessary unlock-lock wqe->lock */ + if (!work) + goto get_next; + spin_unlock_irq(&wqe->lock); } } while (work);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit e9fd939654f17651ff65e7e55aa6934d29eb4335 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
First it changes io-wq interfaces. It replaces {get,put}_work() with free_work(), which guaranteed to be called exactly once. It also enforces free_work() callback to be non-NULL.
io_uring follows the changes and instead of putting a submission reference in io_put_req_async_completion(), it will be done in io_free_work(). As removes io_get_work() with corresponding refcount_inc(), the ref balance is maintained.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 29 ++++++++++++++--------------- fs/io-wq.h | 6 ++---- fs/io_uring.c | 31 +++++++++++-------------------- 3 files changed, 27 insertions(+), 39 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 3a3a818f5416..73c5bb244730 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -108,8 +108,7 @@ struct io_wq { struct io_wqe **wqes; unsigned long state;
- get_work_fn *get_work; - put_work_fn *put_work; + free_work_fn *free_work;
struct task_struct *manager; struct user_struct *user; @@ -510,16 +509,11 @@ static void io_worker_handle_work(struct io_worker *worker) if (test_bit(IO_WQ_BIT_CANCEL, &wq->state)) work->flags |= IO_WQ_WORK_CANCEL;
- if (wq->get_work) - wq->get_work(work); - old_work = work; work->func(&work); work = (old_work == work) ? NULL : work; io_assign_current_work(worker, work); - - if (wq->put_work) - wq->put_work(old_work); + wq->free_work(old_work);
if (hash != -1U) { spin_lock_irq(&wqe->lock); @@ -750,14 +744,17 @@ static bool io_wq_can_queue(struct io_wqe *wqe, struct io_wqe_acct *acct, return true; }
-static void io_run_cancel(struct io_wq_work *work) +static void io_run_cancel(struct io_wq_work *work, struct io_wqe *wqe) { + struct io_wq *wq = wqe->wq; + do { struct io_wq_work *old_work = work;
work->flags |= IO_WQ_WORK_CANCEL; work->func(&work); work = (work == old_work) ? NULL : work; + wq->free_work(old_work); } while (work); }
@@ -774,7 +771,7 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work) * It's close enough to not be an issue, fork() has the same delay. */ if (unlikely(!io_wq_can_queue(wqe, acct, work))) { - io_run_cancel(work); + io_run_cancel(work, wqe); return; }
@@ -913,7 +910,7 @@ static enum io_wq_cancel io_wqe_cancel_cb_work(struct io_wqe *wqe, spin_unlock_irqrestore(&wqe->lock, flags);
if (found) { - io_run_cancel(work); + io_run_cancel(work, wqe); return IO_WQ_CANCEL_OK; }
@@ -988,7 +985,7 @@ static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, spin_unlock_irqrestore(&wqe->lock, flags);
if (found) { - io_run_cancel(work); + io_run_cancel(work, wqe); return IO_WQ_CANCEL_OK; }
@@ -1065,6 +1062,9 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) int ret = -ENOMEM, node; struct io_wq *wq;
+ if (WARN_ON_ONCE(!data->free_work)) + return ERR_PTR(-EINVAL); + wq = kzalloc(sizeof(*wq), GFP_KERNEL); if (!wq) return ERR_PTR(-ENOMEM); @@ -1075,8 +1075,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) return ERR_PTR(-ENOMEM); }
- wq->get_work = data->get_work; - wq->put_work = data->put_work; + wq->free_work = data->free_work;
/* caller must already hold a reference to this */ wq->user = data->user; @@ -1133,7 +1132,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
bool io_wq_get(struct io_wq *wq, struct io_wq_data *data) { - if (data->get_work != wq->get_work || data->put_work != wq->put_work) + if (data->free_work != wq->free_work) return false;
return refcount_inc_not_zero(&wq->use_refs); diff --git a/fs/io-wq.h b/fs/io-wq.h index a0978d6958f0..2117b9a4f161 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -81,14 +81,12 @@ struct io_wq_work { *(work) = (struct io_wq_work){ .func = _func }; \ } while (0) \
-typedef void (get_work_fn)(struct io_wq_work *); -typedef void (put_work_fn)(struct io_wq_work *); +typedef void (free_work_fn)(struct io_wq_work *);
struct io_wq_data { struct user_struct *user;
- get_work_fn *get_work; - put_work_fn *put_work; + free_work_fn *free_work; };
struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data); diff --git a/fs/io_uring.c b/fs/io_uring.c index d6eaafea0aa1..d1b0a7845e1c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1555,8 +1555,8 @@ static void io_put_req(struct io_kiocb *req) io_free_req(req); }
-static void io_put_req_async_completion(struct io_kiocb *req, - struct io_wq_work **workptr) +static void io_steal_work(struct io_kiocb *req, + struct io_wq_work **workptr) { /* * It's in an io-wq worker, so there always should be at least @@ -1566,7 +1566,6 @@ static void io_put_req_async_completion(struct io_kiocb *req, * It also means, that if the counter dropped to 1, then there is * no asynchronous users left, so it's safe to steal the next work. */ - refcount_dec(&req->refs); if (refcount_read(&req->refs) == 1) { struct io_kiocb *nxt = NULL;
@@ -2575,7 +2574,7 @@ static bool io_req_cancelled(struct io_kiocb *req) if (req->work.flags & IO_WQ_WORK_CANCEL) { req_set_fail_links(req); io_cqring_add_event(req, -ECANCELED); - io_double_put_req(req); + io_put_req(req); return true; }
@@ -2603,7 +2602,7 @@ static void io_fsync_finish(struct io_wq_work **workptr) if (io_req_cancelled(req)) return; __io_fsync(req); - io_put_req_async_completion(req, workptr); + io_steal_work(req, workptr); }
static int io_fsync(struct io_kiocb *req, bool force_nonblock) @@ -2636,7 +2635,7 @@ static void io_fallocate_finish(struct io_wq_work **workptr) if (io_req_cancelled(req)) return; __io_fallocate(req); - io_put_req_async_completion(req, workptr); + io_steal_work(req, workptr); }
static int io_fallocate_prep(struct io_kiocb *req, @@ -2957,7 +2956,7 @@ static void io_close_finish(struct io_wq_work **workptr)
/* not cancellable, don't do io_req_cancelled() */ __io_close_finish(req); - io_put_req_async_completion(req, workptr); + io_steal_work(req, workptr); }
static int io_close(struct io_kiocb *req, bool force_nonblock) @@ -3405,7 +3404,7 @@ static void io_accept_finish(struct io_wq_work **workptr) if (io_req_cancelled(req)) return; __io_accept(req, false); - io_put_req_async_completion(req, workptr); + io_steal_work(req, workptr); } #endif
@@ -4663,7 +4662,7 @@ static void io_wq_submit_work(struct io_wq_work **workptr) io_put_req(req); }
- io_put_req_async_completion(req, workptr); + io_steal_work(req, workptr); }
static int io_req_needs_file(struct io_kiocb *req, int fd) @@ -6071,21 +6070,14 @@ static int io_sqe_files_update(struct io_ring_ctx *ctx, void __user *arg, return __io_sqe_files_update(ctx, &up, nr_args); }
-static void io_put_work(struct io_wq_work *work) +static void io_free_work(struct io_wq_work *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work);
- /* Consider that io_put_req_async_completion() relies on this ref */ + /* Consider that io_steal_work() relies on this ref */ io_put_req(req); }
-static void io_get_work(struct io_wq_work *work) -{ - struct io_kiocb *req = container_of(work, struct io_kiocb, work); - - refcount_inc(&req->refs); -} - static int io_init_wq_offload(struct io_ring_ctx *ctx, struct io_uring_params *p) { @@ -6096,8 +6088,7 @@ static int io_init_wq_offload(struct io_ring_ctx *ctx, int ret = 0;
data.user = ctx->user; - data.get_work = io_get_work; - data.put_work = io_put_work; + data.free_work = io_free_work;
if (!(p->flags & IORING_SETUP_ATTACH_WQ)) { /* Do QD, or 4 * CPUS, whatever is smallest */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 5a2e745d4d430c4dbeeeb448c3d5c0c3109e511e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This just prepares the ring for having lists of buffers associated with it, that the application can provide for SQEs to consume instead of providing their own.
The buffers are organized by group ID.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d1b0a7845e1c..dc5381515877 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -195,6 +195,13 @@ struct fixed_file_data { struct completion done; };
+struct io_buffer { + struct list_head list; + __u64 addr; + __s32 len; + __u16 bid; +}; + struct io_ring_ctx { struct { struct percpu_ref refs; @@ -272,6 +279,8 @@ struct io_ring_ctx { struct socket *ring_sock; #endif
+ struct idr io_buffer_idr; + struct idr personality_idr;
struct { @@ -871,6 +880,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) INIT_LIST_HEAD(&ctx->cq_overflow_list); init_completion(&ctx->completions[0]); init_completion(&ctx->completions[1]); + idr_init(&ctx->io_buffer_idr); idr_init(&ctx->personality_idr); mutex_init(&ctx->uring_lock); init_waitqueue_head(&ctx->wait); @@ -6491,6 +6501,30 @@ static int io_eventfd_unregister(struct io_ring_ctx *ctx) return -ENXIO; }
+static int __io_destroy_buffers(int id, void *p, void *data) +{ + struct io_ring_ctx *ctx = data; + struct io_buffer *buf = p; + + /* the head kbuf is the list itself */ + while (!list_empty(&buf->list)) { + struct io_buffer *nxt; + + nxt = list_first_entry(&buf->list, struct io_buffer, list); + list_del(&nxt->list); + kfree(nxt); + } + kfree(buf); + idr_remove(&ctx->io_buffer_idr, id); + return 0; +} + +static void io_destroy_buffers(struct io_ring_ctx *ctx) +{ + idr_for_each(&ctx->io_buffer_idr, __io_destroy_buffers, ctx); + idr_destroy(&ctx->io_buffer_idr); +} + static void io_ring_ctx_free(struct io_ring_ctx *ctx) { io_finish_async(ctx); @@ -6501,6 +6535,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) io_sqe_buffer_unregister(ctx); io_sqe_files_unregister(ctx); io_eventfd_unregister(ctx); + io_destroy_buffers(ctx); idr_destroy(&ctx->personality_idr);
#if defined(CONFIG_UNIX)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit ddf0322db79c5984dc1a1db890f946dd19b7d6d9 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to support passing in an addr/len that is associated with a buffer ID and buffer group ID. The group ID is used to index and lookup the buffers, while the buffer ID can be used to notify the application which buffer in the group was used. The addr passed in is the starting buffer address, and length is each buffer length. A number of buffers to add with can be specified, in which case addr is incremented by length for each addition, and each buffer increments the buffer ID specified.
No validation is done of the buffer ID. If the application provides buffers within the same group with identical buffer IDs, then it'll have a hard time telling which buffer ID was used. The only restriction is that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the maximum ID that can be used.
Signed-off-by: Jens Axboe axboe@kernel.dk Conflicts: fs/io_uring.c [commit cebdb98617ae ("io_uring: add support for IORING_OP_OPENAT2") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 138 +++++++++++++++++++++++++++++++++- include/uapi/linux/io_uring.h | 10 ++- 2 files changed, 145 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index dc5381515877..b665cc71ba23 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -450,6 +450,15 @@ struct io_splice { unsigned int flags; };
+struct io_provide_buf { + struct file *file; + __u64 addr; + __s32 len; + __u32 bgid; + __u16 nbufs; + __u16 bid; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -575,6 +584,7 @@ struct io_kiocb { struct io_madvise madvise; struct io_epoll epoll; struct io_splice splice; + struct io_provide_buf pbuf; };
struct io_async_ctx *io; @@ -796,7 +806,8 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .hash_reg_file = 1, .unbound_nonreg_file = 1, - } + }, + [IORING_OP_PROVIDE_BUFFERS] = {}, };
static void io_wq_submit_work(struct io_wq_work **workptr); @@ -2736,6 +2747,120 @@ static int io_openat(struct io_kiocb *req, bool force_nonblock) return 0; }
+static int io_provide_buffers_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + struct io_provide_buf *p = &req->pbuf; + u64 tmp; + + if (sqe->ioprio || sqe->rw_flags) + return -EINVAL; + + tmp = READ_ONCE(sqe->fd); + if (!tmp || tmp > USHRT_MAX) + return -E2BIG; + p->nbufs = tmp; + p->addr = READ_ONCE(sqe->addr); + p->len = READ_ONCE(sqe->len); + + if (!access_ok(u64_to_user_ptr(p->addr), p->len)) + return -EFAULT; + + p->bgid = READ_ONCE(sqe->buf_group); + tmp = READ_ONCE(sqe->off); + if (tmp > USHRT_MAX) + return -E2BIG; + p->bid = tmp; + return 0; +} + +static int io_add_buffers(struct io_provide_buf *pbuf, struct io_buffer **head) +{ + struct io_buffer *buf; + u64 addr = pbuf->addr; + int i, bid = pbuf->bid; + + for (i = 0; i < pbuf->nbufs; i++) { + buf = kmalloc(sizeof(*buf), GFP_KERNEL); + if (!buf) + break; + + buf->addr = addr; + buf->len = pbuf->len; + buf->bid = bid; + addr += pbuf->len; + bid++; + if (!*head) { + INIT_LIST_HEAD(&buf->list); + *head = buf; + } else { + list_add_tail(&buf->list, &(*head)->list); + } + } + + return i ? i : -ENOMEM; +} + +static void io_ring_submit_unlock(struct io_ring_ctx *ctx, bool needs_lock) +{ + if (needs_lock) + mutex_unlock(&ctx->uring_lock); +} + +static void io_ring_submit_lock(struct io_ring_ctx *ctx, bool needs_lock) +{ + /* + * "Normal" inline submissions always hold the uring_lock, since we + * grab it from the system call. Same is true for the SQPOLL offload. + * The only exception is when we've detached the request and issue it + * from an async worker thread, grab the lock for that case. + */ + if (needs_lock) + mutex_lock(&ctx->uring_lock); +} + +static int io_provide_buffers(struct io_kiocb *req, bool force_nonblock) +{ + struct io_provide_buf *p = &req->pbuf; + struct io_ring_ctx *ctx = req->ctx; + struct io_buffer *head, *list; + int ret = 0; + + io_ring_submit_lock(ctx, !force_nonblock); + + lockdep_assert_held(&ctx->uring_lock); + + list = head = idr_find(&ctx->io_buffer_idr, p->bgid); + + ret = io_add_buffers(p, &head); + if (ret < 0) + goto out; + + if (!list) { + ret = idr_alloc(&ctx->io_buffer_idr, head, p->bgid, p->bgid + 1, + GFP_KERNEL); + if (ret < 0) { + while (!list_empty(&head->list)) { + struct io_buffer *buf; + + buf = list_first_entry(&head->list, + struct io_buffer, list); + list_del(&buf->list); + kfree(buf); + } + kfree(head); + goto out; + } + } +out: + io_ring_submit_unlock(ctx, !force_nonblock); + if (ret < 0) + req_set_fail_links(req); + io_cqring_add_event(req, ret); + io_put_req(req); + return 0; +} + static int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { @@ -4345,6 +4470,9 @@ static int io_req_defer_prep(struct io_kiocb *req, case IORING_OP_SPLICE: ret = io_splice_prep(req, sqe); break; + case IORING_OP_PROVIDE_BUFFERS: + ret = io_provide_buffers_prep(req, sqe); + break; default: printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", req->opcode); @@ -4613,6 +4741,14 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } ret = io_splice(req, force_nonblock); break; + case IORING_OP_PROVIDE_BUFFERS: + if (sqe) { + ret = io_provide_buffers_prep(req, sqe); + if (ret) + break; + } + ret = io_provide_buffers(req, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 14b4f075068f..5a3c5dd07e82 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -45,8 +45,13 @@ struct io_uring_sqe { __u64 user_data; /* data to be passed back at completion time */ union { struct { - /* index into fixed buffers, if used */ - __u16 buf_index; + /* pack this to avoid bogus arm OABI complaints */ + union { + /* index into fixed buffers, if used */ + __u16 buf_index; + /* for grouped buffer selection */ + __u16 buf_group; + } __attribute__((packed)); /* personality to use, if used */ __u16 personality; __s32 splice_fd_in; @@ -118,6 +123,7 @@ enum { IORING_OP_RECV, IORING_OP_EPOLL_CTL, IORING_OP_SPLICE, + IORING_OP_PROVIDE_BUFFERS,
/* this goes last, obviously */ IORING_OP_LAST,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit bcda7baaa3f15c7a95db3c024bb046d6e298f76b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If a server process has tons of pending socket connections, generally it uses epoll to wait for activity. When the socket is ready for reading (or writing), the task can select a buffer and issue a recv/send on the given fd.
Now that we have fast (non-async thread) support, a task can have tons of pending reads or writes pending. But that means they need buffers to back that data, and if the number of connections is high enough, having them preallocated for all possible connections is unfeasible.
With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to use for any request. The request then sets IOSQE_BUFFER_SELECT in the sqe, and a given group ID in sqe->buf_group. When the fd becomes ready, a free buffer from the specified group is selected. If none are available, the request is terminated with -ENOBUFS. If successful, the CQE on completion will contain the buffer ID chosen in the cqe->flags member, encoded as:
(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
Once a buffer has been consumed by a request, it is no longer available and must be registered again with IORING_OP_PROVIDE_BUFFERS.
Requests need to support this feature. For now, IORING_OP_READ and IORING_OP_RECV support it. This is checked on SQE submission, a CQE with res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 224 ++++++++++++++++++++++++++++------ include/uapi/linux/io_uring.h | 14 +++ 2 files changed, 199 insertions(+), 39 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b665cc71ba23..afd71ea5c918 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -396,7 +396,9 @@ struct io_sr_msg { void __user *buf; }; int msg_flags; + int bgid; size_t len; + struct io_buffer *kbuf; };
struct io_open { @@ -493,6 +495,7 @@ enum { REQ_F_LINK_BIT = IOSQE_IO_LINK_BIT, REQ_F_HARDLINK_BIT = IOSQE_IO_HARDLINK_BIT, REQ_F_FORCE_ASYNC_BIT = IOSQE_ASYNC_BIT, + REQ_F_BUFFER_SELECT_BIT = IOSQE_BUFFER_SELECT_BIT,
REQ_F_LINK_NEXT_BIT, REQ_F_FAIL_LINK_BIT, @@ -509,6 +512,7 @@ enum { REQ_F_NEED_CLEANUP_BIT, REQ_F_OVERFLOW_BIT, REQ_F_POLLED_BIT, + REQ_F_BUFFER_SELECTED_BIT, };
enum { @@ -522,6 +526,8 @@ enum { REQ_F_HARDLINK = BIT(REQ_F_HARDLINK_BIT), /* IOSQE_ASYNC */ REQ_F_FORCE_ASYNC = BIT(REQ_F_FORCE_ASYNC_BIT), + /* IOSQE_BUFFER_SELECT */ + REQ_F_BUFFER_SELECT = BIT(REQ_F_BUFFER_SELECT_BIT),
/* already grabbed next link */ REQ_F_LINK_NEXT = BIT(REQ_F_LINK_NEXT_BIT), @@ -553,6 +559,8 @@ enum { REQ_F_OVERFLOW = BIT(REQ_F_OVERFLOW_BIT), /* already went through poll handler */ REQ_F_POLLED = BIT(REQ_F_POLLED_BIT), + /* buffer already selected */ + REQ_F_BUFFER_SELECTED = BIT(REQ_F_BUFFER_SELECTED_BIT), };
struct async_poll { @@ -615,6 +623,7 @@ struct io_kiocb { struct callback_head task_work; struct hlist_node hash_node; struct async_poll *apoll; + int cflags; }; struct io_wq_work work; }; @@ -664,6 +673,8 @@ struct io_op_def { /* set if opcode supports polled "wait" */ unsigned pollin : 1; unsigned pollout : 1; + /* op supports buffer selection */ + unsigned buffer_select : 1; };
static const struct io_op_def io_op_defs[] = { @@ -773,6 +784,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .unbound_nonreg_file = 1, .pollin = 1, + .buffer_select = 1, }, [IORING_OP_WRITE] = { .needs_mm = 1, @@ -797,6 +809,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .unbound_nonreg_file = 1, .pollin = 1, + .buffer_select = 1, }, [IORING_OP_EPOLL_CTL] = { .unbound_nonreg_file = 1, @@ -1167,7 +1180,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) if (cqe) { WRITE_ONCE(cqe->user_data, req->user_data); WRITE_ONCE(cqe->res, req->result); - WRITE_ONCE(cqe->flags, 0); + WRITE_ONCE(cqe->flags, req->cflags); } else { WRITE_ONCE(ctx->rings->cq_overflow, atomic_inc_return(&ctx->cached_cq_overflow)); @@ -1191,7 +1204,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) return cqe != NULL; }
-static void io_cqring_fill_event(struct io_kiocb *req, long res) +static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags) { struct io_ring_ctx *ctx = req->ctx; struct io_uring_cqe *cqe; @@ -1207,7 +1220,7 @@ static void io_cqring_fill_event(struct io_kiocb *req, long res) if (likely(cqe)) { WRITE_ONCE(cqe->user_data, req->user_data); WRITE_ONCE(cqe->res, res); - WRITE_ONCE(cqe->flags, 0); + WRITE_ONCE(cqe->flags, cflags); } else if (ctx->cq_overflow_flushed) { WRITE_ONCE(ctx->rings->cq_overflow, atomic_inc_return(&ctx->cached_cq_overflow)); @@ -1219,23 +1232,34 @@ static void io_cqring_fill_event(struct io_kiocb *req, long res) req->flags |= REQ_F_OVERFLOW; refcount_inc(&req->refs); req->result = res; + req->cflags = cflags; list_add_tail(&req->list, &ctx->cq_overflow_list); } }
-static void io_cqring_add_event(struct io_kiocb *req, long res) +static void io_cqring_fill_event(struct io_kiocb *req, long res) +{ + __io_cqring_fill_event(req, res, 0); +} + +static void __io_cqring_add_event(struct io_kiocb *req, long res, long cflags) { struct io_ring_ctx *ctx = req->ctx; unsigned long flags;
spin_lock_irqsave(&ctx->completion_lock, flags); - io_cqring_fill_event(req, res); + __io_cqring_fill_event(req, res, cflags); io_commit_cqring(ctx); spin_unlock_irqrestore(&ctx->completion_lock, flags);
io_cqring_ev_posted(ctx); }
+static void io_cqring_add_event(struct io_kiocb *req, long res) +{ + __io_cqring_add_event(req, res, 0); +} + static inline bool io_is_fallback_req(struct io_kiocb *req) { return req == (struct io_kiocb *) @@ -1657,6 +1681,18 @@ static inline bool io_req_multi_free(struct req_batch *rb, struct io_kiocb *req) return true; }
+static int io_put_kbuf(struct io_kiocb *req) +{ + struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr; + int cflags; + + cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT; + cflags |= IORING_CQE_F_BUFFER; + req->rw.addr = 0; + kfree(kbuf); + return cflags; +} + /* * Find and free completed poll iocbs */ @@ -1668,10 +1704,15 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
rb.to_free = rb.need_iter = 0; while (!list_empty(done)) { + int cflags = 0; + req = list_first_entry(done, struct io_kiocb, list); list_del(&req->list);
- io_cqring_fill_event(req, req->result); + if (req->flags & REQ_F_BUFFER_SELECTED) + cflags = io_put_kbuf(req); + + __io_cqring_fill_event(req, req->result, cflags); (*nr_events)++;
if (refcount_dec_and_test(&req->refs) && @@ -1846,13 +1887,16 @@ static inline void req_set_fail_links(struct io_kiocb *req) static void io_complete_rw_common(struct kiocb *kiocb, long res) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); + int cflags = 0;
if (kiocb->ki_flags & IOCB_WRITE) kiocb_end_write(req);
if (res != req->result) req_set_fail_links(req); - io_cqring_add_event(req, res); + if (req->flags & REQ_F_BUFFER_SELECTED) + cflags = io_put_kbuf(req); + __io_cqring_add_event(req, res, cflags); }
static void io_complete_rw(struct kiocb *kiocb, long res, long res2) @@ -2030,7 +2074,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
req->rw.addr = READ_ONCE(sqe->addr); req->rw.len = READ_ONCE(sqe->len); - /* we own ->private, reuse it for the buffer index */ + /* we own ->private, reuse it for the buffer index / buffer ID */ req->rw.kiocb.private = (void *) (unsigned long) READ_ONCE(sqe->buf_index); return 0; @@ -2143,8 +2187,61 @@ static ssize_t io_import_fixed(struct io_kiocb *req, int rw, return len; }
+static void io_ring_submit_unlock(struct io_ring_ctx *ctx, bool needs_lock) +{ + if (needs_lock) + mutex_unlock(&ctx->uring_lock); +} + +static void io_ring_submit_lock(struct io_ring_ctx *ctx, bool needs_lock) +{ + /* + * "Normal" inline submissions always hold the uring_lock, since we + * grab it from the system call. Same is true for the SQPOLL offload. + * The only exception is when we've detached the request and issue it + * from an async worker thread, grab the lock for that case. + */ + if (needs_lock) + mutex_lock(&ctx->uring_lock); +} + +static struct io_buffer *io_buffer_select(struct io_kiocb *req, size_t *len, + int bgid, struct io_buffer *kbuf, + bool needs_lock) +{ + struct io_buffer *head; + + if (req->flags & REQ_F_BUFFER_SELECTED) + return kbuf; + + io_ring_submit_lock(req->ctx, needs_lock); + + lockdep_assert_held(&req->ctx->uring_lock); + + head = idr_find(&req->ctx->io_buffer_idr, bgid); + if (head) { + if (!list_empty(&head->list)) { + kbuf = list_last_entry(&head->list, struct io_buffer, + list); + list_del(&kbuf->list); + } else { + kbuf = head; + idr_remove(&req->ctx->io_buffer_idr, bgid); + } + if (*len > kbuf->len) + *len = kbuf->len; + } else { + kbuf = ERR_PTR(-ENOBUFS); + } + + io_ring_submit_unlock(req->ctx, needs_lock); + + return kbuf; +} + static ssize_t io_import_iovec(int rw, struct io_kiocb *req, - struct iovec **iovec, struct iov_iter *iter) + struct iovec **iovec, struct iov_iter *iter, + bool needs_lock) { void __user *buf = u64_to_user_ptr(req->rw.addr); size_t sqe_len = req->rw.len; @@ -2156,12 +2253,29 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, return io_import_fixed(req, rw, iter); }
- /* buffer index only valid with fixed read/write */ - if (req->rw.kiocb.private) + /* buffer index only valid with fixed read/write, or buffer select */ + if (req->rw.kiocb.private && !(req->flags & REQ_F_BUFFER_SELECT)) return -EINVAL;
if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) { ssize_t ret; + + if (req->flags & REQ_F_BUFFER_SELECT) { + struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr; + int bgid; + + bgid = (int) (unsigned long) req->rw.kiocb.private; + kbuf = io_buffer_select(req, &sqe_len, bgid, kbuf, + needs_lock); + if (IS_ERR(kbuf)) { + *iovec = NULL; + return PTR_ERR(kbuf); + } + req->rw.addr = (u64) kbuf; + req->flags |= REQ_F_BUFFER_SELECTED; + buf = u64_to_user_ptr(kbuf->addr); + } + ret = import_single_range(rw, buf, sqe_len, *iovec, iter); *iovec = NULL; return ret < 0 ? ret : sqe_len; @@ -2304,7 +2418,7 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, io = req->io; io->rw.iov = io->rw.fast_iov; req->io = NULL; - ret = io_import_iovec(READ, req, &io->rw.iov, &iter); + ret = io_import_iovec(READ, req, &io->rw.iov, &iter, !force_nonblock); req->io = io; if (ret < 0) return ret; @@ -2321,7 +2435,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock) size_t iov_count; ssize_t io_size, ret;
- ret = io_import_iovec(READ, req, &iovec, &iter); + ret = io_import_iovec(READ, req, &iovec, &iter, !force_nonblock); if (ret < 0) return ret;
@@ -2393,7 +2507,7 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, io = req->io; io->rw.iov = io->rw.fast_iov; req->io = NULL; - ret = io_import_iovec(WRITE, req, &io->rw.iov, &iter); + ret = io_import_iovec(WRITE, req, &io->rw.iov, &iter, !force_nonblock); req->io = io; if (ret < 0) return ret; @@ -2410,7 +2524,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) size_t iov_count; ssize_t ret, io_size;
- ret = io_import_iovec(WRITE, req, &iovec, &iter); + ret = io_import_iovec(WRITE, req, &iovec, &iter, !force_nonblock); if (ret < 0) return ret;
@@ -2801,24 +2915,6 @@ static int io_add_buffers(struct io_provide_buf *pbuf, struct io_buffer **head) return i ? i : -ENOMEM; }
-static void io_ring_submit_unlock(struct io_ring_ctx *ctx, bool needs_lock) -{ - if (needs_lock) - mutex_unlock(&ctx->uring_lock); -} - -static void io_ring_submit_lock(struct io_ring_ctx *ctx, bool needs_lock) -{ - /* - * "Normal" inline submissions always hold the uring_lock, since we - * grab it from the system call. Same is true for the SQPOLL offload. - * The only exception is when we've detached the request and issue it - * from an async worker thread, grab the lock for that case. - */ - if (needs_lock) - mutex_lock(&ctx->uring_lock); -} - static int io_provide_buffers(struct io_kiocb *req, bool force_nonblock) { struct io_provide_buf *p = &req->pbuf; @@ -3341,6 +3437,27 @@ static int io_send(struct io_kiocb *req, bool force_nonblock) #endif }
+static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req, + int *cflags, bool needs_lock) +{ + struct io_sr_msg *sr = &req->sr_msg; + struct io_buffer *kbuf; + + if (!(req->flags & REQ_F_BUFFER_SELECT)) + return NULL; + + kbuf = io_buffer_select(req, &sr->len, sr->bgid, sr->kbuf, needs_lock); + if (IS_ERR(kbuf)) + return kbuf; + + sr->kbuf = kbuf; + req->flags |= REQ_F_BUFFER_SELECTED; + + *cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT; + *cflags |= IORING_CQE_F_BUFFER; + return kbuf; +} + static int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { @@ -3352,6 +3469,7 @@ static int io_recvmsg_prep(struct io_kiocb *req, sr->msg_flags = READ_ONCE(sqe->msg_flags); sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); sr->len = READ_ONCE(sqe->len); + sr->bgid = READ_ONCE(sqe->buf_group);
#ifdef CONFIG_COMPAT if (req->ctx->compat) @@ -3441,8 +3559,9 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) static int io_recv(struct io_kiocb *req, bool force_nonblock) { #if defined(CONFIG_NET) + struct io_buffer *kbuf = NULL; struct socket *sock; - int ret; + int ret, cflags = 0;
if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; @@ -3450,15 +3569,25 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock) sock = sock_from_file(req->file, &ret); if (sock) { struct io_sr_msg *sr = &req->sr_msg; + void __user *buf = sr->buf; struct msghdr msg; struct iovec iov; unsigned flags;
- ret = import_single_range(READ, sr->buf, sr->len, &iov, + kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock); + if (IS_ERR(kbuf)) + return PTR_ERR(kbuf); + else if (kbuf) + buf = u64_to_user_ptr(kbuf->addr); + + ret = import_single_range(READ, buf, sr->len, &iov, &msg.msg_iter); - if (ret) + if (ret) { + kfree(kbuf); return ret; + }
+ req->flags |= REQ_F_NEED_CLEANUP; msg.msg_name = NULL; msg.msg_control = NULL; msg.msg_controllen = 0; @@ -3479,7 +3608,9 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock) ret = -EINTR; }
- io_cqring_add_event(req, ret); + kfree(kbuf); + req->flags &= ~REQ_F_NEED_CLEANUP; + __io_cqring_add_event(req, ret, cflags); if (ret < 0) req_set_fail_links(req); io_put_req(req); @@ -4519,6 +4650,9 @@ static void io_cleanup_req(struct io_kiocb *req) case IORING_OP_READV: case IORING_OP_READ_FIXED: case IORING_OP_READ: + if (req->flags & REQ_F_BUFFER_SELECTED) + kfree((void *)(unsigned long)req->rw.addr); + /* fallthrough */ case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: case IORING_OP_WRITE: @@ -4530,6 +4664,10 @@ static void io_cleanup_req(struct io_kiocb *req) if (io->msg.iov != io->msg.fast_iov) kfree(io->msg.iov); break; + case IORING_OP_RECV: + if (req->flags & REQ_F_BUFFER_SELECTED) + kfree(req->sr_msg.kbuf); + break; case IORING_OP_OPENAT: case IORING_OP_STATX: putname(req->open.filename); @@ -5098,7 +5236,8 @@ static inline void io_queue_link_head(struct io_kiocb *req) }
#define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK| \ - IOSQE_IO_HARDLINK | IOSQE_ASYNC) + IOSQE_IO_HARDLINK | IOSQE_ASYNC | \ + IOSQE_BUFFER_SELECT)
static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct io_submit_state *state, struct io_kiocb **link) @@ -5115,6 +5254,12 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, goto err_req; }
+ if ((sqe_flags & IOSQE_BUFFER_SELECT) && + !io_op_defs[req->opcode].buffer_select) { + ret = -EOPNOTSUPP; + goto err_req; + } + id = READ_ONCE(sqe->personality); if (id) { req->work.creds = idr_find(&ctx->personality_idr, id); @@ -5127,7 +5272,8 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
/* same numerical values with corresponding REQ_F_*, safe to copy */ req->flags |= sqe_flags & (IOSQE_IO_DRAIN | IOSQE_IO_HARDLINK | - IOSQE_ASYNC | IOSQE_FIXED_FILE); + IOSQE_ASYNC | IOSQE_FIXED_FILE | + IOSQE_BUFFER_SELECT);
ret = io_req_set_file(state, req, sqe); if (unlikely(ret)) { diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 5a3c5dd07e82..28a85bdff505 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -66,6 +66,7 @@ enum { IOSQE_IO_LINK_BIT, IOSQE_IO_HARDLINK_BIT, IOSQE_ASYNC_BIT, + IOSQE_BUFFER_SELECT_BIT, };
/* @@ -81,6 +82,8 @@ enum { #define IOSQE_IO_HARDLINK (1U << IOSQE_IO_HARDLINK_BIT) /* always go async */ #define IOSQE_ASYNC (1U << IOSQE_ASYNC_BIT) +/* select buffer from sqe->buf_group */ +#define IOSQE_BUFFER_SELECT (1U << IOSQE_BUFFER_SELECT_BIT)
/* * io_uring_setup() flags @@ -154,6 +157,17 @@ struct io_uring_cqe { __u32 flags; };
+/* + * cqe->flags + * + * IORING_CQE_F_BUFFER If set, the upper 16 bits are the buffer ID + */ +#define IORING_CQE_F_BUFFER (1U << 0) + +enum { + IORING_CQE_BUFFER_SHIFT = 16, +}; + /* * Magic offsets for the application to mmap the data it needs */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 4d954c258a0c365a85a2d1b1cccf63aec38fca4c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This adds support for the vectored read. This is limited to supporting just 1 segment in the iov, and is provided just for convenience for applications that use IORING_OP_READV already.
The iov helpers will be used for IORING_OP_RECVMSG as well.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 111 +++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 97 insertions(+), 14 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index afd71ea5c918..a1111cc25bac 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -685,6 +685,7 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, .unbound_nonreg_file = 1, .pollin = 1, + .buffer_select = 1, }, [IORING_OP_WRITEV] = { .async_ctx = 1, @@ -1683,9 +1684,10 @@ static inline bool io_req_multi_free(struct req_batch *rb, struct io_kiocb *req)
static int io_put_kbuf(struct io_kiocb *req) { - struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr; + struct io_buffer *kbuf; int cflags;
+ kbuf = (struct io_buffer *) (unsigned long) req->rw.addr; cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT; cflags |= IORING_CQE_F_BUFFER; req->rw.addr = 0; @@ -2239,12 +2241,95 @@ static struct io_buffer *io_buffer_select(struct io_kiocb *req, size_t *len, return kbuf; }
+static void __user *io_rw_buffer_select(struct io_kiocb *req, size_t *len, + bool needs_lock) +{ + struct io_buffer *kbuf; + int bgid; + + kbuf = (struct io_buffer *) (unsigned long) req->rw.addr; + bgid = (int) (unsigned long) req->rw.kiocb.private; + kbuf = io_buffer_select(req, len, bgid, kbuf, needs_lock); + if (IS_ERR(kbuf)) + return kbuf; + req->rw.addr = (u64) (unsigned long) kbuf; + req->flags |= REQ_F_BUFFER_SELECTED; + return u64_to_user_ptr(kbuf->addr); +} + +#ifdef CONFIG_COMPAT +static ssize_t io_compat_import(struct io_kiocb *req, struct iovec *iov, + bool needs_lock) +{ + struct compat_iovec __user *uiov; + compat_ssize_t clen; + void __user *buf; + ssize_t len; + + uiov = u64_to_user_ptr(req->rw.addr); + if (!access_ok(uiov, sizeof(*uiov))) + return -EFAULT; + if (__get_user(clen, &uiov->iov_len)) + return -EFAULT; + if (clen < 0) + return -EINVAL; + + len = clen; + buf = io_rw_buffer_select(req, &len, needs_lock); + if (IS_ERR(buf)) + return PTR_ERR(buf); + iov[0].iov_base = buf; + iov[0].iov_len = (compat_size_t) len; + return 0; +} +#endif + +static ssize_t __io_iov_buffer_select(struct io_kiocb *req, struct iovec *iov, + bool needs_lock) +{ + struct iovec __user *uiov = u64_to_user_ptr(req->rw.addr); + void __user *buf; + ssize_t len; + + if (copy_from_user(iov, uiov, sizeof(*uiov))) + return -EFAULT; + + len = iov[0].iov_len; + if (len < 0) + return -EINVAL; + buf = io_rw_buffer_select(req, &len, needs_lock); + if (IS_ERR(buf)) + return PTR_ERR(buf); + iov[0].iov_base = buf; + iov[0].iov_len = len; + return 0; +} + +static ssize_t io_iov_buffer_select(struct io_kiocb *req, struct iovec *iov, + bool needs_lock) +{ + if (req->flags & REQ_F_BUFFER_SELECTED) + return 0; + if (!req->rw.len) + return 0; + else if (req->rw.len > 1) + return -EINVAL; + +#ifdef CONFIG_COMPAT + if (req->ctx->compat) + return io_compat_import(req, iov, needs_lock); +#endif + + return __io_iov_buffer_select(req, iov, needs_lock); +} + static ssize_t io_import_iovec(int rw, struct io_kiocb *req, struct iovec **iovec, struct iov_iter *iter, bool needs_lock) { void __user *buf = u64_to_user_ptr(req->rw.addr); size_t sqe_len = req->rw.len; + ssize_t ret; u8 opcode;
opcode = req->opcode; @@ -2258,22 +2343,12 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, return -EINVAL;
if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) { - ssize_t ret; - if (req->flags & REQ_F_BUFFER_SELECT) { - struct io_buffer *kbuf = (struct io_buffer *) req->rw.addr; - int bgid; - - bgid = (int) (unsigned long) req->rw.kiocb.private; - kbuf = io_buffer_select(req, &sqe_len, bgid, kbuf, - needs_lock); - if (IS_ERR(kbuf)) { + buf = io_rw_buffer_select(req, &sqe_len, needs_lock); + if (IS_ERR(buf)) { *iovec = NULL; - return PTR_ERR(kbuf); + return PTR_ERR(buf); } - req->rw.addr = (u64) kbuf; - req->flags |= REQ_F_BUFFER_SELECTED; - buf = u64_to_user_ptr(kbuf->addr); }
ret = import_single_range(rw, buf, sqe_len, *iovec, iter); @@ -2291,6 +2366,14 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, return iorw->size; }
+ if (req->flags & REQ_F_BUFFER_SELECT) { + ret = io_iov_buffer_select(req, *iovec, needs_lock); + if (!ret) + iov_iter_init(iter, rw, *iovec, 1, (*iovec)->iov_len); + *iovec = NULL; + return ret; + } + #ifdef CONFIG_COMPAT if (req->ctx->compat) return compat_import_iovec(rw, buf, sqe_len, UIO_FASTIOV,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 0a384abfae66651b28e4bbe16883b1ff046ba3b3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This splits it into two parts, one that imports the message, and one that imports the iovec. This allows a caller to only do the first part, and import the iovec manually afterwards.
No functional changes in this patch.
Acked-by: David Miller davem@davemloft.net Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/socket.h | 4 ++++ include/net/compat.h | 3 +++ net/compat.c | 30 +++++++++++++++++++++++------- net/socket.c | 25 +++++++++++++++++++++---- 4 files changed, 51 insertions(+), 11 deletions(-)
diff --git a/include/linux/socket.h b/include/linux/socket.h index 97f2a929b2bf..05c87e849a87 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -377,6 +377,10 @@ extern int recvmsg_copy_msghdr(struct msghdr *msg, struct user_msghdr __user *umsg, unsigned flags, struct sockaddr __user **uaddr, struct iovec **iov); +extern int __copy_msghdr_from_user(struct msghdr *kmsg, + struct user_msghdr __user *umsg, + struct sockaddr __user **save_addr, + struct iovec __user **uiov, size_t *nsegs);
/* helpers which do the actual work for syscalls */ extern int __sys_recvfrom(int fd, void __user *ubuf, size_t size, diff --git a/include/net/compat.h b/include/net/compat.h index 4c6d75612b6c..2f861518cc89 100644 --- a/include/net/compat.h +++ b/include/net/compat.h @@ -41,6 +41,9 @@ int compat_sock_get_timestampns(struct sock *, struct timespec __user *); #define compat_mmsghdr mmsghdr #endif /* defined(CONFIG_COMPAT) */
+int __get_compat_msghdr(struct msghdr *kmsg, struct compat_msghdr __user *umsg, + struct sockaddr __user **save_addr, compat_uptr_t *ptr, + compat_size_t *len); int get_compat_msghdr(struct msghdr *, struct compat_msghdr __user *, struct sockaddr __user **, struct iovec **); struct sock_fprog __user *get_compat_bpf_fprog(char __user *optval); diff --git a/net/compat.c b/net/compat.c index 2582a9223d80..42afe8f45ff8 100644 --- a/net/compat.c +++ b/net/compat.c @@ -32,10 +32,10 @@ #include <linux/uaccess.h> #include <net/compat.h>
-int get_compat_msghdr(struct msghdr *kmsg, - struct compat_msghdr __user *umsg, - struct sockaddr __user **save_addr, - struct iovec **iov) +int __get_compat_msghdr(struct msghdr *kmsg, + struct compat_msghdr __user *umsg, + struct sockaddr __user **save_addr, + compat_uptr_t *ptr, compat_size_t *len) { struct compat_msghdr msg; ssize_t err; @@ -78,10 +78,26 @@ int get_compat_msghdr(struct msghdr *kmsg, return -EMSGSIZE;
kmsg->msg_iocb = NULL; + *ptr = msg.msg_iov; + *len = msg.msg_iovlen; + return 0; +} + +int get_compat_msghdr(struct msghdr *kmsg, + struct compat_msghdr __user *umsg, + struct sockaddr __user **save_addr, + struct iovec **iov) +{ + compat_uptr_t ptr; + compat_size_t len; + ssize_t err; + + err = __get_compat_msghdr(kmsg, umsg, save_addr, &ptr, &len); + if (err) + return err;
- err = compat_import_iovec(save_addr ? READ : WRITE, - compat_ptr(msg.msg_iov), msg.msg_iovlen, - UIO_FASTIOV, iov, &kmsg->msg_iter); + err = compat_import_iovec(save_addr ? READ : WRITE, compat_ptr(ptr), + len, UIO_FASTIOV, iov, &kmsg->msg_iter); return err < 0 ? err : 0; }
diff --git a/net/socket.c b/net/socket.c index 06c544fafa63..50403ebdd8f6 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2018,10 +2018,10 @@ struct used_address { unsigned int name_len; };
-static int copy_msghdr_from_user(struct msghdr *kmsg, - struct user_msghdr __user *umsg, - struct sockaddr __user **save_addr, - struct iovec **iov) +int __copy_msghdr_from_user(struct msghdr *kmsg, + struct user_msghdr __user *umsg, + struct sockaddr __user **save_addr, + struct iovec __user **uiov, size_t *nsegs) { struct user_msghdr msg; ssize_t err; @@ -2063,6 +2063,23 @@ static int copy_msghdr_from_user(struct msghdr *kmsg, return -EMSGSIZE;
kmsg->msg_iocb = NULL; + *uiov = msg.msg_iov; + *nsegs = msg.msg_iovlen; + return 0; +} + +static int copy_msghdr_from_user(struct msghdr *kmsg, + struct user_msghdr __user *umsg, + struct sockaddr __user **save_addr, + struct iovec **iov) +{ + struct user_msghdr msg; + ssize_t err; + + err = __copy_msghdr_from_user(kmsg, umsg, save_addr, &msg.msg_iov, + &msg.msg_iovlen); + if (err) + return err;
err = import_iovec(save_addr ? READ : WRITE, msg.msg_iov, msg.msg_iovlen,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 52de1fe122408d7a62b6cff9ed3895ebb882d71f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Like IORING_OP_READV, this is limited to supporting just a single segment in the iovec passed in.
Signed-off-by: Jens Axboe axboe@kernel.dk
Modified: include/net/compat.h [move __get_compat_msghdr inside CONFIG_COMPAT] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 118 ++++++++++++++++++++++++++++++++++++++----- include/net/compat.h | 6 +-- 2 files changed, 109 insertions(+), 15 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a1111cc25bac..97ddc9fbf625 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -44,6 +44,7 @@ #include <linux/errno.h> #include <linux/syscalls.h> #include <linux/compat.h> +#include <net/compat.h> #include <linux/refcount.h> #include <linux/uio.h> #include <linux/bits.h> @@ -732,6 +733,7 @@ static const struct io_op_def io_op_defs[] = { .unbound_nonreg_file = 1, .needs_fs = 1, .pollin = 1, + .buffer_select = 1, }, [IORING_OP_TIMEOUT] = { .async_ctx = 1, @@ -3520,6 +3522,92 @@ static int io_send(struct io_kiocb *req, bool force_nonblock) #endif }
+static int __io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io) +{ + struct io_sr_msg *sr = &req->sr_msg; + struct iovec __user *uiov; + size_t iov_len; + int ret; + + ret = __copy_msghdr_from_user(&io->msg.msg, sr->msg, &io->msg.uaddr, + &uiov, &iov_len); + if (ret) + return ret; + + if (req->flags & REQ_F_BUFFER_SELECT) { + if (iov_len > 1) + return -EINVAL; + if (copy_from_user(io->msg.iov, uiov, sizeof(*uiov))) + return -EFAULT; + sr->len = io->msg.iov[0].iov_len; + iov_iter_init(&io->msg.msg.msg_iter, READ, io->msg.iov, 1, + sr->len); + io->msg.iov = NULL; + } else { + ret = import_iovec(READ, uiov, iov_len, UIO_FASTIOV, + &io->msg.iov, &io->msg.msg.msg_iter); + if (ret > 0) + ret = 0; + } + + return ret; +} + +#ifdef CONFIG_COMPAT +static int __io_compat_recvmsg_copy_hdr(struct io_kiocb *req, + struct io_async_ctx *io) +{ + struct compat_msghdr __user *msg_compat; + struct io_sr_msg *sr = &req->sr_msg; + struct compat_iovec __user *uiov; + compat_uptr_t ptr; + compat_size_t len; + int ret; + + msg_compat = (struct compat_msghdr __user *) sr->msg; + ret = __get_compat_msghdr(&io->msg.msg, msg_compat, &io->msg.uaddr, + &ptr, &len); + if (ret) + return ret; + + uiov = compat_ptr(ptr); + if (req->flags & REQ_F_BUFFER_SELECT) { + compat_ssize_t clen; + + if (len > 1) + return -EINVAL; + if (!access_ok(uiov, sizeof(*uiov))) + return -EFAULT; + if (__get_user(clen, &uiov->iov_len)) + return -EFAULT; + if (clen < 0) + return -EINVAL; + sr->len = io->msg.iov[0].iov_len; + io->msg.iov = NULL; + } else { + ret = compat_import_iovec(READ, uiov, len, UIO_FASTIOV, + &io->msg.iov, + &io->msg.msg.msg_iter); + if (ret < 0) + return ret; + } + + return 0; +} +#endif + +static int io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io) +{ + io->msg.iov = io->msg.fast_iov; + +#ifdef CONFIG_COMPAT + if (req->ctx->compat) + return __io_compat_recvmsg_copy_hdr(req, io); +#endif + + return __io_recvmsg_copy_hdr(req, io); +} + static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req, int *cflags, bool needs_lock) { @@ -3565,9 +3653,7 @@ static int io_recvmsg_prep(struct io_kiocb *req, if (req->flags & REQ_F_NEED_CLEANUP) return 0;
- io->msg.iov = io->msg.fast_iov; - ret = recvmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, - &io->msg.uaddr, &io->msg.iov); + ret = io_recvmsg_copy_hdr(req, io); if (!ret) req->flags |= REQ_F_NEED_CLEANUP; return ret; @@ -3581,13 +3667,14 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) #if defined(CONFIG_NET) struct io_async_msghdr *kmsg = NULL; struct socket *sock; - int ret; + int ret, cflags = 0;
if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL;
sock = sock_from_file(req->file, &ret); if (sock) { + struct io_buffer *kbuf; struct io_async_ctx io; unsigned flags;
@@ -3599,19 +3686,23 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) kmsg->iov = kmsg->fast_iov; kmsg->msg.msg_iter.iov = kmsg->iov; } else { - struct io_sr_msg *sr = &req->sr_msg; - kmsg = &io.msg; kmsg->msg.msg_name = &io.msg.addr;
- io.msg.iov = io.msg.fast_iov; - ret = recvmsg_copy_msghdr(&io.msg.msg, sr->msg, - sr->msg_flags, &io.msg.uaddr, - &io.msg.iov); + ret = io_recvmsg_copy_hdr(req, &io); if (ret) return ret; }
+ kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock); + if (IS_ERR(kbuf)) { + return PTR_ERR(kbuf); + } else if (kbuf) { + kmsg->fast_iov[0].iov_base = u64_to_user_ptr(kbuf->addr); + iov_iter_init(&kmsg->msg.msg_iter, READ, kmsg->iov, + 1, req->sr_msg.len); + } + flags = req->sr_msg.msg_flags; if (flags & MSG_DONTWAIT) req->flags |= REQ_F_NOWAIT; @@ -3629,7 +3720,7 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) if (kmsg && kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); req->flags &= ~REQ_F_NEED_CLEANUP; - io_cqring_add_event(req, ret); + __io_cqring_add_event(req, ret, cflags); if (ret < 0) req_set_fail_links(req); io_put_req(req); @@ -4742,8 +4833,11 @@ static void io_cleanup_req(struct io_kiocb *req) if (io->rw.iov != io->rw.fast_iov) kfree(io->rw.iov); break; - case IORING_OP_SENDMSG: case IORING_OP_RECVMSG: + if (req->flags & REQ_F_BUFFER_SELECTED) + kfree(req->sr_msg.kbuf); + /* fallthrough */ + case IORING_OP_SENDMSG: if (io->msg.iov != io->msg.fast_iov) kfree(io->msg.iov); break; diff --git a/include/net/compat.h b/include/net/compat.h index 2f861518cc89..5db8429b5947 100644 --- a/include/net/compat.h +++ b/include/net/compat.h @@ -32,6 +32,9 @@ struct compat_cmsghdr {
int compat_sock_get_timestamp(struct sock *, struct timeval __user *); int compat_sock_get_timestampns(struct sock *, struct timespec __user *); +int __get_compat_msghdr(struct msghdr *kmsg, struct compat_msghdr __user *umsg, + struct sockaddr __user **save_addr, compat_uptr_t *ptr, + compat_size_t *len);
#else /* defined(CONFIG_COMPAT) */ /* @@ -41,9 +44,6 @@ int compat_sock_get_timestampns(struct sock *, struct timespec __user *); #define compat_mmsghdr mmsghdr #endif /* defined(CONFIG_COMPAT) */
-int __get_compat_msghdr(struct msghdr *kmsg, struct compat_msghdr __user *umsg, - struct sockaddr __user **save_addr, compat_uptr_t *ptr, - compat_size_t *len); int get_compat_msghdr(struct msghdr *, struct compat_msghdr __user *, struct sockaddr __user **, struct iovec **); struct sock_fprog __user *get_compat_bpf_fprog(char __user *optval);
From: YueHaibing yuehaibing@huawei.com
mainline inclusion from mainline-5.7-rc1 commit 469956e853ccdba72bb82ad2eea6e8ab6b15791f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If CONFIG_NET is not set, gcc warns:
fs/io_uring.c:3110:12: warning: io_setup_async_msg defined but not used [-Wunused-function] static int io_setup_async_msg(struct io_kiocb *req, ^~~~~~~~~~~~~~~~~~
There are many funcions wraped by CONFIG_NET, move them together to simplify code, also fix this warning.
Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: YueHaibing yuehaibing@huawei.com
Minor tweaks.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 94 ++++++++++++++++++++++++++++----------------------- 1 file changed, 52 insertions(+), 42 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f32a430b2729..68b20cfe855e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3428,6 +3428,7 @@ static int io_sync_file_range(struct io_kiocb *req, bool force_nonblock) return 0; }
+#if defined(CONFIG_NET) static int io_setup_async_msg(struct io_kiocb *req, struct io_async_msghdr *kmsg) { @@ -3445,7 +3446,6 @@ static int io_setup_async_msg(struct io_kiocb *req,
static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { -#if defined(CONFIG_NET) struct io_sr_msg *sr = &req->sr_msg; struct io_async_ctx *io = req->io; int ret; @@ -3471,14 +3471,10 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (!ret) req->flags |= REQ_F_NEED_CLEANUP; return ret; -#else - return -EOPNOTSUPP; -#endif }
static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) { -#if defined(CONFIG_NET) struct io_async_msghdr *kmsg = NULL; struct socket *sock; int ret; @@ -3532,14 +3528,10 @@ static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) req_set_fail_links(req); io_put_req(req); return 0; -#else - return -EOPNOTSUPP; -#endif }
static int io_send(struct io_kiocb *req, bool force_nonblock) { -#if defined(CONFIG_NET) struct socket *sock; int ret;
@@ -3582,9 +3574,6 @@ static int io_send(struct io_kiocb *req, bool force_nonblock) req_set_fail_links(req); io_put_req(req); return 0; -#else - return -EOPNOTSUPP; -#endif }
static int __io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io) @@ -3697,7 +3686,6 @@ static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req, static int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { -#if defined(CONFIG_NET) struct io_sr_msg *sr = &req->sr_msg; struct io_async_ctx *io = req->io; int ret; @@ -3722,14 +3710,10 @@ static int io_recvmsg_prep(struct io_kiocb *req, if (!ret) req->flags |= REQ_F_NEED_CLEANUP; return ret; -#else - return -EOPNOTSUPP; -#endif }
static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) { -#if defined(CONFIG_NET) struct io_async_msghdr *kmsg = NULL; struct socket *sock; int ret, cflags = 0; @@ -3790,14 +3774,10 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) req_set_fail_links(req); io_put_req(req); return 0; -#else - return -EOPNOTSUPP; -#endif }
static int io_recv(struct io_kiocb *req, bool force_nonblock) { -#if defined(CONFIG_NET) struct io_buffer *kbuf = NULL; struct socket *sock; int ret, cflags = 0; @@ -3854,15 +3834,10 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock) req_set_fail_links(req); io_put_req(req); return 0; -#else - return -EOPNOTSUPP; -#endif }
- static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { -#if defined(CONFIG_NET) struct io_accept *accept = &req->accept;
if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) @@ -3875,12 +3850,8 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) accept->flags = READ_ONCE(sqe->accept_flags); accept->nofile = rlimit(RLIMIT_NOFILE); return 0; -#else - return -EOPNOTSUPP; -#endif }
-#if defined(CONFIG_NET) static int __io_accept(struct io_kiocb *req, bool force_nonblock) { struct io_accept *accept = &req->accept; @@ -3911,11 +3882,9 @@ static void io_accept_finish(struct io_wq_work **workptr) __io_accept(req, false); io_steal_work(req, workptr); } -#endif
static int io_accept(struct io_kiocb *req, bool force_nonblock) { -#if defined(CONFIG_NET) int ret;
ret = __io_accept(req, force_nonblock); @@ -3924,14 +3893,10 @@ static int io_accept(struct io_kiocb *req, bool force_nonblock) return -EAGAIN; } return 0; -#else - return -EOPNOTSUPP; -#endif }
static int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { -#if defined(CONFIG_NET) struct io_connect *conn = &req->connect; struct io_async_ctx *io = req->io;
@@ -3948,14 +3913,10 @@ static int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return move_addr_to_kernel(conn->addr, conn->addr_len, &io->connect.address); -#else - return -EOPNOTSUPP; -#endif }
static int io_connect(struct io_kiocb *req, bool force_nonblock) { -#if defined(CONFIG_NET) struct io_async_ctx __io, *io; unsigned file_flags; int ret; @@ -3993,10 +3954,59 @@ static int io_connect(struct io_kiocb *req, bool force_nonblock) io_cqring_add_event(req, ret); io_put_req(req); return 0; -#else +} +#else /* !CONFIG_NET */ +static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + return -EOPNOTSUPP; +} + +static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) +{ + return -EOPNOTSUPP; +} + +static int io_send(struct io_kiocb *req, bool force_nonblock) +{ + return -EOPNOTSUPP; +} + +static int io_recvmsg_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + return -EOPNOTSUPP; +} + +static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) +{ + return -EOPNOTSUPP; +} + +static int io_recv(struct io_kiocb *req, bool force_nonblock) +{ + return -EOPNOTSUPP; +} + +static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + return -EOPNOTSUPP; +} + +static int io_accept(struct io_kiocb *req, bool force_nonblock) +{ + return -EOPNOTSUPP; +} + +static int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + return -EOPNOTSUPP; +} + +static int io_connect(struct io_kiocb *req, bool force_nonblock) +{ return -EOPNOTSUPP; -#endif } +#endif /* CONFIG_NET */
struct io_poll_table { struct poll_table_struct pt;
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.7-rc1 commit 32b2244a840a90ea94ba42392de5c48d53f521f5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA
When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need to do io completion events polling again, they can rely on io_sq_thread to do polling work, which can reduce cpu usage and uring_lock contention.
I modify fio io_uring engine codes a bit to evaluate the performance: static int fio_ioring_getevents(struct thread_data *td, unsigned int min, continue; }
- if (!o->sqpoll_thread) { + if (o->sqpoll_thread && o->hipri) { r = io_uring_enter(ld, 0, actual_min, IORING_ENTER_GETEVENTS); if (r < 0) {
and use "fio -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread -rw=read -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k -size=10G -numjobs=1 -time_based -runtime=120"
original codes -------------------------------------------------------------------- iodepth | 4 | 8 | 16 | 32 | 64 bw | 1133MB/s | 1519MB/s | 2090MB/s | 2710MB/s | 3012MB/s fio cpu usage | 100% | 100% | 100% | 100% | 100% --------------------------------------------------------------------
with patch -------------------------------------------------------------------- iodepth | 4 | 8 | 16 | 32 | 64 bw | 1196MB/s | 1721MB/s | 2351MB/s | 2977MB/s | 3357MB/s fio cpu usage | 63.8% | 74.4%% | 81.1% | 83.7% | 82.4% -------------------------------------------------------------------- bw improve | 5.5% | 13.2% | 12.3% | 9.8% | 11.5% --------------------------------------------------------------------
From above test results, we can see that bw has above 5.5%~13%
improvement, and fio process's cpu usage also drops much. Note this won't improve io_sq_thread's cpu usage when SETUP_IOPOLL|SETUP_SQPOLL are both enabled, in this case, io_sq_thread always has 100% cpu usage. I think this patch will be friendly to applications which will often use io_uring_wait_cqe() or similar from liburing.
Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 68b20cfe855e..46fd2f417edf 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1729,6 +1729,8 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, }
io_commit_cqring(ctx); + if (ctx->flags & IORING_SETUP_SQPOLL) + io_cqring_ev_posted(ctx); io_free_req_many(ctx, &rb); }
@@ -7375,7 +7377,14 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
min_complete = min(min_complete, ctx->cq_entries);
- if (ctx->flags & IORING_SETUP_IOPOLL) { + /* + * When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, user + * space applications don't need to do io completion events + * polling again, they can rely on io_sq_thread to do polling + * work, which can reduce cpu usage and uring_lock contention. + */ + if (ctx->flags & IORING_SETUP_IOPOLL && + !(ctx->flags & IORING_SETUP_SQPOLL)) { ret = io_iopoll_check(ctx, &nr_events, min_complete); } else { ret = io_cqring_wait(ctx, min_complete, sig, sigsz);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 3f9d64415fdaa73017fcb168930006648617b488 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Ensure we keep the truncated value, if we did truncate it. If not, we might read/write more than the registered buffer size.
Also for retry, ensure that we return the truncated mapped value for the vectorized versions of the read/write commands.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 46fd2f417edf..6faf90e6d20d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2357,6 +2357,7 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, *iovec = NULL; return PTR_ERR(buf); } + req->rw.len = sqe_len; }
ret = import_single_range(rw, buf, sqe_len, *iovec, iter); @@ -2376,8 +2377,10 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
if (req->flags & REQ_F_BUFFER_SELECT) { ret = io_iov_buffer_select(req, *iovec, needs_lock); - if (!ret) - iov_iter_init(iter, rw, *iovec, 1, (*iovec)->iov_len); + if (!ret) { + ret = (*iovec)->iov_len; + iov_iter_init(iter, rw, *iovec, 1, ret); + } *iovec = NULL; return ret; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 2293b4195800f88de2c454a24b25874be56d87f3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Deduplicate cancellation parts, as many of them looks the same, as do e.g. - io_wqe_cancel_cb_work() and io_wqe_cancel_work() - io_wq_worker_cancel() and io_work_cancel()
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 136 ++++++++++------------------------------------------- 1 file changed, 24 insertions(+), 112 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 73c5bb244730..d2fb0796eaf9 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -856,14 +856,13 @@ void io_wq_cancel_all(struct io_wq *wq) }
struct io_cb_cancel_data { - struct io_wqe *wqe; - work_cancel_fn *cancel; - void *caller_data; + work_cancel_fn *fn; + void *data; };
-static bool io_work_cancel(struct io_worker *worker, void *cancel_data) +static bool io_wq_worker_cancel(struct io_worker *worker, void *data) { - struct io_cb_cancel_data *data = cancel_data; + struct io_cb_cancel_data *match = data; unsigned long flags; bool ret = false;
@@ -874,83 +873,7 @@ static bool io_work_cancel(struct io_worker *worker, void *cancel_data) spin_lock_irqsave(&worker->lock, flags); if (worker->cur_work && !(worker->cur_work->flags & IO_WQ_WORK_NO_CANCEL) && - data->cancel(worker->cur_work, data->caller_data)) { - send_sig(SIGINT, worker->task, 1); - ret = true; - } - spin_unlock_irqrestore(&worker->lock, flags); - - return ret; -} - -static enum io_wq_cancel io_wqe_cancel_cb_work(struct io_wqe *wqe, - work_cancel_fn *cancel, - void *cancel_data) -{ - struct io_cb_cancel_data data = { - .wqe = wqe, - .cancel = cancel, - .caller_data = cancel_data, - }; - struct io_wq_work_node *node, *prev; - struct io_wq_work *work; - unsigned long flags; - bool found = false; - - spin_lock_irqsave(&wqe->lock, flags); - wq_list_for_each(node, prev, &wqe->work_list) { - work = container_of(node, struct io_wq_work, list); - - if (cancel(work, cancel_data)) { - wq_node_del(&wqe->work_list, node, prev); - found = true; - break; - } - } - spin_unlock_irqrestore(&wqe->lock, flags); - - if (found) { - io_run_cancel(work, wqe); - return IO_WQ_CANCEL_OK; - } - - rcu_read_lock(); - found = io_wq_for_each_worker(wqe, io_work_cancel, &data); - rcu_read_unlock(); - return found ? IO_WQ_CANCEL_RUNNING : IO_WQ_CANCEL_NOTFOUND; -} - -enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel, - void *data) -{ - enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; - int node; - - for_each_node(node) { - struct io_wqe *wqe = wq->wqes[node]; - - ret = io_wqe_cancel_cb_work(wqe, cancel, data); - if (ret != IO_WQ_CANCEL_NOTFOUND) - break; - } - - return ret; -} - -struct work_match { - bool (*fn)(struct io_wq_work *, void *data); - void *data; -}; - -static bool io_wq_worker_cancel(struct io_worker *worker, void *data) -{ - struct work_match *match = data; - unsigned long flags; - bool ret = false; - - spin_lock_irqsave(&worker->lock, flags); - if (match->fn(worker->cur_work, match->data) && - !(worker->cur_work->flags & IO_WQ_WORK_NO_CANCEL)) { + match->fn(worker->cur_work, match->data)) { send_sig(SIGINT, worker->task, 1); ret = true; } @@ -960,7 +883,7 @@ static bool io_wq_worker_cancel(struct io_worker *worker, void *data) }
static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, - struct work_match *match) + struct io_cb_cancel_data *match) { struct io_wq_work_node *node, *prev; struct io_wq_work *work; @@ -1001,22 +924,16 @@ static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, return found ? IO_WQ_CANCEL_RUNNING : IO_WQ_CANCEL_NOTFOUND; }
-static bool io_wq_work_match(struct io_wq_work *work, void *data) -{ - return work == data; -} - -enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork) +enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel, + void *data) { - struct work_match match = { - .fn = io_wq_work_match, - .data = cwork + struct io_cb_cancel_data match = { + .fn = cancel, + .data = data, }; enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; int node;
- cwork->flags |= IO_WQ_WORK_CANCEL; - for_each_node(node) { struct io_wqe *wqe = wq->wqes[node];
@@ -1028,33 +945,28 @@ enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork) return ret; }
+static bool io_wq_io_cb_cancel_data(struct io_wq_work *work, void *data) +{ + return work == data; +} + +enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork) +{ + return io_wq_cancel_cb(wq, io_wq_io_cb_cancel_data, (void *)cwork); +} + static bool io_wq_pid_match(struct io_wq_work *work, void *data) { pid_t pid = (pid_t) (unsigned long) data;
- if (work) - return work->task_pid == pid; - return false; + return work->task_pid == pid; }
enum io_wq_cancel io_wq_cancel_pid(struct io_wq *wq, pid_t pid) { - struct work_match match = { - .fn = io_wq_pid_match, - .data = (void *) (unsigned long) pid - }; - enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND; - int node; - - for_each_node(node) { - struct io_wqe *wqe = wq->wqes[node]; + void *data = (void *) (unsigned long) pid;
- ret = io_wqe_cancel_work(wqe, &match); - if (ret != IO_WQ_CANCEL_NOTFOUND) - break; - } - - return ret; + return io_wq_cancel_cb(wq, io_wq_pid_match, data); }
struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 8766dd516c535abf04491dca674d0ef6c95d814f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
It's a preparation patch removing io_wq_enqueue_hashed(), which now should be done by io_wq_hash_work() + io_wq_enqueue().
Also, set hash value for dependant works, and do it as late as possible, because req->file can be unavailable before. This hash will be ignored by io-wq.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 14 +++++--------- fs/io-wq.h | 7 ++++++- fs/io_uring.c | 24 ++++++++++-------------- 3 files changed, 21 insertions(+), 24 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 584db08f0547..c6569e14d847 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -386,7 +386,7 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe, unsigned *hash) work = container_of(node, struct io_wq_work, list);
/* not hashed, can run anytime */ - if (!(work->flags & IO_WQ_WORK_HASHED)) { + if (!io_wq_is_hashed(work)) { wq_node_del(&wqe->work_list, node, prev); return work; } @@ -796,19 +796,15 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work) }
/* - * Enqueue work, hashed by some key. Work items that hash to the same value - * will not be done in parallel. Used to limit concurrent writes, generally - * hashed by inode. + * Work items that hash to the same value will not be done in parallel. + * Used to limit concurrent writes, generally hashed by inode. */ -void io_wq_enqueue_hashed(struct io_wq *wq, struct io_wq_work *work, void *val) +void io_wq_hash_work(struct io_wq_work *work, void *val) { - struct io_wqe *wqe = wq->wqes[numa_node_id()]; - unsigned bit; - + unsigned int bit;
bit = hash_ptr(val, IO_WQ_HASH_ORDER); work->flags |= (IO_WQ_WORK_HASHED | (bit << IO_WQ_HASH_SHIFT)); - io_wqe_enqueue(wqe, work); }
static bool io_wqe_worker_send_sig(struct io_worker *worker, void *data) diff --git a/fs/io-wq.h b/fs/io-wq.h index 2117b9a4f161..298b21f4a4d2 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -94,7 +94,12 @@ bool io_wq_get(struct io_wq *wq, struct io_wq_data *data); void io_wq_destroy(struct io_wq *wq);
void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work); -void io_wq_enqueue_hashed(struct io_wq *wq, struct io_wq_work *work, void *val); +void io_wq_hash_work(struct io_wq_work *work, void *val); + +static inline bool io_wq_is_hashed(struct io_wq_work *work) +{ + return work->flags & IO_WQ_WORK_HASHED; +}
void io_wq_cancel_all(struct io_wq *wq); enum io_wq_cancel io_wq_cancel_work(struct io_wq *wq, struct io_wq_work *cwork); diff --git a/fs/io_uring.c b/fs/io_uring.c index 6faf90e6d20d..c59250bffc7a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1036,15 +1036,14 @@ static inline void io_req_work_drop_env(struct io_kiocb *req) } }
-static inline bool io_prep_async_work(struct io_kiocb *req, +static inline void io_prep_async_work(struct io_kiocb *req, struct io_kiocb **link) { const struct io_op_def *def = &io_op_defs[req->opcode]; - bool do_hashed = false;
if (req->flags & REQ_F_ISREG) { if (def->hash_reg_file) - do_hashed = true; + io_wq_hash_work(&req->work, file_inode(req->file)); } else { if (def->unbound_nonreg_file) req->work.flags |= IO_WQ_WORK_UNBOUND; @@ -1053,25 +1052,18 @@ static inline bool io_prep_async_work(struct io_kiocb *req, io_req_work_grab_env(req, def);
*link = io_prep_linked_timeout(req); - return do_hashed; }
static inline void io_queue_async_work(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; struct io_kiocb *link; - bool do_hashed;
- do_hashed = io_prep_async_work(req, &link); + io_prep_async_work(req, &link);
- trace_io_uring_queue_async_work(ctx, do_hashed, req, &req->work, - req->flags); - if (!do_hashed) { - io_wq_enqueue(ctx->io_wq, &req->work); - } else { - io_wq_enqueue_hashed(ctx->io_wq, &req->work, - file_inode(req->file)); - } + trace_io_uring_queue_async_work(ctx, io_wq_is_hashed(&req->work), req, + &req->work, req->flags); + io_wq_enqueue(ctx->io_wq, &req->work);
if (link) io_queue_linked_timeout(link); @@ -1579,6 +1571,10 @@ static void io_link_work_cb(struct io_wq_work **workptr) static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) { struct io_kiocb *link; + const struct io_op_def *def = &io_op_defs[nxt->opcode]; + + if ((nxt->flags & REQ_F_ISREG) && def->hash_reg_file) + io_wq_hash_work(&nxt->work, file_inode(nxt->file));
*workptr = &nxt->work; link = io_prep_linked_timeout(nxt);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 60cf46ae605446feb0c43c472c0fd1af4cd96231 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Enable io-wq hashing stuff for dependent works simply by re-enqueueing such requests.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 25 +++++++++++++++++++------ 1 file changed, 19 insertions(+), 6 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index c6569e14d847..4f7bdb3fd73c 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -376,11 +376,17 @@ static bool __io_worker_idle(struct io_wqe *wqe, struct io_worker *worker) return __io_worker_unuse(wqe, worker); }
-static struct io_wq_work *io_get_next_work(struct io_wqe *wqe, unsigned *hash) +static inline unsigned int io_get_work_hash(struct io_wq_work *work) +{ + return work->flags >> IO_WQ_HASH_SHIFT; +} + +static struct io_wq_work *io_get_next_work(struct io_wqe *wqe) __must_hold(wqe->lock) { struct io_wq_work_node *node, *prev; struct io_wq_work *work; + unsigned int hash;
wq_list_for_each(node, prev, &wqe->work_list) { work = container_of(node, struct io_wq_work, list); @@ -392,9 +398,9 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe, unsigned *hash) }
/* hashed, can run if not already running */ - *hash = work->flags >> IO_WQ_HASH_SHIFT; - if (!(wqe->hash_map & BIT(*hash))) { - wqe->hash_map |= BIT(*hash); + hash = io_get_work_hash(work); + if (!(wqe->hash_map & BIT(hash))) { + wqe->hash_map |= BIT(hash); wq_node_del(&wqe->work_list, node, prev); return work; } @@ -471,15 +477,17 @@ static void io_assign_current_work(struct io_worker *worker, spin_unlock_irq(&worker->lock); }
+static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work); + static void io_worker_handle_work(struct io_worker *worker) __releases(wqe->lock) { struct io_wqe *wqe = worker->wqe; struct io_wq *wq = wqe->wq; - unsigned hash = -1U;
do { struct io_wq_work *work; + unsigned int hash; get_next: /* * If we got some work, mark us as busy. If we didn't, but @@ -488,7 +496,7 @@ static void io_worker_handle_work(struct io_worker *worker) * can't make progress, any work completion or insertion will * clear the stalled flag. */ - work = io_get_next_work(wqe, &hash); + work = io_get_next_work(wqe); if (work) __io_worker_busy(wqe, worker, work); else if (!wq_list_empty(&wqe->work_list)) @@ -512,11 +520,16 @@ static void io_worker_handle_work(struct io_worker *worker) work->flags |= IO_WQ_WORK_CANCEL;
old_work = work; + hash = io_get_work_hash(work); work->func(&work); work = (old_work == work) ? NULL : work; io_assign_current_work(worker, work); wq->free_work(old_work);
+ if (work && io_wq_is_hashed(work)) { + io_wqe_enqueue(wqe, work); + work = NULL; + } if (hash != -1U) { spin_lock_irq(&wqe->lock); wqe->hash_map &= ~BIT_ULL(hash);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 4ed734b0d0913e566a9d871e15d24eb240f269f7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
With the previous fixes for number of files open checking, I added some debug code to see if we had other spots where we're checking rlimit() against the async io-wq workers. The only one I found was file size checking, which we should also honor.
During write and fallocate prep, store the max file size and override that for the current ask if we're in io-wq worker context.
Cc: stable@vger.kernel.org # 5.1+ Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c59250bffc7a..9141aa266007 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -607,7 +607,10 @@ struct io_kiocb { struct list_head list; unsigned int flags; refcount_t refs; - struct task_struct *task; + union { + struct task_struct *task; + unsigned long fsize; + }; u64 user_data; u32 result; u32 sequence; @@ -2590,6 +2593,8 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (unlikely(!(req->file->f_mode & FMODE_WRITE))) return -EBADF;
+ req->fsize = rlimit(RLIMIT_FSIZE); + /* either don't need iovec imported or already have it */ if (!req->io || req->flags & REQ_F_NEED_CLEANUP) return 0; @@ -2659,10 +2664,17 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) } kiocb->ki_flags |= IOCB_WRITE;
+ if (!force_nonblock) + current->signal->rlim[RLIMIT_FSIZE].rlim_cur = req->fsize; + if (req->file->f_op->write_iter) ret2 = call_write_iter(req->file, kiocb, &iter); else ret2 = loop_rw_iter(WRITE, req->file, kiocb, &iter); + + if (!force_nonblock) + current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY; + /* * Raw bdev writes will -EOPNOTSUPP for IOCB_NOWAIT. Just * retry them without IOCB_NOWAIT. @@ -2845,8 +2857,10 @@ static void __io_fallocate(struct io_kiocb *req) { int ret;
+ current->signal->rlim[RLIMIT_FSIZE].rlim_cur = req->fsize; ret = vfs_fallocate(req->file, req->sync.mode, req->sync.off, req->sync.len); + current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY; if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); @@ -2872,6 +2886,7 @@ static int io_fallocate_prep(struct io_kiocb *req, req->sync.off = READ_ONCE(sqe->off); req->sync.len = READ_ONCE(sqe->addr); req->sync.mode = READ_ONCE(sqe->len); + req->fsize = rlimit(RLIMIT_FSIZE); return 0; }
From: Lukas Bulwahn lukas.bulwahn@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 9f5834c868e901b00f1bfe4d0052b5906b4a2b7f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Commit bbbdeb4720a0 ("io_uring: dual license io_uring.h uapi header") uses a nested SPDX-License-Identifier to dual license the header.
Since then, ./scripts/spdxcheck.py complains:
include/uapi/linux/io_uring.h: 1:60 Missing parentheses: OR
Add parentheses to make spdxcheck.py happy.
Signed-off-by: Lukas Bulwahn lukas.bulwahn@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/uapi/linux/io_uring.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ed90e7f75f15..6e35b534c4b8 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -1,4 +1,4 @@ -/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note OR MIT */ +/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */ /* * Header file for the io_uring interface. *
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit f2cf11492b8b30d89b2fbf525c9ea5e8c4ccc842 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
After io_assign_current_work() of a linked work, it can be decided to offloaded to another thread so doing io_wqe_enqueue(). However, until next io_assign_current_work() it can be cancelled, that isn't handled.
Don't assign it, if it's not going to be executed.
Fixes: 60cf46ae6054 ("io-wq: hash dependent work") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 4f7bdb3fd73c..db03fe55179a 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -486,7 +486,7 @@ static void io_worker_handle_work(struct io_worker *worker) struct io_wq *wq = wqe->wq;
do { - struct io_wq_work *work; + struct io_wq_work *work, *assign_work; unsigned int hash; get_next: /* @@ -523,10 +523,14 @@ static void io_worker_handle_work(struct io_worker *worker) hash = io_get_work_hash(work); work->func(&work); work = (old_work == work) ? NULL : work; - io_assign_current_work(worker, work); + + assign_work = work; + if (work && io_wq_is_hashed(work)) + assign_work = NULL; + io_assign_current_work(worker, assign_work); wq->free_work(old_work);
- if (work && io_wq_is_hashed(work)) { + if (work && !assign_work) { io_wqe_enqueue(wqe, work); work = NULL; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 18a542ff19ad149fac9e5a36a4012e3cac7b3b3b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
work->data and work->list are shared in union. io_wq_assign_next() sets ->data if a req having a linked_timeout, but then io-wq may want to use work->list, e.g. to do re-enqueue of a request, so corrupting ->data.
->data is not necessary, just remove it and extract linked_timeout through @link_list.
Fixes: 60cf46ae6054 ("io-wq: hash dependent work") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.h | 5 +---- fs/io_uring.c | 9 ++++----- 2 files changed, 5 insertions(+), 9 deletions(-)
diff --git a/fs/io-wq.h b/fs/io-wq.h index 298b21f4a4d2..d2a5684bf673 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -63,10 +63,7 @@ static inline void wq_node_del(struct io_wq_work_list *list, } while (0)
struct io_wq_work { - union { - struct io_wq_work_node list; - void *data; - }; + struct io_wq_work_node list; void (*func)(struct io_wq_work **); struct files_struct *files; struct mm_struct *mm; diff --git a/fs/io_uring.c b/fs/io_uring.c index 9141aa266007..846632fbdc7c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1564,9 +1564,10 @@ static void io_free_req(struct io_kiocb *req)
static void io_link_work_cb(struct io_wq_work **workptr) { - struct io_wq_work *work = *workptr; - struct io_kiocb *link = work->data; + struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); + struct io_kiocb *link;
+ link = list_first_entry(&req->link_list, struct io_kiocb, link_list); io_queue_linked_timeout(link); io_wq_submit_work(workptr); } @@ -1581,10 +1582,8 @@ static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt)
*workptr = &nxt->work; link = io_prep_linked_timeout(nxt); - if (link) { + if (link) nxt->work.func = io_link_work_cb; - nxt->work.data = link; - } }
/*
From: Hillf Danton hdanton@sina.com
mainline inclusion from mainline-5.7-rc1 commit 4afdb733b1606c6cb86e7833f9335f4870cf7ddd category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
A case of task hung was reported by syzbot,
INFO: task syz-executor975:9880 blocked for more than 143 seconds. Not tainted 5.6.0-rc6-syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syz-executor975 D27576 9880 9878 0x80004000 Call Trace: schedule+0xd0/0x2a0 kernel/sched/core.c:4154 schedule_timeout+0x6db/0xba0 kernel/time/timer.c:1871 do_wait_for_common kernel/sched/completion.c:83 [inline] __wait_for_common kernel/sched/completion.c:104 [inline] wait_for_common kernel/sched/completion.c:115 [inline] wait_for_completion+0x26a/0x3c0 kernel/sched/completion.c:136 io_queue_file_removal+0x1af/0x1e0 fs/io_uring.c:5826 __io_sqe_files_update.isra.0+0x3a1/0xb00 fs/io_uring.c:5867 io_sqe_files_update fs/io_uring.c:5918 [inline] __io_uring_register+0x377/0x2c00 fs/io_uring.c:7131 __do_sys_io_uring_register fs/io_uring.c:7202 [inline] __se_sys_io_uring_register fs/io_uring.c:7184 [inline] __x64_sys_io_uring_register+0x192/0x560 fs/io_uring.c:7184 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe
and bisect pointed to 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update").
It is down to the order that we wait for work done before flushing it while nobody is likely going to wake us up.
We can drop that completion on stack as flushing work itself is a sync operation we need and no more is left behind it.
To that end, io_file_put::done is re-used for indicating if it can be freed in the workqueue worker context.
Reported-and-Inspired-by: syzbot syzbot+538d1957ce178382a394@syzkaller.appspotmail.com Signed-off-by: Hillf Danton hdanton@sina.com
Rename ->done to ->free_pfile
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 846632fbdc7c..378c5e3b6ad8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6315,7 +6315,7 @@ static void io_ring_file_put(struct io_ring_ctx *ctx, struct file *file) struct io_file_put { struct llist_node llist; struct file *file; - struct completion *done; + bool free_pfile; };
static void io_ring_file_ref_flush(struct fixed_file_data *data) @@ -6326,9 +6326,7 @@ static void io_ring_file_ref_flush(struct fixed_file_data *data) while ((node = llist_del_all(&data->put_llist)) != NULL) { llist_for_each_entry_safe(pfile, tmp, node, llist) { io_ring_file_put(data->ctx, pfile->file); - if (pfile->done) - complete(pfile->done); - else + if (pfile->free_pfile) kfree(pfile); } } @@ -6528,7 +6526,6 @@ static bool io_queue_file_removal(struct fixed_file_data *data, struct file *file) { struct io_file_put *pfile, pfile_stack; - DECLARE_COMPLETION_ONSTACK(done);
/* * If we fail allocating the struct we need for doing async reomval @@ -6537,15 +6534,15 @@ static bool io_queue_file_removal(struct fixed_file_data *data, pfile = kzalloc(sizeof(*pfile), GFP_KERNEL); if (!pfile) { pfile = &pfile_stack; - pfile->done = &done; - } + pfile->free_pfile = false; + } else + pfile->free_pfile = true;
pfile->file = file; llist_add(&pfile->llist, &data->put_llist);
if (pfile == &pfile_stack) { percpu_ref_switch_to_atomic(&data->refs, io_atomic_switch); - wait_for_completion(&done); flush_work(&data->ref_work); return false; }
From: Hillf Danton hdanton@sina.com
mainline inclusion from mainline-5.7-rc1 commit a5318d3cdffbecf075928363d7e4becfeddabfcb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Sync removal of file is only used in case of a GFP_KERNEL kmalloc failure at the cost of io_file_put::done and work flush, while a glich like it can be handled at the call site without too much pain.
That said, what is proposed is to drop sync removing of file, and the kink in neck as well.
Signed-off-by: Hillf Danton hdanton@sina.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 34 ++++++++++------------------------ 1 file changed, 10 insertions(+), 24 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 378c5e3b6ad8..cd1fd6908cbd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6315,7 +6315,6 @@ static void io_ring_file_put(struct io_ring_ctx *ctx, struct file *file) struct io_file_put { struct llist_node llist; struct file *file; - bool free_pfile; };
static void io_ring_file_ref_flush(struct fixed_file_data *data) @@ -6326,8 +6325,7 @@ static void io_ring_file_ref_flush(struct fixed_file_data *data) while ((node = llist_del_all(&data->put_llist)) != NULL) { llist_for_each_entry_safe(pfile, tmp, node, llist) { io_ring_file_put(data->ctx, pfile->file); - if (pfile->free_pfile) - kfree(pfile); + kfree(pfile); } } } @@ -6522,32 +6520,18 @@ static void io_atomic_switch(struct percpu_ref *ref) percpu_ref_get(&data->refs); }
-static bool io_queue_file_removal(struct fixed_file_data *data, +static int io_queue_file_removal(struct fixed_file_data *data, struct file *file) { - struct io_file_put *pfile, pfile_stack; + struct io_file_put *pfile;
- /* - * If we fail allocating the struct we need for doing async reomval - * of this file, just punt to sync and wait for it. - */ pfile = kzalloc(sizeof(*pfile), GFP_KERNEL); - if (!pfile) { - pfile = &pfile_stack; - pfile->free_pfile = false; - } else - pfile->free_pfile = true; + if (!pfile) + return -ENOMEM;
pfile->file = file; llist_add(&pfile->llist, &data->put_llist); - - if (pfile == &pfile_stack) { - percpu_ref_switch_to_atomic(&data->refs, io_atomic_switch); - flush_work(&data->ref_work); - return false; - } - - return true; + return 0; }
static int __io_sqe_files_update(struct io_ring_ctx *ctx, @@ -6582,9 +6566,11 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, index = i & IORING_FILE_TABLE_MASK; if (table->files[index]) { file = io_file_from_index(ctx, index); + err = io_queue_file_removal(data, file); + if (err) + break; table->files[index] = NULL; - if (io_queue_file_removal(data, file)) - ref_switch = true; + ref_switch = true; } if (fd != -1) { file = fget(fd);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 86f3cd1b589a10dbdca98c52cc0cd0f56523c9b3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We always punt async buffered writes to an io-wq helper, as the core kernel does not have IOCB_NOWAIT support for that. Most buffered async writes complete very quickly, as it's just a copy operation. This means that doing multiple locking roundtrips on the shared wqe lock for each buffered write is wasteful. Additionally, buffered writes are hashed work items, which means that any buffered write to a given file is serialized.
Keep identicaly hashed work items contiguously in @wqe->work_list, and track a tail for each hash bucket. On dequeue of a hashed item, splice all of the same hash in one go using the tracked tail. Until the batch is done, the caller doesn't have to synchronize with the wqe or worker locks again.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 68 ++++++++++++++++++++++++++++++++++++++---------------- fs/io-wq.h | 45 +++++++++++++++++++++++++++++------- 2 files changed, 85 insertions(+), 28 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index db03fe55179a..4fd7b31c40a3 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -70,6 +70,8 @@ struct io_worker { #define IO_WQ_HASH_ORDER 5 #endif
+#define IO_WQ_NR_HASH_BUCKETS (1u << IO_WQ_HASH_ORDER) + struct io_wqe_acct { unsigned nr_workers; unsigned max_workers; @@ -99,6 +101,7 @@ struct io_wqe { struct list_head all_list;
struct io_wq *wq; + struct io_wq_work *hash_tail[IO_WQ_NR_HASH_BUCKETS]; };
/* @@ -385,7 +388,7 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe) __must_hold(wqe->lock) { struct io_wq_work_node *node, *prev; - struct io_wq_work *work; + struct io_wq_work *work, *tail; unsigned int hash;
wq_list_for_each(node, prev, &wqe->work_list) { @@ -393,7 +396,7 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe)
/* not hashed, can run anytime */ if (!io_wq_is_hashed(work)) { - wq_node_del(&wqe->work_list, node, prev); + wq_list_del(&wqe->work_list, node, prev); return work; }
@@ -401,7 +404,10 @@ static struct io_wq_work *io_get_next_work(struct io_wqe *wqe) hash = io_get_work_hash(work); if (!(wqe->hash_map & BIT(hash))) { wqe->hash_map |= BIT(hash); - wq_node_del(&wqe->work_list, node, prev); + /* all items with this hash lie in [work, tail] */ + tail = wqe->hash_tail[hash]; + wqe->hash_tail[hash] = NULL; + wq_list_cut(&wqe->work_list, &tail->list, prev); return work; } } @@ -486,7 +492,7 @@ static void io_worker_handle_work(struct io_worker *worker) struct io_wq *wq = wqe->wq;
do { - struct io_wq_work *work, *assign_work; + struct io_wq_work *work; unsigned int hash; get_next: /* @@ -509,8 +515,9 @@ static void io_worker_handle_work(struct io_worker *worker)
/* handle a whole dependent link */ do { - struct io_wq_work *old_work; + struct io_wq_work *old_work, *next_hashed, *linked;
+ next_hashed = wq_next_work(work); io_impersonate_work(worker, work); /* * OK to set IO_WQ_WORK_CANCEL even for uncancellable @@ -519,22 +526,23 @@ static void io_worker_handle_work(struct io_worker *worker) if (test_bit(IO_WQ_BIT_CANCEL, &wq->state)) work->flags |= IO_WQ_WORK_CANCEL;
- old_work = work; hash = io_get_work_hash(work); - work->func(&work); - work = (old_work == work) ? NULL : work; - - assign_work = work; - if (work && io_wq_is_hashed(work)) - assign_work = NULL; - io_assign_current_work(worker, assign_work); + linked = old_work = work; + linked->func(&linked); + linked = (old_work == linked) ? NULL : linked; + + work = next_hashed; + if (!work && linked && !io_wq_is_hashed(linked)) { + work = linked; + linked = NULL; + } + io_assign_current_work(worker, work); wq->free_work(old_work);
- if (work && !assign_work) { - io_wqe_enqueue(wqe, work); - work = NULL; - } - if (hash != -1U) { + if (linked) + io_wqe_enqueue(wqe, linked); + + if (hash != -1U && !next_hashed) { spin_lock_irq(&wqe->lock); wqe->hash_map &= ~BIT_ULL(hash); wqe->flags &= ~IO_WQE_FLAG_STALLED; @@ -777,6 +785,26 @@ static void io_run_cancel(struct io_wq_work *work, struct io_wqe *wqe) } while (work); }
+static void io_wqe_insert_work(struct io_wqe *wqe, struct io_wq_work *work) +{ + unsigned int hash; + struct io_wq_work *tail; + + if (!io_wq_is_hashed(work)) { +append: + wq_list_add_tail(&work->list, &wqe->work_list); + return; + } + + hash = io_get_work_hash(work); + tail = wqe->hash_tail[hash]; + wqe->hash_tail[hash] = work; + if (!tail) + goto append; + + wq_list_add_after(&work->list, &tail->list, &wqe->work_list); +} + static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work) { struct io_wqe_acct *acct = io_work_get_acct(wqe, work); @@ -796,7 +824,7 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
work_flags = work->flags; spin_lock_irqsave(&wqe->lock, flags); - wq_list_add_tail(&work->list, &wqe->work_list); + io_wqe_insert_work(wqe, work); wqe->flags &= ~IO_WQE_FLAG_STALLED; spin_unlock_irqrestore(&wqe->lock, flags);
@@ -915,7 +943,7 @@ static enum io_wq_cancel io_wqe_cancel_work(struct io_wqe *wqe, work = container_of(node, struct io_wq_work, list);
if (match->fn(work, match->data)) { - wq_node_del(&wqe->work_list, node, prev); + wq_list_del(&wqe->work_list, node, prev); found = true; break; } diff --git a/fs/io-wq.h b/fs/io-wq.h index d2a5684bf673..3ee7356d6be5 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -28,6 +28,18 @@ struct io_wq_work_list { struct io_wq_work_node *last; };
+static inline void wq_list_add_after(struct io_wq_work_node *node, + struct io_wq_work_node *pos, + struct io_wq_work_list *list) +{ + struct io_wq_work_node *next = pos->next; + + pos->next = node; + node->next = next; + if (!next) + list->last = node; +} + static inline void wq_list_add_tail(struct io_wq_work_node *node, struct io_wq_work_list *list) { @@ -40,17 +52,26 @@ static inline void wq_list_add_tail(struct io_wq_work_node *node, } }
-static inline void wq_node_del(struct io_wq_work_list *list, - struct io_wq_work_node *node, +static inline void wq_list_cut(struct io_wq_work_list *list, + struct io_wq_work_node *last, struct io_wq_work_node *prev) { - if (node == list->first) - WRITE_ONCE(list->first, node->next); - if (node == list->last) + /* first in the list, if prev==NULL */ + if (!prev) + WRITE_ONCE(list->first, last->next); + else + prev->next = last->next; + + if (last == list->last) list->last = prev; - if (prev) - prev->next = node->next; - node->next = NULL; + last->next = NULL; +} + +static inline void wq_list_del(struct io_wq_work_list *list, + struct io_wq_work_node *node, + struct io_wq_work_node *prev) +{ + wq_list_cut(list, node, prev); }
#define wq_list_for_each(pos, prv, head) \ @@ -78,6 +99,14 @@ struct io_wq_work { *(work) = (struct io_wq_work){ .func = _func }; \ } while (0) \
+static inline struct io_wq_work *wq_next_work(struct io_wq_work *work) +{ + if (!work->list.next) + return NULL; + + return container_of(work->list.next, struct io_wq_work, list); +} + typedef void (free_work_fn)(struct io_wq_work *);
struct io_wq_data {
From: Chucheng Luo luochucheng@vivo.com
mainline inclusion from mainline-5.7-rc1 commit bff6035d0c40fa1dd195aa41f61814d622883420 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The missing 'return' work may make it hard for other developers to understand it.
Signed-off-by: Chucheng Luo luochucheng@vivo.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cd1fd6908cbd..8ab0bafebf5e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2675,7 +2675,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
/* - * Raw bdev writes will -EOPNOTSUPP for IOCB_NOWAIT. Just + * Raw bdev writes will return -EOPNOTSUPP for IOCB_NOWAIT. Just * retry them without IOCB_NOWAIT. */ if (ret2 == -EOPNOTSUPP && (kiocb->ki_flags & IOCB_NOWAIT))
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.7-rc1 commit 3d9932a8b240c9019f48358e8a6928c53c2c7f6b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Cleanup io_alloc_async_ctx() a bit, add a new __io_alloc_async_ctx(), so io_setup_async_rw() won't need to check whether async_ctx is true or false again.
Reviewed-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8ab0bafebf5e..9d311d535efc 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2466,12 +2466,18 @@ static void io_req_map_rw(struct io_kiocb *req, ssize_t io_size, } }
+static inline int __io_alloc_async_ctx(struct io_kiocb *req) +{ + req->io = kmalloc(sizeof(*req->io), GFP_KERNEL); + return req->io == NULL; +} + static int io_alloc_async_ctx(struct io_kiocb *req) { if (!io_op_defs[req->opcode].async_ctx) return 0; - req->io = kmalloc(sizeof(*req->io), GFP_KERNEL); - return req->io == NULL; + + return __io_alloc_async_ctx(req); }
static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size, @@ -2481,7 +2487,7 @@ static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size, if (!io_op_defs[req->opcode].async_ctx) return 0; if (!req->io) { - if (io_alloc_async_ctx(req)) + if (__io_alloc_async_ctx(req)) return -ENOMEM;
io_req_map_rw(req, io_size, iovec, fast_iov, iter);
From: Colin Ian King colin.king@canonical.com
mainline inclusion from mainline-5.7-rc1 commit 211fea18a7bb9b8d51cb5d2b9cbe5583af256609 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
An earlier commit "io_uring: remove @nxt from handlers" removed the setting of pointer nxt and now it is always null, hence the non-null check and call to io_wq_assign_next is redundant and can be removed.
Addresses-Coverity: ("'Constant' variable guard") Reviewed-by: Chaitanya Kulkarni chaitanya.kulkarni@wdc.com Signed-off-by: Colin Ian King colin.king@canonical.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 --- 1 file changed, 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index bf6defb8f3cf..750454952bbb 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3455,14 +3455,11 @@ static void __io_sync_file_range(struct io_kiocb *req) static void io_sync_file_range_finish(struct io_wq_work **workptr) { struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - struct io_kiocb *nxt = NULL;
if (io_req_cancelled(req)) return; __io_sync_file_range(req); io_put_req(req); /* put submission ref */ - if (nxt) - io_wq_assign_next(workptr, nxt); }
static int io_sync_file_range(struct io_kiocb *req, bool force_nonblock)
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.7-rc1 commit f7fe9346869a12efe3af3cc9be2e45a1b6ff8761 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
syzbot reports below warning: INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 1 PID: 7099 Comm: syz-executor897 Not tainted 5.6.0-next-20200406-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 assign_lock_key kernel/locking/lockdep.c:913 [inline] register_lock_class+0x1664/0x1760 kernel/locking/lockdep.c:1225 __lock_acquire+0x104/0x4e00 kernel/locking/lockdep.c:4223 lock_acquire+0x1f2/0x8f0 kernel/locking/lockdep.c:4923 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x8c/0xbf kernel/locking/spinlock.c:159 io_sqe_files_register fs/io_uring.c:6599 [inline] __io_uring_register+0x1fe8/0x2f00 fs/io_uring.c:8001 __do_sys_io_uring_register fs/io_uring.c:8081 [inline] __se_sys_io_uring_register fs/io_uring.c:8063 [inline] __x64_sys_io_uring_register+0x192/0x560 fs/io_uring.c:8063 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x440289 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffff1bbf558 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440289 RDX: 0000000020000280 RSI: 0000000000000002 RDI: 0000000000000003 RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000401b10 R13: 0000000000401ba0 R14: 0000000000000000 R15: 0000000000000000
Initialize struct fixed_file_data's lock to fix this issue.
Reported-by: syzbot+e6eeca4a035da76b3065@syzkaller.appspotmail.com Fixes: 055895537302 ("io_uring: refactor file register/unregister/update handling") Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 750454952bbb..70f9956fa2cd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6435,6 +6435,7 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, ctx->file_data->ctx = ctx; init_completion(&ctx->file_data->done); INIT_LIST_HEAD(&ctx->file_data->ref_list); + spin_lock_init(&ctx->file_data->lock);
nr_tables = DIV_ROUND_UP(nr_args, IORING_MAX_FILES_TABLE); ctx->file_data->table = kcalloc(nr_tables,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc1 commit 08a1d26eb894a9dcf79f674558a284ad1ffef517 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
OPENAT2 correctly sets O_LARGEFILE if it has to, but that escaped the OPENAT opcode. Dmitry reports that his test case that compares openat() and IORING_OP_OPENAT sees failures on large files:
*** sync openat openat succeeded sync write at offset 0 write succeeded sync write at offset 4294967296 write succeeded
*** sync openat openat succeeded io_uring write at offset 0 write succeeded io_uring write at offset 4294967296 write succeeded
*** io_uring openat openat succeeded sync write at offset 0 write succeeded sync write at offset 4294967296 write failed: File too large
*** io_uring openat openat succeeded io_uring write at offset 0 write succeeded io_uring write at offset 4294967296 write failed: File too large
Ensure we set O_LARGEFILE, if force_o_largefile() is true.
Cc: stable@vger.kernel.org # v5.6 Fixes: 15b71abe7b52 ("io_uring: add support for IORING_OP_OPENAT") Reported-by: Dmitry Kadashev dkadashev@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [commit cebdb98617ae ("io_uring: add support for IORING_OP_OPENAT2") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 70f9956fa2cd..80dc4b0dd1f0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2951,6 +2951,8 @@ static int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) req->open.mode = READ_ONCE(sqe->len); fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); req->open.flags = READ_ONCE(sqe->open_flags); + if (force_o_largefile()) + req->open.flags |= O_LARGEFILE;
req->open.filename = getname(fname); if (IS_ERR(req->open.filename)) {
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.7-rc1 commit 45097daea2f4e89bdb1c98359f78d0d6feb8e5c8 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In io_read_prep() or io_write_prep(), io_req_map_rw() takes struct io_async_rw's fast_iov as argument to call io_import_iovec(), and if io_import_iovec() uses struct io_async_rw's fast_iov as valid iovec array, later indeed io_req_map_rw() does not need to do the memcpy operation, because they are same pointers.
Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 80dc4b0dd1f0..2f0f65eb59a4 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2487,8 +2487,9 @@ static void io_req_map_rw(struct io_kiocb *req, ssize_t io_size, req->io->rw.iov = iovec; if (!req->io->rw.iov) { req->io->rw.iov = req->io->rw.fast_iov; - memcpy(req->io->rw.iov, fast_iov, - sizeof(struct iovec) * iter->nr_segs); + if (req->io->rw.iov != fast_iov) + memcpy(req->io->rw.iov, fast_iov, + sizeof(struct iovec) * iter->nr_segs); } else { req->flags |= REQ_F_NEED_CLEANUP; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 709b302faddfac757d87df2080f900eccb1dc9e2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Make io_get_sqring() care only about sqes themselves, not initialising the io_kiocb. Also, split it into get + consume, that will be helpful in the future.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 40 ++++++++++++++++++++++------------------ 1 file changed, 22 insertions(+), 18 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2f0f65eb59a4..2349602fd013 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5715,8 +5715,7 @@ static void io_commit_sqring(struct io_ring_ctx *ctx) * used, it's important that those reads are done through READ_ONCE() to * prevent a re-load down the line. */ -static bool io_get_sqring(struct io_ring_ctx *ctx, struct io_kiocb *req, - const struct io_uring_sqe **sqe_ptr) +static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx) { u32 *sq_array = ctx->sq_array; unsigned head; @@ -5730,25 +5729,18 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct io_kiocb *req, * though the application is the one updating it. */ head = READ_ONCE(sq_array[ctx->cached_sq_head & ctx->sq_mask]); - if (likely(head < ctx->sq_entries)) { - /* - * All io need record the previous position, if LINK vs DARIN, - * it can be used to mark the position of the first IO in the - * link list. - */ - req->sequence = ctx->cached_sq_head; - *sqe_ptr = &ctx->sq_sqes[head]; - req->opcode = READ_ONCE((*sqe_ptr)->opcode); - req->user_data = READ_ONCE((*sqe_ptr)->user_data); - ctx->cached_sq_head++; - return true; - } + if (likely(head < ctx->sq_entries)) + return &ctx->sq_sqes[head];
/* drop invalid entries */ - ctx->cached_sq_head++; ctx->cached_sq_dropped++; WRITE_ONCE(ctx->rings->sq_dropped, ctx->cached_sq_dropped); - return false; + return NULL; +} + +static inline void io_consume_sqe(struct io_ring_ctx *ctx) +{ + ctx->cached_sq_head++; }
static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, @@ -5792,11 +5784,23 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, submitted = -EAGAIN; break; } - if (!io_get_sqring(ctx, req, &sqe)) { + sqe = io_get_sqe(ctx); + if (!sqe) { __io_req_do_free(req); + io_consume_sqe(ctx); break; }
+ /* + * All io need record the previous position, if LINK vs DARIN, + * it can be used to mark the position of the first IO in the + * link list. + */ + req->sequence = ctx->cached_sq_head; + req->opcode = READ_ONCE(sqe->opcode); + req->user_data = READ_ONCE(sqe->user_data); + io_consume_sqe(ctx); + /* will complete beyond this point, count as submitted */ submitted++;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit b1e50e549b1372d9742509230dc4af7dd521d984 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
As io_get_sqe() split into 2 stage get/consume, get an sqe before allocating io_kiocb, so no free_req*() for a failure case is needed, and inline back __io_req_do_free(), which has only 1 user.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 24 +++++++++--------------- 1 file changed, 9 insertions(+), 15 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2349602fd013..d7dd8f3655fe 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1348,14 +1348,6 @@ static inline void io_put_file(struct io_kiocb *req, struct file *file, fput(file); }
-static void __io_req_do_free(struct io_kiocb *req) -{ - if (likely(!io_is_fallback_req(req))) - kmem_cache_free(req_cachep, req); - else - clear_bit_unlock(0, (unsigned long *) req->ctx->fallback_req); -} - static void __io_req_aux_free(struct io_kiocb *req) { if (req->flags & REQ_F_NEED_CLEANUP) @@ -1386,7 +1378,10 @@ static void __io_free_req(struct io_kiocb *req) }
percpu_ref_put(&req->ctx->refs); - __io_req_do_free(req); + if (likely(!io_is_fallback_req(req))) + kmem_cache_free(req_cachep, req); + else + clear_bit_unlock(0, (unsigned long *) req->ctx->fallback_req); }
struct req_batch { @@ -5778,18 +5773,17 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, struct io_kiocb *req; int err;
+ sqe = io_get_sqe(ctx); + if (unlikely(!sqe)) { + io_consume_sqe(ctx); + break; + } req = io_get_req(ctx, statep); if (unlikely(!req)) { if (!submitted) submitted = -EAGAIN; break; } - sqe = io_get_sqe(ctx); - if (!sqe) { - __io_req_do_free(req); - io_consume_sqe(ctx); - break; - }
/* * All io need record the previous position, if LINK vs DARIN,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 0553b8bda8709c47863eab3fff7ac32ad04ca52b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_get_req() do two different things: io_kiocb allocation and initialisation. Move init part out of it and rename into io_alloc_req(). It's simpler this way and also have better data locality.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 53 ++++++++++++++++++++++++++------------------------- 1 file changed, 27 insertions(+), 26 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d7dd8f3655fe..13d0be87bb2d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1287,8 +1287,8 @@ static struct io_kiocb *io_get_fallback_req(struct io_ring_ctx *ctx) return NULL; }
-static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, - struct io_submit_state *state) +static struct io_kiocb *io_alloc_req(struct io_ring_ctx *ctx, + struct io_submit_state *state) { gfp_t gfp = GFP_KERNEL | __GFP_NOWARN; struct io_kiocb *req; @@ -1321,22 +1321,9 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, req = state->reqs[state->free_reqs]; }
-got_it: - req->io = NULL; - req->file = NULL; - req->ctx = ctx; - req->flags = 0; - /* one is dropped after submission, the other at completion */ - refcount_set(&req->refs, 2); - req->task = NULL; - req->result = 0; - INIT_IO_WORK(&req->work, io_wq_submit_work); return req; fallback: - req = io_get_fallback_req(ctx); - if (req) - goto got_it; - return NULL; + return io_get_fallback_req(ctx); }
static inline void io_put_file(struct io_kiocb *req, struct file *file, @@ -5738,6 +5725,28 @@ static inline void io_consume_sqe(struct io_ring_ctx *ctx) ctx->cached_sq_head++; }
+static void io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + /* + * All io need record the previous position, if LINK vs DARIN, + * it can be used to mark the position of the first IO in the + * link list. + */ + req->sequence = ctx->cached_sq_head; + req->opcode = READ_ONCE(sqe->opcode); + req->user_data = READ_ONCE(sqe->user_data); + req->io = NULL; + req->file = NULL; + req->ctx = ctx; + req->flags = 0; + /* one is dropped after submission, the other at completion */ + refcount_set(&req->refs, 2); + req->task = NULL; + req->result = 0; + INIT_IO_WORK(&req->work, io_wq_submit_work); +} + static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, struct file *ring_file, int ring_fd, struct mm_struct **mm, bool async) @@ -5778,23 +5787,15 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, io_consume_sqe(ctx); break; } - req = io_get_req(ctx, statep); + req = io_alloc_req(ctx, statep); if (unlikely(!req)) { if (!submitted) submitted = -EAGAIN; break; }
- /* - * All io need record the previous position, if LINK vs DARIN, - * it can be used to mark the position of the first IO in the - * link list. - */ - req->sequence = ctx->cached_sq_head; - req->opcode = READ_ONCE(sqe->opcode); - req->user_data = READ_ONCE(sqe->user_data); + io_init_req(ctx, req, sqe); io_consume_sqe(ctx); - /* will complete beyond this point, count as submitted */ submitted++;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit 9c280f9087118099f50566e906b9d9d5a0fb4529 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Don't re-read userspace-shared sqe->flags, it can be exploited. sqe->flags are copied into req->flags in io_submit_sqe(), check them there instead.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [skip io_openat2_prep for commit cebdb98617ae ("io_uring: add support for IORING_OP_OPENAT2") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 18 +++++++----------- 1 file changed, 7 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 13d0be87bb2d..1a00bcd64616 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2925,7 +2925,7 @@ static int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
if (sqe->ioprio || sqe->buf_index) return -EINVAL; - if (sqe->flags & IOSQE_FIXED_FILE) + if (req->flags & REQ_F_FIXED_FILE) return -EBADF; if (req->flags & REQ_F_NEED_CLEANUP) return 0; @@ -3264,7 +3264,7 @@ static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
if (sqe->ioprio || sqe->buf_index) return -EINVAL; - if (sqe->flags & IOSQE_FIXED_FILE) + if (req->flags & REQ_F_FIXED_FILE) return -EBADF; if (req->flags & REQ_F_NEED_CLEANUP) return 0; @@ -3341,7 +3341,7 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (sqe->ioprio || sqe->off || sqe->addr || sqe->len || sqe->rw_flags || sqe->buf_index) return -EINVAL; - if (sqe->flags & IOSQE_FIXED_FILE) + if (req->flags & REQ_F_FIXED_FILE) return -EBADF;
req->close.fd = READ_ONCE(sqe->fd); @@ -5300,15 +5300,10 @@ static int io_file_get(struct io_submit_state *state, struct io_kiocb *req, }
static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req, - const struct io_uring_sqe *sqe) + int fd, unsigned int flags) { - unsigned flags; - int fd; bool fixed;
- flags = READ_ONCE(sqe->flags); - fd = READ_ONCE(sqe->fd); - if (!io_req_needs_file(req, fd)) return 0;
@@ -5550,7 +5545,7 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, { struct io_ring_ctx *ctx = req->ctx; unsigned int sqe_flags; - int ret, id; + int ret, id, fd;
sqe_flags = READ_ONCE(sqe->flags);
@@ -5581,7 +5576,8 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, IOSQE_ASYNC | IOSQE_FIXED_FILE | IOSQE_BUFFER_SELECT);
- ret = io_req_set_file(state, req, sqe); + fd = READ_ONCE(sqe->fd); + ret = io_req_set_file(state, req, fd, sqe_flags); if (unlikely(ret)) { err_req: io_cqring_add_event(req, ret);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc1 commit c398ecb3d611925e4a5411afdf7489914a5c0460 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If completion queue overflow occurs, __io_cqring_fill_event() will update req->cflags, which is in a union with req->work and happens to be aliased to req->work.fs. Following io_free_req() -> io_req_work_drop_env() may get a bunch of different problems (miscount fs->users, segfault, etc) on cleaning @fs.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1a00bcd64616..091997a55009 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -609,6 +609,7 @@ struct io_kiocb { };
struct io_async_ctx *io; + int cflags; bool needs_fixed_file; u8 opcode;
@@ -639,7 +640,6 @@ struct io_kiocb { struct callback_head task_work; struct hlist_node hash_node; struct async_poll *apoll; - int cflags; }; struct io_wq_work work; };
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc2 commit 1d4240cc9e7bb101dac58f30283fa24a809f5606 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Having only one place for cleaning up a request after a link assembly/ submission failure will play handy in the future. At least it allows to remove duplicated cleanup sequence.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 50 +++++++++++++++++++------------------------------- 1 file changed, 19 insertions(+), 31 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 521b67216a74..fa13b5d6b5f6 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5542,7 +5542,7 @@ static inline void io_queue_link_head(struct io_kiocb *req) IOSQE_IO_HARDLINK | IOSQE_ASYNC | \ IOSQE_BUFFER_SELECT)
-static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, +static int io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct io_submit_state *state, struct io_kiocb **link) { struct io_ring_ctx *ctx = req->ctx; @@ -5552,24 +5552,18 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, sqe_flags = READ_ONCE(sqe->flags);
/* enforce forwards compatibility on users */ - if (unlikely(sqe_flags & ~SQE_VALID_FLAGS)) { - ret = -EINVAL; - goto err_req; - } + if (unlikely(sqe_flags & ~SQE_VALID_FLAGS)) + return -EINVAL;
if ((sqe_flags & IOSQE_BUFFER_SELECT) && - !io_op_defs[req->opcode].buffer_select) { - ret = -EOPNOTSUPP; - goto err_req; - } + !io_op_defs[req->opcode].buffer_select) + return -EOPNOTSUPP;
id = READ_ONCE(sqe->personality); if (id) { req->work.creds = idr_find(&ctx->personality_idr, id); - if (unlikely(!req->work.creds)) { - ret = -EINVAL; - goto err_req; - } + if (unlikely(!req->work.creds)) + return -EINVAL; get_cred(req->work.creds); }
@@ -5580,12 +5574,8 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
fd = READ_ONCE(sqe->fd); ret = io_req_set_file(state, req, fd, sqe_flags); - if (unlikely(ret)) { -err_req: - io_cqring_add_event(req, ret); - io_double_put_req(req); - return false; - } + if (unlikely(ret)) + return ret;
/* * If we already have a head request, queue this one for async @@ -5608,16 +5598,14 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, head->flags |= REQ_F_IO_DRAIN; ctx->drain_next = 1; } - if (io_alloc_async_ctx(req)) { - ret = -EAGAIN; - goto err_req; - } + if (io_alloc_async_ctx(req)) + return -EAGAIN;
ret = io_req_defer_prep(req, sqe); if (ret) { /* fail even hard links since we don't submit */ head->flags |= REQ_F_FAIL_LINK; - goto err_req; + return ret; } trace_io_uring_link(ctx, req, head); list_add_tail(&req->link_list, &head->link_list); @@ -5636,10 +5624,9 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, req->flags |= REQ_F_LINK; INIT_LIST_HEAD(&req->link_list);
- if (io_alloc_async_ctx(req)) { - ret = -EAGAIN; - goto err_req; - } + if (io_alloc_async_ctx(req)) + return -EAGAIN; + ret = io_req_defer_prep(req, sqe); if (ret) req->flags |= REQ_F_FAIL_LINK; @@ -5649,7 +5636,7 @@ static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } }
- return true; + return 0; }
/* @@ -5814,8 +5801,9 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, req->needs_fixed_file = async; trace_io_uring_submit_sqe(ctx, req->opcode, req->user_data, true, async); - if (!io_submit_sqe(req, sqe, statep, &link)) - break; + err = io_submit_sqe(req, sqe, statep, &link); + if (err) + goto fail_req; }
if (unlikely(submitted != nr)) {
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc2 commit 88357580854aab29d27e1a443575caaedd081612 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The splice file punt check uses file->f_mode to check for O_NONBLOCK, but it should be checking file->f_flags. This leads to punting even for files that have O_NONBLOCK set, which isn't necessary. This equates to checking for FMODE_PATH, which will never be set on the fd in question.
Fixes: 7d67af2c0134 ("io_uring: add splice(2) support") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e6fa6f19129c..5341aabfa400 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2757,7 +2757,7 @@ static bool io_splice_punt(struct file *file) return false; if (!io_file_supports_async(file)) return true; - return !(file->f_mode & O_NONBLOCK); + return !(file->f_flags & O_NONBLOCK); }
static int io_splice(struct io_kiocb *req, bool force_nonblock)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.7-rc7 commit 650b548129b60b0d23508351800108196f4aa89f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If req->io is not NULL, it's already prepared. Don't do it again, it's dangerous.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 71def07b1c94..d491df308235 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4959,12 +4959,13 @@ static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (!req_need_defer(req) && list_empty_careful(&ctx->defer_list)) return 0;
- if (!req->io && io_alloc_async_ctx(req)) - return -EAGAIN; - - ret = io_req_defer_prep(req, sqe); - if (ret < 0) - return ret; + if (!req->io) { + if (io_alloc_async_ctx(req)) + return -EAGAIN; + ret = io_req_defer_prep(req, sqe); + if (ret < 0) + return ret; + }
spin_lock_irq(&ctx->completion_lock); if (!req_need_defer(req) && list_empty(&ctx->defer_list)) {
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc7 commit 948a7749454b1712f1b2f2429f9493eb3e4a89b0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We checked for 'force_nonblock' higher up, so it's definitely false at this point. Kill the check, it's a remnant of when we tried to do inline splice without always punting to async context.
Fixes: 2fb3e82284fc ("io_uring: punt splice async because of inode mutex") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 63e9ae556bae..a2e1fc690f9d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2768,11 +2768,8 @@ static int io_splice(struct io_kiocb *req, bool force_nonblock) poff_in = (sp->off_in == -1) ? NULL : &sp->off_in; poff_out = (sp->off_out == -1) ? NULL : &sp->off_out;
- if (sp->len) { + if (sp->len) ret = do_splice(in, poff_in, out, poff_out, sp->len, flags); - if (force_nonblock && ret == -EAGAIN) - return -EAGAIN; - }
io_put_file(req, in, (sp->flags & SPLICE_F_FD_IN_FIXED)); req->flags &= ~REQ_F_NEED_CLEANUP;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.7-rc7 commit e3aabf9554fd04eb14cd44ae7583fc9d40edd250 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently move it to the io_wqe_manager for execution, but we cannot safely do so as we may lack some of the state to execute it out of context. As we cancel work anyway when the ring/task exits, just mark this request as canceled and io_async_task_func() will do the right thing.
Fixes: aa96bf8a9ee3 ("io_uring: use io-wq manager as backup task if task is exiting") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a2e1fc690f9d..82eb32a8a18f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4083,12 +4083,14 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, req->result = mask; init_task_work(&req->task_work, func); /* - * If this fails, then the task is exiting. Punt to one of the io-wq - * threads to ensure the work gets run, we can't always rely on exit - * cancelation taking care of this. + * If this fails, then the task is exiting. When a task exits, the + * work gets canceled, so just cancel this request as well instead + * of executing it. We can't safely execute it anyway, as we may not + * have the needed state needed for it anyway. */ ret = task_work_add(tsk, &req->task_work, true); if (unlikely(ret)) { + WRITE_ONCE(poll->canceled, true); tsk = io_wq_get_task(req->ctx->io_wq); task_work_add(tsk, &req->task_work, true); }
From: Bijan Mottahedeh bijan.mottahedeh@oracle.com
mainline inclusion from mainline-5.7-rc7 commit 4f4eeba87cc731b200bff9372d14a80f5996b277 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
kiocb.private is used in iomap_dio_rw() so store buf_index separately.
Signed-off-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com
Move 'buf_index' to a hole in io_kiocb.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 82eb32a8a18f..f035a8e061c5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -620,6 +620,8 @@ struct io_kiocb { bool needs_fixed_file; u8 opcode;
+ u16 buf_index; + struct io_ring_ctx *ctx; struct list_head list; unsigned int flags; @@ -2097,9 +2099,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
req->rw.addr = READ_ONCE(sqe->addr); req->rw.len = READ_ONCE(sqe->len); - /* we own ->private, reuse it for the buffer index / buffer ID */ - req->rw.kiocb.private = (void *) (unsigned long) - READ_ONCE(sqe->buf_index); + req->buf_index = READ_ONCE(sqe->buf_index); return 0; }
@@ -2142,7 +2142,7 @@ static ssize_t io_import_fixed(struct io_kiocb *req, int rw, struct io_ring_ctx *ctx = req->ctx; size_t len = req->rw.len; struct io_mapped_ubuf *imu; - unsigned index, buf_index; + u16 index, buf_index; size_t offset; u64 buf_addr;
@@ -2150,7 +2150,7 @@ static ssize_t io_import_fixed(struct io_kiocb *req, int rw, if (unlikely(!ctx->user_bufs)) return -EFAULT;
- buf_index = (unsigned long) req->rw.kiocb.private; + buf_index = req->buf_index; if (unlikely(buf_index >= ctx->nr_user_bufs)) return -EFAULT;
@@ -2266,10 +2266,10 @@ static void __user *io_rw_buffer_select(struct io_kiocb *req, size_t *len, bool needs_lock) { struct io_buffer *kbuf; - int bgid; + u16 bgid;
kbuf = (struct io_buffer *) (unsigned long) req->rw.addr; - bgid = (int) (unsigned long) req->rw.kiocb.private; + bgid = req->buf_index; kbuf = io_buffer_select(req, len, bgid, kbuf, needs_lock); if (IS_ERR(kbuf)) return kbuf; @@ -2360,7 +2360,7 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, }
/* buffer index only valid with fixed read/write, or buffer select */ - if (req->rw.kiocb.private && !(req->flags & REQ_F_BUFFER_SELECT)) + if (req->buf_index && !(req->flags & REQ_F_BUFFER_SELECT)) return -EINVAL;
if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) {
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.7-rc7 commit d4ae271dfaae2a5f41c015f2f20d62a1deeec734 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In io_sq_thread(), currently if we get an -EBUSY error and go to sleep, we will won't clear it again, which will result in io_sq_thread() will never have a chance to submit sqes again. Below test program test.c can reveal this bug:
int main(int argc, char *argv[]) { struct io_uring ring; int i, fd, ret; struct io_uring_sqe *sqe; struct io_uring_cqe *cqe; struct iovec *iovecs; void *buf; struct io_uring_params p;
if (argc < 2) { printf("%s: file\n", argv[0]); return 1; }
memset(&p, 0, sizeof(p)); p.flags = IORING_SETUP_SQPOLL; ret = io_uring_queue_init_params(4, &ring, &p); if (ret < 0) { fprintf(stderr, "queue_init: %s\n", strerror(-ret)); return 1; }
fd = open(argv[1], O_RDONLY | O_DIRECT); if (fd < 0) { perror("open"); return 1; }
iovecs = calloc(10, sizeof(struct iovec)); for (i = 0; i < 10; i++) { if (posix_memalign(&buf, 4096, 4096)) return 1; iovecs[i].iov_base = buf; iovecs[i].iov_len = 4096; }
ret = io_uring_register_files(&ring, &fd, 1); if (ret < 0) { fprintf(stderr, "%s: register %d\n", __FUNCTION__, ret); return ret; }
for (i = 0; i < 10; i++) { sqe = io_uring_get_sqe(&ring); if (!sqe) break;
io_uring_prep_readv(sqe, 0, &iovecs[i], 1, 0); sqe->flags |= IOSQE_FIXED_FILE;
ret = io_uring_submit(&ring); sleep(1); printf("submit %d\n", i); }
for (i = 0; i < 10; i++) { io_uring_wait_cqe(&ring, &cqe); printf("receive: %d\n", i); if (cqe->res != 4096) { fprintf(stderr, "ret=%d, wanted 4096\n", cqe->res); ret = 1; } io_uring_cqe_seen(&ring, cqe); }
close(fd); io_uring_queue_exit(&ring); return 0; } sudo ./test testfile above command will hang on the tenth request, to fix this bug, when io sq_thread is waken up, we reset the variable 'ret' to be zero.
Suggested-by: Jens Axboe axboe@kernel.dk Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3b04ad1c695b..1c99ee5cb2ac 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5968,6 +5968,7 @@ static int io_sq_thread(void *data) finish_wait(&ctx->sqo_wait, &wait);
ctx->rings->sq_flags &= ~IORING_SQ_NEED_WAKEUP; + ret = 0; continue; } finish_wait(&ctx->sqo_wait, &wait);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc1 commit 904fbcb115c85090484dfdffaf7f461d96fe8e53 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The attempt protecting us from closing the ring itself wasn't really complete, and we actually don't need it. The referencing of requests themselve, and the references they hold on the ring, ensures that the life time of the ring is sane. With the check removed, we can also remove the need to have the close operation fget() the file.
Reported-by: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1c99ee5cb2ac..999365bb763b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -789,7 +789,6 @@ static const struct io_op_def io_op_defs[] = { .needs_fs = 1, }, [IORING_OP_CLOSE] = { - .needs_file = 1, .file_table = 1, }, [IORING_OP_FILES_UPDATE] = { @@ -3344,10 +3343,6 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EBADF;
req->close.fd = READ_ONCE(sqe->fd); - if (req->file->f_op == &io_uring_fops || - req->close.fd == req->ctx->ring_fd) - return -EBADF; - return 0; }
@@ -3379,8 +3374,11 @@ static int io_close(struct io_kiocb *req, bool force_nonblock)
req->close.put_file = NULL; ret = __close_fd_get_file(req->close.fd, &req->close.put_file); - if (ret < 0) + if (ret < 0) { + if (ret == -ENOENT) + ret = -EBADF; return ret; + }
/* if the file has a flush method, be safe and punt to async */ if (req->close.put_file->f_op->flush && force_nonblock) {
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.8-rc1 commit 7d01bd745a8f52ff2883f661235139ab6e7d23e6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The "struct io_submit_state *state" parameter is not used, remove it.
Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 999365bb763b..bc6d5c03c3c9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5572,7 +5572,7 @@ static inline void io_queue_link_head(struct io_kiocb *req) }
static int io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, - struct io_submit_state *state, struct io_kiocb **link) + struct io_kiocb **link) { struct io_ring_ctx *ctx = req->ctx; int ret; @@ -5836,7 +5836,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
trace_io_uring_submit_sqe(ctx, req->opcode, req->user_data, true, async); - err = io_submit_sqe(req, sqe, statep, &link); + err = io_submit_sqe(req, sqe, &link); if (err) goto fail_req; }
From: Xiaoming Ni nixiaoming@huawei.com
mainline inclusion from mainline-5.8-rc1 commit 8469508951d4a324b2df3b5bad75e99922c3b798 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Remove duplicate semicolon at the end of line in io_file_from_index()
Signed-off-by: Xiaoming Ni nixiaoming@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index bc6d5c03c3c9..d6b9c7a9e5c4 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5301,7 +5301,7 @@ static inline struct file *io_file_from_index(struct io_ring_ctx *ctx, struct fixed_file_table *table;
table = &ctx->file_data->table[index >> IORING_FILE_TABLE_SHIFT]; - return table->files[index & IORING_FILE_TABLE_MASK];; + return table->files[index & IORING_FILE_TABLE_MASK]; }
static int io_file_get(struct io_submit_state *state, struct io_kiocb *req,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc1 commit 0f158b4cf20e7983d5b33878a6aad118cfac4f05 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We used to have three completions, now we just have two. With the two, let's not allocate them dynamically, just embed then in the ctx and name them appropriately.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 26 ++++++++++---------------- 1 file changed, 10 insertions(+), 16 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index d6b9c7a9e5c4..1d0350d4611f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -279,8 +279,8 @@ struct io_ring_ctx {
const struct cred *creds;
- /* 0 is for ctx quiesce/reinit/free, 1 is for sqo_thread started */ - struct completion *completions; + struct completion ref_comp; + struct completion sq_thread_comp;
/* if all else fails... */ struct io_kiocb *fallback_req; @@ -882,7 +882,7 @@ static void io_ring_ctx_ref_free(struct percpu_ref *ref) { struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs);
- complete(&ctx->completions[0]); + complete(&ctx->ref_comp); }
static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) @@ -898,10 +898,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) if (!ctx->fallback_req) goto err;
- ctx->completions = kmalloc(2 * sizeof(struct completion), GFP_KERNEL); - if (!ctx->completions) - goto err; - /* * Use 5 bits less than the max cq entries, that should give us around * 32 entries per hash list if totally full and uniformly spread. @@ -924,8 +920,8 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) init_waitqueue_head(&ctx->sqo_wait); init_waitqueue_head(&ctx->cq_wait); INIT_LIST_HEAD(&ctx->cq_overflow_list); - init_completion(&ctx->completions[0]); - init_completion(&ctx->completions[1]); + init_completion(&ctx->ref_comp); + init_completion(&ctx->sq_thread_comp); idr_init(&ctx->io_buffer_idr); idr_init(&ctx->personality_idr); mutex_init(&ctx->uring_lock); @@ -941,7 +937,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) err: if (ctx->fallback_req) kmem_cache_free(req_cachep, ctx->fallback_req); - kfree(ctx->completions); kfree(ctx->cancel_hash); kfree(ctx); return NULL; @@ -5876,7 +5871,7 @@ static int io_sq_thread(void *data) unsigned long timeout; int ret = 0;
- complete(&ctx->completions[1]); + complete(&ctx->sq_thread_comp);
old_fs = get_fs(); set_fs(USER_DS); @@ -6156,7 +6151,7 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx) static void io_sq_thread_stop(struct io_ring_ctx *ctx) { if (ctx->sqo_thread) { - wait_for_completion(&ctx->completions[1]); + wait_for_completion(&ctx->sq_thread_comp); /* * The park is a bit of a work-around, without it we get * warning spews on shutdown with SQPOLL set and affinity @@ -7185,7 +7180,6 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) ring_pages(ctx->sq_entries, ctx->cq_entries)); free_uid(ctx->user); put_cred(ctx->creds); - kfree(ctx->completions); kfree(ctx->cancel_hash); kmem_cache_free(req_cachep, ctx->fallback_req); kfree(ctx); @@ -7237,7 +7231,7 @@ static void io_ring_exit_work(struct work_struct *work) if (ctx->rings) io_cqring_overflow_flush(ctx, true);
- wait_for_completion(&ctx->completions[0]); + wait_for_completion(&ctx->ref_comp); io_ring_ctx_free(ctx); }
@@ -7936,7 +7930,7 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, * after we've killed the percpu ref. */ mutex_unlock(&ctx->uring_lock); - ret = wait_for_completion_interruptible(&ctx->completions[0]); + ret = wait_for_completion_interruptible(&ctx->ref_comp); mutex_lock(&ctx->uring_lock); if (ret) { percpu_ref_resurrect(&ctx->refs); @@ -8013,7 +8007,7 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, /* bring the ctx back to life */ percpu_ref_reinit(&ctx->refs); out: - reinit_completion(&ctx->completions[0]); + reinit_completion(&ctx->ref_comp); } return ret; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc1 commit 4a38aed2a0a729ccecd84dca5b76d827b9e1294d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently embed and queue a work item per fixed_file_ref_node that we update, but if the workload does a lot of these, then the associated kworker-events overhead can become quite noticeable.
Since we rarely need to wait on these, batch them at 1 second intervals instead. If we do need to wait for them, we just flush the pending delayed work.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 54 +++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 42 insertions(+), 12 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1d0350d4611f..923458a45360 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -191,7 +191,7 @@ struct fixed_file_ref_node { struct list_head node; struct list_head file_list; struct fixed_file_data *file_data; - struct work_struct work; + struct llist_node llist; };
struct fixed_file_data { @@ -327,6 +327,9 @@ struct io_ring_ctx { struct list_head inflight_list; } ____cacheline_aligned_in_smp;
+ struct delayed_work file_put_work; + struct llist_head file_put_llist; + struct work_struct exit_work; };
@@ -878,6 +881,8 @@ struct sock *io_uring_get_socket(struct file *file) } EXPORT_SYMBOL(io_uring_get_socket);
+static void io_file_put_work(struct work_struct *work); + static void io_ring_ctx_ref_free(struct percpu_ref *ref) { struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs); @@ -933,6 +938,8 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) init_waitqueue_head(&ctx->inflight_wait); spin_lock_init(&ctx->inflight_lock); INIT_LIST_HEAD(&ctx->inflight_list); + INIT_DELAYED_WORK(&ctx->file_put_work, io_file_put_work); + init_llist_head(&ctx->file_put_llist); return ctx; err: if (ctx->fallback_req) @@ -6134,6 +6141,7 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx) percpu_ref_kill(&data->refs);
/* wait for all refs nodes to complete */ + flush_delayed_work(&ctx->file_put_work); wait_for_completion(&data->done);
__io_sqe_files_unregister(ctx); @@ -6364,18 +6372,13 @@ struct io_file_put { struct file *file; };
-static void io_file_put_work(struct work_struct *work) +static void __io_file_put_work(struct fixed_file_ref_node *ref_node) { - struct fixed_file_ref_node *ref_node; - struct fixed_file_data *file_data; - struct io_ring_ctx *ctx; + struct fixed_file_data *file_data = ref_node->file_data; + struct io_ring_ctx *ctx = file_data->ctx; struct io_file_put *pfile, *tmp; unsigned long flags;
- ref_node = container_of(work, struct fixed_file_ref_node, work); - file_data = ref_node->file_data; - ctx = file_data->ctx; - list_for_each_entry_safe(pfile, tmp, &ref_node->file_list, list) { list_del_init(&pfile->list); io_ring_file_put(ctx, pfile->file); @@ -6391,13 +6394,42 @@ static void io_file_put_work(struct work_struct *work) percpu_ref_put(&file_data->refs); }
+static void io_file_put_work(struct work_struct *work) +{ + struct io_ring_ctx *ctx; + struct llist_node *node; + + ctx = container_of(work, struct io_ring_ctx, file_put_work.work); + node = llist_del_all(&ctx->file_put_llist); + + while (node) { + struct fixed_file_ref_node *ref_node; + struct llist_node *next = node->next; + + ref_node = llist_entry(node, struct fixed_file_ref_node, llist); + __io_file_put_work(ref_node); + node = next; + } +} + static void io_file_data_ref_zero(struct percpu_ref *ref) { struct fixed_file_ref_node *ref_node; + struct io_ring_ctx *ctx; + bool first_add; + int delay = HZ;
ref_node = container_of(ref, struct fixed_file_ref_node, refs); + ctx = ref_node->file_data->ctx;
- queue_work(system_wq, &ref_node->work); + if (percpu_ref_is_dying(&ctx->file_data->refs)) + delay = 0; + + first_add = llist_add(&ref_node->llist, &ctx->file_put_llist); + if (!delay) + mod_delayed_work(system_wq, &ctx->file_put_work, 0); + else if (first_add) + queue_delayed_work(system_wq, &ctx->file_put_work, delay); }
static struct fixed_file_ref_node *alloc_fixed_file_ref_node( @@ -6416,10 +6448,8 @@ static struct fixed_file_ref_node *alloc_fixed_file_ref_node( } INIT_LIST_HEAD(&ref_node->node); INIT_LIST_HEAD(&ref_node->file_list); - INIT_WORK(&ref_node->work, io_file_put_work); ref_node->file_data = ctx->file_data; return ref_node; - }
static void destroy_fixed_file_ref_node(struct fixed_file_ref_node *ref_node)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc1 commit 18bceab101adde8f38de76016bc77f3f25cf22f4 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Some file descriptors use separate waitqueues for their f_ops->poll() handler, most commonly one for read and one for write. The io_uring poll implementation doesn't work with that, as the 2nd poll_wait() call will cause the io_uring poll request to -EINVAL.
This affects (at least) tty devices and /dev/random as well. This is a big problem for event loops where some file descriptors work, and others don't.
With this fix, io_uring handles multiple waitqueues.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 218 +++++++++++++++++++++++++++++++++----------------- 1 file changed, 146 insertions(+), 72 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 923458a45360..278ac42b269e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4044,27 +4044,6 @@ struct io_poll_table { int error; };
-static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt, - struct wait_queue_head *head) -{ - if (unlikely(poll->head)) { - pt->error = -EINVAL; - return; - } - - pt->error = 0; - poll->head = head; - add_wait_queue(head, &poll->wait); -} - -static void io_async_queue_proc(struct file *file, struct wait_queue_head *head, - struct poll_table_struct *p) -{ - struct io_poll_table *pt = container_of(p, struct io_poll_table, pt); - - __io_queue_proc(&pt->req->apoll->poll, pt, head); -} - static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, __poll_t mask, task_work_func_t func) { @@ -4118,6 +4097,144 @@ static bool io_poll_rewait(struct io_kiocb *req, struct io_poll_iocb *poll) return false; }
+static void io_poll_remove_double(struct io_kiocb *req) +{ + struct io_poll_iocb *poll = (struct io_poll_iocb *) req->io; + + lockdep_assert_held(&req->ctx->completion_lock); + + if (poll && poll->head) { + struct wait_queue_head *head = poll->head; + + spin_lock(&head->lock); + list_del_init(&poll->wait.entry); + if (poll->wait.private) + refcount_dec(&req->refs); + poll->head = NULL; + spin_unlock(&head->lock); + } +} + +static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error) +{ + struct io_ring_ctx *ctx = req->ctx; + + io_poll_remove_double(req); + req->poll.done = true; + io_cqring_fill_event(req, error ? error : mangle_poll(mask)); + io_commit_cqring(ctx); +} + +static void io_poll_task_handler(struct io_kiocb *req, struct io_kiocb **nxt) +{ + struct io_ring_ctx *ctx = req->ctx; + + if (io_poll_rewait(req, &req->poll)) { + spin_unlock_irq(&ctx->completion_lock); + return; + } + + hash_del(&req->hash_node); + io_poll_complete(req, req->result, 0); + req->flags |= REQ_F_COMP_LOCKED; + io_put_req_find_next(req, nxt); + spin_unlock_irq(&ctx->completion_lock); + + io_cqring_ev_posted(ctx); +} + +static void io_poll_task_func(struct callback_head *cb) +{ + struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); + struct io_kiocb *nxt = NULL; + + io_poll_task_handler(req, &nxt); + if (nxt) { + struct io_ring_ctx *ctx = nxt->ctx; + + mutex_lock(&ctx->uring_lock); + __io_queue_sqe(nxt, NULL); + mutex_unlock(&ctx->uring_lock); + } +} + +static int io_poll_double_wake(struct wait_queue_entry *wait, unsigned mode, + int sync, void *key) +{ + struct io_kiocb *req = wait->private; + struct io_poll_iocb *poll = (struct io_poll_iocb *) req->io; + __poll_t mask = key_to_poll(key); + + /* for instances that support it check for an event match first: */ + if (mask && !(mask & poll->events)) + return 0; + + if (req->poll.head) { + bool done; + + spin_lock(&req->poll.head->lock); + done = list_empty(&req->poll.wait.entry); + if (!done) + list_del_init(&req->poll.wait.entry); + spin_unlock(&req->poll.head->lock); + if (!done) + __io_async_wake(req, poll, mask, io_poll_task_func); + } + refcount_dec(&req->refs); + return 1; +} + +static void io_init_poll_iocb(struct io_poll_iocb *poll, __poll_t events, + wait_queue_func_t wake_func) +{ + poll->head = NULL; + poll->done = false; + poll->canceled = false; + poll->events = events; + INIT_LIST_HEAD(&poll->wait.entry); + init_waitqueue_func_entry(&poll->wait, wake_func); +} + +static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt, + struct wait_queue_head *head) +{ + struct io_kiocb *req = pt->req; + + /* + * If poll->head is already set, it's because the file being polled + * uses multiple waitqueues for poll handling (eg one for read, one + * for write). Setup a separate io_poll_iocb if this happens. + */ + if (unlikely(poll->head)) { + /* already have a 2nd entry, fail a third attempt */ + if (req->io) { + pt->error = -EINVAL; + return; + } + poll = kmalloc(sizeof(*poll), GFP_ATOMIC); + if (!poll) { + pt->error = -ENOMEM; + return; + } + io_init_poll_iocb(poll, req->poll.events, io_poll_double_wake); + refcount_inc(&req->refs); + poll->wait.private = req; + req->io = (void *) poll; + } + + pt->error = 0; + poll->head = head; + add_wait_queue(head, &poll->wait); +} + +static void io_async_queue_proc(struct file *file, struct wait_queue_head *head, + struct poll_table_struct *p) +{ + struct io_poll_table *pt = container_of(p, struct io_poll_table, pt); + + __io_queue_proc(&pt->req->apoll->poll, pt, head); +} + static void io_async_task_func(struct callback_head *cb) { struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); @@ -4193,18 +4310,13 @@ static __poll_t __io_arm_poll_handler(struct io_kiocb *req, bool cancel = false;
poll->file = req->file; - poll->head = NULL; - poll->done = poll->canceled = false; - poll->events = mask; + io_init_poll_iocb(poll, mask, wake_func); + poll->wait.private = req;
ipt->pt._key = mask; ipt->req = req; ipt->error = -EINVAL;
- INIT_LIST_HEAD(&poll->wait.entry); - init_waitqueue_func_entry(&poll->wait, wake_func); - poll->wait.private = req; - mask = vfs_poll(req->file, &ipt->pt) & poll->events;
spin_lock_irq(&ctx->completion_lock); @@ -4235,6 +4347,7 @@ static bool io_arm_poll_handler(struct io_kiocb *req) struct async_poll *apoll; struct io_poll_table ipt; __poll_t mask, ret; + bool had_io;
if (!req->file || !file_can_poll(req->file)) return false; @@ -4249,6 +4362,7 @@ static bool io_arm_poll_handler(struct io_kiocb *req)
req->flags |= REQ_F_POLLED; memcpy(&apoll->work, &req->work, sizeof(req->work)); + had_io = req->io != NULL;
get_task_struct(current); req->task = current; @@ -4268,7 +4382,9 @@ static bool io_arm_poll_handler(struct io_kiocb *req) io_async_wake); if (ret) { ipt.error = 0; - apoll->poll.done = true; + /* only remove double add if we did it here */ + if (!had_io) + io_poll_remove_double(req); spin_unlock_irq(&ctx->completion_lock); memcpy(&req->work, &apoll->work, sizeof(req->work)); kfree(apoll); @@ -4301,6 +4417,7 @@ static bool io_poll_remove_one(struct io_kiocb *req) bool do_complete;
if (req->opcode == IORING_OP_POLL_ADD) { + io_poll_remove_double(req); do_complete = __io_poll_remove_one(req, &req->poll); } else { apoll = req->apoll; @@ -4402,49 +4519,6 @@ static int io_poll_remove(struct io_kiocb *req) return 0; }
-static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error) -{ - struct io_ring_ctx *ctx = req->ctx; - - req->poll.done = true; - io_cqring_fill_event(req, error ? error : mangle_poll(mask)); - io_commit_cqring(ctx); -} - -static void io_poll_task_handler(struct io_kiocb *req, struct io_kiocb **nxt) -{ - struct io_ring_ctx *ctx = req->ctx; - struct io_poll_iocb *poll = &req->poll; - - if (io_poll_rewait(req, poll)) { - spin_unlock_irq(&ctx->completion_lock); - return; - } - - hash_del(&req->hash_node); - io_poll_complete(req, req->result, 0); - req->flags |= REQ_F_COMP_LOCKED; - io_put_req_find_next(req, nxt); - spin_unlock_irq(&ctx->completion_lock); - - io_cqring_ev_posted(ctx); -} - -static void io_poll_task_func(struct callback_head *cb) -{ - struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); - struct io_kiocb *nxt = NULL; - - io_poll_task_handler(req, &nxt); - if (nxt) { - struct io_ring_ctx *ctx = nxt->ctx; - - mutex_lock(&ctx->uring_lock); - __io_queue_sqe(nxt, NULL); - mutex_unlock(&ctx->uring_lock); - } -} - static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, void *key) {
From: Stefano Garzarella sgarzare@redhat.com
mainline inclusion from mainline-5.8-rc1 commit 0d9b5b3af134cddfdc1dd31d41946a0ad389bbf2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This patch adds the new 'cq_flags' field that should be written by the application and read by the kernel.
This new field is available to the userspace application through 'cq_off.flags'. We are using 4-bytes previously reserved and set to zero. This means that if the application finds this field to zero, then the new functionality is not supported.
In the next patch we will introduce the first flag available.
Signed-off-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 +++++++++- include/uapi/linux/io_uring.h | 4 +++- 2 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 278ac42b269e..dfb1a9e9a9b9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -142,7 +142,7 @@ struct io_rings { */ u32 sq_dropped; /* - * Runtime flags + * Runtime SQ flags * * Written by the kernel, shouldn't be modified by the * application. @@ -151,6 +151,13 @@ struct io_rings { * for IORING_SQ_NEED_WAKEUP after updating the sq tail. */ u32 sq_flags; + /* + * Runtime CQ flags + * + * Written by the application, shouldn't be modified by the + * kernel. + */ + u32 cq_flags; /* * Number of completion events lost because the queue was full; * this should be avoided by the application by making sure @@ -7874,6 +7881,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, p->cq_off.ring_entries = offsetof(struct io_rings, cq_ring_entries); p->cq_off.overflow = offsetof(struct io_rings, cq_overflow); p->cq_off.cqes = offsetof(struct io_rings, cqes); + p->cq_off.flags = offsetof(struct io_rings, cq_flags);
p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP | IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS | diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 6e35b534c4b8..94e3359249ab 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -203,7 +203,9 @@ struct io_cqring_offsets { __u32 ring_entries; __u32 overflow; __u32 cqes; - __u64 resv[2]; + __u32 flags; + __u32 resv1; + __u64 resv2; };
/*
From: Stefano Garzarella sgarzare@redhat.com
mainline inclusion from mainline-5.8-rc1 commit 7e55a19cf6e70ce08964b46dbbfbdb07fbc995fc category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This new flag should be set/clear from the application to disable/enable eventfd notifications when a request is completed and queued to the CQ ring.
Before this patch, notifications were always sent if an eventfd is registered, so IORING_CQ_EVENTFD_DISABLED is not set during the initialization.
It will be up to the application to set the flag after initialization if no notifications are required at the beginning.
Signed-off-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 ++ include/uapi/linux/io_uring.h | 7 +++++++ 2 files changed, 9 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index dfb1a9e9a9b9..e0dd69d9d0c6 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1152,6 +1152,8 @@ static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx) { if (!ctx->cq_ev_fd) return false; + if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED) + return false; if (!ctx->eventfd_async) return true; return io_wq_current_is_worker(); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 94e3359249ab..15aed20c6789 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -208,6 +208,13 @@ struct io_cqring_offsets { __u64 resv2; };
+/* + * cq_ring->flags + */ + +/* disable eventfd notifications */ +#define IORING_CQ_EVENTFD_DISABLED (1U << 0) + /* * io_uring_enter(2) flags */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc1 commit 6a4d07cde5778174a35ffc445c1d1388479563ee category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There's no point in using list_del_init() on entries that are going away, and the associated lock is always used in process context so let's not use the IRQ disabling+saving variant of the spinlock.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 24 ++++++++++-------------- 1 file changed, 10 insertions(+), 14 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e0dd69d9d0c6..7b89fbe3cfa8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6208,16 +6208,15 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx) struct fixed_file_data *data = ctx->file_data; struct fixed_file_ref_node *ref_node = NULL; unsigned nr_tables, i; - unsigned long flags;
if (!data) return -ENXIO;
- spin_lock_irqsave(&data->lock, flags); + spin_lock(&data->lock); if (!list_empty(&data->ref_list)) ref_node = list_first_entry(&data->ref_list, struct fixed_file_ref_node, node); - spin_unlock_irqrestore(&data->lock, flags); + spin_unlock(&data->lock); if (ref_node) percpu_ref_kill(&ref_node->refs);
@@ -6460,17 +6459,16 @@ static void __io_file_put_work(struct fixed_file_ref_node *ref_node) struct fixed_file_data *file_data = ref_node->file_data; struct io_ring_ctx *ctx = file_data->ctx; struct io_file_put *pfile, *tmp; - unsigned long flags;
list_for_each_entry_safe(pfile, tmp, &ref_node->file_list, list) { - list_del_init(&pfile->list); + list_del(&pfile->list); io_ring_file_put(ctx, pfile->file); kfree(pfile); }
- spin_lock_irqsave(&file_data->lock, flags); - list_del_init(&ref_node->node); - spin_unlock_irqrestore(&file_data->lock, flags); + spin_lock(&file_data->lock); + list_del(&ref_node->node); + spin_unlock(&file_data->lock);
percpu_ref_exit(&ref_node->refs); kfree(ref_node); @@ -6550,7 +6548,6 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, int fd, ret = 0; unsigned i; struct fixed_file_ref_node *ref_node; - unsigned long flags;
if (ctx->file_data) return -EBUSY; @@ -6658,9 +6655,9 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, }
ctx->file_data->cur_refs = &ref_node->refs; - spin_lock_irqsave(&ctx->file_data->lock, flags); + spin_lock(&ctx->file_data->lock); list_add(&ref_node->node, &ctx->file_data->ref_list); - spin_unlock_irqrestore(&ctx->file_data->lock, flags); + spin_unlock(&ctx->file_data->lock); percpu_ref_get(&ctx->file_data->refs); return ret; } @@ -6736,7 +6733,6 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, __s32 __user *fds; int fd, i, err; __u32 done; - unsigned long flags; bool needs_switch = false;
if (check_add_overflow(up->offset, nr_args, &done)) @@ -6801,10 +6797,10 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
if (needs_switch) { percpu_ref_kill(data->cur_refs); - spin_lock_irqsave(&data->lock, flags); + spin_lock(&data->lock); list_add(&ref_node->node, &data->ref_list); data->cur_refs = &ref_node->refs; - spin_unlock_irqrestore(&data->lock, flags); + spin_unlock(&data->lock); percpu_ref_get(&ctx->file_data->refs); } else destroy_fixed_file_ref_node(ref_node);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc1 commit 3bfa5bcb26f0b52d7ae8416aa0618fff21aceaaf category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We only need apoll in the one section, do the juggling with the work restoration there. This removes a special case further down as well.
No functional changes in this patch.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7b89fbe3cfa8..a37d14aed0a1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4417,33 +4417,32 @@ static bool __io_poll_remove_one(struct io_kiocb *req, do_complete = true; } spin_unlock(&poll->head->lock); + hash_del(&req->hash_node); return do_complete; }
static bool io_poll_remove_one(struct io_kiocb *req) { - struct async_poll *apoll = NULL; bool do_complete;
if (req->opcode == IORING_OP_POLL_ADD) { io_poll_remove_double(req); do_complete = __io_poll_remove_one(req, &req->poll); } else { - apoll = req->apoll; + struct async_poll *apoll = req->apoll; + /* non-poll requests have submit ref still */ - do_complete = __io_poll_remove_one(req, &req->apoll->poll); - if (do_complete) + do_complete = __io_poll_remove_one(req, &apoll->poll); + if (do_complete) { io_put_req(req); - } - - hash_del(&req->hash_node); - - if (do_complete && apoll) { - /* - * restore ->work because we need to call io_req_work_drop_env. - */ - memcpy(&req->work, &apoll->work, sizeof(req->work)); - kfree(apoll); + /* + * restore ->work because we will call + * io_req_work_drop_env below when dropping the + * final reference. + */ + memcpy(&req->work, &apoll->work, sizeof(req->work)); + kfree(apoll); + } }
if (do_complete) {
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 0cdaf760f42eb8e8a714c1cc017423e5da6d4936 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
A submission is "async" IIF it's done by SQPOLL thread. Instead of passing @async flag into io_submit_sqes(), deduce it from ctx->flags.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a37d14aed0a1..dbb75517ef37 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -627,7 +627,6 @@ struct io_kiocb {
struct io_async_ctx *io; int cflags; - bool needs_fixed_file; u8 opcode;
u16 buf_index; @@ -890,6 +889,11 @@ EXPORT_SYMBOL(io_uring_get_socket);
static void io_file_put_work(struct work_struct *work);
+static inline bool io_async_submit(struct io_ring_ctx *ctx) +{ + return ctx->flags & IORING_SETUP_SQPOLL; +} + static void io_ring_ctx_ref_free(struct percpu_ref *ref) { struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs); @@ -5421,7 +5425,7 @@ static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req, bool fixed;
fixed = (req->flags & REQ_F_FIXED_FILE) != 0; - if (unlikely(!fixed && req->needs_fixed_file)) + if (unlikely(!fixed && io_async_submit(req->ctx))) return -EBADF;
return io_file_get(state, req, fd, &req->file, fixed); @@ -5800,7 +5804,7 @@ static inline void io_consume_sqe(struct io_ring_ctx *ctx)
static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, const struct io_uring_sqe *sqe, - struct io_submit_state *state, bool async) + struct io_submit_state *state) { unsigned int sqe_flags; int id; @@ -5821,7 +5825,6 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, refcount_set(&req->refs, 2); req->task = NULL; req->result = 0; - req->needs_fixed_file = async; INIT_IO_WORK(&req->work, io_wq_submit_work);
if (unlikely(req->opcode >= IORING_OP_LAST)) @@ -5862,7 +5865,7 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, }
static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, - struct file *ring_file, int ring_fd, bool async) + struct file *ring_file, int ring_fd) { struct io_submit_state state, *statep = NULL; struct io_kiocb *link = NULL; @@ -5906,7 +5909,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, break; }
- err = io_init_req(ctx, req, sqe, statep, async); + err = io_init_req(ctx, req, sqe, statep); io_consume_sqe(ctx); /* will complete beyond this point, count as submitted */ submitted++; @@ -5919,7 +5922,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, }
trace_io_uring_submit_sqe(ctx, req->opcode, req->user_data, - true, async); + true, io_async_submit(ctx)); err = io_submit_sqe(req, sqe, &link); if (err) goto fail_req; @@ -6059,7 +6062,7 @@ static int io_sq_thread(void *data) }
mutex_lock(&ctx->uring_lock); - ret = io_submit_sqes(ctx, to_submit, NULL, -1, true); + ret = io_submit_sqes(ctx, to_submit, NULL, -1); mutex_unlock(&ctx->uring_lock); timeout = jiffies + ctx->sq_thread_idle; } @@ -7567,7 +7570,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, submitted = to_submit; } else if (to_submit) { mutex_lock(&ctx->uring_lock); - submitted = io_submit_sqes(ctx, to_submit, f.file, fd, false); + submitted = io_submit_sqes(ctx, to_submit, f.file, fd); mutex_unlock(&ctx->uring_lock);
if (submitted != to_submit)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 9f13c35b33fddb186beab9ef21c555a01e45f4d7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_file_put() deals with flushing state's file refs, adding "state" to its name makes it a bit clearer. Also, avoid double check of state->file in __io_file_get() in some cases.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index dbb75517ef37..97eef11877ab 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1994,15 +1994,19 @@ static void io_iopoll_req_issued(struct io_kiocb *req) wake_up(&ctx->sqo_wait); }
-static void io_file_put(struct io_submit_state *state) +static void __io_state_file_put(struct io_submit_state *state) { - if (state->file) { - int diff = state->has_refs - state->used_refs; + int diff = state->has_refs - state->used_refs;
- if (diff) - fput_many(state->file, diff); - state->file = NULL; - } + if (diff) + fput_many(state->file, diff); + state->file = NULL; +} + +static inline void io_state_file_put(struct io_submit_state *state) +{ + if (state->file) + __io_state_file_put(state); }
/* @@ -2021,7 +2025,7 @@ static struct file *__io_file_get(struct io_submit_state *state, int fd) state->ios_left--; return state->file; } - io_file_put(state); + __io_state_file_put(state); } state->file = fget_many(fd, state->ios_left); if (!state->file) @@ -5733,7 +5737,7 @@ static int io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, static void io_submit_state_end(struct io_submit_state *state) { blk_finish_plug(&state->plug); - io_file_put(state); + io_state_file_put(state); if (state->free_reqs) kmem_cache_free_bulk(req_cachep, state->free_reqs, state->reqs); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit c11368a57be460de889696f6ff8815fbcacf4db2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
req->flags stores all sqe->flags. After checking that sqe->flags are valid set if IOSQE* flags, no need to double check it, just forward them all.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 97eef11877ab..2b9678f91395 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5858,9 +5858,7 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, }
/* same numerical values with corresponding REQ_F_*, safe to copy */ - req->flags |= sqe_flags & (IOSQE_IO_DRAIN | IOSQE_IO_HARDLINK | - IOSQE_ASYNC | IOSQE_FIXED_FILE | - IOSQE_BUFFER_SELECT | IOSQE_IO_LINK); + req->flags |= sqe_flags;
if (!io_op_defs[req->opcode].needs_file) return 0;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 9dafdfc2f0a3ae551711098de3d7b621a469f11a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
export do_tee() for use in io_uring
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/splice.c | 3 +-- include/linux/splice.h | 3 +++ 2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/splice.c b/fs/splice.c index f8aa86070b22..230367f3df41 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1717,8 +1717,7 @@ static int link_pipe(struct pipe_inode_info *ipipe, * The 'flags' used are the SPLICE_F_* variants, currently the only * applicable one is SPLICE_F_NONBLOCK. */ -static long do_tee(struct file *in, struct file *out, size_t len, - unsigned int flags) +long do_tee(struct file *in, struct file *out, size_t len, unsigned int flags) { struct pipe_inode_info *ipipe = get_pipe_info(in); struct pipe_inode_info *opipe = get_pipe_info(out); diff --git a/include/linux/splice.h b/include/linux/splice.h index ebbbfea48aa0..5c47013f708e 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -82,6 +82,9 @@ extern long do_splice(struct file *in, loff_t __user *off_in, struct file *out, loff_t __user *off_out, size_t len, unsigned int flags);
+extern long do_tee(struct file *in, struct file *out, size_t len, + unsigned int flags); + /* * for dynamic pipe sizing */
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit f2a8d5c7a218b9c24befb756c4eb30aa550ce822 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add IORING_OP_TEE implementing tee(2) support. Almost identical to splice bits, but without offsets.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 62 +++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 1 + 2 files changed, 60 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2b9678f91395..9db2f55082a6 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -852,6 +852,11 @@ static const struct io_op_def io_op_defs[] = { }, [IORING_OP_PROVIDE_BUFFERS] = {}, [IORING_OP_REMOVE_BUFFERS] = {}, + [IORING_OP_TEE] = { + .needs_file = 1, + .hash_reg_file = 1, + .unbound_nonreg_file = 1, + }, };
static void io_wq_submit_work(struct io_wq_work **workptr); @@ -2741,7 +2746,8 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) return ret; }
-static int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +static int __io_splice_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) { struct io_splice* sp = &req->splice; unsigned int valid_flags = SPLICE_F_FD_IN_FIXED | SPLICE_F_ALL; @@ -2751,8 +2757,6 @@ static int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0;
sp->file_in = NULL; - sp->off_in = READ_ONCE(sqe->splice_off_in); - sp->off_out = READ_ONCE(sqe->off); sp->len = READ_ONCE(sqe->len); sp->flags = READ_ONCE(sqe->splice_flags);
@@ -2771,6 +2775,46 @@ static int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
+static int io_tee_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + if (READ_ONCE(sqe->splice_off_in) || READ_ONCE(sqe->off)) + return -EINVAL; + return __io_splice_prep(req, sqe); +} + +static int io_tee(struct io_kiocb *req, bool force_nonblock) +{ + struct io_splice *sp = &req->splice; + struct file *in = sp->file_in; + struct file *out = sp->file_out; + unsigned int flags = sp->flags & ~SPLICE_F_FD_IN_FIXED; + long ret = 0; + + if (force_nonblock) + return -EAGAIN; + if (sp->len) + ret = do_tee(in, out, sp->len, flags); + + io_put_file(req, in, (sp->flags & SPLICE_F_FD_IN_FIXED)); + req->flags &= ~REQ_F_NEED_CLEANUP; + + io_cqring_add_event(req, ret); + if (ret != sp->len) + req_set_fail_links(req); + io_put_req(req); + return 0; +} + +static int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_splice* sp = &req->splice; + + sp->off_in = READ_ONCE(sqe->splice_off_in); + sp->off_out = READ_ONCE(sqe->off); + return __io_splice_prep(req, sqe); +} + static int io_splice(struct io_kiocb *req, bool force_nonblock) { struct io_splice *sp = &req->splice; @@ -5029,6 +5073,9 @@ static int io_req_defer_prep(struct io_kiocb *req, case IORING_OP_REMOVE_BUFFERS: ret = io_remove_buffers_prep(req, sqe); break; + case IORING_OP_TEE: + ret = io_tee_prep(req, sqe); + break; default: printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", req->opcode); @@ -5102,6 +5149,7 @@ static void io_cleanup_req(struct io_kiocb *req) putname(req->open.filename); break; case IORING_OP_SPLICE: + case IORING_OP_TEE: io_put_file(req, req->splice.file_in, (req->splice.flags & SPLICE_F_FD_IN_FIXED)); break; @@ -5324,6 +5372,14 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, } ret = io_remove_buffers(req, force_nonblock); break; + case IORING_OP_TEE: + if (sqe) { + ret = io_tee_prep(req, sqe); + if (ret < 0) + break; + } + ret = io_tee(req, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 15aed20c6789..9afedee24e5b 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -128,6 +128,7 @@ enum { IORING_OP_SPLICE, IORING_OP_PROVIDE_BUFFERS, IORING_OP_REMOVE_BUFFERS, + IORING_OP_TEE,
/* this goes last, obviously */ IORING_OP_LAST,
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.8-rc1 commit 6b668c9b7fc6fc0c313cdaee8b75d17f4d954ab5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
When IORING_SETUP_SQPOLL is enabled, io_ring_ctx_wait_and_kill() will wait for sq thread to idle by busy loop:
while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait)) cond_resched();
Above loop isn't very CPU friendly, it may introduce a short cpu burst on the current cpu.
If ctx->refs is dying, we forbid sq_thread from submitting any further SQEs. Instead they just get discarded when we exit.
Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 13 ++----------- 1 file changed, 2 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 11decc52f7b7..bd7c862f2d67 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6119,7 +6119,8 @@ static int io_sq_thread(void *data) }
mutex_lock(&ctx->uring_lock); - ret = io_submit_sqes(ctx, to_submit, NULL, -1); + if (likely(!percpu_ref_is_dying(&ctx->refs))) + ret = io_submit_sqes(ctx, to_submit, NULL, -1); mutex_unlock(&ctx->uring_lock); timeout = jiffies + ctx->sq_thread_idle; } @@ -7409,16 +7410,6 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) percpu_ref_kill(&ctx->refs); mutex_unlock(&ctx->uring_lock);
- /* - * Wait for sq thread to idle, if we have one. It won't spin on new - * work after we've killed the ctx ref above. This is important to do - * before we cancel existing commands, as the thread could otherwise - * be queueing new work post that. If that's work we need to cancel, - * it could cause shutdown to hang. - */ - while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait)) - cond_resched(); - io_kill_timeouts(ctx); io_poll_remove_all(ctx);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 4518a3cc273cf82efdd36522fb1f13baad173c70 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In io_uring_cancel_files(), after refcount_sub_and_test() leaves 0 req->refs, it calls io_put_req(), which would also put a ref. Call io_free_req() instead.
Cc: stable@vger.kernel.org Fixes: 2ca10259b418 ("io_uring: prune request from overflow list on flush") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index bd7c862f2d67..48965063ea68 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7478,7 +7478,7 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, * all we had, then we're done with this request. */ if (refcount_sub_and_test(2, &cancel_req->refs)) { - io_put_req(cancel_req); + io_free_req(cancel_req); finish_wait(&ctx->inflight_wait, &wait); continue; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 733f5c95e6fdabd05b8dfc15e04512809c9652c2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Move spin_lock_irq() earlier to have only 1 call site of it in io_timeout(). It makes the flow easier.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 48965063ea68..80fc3d7179d7 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4792,6 +4792,7 @@ static int io_timeout(struct io_kiocb *req) u32 seq = req->sequence;
data = &req->io->timeout; + spin_lock_irq(&ctx->completion_lock);
/* * sqe->off holds how many events that need to occur for this @@ -4800,7 +4801,6 @@ static int io_timeout(struct io_kiocb *req) */ if (!count) { req->flags |= REQ_F_TIMEOUT_NOSEQ; - spin_lock_irq(&ctx->completion_lock); entry = ctx->timeout_list.prev; goto add; } @@ -4811,7 +4811,6 @@ static int io_timeout(struct io_kiocb *req) * Insertion sort, ensuring the first entry in the list is always * the one we need first. */ - spin_lock_irq(&ctx->completion_lock); list_for_each_prev(entry, &ctx->timeout_list) { struct io_kiocb *nxt = list_entry(entry, struct io_kiocb, list); unsigned nxt_seq;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 56080b02ed6e71fbc0add2d05a32ed7361dd736a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
SQEs are user writable, don't read sqe->off twice in io_timeout_prep()
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 80fc3d7179d7..a90a548da824 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4750,18 +4750,19 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, { struct io_timeout_data *data; unsigned flags; + u32 off = READ_ONCE(sqe->off);
if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; if (sqe->ioprio || sqe->buf_index || sqe->len != 1) return -EINVAL; - if (sqe->off && is_timeout_link) + if (off && is_timeout_link) return -EINVAL; flags = READ_ONCE(sqe->timeout_flags); if (flags & ~IORING_TIMEOUT_ABS) return -EINVAL;
- req->timeout.count = READ_ONCE(sqe->off); + req->timeout.count = off;
if (!req->io && io_alloc_async_ctx(req)) return -ENOMEM;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 0bf0eefdab52d9f9f3a1eeda32a4fc7afe4e9219 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_close() was punting async manually to skip grabbing files. Use REQ_F_NO_FILE_TABLE instead, and pass it through the generic path with -EAGAIN.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 20 +++++--------------- 1 file changed, 5 insertions(+), 15 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cedbf117450a..e50734123350 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3437,25 +3437,15 @@ static int io_close(struct io_kiocb *req, bool force_nonblock)
req->close.put_file = NULL; ret = __close_fd_get_file(req->close.fd, &req->close.put_file); - if (ret < 0) { - if (ret == -ENOENT) - ret = -EBADF; - return ret; - } + if (ret < 0) + return (ret == -ENOENT) ? -EBADF : ret;
/* if the file has a flush method, be safe and punt to async */ if (req->close.put_file->f_op->flush && force_nonblock) { - /* submission ref will be dropped, take it for async */ - refcount_inc(&req->refs); - + /* avoid grabbing files - we don't need the files */ + req->flags |= REQ_F_NO_FILE_TABLE | REQ_F_MUST_PUNT; req->work.func = io_close_finish; - /* - * Do manual async queue here to avoid grabbing files - we don't - * need the files, and it'll cause io_close_finish() to close - * the file again and cause a double CQE entry for this request - */ - io_queue_async_work(req); - return 0; + return -EAGAIN; }
/*
From: Bijan Mottahedeh bijan.mottahedeh@oracle.com
mainline inclusion from mainline-5.8-rc1 commit 1d9e1288039a47dc1189c3c1fed5cf3c215e94b7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Separate statx data from open in io_kiocb. No functional changes.
Signed-off-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [commit c12cedf24e78("io_uring: add 'struct open_how' to the openat request context") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 32 ++++++++++++++++++++------------ 1 file changed, 20 insertions(+), 12 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e50734123350..08ee4e0e815f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -427,10 +427,8 @@ struct io_open { int dfd; union { umode_t mode; - unsigned mask; }; struct filename *filename; - struct statx __user *buffer; int flags; unsigned long nofile; }; @@ -482,6 +480,15 @@ struct io_provide_buf { __u16 bid; };
+struct io_statx { + struct file *file; + int dfd; + unsigned int mask; + unsigned int flags; + struct filename *filename; + struct statx __user *buffer; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -623,6 +630,7 @@ struct io_kiocb { struct io_epoll epoll; struct io_splice splice; struct io_provide_buf pbuf; + struct io_statx statx; };
struct io_async_ctx *io; @@ -3326,19 +3334,19 @@ static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (req->flags & REQ_F_NEED_CLEANUP) return 0;
- req->open.dfd = READ_ONCE(sqe->fd); - req->open.mask = READ_ONCE(sqe->len); + req->statx.dfd = READ_ONCE(sqe->fd); + req->statx.mask = READ_ONCE(sqe->len); fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); - req->open.buffer = u64_to_user_ptr(READ_ONCE(sqe->addr2)); - req->open.flags = READ_ONCE(sqe->statx_flags); + req->statx.buffer = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + req->statx.flags = READ_ONCE(sqe->statx_flags);
- if (vfs_stat_set_lookup_flags(&lookup_flags, req->open.flags)) + if (vfs_stat_set_lookup_flags(&lookup_flags, req->statx.flags)) return -EINVAL;
- req->open.filename = getname_flags(fname, lookup_flags, NULL); - if (IS_ERR(req->open.filename)) { - ret = PTR_ERR(req->open.filename); - req->open.filename = NULL; + req->statx.filename = getname_flags(fname, lookup_flags, NULL); + if (IS_ERR(req->statx.filename)) { + ret = PTR_ERR(req->statx.filename); + req->statx.filename = NULL; return ret; }
@@ -3348,7 +3356,7 @@ static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
static int io_statx(struct io_kiocb *req, bool force_nonblock) { - struct io_open *ctx = &req->open; + struct io_statx *ctx = &req->statx; unsigned lookup_flags; struct path path; struct kstat stat;
From: Bijan Mottahedeh bijan.mottahedeh@oracle.com
mainline inclusion from mainline-5.8-rc1 commit 0018784fc84f636d473a0d2a65a34f9d01893c0a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This is a prepatory patch to allow io_uring to invoke statx directly.
Signed-off-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/internal.h | 2 ++ fs/stat.c | 32 +++++++++++++++++++------------- 2 files changed, 21 insertions(+), 13 deletions(-)
diff --git a/fs/internal.h b/fs/internal.h index acbc60a8e13e..6aa0e08161ac 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -195,3 +195,5 @@ int sb_init_dio_done_wq(struct super_block *sb); */ unsigned vfs_stat_set_lookup_flags(unsigned *lookup_flags, int flags); int cp_statx(const struct kstat *stat, struct statx __user *buffer); +int do_statx(int dfd, const char __user *filename, unsigned flags, + unsigned int mask, struct statx __user *buffer); diff --git a/fs/stat.c b/fs/stat.c index 46dfe0df1a71..a69de0897b74 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -562,6 +562,24 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer) return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0; }
+int do_statx(int dfd, const char __user *filename, unsigned flags, + unsigned int mask, struct statx __user *buffer) +{ + struct kstat stat; + int error; + + if (mask & STATX__RESERVED) + return -EINVAL; + if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE) + return -EINVAL; + + error = vfs_statx(dfd, filename, flags, &stat, mask); + if (error) + return error; + + return cp_statx(&stat, buffer); +} + /** * sys_statx - System call to get enhanced stats * @dfd: Base directory to pathwalk from *or* fd to stat. @@ -578,19 +596,7 @@ SYSCALL_DEFINE5(statx, unsigned int, mask, struct statx __user *, buffer) { - struct kstat stat; - int error; - - if (mask & STATX__RESERVED) - return -EINVAL; - if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE) - return -EINVAL; - - error = vfs_statx(dfd, filename, flags, &stat, mask); - if (error) - return error; - - return cp_statx(&stat, buffer); + return do_statx(dfd, filename, flags, mask, buffer); }
#ifdef CONFIG_COMPAT
From: Bijan Mottahedeh bijan.mottahedeh@oracle.com
mainline inclusion from mainline-5.8-rc1 commit e62753e4e2926f249d088cc0517be5ed4efec6d6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Calling statx directly both simplifies the interface and avoids potential incompatibilities between sync and async invokations.
Signed-off-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [commit cebdb98617ae("io_uring: add support for IORING_OP_OPENAT2") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 50 ++++---------------------------------------------- 1 file changed, 4 insertions(+), 46 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 08ee4e0e815f..72991fb10d28 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -485,7 +485,7 @@ struct io_statx { int dfd; unsigned int mask; unsigned int flags; - struct filename *filename; + const char __user *filename; struct statx __user *buffer; };
@@ -3323,43 +3323,23 @@ static int io_fadvise(struct io_kiocb *req, bool force_nonblock)
static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { - const char __user *fname; - unsigned lookup_flags; - int ret; - if (sqe->ioprio || sqe->buf_index) return -EINVAL; if (req->flags & REQ_F_FIXED_FILE) return -EBADF; - if (req->flags & REQ_F_NEED_CLEANUP) - return 0;
req->statx.dfd = READ_ONCE(sqe->fd); req->statx.mask = READ_ONCE(sqe->len); - fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); + req->statx.filename = u64_to_user_ptr(READ_ONCE(sqe->addr)); req->statx.buffer = u64_to_user_ptr(READ_ONCE(sqe->addr2)); req->statx.flags = READ_ONCE(sqe->statx_flags);
- if (vfs_stat_set_lookup_flags(&lookup_flags, req->statx.flags)) - return -EINVAL; - - req->statx.filename = getname_flags(fname, lookup_flags, NULL); - if (IS_ERR(req->statx.filename)) { - ret = PTR_ERR(req->statx.filename); - req->statx.filename = NULL; - return ret; - } - - req->flags |= REQ_F_NEED_CLEANUP; return 0; }
static int io_statx(struct io_kiocb *req, bool force_nonblock) { struct io_statx *ctx = &req->statx; - unsigned lookup_flags; - struct path path; - struct kstat stat; int ret;
if (force_nonblock) { @@ -3369,29 +3349,9 @@ static int io_statx(struct io_kiocb *req, bool force_nonblock) return -EAGAIN; }
- if (vfs_stat_set_lookup_flags(&lookup_flags, ctx->flags)) - return -EINVAL; - -retry: - /* filename_lookup() drops it, keep a reference */ - ctx->filename->refcnt++; - - ret = filename_lookup(ctx->dfd, ctx->filename, lookup_flags, &path, - NULL); - if (ret) - goto err; + ret = do_statx(ctx->dfd, ctx->filename, ctx->flags, ctx->mask, + ctx->buffer);
- ret = vfs_getattr(&path, &stat, ctx->mask, ctx->flags); - path_put(&path); - if (retry_estale(ret, lookup_flags)) { - lookup_flags |= LOOKUP_REVAL; - goto retry; - } - if (!ret) - ret = cp_statx(&stat, ctx->buffer); -err: - putname(ctx->filename); - req->flags &= ~REQ_F_NEED_CLEANUP; if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); @@ -5142,8 +5102,6 @@ static void io_cleanup_req(struct io_kiocb *req) kfree(req->sr_msg.kbuf); break; case IORING_OP_OPENAT: - case IORING_OP_STATX: - putname(req->open.filename); break; case IORING_OP_SPLICE: case IORING_OP_TEE:
From: Bijan Mottahedeh bijan.mottahedeh@oracle.com
mainline inclusion from mainline-5.8-rc1 commit 6f88cc176a3358c54bb6c38c8afee3f3a42faf54 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The io_uring interfaces have been replaced by do_statx() and are no longer needed.
Signed-off-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/internal.h | 2 -- fs/stat.c | 5 +++-- 2 files changed, 3 insertions(+), 4 deletions(-)
diff --git a/fs/internal.h b/fs/internal.h index 6aa0e08161ac..73e9829245f1 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -193,7 +193,5 @@ int sb_init_dio_done_wq(struct super_block *sb); /* * fs/stat.c: */ -unsigned vfs_stat_set_lookup_flags(unsigned *lookup_flags, int flags); -int cp_statx(const struct kstat *stat, struct statx __user *buffer); int do_statx(int dfd, const char __user *filename, unsigned flags, unsigned int mask, struct statx __user *buffer); diff --git a/fs/stat.c b/fs/stat.c index a69de0897b74..0e36d35e9140 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -150,7 +150,8 @@ int vfs_statx_fd(unsigned int fd, struct kstat *stat, } EXPORT_SYMBOL(vfs_statx_fd);
-inline unsigned vfs_stat_set_lookup_flags(unsigned *lookup_flags, int flags) +static inline unsigned vfs_stat_set_lookup_flags(unsigned *lookup_flags, + int flags) { if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT | AT_EMPTY_PATH | KSTAT_QUERY_FLAGS)) != 0) @@ -528,7 +529,7 @@ SYSCALL_DEFINE4(fstatat64, int, dfd, const char __user *, filename, } #endif /* __ARCH_WANT_STAT64 || __ARCH_WANT_COMPAT_STAT64 */
-noinline_for_stack int +static noinline_for_stack int cp_statx(const struct kstat *stat, struct statx __user *buffer) { struct statx tmp;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 360428f8c0cd857006a8a3f515946285370489ac category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Separate flushing offset timeouts io_commit_cqring() by moving it into a helper. Just a preparation, makes following patches clearer.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 34 ++++++++++++++-------------------- 1 file changed, 14 insertions(+), 20 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 72991fb10d28..4ec02d11110f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -989,23 +989,6 @@ static inline bool req_need_defer(struct io_kiocb *req) return false; }
-static struct io_kiocb *io_get_timeout_req(struct io_ring_ctx *ctx) -{ - struct io_kiocb *req; - - req = list_first_entry_or_null(&ctx->timeout_list, struct io_kiocb, list); - if (req) { - if (req->flags & REQ_F_TIMEOUT_NOSEQ) - return NULL; - if (!__req_need_defer(req)) { - list_del_init(&req->list); - return req; - } - } - - return NULL; -} - static void __io_commit_cqring(struct io_ring_ctx *ctx) { struct io_rings *rings = ctx->rings; @@ -1134,13 +1117,24 @@ static void __io_queue_deferred(struct io_ring_ctx *ctx) } while (!list_empty(&ctx->defer_list)); }
-static void io_commit_cqring(struct io_ring_ctx *ctx) +static void io_flush_timeouts(struct io_ring_ctx *ctx) { - struct io_kiocb *req; + while (!list_empty(&ctx->timeout_list)) { + struct io_kiocb *req = list_first_entry(&ctx->timeout_list, + struct io_kiocb, list);
- while ((req = io_get_timeout_req(ctx)) != NULL) + if (req->flags & REQ_F_TIMEOUT_NOSEQ) + break; + if (__req_need_defer(req)) + break; + list_del_init(&req->list); io_kill_timeout(req); + } +}
+static void io_commit_cqring(struct io_ring_ctx *ctx) +{ + io_flush_timeouts(ctx); __io_commit_cqring(ctx);
if (unlikely(!list_empty(&ctx->defer_list)))
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit bfe68a221905de37e65394a6d58c1e5f3e545d2f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Offset timeouts wait not for sqe->off non-timeout CQEs, but rather sqe->off + number of prior inflight requests. Wait exactly for sqe->off non-timeout completions
Reported-by: Jens Axboe axboe@kernel.dk Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 65 +++++++++++---------------------------------------- 1 file changed, 14 insertions(+), 51 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4ec02d11110f..5757474c0754 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -394,7 +394,8 @@ struct io_timeout { struct file *file; u64 addr; int flags; - u32 count; + u32 off; + u32 target_seq; };
struct io_rw { @@ -1125,8 +1126,10 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx)
if (req->flags & REQ_F_TIMEOUT_NOSEQ) break; - if (__req_need_defer(req)) + if (req->timeout.target_seq != ctx->cached_cq_tail + - atomic_read(&ctx->cq_timeouts)) break; + list_del_init(&req->list); io_kill_timeout(req); } @@ -4609,20 +4612,8 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) * We could be racing with timeout deletion. If the list is empty, * then timeout lookup already found it and will be handling it. */ - if (!list_empty(&req->list)) { - struct io_kiocb *prev; - - /* - * Adjust the reqs sequence before the current one because it - * will consume a slot in the cq_ring and the cq_tail - * pointer will be increased, otherwise other timeout reqs may - * return in advance without waiting for enough wait_nr. - */ - prev = req; - list_for_each_entry_continue_reverse(prev, &ctx->timeout_list, list) - prev->sequence++; + if (!list_empty(&req->list)) list_del_init(&req->list); - }
io_cqring_fill_event(req, -ETIME); io_commit_cqring(ctx); @@ -4714,7 +4705,7 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (flags & ~IORING_TIMEOUT_ABS) return -EINVAL;
- req->timeout.count = off; + req->timeout.off = off;
if (!req->io && io_alloc_async_ctx(req)) return -ENOMEM; @@ -4738,13 +4729,10 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, static int io_timeout(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; - struct io_timeout_data *data; + struct io_timeout_data *data = &req->io->timeout; struct list_head *entry; - unsigned span = 0; - u32 count = req->timeout.count; - u32 seq = req->sequence; + u32 tail, off = req->timeout.off;
- data = &req->io->timeout; spin_lock_irq(&ctx->completion_lock);
/* @@ -4752,13 +4740,14 @@ static int io_timeout(struct io_kiocb *req) * timeout event to be satisfied. If it isn't set, then this is * a pure timeout request, sequence isn't used. */ - if (!count) { + if (!off) { req->flags |= REQ_F_TIMEOUT_NOSEQ; entry = ctx->timeout_list.prev; goto add; }
- req->sequence = seq + count; + tail = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts); + req->timeout.target_seq = tail + off;
/* * Insertion sort, ensuring the first entry in the list is always @@ -4766,39 +4755,13 @@ static int io_timeout(struct io_kiocb *req) */ list_for_each_prev(entry, &ctx->timeout_list) { struct io_kiocb *nxt = list_entry(entry, struct io_kiocb, list); - unsigned nxt_seq; - long long tmp, tmp_nxt; - u32 nxt_offset = nxt->timeout.count;
if (nxt->flags & REQ_F_TIMEOUT_NOSEQ) continue; - - /* - * Since seq + count can overflow, use type long - * long to store it. - */ - tmp = (long long)seq + count; - nxt_seq = nxt->sequence - nxt_offset; - tmp_nxt = (long long)nxt_seq + nxt_offset; - - /* - * cached_sq_head may overflow, and it will never overflow twice - * once there is some timeout req still be valid. - */ - if (seq < nxt_seq) - tmp += UINT_MAX; - - if (tmp > tmp_nxt) + /* nxt.seq is behind @tail, otherwise would've been completed */ + if (off >= nxt->timeout.target_seq - tail) break; - - /* - * Sequence of reqs after the insert one and itself should - * be adjusted because each timeout req consumes a slot. - */ - span++; - nxt->sequence++; } - req->sequence -= span; add: list_add(&req->list, entry); data->timer.function = io_timeout_fn;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 7b53d59859bc932b37895d2d37388e7fa29af7a5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Overflowed requests in io_uring_cancel_files() should be shed only of inflight and overflowed refs. All other left references are owned by someone else.
If refcount_sub_and_test() fails, it will go further and put put extra ref, don't do that. Also, don't need to do io_wq_cancel_work() for overflowed reqs, they will be let go shortly anyway.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5757474c0754..8516dffe6649 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7395,10 +7395,11 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, finish_wait(&ctx->inflight_wait, &wait); continue; } + } else { + io_wq_cancel_work(ctx->io_wq, &cancel_req->work); + io_put_req(cancel_req); }
- io_wq_cancel_work(ctx->io_wq, &cancel_req->work); - io_put_req(cancel_req); schedule(); finish_wait(&ctx->inflight_wait, &wait); }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc1 commit fd2206e4e97b5bae422d9f2f9ebbc79bc97e44a5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
A previous commit enabled this functionality, which also enabled O_PATH to work correctly with io_uring. But we can't safely close the ring itself, as the file handle isn't reference counted inside io_uring_enter(). Instead of jumping through hoops to enable ring closure, add a "soft" ->needs_file option, ->needs_file_no_error. This enables O_PATH file descriptors to work, but still catches the case of trying to close the ring itself.
Reported-by: Jann Horn jannh@google.com Fixes: 904fbcb115c8 ("io_uring: remove 'fd is io_uring' from close path") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 25 +++++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8516dffe6649..aceede48ccf2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -701,6 +701,8 @@ struct io_op_def { unsigned needs_mm : 1; /* needs req->file assigned */ unsigned needs_file : 1; + /* don't fail if file grab fails */ + unsigned needs_file_no_error : 1; /* hash wq insertion if file is a regular file */ unsigned hash_reg_file : 1; /* unbound wq insertion if file is a non-regular file */ @@ -807,6 +809,8 @@ static const struct io_op_def io_op_defs[] = { .needs_fs = 1, }, [IORING_OP_CLOSE] = { + .needs_file = 1, + .needs_file_no_error = 1, .file_table = 1, }, [IORING_OP_FILES_UPDATE] = { @@ -3371,6 +3375,10 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EBADF;
req->close.fd = READ_ONCE(sqe->fd); + if ((req->file && req->file->f_op == &io_uring_fops) || + req->close.fd == req->ctx->ring_fd) + return -EBADF; + return 0; }
@@ -5376,19 +5384,20 @@ static int io_file_get(struct io_submit_state *state, struct io_kiocb *req, return -EBADF; fd = array_index_nospec(fd, ctx->nr_user_files); file = io_file_from_index(ctx, fd); - if (!file) - return -EBADF; - req->fixed_file_refs = ctx->file_data->cur_refs; - percpu_ref_get(req->fixed_file_refs); + if (file) { + req->fixed_file_refs = ctx->file_data->cur_refs; + percpu_ref_get(req->fixed_file_refs); + } } else { trace_io_uring_file_get(ctx, fd); file = __io_file_get(state, fd); - if (unlikely(!file)) - return -EBADF; }
- *out_file = file; - return 0; + if (file || io_op_defs[req->opcode].needs_file_no_error) { + *out_file = file; + return 0; + } + return -EBADF; }
static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 3232dd02af65f2d01be641120d2a710176b0c7a7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
IORING_SETUP_IOPOLL is defined only for read/write, other opcodes should be disallowed, otherwise it'll get an error as below. Also refuse open/close with SQPOLL, as the polling thread wouldn't know which file table to use.
RIP: 0010:io_iopoll_getevents+0x111/0x5a0 Call Trace: ? _raw_spin_unlock_irqrestore+0x24/0x40 ? do_send_sig_info+0x64/0x90 io_iopoll_reap_events.part.0+0x5e/0xa0 io_ring_ctx_wait_and_kill+0x132/0x1c0 io_uring_release+0x20/0x30 __fput+0xcd/0x230 ____fput+0xe/0x10 task_work_run+0x67/0xa0 do_exit+0x353/0xb10 ? handle_mm_fault+0xd4/0x200 ? syscall_trace_enter+0x18c/0x2c0 do_group_exit+0x43/0xa0 __x64_sys_exit_group+0x18/0x20 do_syscall_64+0x60/0x1e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Pavel Begunkov asml.silence@gmail.com [axboe: allow provide/remove buffers and files update] Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [commit cebdb98617ae("io_uring: add support for IORING_OP_OPENAT2") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index aceede48ccf2..fd0b428c965d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2764,6 +2764,8 @@ static int __io_splice_prep(struct io_kiocb *req,
if (req->flags & REQ_F_NEED_CLEANUP) return 0; + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL;
sp->file_in = NULL; sp->len = READ_ONCE(sqe->len); @@ -2964,6 +2966,8 @@ static int io_fallocate_prep(struct io_kiocb *req, { if (sqe->ioprio || sqe->buf_index || sqe->rw_flags) return -EINVAL; + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL;
req->sync.off = READ_ONCE(sqe->off); req->sync.len = READ_ONCE(sqe->addr); @@ -2989,6 +2993,8 @@ static int io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) const char __user *fname; int ret;
+ if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) + return -EINVAL; if (sqe->ioprio || sqe->buf_index) return -EINVAL; if (req->flags & REQ_F_FIXED_FILE) @@ -3213,6 +3219,8 @@ static int io_epoll_ctl_prep(struct io_kiocb *req, #if defined(CONFIG_EPOLL) if (sqe->ioprio || sqe->buf_index) return -EINVAL; + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL;
req->epoll.epfd = READ_ONCE(sqe->fd); req->epoll.op = READ_ONCE(sqe->len); @@ -3257,6 +3265,8 @@ static int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) if (sqe->ioprio || sqe->buf_index || sqe->off) return -EINVAL; + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL;
req->madvise.addr = READ_ONCE(sqe->addr); req->madvise.len = READ_ONCE(sqe->len); @@ -3291,6 +3301,8 @@ static int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { if (sqe->ioprio || sqe->buf_index || sqe->addr) return -EINVAL; + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL;
req->fadvise.offset = READ_ONCE(sqe->off); req->fadvise.len = READ_ONCE(sqe->len); @@ -3324,6 +3336,8 @@ static int io_fadvise(struct io_kiocb *req, bool force_nonblock)
static int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; if (sqe->ioprio || sqe->buf_index) return -EINVAL; if (req->flags & REQ_F_FIXED_FILE) @@ -3368,6 +3382,8 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) */ req->work.flags |= IO_WQ_WORK_NO_CANCEL;
+ if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL))) + return -EINVAL; if (sqe->ioprio || sqe->off || sqe->addr || sqe->len || sqe->rw_flags || sqe->buf_index) return -EINVAL;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit d2b6f48b691ed67569786c332f0173b918d3fd1b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Fail recv/send in case of IORING_SETUP_IOPOLL earlier during prep, so it'd be done only once. Removes duplication as well
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index fd0b428c965d..7633c2de7430 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3520,6 +3520,9 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) struct io_async_ctx *io = req->io; int ret;
+ if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + sr->msg_flags = READ_ONCE(sqe->msg_flags); sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); sr->len = READ_ONCE(sqe->len); @@ -3549,9 +3552,6 @@ static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) struct socket *sock; int ret;
- if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) - return -EINVAL; - sock = sock_from_file(req->file, &ret); if (sock) { struct io_async_ctx io; @@ -3605,9 +3605,6 @@ static int io_send(struct io_kiocb *req, bool force_nonblock) struct socket *sock; int ret;
- if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) - return -EINVAL; - sock = sock_from_file(req->file, &ret); if (sock) { struct io_sr_msg *sr = &req->sr_msg; @@ -3760,6 +3757,9 @@ static int io_recvmsg_prep(struct io_kiocb *req, struct io_async_ctx *io = req->io; int ret;
+ if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + sr->msg_flags = READ_ONCE(sqe->msg_flags); sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); sr->len = READ_ONCE(sqe->len); @@ -3788,9 +3788,6 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) struct socket *sock; int ret, cflags = 0;
- if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) - return -EINVAL; - sock = sock_from_file(req->file, &ret); if (sock) { struct io_buffer *kbuf; @@ -3852,9 +3849,6 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock) struct socket *sock; int ret, cflags = 0;
- if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) - return -EINVAL; - sock = sock_from_file(req->file, &ret); if (sock) { struct io_sr_msg *sr = &req->sr_msg;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc1 commit dddb3e26f6d88c5344d28cb5ff9d3d6fa05c4f7a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We already have the buffer selected, but we should set the iter list again.
Cc: stable@vger.kernel.org # v5.7 Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7633c2de7430..4fccfac7aca0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2361,8 +2361,14 @@ static ssize_t __io_iov_buffer_select(struct io_kiocb *req, struct iovec *iov, static ssize_t io_iov_buffer_select(struct io_kiocb *req, struct iovec *iov, bool needs_lock) { - if (req->flags & REQ_F_BUFFER_SELECTED) + if (req->flags & REQ_F_BUFFER_SELECTED) { + struct io_buffer *kbuf; + + kbuf = (struct io_buffer *) (unsigned long) req->rw.addr; + iov[0].iov_base = u64_to_user_ptr(kbuf->addr); + iov[0].iov_len = kbuf->len; return 0; + } if (!req->rw.len) return 0; else if (req->rw.len > 1)
From: Bijan Mottahedeh bijan.mottahedeh@oracle.com
mainline inclusion from mainline-5.8-rc1 commit efe68c1ca8f49e8c06afd74b699411bfbb8ba1ff category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Account for the number of provided buffers when validating the address range.
Signed-off-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4fccfac7aca0..1d2d6fc42350 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3147,7 +3147,7 @@ static int io_provide_buffers_prep(struct io_kiocb *req, p->addr = READ_ONCE(sqe->addr); p->len = READ_ONCE(sqe->len);
- if (!access_ok(u64_to_user_ptr(p->addr), p->len)) + if (!access_ok(u64_to_user_ptr(p->addr), (p->len * p->nbufs))) return -EFAULT;
p->bgid = READ_ONCE(sqe->buf_group);
From: Denis Efremov efremov@linux.com
mainline inclusion from mainline-5.8-rc1 commit a8c73c1a614f6da6c0b04c393f87447e28cb6de4 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Use kvfree() to free the pages and vmas, since they are allocated by kvmalloc_array() in a loop.
Fixes: d4ef647510b1 ("io_uring: avoid page allocation warnings") Signed-off-by: Denis Efremov efremov@linux.com Signed-off-by: Jens Axboe axboe@kernel.dk Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200605093203.40087-1-efremov@linux.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1d2d6fc42350..fb0bb5a1cc76 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7123,8 +7123,8 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
ret = 0; if (!pages || nr_pages > got_pages) { - kfree(vmas); - kfree(pages); + kvfree(vmas); + kvfree(pages); pages = kvmalloc_array(nr_pages, sizeof(struct page *), GFP_KERNEL); vmas = kvmalloc_array(nr_pages,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit 3af73b286ccee493dc055fc58da02b2dc7a5304d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Relying on having a specific work.func is dangerous, even if an opcode handler set it itself. E.g. io_wq_assign_next() can modify it.
io_close() sets a custom work.func to indicate that __close_fd_get_file() was already called. Fortunately, there is no bugs with io_wq_assign_next() and close yet.
Still, do it safe and always be prepared to be called through io_wq_submit_work(). Zero req->close.put_file in prep, and call __close_fd_get_file() IFF it's NULL.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 50 +++++++++++++++++--------------------------------- 1 file changed, 17 insertions(+), 33 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index fb0bb5a1cc76..a05efcec5d02 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3401,53 +3401,37 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) req->close.fd == req->ctx->ring_fd) return -EBADF;
+ req->close.put_file = NULL; return 0; }
-/* only called when __close_fd_get_file() is done */ -static void __io_close_finish(struct io_kiocb *req) -{ - int ret; - - ret = filp_close(req->close.put_file, req->work.files); - if (ret < 0) - req_set_fail_links(req); - io_cqring_add_event(req, ret); - fput(req->close.put_file); - io_put_req(req); -} - -static void io_close_finish(struct io_wq_work **workptr) -{ - struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - - /* not cancellable, don't do io_req_cancelled() */ - __io_close_finish(req); - io_steal_work(req, workptr); -} - static int io_close(struct io_kiocb *req, bool force_nonblock) { + struct io_close *close = &req->close; int ret;
- req->close.put_file = NULL; - ret = __close_fd_get_file(req->close.fd, &req->close.put_file); - if (ret < 0) - return (ret == -ENOENT) ? -EBADF : ret; + /* might be already done during nonblock submission */ + if (!close->put_file) { + ret = __close_fd_get_file(close->fd, &close->put_file); + if (ret < 0) + return (ret == -ENOENT) ? -EBADF : ret; + }
/* if the file has a flush method, be safe and punt to async */ - if (req->close.put_file->f_op->flush && force_nonblock) { + if (close->put_file->f_op->flush && force_nonblock) { /* avoid grabbing files - we don't need the files */ req->flags |= REQ_F_NO_FILE_TABLE | REQ_F_MUST_PUNT; - req->work.func = io_close_finish; return -EAGAIN; }
- /* - * No ->flush(), safely close from here and just punt the - * fput() to async context. - */ - __io_close_finish(req); + /* No ->flush() or already async, safely close from here */ + ret = filp_close(close->put_file, req->work.files); + if (ret < 0) + req_set_fail_links(req); + io_cqring_add_event(req, ret); + fput(close->put_file); + close->put_file = NULL; + io_put_req(req); return 0; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit ac45abc0e2a8ed16ecc0eea039fe762ddfefbcad category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In preparation of getting rid of work.func, this removes almost all custom instances of it, leaving only io_wq_submit_work() and io_link_work_cb(). And the last one will be dealt later.
Nothing fancy, just routinely remove *_finish() function and inline what's left. E.g. remove io_fsync_finish() + inline __io_fsync() into io_fsync().
As no users of io_req_cancelled() are left, delete it as well. The patch adds extra switch lookup on cold-ish path, but that's overweighted by nice diffstat and other benefits of the following patches.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 139 ++++++++++---------------------------------------- 1 file changed, 27 insertions(+), 112 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a05efcec5d02..7e0cea0ffde8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2896,23 +2896,15 @@ static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static bool io_req_cancelled(struct io_kiocb *req) -{ - if (req->work.flags & IO_WQ_WORK_CANCEL) { - req_set_fail_links(req); - io_cqring_add_event(req, -ECANCELED); - io_put_req(req); - return true; - } - - return false; -} - -static void __io_fsync(struct io_kiocb *req) +static int io_fsync(struct io_kiocb *req, bool force_nonblock) { loff_t end = req->sync.off + req->sync.len; int ret;
+ /* fsync always requires a blocking context */ + if (force_nonblock) + return -EAGAIN; + ret = vfs_fsync_range(req->file, req->sync.off, end > 0 ? end : LLONG_MAX, req->sync.flags & IORING_FSYNC_DATASYNC); @@ -2920,53 +2912,9 @@ static void __io_fsync(struct io_kiocb *req) req_set_fail_links(req); io_cqring_add_event(req, ret); io_put_req(req); -} - -static void io_fsync_finish(struct io_wq_work **workptr) -{ - struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - - if (io_req_cancelled(req)) - return; - __io_fsync(req); - io_steal_work(req, workptr); -} - -static int io_fsync(struct io_kiocb *req, bool force_nonblock) -{ - /* fsync always requires a blocking context */ - if (force_nonblock) { - req->work.func = io_fsync_finish; - return -EAGAIN; - } - __io_fsync(req); return 0; }
-static void __io_fallocate(struct io_kiocb *req) -{ - int ret; - - current->signal->rlim[RLIMIT_FSIZE].rlim_cur = req->fsize; - ret = vfs_fallocate(req->file, req->sync.mode, req->sync.off, - req->sync.len); - current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY; - if (ret < 0) - req_set_fail_links(req); - io_cqring_add_event(req, ret); - io_put_req(req); -} - -static void io_fallocate_finish(struct io_wq_work **workptr) -{ - struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - - if (io_req_cancelled(req)) - return; - __io_fallocate(req); - io_steal_work(req, workptr); -} - static int io_fallocate_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { @@ -2984,13 +2932,20 @@ static int io_fallocate_prep(struct io_kiocb *req,
static int io_fallocate(struct io_kiocb *req, bool force_nonblock) { + int ret; + /* fallocate always requiring blocking context */ - if (force_nonblock) { - req->work.func = io_fallocate_finish; + if (force_nonblock) return -EAGAIN; - }
- __io_fallocate(req); + current->signal->rlim[RLIMIT_FSIZE].rlim_cur = req->fsize; + ret = vfs_fallocate(req->file, req->sync.mode, req->sync.off, + req->sync.len); + current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY; + if (ret < 0) + req_set_fail_links(req); + io_cqring_add_event(req, ret); + io_put_req(req); return 0; }
@@ -3453,38 +3408,20 @@ static int io_prep_sfr(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static void __io_sync_file_range(struct io_kiocb *req) +static int io_sync_file_range(struct io_kiocb *req, bool force_nonblock) { int ret;
+ /* sync_file_range always requires a blocking context */ + if (force_nonblock) + return -EAGAIN; + ret = sync_file_range(req->file, req->sync.off, req->sync.len, req->sync.flags); if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); io_put_req(req); -} - - -static void io_sync_file_range_finish(struct io_wq_work **workptr) -{ - struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - - if (io_req_cancelled(req)) - return; - __io_sync_file_range(req); - io_steal_work(req, workptr); -} - -static int io_sync_file_range(struct io_kiocb *req, bool force_nonblock) -{ - /* sync_file_range always requires a blocking context */ - if (force_nonblock) { - req->work.func = io_sync_file_range_finish; - return -EAGAIN; - } - - __io_sync_file_range(req); return 0; }
@@ -3906,49 +3843,27 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; }
-static int __io_accept(struct io_kiocb *req, bool force_nonblock) +static int io_accept(struct io_kiocb *req, bool force_nonblock) { struct io_accept *accept = &req->accept; - unsigned file_flags; + unsigned int file_flags = force_nonblock ? O_NONBLOCK : 0; int ret;
- file_flags = force_nonblock ? O_NONBLOCK : 0; ret = __sys_accept4_file(req->file, file_flags, accept->addr, accept->addr_len, accept->flags, accept->nofile); if (ret == -EAGAIN && force_nonblock) return -EAGAIN; - if (ret == -ERESTARTSYS) - ret = -EINTR; - if (ret < 0) + if (ret < 0) { + if (ret == -ERESTARTSYS) + ret = -EINTR; req_set_fail_links(req); + } io_cqring_add_event(req, ret); io_put_req(req); return 0; }
-static void io_accept_finish(struct io_wq_work **workptr) -{ - struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - - if (io_req_cancelled(req)) - return; - __io_accept(req, false); - io_steal_work(req, workptr); -} - -static int io_accept(struct io_kiocb *req, bool force_nonblock) -{ - int ret; - - ret = __io_accept(req, force_nonblock); - if (ret == -EAGAIN && force_nonblock) { - req->work.func = io_accept_finish; - return -EAGAIN; - } - return 0; -} - static int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_connect *conn = &req->connect;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit d4c81f38522f3e7f4be1b472ef9988d0ed7f3696 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Remove io_link_work_cb() -- the last custom work.func. Not the prettiest thing, but works. Instead of queueing a linked timeout in io_link_work_cb() mark a request with REQ_F_QUEUE_TIMEOUT and do enqueueing based on the flag in io_wq_submit_work().
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 29 ++++++++++++++++++----------- 1 file changed, 18 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7e0cea0ffde8..6c7546822f3e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -544,6 +544,7 @@ enum { REQ_F_POLLED_BIT, REQ_F_BUFFER_SELECTED_BIT, REQ_F_NO_FILE_TABLE_BIT, + REQ_F_QUEUE_TIMEOUT_BIT,
/* not a real bit, just to check we're not overflowing the space */ __REQ_F_LAST_BIT, @@ -599,6 +600,8 @@ enum { REQ_F_BUFFER_SELECTED = BIT(REQ_F_BUFFER_SELECTED_BIT), /* doesn't need file table for this request */ REQ_F_NO_FILE_TABLE = BIT(REQ_F_NO_FILE_TABLE_BIT), + /* needs to queue linked timeout */ + REQ_F_QUEUE_TIMEOUT = BIT(REQ_F_QUEUE_TIMEOUT_BIT), };
struct async_poll { @@ -1578,16 +1581,6 @@ static void io_free_req(struct io_kiocb *req) io_queue_async_work(nxt); }
-static void io_link_work_cb(struct io_wq_work **workptr) -{ - struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); - struct io_kiocb *link; - - link = list_first_entry(&req->link_list, struct io_kiocb, link_list); - io_queue_linked_timeout(link); - io_wq_submit_work(workptr); -} - static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) { struct io_kiocb *link; @@ -1599,7 +1592,7 @@ static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) *workptr = &nxt->work; link = io_prep_linked_timeout(nxt); if (link) - nxt->work.func = io_link_work_cb; + nxt->flags |= REQ_F_QUEUE_TIMEOUT; }
/* @@ -5243,12 +5236,26 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; }
+static void io_arm_async_linked_timeout(struct io_kiocb *req) +{ + struct io_kiocb *link; + + /* link head's timeout is queued in io_queue_async_work() */ + if (!(req->flags & REQ_F_QUEUE_TIMEOUT)) + return; + + link = list_first_entry(&req->link_list, struct io_kiocb, link_list); + io_queue_linked_timeout(link); +} + static void io_wq_submit_work(struct io_wq_work **workptr) { struct io_wq_work *work = *workptr; struct io_kiocb *req = container_of(work, struct io_kiocb, work); int ret = 0;
+ io_arm_async_linked_timeout(req); + /* if NO_CANCEL is set, we must still run the work */ if ((work->flags & (IO_WQ_WORK_CANCEL|IO_WQ_WORK_NO_CANCEL)) == IO_WQ_WORK_CANCEL) {
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc1 commit f5fa38c59cb0b40633dee5cdf7465801be3e4928 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_uring is the only user of io-wq, and now it uses only io-wq callback for all its requests, namely io_wq_submit_work(). Instead of storing work->runner callback in each instance of io_wq_work, keep it in io-wq itself.
pros: - reduces io_wq_work size - more robust -- ->func won't be invalidated with mem{cpy,set}(req) - helps other work
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 10 ++++++---- fs/io-wq.h | 7 ++++--- fs/io_uring.c | 3 ++- 3 files changed, 12 insertions(+), 8 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index a2040d8d5819..e9692b47a342 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -113,6 +113,7 @@ struct io_wq { unsigned long state;
free_work_fn *free_work; + io_wq_work_fn *do_work;
struct task_struct *manager; struct user_struct *user; @@ -529,7 +530,7 @@ static void io_worker_handle_work(struct io_worker *worker)
hash = io_get_work_hash(work); linked = old_work = work; - linked->func(&linked); + wq->do_work(&linked); linked = (old_work == linked) ? NULL : linked;
work = next_hashed; @@ -786,7 +787,7 @@ static void io_run_cancel(struct io_wq_work *work, struct io_wqe *wqe) struct io_wq_work *old_work = work;
work->flags |= IO_WQ_WORK_CANCEL; - work->func(&work); + wq->do_work(&work); work = (work == old_work) ? NULL : work; wq->free_work(old_work); } while (work); @@ -1024,7 +1025,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) int ret = -ENOMEM, node; struct io_wq *wq;
- if (WARN_ON_ONCE(!data->free_work)) + if (WARN_ON_ONCE(!data->free_work || !data->do_work)) return ERR_PTR(-EINVAL);
wq = kzalloc(sizeof(*wq), GFP_KERNEL); @@ -1038,6 +1039,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data) }
wq->free_work = data->free_work; + wq->do_work = data->do_work;
/* caller must already hold a reference to this */ wq->user = data->user; @@ -1094,7 +1096,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
bool io_wq_get(struct io_wq *wq, struct io_wq_data *data) { - if (data->free_work != wq->free_work) + if (data->free_work != wq->free_work || data->do_work != wq->do_work) return false;
return refcount_inc_not_zero(&wq->use_refs); diff --git a/fs/io-wq.h b/fs/io-wq.h index 5ba12de7572f..2db24d31fbc5 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -85,7 +85,6 @@ static inline void wq_list_del(struct io_wq_work_list *list,
struct io_wq_work { struct io_wq_work_node list; - void (*func)(struct io_wq_work **); struct files_struct *files; struct mm_struct *mm; const struct cred *creds; @@ -94,9 +93,9 @@ struct io_wq_work { pid_t task_pid; };
-#define INIT_IO_WORK(work, _func) \ +#define INIT_IO_WORK(work) \ do { \ - *(work) = (struct io_wq_work){ .func = _func }; \ + *(work) = (struct io_wq_work){}; \ } while (0) \
static inline struct io_wq_work *wq_next_work(struct io_wq_work *work) @@ -108,10 +107,12 @@ static inline struct io_wq_work *wq_next_work(struct io_wq_work *work) }
typedef void (free_work_fn)(struct io_wq_work *); +typedef void (io_wq_work_fn)(struct io_wq_work **);
struct io_wq_data { struct user_struct *user;
+ io_wq_work_fn *do_work; free_work_fn *free_work; };
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6c7546822f3e..e8cb9333f2a6 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5728,7 +5728,7 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, refcount_set(&req->refs, 2); req->task = NULL; req->result = 0; - INIT_IO_WORK(&req->work, io_wq_submit_work); + INIT_IO_WORK(&req->work);
if (unlikely(req->opcode >= IORING_OP_LAST)) return -EINVAL; @@ -6748,6 +6748,7 @@ static int io_init_wq_offload(struct io_ring_ctx *ctx,
data.user = ctx->user; data.free_work = io_free_work; + data.do_work = io_wq_submit_work;
if (!(p->flags & IORING_SETUP_ATTACH_WQ)) { /* Do QD, or 4 * CPUS, whatever is smallest */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc1 commit c5b856255cbc3b664d686a83fa9397a835e063de category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We can assume that O_NONBLOCK is always honored, even if we don't have a ->read/write_iter() for the file type. Also unify the read/write checking for allowing async punt, having the write side factoring in the REQ_F_NOWAIT flag as well.
Cc: stable@vger.kernel.org Fixes: 490e89676a52 ("io_uring: only force async punt if poll based retry can't handle it") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e8cb9333f2a6..748e0f62aefe 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2059,6 +2059,10 @@ static bool io_file_supports_async(struct file *file, int rw) if (S_ISREG(mode) && file->f_op != &io_uring_fops) return true;
+ /* any ->read/write should understand O_NONBLOCK */ + if (file->f_flags & O_NONBLOCK) + return true; + if (!(file->f_mode & FMODE_NOWAIT)) return false;
@@ -2101,8 +2105,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, kiocb->ki_ioprio = get_current_ioprio();
/* don't allow async punt if RWF_NOWAIT was requested */ - if ((kiocb->ki_flags & IOCB_NOWAIT) || - (req->file->f_flags & O_NONBLOCK)) + if (kiocb->ki_flags & IOCB_NOWAIT) req->flags |= REQ_F_NOWAIT;
if (force_nonblock) @@ -2743,7 +2746,8 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) if (ret) goto out_free; /* any defer here is final, must blocking retry */ - if (!file_can_poll(req->file)) + if (!(req->flags & REQ_F_NOWAIT) && + !file_can_poll(req->file)) req->flags |= REQ_F_MUST_PUNT; return -EAGAIN; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc2 commit 801dd57bd1d8c2c253f43635a3045bfa32a810b3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For an exiting process it tries to cancel all its inflight requests. Use req->task to match such instead of work.pid. We always have req->task set, and it will be valid because we're matching only current exiting task.
Also, remove work.pid and everything related, it's useless now.
Reported-by: Eric W. Biederman ebiederm@xmission.com Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.h | 1 - fs/io_uring.c | 16 ++++++---------- 2 files changed, 6 insertions(+), 11 deletions(-)
diff --git a/fs/io-wq.h b/fs/io-wq.h index b72538fe5afd..071f1a997800 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -90,7 +90,6 @@ struct io_wq_work { const struct cred *creds; struct fs_struct *fs; unsigned flags; - pid_t task_pid; };
static inline struct io_wq_work *wq_next_work(struct io_wq_work *work) diff --git a/fs/io_uring.c b/fs/io_uring.c index cb032f2730a8..2639dcc4945e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1062,8 +1062,6 @@ static inline void io_req_work_grab_env(struct io_kiocb *req, } spin_unlock(¤t->fs->lock); } - if (!req->work.task_pid) - req->work.task_pid = task_pid_vnr(current); }
static inline void io_req_work_drop_env(struct io_kiocb *req) @@ -7409,11 +7407,12 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, } }
-static bool io_cancel_pid_cb(struct io_wq_work *work, void *data) +static bool io_cancel_task_cb(struct io_wq_work *work, void *data) { - pid_t pid = (pid_t) (unsigned long) data; + struct io_kiocb *req = container_of(work, struct io_kiocb, work); + struct task_struct *task = data;
- return work->task_pid == pid; + return req->task == task; }
static int io_uring_flush(struct file *file, void *data) @@ -7425,11 +7424,8 @@ static int io_uring_flush(struct file *file, void *data) /* * If the task is going away, cancel work it may have pending */ - if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) { - void *data = (void *) (unsigned long)task_pid_vnr(current); - - io_wq_cancel_cb(ctx->io_wq, io_cancel_pid_cb, data, true); - } + if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) + io_wq_cancel_cb(ctx->io_wq, io_cancel_task_cb, current, true);
return 0; }
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.8-rc2 commit 2d7d67920e5c8e0854df23ca77da2dd5880ce5dd category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In IOPOLL mode, for EAGAIN error, we'll try to submit io request again using io-wq, so don't fail rest of links if this io request has links.
Cc: stable@vger.kernel.org Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2639dcc4945e..b99c64bcfbdc 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1987,7 +1987,7 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2) if (kiocb->ki_flags & IOCB_WRITE) kiocb_end_write(req);
- if (res != req->result) + if (res != -EAGAIN && res != req->result) req_set_fail_links(req); req->result = res; if (res != -EAGAIN)
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.8-rc2 commit bbde017a32b32d2fa8d5fddca25fade20132abf8 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In io_complete_rw_iopoll(), stores to io_kiocb's result and iopoll completed are two independent store operations, to ensure that once iopoll_completed is ture and then req->result must been perceived by the cpu executing io_do_iopoll(), proper memory barrier should be used.
And in io_do_iopoll(), we check whether req->result is EAGAIN, if it is, we'll need to issue this io request using io-wq again. In order to just issue a single smp_rmb() on the completion side, move the re-submit work to io_iopoll_complete().
Cc: stable@vger.kernel.org Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com [axboe: don't set ->iopoll_completed for -EAGAIN retry] Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 53 ++++++++++++++++++++++++++++----------------------- 1 file changed, 29 insertions(+), 24 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b99c64bcfbdc..9a27b2224f30 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1741,6 +1741,18 @@ static int io_put_kbuf(struct io_kiocb *req) return cflags; }
+static void io_iopoll_queue(struct list_head *again) +{ + struct io_kiocb *req; + + do { + req = list_first_entry(again, struct io_kiocb, list); + list_del(&req->list); + refcount_inc(&req->refs); + io_queue_async_work(req); + } while (!list_empty(again)); +} + /* * Find and free completed poll iocbs */ @@ -1749,12 +1761,21 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, { struct req_batch rb; struct io_kiocb *req; + LIST_HEAD(again); + + /* order with ->result store in io_complete_rw_iopoll() */ + smp_rmb();
rb.to_free = rb.need_iter = 0; while (!list_empty(done)) { int cflags = 0;
req = list_first_entry(done, struct io_kiocb, list); + if (READ_ONCE(req->result) == -EAGAIN) { + req->iopoll_completed = 0; + list_move_tail(&req->list, &again); + continue; + } list_del(&req->list);
if (req->flags & REQ_F_BUFFER_SELECTED) @@ -1772,18 +1793,9 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, if (ctx->flags & IORING_SETUP_SQPOLL) io_cqring_ev_posted(ctx); io_free_req_many(ctx, &rb); -} - -static void io_iopoll_queue(struct list_head *again) -{ - struct io_kiocb *req;
- do { - req = list_first_entry(again, struct io_kiocb, list); - list_del(&req->list); - refcount_inc(&req->refs); - io_queue_async_work(req); - } while (!list_empty(again)); + if (!list_empty(&again)) + io_iopoll_queue(&again); }
static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, @@ -1791,7 +1803,6 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, { struct io_kiocb *req, *tmp; LIST_HEAD(done); - LIST_HEAD(again); bool spin; int ret;
@@ -1817,13 +1828,6 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, if (!list_empty(&done)) break;
- if (req->result == -EAGAIN) { - list_move_tail(&req->list, &again); - continue; - } - if (!list_empty(&again)) - break; - ret = kiocb->ki_filp->f_op->iopoll(kiocb, spin); if (ret < 0) break; @@ -1836,9 +1840,6 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, if (!list_empty(&done)) io_iopoll_complete(ctx, nr_events, &done);
- if (!list_empty(&again)) - io_iopoll_queue(&again); - return ret; }
@@ -1989,9 +1990,13 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
if (res != -EAGAIN && res != req->result) req_set_fail_links(req); - req->result = res; - if (res != -EAGAIN) + + WRITE_ONCE(req->result, res); + /* order with io_poll_complete() checking ->result */ + if (res != -EAGAIN) { + smp_wmb(); WRITE_ONCE(req->iopoll_completed, 1); + } }
/*
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc2 commit 56952e91acc93ed624fe9da840900defb75f1323 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we're doing polled IO and end up having requests being submitted async, then completions can come in while we're waiting for refs to drop. We need to reap these manually, as nobody else will be looking for them.
Break the wait into 1/20th of a second time waits, and check for done poll completions if we time out. Otherwise we can have done poll completions sitting in ctx->poll_list, which needs us to reap them but we're just waiting for them.
Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e2a2191c9f53..41db322af299 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7321,7 +7321,17 @@ static void io_ring_exit_work(struct work_struct *work) if (ctx->rings) io_cqring_overflow_flush(ctx, true);
- wait_for_completion(&ctx->ref_comp); + /* + * If we're doing polled IO and end up having requests being + * submitted async (out-of-line), then completions can come in while + * we're waiting for refs to drop. We need to reap these manually, + * as nobody else will be looking for them. + */ + while (!wait_for_completion_timeout(&ctx->ref_comp, HZ/20)) { + io_iopoll_reap_events(ctx); + if (ctx->rings) + io_cqring_overflow_flush(ctx, true); + } io_ring_ctx_free(ctx); }
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.8-rc2 commit 6f2cc1664db20676069cff27a461ccc97dbfd114 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In io_read() or io_write(), when io request is submitted successfully, it'll go through the below sequence:
kfree(iovec); req->flags &= ~REQ_F_NEED_CLEANUP; return ret;
But clearing REQ_F_NEED_CLEANUP might be unsafe. The io request may already have been completed, and then io_complete_rw_iopoll() and io_complete_rw() will be called, both of which will also modify req->flags if needed. This causes a race condition, with concurrent non-atomic modification of req->flags.
To eliminate this race, in io_read() or io_write(), if io request is submitted successfully, we don't remove REQ_F_NEED_CLEANUP flag. If REQ_F_NEED_CLEANUP is set, we'll leave __io_req_aux_free() to the iovec cleanup work correspondingly.
Cc: stable@vger.kernel.org Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 41db322af299..6856eec77aae 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2669,8 +2669,8 @@ static int io_read(struct io_kiocb *req, bool force_nonblock) } } out_free: - kfree(iovec); - req->flags &= ~REQ_F_NEED_CLEANUP; + if (!(req->flags & REQ_F_NEED_CLEANUP)) + kfree(iovec); return ret; }
@@ -2792,8 +2792,8 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) } } out_free: - req->flags &= ~REQ_F_NEED_CLEANUP; - kfree(iovec); + if (!(req->flags & REQ_F_NEED_CLEANUP)) + kfree(iovec); return ret; }
From: Jiufei Xue jiufei.xue@linux.alibaba.com
mainline inclusion from mainline-5.9-rc1 commit a31eb4a2f1650fa578082ad9e9845487ecd90abe category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Applications can pass this flag in to avoid accept thundering herd.
Signed-off-by: Jiufei Xue jiufei.xue@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 78f67c41efd5..f7ffea95c907 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4210,7 +4210,11 @@ static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt,
pt->error = 0; poll->head = head; - add_wait_queue(head, &poll->wait); + + if (poll->events & EPOLLEXCLUSIVE) + add_wait_queue_exclusive(head, &poll->wait); + else + add_wait_queue(head, &poll->wait); }
static void io_async_queue_proc(struct file *file, struct wait_queue_head *head, @@ -4567,7 +4571,8 @@ static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe #ifdef __BIG_ENDIAN events = swahw32(events); #endif - poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; + poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP | + (events & EPOLLEXCLUSIVE);
io_get_req_task(req); return 0;
From: Bijan Mottahedeh bijan.mottahedeh@oracle.com
mainline inclusion from mainline-5.9-rc1 commit a087e2b519929152fdde8299457e32d5a8994a7c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Facilitate separation of locked memory usage reporting vs. limiting for upcoming patches. No functional changes.
Signed-off-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com [axboe: kill unnecessary () around return in io_account_mem()] Signed-off-by: Jens Axboe axboe@kernel.dk Conflicts: fs/io_uring.c [commit f1f6a7dd9b("mm, tree-wide: rename put_user_page*() to unpin_user_page*()) is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 48 ++++++++++++++++++++++++++++-------------------- 1 file changed, 28 insertions(+), 20 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f7ffea95c907..e4585dd74cb8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6925,12 +6925,14 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx, return ret; }
-static void io_unaccount_mem(struct user_struct *user, unsigned long nr_pages) +static inline void __io_unaccount_mem(struct user_struct *user, + unsigned long nr_pages) { atomic_long_sub(nr_pages, &user->locked_vm); }
-static int io_account_mem(struct user_struct *user, unsigned long nr_pages) +static inline int __io_account_mem(struct user_struct *user, + unsigned long nr_pages) { unsigned long page_limit, cur_pages, new_pages;
@@ -6948,6 +6950,20 @@ static int io_account_mem(struct user_struct *user, unsigned long nr_pages) return 0; }
+static void io_unaccount_mem(struct io_ring_ctx *ctx, unsigned long nr_pages) +{ + if (ctx->account_mem) + __io_unaccount_mem(ctx->user, nr_pages); +} + +static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages) +{ + if (ctx->account_mem) + return __io_account_mem(ctx->user, nr_pages); + + return 0; +} + static void io_mem_free(void *ptr) { struct page *page; @@ -7022,8 +7038,7 @@ static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx) for (j = 0; j < imu->nr_bvecs; j++) put_page(imu->bvec[j].bv_page);
- if (ctx->account_mem) - io_unaccount_mem(ctx->user, imu->nr_bvecs); + io_unaccount_mem(ctx, imu->nr_bvecs); kvfree(imu->bvec); imu->nr_bvecs = 0; } @@ -7106,11 +7121,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, start = ubuf >> PAGE_SHIFT; nr_pages = end - start;
- if (ctx->account_mem) { - ret = io_account_mem(ctx->user, nr_pages); - if (ret) - goto err; - } + ret = io_account_mem(ctx, nr_pages); + if (ret) + goto err;
ret = 0; if (!pages || nr_pages > got_pages) { @@ -7123,8 +7136,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, GFP_KERNEL); if (!pages || !vmas) { ret = -ENOMEM; - if (ctx->account_mem) - io_unaccount_mem(ctx->user, nr_pages); + io_unaccount_mem(ctx, nr_pages); goto err; } got_pages = nr_pages; @@ -7134,8 +7146,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, GFP_KERNEL); ret = -ENOMEM; if (!imu->bvec) { - if (ctx->account_mem) - io_unaccount_mem(ctx->user, nr_pages); + io_unaccount_mem(ctx, nr_pages); goto err; }
@@ -7167,8 +7178,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, for (j = 0; j < pret; j++) put_page(pages[j]); } - if (ctx->account_mem) - io_unaccount_mem(ctx->user, nr_pages); + io_unaccount_mem(ctx, nr_pages); kvfree(imu->bvec); goto err; } @@ -7273,9 +7283,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) io_mem_free(ctx->sq_sqes);
percpu_ref_exit(&ctx->refs); - if (ctx->account_mem) - io_unaccount_mem(ctx->user, - ring_pages(ctx->sq_entries, ctx->cq_entries)); + io_unaccount_mem(ctx, ring_pages(ctx->sq_entries, ctx->cq_entries)); free_uid(ctx->user); put_cred(ctx->creds); kfree(ctx->cancel_hash); @@ -7845,7 +7853,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, account_mem = !capable(CAP_IPC_LOCK);
if (account_mem) { - ret = io_account_mem(user, + ret = __io_account_mem(user, ring_pages(p->sq_entries, p->cq_entries)); if (ret) { free_uid(user); @@ -7856,7 +7864,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, ctx = io_ring_ctx_alloc(p); if (!ctx) { if (account_mem) - io_unaccount_mem(user, ring_pages(p->sq_entries, + __io_unaccount_mem(user, ring_pages(p->sq_entries, p->cq_entries)); free_uid(user); return -ENOMEM;
From: Bijan Mottahedeh bijan.mottahedeh@oracle.com
mainline inclusion from mainline-5.9-rc1 commit aad5d8da1b301fe399d65f2dcb84df2ec60caaa3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Rename account_mem to limit_name to clarify its purpose.
Signed-off-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e4585dd74cb8..6e7d3d69010e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -227,7 +227,7 @@ struct io_ring_ctx { struct { unsigned int flags; unsigned int compat: 1; - unsigned int account_mem: 1; + unsigned int limit_mem: 1; unsigned int cq_overflow_flushed: 1; unsigned int drain_next: 1; unsigned int eventfd_async: 1; @@ -6952,13 +6952,13 @@ static inline int __io_account_mem(struct user_struct *user,
static void io_unaccount_mem(struct io_ring_ctx *ctx, unsigned long nr_pages) { - if (ctx->account_mem) + if (ctx->limit_mem) __io_unaccount_mem(ctx->user, nr_pages); }
static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages) { - if (ctx->account_mem) + if (ctx->limit_mem) return __io_account_mem(ctx->user, nr_pages);
return 0; @@ -7811,7 +7811,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, { struct user_struct *user = NULL; struct io_ring_ctx *ctx; - bool account_mem; + bool limit_mem; int ret;
if (!entries) @@ -7850,9 +7850,9 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, }
user = get_uid(current_user()); - account_mem = !capable(CAP_IPC_LOCK); + limit_mem = !capable(CAP_IPC_LOCK);
- if (account_mem) { + if (limit_mem) { ret = __io_account_mem(user, ring_pages(p->sq_entries, p->cq_entries)); if (ret) { @@ -7863,14 +7863,14 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
ctx = io_ring_ctx_alloc(p); if (!ctx) { - if (account_mem) + if (limit_mem) __io_unaccount_mem(user, ring_pages(p->sq_entries, p->cq_entries)); free_uid(user); return -ENOMEM; } ctx->compat = in_compat_syscall(); - ctx->account_mem = account_mem; + ctx->limit_mem = limit_mem; ctx->user = user; ctx->creds = get_current_cred();
From: Bijan Mottahedeh bijan.mottahedeh@oracle.com
mainline inclusion from mainline-5.9-rc1 commit 309758254ea62e07471abcaeca5b5c2173f4ebc2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Report pinned memory usage always, regardless of whether locked memory limit is enforced.
Signed-off-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6e7d3d69010e..ab0918d498ea 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6954,12 +6954,23 @@ static void io_unaccount_mem(struct io_ring_ctx *ctx, unsigned long nr_pages) { if (ctx->limit_mem) __io_unaccount_mem(ctx->user, nr_pages); + + if (ctx->sqo_mm) + atomic64_sub(nr_pages, &ctx->sqo_mm->pinned_vm); }
static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages) { - if (ctx->limit_mem) - return __io_account_mem(ctx->user, nr_pages); + int ret; + + if (ctx->limit_mem) { + ret = __io_account_mem(ctx->user, nr_pages); + if (ret) + return ret; + } + + if (ctx->sqo_mm) + atomic64_add(nr_pages, &ctx->sqo_mm->pinned_vm);
return 0; } @@ -7262,8 +7273,10 @@ static void io_destroy_buffers(struct io_ring_ctx *ctx) static void io_ring_ctx_free(struct io_ring_ctx *ctx) { io_finish_async(ctx); - if (ctx->sqo_mm) + if (ctx->sqo_mm) { mmdrop(ctx->sqo_mm); + ctx->sqo_mm = NULL; + }
io_iopoll_reap_events(ctx); io_sqe_buffer_unregister(ctx); @@ -7870,7 +7883,6 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, return -ENOMEM; } ctx->compat = in_compat_syscall(); - ctx->limit_mem = limit_mem; ctx->user = user; ctx->creds = get_current_cred();
@@ -7918,6 +7930,8 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, goto err;
trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags); + io_account_mem(ctx, ring_pages(p->sq_entries, p->cq_entries)); + ctx->limit_mem = limit_mem; return ret; err: io_ring_ctx_wait_and_kill(ctx);
From: Bijan Mottahedeh bijan.mottahedeh@oracle.com
mainline inclusion from mainline-5.9-rc1 commit 2e0464d48f32a9e78e2aa85cbbedc77ecbb6ed60 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Ring pages are not pinned so it is more appropriate to report them as locked.
Signed-off-by: Bijan Mottahedeh bijan.mottahedeh@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [note mm_struct->locked_vm is atomic_long_t, pinned_vm is unsigned long in 4.19. And commit f1f6a7dd9b("mm, tree-wide: rename put_user_page*() to unpin_user_page*()) is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 43 ++++++++++++++++++++++++++++++------------- 1 file changed, 30 insertions(+), 13 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ab0918d498ea..6243b3f802f3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -880,6 +880,11 @@ static const struct io_op_def io_op_defs[] = { }, };
+enum io_mem_account { + ACCT_LOCKED, + ACCT_PINNED, +}; + static void io_wq_submit_work(struct io_wq_work **workptr); static void io_cqring_fill_event(struct io_kiocb *req, long res); static void io_put_req(struct io_kiocb *req); @@ -6950,16 +6955,22 @@ static inline int __io_account_mem(struct user_struct *user, return 0; }
-static void io_unaccount_mem(struct io_ring_ctx *ctx, unsigned long nr_pages) +static void io_unaccount_mem(struct io_ring_ctx *ctx, unsigned long nr_pages, + enum io_mem_account acct) { if (ctx->limit_mem) __io_unaccount_mem(ctx->user, nr_pages);
- if (ctx->sqo_mm) - atomic64_sub(nr_pages, &ctx->sqo_mm->pinned_vm); + if (ctx->sqo_mm) { + if (acct == ACCT_LOCKED) + atomic64_sub(nr_pages, &ctx->sqo_mm->locked_vm); + else if (acct == ACCT_PINNED) + ctx->sqo_mm->pinned_vm -= nr_pages; + } }
-static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages) +static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages, + enum io_mem_account acct) { int ret;
@@ -6969,8 +6980,12 @@ static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages) return ret; }
- if (ctx->sqo_mm) - atomic64_add(nr_pages, &ctx->sqo_mm->pinned_vm); + if (ctx->sqo_mm) { + if (acct == ACCT_LOCKED) + atomic64_add(nr_pages, &ctx->sqo_mm->locked_vm); + else if (acct == ACCT_PINNED) + ctx->sqo_mm->pinned_vm += nr_pages; + }
return 0; } @@ -7049,7 +7064,7 @@ static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx) for (j = 0; j < imu->nr_bvecs; j++) put_page(imu->bvec[j].bv_page);
- io_unaccount_mem(ctx, imu->nr_bvecs); + io_unaccount_mem(ctx, imu->nr_bvecs, ACCT_PINNED); kvfree(imu->bvec); imu->nr_bvecs = 0; } @@ -7132,7 +7147,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, start = ubuf >> PAGE_SHIFT; nr_pages = end - start;
- ret = io_account_mem(ctx, nr_pages); + ret = io_account_mem(ctx, nr_pages, ACCT_PINNED); if (ret) goto err;
@@ -7147,7 +7162,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, GFP_KERNEL); if (!pages || !vmas) { ret = -ENOMEM; - io_unaccount_mem(ctx, nr_pages); + io_unaccount_mem(ctx, nr_pages, ACCT_PINNED); goto err; } got_pages = nr_pages; @@ -7157,7 +7172,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, GFP_KERNEL); ret = -ENOMEM; if (!imu->bvec) { - io_unaccount_mem(ctx, nr_pages); + io_unaccount_mem(ctx, nr_pages, ACCT_PINNED); goto err; }
@@ -7189,7 +7204,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, for (j = 0; j < pret; j++) put_page(pages[j]); } - io_unaccount_mem(ctx, nr_pages); + io_unaccount_mem(ctx, nr_pages, ACCT_PINNED); kvfree(imu->bvec); goto err; } @@ -7296,7 +7311,8 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) io_mem_free(ctx->sq_sqes);
percpu_ref_exit(&ctx->refs); - io_unaccount_mem(ctx, ring_pages(ctx->sq_entries, ctx->cq_entries)); + io_unaccount_mem(ctx, ring_pages(ctx->sq_entries, ctx->cq_entries), + ACCT_LOCKED); free_uid(ctx->user); put_cred(ctx->creds); kfree(ctx->cancel_hash); @@ -7930,7 +7946,8 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, goto err;
trace_io_uring_create(ret, ctx, p->sq_entries, p->cq_entries, p->flags); - io_account_mem(ctx, ring_pages(p->sq_entries, p->cq_entries)); + io_account_mem(ctx, ring_pages(p->sq_entries, p->cq_entries), + ACCT_LOCKED); ctx->limit_mem = limit_mem; return ret; err:
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit e883a79d8ced8e123f8c4042a29a7524c39935ab category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Renumerate IO_WQ flags, so they take adjacent bits
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/io-wq.h b/fs/io-wq.h index 071f1a997800..04239dfb12b0 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -5,10 +5,10 @@ struct io_wq;
enum { IO_WQ_WORK_CANCEL = 1, - IO_WQ_WORK_HASHED = 4, - IO_WQ_WORK_UNBOUND = 32, - IO_WQ_WORK_NO_CANCEL = 256, - IO_WQ_WORK_CONCURRENT = 512, + IO_WQ_WORK_HASHED = 2, + IO_WQ_WORK_UNBOUND = 4, + IO_WQ_WORK_NO_CANCEL = 8, + IO_WQ_WORK_CONCURRENT = 16,
IO_WQ_HASH_SHIFT = 24, /* upper 8 bits are used for hash key */ };
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit f4db7182e0de981a3f1b356e0cf43c6815423055 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
It's easier to return next work from ->do_work() than having an in-out argument. Looks nicer and easier to compile. Also, merge io_wq_assign_next() into its only user.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 8 +++----- fs/io-wq.h | 2 +- fs/io_uring.c | 53 ++++++++++++++++++++------------------------------- 3 files changed, 25 insertions(+), 38 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index 4a98817bb436..5de32e97304a 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -529,9 +529,8 @@ static void io_worker_handle_work(struct io_worker *worker) work->flags |= IO_WQ_WORK_CANCEL;
hash = io_get_work_hash(work); - linked = old_work = work; - wq->do_work(&linked); - linked = (old_work == linked) ? NULL : linked; + old_work = work; + linked = wq->do_work(work);
work = next_hashed; if (!work && linked && !io_wq_is_hashed(linked)) { @@ -787,8 +786,7 @@ static void io_run_cancel(struct io_wq_work *work, struct io_wqe *wqe) struct io_wq_work *old_work = work;
work->flags |= IO_WQ_WORK_CANCEL; - wq->do_work(&work); - work = (work == old_work) ? NULL : work; + work = wq->do_work(work); wq->free_work(old_work); } while (work); } diff --git a/fs/io-wq.h b/fs/io-wq.h index 04239dfb12b0..114f12ec2d65 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -101,7 +101,7 @@ static inline struct io_wq_work *wq_next_work(struct io_wq_work *work) }
typedef void (free_work_fn)(struct io_wq_work *); -typedef void (io_wq_work_fn)(struct io_wq_work **); +typedef struct io_wq_work *(io_wq_work_fn)(struct io_wq_work *);
struct io_wq_data { struct user_struct *user; diff --git a/fs/io_uring.c b/fs/io_uring.c index 6243b3f802f3..22b5e9cb7024 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -885,7 +885,6 @@ enum io_mem_account { ACCT_PINNED, };
-static void io_wq_submit_work(struct io_wq_work **workptr); static void io_cqring_fill_event(struct io_kiocb *req, long res); static void io_put_req(struct io_kiocb *req); static void __io_double_put_req(struct io_kiocb *req); @@ -1619,20 +1618,6 @@ static void io_free_req(struct io_kiocb *req) io_queue_async_work(nxt); }
-static void io_wq_assign_next(struct io_wq_work **workptr, struct io_kiocb *nxt) -{ - struct io_kiocb *link; - const struct io_op_def *def = &io_op_defs[nxt->opcode]; - - if ((nxt->flags & REQ_F_ISREG) && def->hash_reg_file) - io_wq_hash_work(&nxt->work, file_inode(nxt->file)); - - *workptr = &nxt->work; - link = io_prep_linked_timeout(nxt); - if (link) - nxt->flags |= REQ_F_QUEUE_TIMEOUT; -} - /* * Drop reference to request, return next in chain (if there is one) if this * was the last reference to this request. @@ -1652,24 +1637,29 @@ static void io_put_req(struct io_kiocb *req) io_free_req(req); }
-static void io_steal_work(struct io_kiocb *req, - struct io_wq_work **workptr) +static struct io_wq_work *io_steal_work(struct io_kiocb *req) { + struct io_kiocb *link, *nxt = NULL; + /* - * It's in an io-wq worker, so there always should be at least - * one reference, which will be dropped in io_put_work() just - * after the current handler returns. - * - * It also means, that if the counter dropped to 1, then there is - * no asynchronous users left, so it's safe to steal the next work. + * A ref is owned by io-wq in which context we're. So, if that's the + * last one, it's safe to steal next work. False negatives are Ok, + * it just will be re-punted async in io_put_work() */ - if (refcount_read(&req->refs) == 1) { - struct io_kiocb *nxt = NULL; + if (refcount_read(&req->refs) != 1) + return NULL;
- io_req_find_next(req, &nxt); - if (nxt) - io_wq_assign_next(workptr, nxt); - } + io_req_find_next(req, &nxt); + if (!nxt) + return NULL; + + if ((nxt->flags & REQ_F_ISREG) && io_op_defs[nxt->opcode].hash_reg_file) + io_wq_hash_work(&nxt->work, file_inode(nxt->file)); + + link = io_prep_linked_timeout(nxt); + if (link) + nxt->flags |= REQ_F_QUEUE_TIMEOUT; + return &nxt->work; }
/* @@ -5347,9 +5337,8 @@ static void io_arm_async_linked_timeout(struct io_kiocb *req) io_queue_linked_timeout(link); }
-static void io_wq_submit_work(struct io_wq_work **workptr) +static struct io_wq_work *io_wq_submit_work(struct io_wq_work *work) { - struct io_wq_work *work = *workptr; struct io_kiocb *req = container_of(work, struct io_kiocb, work); int ret = 0;
@@ -5381,7 +5370,7 @@ static void io_wq_submit_work(struct io_wq_work **workptr) io_put_req(req); }
- io_steal_work(req, workptr); + return io_steal_work(req); }
static inline struct file *io_file_from_index(struct io_ring_ctx *ctx,
From: Xuan Zhuo xuanzhuo@linux.alibaba.com
mainline inclusion from mainline-5.8-rc3 commit b772f07add1c0b22e02c0f1e96f647560679d3a9 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
When the user consumes and generates sqe at a fast rate, io_sqring_entries can always get sqe, and ret will not be equal to -EBUSY, so that io_sq_thread will never call cond_resched or schedule, and then we will get the following system error prompt:
rcu: INFO: rcu_sched self-detected stall on CPU or watchdog: BUG: soft lockup-CPU#23 stuck for 112s! [io_uring-sq:1863]
This patch checks whether need to call cond_resched() by checking the need_resched() function every cycle.
Suggested-by: Jens Axboe axboe@kernel.dk Signed-off-by: Xuan Zhuo xuanzhuo@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 22b5e9cb7024..c8525c70cc77 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5969,7 +5969,7 @@ static int io_sq_thread(void *data) * If submit got -EBUSY, flag us as needing the application * to enter the kernel to reap and flush events. */ - if (!to_submit || ret == -EBUSY) { + if (!to_submit || ret == -EBUSY || need_resched()) { /* * Drop cur_mm before scheduling, we can't hold it for * long periods (or over schedule()). Do this before @@ -5985,7 +5985,7 @@ static int io_sq_thread(void *data) * more IO, we should wait for the application to * reap events and wake us up. */ - if (!list_empty(&ctx->poll_list) || + if (!list_empty(&ctx->poll_list) || need_resched() || (!time_after(jiffies, timeout) && ret != -EBUSY && !percpu_ref_is_dying(&ctx->refs))) { if (current->task_works)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc3 commit cd664b0e35cb1202f40c259a1a5ea791d18c879d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_do_iopoll() won't do anything with a request unless req->iopoll_completed is set. So io_complete_rw_iopoll() has to set it, otherwise io_do_iopoll() will poll a file again and again even though the request of interest was completed long time ago.
Also, remove -EAGAIN check from io_issue_sqe() as it races with the changed lines. The request will take the long way and be resubmitted from io_iopoll*().
io_kiocb's result and iopoll_completed")
Fixes: bbde017a32b3 ("io_uring: add memory barrier to synchronize Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c8525c70cc77..2d9419c24248 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1988,10 +1988,8 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
WRITE_ONCE(req->result, res); /* order with io_poll_complete() checking ->result */ - if (res != -EAGAIN) { - smp_wmb(); - WRITE_ONCE(req->iopoll_completed, 1); - } + smp_wmb(); + WRITE_ONCE(req->iopoll_completed, 1); }
/* @@ -5309,9 +5307,6 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if ((ctx->flags & IORING_SETUP_IOPOLL) && req->file) { const bool in_async = io_wq_current_is_worker();
- if (req->result == -EAGAIN) - return -EAGAIN; - /* workqueue context doesn't hold uring_lock, grab it now */ if (in_async) mutex_lock(&ctx->uring_lock);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc3 commit d60b5fbc1ce8210759b568da49d149b868e7c6d3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Don't reissue requests from io_iopoll_reap_events(), the task may not have mm, which ends up with NULL. It's better to kill everything off on exit anyway.
[ 677.734670] RIP: 0010:io_iopoll_complete+0x27e/0x630 ... [ 677.734679] Call Trace: [ 677.734695] ? __send_signal+0x1f2/0x420 [ 677.734698] ? _raw_spin_unlock_irqrestore+0x24/0x40 [ 677.734699] ? send_signal+0xf5/0x140 [ 677.734700] io_iopoll_getevents+0x12f/0x1a0 [ 677.734702] io_iopoll_reap_events.part.0+0x5e/0xa0 [ 677.734703] io_ring_ctx_wait_and_kill+0x132/0x1c0 [ 677.734704] io_uring_release+0x20/0x30 [ 677.734706] __fput+0xcd/0x230 [ 677.734707] ____fput+0xe/0x10 [ 677.734709] task_work_run+0x67/0xa0 [ 677.734710] do_exit+0x35d/0xb70 [ 677.734712] do_group_exit+0x43/0xa0 [ 677.734713] get_signal+0x140/0x900 [ 677.734715] do_signal+0x37/0x780 [ 677.734717] ? enqueue_hrtimer+0x41/0xb0 [ 677.734718] ? recalibrate_cpu_khz+0x10/0x10 [ 677.734720] ? ktime_get+0x3e/0xa0 [ 677.734721] ? lapic_next_deadline+0x26/0x30 [ 677.734723] ? tick_program_event+0x4d/0x90 [ 677.734724] ? __hrtimer_get_next_event+0x4d/0x80 [ 677.734726] __prepare_exit_to_usermode+0x126/0x1c0 [ 677.734741] prepare_exit_to_usermode+0x9/0x40 [ 677.734742] idtentry_exit_cond_rcu+0x4c/0x60 [ 677.734743] sysvec_reschedule_ipi+0x92/0x160 [ 677.734744] ? asm_sysvec_reschedule_ipi+0xa/0x20 [ 677.734745] asm_sysvec_reschedule_ipi+0x12/0x20
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2d9419c24248..8792efcf21f9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -894,6 +894,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, struct io_uring_files_update *ip, unsigned nr_args); static int io_grab_files(struct io_kiocb *req); +static void io_complete_rw_common(struct kiocb *kiocb, long res); static void io_cleanup_req(struct io_kiocb *req); static int io_file_get(struct io_submit_state *state, struct io_kiocb *req, int fd, struct file **out_file, bool fixed); @@ -1743,6 +1744,14 @@ static void io_iopoll_queue(struct list_head *again) do { req = list_first_entry(again, struct io_kiocb, list); list_del(&req->list); + + /* shouldn't happen unless io_uring is dying, cancel reqs */ + if (unlikely(!current->mm)) { + io_complete_rw_common(&req->rw.kiocb, -EAGAIN); + io_put_req(req); + continue; + } + refcount_inc(&req->refs); io_queue_async_work(req); } while (!list_empty(again));
From: Oleg Nesterov oleg@redhat.com
mainline inclusion from mainline-5.8-rc4 commit e91b48162332480f5840902268108bb7fb7a44c7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
So that the target task will exit the wait_event_interruptible-like loop and call task_work_run() asap.
The patch turns "bool notify" into 0,TWA_RESUME,TWA_SIGNAL enum, the new TWA_SIGNAL flag implies signal_wake_up(). However, it needs to avoid the race with recalc_sigpending(), so the patch also adds the new JOBCTL_TASK_WORK bit included in JOBCTL_PENDING_MASK.
TODO: once this patch is merged we need to change all current users of task_work_add(notify = true) to use TWA_RESUME.
Cc: stable@vger.kernel.org # v5.7 Acked-by: Peter Zijlstra (Intel) peterz@infradead.org Signed-off-by: Oleg Nesterov oleg@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: include/linux/sched/jobctl.h [commit 76f969e8948d("cgroup: cgroup v2 freezer]") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/sched/jobctl.h | 4 +++- include/linux/task_work.h | 5 ++++- kernel/signal.c | 10 +++++++--- kernel/task_work.c | 16 ++++++++++++++-- 4 files changed, 28 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h index 98228bd48aee..dfbaab50df5d 100644 --- a/include/linux/sched/jobctl.h +++ b/include/linux/sched/jobctl.h @@ -18,6 +18,7 @@ struct task_struct; #define JOBCTL_TRAP_NOTIFY_BIT 20 /* trap for NOTIFY */ #define JOBCTL_TRAPPING_BIT 21 /* switching to TRACED */ #define JOBCTL_LISTENING_BIT 22 /* ptracer is listening for events */ +#define JOBCTL_TASK_WORK_BIT 24 /* set by TWA_SIGNAL */
#define JOBCTL_STOP_DEQUEUED (1UL << JOBCTL_STOP_DEQUEUED_BIT) #define JOBCTL_STOP_PENDING (1UL << JOBCTL_STOP_PENDING_BIT) @@ -26,9 +27,10 @@ struct task_struct; #define JOBCTL_TRAP_NOTIFY (1UL << JOBCTL_TRAP_NOTIFY_BIT) #define JOBCTL_TRAPPING (1UL << JOBCTL_TRAPPING_BIT) #define JOBCTL_LISTENING (1UL << JOBCTL_LISTENING_BIT) +#define JOBCTL_TASK_WORK (1UL << JOBCTL_TASK_WORK_BIT)
#define JOBCTL_TRAP_MASK (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY) -#define JOBCTL_PENDING_MASK (JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK) +#define JOBCTL_PENDING_MASK (JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK | JOBCTL_TASK_WORK)
extern bool task_set_jobctl_pending(struct task_struct *task, unsigned long mask); extern void task_clear_jobctl_trapping(struct task_struct *task); diff --git a/include/linux/task_work.h b/include/linux/task_work.h index bd9a6a91c097..0fb93aafa478 100644 --- a/include/linux/task_work.h +++ b/include/linux/task_work.h @@ -13,7 +13,10 @@ init_task_work(struct callback_head *twork, task_work_func_t func) twork->func = func; }
-int task_work_add(struct task_struct *task, struct callback_head *twork, bool); +#define TWA_RESUME 1 +#define TWA_SIGNAL 2 +int task_work_add(struct task_struct *task, struct callback_head *twork, int); + struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t); void task_work_run(void);
diff --git a/kernel/signal.c b/kernel/signal.c index 03c0fbd586b4..b0dac7d2f9fc 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2398,9 +2398,6 @@ bool get_signal(struct ksignal *ksig) struct signal_struct *signal = current->signal; int signr;
- if (unlikely(current->task_works)) - task_work_run(); - if (unlikely(uprobe_deny_signal())) return false;
@@ -2413,6 +2410,13 @@ bool get_signal(struct ksignal *ksig)
relock: spin_lock_irq(&sighand->siglock); + current->jobctl &= ~JOBCTL_TASK_WORK; + if (unlikely(current->task_works)) { + spin_unlock_irq(&sighand->siglock); + task_work_run(); + goto relock; + } + /* * Every stopped thread goes here after wakeup. Check to see if * we should notify the parent, prepare_signal(SIGCONT) encodes diff --git a/kernel/task_work.c b/kernel/task_work.c index 825f28259a19..5c0848ca1287 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -25,9 +25,10 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */ * 0 if succeeds or -ESRCH. */ int -task_work_add(struct task_struct *task, struct callback_head *work, bool notify) +task_work_add(struct task_struct *task, struct callback_head *work, int notify) { struct callback_head *head; + unsigned long flags;
do { head = READ_ONCE(task->task_works); @@ -36,8 +37,19 @@ task_work_add(struct task_struct *task, struct callback_head *work, bool notify) work->next = head; } while (cmpxchg(&task->task_works, head, work) != head);
- if (notify) + switch (notify) { + case TWA_RESUME: set_notify_resume(task); + break; + case TWA_SIGNAL: + if (lock_task_sighand(task, &flags)) { + task->jobctl |= JOBCTL_TASK_WORK; + signal_wake_up(task, 0); + unlock_task_sighand(task, &flags); + } + break; + } + return 0; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc4 commit ce593a6c480a22acba08795be313c0c6d49dd35d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Since 5.7, we've been using task_work to trigger async running of requests in the context of the original task. This generally works great, but there's a case where if the task is currently blocked in the kernel waiting on a condition to become true, it won't process task_work. Even though the task is woken, it just checks whatever condition it's waiting on, and goes back to sleep if it's still false.
This is a problem if that very condition only becomes true when that task_work is run. An example of that is the task registering an eventfd with io_uring, and it's now blocked waiting on an eventfd read. That read could depend on a completion event, and that completion event won't get trigged until task_work has been run.
Use the TWA_SIGNAL notification for task_work, so that we ensure that the task always runs the work when queued.
Cc: stable@vger.kernel.org # v5.7 Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 32 ++++++++++++++++++++++++-------- 1 file changed, 24 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8792efcf21f9..8024e7bcb4fc 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4032,6 +4032,21 @@ struct io_poll_table { int error; };
+static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb, + int notify) +{ + struct task_struct *tsk = req->task; + int ret; + + if (req->ctx->flags & IORING_SETUP_SQPOLL) + notify = 0; + + ret = task_work_add(tsk, cb, notify); + if (!ret) + wake_up_process(tsk); + return ret; +} + static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, __poll_t mask, task_work_func_t func) { @@ -4055,13 +4070,13 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, * of executing it. We can't safely execute it anyway, as we may not * have the needed state needed for it anyway. */ - ret = task_work_add(tsk, &req->task_work, true); + ret = io_req_task_work_add(req, &req->task_work, TWA_SIGNAL); if (unlikely(ret)) { WRITE_ONCE(poll->canceled, true); tsk = io_wq_get_task(req->ctx->io_wq); - task_work_add(tsk, &req->task_work, true); + task_work_add(tsk, &req->task_work, 0); + wake_up_process(tsk); } - wake_up_process(tsk); return 1; }
@@ -6141,19 +6156,20 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, do { prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); + /* make sure we run task_work before checking for signals */ if (current->task_works) task_work_run(); - if (io_should_wake(&iowq, false)) - break; - schedule(); if (signal_pending(current)) { - ret = -EINTR; + ret = -ERESTARTSYS; break; } + if (io_should_wake(&iowq, false)) + break; + schedule(); } while (1); finish_wait(&ctx->wait, &iowq.wq);
- restore_saved_sigmask_unless(ret == -EINTR); + restore_saved_sigmask_unless(ret == -ERESTARTSYS);
return READ_ONCE(rings->cq.head) == READ_ONCE(rings->cq.tail) ? ret : 0; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc4 commit b7db41c9e03b5189bc94993bd50e4506ac9e34c1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
When switching to TWA_SIGNAL for task_work notifications, we also made any signal based condition in io_cqring_wait() return -ERESTARTSYS. This breaks applications that rely on using signals to abort someone waiting for events.
Check if we have a signal pending because of queued task_work, and repeat the signal check once we've run the task_work. This provides a reliable way of telling the two apart.
Additionally, only use TWA_SIGNAL if we are using an eventfd. If not, we don't have the dependency situation described in the original commit, and we can get by with just using TWA_RESUME like we previously did.
Fixes: ce593a6c480a ("io_uring: use signal based task_work running") Cc: stable@vger.kernel.org # v5.7 Reported-by: Andres Freund andres@anarazel.de Tested-by: Andres Freund andres@anarazel.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 29 ++++++++++++++++++++++------- 1 file changed, 22 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8024e7bcb4fc..f38c24f80537 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4032,14 +4032,22 @@ struct io_poll_table { int error; };
-static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb, - int notify) +static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb) { struct task_struct *tsk = req->task; - int ret; + struct io_ring_ctx *ctx = req->ctx; + int ret, notify = TWA_RESUME;
- if (req->ctx->flags & IORING_SETUP_SQPOLL) + /* + * SQPOLL kernel thread doesn't need notification, just a wakeup. + * If we're not using an eventfd, then TWA_RESUME is always fine, + * as we won't have dependencies between request completions for + * other kernel wait conditions. + */ + if (ctx->flags & IORING_SETUP_SQPOLL) notify = 0; + else if (ctx->cq_ev_fd) + notify = TWA_SIGNAL;
ret = task_work_add(tsk, cb, notify); if (!ret) @@ -4070,7 +4078,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, * of executing it. We can't safely execute it anyway, as we may not * have the needed state needed for it anyway. */ - ret = io_req_task_work_add(req, &req->task_work, TWA_SIGNAL); + ret = io_req_task_work_add(req, &req->task_work); if (unlikely(ret)) { WRITE_ONCE(poll->canceled, true); tsk = io_wq_get_task(req->ctx->io_wq); @@ -6160,7 +6168,14 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, if (current->task_works) task_work_run(); if (signal_pending(current)) { - ret = -ERESTARTSYS; + if (current->jobctl & JOBCTL_TASK_WORK) { + spin_lock_irq(¤t->sighand->siglock); + current->jobctl &= ~JOBCTL_TASK_WORK; + recalc_sigpending(); + spin_unlock_irq(¤t->sighand->siglock); + continue; + } + ret = -EINTR; break; } if (io_should_wake(&iowq, false)) @@ -6169,7 +6184,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, } while (1); finish_wait(&ctx->wait, &iowq.wq);
- restore_saved_sigmask_unless(ret == -ERESTARTSYS); + restore_saved_sigmask_unless(ret == -EINTR);
return READ_ONCE(rings->cq.head) == READ_ONCE(rings->cq.tail) ? ret : 0; }
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.8-rc5 commit 6d5f904904608a9cd32854d7d0a4dd65b27f9935 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
For those applications which are not willing to use io_uring_enter() to reap and handle cqes, they may completely rely on liburing's io_uring_peek_cqe(), but if cq ring has overflowed, currently because io_uring_peek_cqe() is not aware of this overflow, it won't enter kernel to flush cqes, below test program can reveal this bug:
static void test_cq_overflow(struct io_uring *ring) { struct io_uring_cqe *cqe; struct io_uring_sqe *sqe; int issued = 0; int ret = 0;
do { sqe = io_uring_get_sqe(ring); if (!sqe) { fprintf(stderr, "get sqe failed\n"); break;; } ret = io_uring_submit(ring); if (ret <= 0) { if (ret != -EBUSY) fprintf(stderr, "sqe submit failed: %d\n", ret); break; } issued++; } while (ret > 0); assert(ret == -EBUSY);
printf("issued requests: %d\n", issued);
while (issued) { ret = io_uring_peek_cqe(ring, &cqe); if (ret) { if (ret != -EAGAIN) { fprintf(stderr, "peek completion failed: %s\n", strerror(ret)); break; } printf("left requets: %d\n", issued); continue; } io_uring_cqe_seen(ring, cqe); issued--; printf("left requets: %d\n", issued); } }
int main(int argc, char *argv[]) { int ret; struct io_uring ring;
ret = io_uring_queue_init(16, &ring, 0); if (ret) { fprintf(stderr, "ring setup failed: %d\n", ret); return 1; }
test_cq_overflow(&ring); return 0; }
To fix this issue, export cq overflow status to userspace by adding new IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 11 +++++++++-- include/uapi/linux/io_uring.h | 1 + 2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f38c24f80537..fd068b3beada 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1277,6 +1277,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) if (cqe) { clear_bit(0, &ctx->sq_check_overflow); clear_bit(0, &ctx->cq_check_overflow); + ctx->rings->sq_flags &= ~IORING_SQ_CQ_OVERFLOW; } spin_unlock_irqrestore(&ctx->completion_lock, flags); io_cqring_ev_posted(ctx); @@ -1314,6 +1315,7 @@ static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags) if (list_empty(&ctx->cq_overflow_list)) { set_bit(0, &ctx->sq_check_overflow); set_bit(0, &ctx->cq_check_overflow); + ctx->rings->sq_flags |= IORING_SQ_CQ_OVERFLOW; } req->flags |= REQ_F_OVERFLOW; refcount_inc(&req->refs); @@ -6038,9 +6040,9 @@ static int io_sq_thread(void *data) }
/* Tell userspace we may need a wakeup call */ + spin_lock_irq(&ctx->completion_lock); ctx->rings->sq_flags |= IORING_SQ_NEED_WAKEUP; - /* make sure to read SQ tail after writing flags */ - smp_mb(); + spin_unlock_irq(&ctx->completion_lock);
to_submit = io_sqring_entries(ctx); if (!to_submit || ret == -EBUSY) { @@ -6058,13 +6060,17 @@ static int io_sq_thread(void *data) schedule(); finish_wait(&ctx->sqo_wait, &wait);
+ spin_lock_irq(&ctx->completion_lock); ctx->rings->sq_flags &= ~IORING_SQ_NEED_WAKEUP; + spin_unlock_irq(&ctx->completion_lock); ret = 0; continue; } finish_wait(&ctx->sqo_wait, &wait);
+ spin_lock_irq(&ctx->completion_lock); ctx->rings->sq_flags &= ~IORING_SQ_NEED_WAKEUP; + spin_unlock_irq(&ctx->completion_lock); }
mutex_lock(&ctx->uring_lock); @@ -7480,6 +7486,7 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, if (list_empty(&ctx->cq_overflow_list)) { clear_bit(0, &ctx->sq_check_overflow); clear_bit(0, &ctx->cq_check_overflow); + ctx->rings->sq_flags &= ~IORING_SQ_CQ_OVERFLOW; } spin_unlock_irq(&ctx->completion_lock);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 83b790cf3c8d..d0bedc4a843b 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -197,6 +197,7 @@ struct io_sqring_offsets { * sq_ring->flags */ #define IORING_SQ_NEED_WAKEUP (1U << 0) /* needs io_uring_enter wakeup */ +#define IORING_SQ_CQ_OVERFLOW (1U << 1) /* CQ ring is overflown */
struct io_cqring_offsets { __u32 head;
From: Yang Yingliang yangyingliang@huawei.com
mainline inclusion from mainline-5.8-rc5 commit f3bd9dae3708a0ff6b067e766073ffeb853301f9 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
I got a memleak report when doing some fuzz test:
BUG: memory leak unreferenced object 0xffff888113e02300 (size 488): comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ a0 a4 ce 19 81 88 ff ff 60 ce 09 0d 81 88 ff ff ........`....... backtrace: [<00000000129a84ec>] kmem_cache_zalloc include/linux/slab.h:659 [inline] [<00000000129a84ec>] __alloc_file+0x25/0x310 fs/file_table.c:101 [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151 [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193 [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233 [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline] [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74 [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720 [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359 [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
BUG: memory leak unreferenced object 0xffff8881152dd5e0 (size 16): comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s) hex dump (first 16 bytes): 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<0000000074caa794>] kmem_cache_zalloc include/linux/slab.h:659 [inline] [<0000000074caa794>] lsm_file_alloc security/security.c:567 [inline] [<0000000074caa794>] security_file_alloc+0x32/0x160 security/security.c:1440 [<00000000c6745ea3>] __alloc_file+0xba/0x310 fs/file_table.c:106 [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151 [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193 [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233 [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline] [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74 [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720 [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359 [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
If io_sqe_file_register() failed, we need put the file that get by fget() to avoid the memleak.
Fixes: c3a31e605620 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE") Cc: stable@vger.kernel.org Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index fd068b3beada..3985fd1ca03f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6810,8 +6810,10 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, } table->files[index] = file; err = io_sqe_file_register(ctx, file, i); - if (err) + if (err) { + fput(file); break; + } } nr_args--; done++;
From: Yang Yingliang yangyingliang@huawei.com
mainline inclusion from mainline-5.8-rc5 commit 667e57da358f61b6966e12e925a69e42d912e8bb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
I got a memleak report when doing some fuzz test:
BUG: memory leak unreferenced object 0x607eeac06e78 (size 8): comm "test", pid 295, jiffies 4294735835 (age 31.745s) hex dump (first 8 bytes): 00 00 00 00 00 00 00 00 ........ backtrace: [<00000000932632e6>] percpu_ref_init+0x2a/0x1b0 [<0000000092ddb796>] __io_uring_register+0x111d/0x22a0 [<00000000eadd6c77>] __x64_sys_io_uring_register+0x17b/0x480 [<00000000591b89a6>] do_syscall_64+0x56/0xa0 [<00000000864a281d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Call percpu_ref_exit() on error path to avoid refcount memleak.
Fixes: 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update") Cc: stable@vger.kernel.org Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3985fd1ca03f..2d1a7951877b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6658,6 +6658,7 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, for (i = 0; i < nr_tables; i++) kfree(ctx->file_data->table[i].files);
+ percpu_ref_exit(&ctx->file_data->refs); kfree(ctx->file_data->table); kfree(ctx->file_data); ctx->file_data = NULL;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc5 commit 309fc03a3284af62eb6082fb60327045a1dabf57 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently account the memory after the exit work has been run, but that leaves a gap where a process has closed its ring and until the memory has been accounted as freed. If the memlocked ulimit is borderline, then that can introduce spurious setup errors returning -ENOMEM because the free work hasn't been run yet.
Account this as freed when we close the ring, as not to expose a tiny gap where setting up a new ring can fail.
Fixes: 85faa7b8346e ("io_uring: punt final io_ring_ctx wait-and-free to workqueue") Cc: stable@vger.kernel.org # v5.7 Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [io_ring_ctx->account_mem has been replace by limit_mem] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 2d1a7951877b..dbcd3b42b9d7 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7344,8 +7344,6 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) io_mem_free(ctx->sq_sqes);
percpu_ref_exit(&ctx->refs); - io_unaccount_mem(ctx, ring_pages(ctx->sq_entries, ctx->cq_entries), - ACCT_LOCKED); free_uid(ctx->user); put_cred(ctx->creds); kfree(ctx->cancel_hash); @@ -7430,6 +7428,15 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) if (ctx->rings) io_cqring_overflow_flush(ctx, true); idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx); + + /* + * Do this upfront, so we won't have a grace period where the ring + * is closed but resources aren't reaped yet. This can cause + * spurious failure in setting up a new ring. + */ + io_unaccount_mem(ctx, ring_pages(ctx->sq_entries, ctx->cq_entries), + ACCT_LOCKED); + INIT_WORK(&ctx->exit_work, io_ring_exit_work); queue_work(system_wq, &ctx->exit_work); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc5 commit dd821e0c95a64b5923a0c57f07d3f7563553e756 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Ensure to set msg.msg_name for the async portion of send/recvmsg, as the header copy will copy to/from it.
Cc: stable@vger.kernel.org # v5.5+ Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index dbcd3b42b9d7..e95ddddb9a81 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3513,6 +3513,7 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (req->flags & REQ_F_NEED_CLEANUP) return 0;
+ io->msg.msg.msg_name = &io->msg.addr; io->msg.iov = io->msg.fast_iov; ret = sendmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, &io->msg.iov); @@ -3694,6 +3695,7 @@ static int __io_compat_recvmsg_copy_hdr(struct io_kiocb *req,
static int io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io) { + io->msg.msg.msg_name = &io->msg.addr; io->msg.iov = io->msg.fast_iov;
#ifdef CONFIG_COMPAT
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc5 commit 16d598030a37853a7a6b4384cad19c9c0af2f021 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
59960b9deb535 ("io_uring: fix lazy work init") tried to fix missing io_req_init_async(), but left out work.flags and hash. Do it earlier.
Fixes: 7cdaf587de7c ("io_uring: avoid whole io_wq_work copy for requests completed inline") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e95ddddb9a81..c373ffbf2cbd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1099,6 +1099,8 @@ static inline void io_prep_async_work(struct io_kiocb *req, { const struct io_op_def *def = &io_op_defs[req->opcode];
+ io_req_init_async(req); + if (req->flags & REQ_F_ISREG) { if (def->hash_reg_file) io_wq_hash_work(&req->work, file_inode(req->file)); @@ -1107,7 +1109,6 @@ static inline void io_prep_async_work(struct io_kiocb *req, req->work.flags |= IO_WQ_WORK_UNBOUND; }
- io_req_init_async(req); io_req_work_grab_env(req, def);
*link = io_prep_linked_timeout(req);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc6 commit 681fda8d27a66f7e65ff7f2d200d7635e64a8d05 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_recvmsg() doesn't free memory allocated for struct io_buffer. This can causes a leak when used with automatic buffer selection.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c373ffbf2cbd..408b496c6b88 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3805,10 +3805,16 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock)
ret = __sys_recvmsg_sock(sock, &kmsg->msg, req->sr_msg.msg, kmsg->uaddr, flags); - if (force_nonblock && ret == -EAGAIN) - return io_setup_async_msg(req, kmsg); + if (force_nonblock && ret == -EAGAIN) { + ret = io_setup_async_msg(req, kmsg); + if (ret != -EAGAIN) + kfree(kbuf); + return ret; + } if (ret == -ERESTARTSYS) ret = -EINTR; + if (kbuf) + kfree(kbuf); }
if (kmsg && kmsg->iov != kmsg->fast_iov)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.8-rc7 commit 807abcb0883439af5ead73f3308310453b97b624 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The double poll additions were centered around doing POLL_ADD on file descriptors that use more than one waitqueue (typically one for read, one for write) when being polled. However, it can also end up being triggered for when we use poll triggered retry. For that case, we cannot safely use req->io, as that could be used by the request type itself.
Add a second io_poll_iocb pointer in the structure we allocate for poll based retry, and ensure we use the right one from the two paths.
Fixes: 18bceab101ad ("io_uring: allow POLL_ADD with double poll_wait() users") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 47 ++++++++++++++++++++++++++--------------------- 1 file changed, 26 insertions(+), 21 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 408b496c6b88..93a4d6a3ad57 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -609,6 +609,7 @@ enum {
struct async_poll { struct io_poll_iocb poll; + struct io_poll_iocb *double_poll; struct io_wq_work work; };
@@ -4119,9 +4120,9 @@ static bool io_poll_rewait(struct io_kiocb *req, struct io_poll_iocb *poll) return false; }
-static void io_poll_remove_double(struct io_kiocb *req) +static void io_poll_remove_double(struct io_kiocb *req, void *data) { - struct io_poll_iocb *poll = (struct io_poll_iocb *) req->io; + struct io_poll_iocb *poll = data;
lockdep_assert_held(&req->ctx->completion_lock);
@@ -4141,7 +4142,7 @@ static void io_poll_complete(struct io_kiocb *req, __poll_t mask, int error) { struct io_ring_ctx *ctx = req->ctx;
- io_poll_remove_double(req); + io_poll_remove_double(req, req->io); req->poll.done = true; io_cqring_fill_event(req, error ? error : mangle_poll(mask)); io_commit_cqring(ctx); @@ -4184,21 +4185,21 @@ static int io_poll_double_wake(struct wait_queue_entry *wait, unsigned mode, int sync, void *key) { struct io_kiocb *req = wait->private; - struct io_poll_iocb *poll = (struct io_poll_iocb *) req->io; + struct io_poll_iocb *poll = req->apoll->double_poll; __poll_t mask = key_to_poll(key);
/* for instances that support it check for an event match first: */ if (mask && !(mask & poll->events)) return 0;
- if (req->poll.head) { + if (poll && poll->head) { bool done;
- spin_lock(&req->poll.head->lock); - done = list_empty(&req->poll.wait.entry); + spin_lock(&poll->head->lock); + done = list_empty(&poll->wait.entry); if (!done) - list_del_init(&req->poll.wait.entry); - spin_unlock(&req->poll.head->lock); + list_del_init(&poll->wait.entry); + spin_unlock(&poll->head->lock); if (!done) __io_async_wake(req, poll, mask, io_poll_task_func); } @@ -4218,7 +4219,8 @@ static void io_init_poll_iocb(struct io_poll_iocb *poll, __poll_t events, }
static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt, - struct wait_queue_head *head) + struct wait_queue_head *head, + struct io_poll_iocb **poll_ptr) { struct io_kiocb *req = pt->req;
@@ -4229,7 +4231,7 @@ static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt, */ if (unlikely(poll->head)) { /* already have a 2nd entry, fail a third attempt */ - if (req->io) { + if (*poll_ptr) { pt->error = -EINVAL; return; } @@ -4241,7 +4243,7 @@ static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt, io_init_poll_iocb(poll, req->poll.events, io_poll_double_wake); refcount_inc(&req->refs); poll->wait.private = req; - req->io = (void *) poll; + *poll_ptr = poll; }
pt->error = 0; @@ -4257,8 +4259,9 @@ static void io_async_queue_proc(struct file *file, struct wait_queue_head *head, struct poll_table_struct *p) { struct io_poll_table *pt = container_of(p, struct io_poll_table, pt); + struct async_poll *apoll = pt->req->apoll;
- __io_queue_proc(&pt->req->apoll->poll, pt, head); + __io_queue_proc(&apoll->poll, pt, head, &apoll->double_poll); }
static void io_sq_thread_drop_mm(struct io_ring_ctx *ctx) @@ -4308,11 +4311,13 @@ static void io_async_task_func(struct callback_head *cb) } }
+ io_poll_remove_double(req, apoll->double_poll); spin_unlock_irq(&ctx->completion_lock);
/* restore ->work in case we need to retry again */ if (req->flags & REQ_F_WORK_INITIALIZED) memcpy(&req->work, &apoll->work, sizeof(req->work)); + kfree(apoll->double_poll); kfree(apoll);
if (!canceled) { @@ -4400,7 +4405,6 @@ static bool io_arm_poll_handler(struct io_kiocb *req) struct async_poll *apoll; struct io_poll_table ipt; __poll_t mask, ret; - bool had_io;
if (!req->file || !file_can_poll(req->file)) return false; @@ -4412,11 +4416,11 @@ static bool io_arm_poll_handler(struct io_kiocb *req) apoll = kmalloc(sizeof(*apoll), GFP_ATOMIC); if (unlikely(!apoll)) return false; + apoll->double_poll = NULL;
req->flags |= REQ_F_POLLED; if (req->flags & REQ_F_WORK_INITIALIZED) memcpy(&apoll->work, &req->work, sizeof(req->work)); - had_io = req->io != NULL;
io_get_req_task(req); req->apoll = apoll; @@ -4434,13 +4438,11 @@ static bool io_arm_poll_handler(struct io_kiocb *req) ret = __io_arm_poll_handler(req, &apoll->poll, &ipt, mask, io_async_wake); if (ret) { - ipt.error = 0; - /* only remove double add if we did it here */ - if (!had_io) - io_poll_remove_double(req); + io_poll_remove_double(req, apoll->double_poll); spin_unlock_irq(&ctx->completion_lock); if (req->flags & REQ_F_WORK_INITIALIZED) memcpy(&req->work, &apoll->work, sizeof(req->work)); + kfree(apoll->double_poll); kfree(apoll); return false; } @@ -4471,11 +4473,13 @@ static bool io_poll_remove_one(struct io_kiocb *req) bool do_complete;
if (req->opcode == IORING_OP_POLL_ADD) { - io_poll_remove_double(req); + io_poll_remove_double(req, req->io); do_complete = __io_poll_remove_one(req, &req->poll); } else { struct async_poll *apoll = req->apoll;
+ io_poll_remove_double(req, apoll->double_poll); + /* non-poll requests have submit ref still */ do_complete = __io_poll_remove_one(req, &apoll->poll); if (do_complete) { @@ -4488,6 +4492,7 @@ static bool io_poll_remove_one(struct io_kiocb *req) if (req->flags & REQ_F_WORK_INITIALIZED) memcpy(&req->work, &apoll->work, sizeof(req->work)); + kfree(apoll->double_poll); kfree(apoll); } } @@ -4588,7 +4593,7 @@ static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head, { struct io_poll_table *pt = container_of(p, struct io_poll_table, pt);
- __io_queue_proc(&pt->req->poll, pt, head); + __io_queue_proc(&pt->req->poll, pt, head, (struct io_poll_iocb **) &pt->req->io); }
static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
From: Daniele Albano d.albano@gmail.com
mainline inclusion from mainline-5.8-rc7 commit 61710e437f2807e26a3402543bdbb7217a9c8620 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We currently filter these for timeout_remove/async_cancel/files_update, but we only should be filtering for fixed file and buffer select. This also causes a second read of sqe->flags, which isn't needed.
Just check req->flags for the relevant bits. This then allows these commands to be used in links, for example, like everything else.
Signed-off-by: Daniele Albano d.albano@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 93a4d6a3ad57..3c86c4acbf86 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4705,7 +4705,9 @@ static int io_timeout_remove_prep(struct io_kiocb *req, { if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; - if (sqe->flags || sqe->ioprio || sqe->buf_index || sqe->len) + if (unlikely(req->flags & (REQ_F_FIXED_FILE | REQ_F_BUFFER_SELECT))) + return -EINVAL; + if (sqe->ioprio || sqe->buf_index || sqe->len) return -EINVAL;
req->timeout.addr = READ_ONCE(sqe->addr); @@ -4883,8 +4885,9 @@ static int io_async_cancel_prep(struct io_kiocb *req, { if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; - if (sqe->flags || sqe->ioprio || sqe->off || sqe->len || - sqe->cancel_flags) + if (unlikely(req->flags & (REQ_F_FIXED_FILE | REQ_F_BUFFER_SELECT))) + return -EINVAL; + if (sqe->ioprio || sqe->off || sqe->len || sqe->cancel_flags) return -EINVAL;
req->cancel.addr = READ_ONCE(sqe->addr); @@ -4902,7 +4905,9 @@ static int io_async_cancel(struct io_kiocb *req) static int io_files_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { - if (sqe->flags || sqe->ioprio || sqe->rw_flags) + if (unlikely(req->flags & (REQ_F_FIXED_FILE | REQ_F_BUFFER_SELECT))) + return -EINVAL; + if (sqe->ioprio || sqe->rw_flags) return -EINVAL;
req->files_update.offset = READ_ONCE(sqe->off);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8-rc7 commit 3e863ea3bb1a2203ae648eb272db0ce6a1a2072c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
IOSQE_ASYNC branch of io_queue_sqe() is another place where an unitialised req->work can be accessed (i.e. prior io_req_init_async()). Nothing really bad though, it just looses IO_WQ_WORK_CONCURRENT flag.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3c86c4acbf86..ef4bde014013 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5685,6 +5685,7 @@ static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) * Never try inline submit of IOSQE_ASYNC is set, go straight * to async execution. */ + io_req_init_async(req); req->work.flags |= IO_WQ_WORK_CONCURRENT; io_queue_async_work(req); } else {
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8 commit d5e16d8e23825304c6a9945116cc6b6f8d51f28c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
req->work might be already initialised by the time it gets into __io_arm_poll_handler(), which will corrupt it by using fields that are in an union with req->work. Luckily, the only side effect is missing put_creds(). Clean req->work before going there.
Suggested-by: Jens Axboe axboe@kernel.dk Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ef4bde014013..3734323fcfa9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4626,6 +4626,10 @@ static int io_poll_add(struct io_kiocb *req) struct io_poll_table ipt; __poll_t mask;
+ /* ->work is in union with hash_node and others */ + io_req_work_drop_env(req); + req->flags &= ~REQ_F_WORK_INITIALIZED; + INIT_HLIST_NODE(&req->hash_node); INIT_LIST_HEAD(&req->list); ipt.pt._qproc = io_poll_queue_proc;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.8 commit 4ae6dbd683860b9edc254ea8acf5e04b5ae242e5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_fail_links() doesn't consider REQ_F_COMP_LOCKED leading to nested spin_lock(completion_lock) and lockup.
[ 197.680409] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 6-... } 18239 jiffies s: 1421 root: 0x40/. [ 197.680411] rcu: blocking rcu_node structures: [ 197.680412] Task dump for CPU 6: [ 197.680413] link-timeout R running task 0 1669 1 0x8000008a [ 197.680414] Call Trace: [ 197.680420] ? io_req_find_next+0xa0/0x200 [ 197.680422] ? io_put_req_find_next+0x2a/0x50 [ 197.680423] ? io_poll_task_func+0xcf/0x140 [ 197.680425] ? task_work_run+0x67/0xa0 [ 197.680426] ? do_exit+0x35d/0xb70 [ 197.680429] ? syscall_trace_enter+0x187/0x2c0 [ 197.680430] ? do_group_exit+0x43/0xa0 [ 197.680448] ? __x64_sys_exit_group+0x18/0x20 [ 197.680450] ? do_syscall_64+0x52/0xa0 [ 197.680452] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3734323fcfa9..42d399fc01dc 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4159,10 +4159,9 @@ static void io_poll_task_handler(struct io_kiocb *req, struct io_kiocb **nxt)
hash_del(&req->hash_node); io_poll_complete(req, req->result, 0); - req->flags |= REQ_F_COMP_LOCKED; - io_put_req_find_next(req, nxt); spin_unlock_irq(&ctx->completion_lock);
+ io_put_req_find_next(req, nxt); io_cqring_ev_posted(ctx); }
From: Dmitry Vyukov dvyukov@google.com
mainline inclusion from mainline-5.9-rc1 commit b36200f543ff07a1cb346aa582349141df2c8068 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
rings_size() sets sq_offset to the total size of the rings (the returned value which is used for memory allocation). This is wrong: sq array should be located within the rings, not after them. Set sq_offset to where it should be.
Fixes: 75b28affdd6a ("io_uring: allocate the two rings together") Signed-off-by: Dmitry Vyukov dvyukov@google.com Acked-by: Hristo Venev hristo@venev.name Cc: io-uring@vger.kernel.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 42d399fc01dc..adf67d940f55 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7082,6 +7082,9 @@ static unsigned long rings_size(unsigned sq_entries, unsigned cq_entries, return SIZE_MAX; #endif
+ if (sq_offset) + *sq_offset = off; + sq_array_size = array_size(sizeof(u32), sq_entries); if (sq_array_size == SIZE_MAX) return SIZE_MAX; @@ -7089,9 +7092,6 @@ static unsigned long rings_size(unsigned sq_entries, unsigned cq_entries, if (check_add_overflow(off, sq_array_size, &off)) return SIZE_MAX;
- if (sq_offset) - *sq_offset = off; - return off; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 270a5940700bb6cf9abf36ea10cf1fa0d453aa7a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Every second field in send/recv is called msg, make it a bit more understandable by renaming ->msg, which is a user provided ptr, to ->umsg.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index adf67d940f55..c4305ed8d2c8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -414,7 +414,7 @@ struct io_connect { struct io_sr_msg { struct file *file; union { - struct user_msghdr __user *msg; + struct user_msghdr __user *umsg; void __user *buf; }; int msg_flags; @@ -3501,7 +3501,7 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EINVAL;
sr->msg_flags = READ_ONCE(sqe->msg_flags); - sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + sr->umsg = u64_to_user_ptr(READ_ONCE(sqe->addr)); sr->len = READ_ONCE(sqe->len);
#ifdef CONFIG_COMPAT @@ -3517,7 +3517,7 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
io->msg.msg.msg_name = &io->msg.addr; io->msg.iov = io->msg.fast_iov; - ret = sendmsg_copy_msghdr(&io->msg.msg, sr->msg, sr->msg_flags, + ret = sendmsg_copy_msghdr(&io->msg.msg, sr->umsg, sr->msg_flags, &io->msg.iov); if (!ret) req->flags |= REQ_F_NEED_CLEANUP; @@ -3549,7 +3549,7 @@ static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) kmsg->msg.msg_name = &io.msg.addr;
io.msg.iov = io.msg.fast_iov; - ret = sendmsg_copy_msghdr(&io.msg.msg, sr->msg, + ret = sendmsg_copy_msghdr(&io.msg.msg, sr->umsg, sr->msg_flags, &io.msg.iov); if (ret) return ret; @@ -3628,8 +3628,8 @@ static int __io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io) size_t iov_len; int ret;
- ret = __copy_msghdr_from_user(&io->msg.msg, sr->msg, &io->msg.uaddr, - &uiov, &iov_len); + ret = __copy_msghdr_from_user(&io->msg.msg, sr->umsg, + &io->msg.uaddr, &uiov, &iov_len); if (ret) return ret;
@@ -3663,7 +3663,7 @@ static int __io_compat_recvmsg_copy_hdr(struct io_kiocb *req, compat_size_t len; int ret;
- msg_compat = (struct compat_msghdr __user *) sr->msg; + msg_compat = (struct compat_msghdr __user *) sr->umsg; ret = __get_compat_msghdr(&io->msg.msg, msg_compat, &io->msg.uaddr, &ptr, &len); if (ret) @@ -3740,7 +3740,7 @@ static int io_recvmsg_prep(struct io_kiocb *req, return -EINVAL;
sr->msg_flags = READ_ONCE(sqe->msg_flags); - sr->msg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + sr->umsg = u64_to_user_ptr(READ_ONCE(sqe->addr)); sr->len = READ_ONCE(sqe->len); sr->bgid = READ_ONCE(sqe->buf_group);
@@ -3804,7 +3804,7 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) else if (force_nonblock) flags |= MSG_DONTWAIT;
- ret = __sys_recvmsg_sock(sock, &kmsg->msg, req->sr_msg.msg, + ret = __sys_recvmsg_sock(sock, &kmsg->msg, req->sr_msg.umsg, kmsg->uaddr, flags); if (force_nonblock && ret == -EAGAIN) { ret = io_setup_async_msg(req, kmsg);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 1400e69705baf98d1c9cb73b592a3a68aab1d852 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
send/recv msghdr initialisation works with struct io_async_msghdr, but pulls the whole struct io_async_ctx for no reason. That complicates it with composite accessing, e.g. io->msg.
Use and pass the most specific type, which is struct io_async_msghdr. It is the larget field in union io_async_ctx and doesn't save stack space, but looks clearer. The most of the changes are replacing "io->msg." with "iomsg->"
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 63 +++++++++++++++++++++++++-------------------------- 1 file changed, 31 insertions(+), 32 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c4305ed8d2c8..03e4fef26567 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3532,7 +3532,7 @@ static int io_sendmsg(struct io_kiocb *req, bool force_nonblock)
sock = sock_from_file(req->file, &ret); if (sock) { - struct io_async_ctx io; + struct io_async_msghdr iomsg; unsigned flags;
if (req->io) { @@ -3545,14 +3545,13 @@ static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) } else { struct io_sr_msg *sr = &req->sr_msg;
- kmsg = &io.msg; - kmsg->msg.msg_name = &io.msg.addr; - - io.msg.iov = io.msg.fast_iov; - ret = sendmsg_copy_msghdr(&io.msg.msg, sr->umsg, - sr->msg_flags, &io.msg.iov); + iomsg.msg.msg_name = &iomsg.addr; + iomsg.iov = iomsg.fast_iov; + ret = sendmsg_copy_msghdr(&iomsg.msg, sr->umsg, + sr->msg_flags, &iomsg.iov); if (ret) return ret; + kmsg = &iomsg; }
flags = req->sr_msg.msg_flags; @@ -3621,30 +3620,31 @@ static int io_send(struct io_kiocb *req, bool force_nonblock) return 0; }
-static int __io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io) +static int __io_recvmsg_copy_hdr(struct io_kiocb *req, + struct io_async_msghdr *iomsg) { struct io_sr_msg *sr = &req->sr_msg; struct iovec __user *uiov; size_t iov_len; int ret;
- ret = __copy_msghdr_from_user(&io->msg.msg, sr->umsg, - &io->msg.uaddr, &uiov, &iov_len); + ret = __copy_msghdr_from_user(&iomsg->msg, sr->umsg, + &iomsg->uaddr, &uiov, &iov_len); if (ret) return ret;
if (req->flags & REQ_F_BUFFER_SELECT) { if (iov_len > 1) return -EINVAL; - if (copy_from_user(io->msg.iov, uiov, sizeof(*uiov))) + if (copy_from_user(iomsg->iov, uiov, sizeof(*uiov))) return -EFAULT; - sr->len = io->msg.iov[0].iov_len; - iov_iter_init(&io->msg.msg.msg_iter, READ, io->msg.iov, 1, + sr->len = iomsg->iov[0].iov_len; + iov_iter_init(&iomsg->msg.msg_iter, READ, iomsg->iov, 1, sr->len); - io->msg.iov = NULL; + iomsg->iov = NULL; } else { ret = import_iovec(READ, uiov, iov_len, UIO_FASTIOV, - &io->msg.iov, &io->msg.msg.msg_iter); + &iomsg->iov, &iomsg->msg.msg_iter); if (ret > 0) ret = 0; } @@ -3654,7 +3654,7 @@ static int __io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io)
#ifdef CONFIG_COMPAT static int __io_compat_recvmsg_copy_hdr(struct io_kiocb *req, - struct io_async_ctx *io) + struct io_async_msghdr *iomsg) { struct compat_msghdr __user *msg_compat; struct io_sr_msg *sr = &req->sr_msg; @@ -3664,7 +3664,7 @@ static int __io_compat_recvmsg_copy_hdr(struct io_kiocb *req, int ret;
msg_compat = (struct compat_msghdr __user *) sr->umsg; - ret = __get_compat_msghdr(&io->msg.msg, msg_compat, &io->msg.uaddr, + ret = __get_compat_msghdr(&iomsg->msg, msg_compat, &iomsg->uaddr, &ptr, &len); if (ret) return ret; @@ -3681,12 +3681,12 @@ static int __io_compat_recvmsg_copy_hdr(struct io_kiocb *req, return -EFAULT; if (clen < 0) return -EINVAL; - sr->len = io->msg.iov[0].iov_len; - io->msg.iov = NULL; + sr->len = iomsg->iov[0].iov_len; + iomsg->iov = NULL; } else { ret = compat_import_iovec(READ, uiov, len, UIO_FASTIOV, - &io->msg.iov, - &io->msg.msg.msg_iter); + &iomsg->iov, + &iomsg->msg.msg_iter); if (ret < 0) return ret; } @@ -3695,17 +3695,18 @@ static int __io_compat_recvmsg_copy_hdr(struct io_kiocb *req, } #endif
-static int io_recvmsg_copy_hdr(struct io_kiocb *req, struct io_async_ctx *io) +static int io_recvmsg_copy_hdr(struct io_kiocb *req, + struct io_async_msghdr *iomsg) { - io->msg.msg.msg_name = &io->msg.addr; - io->msg.iov = io->msg.fast_iov; + iomsg->msg.msg_name = &iomsg->addr; + iomsg->iov = iomsg->fast_iov;
#ifdef CONFIG_COMPAT if (req->ctx->compat) - return __io_compat_recvmsg_copy_hdr(req, io); + return __io_compat_recvmsg_copy_hdr(req, iomsg); #endif
- return __io_recvmsg_copy_hdr(req, io); + return __io_recvmsg_copy_hdr(req, iomsg); }
static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req, @@ -3755,7 +3756,7 @@ static int io_recvmsg_prep(struct io_kiocb *req, if (req->flags & REQ_F_NEED_CLEANUP) return 0;
- ret = io_recvmsg_copy_hdr(req, io); + ret = io_recvmsg_copy_hdr(req, &io->msg); if (!ret) req->flags |= REQ_F_NEED_CLEANUP; return ret; @@ -3770,7 +3771,7 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) sock = sock_from_file(req->file, &ret); if (sock) { struct io_buffer *kbuf; - struct io_async_ctx io; + struct io_async_msghdr iomsg; unsigned flags;
if (req->io) { @@ -3781,12 +3782,10 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) kmsg->iov = kmsg->fast_iov; kmsg->msg.msg_iter.iov = kmsg->iov; } else { - kmsg = &io.msg; - kmsg->msg.msg_name = &io.msg.addr; - - ret = io_recvmsg_copy_hdr(req, &io); + ret = io_recvmsg_copy_hdr(req, &iomsg); if (ret) return ret; + kmsg = &iomsg; }
kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 2ae523ed07f14391d685651f671a7858fe8c368a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Don't repeat send msg initialisation code, it's error prone. Extract and use a helper function.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 03e4fef26567..fe61ce18c6c2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3491,6 +3491,15 @@ static int io_setup_async_msg(struct io_kiocb *req, return -EAGAIN; }
+static int io_sendmsg_copy_hdr(struct io_kiocb *req, + struct io_async_msghdr *iomsg) +{ + iomsg->iov = iomsg->fast_iov; + iomsg->msg.msg_name = &iomsg->addr; + return sendmsg_copy_msghdr(&iomsg->msg, req->sr_msg.umsg, + req->sr_msg.msg_flags, &iomsg->iov); +} + static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_sr_msg *sr = &req->sr_msg; @@ -3515,10 +3524,7 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (req->flags & REQ_F_NEED_CLEANUP) return 0;
- io->msg.msg.msg_name = &io->msg.addr; - io->msg.iov = io->msg.fast_iov; - ret = sendmsg_copy_msghdr(&io->msg.msg, sr->umsg, sr->msg_flags, - &io->msg.iov); + ret = io_sendmsg_copy_hdr(req, &io->msg); if (!ret) req->flags |= REQ_F_NEED_CLEANUP; return ret; @@ -3543,12 +3549,7 @@ static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) kmsg->iov = kmsg->fast_iov; kmsg->msg.msg_iter.iov = kmsg->iov; } else { - struct io_sr_msg *sr = &req->sr_msg; - - iomsg.msg.msg_name = &iomsg.addr; - iomsg.iov = iomsg.fast_iov; - ret = sendmsg_copy_msghdr(&iomsg.msg, sr->umsg, - sr->msg_flags, &iomsg.iov); + ret = io_sendmsg_copy_hdr(req, &iomsg); if (ret) return ret; kmsg = &iomsg;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit b64e3444d4e1c71fe148a4f4535395b1fdd73200 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Don't deref req->io->rw every time, but put it in a local variable. This looks prettier, generates less instructions, and doesn't break alias analysis.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index fe61ce18c6c2..b70a41bad00a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2548,15 +2548,17 @@ static void io_req_map_rw(struct io_kiocb *req, ssize_t io_size, struct iovec *iovec, struct iovec *fast_iov, struct iov_iter *iter) { - req->io->rw.nr_segs = iter->nr_segs; - req->io->rw.size = io_size; - req->io->rw.iov = iovec; - if (!req->io->rw.iov) { - req->io->rw.iov = req->io->rw.fast_iov; - if (req->io->rw.iov != fast_iov) - memcpy(req->io->rw.iov, fast_iov, + struct io_async_rw *rw = &req->io->rw; + + rw->nr_segs = iter->nr_segs; + rw->size = io_size; + if (!iovec) { + rw->iov = rw->fast_iov; + if (rw->iov != fast_iov) + memcpy(rw->iov, fast_iov, sizeof(struct iovec) * iter->nr_segs); } else { + rw->iov = iovec; req->flags |= REQ_F_NEED_CLEANUP; } }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit c3e330a493740a2a8312dcb7b1cffceaec7f619a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Preparing reads/writes for async is a bit tricky. Extract a helper to not repeat it twice.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 46 ++++++++++++++++++++-------------------------- 1 file changed, 20 insertions(+), 26 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b70a41bad00a..70adbafb37bf 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2592,11 +2592,27 @@ static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size, return 0; }
+static inline int io_rw_prep_async(struct io_kiocb *req, int rw, + bool force_nonblock) +{ + struct io_async_ctx *io = req->io; + struct iov_iter iter; + ssize_t ret; + + io->rw.iov = io->rw.fast_iov; + req->io = NULL; + ret = io_import_iovec(rw, req, &io->rw.iov, &iter, !force_nonblock); + req->io = io; + if (unlikely(ret < 0)) + return ret; + + io_req_map_rw(req, ret, io->rw.iov, io->rw.fast_iov, &iter); + return 0; +} + static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, bool force_nonblock) { - struct io_async_ctx *io; - struct iov_iter iter; ssize_t ret;
ret = io_prep_rw(req, sqe, force_nonblock); @@ -2609,17 +2625,7 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, /* either don't need iovec imported or already have it */ if (!req->io || req->flags & REQ_F_NEED_CLEANUP) return 0; - - io = req->io; - io->rw.iov = io->rw.fast_iov; - req->io = NULL; - ret = io_import_iovec(READ, req, &io->rw.iov, &iter, !force_nonblock); - req->io = io; - if (ret < 0) - return ret; - - io_req_map_rw(req, ret, io->rw.iov, io->rw.fast_iov, &iter); - return 0; + return io_rw_prep_async(req, READ, force_nonblock); }
static int io_read(struct io_kiocb *req, bool force_nonblock) @@ -2685,8 +2691,6 @@ static int io_read(struct io_kiocb *req, bool force_nonblock) static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, bool force_nonblock) { - struct io_async_ctx *io; - struct iov_iter iter; ssize_t ret;
ret = io_prep_rw(req, sqe, force_nonblock); @@ -2701,17 +2705,7 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, /* either don't need iovec imported or already have it */ if (!req->io || req->flags & REQ_F_NEED_CLEANUP) return 0; - - io = req->io; - io->rw.iov = io->rw.fast_iov; - req->io = NULL; - ret = io_import_iovec(WRITE, req, &io->rw.iov, &iter, !force_nonblock); - req->io = io; - if (ret < 0) - return ret; - - io_req_map_rw(req, ret, io->rw.iov, io->rw.fast_iov, &iter); - return 0; + return io_rw_prep_async(req, WRITE, force_nonblock); }
static int io_write(struct io_kiocb *req, bool force_nonblock)
From: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com
mainline inclusion from mainline-5.9-rc1 commit 23b3628e45924419399da48c2b3a522b05557c91 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
In io_sq_thread(), if there are task works to handle, current codes will skip schedule() and go on polling sq again, but forget to clear IORING_SQ_NEED_WAKEUP flag, fix this issue. Also add two helpers to set and clear IORING_SQ_NEED_WAKEUP flag,
Signed-off-by: Xiaoguang Wang xiaoguang.wang@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 29 +++++++++++++++++++---------- 1 file changed, 19 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 70adbafb37bf..a50e598336fb 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5981,6 +5981,21 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, return submitted; }
+static inline void io_ring_set_wakeup_flag(struct io_ring_ctx *ctx) +{ + /* Tell userspace we may need a wakeup call */ + spin_lock_irq(&ctx->completion_lock); + ctx->rings->sq_flags |= IORING_SQ_NEED_WAKEUP; + spin_unlock_irq(&ctx->completion_lock); +} + +static inline void io_ring_clear_wakeup_flag(struct io_ring_ctx *ctx) +{ + spin_lock_irq(&ctx->completion_lock); + ctx->rings->sq_flags &= ~IORING_SQ_NEED_WAKEUP; + spin_unlock_irq(&ctx->completion_lock); +} + static int io_sq_thread(void *data) { struct io_ring_ctx *ctx = data; @@ -6058,10 +6073,7 @@ static int io_sq_thread(void *data) continue; }
- /* Tell userspace we may need a wakeup call */ - spin_lock_irq(&ctx->completion_lock); - ctx->rings->sq_flags |= IORING_SQ_NEED_WAKEUP; - spin_unlock_irq(&ctx->completion_lock); + io_ring_set_wakeup_flag(ctx);
to_submit = io_sqring_entries(ctx); if (!to_submit || ret == -EBUSY) { @@ -6072,6 +6084,7 @@ static int io_sq_thread(void *data) if (current->task_works) { task_work_run(); finish_wait(&ctx->sqo_wait, &wait); + io_ring_clear_wakeup_flag(ctx); continue; } if (signal_pending(current)) @@ -6079,17 +6092,13 @@ static int io_sq_thread(void *data) schedule(); finish_wait(&ctx->sqo_wait, &wait);
- spin_lock_irq(&ctx->completion_lock); - ctx->rings->sq_flags &= ~IORING_SQ_NEED_WAKEUP; - spin_unlock_irq(&ctx->completion_lock); + io_ring_clear_wakeup_flag(ctx); ret = 0; continue; } finish_wait(&ctx->sqo_wait, &wait);
- spin_lock_irq(&ctx->completion_lock); - ctx->rings->sq_flags &= ~IORING_SQ_NEED_WAKEUP; - spin_unlock_irq(&ctx->completion_lock); + io_ring_clear_wakeup_flag(ctx); }
mutex_lock(&ctx->uring_lock);
From: Guoyu Huang hgy5945@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 2dd2111d0d383df104b144e0d1f6b5a00cb7cd88 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
loop_rw_iter() does not check whether the file has a read or write function. This can lead to NULL pointer dereference when the user passes in a file descriptor that does not have read or write function.
The crash log looks like this:
[ 99.834071] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 99.835364] #PF: supervisor instruction fetch in kernel mode [ 99.836522] #PF: error_code(0x0010) - not-present page [ 99.837771] PGD 8000000079d62067 P4D 8000000079d62067 PUD 79d8c067 PMD 0 [ 99.839649] Oops: 0010 [#2] SMP PTI [ 99.840591] CPU: 1 PID: 333 Comm: io_wqe_worker-0 Tainted: G D 5.8.0 #2 [ 99.842622] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014 [ 99.845140] RIP: 0010:0x0 [ 99.845840] Code: Bad RIP value. [ 99.846672] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202 [ 99.848018] RAX: 0000000000000000 RBX: ffff92363bd67300 RCX: ffff92363d461208 [ 99.849854] RDX: 0000000000000010 RSI: 00007ffdbf696bb0 RDI: ffff92363bd67300 [ 99.851743] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000 [ 99.853394] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010 [ 99.855148] R13: 0000000000000000 R14: ffff92363d461208 R15: ffffa1c7c01ebc68 [ 99.856914] FS: 0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000 [ 99.858651] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 99.860032] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0 [ 99.861979] Call Trace: [ 99.862617] loop_rw_iter.part.0+0xad/0x110 [ 99.863838] io_write+0x2ae/0x380 [ 99.864644] ? kvm_sched_clock_read+0x11/0x20 [ 99.865595] ? sched_clock+0x9/0x10 [ 99.866453] ? sched_clock_cpu+0x11/0xb0 [ 99.867326] ? newidle_balance+0x1d4/0x3c0 [ 99.868283] io_issue_sqe+0xd8f/0x1340 [ 99.869216] ? __switch_to+0x7f/0x450 [ 99.870280] ? __switch_to_asm+0x42/0x70 [ 99.871254] ? __switch_to_asm+0x36/0x70 [ 99.872133] ? lock_timer_base+0x72/0xa0 [ 99.873155] ? switch_mm_irqs_off+0x1bf/0x420 [ 99.874152] io_wq_submit_work+0x64/0x180 [ 99.875192] ? kthread_use_mm+0x71/0x100 [ 99.876132] io_worker_handle_work+0x267/0x440 [ 99.877233] io_wqe_worker+0x297/0x350 [ 99.878145] kthread+0x112/0x150 [ 99.878849] ? __io_worker_unuse+0x100/0x100 [ 99.879935] ? kthread_park+0x90/0x90 [ 99.880874] ret_from_fork+0x22/0x30 [ 99.881679] Modules linked in: [ 99.882493] CR2: 0000000000000000 [ 99.883324] ---[ end trace 4453745f4673190b ]--- [ 99.884289] RIP: 0010:0x0 [ 99.884837] Code: Bad RIP value. [ 99.885492] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202 [ 99.886851] RAX: 0000000000000000 RBX: ffff92363acd7f00 RCX: ffff92363d461608 [ 99.888561] RDX: 0000000000000010 RSI: 00007ffe040d9e10 RDI: ffff92363acd7f00 [ 99.890203] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000 [ 99.891907] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010 [ 99.894106] R13: 0000000000000000 R14: ffff92363d461608 R15: ffffa1c7c01ebc68 [ 99.896079] FS: 0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000 [ 99.898017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 99.899197] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0
Fixes: 32960613b7c3 ("io_uring: correctly handle non ->{read,write}_iter() file_operations") Cc: stable@vger.kernel.org Signed-off-by: Guoyu Huang hgy5945@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [commit bcf5a06304d6("io_uring: support true async buffered reads, if file provides it") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a50e598336fb..1a4d3408dd04 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2663,8 +2663,10 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
if (req->file->f_op->read_iter) ret2 = call_read_iter(req->file, kiocb, &iter); - else + else if (req->file->f_op->read) ret2 = loop_rw_iter(READ, req->file, kiocb, &iter); + else + ret2 = -EINVAL;
/* Catch -EAGAIN return for forced non-blocking submission */ if (!force_nonblock || ret2 != -EAGAIN) { @@ -2766,8 +2768,10 @@ static int io_write(struct io_kiocb *req, bool force_nonblock)
if (req->file->f_op->write_iter) ret2 = call_write_iter(req->file, kiocb, &iter); - else + else if (req->file->f_op->write) ret2 = loop_rw_iter(WRITE, req->file, kiocb, &iter); + else + ret2 = -EINVAL;
if (!force_nonblock) current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit 0ba9c9edcd152158a0e321a4c13ac1dfc571ff3d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
An earlier commit:
b7db41c9e03b ("io_uring: fix regression with always ignoring signals in io_cqring_wait()")
ensured that we didn't get stuck waiting for eventfd reads when it's registered with the io_uring ring for event notification, but we still have cases where the task can be waiting on other events in the kernel and need a bigger nudge to make forward progress. Or the task could be in the kernel and running, but on its way to blocking.
This means that TWA_RESUME cannot reliably be used to ensure we make progress. Use TWA_SIGNAL unconditionally.
Cc: stable@vger.kernel.org # v5.7+ Reported-by: Josef josef.grieb@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [commit c2c4c83c58cb("io_uring: use new io_req_task_work_add() helper throughout") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1a4d3408dd04..3a94a7abba87 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4048,22 +4048,22 @@ static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb) { struct task_struct *tsk = req->task; struct io_ring_ctx *ctx = req->ctx; - int ret, notify = TWA_RESUME; + int ret, notify;
/* - * SQPOLL kernel thread doesn't need notification, just a wakeup. - * If we're not using an eventfd, then TWA_RESUME is always fine, - * as we won't have dependencies between request completions for - * other kernel wait conditions. + * SQPOLL kernel thread doesn't need notification, just a wakeup. For + * all other cases, use TWA_SIGNAL unconditionally to ensure we're + * processing task_work. There's no reliable way to tell if TWA_RESUME + * will do the job. */ - if (ctx->flags & IORING_SETUP_SQPOLL) - notify = 0; - else if (ctx->cq_ev_fd) + notify = 0; + if (!(ctx->flags & IORING_SETUP_SQPOLL)) notify = TWA_SIGNAL;
ret = task_work_add(tsk, cb, notify); if (!ret) wake_up_process(tsk); + return ret; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit 6d816e088c359866f9867057e04f244c608c42fe category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We're holding the request reference, but we need to go one higher to ensure that the ctx remains valid after the request has finished. If the ring is closed with pending task_work inflight, and the given io_kiocb finishes sync during issue, then we need a reference to the ring itself around the task_work execution cycle.
Cc: stable@vger.kernel.org # v5.7+ Reported-by: syzbot+9b260fc33297966f5a8e@syzkaller.appspotmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [commit c40f63790ec9("io_uring: use task_work for links if possible") is not merged] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3a94a7abba87..8078e36ca9f2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4084,6 +4084,8 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, tsk = req->task; req->result = mask; init_task_work(&req->task_work, func); + percpu_ref_get(&req->ctx->refs); + /* * If this fails, then the task is exiting. When a task exits, the * work gets canceled, so just cancel this request as well instead @@ -4168,6 +4170,7 @@ static void io_poll_task_handler(struct io_kiocb *req, struct io_kiocb **nxt) static void io_poll_task_func(struct callback_head *cb) { struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); + struct io_ring_ctx *ctx = req->ctx; struct io_kiocb *nxt = NULL;
io_poll_task_handler(req, &nxt); @@ -4178,6 +4181,7 @@ static void io_poll_task_func(struct callback_head *cb) __io_queue_sqe(nxt, NULL); mutex_unlock(&ctx->uring_lock); } + percpu_ref_put(&ctx->refs); }
static int io_poll_double_wake(struct wait_queue_entry *wait, unsigned mode, @@ -4296,6 +4300,7 @@ static void io_async_task_func(struct callback_head *cb)
if (io_poll_rewait(req, &apoll->poll)) { spin_unlock_irq(&ctx->completion_lock); + percpu_ref_put(&ctx->refs); return; }
@@ -4316,6 +4321,7 @@ static void io_async_task_func(struct callback_head *cb) /* restore ->work in case we need to retry again */ if (req->flags & REQ_F_WORK_INITIALIZED) memcpy(&req->work, &apoll->work, sizeof(req->work)); + percpu_ref_put(&ctx->refs); kfree(apoll->double_poll); kfree(apoll);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc3 commit 56450c20fe10d4d93f58019109aa4e06fc0b9206 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Make sure we clear req->result, which was set to -EAGAIN for retry purposes, when moving it to the reissue list. Otherwise we can end up retrying a request more than once, which leads to weird results in the io-wq handling (and other spots).
Cc: stable@vger.kernel.org Reported-by: Andres Freund andres@anarazel.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8078e36ca9f2..a1dc12105450 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1780,6 +1780,7 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
req = list_first_entry(done, struct io_kiocb, list); if (READ_ONCE(req->result) == -EAGAIN) { + req->result = 0; req->iopoll_completed = 0; list_move_tail(&req->list, &again); continue;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc3 commit eefdf30f3dcb5c1d47bee2b3afdb9d4d05343ff3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
This normally isn't hit, as polling is mostly done on NVMe with deep queue depths. But if we do run into request starvation, we need to ensure that retries are properly serialized.
Reported-by: Andres Freund andres@anarazel.de Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [note IOPOLL retry in io_read for commit 4503b7676a2e("io_uring: catch -EIO from buffered issue request failure")] Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a1dc12105450..6e4170e2c60f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1103,7 +1103,7 @@ static inline void io_prep_async_work(struct io_kiocb *req, io_req_init_async(req);
if (req->flags & REQ_F_ISREG) { - if (def->hash_reg_file) + if (def->hash_reg_file || (req->ctx->flags & IORING_SETUP_IOPOLL)) io_wq_hash_work(&req->work, file_inode(req->file)); } else { if (def->unbound_nonreg_file) @@ -2640,6 +2640,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock) ret = io_import_iovec(READ, req, &iovec, &iter, !force_nonblock); if (ret < 0) return ret; + iov_count = iov_iter_count(&iter);
/* Ensure we clear previously set non-block flag */ if (!force_nonblock) @@ -2657,7 +2658,6 @@ static int io_read(struct io_kiocb *req, bool force_nonblock) if (force_nonblock && !io_file_supports_async(req->file, READ)) goto copy_iov;
- iov_count = iov_iter_count(&iter); ret = rw_verify_area(READ, req->file, &kiocb->ki_pos, iov_count); if (!ret) { ssize_t ret2; @@ -2671,6 +2671,10 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
/* Catch -EAGAIN return for forced non-blocking submission */ if (!force_nonblock || ret2 != -EAGAIN) { + /* IOPOLL retry should happen for io-wq threads */ + if ((req->ctx->flags & IORING_SETUP_IOPOLL) && + ret2 == -EAGAIN) + goto copy_iov; kiocb_done(kiocb, ret2); } else { copy_iov: @@ -2722,6 +2726,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) ret = io_import_iovec(WRITE, req, &iovec, &iter, !force_nonblock); if (ret < 0) return ret; + iov_count = iov_iter_count(&iter);
/* Ensure we clear previously set non-block flag */ if (!force_nonblock) @@ -2744,7 +2749,6 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) (req->flags & REQ_F_ISREG)) goto copy_iov;
- iov_count = iov_iter_count(&iter); ret = rw_verify_area(WRITE, req->file, &kiocb->ki_pos, iov_count); if (!ret) { ssize_t ret2; @@ -2784,6 +2788,10 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) if (ret2 == -EOPNOTSUPP && (kiocb->ki_flags & IOCB_NOWAIT)) ret2 = -EAGAIN; if (!force_nonblock || ret2 != -EAGAIN) { + /* IOPOLL retry should happen for io-wq threads */ + if ((req->ctx->flags & IORING_SETUP_IOPOLL) && + ret2 == -EAGAIN) + goto copy_iov; kiocb_done(kiocb, ret2); } else { copy_iov:
From: Jiufei Xue jiufei.xue@linux.alibaba.com
mainline inclusion from mainline-5.9-rc4 commit 98dfd5024a2e9e170b85c07078e2d89f20a5dfbd category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Index here is already the position of the file in fixed_file_table, we should not use io_file_from_index() again to get it. Otherwise, the wrong file which still in use may be released unexpectedly.
Cc: stable@vger.kernel.org # v5.6 Fixes: 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update") Signed-off-by: Jiufei Xue jiufei.xue@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6e4170e2c60f..9e87587b1f3e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6830,7 +6830,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, table = &ctx->file_data->table[i >> IORING_FILE_TABLE_SHIFT]; index = i & IORING_FILE_TABLE_MASK; if (table->files[index]) { - file = io_file_from_index(ctx, index); + file = table->files[index]; err = io_queue_file_removal(data, file); if (err) break;
From: Jiufei Xue jiufei.xue@linux.alibaba.com
mainline inclusion from mainline-v5.9-rc4 commit 95d1c8e5f801e959a89181a2548a3efa60a1a6ce category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
While io_sqe_file_register() failed in __io_sqe_files_update(), table->files[i] still point to the original file which may freed soon, and that will trigger use-after-free problems.
Cc: stable@vger.kernel.org Fixes: f3bd9dae3708 ("io_uring: fix memleak in __io_sqe_files_update()") Signed-off-by: Jiufei Xue jiufei.xue@linux.alibaba.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9e87587b1f3e..6bdc6dcde6a2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6859,6 +6859,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, table->files[index] = file; err = io_sqe_file_register(ctx, file, i); if (err) { + table->files[index] = NULL; fput(file); break; }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit f74441e6311a28f0ee89b9c8e296a33730f812fc category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The tear down path will always unaccount the memory, so ensure that we have accounted it before hitting any of them.
Reported-by: Tomáš Chaloupka chalucha@gmail.com Reviewed-by: Stefano Garzarella sgarzare@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index fa1ce29d5f67..a8218ff4df42 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7990,6 +7990,16 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, ctx->user = user; ctx->creds = get_current_cred();
+ /* + * Account memory _before_ installing the file descriptor. Once + * the descriptor is installed, it can get closed at any time. Also + * do this before hitting the general error path, as ring freeing + * will un-account as well. + */ + io_account_mem(ctx, ring_pages(p->sq_entries, p->cq_entries), + ACCT_LOCKED); + ctx->limit_mem = limit_mem; + ret = io_allocate_scq_urings(ctx, p); if (ret) goto err; @@ -8026,14 +8036,6 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, goto err; }
- /* - * Account memory _before_ installing the file descriptor. Once - * the descriptor is installed, it can get closed at any time. - */ - io_account_mem(ctx, ring_pages(p->sq_entries, p->cq_entries), - ACCT_LOCKED); - ctx->limit_mem = limit_mem; - /* * Install ring fd as the very last thing, so we don't risk someone * having closed it before we finish setup
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit d3cac64c498c4fb2df46b97ee6f4c7d6d75f5e3d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
__io_queue_sqe() tries to handle all request of a link, so it's not enough to grab mm in io_sq_thread_acquire_mm() based just on the head.
Don't check req->needs_mm and do it always.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com
Conflicts: fs/io_uring.c
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a8218ff4df42..a3de263431df 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4294,10 +4294,9 @@ static void io_sq_thread_drop_mm(void) } }
-static int io_sq_thread_acquire_mm(struct io_ring_ctx *ctx, - struct io_kiocb *req) +static int __io_sq_thread_acquire_mm(struct io_ring_ctx *ctx) { - if (io_op_defs[req->opcode].needs_mm && !current->mm) { + if (!current->mm) { if (unlikely(!mmget_not_zero(ctx->sqo_mm))) return -EFAULT; use_mm(ctx->sqo_mm); @@ -4306,6 +4305,14 @@ static int io_sq_thread_acquire_mm(struct io_ring_ctx *ctx, return 0; }
+static int io_sq_thread_acquire_mm(struct io_ring_ctx *ctx, + struct io_kiocb *req) +{ + if (!io_op_defs[req->opcode].needs_mm) + return 0; + return __io_sq_thread_acquire_mm(ctx); +} + static void io_async_task_func(struct callback_head *cb) { struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); @@ -4344,7 +4351,7 @@ static void io_async_task_func(struct callback_head *cb)
if (!canceled) { __set_current_state(TASK_RUNNING); - if (io_sq_thread_acquire_mm(ctx, req)) { + if (__io_sq_thread_acquire_mm(ctx)) { io_cqring_add_event(req, -EFAULT); goto end_req; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 8eb06d7e8dd853d70668617dda57de4f6cebe651 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There is a fancy bug, where exiting user task may not have ->mm, that makes task_works to try to do kthread_use_mm(ctx->sqo_mm).
Don't do that if sqo_mm is NULL.
[ 290.460558] WARNING: CPU: 6 PID: 150933 at kernel/kthread.c:1238 kthread_use_mm+0xf3/0x110 [ 290.460579] CPU: 6 PID: 150933 Comm: read-write2 Tainted: G I E 5.8.0-rc2-00066-g9b21720607cf #531 [ 290.460580] RIP: 0010:kthread_use_mm+0xf3/0x110 ... [ 290.460584] Call Trace: [ 290.460584] __io_sq_thread_acquire_mm.isra.0.part.0+0x25/0x30 [ 290.460584] __io_req_task_submit+0x64/0x80 [ 290.460584] io_req_task_submit+0x15/0x20 [ 290.460585] task_work_run+0x67/0xa0 [ 290.460585] do_exit+0x35d/0xb70 [ 290.460585] do_group_exit+0x43/0xa0 [ 290.460585] get_signal+0x140/0x900 [ 290.460586] do_signal+0x37/0x780 [ 290.460586] __prepare_exit_to_usermode+0x126/0x1c0 [ 290.460586] __syscall_return_slowpath+0x3b/0x1c0 [ 290.460587] do_syscall_64+0x5f/0xa0 [ 290.460587] entry_SYSCALL_64_after_hwframe+0x44/0xa9
following with faults.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a3de263431df..0b5a28d0c7ba 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4297,7 +4297,7 @@ static void io_sq_thread_drop_mm(void) static int __io_sq_thread_acquire_mm(struct io_ring_ctx *ctx) { if (!current->mm) { - if (unlikely(!mmget_not_zero(ctx->sqo_mm))) + if (unlikely(!ctx->sqo_mm || !mmget_not_zero(ctx->sqo_mm))) return -EFAULT; use_mm(ctx->sqo_mm); } @@ -6955,10 +6955,10 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx, { int ret;
- mmgrab(current->mm); - ctx->sqo_mm = current->mm; - if (ctx->flags & IORING_SETUP_SQPOLL) { + mmgrab(current->mm); + ctx->sqo_mm = current->mm; + ret = -EPERM; if (!capable(CAP_SYS_ADMIN)) goto err; @@ -7002,8 +7002,10 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx, return 0; err: io_finish_async(ctx); - mmdrop(ctx->sqo_mm); - ctx->sqo_mm = NULL; + if (ctx->sqo_mm) { + mmdrop(ctx->sqo_mm); + ctx->sqo_mm = NULL; + } return ret; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit b2edc0a77fac19bbdef63cedb2ea34aec1a9a499 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
First of all don't spin in io_ring_ctx_wait_and_kill() on iopoll. Requests won't complete faster because of that, but only lengthen io_uring_release().
The same goes for offloaded cleanup in io_ring_exit_work() -- it already has waiting loop, don't do blocking active spinning.
For that, pass min=0 into io_iopoll_[try_]reap_events(), so it won't actively spin. Leave the function if io_do_iopoll() there can't complete a request to sleep in io_ring_exit_work().
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 23 +++++++++++------------ 1 file changed, 11 insertions(+), 12 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b68605e4eb48..19db01bc88bd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1878,7 +1878,7 @@ static int io_iopoll_getevents(struct io_ring_ctx *ctx, unsigned int *nr_events, * We can't just wait for polled events to come to us, we have to actively * find and complete them. */ -static void io_iopoll_reap_events(struct io_ring_ctx *ctx) +static void io_iopoll_try_reap_events(struct io_ring_ctx *ctx) { if (!(ctx->flags & IORING_SETUP_IOPOLL)) return; @@ -1887,8 +1887,11 @@ static void io_iopoll_reap_events(struct io_ring_ctx *ctx) while (!list_empty(&ctx->poll_list)) { unsigned int nr_events = 0;
- io_do_iopoll(ctx, &nr_events, 1); + io_do_iopoll(ctx, &nr_events, 0);
+ /* let it sleep and repeat later if can't complete a request */ + if (nr_events == 0) + break; /* * Ensure we allow local-to-the-cpu processing to take place, * in this case we need to ensure that we reap all events. @@ -7366,7 +7369,6 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) ctx->sqo_mm = NULL; }
- io_iopoll_reap_events(ctx); io_sqe_buffer_unregister(ctx); io_sqe_files_unregister(ctx); io_eventfd_unregister(ctx); @@ -7431,11 +7433,8 @@ static int io_remove_personalities(int id, void *p, void *data)
static void io_ring_exit_work(struct work_struct *work) { - struct io_ring_ctx *ctx; - - ctx = container_of(work, struct io_ring_ctx, exit_work); - if (ctx->rings) - io_cqring_overflow_flush(ctx, true); + struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, + exit_work);
/* * If we're doing polled IO and end up having requests being @@ -7443,11 +7442,11 @@ static void io_ring_exit_work(struct work_struct *work) * we're waiting for refs to drop. We need to reap these manually, * as nobody else will be looking for them. */ - while (!wait_for_completion_timeout(&ctx->ref_comp, HZ/20)) { - io_iopoll_reap_events(ctx); + do { if (ctx->rings) io_cqring_overflow_flush(ctx, true); - } + io_iopoll_try_reap_events(ctx); + } while (!wait_for_completion_timeout(&ctx->ref_comp, HZ/20)); io_ring_ctx_free(ctx); }
@@ -7463,10 +7462,10 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) if (ctx->io_wq) io_wq_cancel_all(ctx->io_wq);
- io_iopoll_reap_events(ctx); /* if we failed setting up the ctx, we might not have any rings */ if (ctx->rings) io_cqring_overflow_flush(ctx, true); + io_iopoll_try_reap_events(ctx); idr_for_each(&ctx->personality_idr, io_remove_personalities, ctx);
/*
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 5dbcad51f78434e782d0470b8b5fc4380700c35f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_sqe_buffer_unregister() uses cxt->sqo_mm for memory accounting, but io_ring_ctx_free() drops ->sqo_mm before leaving pinned_vm over-accounted. Postpone mm cleanup for when it's not needed anymore.
Fixes: 309758254ea62 ("io_uring: report pinned memory usage") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 19db01bc88bd..d168cb1b051e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7364,12 +7364,12 @@ static void io_destroy_buffers(struct io_ring_ctx *ctx) static void io_ring_ctx_free(struct io_ring_ctx *ctx) { io_finish_async(ctx); + io_sqe_buffer_unregister(ctx); if (ctx->sqo_mm) { mmdrop(ctx->sqo_mm); ctx->sqo_mm = NULL; }
- io_sqe_buffer_unregister(ctx); io_sqe_files_unregister(ctx); io_eventfd_unregister(ctx); io_destroy_buffers(ctx);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-v5.12-rc5 commit d81269fecb8ce16eb07efafc9ff5520b2a31c486 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_provide_buffers_prep()'s "p->len * p->nbufs" to sign extension problems. Not a huge problem as it's only used for access_ok() and increases the checked length, but better to keep typing right.
Reported-by: Colin Ian King colin.king@canonical.com Fixes: efe68c1ca8f49 ("io_uring: validate the full range of provided buffers for access") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Reviewed-by: Colin Ian King colin.king@canonical.com Link: https://lore.kernel.org/r/562376a39509e260d8532186a06226e56eb1f594.161614923... Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 509ccdacab70..be440687fdb9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3159,6 +3159,7 @@ static int io_remove_buffers(struct io_kiocb *req, bool force_nonblock) static int io_provide_buffers_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { + unsigned long size; struct io_provide_buf *p = &req->pbuf; u64 tmp;
@@ -3172,7 +3173,8 @@ static int io_provide_buffers_prep(struct io_kiocb *req, p->addr = READ_ONCE(sqe->addr); p->len = READ_ONCE(sqe->len);
- if (!access_ok(u64_to_user_ptr(p->addr), (p->len * p->nbufs))) + size = (unsigned long)p->len * p->nbufs; + if (!access_ok(u64_to_user_ptr(p->addr), size)) return -EFAULT;
p->bgid = READ_ONCE(sqe->buf_group);
From: yangerkun yangerkun@huawei.com
hulk inclusion category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Backport for io_uring will extend the syscall number, which will change KABI like bpf_trace_run1. Fix it by hack the syscall in do_syscall_64.
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Reviewed-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/x86/entry/common.c | 7 +++++++ arch/x86/entry/syscalls/syscall_32.tbl | 3 --- arch/x86/entry/syscalls/syscall_64.tbl | 3 --- arch/x86/include/asm/syscall_wrapper.h | 3 +++ 4 files changed, 10 insertions(+), 6 deletions(-)
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 8353348ddeaf..0723098a3961 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -291,6 +291,13 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs) if (likely(nr < NR_syscalls)) { nr = array_index_nospec(nr, NR_syscalls); regs->ax = sys_call_table[nr](regs); + } else { + if (nr == 425) + regs->ax = __x64_sys_io_uring_setup(regs); + else if (likely(nr == 426)) + regs->ax = __x64_sys_io_uring_enter(regs); + else if (nr == 427) + regs->ax = __x64_sys_io_uring_register(regs); }
syscall_return_slowpath(regs); diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 2eefd2a7c1ce..3cf7b533b3d1 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -398,6 +398,3 @@ 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents 386 i386 rseq sys_rseq __ia32_sys_rseq -425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup -426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter -427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 65c026185e61..f0b1709a5ffb 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,9 +343,6 @@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq -425 common io_uring_setup __x64_sys_io_uring_setup -426 common io_uring_enter __x64_sys_io_uring_enter -427 common io_uring_register __x64_sys_io_uring_register
# # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/arch/x86/include/asm/syscall_wrapper.h b/arch/x86/include/asm/syscall_wrapper.h index 90eb70df0b18..46e125b2d08a 100644 --- a/arch/x86/include/asm/syscall_wrapper.h +++ b/arch/x86/include/asm/syscall_wrapper.h @@ -206,5 +206,8 @@ struct pt_regs; asmlinkage long __x64_sys_getcpu(const struct pt_regs *regs); asmlinkage long __x64_sys_gettimeofday(const struct pt_regs *regs); asmlinkage long __x64_sys_time(const struct pt_regs *regs); +asmlinkage long __x64_sys_io_uring_setup(const struct pt_regs *regs); +asmlinkage long __x64_sys_io_uring_enter(const struct pt_regs *regs); +asmlinkage long __x64_sys_io_uring_register(const struct pt_regs *regs);
#endif /* _ASM_X86_SYSCALL_WRAPPER_H */
From: yangerkun yangerkun@huawei.com
hulk inclusion category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The same as x86.
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Reviewed-by: Chen Zhou chenzhou10@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/arm64/include/asm/syscall_wrapper.h | 5 +++++ arch/arm64/kernel/syscall.c | 9 ++++++++- include/uapi/asm-generic/unistd.h | 8 +------- 3 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/arch/arm64/include/asm/syscall_wrapper.h b/arch/arm64/include/asm/syscall_wrapper.h index 507d0ee6bc69..8523ac1281f9 100644 --- a/arch/arm64/include/asm/syscall_wrapper.h +++ b/arch/arm64/include/asm/syscall_wrapper.h @@ -77,4 +77,9 @@ #define SYS_NI(name) SYSCALL_ALIAS(__arm64_sys_##name, sys_ni_posix_timers); #endif
+struct pt_regs; +asmlinkage long __arm64_sys_io_uring_setup(const struct pt_regs *regs); +asmlinkage long __arm64_sys_io_uring_enter(const struct pt_regs *regs); +asmlinkage long __arm64_sys_io_uring_register(const struct pt_regs *regs); + #endif /* __ASM_SYSCALL_WRAPPER_H */ diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c index cee2933bd6c1..2bf13fef1678 100644 --- a/arch/arm64/kernel/syscall.c +++ b/arch/arm64/kernel/syscall.c @@ -47,7 +47,14 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno, syscall_fn = syscall_table[array_index_nospec(scno, sc_nr)]; ret = __invoke_syscall(regs, syscall_fn); } else { - ret = do_ni_syscall(regs, scno); + if (scno == 425) + ret = __arm64_sys_io_uring_setup(regs); + else if (likely(scno == 426)) + ret = __arm64_sys_io_uring_setup(regs); + else if (scno == 427) + ret = __arm64_sys_io_uring_register(regs); + else + ret = do_ni_syscall(regs, scno); }
regs->regs[0] = ret; diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 4c1ba6d0dac8..b538ed1be4eb 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -740,15 +740,9 @@ __SYSCALL(__NR_statx, sys_statx) __SC_COMP(__NR_io_pgetevents, sys_io_pgetevents, compat_sys_io_pgetevents) #define __NR_rseq 293 __SYSCALL(__NR_rseq, sys_rseq) -#define __NR_io_uring_setup 425 -__SYSCALL(__NR_io_uring_setup, sys_io_uring_setup) -#define __NR_io_uring_enter 426 -__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter) -#define __NR_io_uring_register 427 -__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
#undef __NR_syscalls -#define __NR_syscalls 428 +#define __NR_syscalls 294
/* * 32 bit systems traditionally used different
From: yangerkun yangerkun@huawei.com
hulk inclusion category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
note io_uring does not support openat2. Just add enum IORING_OP_OPENAT2 for userspace compatablity in case of following op not supported.
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/uapi/linux/io_uring.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 9808a8181e8d..022d299a78ac 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -125,6 +125,7 @@ enum { IORING_OP_MADVISE, IORING_OP_SEND, IORING_OP_RECV, + IORING_OP_OPENAT2, IORING_OP_EPOLL_CTL, IORING_OP_SPLICE, IORING_OP_PROVIDE_BUFFERS,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit dd9dfcdf5a603680458f5e7b0d2273c66e5417db category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Always do io_commit_cqring() after completing a request, even if it was accounted as overflowed on the CQ side. Failing to do that may lead to not to pushing deferred requests when needed, and so stalling the whole ring.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index be440687fdb9..b65669cc9850 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7648,6 +7648,7 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, } WRITE_ONCE(ctx->rings->cq_overflow, atomic_inc_return(&ctx->cached_cq_overflow)); + io_commit_cqring(ctx); spin_unlock_irq(&ctx->completion_lock);
/*
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 9b0d911acce00b67f7e7336f838b732de7d917d6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
After pulling nxt from a request, it's no more a links head, so clear REQ_F_LINK_HEAD. Absence of this flag also indicates that there are no linked requests, so replacing REQ_F_LINK_NEXT, which can be killed.
Linked timeouts also behave leaving the flag intact when necessary.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b65669cc9850..36ea726a5c38 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -536,7 +536,6 @@ enum { REQ_F_BUFFER_SELECT_BIT = IOSQE_BUFFER_SELECT_BIT,
REQ_F_LINK_HEAD_BIT, - REQ_F_LINK_NEXT_BIT, REQ_F_FAIL_LINK_BIT, REQ_F_INFLIGHT_BIT, REQ_F_CUR_POS_BIT, @@ -576,8 +575,6 @@ enum {
/* head of a link */ REQ_F_LINK_HEAD = BIT(REQ_F_LINK_HEAD_BIT), - /* already grabbed next link */ - REQ_F_LINK_NEXT = BIT(REQ_F_LINK_NEXT_BIT), /* fail rest of links */ REQ_F_FAIL_LINK = BIT(REQ_F_FAIL_LINK_BIT), /* on inflight list */ @@ -1526,10 +1523,6 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) struct io_ring_ctx *ctx = req->ctx; bool wake_ev = false;
- /* Already got next link */ - if (req->flags & REQ_F_LINK_NEXT) - return; - /* * The list should never be empty when we are called here. But could * potentially happen if the chain is messed up, check to be on the @@ -1554,7 +1547,6 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) break; }
- req->flags |= REQ_F_LINK_NEXT; if (wake_ev) io_cqring_ev_posted(ctx); } @@ -1595,6 +1587,7 @@ static void io_req_find_next(struct io_kiocb *req, struct io_kiocb **nxt) { if (likely(!(req->flags & REQ_F_LINK_HEAD))) return; + req->flags &= ~REQ_F_LINK_HEAD;
/* * If LINK is set, we have dependent requests in this chain. If we
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 7c86ffeeed303187f266ed17bd87a9b375955709 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Linked timeout cancellation code is repeated in in io_req_link_next() and io_fail_links(), and they differ in details even though shouldn't. Basing on the fact that there is maximum one armed linked timeout in a link, and it immediately follows the head, extract a function that will check for it and defuse.
Justification: - DRY and cleaner - better inlining for io_req_link_next() (just 1 call site now) - isolates linked_timeouts from common path - reduces time under spinlock for failed links - actually less code
Signed-off-by: Pavel Begunkov asml.silence@gmail.com [axboe: fold in locking fix for io_fail_links()] Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 107 +++++++++++++++++++++++++++----------------------- 1 file changed, 58 insertions(+), 49 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 36ea726a5c38..21aac43c59ce 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1518,48 +1518,57 @@ static bool io_link_cancel_timeout(struct io_kiocb *req) return false; }
-static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) +static void io_kill_linked_timeout(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; + struct io_kiocb *link; bool wake_ev = false; + unsigned long flags = 0; /* false positive warning */ + + if (!(req->flags & REQ_F_COMP_LOCKED)) + spin_lock_irqsave(&ctx->completion_lock, flags); + + if (list_empty(&req->link_list)) + goto out; + link = list_first_entry(&req->link_list, struct io_kiocb, link_list); + if (link->opcode != IORING_OP_LINK_TIMEOUT) + goto out; + + list_del_init(&link->link_list); + wake_ev = io_link_cancel_timeout(link); + req->flags &= ~REQ_F_LINK_TIMEOUT; +out: + if (!(req->flags & REQ_F_COMP_LOCKED)) + spin_unlock_irqrestore(&ctx->completion_lock, flags); + if (wake_ev) + io_cqring_ev_posted(ctx); +} + +static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) +{ + struct io_kiocb *nxt;
/* * The list should never be empty when we are called here. But could * potentially happen if the chain is messed up, check to be on the * safe side. */ - while (!list_empty(&req->link_list)) { - struct io_kiocb *nxt = list_first_entry(&req->link_list, - struct io_kiocb, link_list); - - if (unlikely((req->flags & REQ_F_LINK_TIMEOUT) && - (nxt->flags & REQ_F_TIMEOUT))) { - list_del_init(&nxt->link_list); - wake_ev |= io_link_cancel_timeout(nxt); - req->flags &= ~REQ_F_LINK_TIMEOUT; - continue; - } - - list_del_init(&req->link_list); - if (!list_empty(&nxt->link_list)) - nxt->flags |= REQ_F_LINK_HEAD; - *nxtptr = nxt; - break; - } + if (unlikely(list_empty(&req->link_list))) + return;
- if (wake_ev) - io_cqring_ev_posted(ctx); + nxt = list_first_entry(&req->link_list, struct io_kiocb, link_list); + list_del_init(&req->link_list); + if (!list_empty(&nxt->link_list)) + nxt->flags |= REQ_F_LINK_HEAD; + *nxtptr = nxt; }
/* * Called if REQ_F_LINK_HEAD is set, and we fail the head request */ -static void io_fail_links(struct io_kiocb *req) +static void __io_fail_links(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; - unsigned long flags; - - spin_lock_irqsave(&ctx->completion_lock, flags);
while (!list_empty(&req->link_list)) { struct io_kiocb *link = list_first_entry(&req->link_list, @@ -1568,18 +1577,29 @@ static void io_fail_links(struct io_kiocb *req) list_del_init(&link->link_list); trace_io_uring_fail_link(req, link);
- if ((req->flags & REQ_F_LINK_TIMEOUT) && - link->opcode == IORING_OP_LINK_TIMEOUT) { - io_link_cancel_timeout(link); - } else { - io_cqring_fill_event(link, -ECANCELED); - __io_double_put_req(link); - } + io_cqring_fill_event(link, -ECANCELED); + __io_double_put_req(link); req->flags &= ~REQ_F_LINK_TIMEOUT; }
io_commit_cqring(ctx); - spin_unlock_irqrestore(&ctx->completion_lock, flags); + io_cqring_ev_posted(ctx); +} + +static void io_fail_links(struct io_kiocb *req) +{ + struct io_ring_ctx *ctx = req->ctx; + + if (!(req->flags & REQ_F_COMP_LOCKED)) { + unsigned long flags; + + spin_lock_irqsave(&ctx->completion_lock, flags); + __io_fail_links(req); + spin_unlock_irqrestore(&ctx->completion_lock, flags); + } else { + __io_fail_links(req); + } + io_cqring_ev_posted(ctx); }
@@ -1589,30 +1609,19 @@ static void io_req_find_next(struct io_kiocb *req, struct io_kiocb **nxt) return; req->flags &= ~REQ_F_LINK_HEAD;
+ if (req->flags & REQ_F_LINK_TIMEOUT) + io_kill_linked_timeout(req); + /* * If LINK is set, we have dependent requests in this chain. If we * didn't fail this request, queue the first one up, moving any other * dependencies to the next request. In case of failure, fail the rest * of the chain. */ - if (req->flags & REQ_F_FAIL_LINK) { + if (req->flags & REQ_F_FAIL_LINK) io_fail_links(req); - } else if ((req->flags & (REQ_F_LINK_TIMEOUT | REQ_F_COMP_LOCKED)) == - REQ_F_LINK_TIMEOUT) { - struct io_ring_ctx *ctx = req->ctx; - unsigned long flags; - - /* - * If this is a timeout link, we could be racing with the - * timeout timer. Grab the completion lock for this case to - * protect against that. - */ - spin_lock_irqsave(&ctx->completion_lock, flags); - io_req_link_next(req, nxt); - spin_unlock_irqrestore(&ctx->completion_lock, flags); - } else { + else io_req_link_next(req, nxt); - } }
static void io_free_req(struct io_kiocb *req)
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit ab0b6451db2a8ed630b89ef3826b8ea994149444 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Avoid jumping through hoops to silence unused variable warnings, and also fix sparse rightfully complaining about the locking context:
fs/io_uring.c:1593:39: warning: context imbalance in 'io_kill_linked_timeout' - unexpected unlock
Provide the functional helper as __io_kill_linked_timeout(), and have separate the locking from it.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 32 +++++++++++++++++++++----------- 1 file changed, 21 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 21aac43c59ce..b885f3feed03 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1518,28 +1518,38 @@ static bool io_link_cancel_timeout(struct io_kiocb *req) return false; }
-static void io_kill_linked_timeout(struct io_kiocb *req) +static bool __io_kill_linked_timeout(struct io_kiocb *req) { - struct io_ring_ctx *ctx = req->ctx; struct io_kiocb *link; - bool wake_ev = false; - unsigned long flags = 0; /* false positive warning */ - - if (!(req->flags & REQ_F_COMP_LOCKED)) - spin_lock_irqsave(&ctx->completion_lock, flags); + bool wake_ev;
if (list_empty(&req->link_list)) - goto out; + return false; link = list_first_entry(&req->link_list, struct io_kiocb, link_list); if (link->opcode != IORING_OP_LINK_TIMEOUT) - goto out; + return false;
list_del_init(&link->link_list); wake_ev = io_link_cancel_timeout(link); req->flags &= ~REQ_F_LINK_TIMEOUT; -out: - if (!(req->flags & REQ_F_COMP_LOCKED)) + return wake_ev; +} + +static void io_kill_linked_timeout(struct io_kiocb *req) +{ + struct io_ring_ctx *ctx = req->ctx; + bool wake_ev; + + if (!(req->flags & REQ_F_COMP_LOCKED)) { + unsigned long flags; + + spin_lock_irqsave(&ctx->completion_lock, flags); + wake_ev = __io_kill_linked_timeout(req); spin_unlock_irqrestore(&ctx->completion_lock, flags); + } else { + wake_ev = __io_kill_linked_timeout(req); + } + if (wake_ev) io_cqring_ev_posted(ctx); }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit 9b7adba9eaec28e0e4343c96d0dbeb9578802f5f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
When we traverse into failing links or timeouts, we need to ensure we propagate the REQ_F_COMP_LOCKED flag to ensure that we correctly signal to the completion side that we already hold the completion lock.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b885f3feed03..5e6b601ed398 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1530,6 +1530,7 @@ static bool __io_kill_linked_timeout(struct io_kiocb *req) return false;
list_del_init(&link->link_list); + link->flags |= REQ_F_COMP_LOCKED; wake_ev = io_link_cancel_timeout(link); req->flags &= ~REQ_F_LINK_TIMEOUT; return wake_ev; @@ -1588,6 +1589,7 @@ static void __io_fail_links(struct io_kiocb *req) trace_io_uring_fail_link(req, link);
io_cqring_fill_event(link, -ECANCELED); + link->flags |= REQ_F_COMP_LOCKED; __io_double_put_req(link); req->flags &= ~REQ_F_LINK_TIMEOUT; } @@ -4834,6 +4836,7 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) return -EALREADY;
req_set_fail_links(req); + req->flags |= REQ_F_COMP_LOCKED; io_cqring_fill_event(req, -ECANCELED); io_put_req(req); return 0;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit e1e16097e265daac918ce355bf1a0d1677adf0c7 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We have lots of callers of:
io_cqring_add_event(req, result); io_put_req(req);
Provide a helper that does this for us. It helps clean up the code, and also provides a more convenient location for us to change the completion handling.
Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 103 ++++++++++++++++++++------------------------------ 1 file changed, 42 insertions(+), 61 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5e6b601ed398..0bde05e735a2 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1338,7 +1338,7 @@ static void io_cqring_fill_event(struct io_kiocb *req, long res) __io_cqring_fill_event(req, res, 0); }
-static void __io_cqring_add_event(struct io_kiocb *req, long res, long cflags) +static void io_cqring_add_event(struct io_kiocb *req, long res, long cflags) { struct io_ring_ctx *ctx = req->ctx; unsigned long flags; @@ -1351,9 +1351,15 @@ static void __io_cqring_add_event(struct io_kiocb *req, long res, long cflags) io_cqring_ev_posted(ctx); }
-static void io_cqring_add_event(struct io_kiocb *req, long res) +static void __io_req_complete(struct io_kiocb *req, long res, unsigned cflags) { - __io_cqring_add_event(req, res, 0); + io_cqring_add_event(req, res, cflags); + io_put_req(req); +} + +static void io_req_complete(struct io_kiocb *req, long res) +{ + __io_req_complete(req, res, 0); }
static inline bool io_is_fallback_req(struct io_kiocb *req) @@ -2006,7 +2012,7 @@ static void io_complete_rw_common(struct kiocb *kiocb, long res) req_set_fail_links(req); if (req->flags & REQ_F_BUFFER_SELECTED) cflags = io_put_kbuf(req); - __io_cqring_add_event(req, res, cflags); + io_cqring_add_event(req, res, cflags); }
static void io_complete_rw(struct kiocb *kiocb, long res, long res2) @@ -2900,10 +2906,9 @@ static int io_tee(struct io_kiocb *req, bool force_nonblock) io_put_file(req, in, (sp->flags & SPLICE_F_FD_IN_FIXED)); req->flags &= ~REQ_F_NEED_CLEANUP;
- io_cqring_add_event(req, ret); if (ret != sp->len) req_set_fail_links(req); - io_put_req(req); + io_req_complete(req, ret); return 0; }
@@ -2937,10 +2942,9 @@ static int io_splice(struct io_kiocb *req, bool force_nonblock) io_put_file(req, in, (sp->flags & SPLICE_F_FD_IN_FIXED)); req->flags &= ~REQ_F_NEED_CLEANUP;
- io_cqring_add_event(req, ret); if (ret != sp->len) req_set_fail_links(req); - io_put_req(req); + io_req_complete(req, ret); return 0; }
@@ -2954,8 +2958,7 @@ static int io_nop(struct io_kiocb *req) if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL;
- io_cqring_add_event(req, 0); - io_put_req(req); + io_req_complete(req, 0); return 0; }
@@ -2994,8 +2997,7 @@ static int io_fsync(struct io_kiocb *req, bool force_nonblock) req->sync.flags & IORING_FSYNC_DATASYNC); if (ret < 0) req_set_fail_links(req); - io_cqring_add_event(req, ret); - io_put_req(req); + io_req_complete(req, ret); return 0; }
@@ -3028,8 +3030,7 @@ static int io_fallocate(struct io_kiocb *req, bool force_nonblock) current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY; if (ret < 0) req_set_fail_links(req); - io_cqring_add_event(req, ret); - io_put_req(req); + io_req_complete(req, ret); return 0; }
@@ -3096,8 +3097,7 @@ static int io_openat(struct io_kiocb *req, bool force_nonblock) req->flags &= ~REQ_F_NEED_CLEANUP; if (ret < 0) req_set_fail_links(req); - io_cqring_add_event(req, ret); - io_put_req(req); + io_req_complete(req, ret); return 0; }
@@ -3165,8 +3165,7 @@ static int io_remove_buffers(struct io_kiocb *req, bool force_nonblock) io_ring_submit_lock(ctx, !force_nonblock); if (ret < 0) req_set_fail_links(req); - io_cqring_add_event(req, ret); - io_put_req(req); + io_req_complete(req, ret); return 0; }
@@ -3255,8 +3254,7 @@ static int io_provide_buffers(struct io_kiocb *req, bool force_nonblock) io_ring_submit_unlock(ctx, !force_nonblock); if (ret < 0) req_set_fail_links(req); - io_cqring_add_event(req, ret); - io_put_req(req); + io_req_complete(req, ret); return 0; }
@@ -3299,8 +3297,7 @@ static int io_epoll_ctl(struct io_kiocb *req, bool force_nonblock)
if (ret < 0) req_set_fail_links(req); - io_cqring_add_event(req, ret); - io_put_req(req); + io_req_complete(req, ret); return 0; #else return -EOPNOTSUPP; @@ -3336,8 +3333,7 @@ static int io_madvise(struct io_kiocb *req, bool force_nonblock) ret = do_madvise(ma->addr, ma->len, ma->advice); if (ret < 0) req_set_fail_links(req); - io_cqring_add_event(req, ret); - io_put_req(req); + io_req_complete(req, ret); return 0; #else return -EOPNOTSUPP; @@ -3376,8 +3372,7 @@ static int io_fadvise(struct io_kiocb *req, bool force_nonblock) ret = vfs_fadvise(req->file, fa->offset, fa->len, fa->advice); if (ret < 0) req_set_fail_links(req); - io_cqring_add_event(req, ret); - io_put_req(req); + io_req_complete(req, ret); return 0; }
@@ -3416,8 +3411,7 @@ static int io_statx(struct io_kiocb *req, bool force_nonblock)
if (ret < 0) req_set_fail_links(req); - io_cqring_add_event(req, ret); - io_put_req(req); + io_req_complete(req, ret); return 0; }
@@ -3472,10 +3466,9 @@ static int io_close(struct io_kiocb *req, bool force_nonblock) ret = filp_close(close->put_file, req->work.files); if (ret < 0) req_set_fail_links(req); - io_cqring_add_event(req, ret); fput(close->put_file); close->put_file = NULL; - io_put_req(req); + io_req_complete(req, ret); return 0; }
@@ -3509,8 +3502,7 @@ static int io_sync_file_range(struct io_kiocb *req, bool force_nonblock) req->sync.flags); if (ret < 0) req_set_fail_links(req); - io_cqring_add_event(req, ret); - io_put_req(req); + io_req_complete(req, ret); return 0; }
@@ -3610,10 +3602,9 @@ static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) if (kmsg && kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); req->flags &= ~REQ_F_NEED_CLEANUP; - io_cqring_add_event(req, ret); if (ret < 0) req_set_fail_links(req); - io_put_req(req); + io_req_complete(req, ret); return 0; }
@@ -3653,10 +3644,9 @@ static int io_send(struct io_kiocb *req, bool force_nonblock) ret = -EINTR; }
- io_cqring_add_event(req, ret); if (ret < 0) req_set_fail_links(req); - io_put_req(req); + io_req_complete(req, ret); return 0; }
@@ -3861,10 +3851,9 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) if (kmsg && kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); req->flags &= ~REQ_F_NEED_CLEANUP; - __io_cqring_add_event(req, ret, cflags); if (ret < 0) req_set_fail_links(req); - io_put_req(req); + __io_req_complete(req, ret, cflags); return 0; }
@@ -3918,10 +3907,9 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock)
kfree(kbuf); req->flags &= ~REQ_F_NEED_CLEANUP; - __io_cqring_add_event(req, ret, cflags); if (ret < 0) req_set_fail_links(req); - io_put_req(req); + __io_req_complete(req, ret, cflags); return 0; }
@@ -3960,8 +3948,7 @@ static int io_accept(struct io_kiocb *req, bool force_nonblock) ret = -EINTR; req_set_fail_links(req); } - io_cqring_add_event(req, ret); - io_put_req(req); + io_req_complete(req, ret); return 0; }
@@ -4021,8 +4008,7 @@ static int io_connect(struct io_kiocb *req, bool force_nonblock) out: if (ret < 0) req_set_fail_links(req); - io_cqring_add_event(req, ret); - io_put_req(req); + io_req_complete(req, ret); return 0; } #else /* !CONFIG_NET */ @@ -4460,7 +4446,7 @@ static void io_async_task_func(struct callback_head *cb) if (!canceled) { __set_current_state(TASK_RUNNING); if (__io_sq_thread_acquire_mm(ctx)) { - io_cqring_add_event(req, -EFAULT); + io_cqring_add_event(req, -EFAULT, 0); goto end_req; } __io_sq_thread_acquire_files(ctx); @@ -4709,10 +4695,9 @@ static int io_poll_remove(struct io_kiocb *req) ret = io_poll_cancel(ctx, addr); spin_unlock_irq(&ctx->completion_lock);
- io_cqring_add_event(req, ret); if (ret < 0) req_set_fail_links(req); - io_put_req(req); + io_req_complete(req, ret); return 0; }
@@ -5078,8 +5063,7 @@ static int io_files_update(struct io_kiocb *req, bool force_nonblock)
if (ret < 0) req_set_fail_links(req); - io_cqring_add_event(req, ret); - io_put_req(req); + io_req_complete(req, ret); return 0; }
@@ -5558,8 +5542,7 @@ static struct io_wq_work *io_wq_submit_work(struct io_wq_work *work)
if (ret) { req_set_fail_links(req); - io_cqring_add_event(req, ret); - io_put_req(req); + io_req_complete(req, ret); }
return io_steal_work(req); @@ -5670,8 +5653,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) io_async_find_and_cancel(ctx, req, prev->user_data, -ETIME); io_put_req(prev); } else { - io_cqring_add_event(req, -ETIME); - io_put_req(req); + io_req_complete(req, -ETIME); } return HRTIMER_NORESTART; } @@ -5781,9 +5763,8 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
/* and drop final reference, if we failed */ if (ret) { - io_cqring_add_event(req, ret); req_set_fail_links(req); - io_put_req(req); + io_req_complete(req, ret); } if (nxt) { req = nxt; @@ -5805,9 +5786,9 @@ static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (ret) { if (ret != -EIOCBQUEUED) { fail_req: - io_cqring_add_event(req, ret); req_set_fail_links(req); - io_double_put_req(req); + io_put_req(req); + io_req_complete(req, ret); } } else if (req->flags & REQ_F_FORCE_ASYNC) { if (!req->io) { @@ -5834,8 +5815,8 @@ static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe) static inline void io_queue_link_head(struct io_kiocb *req) { if (unlikely(req->flags & REQ_F_FAIL_LINK)) { - io_cqring_add_event(req, -ECANCELED); - io_double_put_req(req); + io_put_req(req); + io_req_complete(req, -ECANCELED); } else io_queue_sqe(req, NULL); } @@ -6092,8 +6073,8 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
if (unlikely(err)) { fail_req: - io_cqring_add_event(req, err); - io_double_put_req(req); + io_put_req(req); + io_req_complete(req, err); break; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 0be0b0e33b0bfd08264b108512e44b3907fe987b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Greatly simplify io_async_task_func() removing duplicated functionality of __io_req_task_submit(). This do one extra spin lock/unlock for cancelled poll case, but that shouldn't happen often.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [807abcb08834 ("io_uring: ensure double poll additions work with both request types") and 28cea78af449 ("io_uring: allow non-fixed files with SQPOLL") include first]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 31 ++++++------------------------- 1 file changed, 6 insertions(+), 25 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index af1a4dc6c9c8..14dcba1da6be 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1767,6 +1767,7 @@ static void __io_req_task_submit(struct io_kiocb *req)
__set_current_state(TASK_RUNNING); if (!__io_sq_thread_acquire_mm(ctx)) { + __io_sq_thread_acquire_files(ctx); mutex_lock(&ctx->uring_lock); __io_queue_sqe(req, NULL, NULL); mutex_unlock(&ctx->uring_lock); @@ -4517,7 +4518,6 @@ static void io_async_task_func(struct callback_head *cb) struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); struct async_poll *apoll = req->apoll; struct io_ring_ctx *ctx = req->ctx; - bool canceled = false;
trace_io_uring_task_run(req->ctx, req->opcode, req->user_data);
@@ -4528,15 +4528,8 @@ static void io_async_task_func(struct callback_head *cb) }
/* If req is still hashed, it cannot have been canceled. Don't check. */ - if (hash_hashed(&req->hash_node)) { + if (hash_hashed(&req->hash_node)) hash_del(&req->hash_node); - } else { - canceled = READ_ONCE(apoll->poll.canceled); - if (canceled) { - io_cqring_fill_event(req, -ECANCELED); - io_commit_cqring(ctx); - } - }
io_poll_remove_double(req); spin_unlock_irq(&ctx->completion_lock); @@ -4548,22 +4541,10 @@ static void io_async_task_func(struct callback_head *cb) kfree(apoll->double_poll); kfree(apoll);
- if (!canceled) { - __set_current_state(TASK_RUNNING); - if (__io_sq_thread_acquire_mm(ctx)) { - io_cqring_add_event(req, -EFAULT, 0); - goto end_req; - } - __io_sq_thread_acquire_files(ctx); - mutex_lock(&ctx->uring_lock); - __io_queue_sqe(req, NULL, NULL); - mutex_unlock(&ctx->uring_lock); - } else { - io_cqring_ev_posted(ctx); -end_req: - req_set_fail_links(req); - io_double_put_req(req); - } + if (!READ_ONCE(apoll->poll.canceled)) + __io_req_task_submit(req); + else + __io_req_task_cancel(req, -ECANCELED); }
static int io_async_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 1bcb8c5d65a845e0ecb9e82237c399b29b8d15ea category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_steal_work() can't be sure that @nxt has req->work properly set, so we can't pass it to io-wq as is.
A dirty quick fix -- drag it through io_req_task_queue(), and always return NULL from io_steal_work().
e.g.
[ 50.770161] BUG: kernel NULL pointer dereference, address: 00000000 [ 50.770164] #PF: supervisor write access in kernel mode [ 50.770164] #PF: error_code(0x0002) - not-present page [ 50.770168] CPU: 1 PID: 1448 Comm: io_wqe_worker-0 Tainted: G I 5.8.0-rc2-00035-g2237d76530eb-dirty #494 [ 50.770172] RIP: 0010:override_creds+0x19/0x30 ... [ 50.770183] io_worker_handle_work+0x25c/0x430 [ 50.770185] io_wqe_worker+0x2a0/0x350 [ 50.770190] kthread+0x136/0x180 [ 50.770194] ret_from_fork+0x22/0x30
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e38a48c71a72..2535ac88d97a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1874,7 +1874,7 @@ static void io_put_req(struct io_kiocb *req)
static struct io_wq_work *io_steal_work(struct io_kiocb *req) { - struct io_kiocb *link, *nxt = NULL; + struct io_kiocb *nxt = NULL;
/* * A ref is owned by io-wq in which context we're. So, if that's the @@ -1891,10 +1891,15 @@ static struct io_wq_work *io_steal_work(struct io_kiocb *req) if ((nxt->flags & REQ_F_ISREG) && io_op_defs[nxt->opcode].hash_reg_file) io_wq_hash_work(&nxt->work, file_inode(nxt->file));
- link = io_prep_linked_timeout(nxt); - if (link) - nxt->flags |= REQ_F_QUEUE_TIMEOUT; - return &nxt->work; + io_req_task_queue(nxt); + /* + * If we're going to return actual work, here should be timeout prep: + * + * link = io_prep_linked_timeout(nxt); + * if (link) + * nxt->flags |= REQ_F_QUEUE_TIMEOUT; + */ + return NULL; }
/*
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 710c2bfb66474a186b0196e3342d43db0e6c04e1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We won't have valid ring_fd, ring_file in task work. Grab files early.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c7cda1284fe0..c7cd6a2539f1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5163,15 +5163,15 @@ static int io_req_defer_prep(struct io_kiocb *req, if (!sqe) return 0;
- if (for_async || (req->flags & REQ_F_WORK_INITIALIZED)) { + if (io_op_defs[req->opcode].file_table) { io_req_init_async(req); + ret = io_grab_files(req); + if (unlikely(ret)) + return ret; + }
- if (io_op_defs[req->opcode].file_table) { - ret = io_grab_files(req); - if (unlikely(ret)) - return ret; - } - + if (for_async || (req->flags & REQ_F_WORK_INITIALIZED)) { + io_req_init_async(req); io_req_work_grab_env(req, &io_op_defs[req->opcode]); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 9b5f7bd93272689ec8dc2cfd40a812265c23414e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Generally, it's better to return a value directly than having out parameter. It's cleaner and saves from some kinds of ugly bugs. May also be faster.
Return next request from io_req_find_next() and friends directly instead of passing out parameter.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [4ae6dbd683860 ("io_uring: fix lockup in io_fail_links()") and 6a0af224c213 ("io_uring: don't put a poll req under spinlock") seems same. and we has adopt 4ae6dbd683860, no need to include 6a0af224c213.]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 37 ++++++++++++++++++------------------- 1 file changed, 18 insertions(+), 19 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c7cd6a2539f1..289423385746 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1659,7 +1659,7 @@ static void io_kill_linked_timeout(struct io_kiocb *req) io_cqring_ev_posted(ctx); }
-static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) +static struct io_kiocb *io_req_link_next(struct io_kiocb *req) { struct io_kiocb *nxt;
@@ -1669,13 +1669,13 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) * safe side. */ if (unlikely(list_empty(&req->link_list))) - return; + return NULL;
nxt = list_first_entry(&req->link_list, struct io_kiocb, link_list); list_del_init(&req->link_list); if (!list_empty(&nxt->link_list)) nxt->flags |= REQ_F_LINK_HEAD; - *nxtptr = nxt; + return nxt; }
/* @@ -1719,10 +1719,10 @@ static void io_fail_links(struct io_kiocb *req) io_cqring_ev_posted(ctx); }
-static void io_req_find_next(struct io_kiocb *req, struct io_kiocb **nxt) +static struct io_kiocb *io_req_find_next(struct io_kiocb *req) { if (likely(!(req->flags & REQ_F_LINK_HEAD))) - return; + return NULL; req->flags &= ~REQ_F_LINK_HEAD;
if (req->flags & REQ_F_LINK_TIMEOUT) @@ -1734,10 +1734,10 @@ static void io_req_find_next(struct io_kiocb *req, struct io_kiocb **nxt) * dependencies to the next request. In case of failure, fail the rest * of the chain. */ - if (req->flags & REQ_F_FAIL_LINK) - io_fail_links(req); - else - io_req_link_next(req, nxt); + if (likely(!(req->flags & REQ_F_FAIL_LINK))) + return io_req_link_next(req); + io_fail_links(req); + return NULL; }
static void __io_req_task_cancel(struct io_kiocb *req, int error) @@ -1804,9 +1804,7 @@ static void io_req_task_queue(struct io_kiocb *req)
static void io_queue_next(struct io_kiocb *req) { - struct io_kiocb *nxt = NULL; - - io_req_find_next(req, &nxt); + struct io_kiocb *nxt = io_req_find_next(req);
if (nxt) io_req_task_queue(nxt); @@ -1857,13 +1855,15 @@ static void io_req_free_batch(struct req_batch *rb, struct io_kiocb *req) * Drop reference to request, return next in chain (if there is one) if this * was the last reference to this request. */ -__attribute__((nonnull)) -static void io_put_req_find_next(struct io_kiocb *req, struct io_kiocb **nxtptr) +static struct io_kiocb *io_put_req_find_next(struct io_kiocb *req) { + struct io_kiocb *nxt = NULL; + if (refcount_dec_and_test(&req->refs)) { - io_req_find_next(req, nxtptr); + nxt = io_req_find_next(req); __io_free_req(req); } + return nxt; }
static void io_put_req(struct io_kiocb *req) @@ -1884,7 +1884,7 @@ static struct io_wq_work *io_steal_work(struct io_kiocb *req) if (refcount_read(&req->refs) != 1) return NULL;
- io_req_find_next(req, &nxt); + nxt = io_req_find_next(req); if (!nxt) return NULL;
@@ -4403,7 +4403,7 @@ static void io_poll_task_handler(struct io_kiocb *req, struct io_kiocb **nxt) io_poll_complete(req, req->result, 0); spin_unlock_irq(&ctx->completion_lock);
- io_put_req_find_next(req, nxt); + *nxt = io_put_req_find_next(req); io_cqring_ev_posted(ctx); }
@@ -5841,9 +5841,8 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, }
err: - nxt = NULL; /* drop submission reference */ - io_put_req_find_next(req, &nxt); + nxt = io_put_req_find_next(req);
if (linked_timeout) { if (!ret)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit a1a4661691c5f1a3af4c04f56ad68e2d1dbee3af category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Now REQ_F_TIMEOUT is set but never used, kill it
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ---- 1 file changed, 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 289423385746..c744c45088cd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -541,7 +541,6 @@ enum { REQ_F_CUR_POS_BIT, REQ_F_NOWAIT_BIT, REQ_F_LINK_TIMEOUT_BIT, - REQ_F_TIMEOUT_BIT, REQ_F_ISREG_BIT, REQ_F_MUST_PUNT_BIT, REQ_F_TIMEOUT_NOSEQ_BIT, @@ -585,8 +584,6 @@ enum { REQ_F_NOWAIT = BIT(REQ_F_NOWAIT_BIT), /* has linked timeout */ REQ_F_LINK_TIMEOUT = BIT(REQ_F_LINK_TIMEOUT_BIT), - /* timeout request */ - REQ_F_TIMEOUT = BIT(REQ_F_TIMEOUT_BIT), /* regular file */ REQ_F_ISREG = BIT(REQ_F_ISREG_BIT), /* must be punted even for NONBLOCK */ @@ -4977,7 +4974,6 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
data = &req->io->timeout; data->req = req; - req->flags |= REQ_F_TIMEOUT;
if (get_timespec64(&data->ts, u64_to_user_ptr(sqe->addr))) return -EFAULT;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 8eb7e2d00763367f345ef0b2a2eb4f8001ae40ce category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
There are too many useless flags, kill REQ_F_TIMEOUT_NOSEQ, which can be easily infered from req.timeout itself.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c744c45088cd..254343e64aba 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -543,7 +543,6 @@ enum { REQ_F_LINK_TIMEOUT_BIT, REQ_F_ISREG_BIT, REQ_F_MUST_PUNT_BIT, - REQ_F_TIMEOUT_NOSEQ_BIT, REQ_F_COMP_LOCKED_BIT, REQ_F_NEED_CLEANUP_BIT, REQ_F_OVERFLOW_BIT, @@ -588,8 +587,6 @@ enum { REQ_F_ISREG = BIT(REQ_F_ISREG_BIT), /* must be punted even for NONBLOCK */ REQ_F_MUST_PUNT = BIT(REQ_F_MUST_PUNT_BIT), - /* no timeout sequence */ - REQ_F_TIMEOUT_NOSEQ = BIT(REQ_F_TIMEOUT_NOSEQ_BIT), /* completion under lock */ REQ_F_COMP_LOCKED = BIT(REQ_F_COMP_LOCKED_BIT), /* needs cleanup */ @@ -1072,6 +1069,11 @@ static void io_ring_ctx_ref_free(struct percpu_ref *ref) complete(&ctx->ref_comp); }
+static inline bool io_is_timeout_noseq(struct io_kiocb *req) +{ + return !req->timeout.off; +} + static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) { struct io_ring_ctx *ctx; @@ -1284,7 +1286,7 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx) struct io_kiocb *req = list_first_entry(&ctx->timeout_list, struct io_kiocb, list);
- if (req->flags & REQ_F_TIMEOUT_NOSEQ) + if (io_is_timeout_noseq(req)) break; if (req->timeout.target_seq != ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts)) @@ -5001,8 +5003,7 @@ static int io_timeout(struct io_kiocb *req) * timeout event to be satisfied. If it isn't set, then this is * a pure timeout request, sequence isn't used. */ - if (!off) { - req->flags |= REQ_F_TIMEOUT_NOSEQ; + if (io_is_timeout_noseq(req)) { entry = ctx->timeout_list.prev; goto add; } @@ -5017,7 +5018,7 @@ static int io_timeout(struct io_kiocb *req) list_for_each_prev(entry, &ctx->timeout_list) { struct io_kiocb *nxt = list_entry(entry, struct io_kiocb, list);
- if (nxt->flags & REQ_F_TIMEOUT_NOSEQ) + if (io_is_timeout_noseq(nxt)) continue; /* nxt.seq is behind @tail, otherwise would've been completed */ if (off >= nxt->timeout.target_seq - tail)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 3fa5e0f331280237af918ab2e7a160f5a68d3e7d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
gcc 9.2.0 compiles io_req_find_next() as a separate function leaving the first REQ_F_LINK_HEAD fast check not inlined. Help it by splitting out the check from the function.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 254343e64aba..fed53e47d6da 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1718,12 +1718,9 @@ static void io_fail_links(struct io_kiocb *req) io_cqring_ev_posted(ctx); }
-static struct io_kiocb *io_req_find_next(struct io_kiocb *req) +static struct io_kiocb *__io_req_find_next(struct io_kiocb *req) { - if (likely(!(req->flags & REQ_F_LINK_HEAD))) - return NULL; req->flags &= ~REQ_F_LINK_HEAD; - if (req->flags & REQ_F_LINK_TIMEOUT) io_kill_linked_timeout(req);
@@ -1739,6 +1736,13 @@ static struct io_kiocb *io_req_find_next(struct io_kiocb *req) return NULL; }
+static struct io_kiocb *io_req_find_next(struct io_kiocb *req) +{ + if (likely(!(req->flags & REQ_F_LINK_HEAD))) + return NULL; + return __io_req_find_next(req); +} + static void __io_req_task_cancel(struct io_kiocb *req, int error) { struct io_ring_ctx *ctx = req->ctx;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 62ef73165091476d31f31e33d9d0d48b088c129d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_{read,write}() { ... copy_iov: // prep async if (!(flags & REQ_F_NOWAIT) && !file_can_poll(file)) flags |= REQ_F_MUST_PUNT; }
REQ_F_MUST_PUNT there is pointless, because if it happens then REQ_F_NOWAIT is known to be _not_ set, and the request will go async path in __io_queue_sqe() anyway. file_can_poll() check is also repeated in arm_poll*(), so don't need it.
Remove the mentioned assignment REQ_F_MUST_PUNT in preparation for killing the flag.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [bcf5a06304d6 ("io_uring: support true async buffered reads, if file provides it") not merge]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 -------- 1 file changed, 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index fed53e47d6da..293d5030ebbb 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2908,10 +2908,6 @@ static int io_read(struct io_kiocb *req, bool force_nonblock, inline_vecs, &iter); if (ret) goto out_free; - /* any defer here is final, must blocking retry */ - if (!(req->flags & REQ_F_NOWAIT) && - !file_can_poll(req->file)) - req->flags |= REQ_F_MUST_PUNT; return -EAGAIN; } } @@ -3024,10 +3020,6 @@ static int io_write(struct io_kiocb *req, bool force_nonblock, inline_vecs, &iter); if (ret) goto out_free; - /* any defer here is final, must blocking retry */ - if (!(req->flags & REQ_F_NOWAIT) && - !file_can_poll(req->file)) - req->flags |= REQ_F_MUST_PUNT; return -EAGAIN; } }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 24c74678634b3cbdb325b3b7706366c83811b311 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
REQ_F_MUST_PUNT may seem looking good and clear, but it's the same as not having REQ_F_NOWAIT set. That rather creates more confusion. Moreover, it doesn't even affect any behaviour (e.g. see the patch removing it from io_{read,write}).
Kill theg flag and update already outdated comments.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [8eb7e2d00763 ("io_uring: kill REQ_F_TIMEOUT_NOSEQ") and a1a4661691c5 ("io_uring: kill REQ_F_TIMEOUT") and 6795c5aba247 ("io_uring: clean up req->result setting by rw") and 607ec89ed18f ("io_uring: fix SQPOLL IORING_OP_CLOSE cancelation state") include first]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 22 +++++++--------------- 1 file changed, 7 insertions(+), 15 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 293d5030ebbb..8a1d04120501 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -542,7 +542,6 @@ enum { REQ_F_NOWAIT_BIT, REQ_F_LINK_TIMEOUT_BIT, REQ_F_ISREG_BIT, - REQ_F_MUST_PUNT_BIT, REQ_F_COMP_LOCKED_BIT, REQ_F_NEED_CLEANUP_BIT, REQ_F_OVERFLOW_BIT, @@ -585,8 +584,6 @@ enum { REQ_F_LINK_TIMEOUT = BIT(REQ_F_LINK_TIMEOUT_BIT), /* regular file */ REQ_F_ISREG = BIT(REQ_F_ISREG_BIT), - /* must be punted even for NONBLOCK */ - REQ_F_MUST_PUNT = BIT(REQ_F_MUST_PUNT_BIT), /* completion under lock */ REQ_F_COMP_LOCKED = BIT(REQ_F_COMP_LOCKED_BIT), /* needs cleanup */ @@ -2877,10 +2874,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock, io_size = ret; req->result = io_size;
- /* - * If the file doesn't support async, mark it as REQ_F_MUST_PUNT so - * we know to async punt it even if it was opened O_NONBLOCK - */ + /* If the file doesn't support async, just async punt */ if (force_nonblock && !io_file_supports_async(req->file, READ)) goto copy_iov;
@@ -2958,10 +2952,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock, io_size = ret; req->result = io_size;
- /* - * If the file doesn't support async, mark it as REQ_F_MUST_PUNT so - * we know to async punt it even if it was opened O_NONBLOCK - */ + /* If the file doesn't support async, just async punt */ if (force_nonblock && !io_file_supports_async(req->file, WRITE)) goto copy_iov;
@@ -3645,8 +3636,10 @@ static int io_close(struct io_kiocb *req, bool force_nonblock, if (close->put_file->f_op->flush && force_nonblock) { /* not safe to cancel at this point */ req->work.flags |= IO_WQ_WORK_NO_CANCEL; + /* was never set, but play safe */ + req->flags &= ~REQ_F_NOWAIT; /* avoid grabbing files - we don't need the files */ - req->flags |= REQ_F_NO_FILE_TABLE | REQ_F_MUST_PUNT; + req->flags |= REQ_F_NO_FILE_TABLE; return -EAGAIN; }
@@ -4614,7 +4607,7 @@ static bool io_arm_poll_handler(struct io_kiocb *req)
if (!req->file || !file_can_poll(req->file)) return false; - if (req->flags & (REQ_F_MUST_PUNT | REQ_F_POLLED)) + if (req->flags & REQ_F_POLLED) return false; if (!def->pollin && !def->pollout) return false; @@ -5809,8 +5802,7 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, * We async punt it if the file wasn't marked NOWAIT, or if the file * doesn't support non-blocking read/write attempts */ - if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) || - (req->flags & REQ_F_MUST_PUNT))) { + if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) { if (io_arm_poll_handler(req)) { if (linked_timeout) io_queue_linked_timeout(linked_timeout);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit b90cd197f9315f968d5ee4e6ee9f4e3067f2c883 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
It's a good practice to modify fields of a struct after but not before it was initialised. Even though io_init_poll_iocb() doesn't touch poll->file, call it first.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 8a1d04120501..f7323de614a6 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4566,8 +4566,8 @@ static __poll_t __io_arm_poll_handler(struct io_kiocb *req, struct io_ring_ctx *ctx = req->ctx; bool cancel = false;
- poll->file = req->file; io_init_poll_iocb(poll, mask, wake_func); + poll->file = req->file; poll->wait.private = req;
ipt->pt._key = mask;
From: Randy Dunlap rdunlap@infradead.org
mainline inclusion from mainline-5.9-rc1 commit 1e16c2f917a59d27fb6b540c44d66978c8ad29ef category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Fix build errors when CONFIG_NET is not set/enabled:
../fs/io_uring.c:5472:10: error: too many arguments to function ‘io_sendmsg’ ../fs/io_uring.c:5474:10: error: too many arguments to function ‘io_send’ ../fs/io_uring.c:5484:10: error: too many arguments to function ‘io_recvmsg’ ../fs/io_uring.c:5486:10: error: too many arguments to function ‘io_recv’ ../fs/io_uring.c:5510:9: error: too many arguments to function ‘io_accept’ ../fs/io_uring.c:5518:9: error: too many arguments to function ‘io_connect’
Signed-off-by: Randy Dunlap rdunlap@infradead.org Cc: Jens Axboe axboe@kernel.dk Cc: io-uring@vger.kernel.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f7323de614a6..e02f975fab65 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4204,12 +4204,14 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EOPNOTSUPP; }
-static int io_sendmsg(struct io_kiocb *req, bool force_nonblock) +static int io_sendmsg(struct io_kiocb *req, bool force_nonblock, + struct io_comp_state *cs) { return -EOPNOTSUPP; }
-static int io_send(struct io_kiocb *req, bool force_nonblock) +static int io_send(struct io_kiocb *req, bool force_nonblock, + struct io_comp_state *cs) { return -EOPNOTSUPP; } @@ -4220,12 +4222,14 @@ static int io_recvmsg_prep(struct io_kiocb *req, return -EOPNOTSUPP; }
-static int io_recvmsg(struct io_kiocb *req, bool force_nonblock) +static int io_recvmsg(struct io_kiocb *req, bool force_nonblock, + struct io_comp_state *cs) { return -EOPNOTSUPP; }
-static int io_recv(struct io_kiocb *req, bool force_nonblock) +static int io_recv(struct io_kiocb *req, bool force_nonblock, + struct io_comp_state *cs) { return -EOPNOTSUPP; } @@ -4235,7 +4239,8 @@ static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EOPNOTSUPP; }
-static int io_accept(struct io_kiocb *req, bool force_nonblock) +static int io_accept(struct io_kiocb *req, bool force_nonblock, + struct io_comp_state *cs) { return -EOPNOTSUPP; } @@ -4245,7 +4250,8 @@ static int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EOPNOTSUPP; }
-static int io_connect(struct io_kiocb *req, bool force_nonblock) +static int io_connect(struct io_kiocb *req, bool force_nonblock, + struct io_comp_state *cs) { return -EOPNOTSUPP; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 351fd53595a3ceb88756a005e3b864f7c8cb86e4 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Remove struct io_op_def *def parameter from io_req_work_grab_env(), it's trivially deducible from req->opcode and fast. The API is cleaner this way, and also helps the complier to understand that it's a real constant and could be register-cached.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [16d598030a37 ("io_uring: fix not initialised work->flags") include first]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e02f975fab65..7624b5388a75 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1159,9 +1159,10 @@ static void __io_commit_cqring(struct io_ring_ctx *ctx) } }
-static inline void io_req_work_grab_env(struct io_kiocb *req, - const struct io_op_def *def) +static inline void io_req_work_grab_env(struct io_kiocb *req) { + const struct io_op_def *def = &io_op_defs[req->opcode]; + if (!req->work.mm && def->needs_mm) { mmgrab(current->mm); req->work.mm = current->mm; @@ -1220,7 +1221,7 @@ static inline void io_prep_async_work(struct io_kiocb *req, req->work.flags |= IO_WQ_WORK_UNBOUND; }
- io_req_work_grab_env(req, def); + io_req_work_grab_env(req);
*link = io_prep_linked_timeout(req); } @@ -5164,7 +5165,7 @@ static int io_req_defer_prep(struct io_kiocb *req,
if (for_async || (req->flags & REQ_F_WORK_INITIALIZED)) { io_req_init_async(req); - io_req_work_grab_env(req, &io_op_defs[req->opcode]); + io_req_work_grab_env(req); }
switch (req->opcode) {
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit edcdfcc149a8d0c11d4dd2b23b5338af22e31a5f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Place io_req_init_async() in io_req_work_grab_env() so it won't be forgotten.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [edcdfcc149a8 ("io_uring: do init work in grab_env()")]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7624b5388a75..cd8b83c95b34 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1163,6 +1163,8 @@ static inline void io_req_work_grab_env(struct io_kiocb *req) { const struct io_op_def *def = &io_op_defs[req->opcode];
+ io_req_init_async(req); + if (!req->work.mm && def->needs_mm) { mmgrab(current->mm); req->work.mm = current->mm; @@ -1222,7 +1224,6 @@ static inline void io_prep_async_work(struct io_kiocb *req, }
io_req_work_grab_env(req); - *link = io_prep_linked_timeout(req); }
@@ -5163,10 +5164,8 @@ static int io_req_defer_prep(struct io_kiocb *req, return ret; }
- if (for_async || (req->flags & REQ_F_WORK_INITIALIZED)) { - io_req_init_async(req); + if (for_async || (req->flags & REQ_F_WORK_INITIALIZED)) io_req_work_grab_env(req); - }
switch (req->opcode) { case IORING_OP_NOP:
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit debb85f496c9cc70663eac31d3ad9153839c844c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Remove io_req_work_grab_env() call from io_req_defer_prep(), just call it when neccessary.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cd8b83c95b34..5cc6491d09c0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5150,7 +5150,7 @@ static int io_files_update(struct io_kiocb *req, bool force_nonblock, }
static int io_req_defer_prep(struct io_kiocb *req, - const struct io_uring_sqe *sqe, bool for_async) + const struct io_uring_sqe *sqe) { ssize_t ret = 0;
@@ -5164,9 +5164,6 @@ static int io_req_defer_prep(struct io_kiocb *req, return ret; }
- if (for_async || (req->flags & REQ_F_WORK_INITIALIZED)) - io_req_work_grab_env(req); - switch (req->opcode) { case IORING_OP_NOP: break; @@ -5276,9 +5273,10 @@ static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (!req->io) { if (io_alloc_async_ctx(req)) return -EAGAIN; - ret = io_req_defer_prep(req, sqe, true); + ret = io_req_defer_prep(req, sqe); if (ret < 0) return ret; + io_req_work_grab_env(req); }
spin_lock_irq(&ctx->completion_lock); @@ -5877,9 +5875,10 @@ static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, ret = -EAGAIN; if (io_alloc_async_ctx(req)) goto fail_req; - ret = io_req_defer_prep(req, sqe, true); + ret = io_req_defer_prep(req, sqe); if (unlikely(ret < 0)) goto fail_req; + io_req_work_grab_env(req); }
/* @@ -5934,7 +5933,7 @@ static int io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (io_alloc_async_ctx(req)) return -EAGAIN;
- ret = io_req_defer_prep(req, sqe, false); + ret = io_req_defer_prep(req, sqe); if (ret) { /* fail even hard links since we don't submit */ head->flags |= REQ_F_FAIL_LINK; @@ -5961,7 +5960,7 @@ static int io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (io_alloc_async_ctx(req)) return -EAGAIN;
- ret = io_req_defer_prep(req, sqe, false); + ret = io_req_defer_prep(req, sqe); if (ret) req->flags |= REQ_F_FAIL_LINK; *link = req;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit cbdcb4357c000861b77369c34e110fa893d23607 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently io_steal_work() is disabled, and every linked request should go through task_work for initialisation. Do io_req_work_grab_env() just before io-wq punting and for the whole link, so any request reachable by io_steal_work() is prepared.
This is also interesting for another reason -- it localises io_req_work_grab_env() into one place just before io-wq punting, helping to to better manage req->work lifetime and add some neat cleanup/optimisations later.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 53 ++++++++++++++++++++++++++++----------------------- 1 file changed, 29 insertions(+), 24 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 5cc6491d09c0..e2ddfdc48bc7 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1159,7 +1159,7 @@ static void __io_commit_cqring(struct io_ring_ctx *ctx) } }
-static inline void io_req_work_grab_env(struct io_kiocb *req) +static void io_req_work_grab_env(struct io_kiocb *req) { const struct io_op_def *def = &io_op_defs[req->opcode];
@@ -1208,8 +1208,7 @@ static inline void io_req_work_drop_env(struct io_kiocb *req) } }
-static inline void io_prep_async_work(struct io_kiocb *req, - struct io_kiocb **link) +static void io_prep_async_work(struct io_kiocb *req) { const struct io_op_def *def = &io_op_defs[req->opcode];
@@ -1224,15 +1223,22 @@ static inline void io_prep_async_work(struct io_kiocb *req, }
io_req_work_grab_env(req); - *link = io_prep_linked_timeout(req); }
-static inline void io_queue_async_work(struct io_kiocb *req) +static void io_prep_async_link(struct io_kiocb *req) { - struct io_ring_ctx *ctx = req->ctx; - struct io_kiocb *link; + struct io_kiocb *cur;
- io_prep_async_work(req, &link); + io_prep_async_work(req); + if (req->flags & REQ_F_LINK_HEAD) + list_for_each_entry(cur, &req->link_list, link_list) + io_prep_async_work(cur); +} + +static void __io_queue_async_work(struct io_kiocb *req) +{ + struct io_ring_ctx *ctx = req->ctx; + struct io_kiocb *link = io_prep_linked_timeout(req);
trace_io_uring_queue_async_work(ctx, io_wq_is_hashed(&req->work), req, &req->work, req->flags); @@ -1242,6 +1248,13 @@ static inline void io_queue_async_work(struct io_kiocb *req) io_queue_linked_timeout(link); }
+static void io_queue_async_work(struct io_kiocb *req) +{ + /* init ->work of the whole link before punting */ + io_prep_async_link(req); + __io_queue_async_work(req); +} + static void io_kill_timeout(struct io_kiocb *req) { int ret; @@ -1275,7 +1288,8 @@ static void __io_queue_deferred(struct io_ring_ctx *ctx) if (req_need_defer(req)) break; list_del_init(&req->list); - io_queue_async_work(req); + /* punt-init is done before queueing for defer */ + __io_queue_async_work(req); } while (!list_empty(&ctx->defer_list)); }
@@ -1876,7 +1890,7 @@ static void io_put_req(struct io_kiocb *req)
static struct io_wq_work *io_steal_work(struct io_kiocb *req) { - struct io_kiocb *nxt = NULL; + struct io_kiocb *timeout, *nxt = NULL;
/* * A ref is owned by io-wq in which context we're. So, if that's the @@ -1890,18 +1904,10 @@ static struct io_wq_work *io_steal_work(struct io_kiocb *req) if (!nxt) return NULL;
- if ((nxt->flags & REQ_F_ISREG) && io_op_defs[nxt->opcode].hash_reg_file) - io_wq_hash_work(&nxt->work, file_inode(nxt->file)); - - io_req_task_queue(nxt); - /* - * If we're going to return actual work, here should be timeout prep: - * - * link = io_prep_linked_timeout(nxt); - * if (link) - * nxt->flags |= REQ_F_QUEUE_TIMEOUT; - */ - return NULL; + timeout = io_prep_linked_timeout(nxt); + if (timeout) + nxt->flags |= REQ_F_QUEUE_TIMEOUT; + return &nxt->work; }
/* @@ -5276,8 +5282,8 @@ static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) ret = io_req_defer_prep(req, sqe); if (ret < 0) return ret; - io_req_work_grab_env(req); } + io_prep_async_link(req);
spin_lock_irq(&ctx->completion_lock); if (!req_need_defer(req) && list_empty(&ctx->defer_list)) { @@ -5878,7 +5884,6 @@ static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, ret = io_req_defer_prep(req, sqe); if (unlikely(ret < 0)) goto fail_req; - io_req_work_grab_env(req); }
/*
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit 4c6e277c4cc4a6b3b2b9c66a7b014787ae757cc1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Provide a helper to run task_work instead of checking and running manually in a bunch of different spots. While doing so, also move the task run state setting where we run the task work. Then we can move it out of the callback helpers. This also helps ensure we only do this once per task_work list run, not per task_work item.
Suggested-by: Oleg Nesterov oleg@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [28cea78af449 ("io_uring: allow non-fixed files with SQPOLL") include first, b63534c41e20 ("io_uring: re-issue block requests that failed because of resources") not include, 23b3628e4592 ("io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works") include first, 37c54f9bd486 ("kernel: set USER_DS in kthread_use_mm") not include]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 34 +++++++++++++++++++--------------- 1 file changed, 19 insertions(+), 15 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index e2ddfdc48bc7..affde6571d6c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1781,7 +1781,6 @@ static void __io_req_task_submit(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx;
- __set_current_state(TASK_RUNNING); if (!__io_sq_thread_acquire_mm(ctx)) { __io_sq_thread_acquire_files(ctx); mutex_lock(&ctx->uring_lock); @@ -1970,6 +1969,17 @@ static int io_put_kbuf(struct io_kiocb *req) return cflags; }
+static inline bool io_run_task_work(void) +{ + if (current->task_works) { + __set_current_state(TASK_RUNNING); + task_work_run(); + return true; + } + + return false; +} + static void io_iopoll_queue(struct list_head *again) { struct io_kiocb *req; @@ -2164,8 +2174,7 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, */ if (!(++iters & 7)) { mutex_unlock(&ctx->uring_lock); - if (current->task_works) - task_work_run(); + io_run_task_work(); mutex_lock(&ctx->uring_lock); }
@@ -6271,8 +6280,7 @@ static int io_sq_thread(void *data) if (!list_empty(&ctx->poll_list) || need_resched() || (!time_after(jiffies, timeout) && ret != -EBUSY && !percpu_ref_is_dying(&ctx->refs))) { - if (current->task_works) - task_work_run(); + io_run_task_work(); cond_resched(); continue; } @@ -6301,8 +6309,7 @@ static int io_sq_thread(void *data) finish_wait(&ctx->sqo_wait, &wait); break; } - if (current->task_works) { - task_work_run(); + if (io_run_task_work()) { finish_wait(&ctx->sqo_wait, &wait); io_ring_clear_wakeup_flag(ctx); continue; @@ -6328,8 +6335,7 @@ static int io_sq_thread(void *data) timeout = jiffies + ctx->sq_thread_idle; }
- if (current->task_works) - task_work_run(); + io_run_task_work();
set_fs(old_fs); io_sq_thread_drop_mm_files(); @@ -6400,9 +6406,8 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, do { if (io_cqring_events(ctx, false) >= min_events) return 0; - if (!current->task_works) + if (!io_run_task_work()) break; - task_work_run(); } while (1);
if (sig) { @@ -6424,8 +6429,8 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); /* make sure we run task_work before checking for signals */ - if (current->task_works) - task_work_run(); + if (io_run_task_work()) + continue; if (signal_pending(current)) { if (current->jobctl & JOBCTL_TASK_WORK) { spin_lock_irq(¤t->sighand->siglock); @@ -7861,8 +7866,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, int submitted = 0; struct fd f;
- if (current->task_works) - task_work_run(); + io_run_task_work();
if (flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP)) return -EINVAL;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit c2c4c83c58cbca23527fee93b49738a5a84272a1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Since we now have that in the 5.9 branch, convert the existing users of task_work_add() to use this new helper.
Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [0ba9c9edcd15 ("io_uring: use TWA_SIGNAL for task_work uncondtionally") include first, 6d816e088c35 ("io_uring: hold 'ctx' reference around task_work queue + execute") include first]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 59 ++++++++++++++++++++++++++------------------------- 1 file changed, 30 insertions(+), 29 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index affde6571d6c..a629e16d1122 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1756,6 +1756,29 @@ static struct io_kiocb *io_req_find_next(struct io_kiocb *req) return __io_req_find_next(req); }
+static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb) +{ + struct task_struct *tsk = req->task; + struct io_ring_ctx *ctx = req->ctx; + int ret, notify; + + /* + * SQPOLL kernel thread doesn't need notification, just a wakeup. For + * all other cases, use TWA_SIGNAL unconditionally to ensure we're + * processing task_work. There's no reliable way to tell if TWA_RESUME + * will do the job. + */ + notify = 0; + if (!(ctx->flags & IORING_SETUP_SQPOLL)) + notify = TWA_SIGNAL; + + ret = task_work_add(tsk, cb, notify); + if (!ret) + wake_up_process(tsk); + + return ret; +} + static void __io_req_task_cancel(struct io_kiocb *req, int error) { struct io_ring_ctx *ctx = req->ctx; @@ -1802,19 +1825,20 @@ static void io_req_task_submit(struct callback_head *cb)
static void io_req_task_queue(struct io_kiocb *req) { - struct task_struct *tsk = req->task; int ret;
init_task_work(&req->task_work, io_req_task_submit); percpu_ref_get(&req->ctx->refs);
- ret = task_work_add(tsk, &req->task_work, true); + ret = io_req_task_work_add(req, &req->task_work); if (unlikely(ret)) { + struct task_struct *tsk; + init_task_work(&req->task_work, io_req_task_cancel); tsk = io_wq_get_task(req->ctx->io_wq); - task_work_add(tsk, &req->task_work, true); + task_work_add(tsk, &req->task_work, 0); + wake_up_process(tsk); } - wake_up_process(tsk); }
static void io_queue_next(struct io_kiocb *req) @@ -4280,33 +4304,9 @@ struct io_poll_table { int error; };
-static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb) -{ - struct task_struct *tsk = req->task; - struct io_ring_ctx *ctx = req->ctx; - int ret, notify; - - /* - * SQPOLL kernel thread doesn't need notification, just a wakeup. For - * all other cases, use TWA_SIGNAL unconditionally to ensure we're - * processing task_work. There's no reliable way to tell if TWA_RESUME - * will do the job. - */ - notify = 0; - if (!(ctx->flags & IORING_SETUP_SQPOLL)) - notify = TWA_SIGNAL; - - ret = task_work_add(tsk, cb, notify); - if (!ret) - wake_up_process(tsk); - - return ret; -} - static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, __poll_t mask, task_work_func_t func) { - struct task_struct *tsk; int ret;
/* for instances that support it check for an event match first: */ @@ -4317,7 +4317,6 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
list_del_init(&poll->wait.entry);
- tsk = req->task; req->result = mask; init_task_work(&req->task_work, func); percpu_ref_get(&req->ctx->refs); @@ -4330,6 +4329,8 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, */ ret = io_req_task_work_add(req, &req->task_work); if (unlikely(ret)) { + struct task_struct *tsk; + WRITE_ONCE(poll->canceled, true); tsk = io_wq_get_task(req->ctx->io_wq); task_work_add(tsk, &req->task_work, 0);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 6df1db6b542436c6d429caa66e1045862fa36155 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_prep_linked_timeout() sets REQ_F_LINK_TIMEOUT altering refcounting of the following linked request. After that someone should call io_queue_linked_timeout(), otherwise a submission reference of the linked timeout won't be ever dropped.
That's what happens in io_steal_work() if io-wq decides to postpone linked request with io_wqe_enqueue(). io_queue_linked_timeout() can also be potentially called twice without synchronisation during re-submission, e.g. io_rw_resubmit().
There are the rules, whoever did io_prep_linked_timeout() must also call io_queue_linked_timeout(). To not do it twice, io_prep_linked_timeout() will return non NULL only for the first call. That's controlled by REQ_F_LINK_TIMEOUT flag.
Also kill REQ_F_QUEUE_TIMEOUT.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 33 +++++++-------------------------- 1 file changed, 7 insertions(+), 26 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a629e16d1122..69e37b53477b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -548,7 +548,6 @@ enum { REQ_F_POLLED_BIT, REQ_F_BUFFER_SELECTED_BIT, REQ_F_NO_FILE_TABLE_BIT, - REQ_F_QUEUE_TIMEOUT_BIT, REQ_F_WORK_INITIALIZED_BIT, REQ_F_TASK_PINNED_BIT,
@@ -596,8 +595,6 @@ enum { REQ_F_BUFFER_SELECTED = BIT(REQ_F_BUFFER_SELECTED_BIT), /* doesn't need file table for this request */ REQ_F_NO_FILE_TABLE = BIT(REQ_F_NO_FILE_TABLE_BIT), - /* needs to queue linked timeout */ - REQ_F_QUEUE_TIMEOUT = BIT(REQ_F_QUEUE_TIMEOUT_BIT), /* io_wq_work is initialized */ REQ_F_WORK_INITIALIZED = BIT(REQ_F_WORK_INITIALIZED_BIT), /* req->task is refcounted */ @@ -1913,7 +1910,7 @@ static void io_put_req(struct io_kiocb *req)
static struct io_wq_work *io_steal_work(struct io_kiocb *req) { - struct io_kiocb *timeout, *nxt = NULL; + struct io_kiocb *nxt;
/* * A ref is owned by io-wq in which context we're. So, if that's the @@ -1924,13 +1921,7 @@ static struct io_wq_work *io_steal_work(struct io_kiocb *req) return NULL;
nxt = io_req_find_next(req); - if (!nxt) - return NULL; - - timeout = io_prep_linked_timeout(nxt); - if (timeout) - nxt->flags |= REQ_F_QUEUE_TIMEOUT; - return &nxt->work; + return nxt ? &nxt->work : NULL; }
/* @@ -5597,24 +5588,15 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; }
-static void io_arm_async_linked_timeout(struct io_kiocb *req) -{ - struct io_kiocb *link; - - /* link head's timeout is queued in io_queue_async_work() */ - if (!(req->flags & REQ_F_QUEUE_TIMEOUT)) - return; - - link = list_first_entry(&req->link_list, struct io_kiocb, link_list); - io_queue_linked_timeout(link); -} - static struct io_wq_work *io_wq_submit_work(struct io_wq_work *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work); + struct io_kiocb *timeout; int ret = 0;
- io_arm_async_linked_timeout(req); + timeout = io_prep_linked_timeout(req); + if (timeout) + io_queue_linked_timeout(timeout);
/* if NO_CANCEL is set, we must still run the work */ if ((work->flags & (IO_WQ_WORK_CANCEL|IO_WQ_WORK_NO_CANCEL)) == @@ -5782,8 +5764,7 @@ static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req)
if (!(req->flags & REQ_F_LINK_HEAD)) return NULL; - /* for polled retry, if flag is set, we already went through here */ - if (req->flags & REQ_F_POLLED) + if (req->flags & REQ_F_LINK_TIMEOUT) return NULL;
nxt = list_first_entry_or_null(&req->link_list, struct io_kiocb,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 652532ad459524d32c6bf1522e0b88d83b084d1a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
A preparation path, extracts error path into a separate block. It looks saner then calling req_set_fail_links() after io_put_req_find_next(), even though it have been working well.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 23 +++++++++++------------ 1 file changed, 11 insertions(+), 12 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 69e37b53477b..c0a6f6cda554 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5826,22 +5826,21 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, goto exit; }
+ if (unlikely(ret)) { err: - /* drop submission reference */ - nxt = io_put_req_find_next(req); - - if (linked_timeout) { - if (!ret) - io_queue_linked_timeout(linked_timeout); - else - io_put_req(linked_timeout); - } - - /* and drop final reference, if we failed */ - if (ret) { + /* un-prep timeout, so it'll be killed as any other linked */ + req->flags &= ~REQ_F_LINK_TIMEOUT; req_set_fail_links(req); + io_put_req(req); io_req_complete(req, ret); + goto exit; } + + /* drop submission reference */ + nxt = io_put_req_find_next(req); + if (linked_timeout) + io_queue_linked_timeout(linked_timeout); + if (nxt) { req = nxt;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 8b3656af2a37dc538d21e144a5a94bacae05e9f1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Don't forget to fill cqe->flags properly in io_submit_flush_completions()
Fixes: a1d7c393c4711 ("io_uring: enable READ/WRITE to use deferred completions") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c0a6f6cda554..19460cda134e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1478,7 +1478,7 @@ static void io_submit_flush_completions(struct io_comp_state *cs)
req = list_first_entry(&cs->list, struct io_kiocb, list); list_del(&req->list); - io_cqring_fill_event(req, req->result); + __io_cqring_fill_event(req, req->result, req->cflags); if (!(req->flags & REQ_F_LINK_HEAD)) { req->flags |= REQ_F_COMP_LOCKED; io_put_req(req); @@ -1503,6 +1503,7 @@ static void __io_req_complete(struct io_kiocb *req, long res, unsigned cflags, io_put_req(req); } else { req->result = res; + req->cflags = cflags; list_add_tail(&req->list, &cs->list); if (++cs->nr >= 32) io_submit_flush_completions(cs);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 3aadc23e6054353ca056bf14e87250c79efbd7ed category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
->iopoll() may have completed current request, but instead of reaping it, io_do_iopoll() just continues with the next request in the list. As a result it can leave just polled and completed request in the list up until next syscall. Even outer loop in io_iopoll_getevents() doesn't help the situation.
E.g. poll_list: req0 -> req1 If req0->iopoll() completed both requests, and @min<=1, then @req0 will be left behind.
Check whether a req was completed after ->iopoll().
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 19460cda134e..ab77682c6742 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2094,6 +2094,10 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, if (ret < 0) break;
+ /* iopoll may have completed current req */ + if (READ_ONCE(req->iopoll_completed)) + list_move_tail(&req->list, &done); + if (ret && spin) spin = false; ret = 0;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 7668b92a69b8201e2dd16a47a08efb93e909f419 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Nobody checks io_iopoll_check()'s output parameter @nr_events. Remove the parameter and declare it further down the stack.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ab77682c6742..46dde280cb0f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2162,9 +2162,9 @@ static void io_iopoll_try_reap_events(struct io_ring_ctx *ctx) mutex_unlock(&ctx->uring_lock); }
-static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, - long min) +static int io_iopoll_check(struct io_ring_ctx *ctx, long min) { + unsigned int nr_events = 0; int iters = 0, ret = 0;
/* @@ -2198,11 +2198,11 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, mutex_lock(&ctx->uring_lock); }
- ret = io_iopoll_getevents(ctx, nr_events, min); + ret = io_iopoll_getevents(ctx, &nr_events, min); if (ret <= 0) break; ret = 0; - } while (min && !*nr_events && !need_resched()); + } while (min && !nr_events && !need_resched());
mutex_unlock(&ctx->uring_lock); return ret; @@ -7891,8 +7891,6 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, goto out; } if (flags & IORING_ENTER_GETEVENTS) { - unsigned nr_events = 0; - min_complete = min(min_complete, ctx->cq_entries);
/* @@ -7903,7 +7901,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, */ if (ctx->flags & IORING_SETUP_IOPOLL && !(ctx->flags & IORING_SETUP_SQPOLL)) { - ret = io_iopoll_check(ctx, &nr_events, min_complete); + ret = io_iopoll_check(ctx, min_complete); } else { ret = io_cqring_wait(ctx, min_complete, sig, sigsz); }
From: Dan Carpenter dan.carpenter@oracle.com
mainline inclusion from mainline-5.9-rc1 commit aa340845ae6f019e0a12321a1741c14679bb0664 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The "apoll" variable is freed and then used on the next line. We need to move the free down a few lines.
Fixes: 0be0b0e33b0b ("io_uring: simplify io_async_task_func()") Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [760618f7a8e3 ("Merge branch 'io_uring-5.8' into for-5.9/io_uring"), this merge has some change, we need backport too]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 46dde280cb0f..94e5795fc41c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4546,14 +4546,15 @@ static void io_async_task_func(struct callback_head *cb) /* restore ->work in case we need to retry again */ if (req->flags & REQ_F_WORK_INITIALIZED) memcpy(&req->work, &apoll->work, sizeof(req->work)); - percpu_ref_put(&ctx->refs); - kfree(apoll->double_poll); - kfree(apoll);
if (!READ_ONCE(apoll->poll.canceled)) __io_req_task_submit(req); else __io_req_task_cancel(req, -ECANCELED); + + percpu_ref_put(&ctx->refs); + kfree(apoll->double_poll); + kfree(apoll); }
static int io_async_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit 5acbbc8ed3a9aef71c6eb5f19ba24f7321200220 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
It's safe to call kfree() with a NULL pointer, but it's also pointless. Most of the time we don't have any data to free, and at millions of requests per second, the redundant function call adds noticeable overhead (about 1.3% of the runtime).
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 94e5795fc41c..ed547051b61c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1585,7 +1585,8 @@ static void io_dismantle_req(struct io_kiocb *req) if (req->flags & REQ_F_NEED_CLEANUP) io_cleanup_req(req);
- kfree(req->io); + if (req->io) + kfree(req->io); if (req->file) io_put_file(req, req->file, (req->flags & REQ_F_FIXED_FILE)); __io_put_req_task(req);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit 2bc9930e78fe0cb3e7b7e3169de0a40baee38d29 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
We just have one caller of this, req_need_defer(), just inline the code in there instead.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ed547051b61c..ba73c1a0c650 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1127,18 +1127,14 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) return NULL; }
-static inline bool __req_need_defer(struct io_kiocb *req) -{ - struct io_ring_ctx *ctx = req->ctx; - - return req->sequence != ctx->cached_cq_tail - + atomic_read(&ctx->cached_cq_overflow); -} - static inline bool req_need_defer(struct io_kiocb *req) { - if (unlikely(req->flags & REQ_F_IO_DRAIN)) - return __req_need_defer(req); + if (unlikely(req->flags & REQ_F_IO_DRAIN)) { + struct io_ring_ctx *ctx = req->ctx; + + return req->sequence != ctx->cached_cq_tail + + atomic_read(&ctx->cached_cq_overflow); + }
return false; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 3ca405ebfc1c3445b049dd25ca3338cbc99837d1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Calling io_req_complete(req) means that the request is done, and there is nothing left but to clean it up. That also means that per-op data after that should not be used, so we're free to reuse it in completion path, e.g. to store overflow_list as done in this patch.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index ba73c1a0c650..28fedc96b17d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -499,6 +499,11 @@ struct io_statx { struct statx __user *buffer; };
+struct io_completion { + struct file *file; + struct list_head list; +}; + struct io_async_connect { struct sockaddr_storage address; }; @@ -633,6 +638,8 @@ struct io_kiocb { struct io_splice splice; struct io_provide_buf pbuf; struct io_statx statx; + /* use only after cleaning per-op data, see io_clean_op() */ + struct io_completion compl; };
struct io_async_ctx *io; @@ -903,7 +910,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, static int io_grab_files(struct io_kiocb *req); static void io_complete_rw_common(struct kiocb *kiocb, long res, struct io_comp_state *cs); -static void io_cleanup_req(struct io_kiocb *req); +static void __io_clean_op(struct io_kiocb *req); static int io_file_get(struct io_submit_state *state, struct io_kiocb *req, int fd, struct file **out_file, bool fixed); static void __io_queue_sqe(struct io_kiocb *req, @@ -935,6 +942,12 @@ static void io_get_req_task(struct io_kiocb *req) req->flags |= REQ_F_TASK_PINNED; }
+static inline void io_clean_op(struct io_kiocb *req) +{ + if (req->flags & REQ_F_NEED_CLEANUP) + __io_clean_op(req); +} + /* not idempotent -- it doesn't clear REQ_F_TASK_PINNED */ static void __io_put_req_task(struct io_kiocb *req) { @@ -1472,8 +1485,8 @@ static void io_submit_flush_completions(struct io_comp_state *cs) while (!list_empty(&cs->list)) { struct io_kiocb *req;
- req = list_first_entry(&cs->list, struct io_kiocb, list); - list_del(&req->list); + req = list_first_entry(&cs->list, struct io_kiocb, compl.list); + list_del(&req->compl.list); __io_cqring_fill_event(req, req->result, req->cflags); if (!(req->flags & REQ_F_LINK_HEAD)) { req->flags |= REQ_F_COMP_LOCKED; @@ -1498,9 +1511,10 @@ static void __io_req_complete(struct io_kiocb *req, long res, unsigned cflags, io_cqring_add_event(req, res, cflags); io_put_req(req); } else { + io_clean_op(req); req->result = res; req->cflags = cflags; - list_add_tail(&req->list, &cs->list); + list_add_tail(&req->compl.list, &cs->list); if (++cs->nr >= 32) io_submit_flush_completions(cs); } @@ -1578,8 +1592,7 @@ static inline void io_put_file(struct io_kiocb *req, struct file *file,
static void io_dismantle_req(struct io_kiocb *req) { - if (req->flags & REQ_F_NEED_CLEANUP) - io_cleanup_req(req); + io_clean_op(req);
if (req->io) kfree(req->io); @@ -5301,7 +5314,7 @@ static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EIOCBQUEUED; }
-static void io_cleanup_req(struct io_kiocb *req) +static void __io_clean_op(struct io_kiocb *req) { struct io_async_ctx *io = req->io;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 540e32a0855e700affa29b1112bf2dbb1fa7702a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
It supports both polling and I/O polling. Rename ctx->poll to clearly show that it's only in I/O poll case.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 28fedc96b17d..b38da6025c97 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -329,12 +329,12 @@ struct io_ring_ctx { spinlock_t completion_lock;
/* - * ->poll_list is protected by the ctx->uring_lock for + * ->iopoll_list is protected by the ctx->uring_lock for * io_uring instances that don't use IORING_SETUP_SQPOLL. * For SQPOLL, only the single threaded io_sq_thread() will * manipulate the list, hence no extra locking is needed there. */ - struct list_head poll_list; + struct list_head iopoll_list; struct hlist_head *cancel_hash; unsigned cancel_hash_bits; bool poll_multi_file; @@ -1123,7 +1123,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) mutex_init(&ctx->uring_lock); init_waitqueue_head(&ctx->wait); spin_lock_init(&ctx->completion_lock); - INIT_LIST_HEAD(&ctx->poll_list); + INIT_LIST_HEAD(&ctx->iopoll_list); INIT_LIST_HEAD(&ctx->defer_list); INIT_LIST_HEAD(&ctx->timeout_list); init_waitqueue_head(&ctx->inflight_wait); @@ -2085,7 +2085,7 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, spin = !ctx->poll_multi_file && *nr_events < min;
ret = 0; - list_for_each_entry_safe(req, tmp, &ctx->poll_list, list) { + list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, list) { struct kiocb *kiocb = &req->rw.kiocb;
/* @@ -2127,7 +2127,7 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, static int io_iopoll_getevents(struct io_ring_ctx *ctx, unsigned int *nr_events, long min) { - while (!list_empty(&ctx->poll_list) && !need_resched()) { + while (!list_empty(&ctx->iopoll_list) && !need_resched()) { int ret;
ret = io_do_iopoll(ctx, nr_events, min); @@ -2150,7 +2150,7 @@ static void io_iopoll_try_reap_events(struct io_ring_ctx *ctx) return;
mutex_lock(&ctx->uring_lock); - while (!list_empty(&ctx->poll_list)) { + while (!list_empty(&ctx->iopoll_list)) { unsigned int nr_events = 0;
io_do_iopoll(ctx, &nr_events, 0); @@ -2292,12 +2292,12 @@ static void io_iopoll_req_issued(struct io_kiocb *req) * how we do polling eventually, not spinning if we're on potentially * different devices. */ - if (list_empty(&ctx->poll_list)) { + if (list_empty(&ctx->iopoll_list)) { ctx->poll_multi_file = false; } else if (!ctx->poll_multi_file) { struct io_kiocb *list_req;
- list_req = list_first_entry(&ctx->poll_list, struct io_kiocb, + list_req = list_first_entry(&ctx->iopoll_list, struct io_kiocb, list); if (list_req->file != req->file) ctx->poll_multi_file = true; @@ -2308,9 +2308,9 @@ static void io_iopoll_req_issued(struct io_kiocb *req) * it to the front so we find it first. */ if (READ_ONCE(req->iopoll_completed)) - list_add(&req->list, &ctx->poll_list); + list_add(&req->list, &ctx->iopoll_list); else - list_add_tail(&req->list, &ctx->poll_list); + list_add_tail(&req->list, &ctx->iopoll_list);
if ((ctx->flags & IORING_SETUP_SQPOLL) && wq_has_sleeper(&ctx->sqo_wait)) @@ -6241,11 +6241,11 @@ static int io_sq_thread(void *data) while (!kthread_should_park()) { unsigned int to_submit;
- if (!list_empty(&ctx->poll_list)) { + if (!list_empty(&ctx->iopoll_list)) { unsigned nr_events = 0;
mutex_lock(&ctx->uring_lock); - if (!list_empty(&ctx->poll_list) && !need_resched()) + if (!list_empty(&ctx->iopoll_list) && !need_resched()) io_do_iopoll(ctx, &nr_events, 0); else timeout = jiffies + ctx->sq_thread_idle; @@ -6274,7 +6274,7 @@ static int io_sq_thread(void *data) * more IO, we should wait for the application to * reap events and wake us up. */ - if (!list_empty(&ctx->poll_list) || need_resched() || + if (!list_empty(&ctx->iopoll_list) || need_resched() || (!time_after(jiffies, timeout) && ret != -EBUSY && !percpu_ref_is_dying(&ctx->refs))) { io_run_task_work(); @@ -6287,13 +6287,13 @@ static int io_sq_thread(void *data)
/* * While doing polled IO, before going to sleep, we need - * to check if there are new reqs added to poll_list, it - * is because reqs may have been punted to io worker and - * will be added to poll_list later, hence check the - * poll_list again. + * to check if there are new reqs added to iopoll_list, + * it is because reqs may have been punted to io worker + * and will be added to iopoll_list later, hence check + * the iopoll_list again. */ if ((ctx->flags & IORING_SETUP_IOPOLL) && - !list_empty_careful(&ctx->poll_list)) { + !list_empty_careful(&ctx->iopoll_list)) { finish_wait(&ctx->sqo_wait, &wait); continue; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit d21ffe7eca82d47b489760899912f81e30456e2e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
req->inflight_entry is used to track requests that grabbed files_struct. Let's share it with iopoll list, because the only iopoll'ed ops are reads and writes, which don't need a file table.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [b63534c41e20 ("io_uring: re-issue block requests that failed because of resources") not include, 56450c20fe10 ("io_uring: clear req->result on IOPOLL re-issue") include first]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 26 +++++++++++++++----------- 1 file changed, 15 insertions(+), 11 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index b38da6025c97..3c0633fe4c8a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -662,6 +662,10 @@ struct io_kiocb {
struct list_head link_list;
+ /* + * 1. used with ctx->iopoll_list with reads/writes + * 2. to track reqs with ->files (see io_op_def::file_table) + */ struct list_head inflight_entry;
struct percpu_ref *fixed_file_refs; @@ -2011,8 +2015,8 @@ static void io_iopoll_queue(struct list_head *again) struct io_kiocb *req;
do { - req = list_first_entry(again, struct io_kiocb, list); - list_del(&req->list); + req = list_first_entry(again, struct io_kiocb, inflight_entry); + list_del(&req->inflight_entry);
/* shouldn't happen unless io_uring is dying, cancel reqs */ if (unlikely(!current->mm)) { @@ -2042,14 +2046,14 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, while (!list_empty(done)) { int cflags = 0;
- req = list_first_entry(done, struct io_kiocb, list); + req = list_first_entry(done, struct io_kiocb, inflight_entry); if (READ_ONCE(req->result) == -EAGAIN) { req->result = 0; req->iopoll_completed = 0; - list_move_tail(&req->list, &again); + list_move_tail(&req->inflight_entry, &again); continue; } - list_del(&req->list); + list_del(&req->inflight_entry);
if (req->flags & REQ_F_BUFFER_SELECTED) cflags = io_put_kbuf(req); @@ -2085,7 +2089,7 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, spin = !ctx->poll_multi_file && *nr_events < min;
ret = 0; - list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, list) { + list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, inflight_entry) { struct kiocb *kiocb = &req->rw.kiocb;
/* @@ -2094,7 +2098,7 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, * and complete those lists first, if we have entries there. */ if (READ_ONCE(req->iopoll_completed)) { - list_move_tail(&req->list, &done); + list_move_tail(&req->inflight_entry, &done); continue; } if (!list_empty(&done)) @@ -2106,7 +2110,7 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events,
/* iopoll may have completed current req */ if (READ_ONCE(req->iopoll_completed)) - list_move_tail(&req->list, &done); + list_move_tail(&req->inflight_entry, &done);
if (ret && spin) spin = false; @@ -2298,7 +2302,7 @@ static void io_iopoll_req_issued(struct io_kiocb *req) struct io_kiocb *list_req;
list_req = list_first_entry(&ctx->iopoll_list, struct io_kiocb, - list); + inflight_entry); if (list_req->file != req->file) ctx->poll_multi_file = true; } @@ -2308,9 +2312,9 @@ static void io_iopoll_req_issued(struct io_kiocb *req) * it to the front so we find it first. */ if (READ_ONCE(req->iopoll_completed)) - list_add(&req->list, &ctx->iopoll_list); + list_add(&req->inflight_entry, &ctx->iopoll_list); else - list_add_tail(&req->list, &ctx->iopoll_list); + list_add_tail(&req->inflight_entry, &ctx->iopoll_list);
if ((ctx->flags & IORING_SETUP_SQPOLL) && wq_has_sleeper(&ctx->sqo_wait))
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 40d8ddd4facb80760d5a0c61a7cf026d5ff73ff0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
As with the completion path, also use compl.list for overflowed requests. If cleaned up properly, nobody needs per-op data there anymore.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 3c0633fe4c8a..6ce523412878 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1398,8 +1398,8 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) break;
req = list_first_entry(&ctx->cq_overflow_list, struct io_kiocb, - list); - list_move(&req->list, &list); + compl.list); + list_move(&req->compl.list, &list); req->flags &= ~REQ_F_OVERFLOW; if (cqe) { WRITE_ONCE(cqe->user_data, req->user_data); @@ -1421,8 +1421,8 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) io_cqring_ev_posted(ctx);
while (!list_empty(&list)) { - req = list_first_entry(&list, struct io_kiocb, list); - list_del(&req->list); + req = list_first_entry(&list, struct io_kiocb, compl.list); + list_del(&req->compl.list); io_put_req(req); }
@@ -1455,11 +1455,12 @@ static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags) set_bit(0, &ctx->cq_check_overflow); ctx->rings->sq_flags |= IORING_SQ_CQ_OVERFLOW; } + io_clean_op(req); req->flags |= REQ_F_OVERFLOW; - refcount_inc(&req->refs); req->result = res; req->cflags = cflags; - list_add_tail(&req->list, &ctx->cq_overflow_list); + refcount_inc(&req->refs); + list_add_tail(&req->compl.list, &ctx->cq_overflow_list); } }
@@ -7734,7 +7735,7 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx,
if (cancel_req->flags & REQ_F_OVERFLOW) { spin_lock_irq(&ctx->completion_lock); - list_del(&cancel_req->list); + list_del(&cancel_req->compl.list); cancel_req->flags &= ~REQ_F_OVERFLOW; if (list_empty(&ctx->cq_overflow_list)) { clear_bit(0, &ctx->sq_check_overflow);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 135fcde8496b03d31648171dbc038990112e41d5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Instead of using shared req->list, hang timeouts up on their own list entry. struct io_timeout have enough extra space for it, but if that will be a problem ->inflight_entry can reused for that.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 6ce523412878..9a50a0de2395 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -405,6 +405,7 @@ struct io_timeout { int flags; u32 off; u32 target_seq; + struct list_head list; };
struct io_rw { @@ -1272,7 +1273,7 @@ static void io_kill_timeout(struct io_kiocb *req) ret = hrtimer_try_to_cancel(&req->io->timeout.timer); if (ret != -1) { atomic_inc(&req->ctx->cq_timeouts); - list_del_init(&req->list); + list_del_init(&req->timeout.list); req->flags |= REQ_F_COMP_LOCKED; io_cqring_fill_event(req, 0); io_put_req(req); @@ -1284,7 +1285,7 @@ static void io_kill_timeouts(struct io_ring_ctx *ctx) struct io_kiocb *req, *tmp;
spin_lock_irq(&ctx->completion_lock); - list_for_each_entry_safe(req, tmp, &ctx->timeout_list, list) + list_for_each_entry_safe(req, tmp, &ctx->timeout_list, timeout.list) io_kill_timeout(req); spin_unlock_irq(&ctx->completion_lock); } @@ -1307,7 +1308,7 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx) { while (!list_empty(&ctx->timeout_list)) { struct io_kiocb *req = list_first_entry(&ctx->timeout_list, - struct io_kiocb, list); + struct io_kiocb, timeout.list);
if (io_is_timeout_noseq(req)) break; @@ -1315,7 +1316,7 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx) - atomic_read(&ctx->cq_timeouts)) break;
- list_del_init(&req->list); + list_del_init(&req->timeout.list); io_kill_timeout(req); } } @@ -4898,8 +4899,8 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) * We could be racing with timeout deletion. If the list is empty, * then timeout lookup already found it and will be handling it. */ - if (!list_empty(&req->list)) - list_del_init(&req->list); + if (!list_empty(&req->timeout.list)) + list_del_init(&req->timeout.list);
io_cqring_fill_event(req, -ETIME); io_commit_cqring(ctx); @@ -4916,9 +4917,9 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) struct io_kiocb *req; int ret = -ENOENT;
- list_for_each_entry(req, &ctx->timeout_list, list) { + list_for_each_entry(req, &ctx->timeout_list, timeout.list) { if (user_data == req->user_data) { - list_del_init(&req->list); + list_del_init(&req->timeout.list); ret = 0; break; } @@ -5041,7 +5042,8 @@ static int io_timeout(struct io_kiocb *req) * the one we need first. */ list_for_each_prev(entry, &ctx->timeout_list) { - struct io_kiocb *nxt = list_entry(entry, struct io_kiocb, list); + struct io_kiocb *nxt = list_entry(entry, struct io_kiocb, + timeout.list);
if (io_is_timeout_noseq(nxt)) continue; @@ -5050,7 +5052,7 @@ static int io_timeout(struct io_kiocb *req) break; } add: - list_add(&req->list, entry); + list_add(&req->timeout.list, entry); data->timer.function = io_timeout_fn; hrtimer_start(&data->timer, timespec64_to_ktime(data->ts), data->mode); spin_unlock_irq(&ctx->completion_lock);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 7d6ddea6beaf6639cf3a2b291dcdac6fe1edc584 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
poll*() doesn't use req->list, don't init it.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9a50a0de2395..119b7ab91718 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4865,7 +4865,6 @@ static int io_poll_add(struct io_kiocb *req) req->flags &= ~REQ_F_WORK_INITIALIZED;
INIT_HLIST_NODE(&req->hash_node); - INIT_LIST_HEAD(&req->list); ipt.pt._qproc = io_poll_queue_proc;
mask = __io_arm_poll_handler(req, &req->poll, &ipt, poll->events,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 27dc8338e5fb0e0ed5b272e792f4ffad7f3bc03e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
The only left user of req->list is DRAIN, hence instead of keeping a separate per request list for it, do that with old fashion non-intrusive lists allocated on demand. That's a really slow path, so that's OK.
This removes req->list and so sheds 16 bytes from io_kiocb.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [ac8691c415e0 ("io_uring: always plug for any number of IOs") not include]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 25 ++++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 119b7ab91718..4fa5633c8661 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -652,7 +652,6 @@ struct io_kiocb { u16 buf_index;
struct io_ring_ctx *ctx; - struct list_head list; unsigned int flags; refcount_t refs; struct task_struct *task; @@ -687,6 +686,11 @@ struct io_kiocb { struct callback_head task_work; };
+struct io_defer_entry { + struct list_head list; + struct io_kiocb *req; +}; + #define IO_PLUG_THRESHOLD 2 #define IO_IOPOLL_BATCH 8
@@ -1293,14 +1297,15 @@ static void io_kill_timeouts(struct io_ring_ctx *ctx) static void __io_queue_deferred(struct io_ring_ctx *ctx) { do { - struct io_kiocb *req = list_first_entry(&ctx->defer_list, - struct io_kiocb, list); + struct io_defer_entry *de = list_first_entry(&ctx->defer_list, + struct io_defer_entry, list);
- if (req_need_defer(req)) + if (req_need_defer(de->req)) break; - list_del_init(&req->list); + list_del_init(&de->list); /* punt-init is done before queueing for defer */ - __io_queue_async_work(req); + __io_queue_async_work(de->req); + kfree(de); } while (!list_empty(&ctx->defer_list)); }
@@ -5293,6 +5298,7 @@ static int io_req_defer_prep(struct io_kiocb *req, static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_ring_ctx *ctx = req->ctx; + struct io_defer_entry *de; int ret;
/* Still need defer if there is pending req in defer list. */ @@ -5307,15 +5313,20 @@ static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) return ret; } io_prep_async_link(req); + de = kmalloc(sizeof(*de), GFP_KERNEL); + if (!de) + return -ENOMEM;
spin_lock_irq(&ctx->completion_lock); if (!req_need_defer(req) && list_empty(&ctx->defer_list)) { spin_unlock_irq(&ctx->completion_lock); + kfree(de); return 0; }
trace_io_uring_defer(ctx, req, req->user_data); - list_add_tail(&req->list, &ctx->defer_list); + de->req = req; + list_add_tail(&de->list, &ctx->defer_list); spin_unlock_irq(&ctx->completion_lock); return -EIOCBQUEUED; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 9cf7c104deaef52d6fd7c103a716e31d9815ede8 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
req->sequence is used only for deferred (i.e. DRAIN) requests, but initialised for every request. Remove req->sequence from io_kiocb together with its initialisation in io_init_req().
Replace it with a new field in struct io_defer_entry, that will be calculated only when needed in io_req_defer(), which is a slow path.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [ac8691c415e0 ("io_uring: always plug for any number of IOs") not include]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 44 ++++++++++++++++++++++++++++++-------------- 1 file changed, 30 insertions(+), 14 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 4fa5633c8661..95b11f0fc1f5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -650,6 +650,7 @@ struct io_kiocb { u8 iopoll_completed;
u16 buf_index; + u32 result;
struct io_ring_ctx *ctx; unsigned int flags; @@ -657,8 +658,6 @@ struct io_kiocb { struct task_struct *task; unsigned long fsize; u64 user_data; - u32 result; - u32 sequence;
struct list_head link_list;
@@ -689,6 +688,7 @@ struct io_kiocb { struct io_defer_entry { struct list_head list; struct io_kiocb *req; + u32 seq; };
#define IO_PLUG_THRESHOLD 2 @@ -1149,13 +1149,13 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) return NULL; }
-static inline bool req_need_defer(struct io_kiocb *req) +static bool req_need_defer(struct io_kiocb *req, u32 seq) { if (unlikely(req->flags & REQ_F_IO_DRAIN)) { struct io_ring_ctx *ctx = req->ctx;
- return req->sequence != ctx->cached_cq_tail - + atomic_read(&ctx->cached_cq_overflow); + return seq != ctx->cached_cq_tail + + atomic_read(&ctx->cached_cq_overflow); }
return false; @@ -1300,7 +1300,7 @@ static void __io_queue_deferred(struct io_ring_ctx *ctx) struct io_defer_entry *de = list_first_entry(&ctx->defer_list, struct io_defer_entry, list);
- if (req_need_defer(de->req)) + if (req_need_defer(de->req, de->seq)) break; list_del_init(&de->list); /* punt-init is done before queueing for defer */ @@ -5295,14 +5295,35 @@ static int io_req_defer_prep(struct io_kiocb *req, return ret; }
+static u32 io_get_sequence(struct io_kiocb *req) +{ + struct io_kiocb *pos; + struct io_ring_ctx *ctx = req->ctx; + u32 total_submitted, nr_reqs = 1; + + if (req->flags & REQ_F_LINK_HEAD) + list_for_each_entry(pos, &req->link_list, link_list) + nr_reqs++; + + total_submitted = ctx->cached_sq_head - ctx->cached_sq_dropped; + return total_submitted - nr_reqs; +} + static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_ring_ctx *ctx = req->ctx; struct io_defer_entry *de; int ret; + u32 seq;
/* Still need defer if there is pending req in defer list. */ - if (!req_need_defer(req) && list_empty_careful(&ctx->defer_list)) + if (likely(list_empty_careful(&ctx->defer_list) && + !(req->flags & REQ_F_IO_DRAIN))) + return 0; + + seq = io_get_sequence(req); + /* Still a chance to pass the sequence check */ + if (!req_need_defer(req, seq) && list_empty_careful(&ctx->defer_list)) return 0;
if (!req->io) { @@ -5318,7 +5339,7 @@ static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -ENOMEM;
spin_lock_irq(&ctx->completion_lock); - if (!req_need_defer(req) && list_empty(&ctx->defer_list)) { + if (!req_need_defer(req, seq) && list_empty(&ctx->defer_list)) { spin_unlock_irq(&ctx->completion_lock); kfree(de); return 0; @@ -5326,6 +5347,7 @@ static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe)
trace_io_uring_defer(ctx, req, req->user_data); de->req = req; + de->seq = seq; list_add_tail(&de->list, &ctx->defer_list); spin_unlock_irq(&ctx->completion_lock); return -EIOCBQUEUED; @@ -6087,12 +6109,6 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, unsigned int sqe_flags; int id;
- /* - * All io need record the previous position, if LINK vs DARIN, - * it can be used to mark the position of the first IO in the - * link list. - */ - req->sequence = ctx->cached_sq_head - ctx->cached_sq_dropped; req->opcode = READ_ONCE(sqe->opcode); req->user_data = READ_ONCE(sqe->user_data); req->io = NULL;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 0f7e466b393abab86be96ffcf00af383afddc0d1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
req->cflags is used only for defer-completion path, just use completion data to store it. With the 4 bytes from the ->sequence patch and compacting io_kiocb, this frees 8 bytes.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 95b11f0fc1f5..069907a467be 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -503,6 +503,7 @@ struct io_statx { struct io_completion { struct file *file; struct list_head list; + int cflags; };
struct io_async_connect { @@ -644,7 +645,6 @@ struct io_kiocb { };
struct io_async_ctx *io; - int cflags; u8 opcode; /* polled IO has completed */ u8 iopoll_completed; @@ -1410,7 +1410,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) if (cqe) { WRITE_ONCE(cqe->user_data, req->user_data); WRITE_ONCE(cqe->res, req->result); - WRITE_ONCE(cqe->flags, req->cflags); + WRITE_ONCE(cqe->flags, req->compl.cflags); } else { WRITE_ONCE(ctx->rings->cq_overflow, atomic_inc_return(&ctx->cached_cq_overflow)); @@ -1464,7 +1464,7 @@ static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags) io_clean_op(req); req->flags |= REQ_F_OVERFLOW; req->result = res; - req->cflags = cflags; + req->compl.cflags = cflags; refcount_inc(&req->refs); list_add_tail(&req->compl.list, &ctx->cq_overflow_list); } @@ -1498,7 +1498,7 @@ static void io_submit_flush_completions(struct io_comp_state *cs)
req = list_first_entry(&cs->list, struct io_kiocb, compl.list); list_del(&req->compl.list); - __io_cqring_fill_event(req, req->result, req->cflags); + __io_cqring_fill_event(req, req->result, req->compl.cflags); if (!(req->flags & REQ_F_LINK_HEAD)) { req->flags |= REQ_F_COMP_LOCKED; io_put_req(req); @@ -1524,7 +1524,7 @@ static void __io_req_complete(struct io_kiocb *req, long res, unsigned cflags, } else { io_clean_op(req); req->result = res; - req->cflags = cflags; + req->compl.cflags = cflags; list_add_tail(&req->compl.list, &cs->list); if (++cs->nr >= 32) io_submit_flush_completions(cs);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit f254ac04c8744cf7bfed012717eac34eacc65dfb category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
When a process exits, we cancel whatever requests it has pending that are referencing the file table. However, if a link is holding a reference, then we cannot find it by simply looking at the inflight list.
Enable checking of the poll and timeout list to find the link, and cancel it appropriately.
Cc: stable@vger.kernel.org Reported-by: Josef josef.grieb@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 97 +++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 87 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 069907a467be..1ce7395c8939 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4741,6 +4741,7 @@ static bool io_poll_remove_one(struct io_kiocb *req) io_cqring_fill_event(req, -ECANCELED); io_commit_cqring(req->ctx); req->flags |= REQ_F_COMP_LOCKED; + req_set_fail_links(req); io_put_req(req); }
@@ -4916,6 +4917,23 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) return HRTIMER_NORESTART; }
+static int __io_timeout_cancel(struct io_kiocb *req) +{ + int ret; + + list_del_init(&req->timeout.list); + + ret = hrtimer_try_to_cancel(&req->io->timeout.timer); + if (ret == -1) + return -EALREADY; + + req_set_fail_links(req); + req->flags |= REQ_F_COMP_LOCKED; + io_cqring_fill_event(req, -ECANCELED); + io_put_req(req); + return 0; +} + static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) { struct io_kiocb *req; @@ -4923,7 +4941,6 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data)
list_for_each_entry(req, &ctx->timeout_list, timeout.list) { if (user_data == req->user_data) { - list_del_init(&req->timeout.list); ret = 0; break; } @@ -4932,15 +4949,7 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data) if (ret == -ENOENT) return ret;
- ret = hrtimer_try_to_cancel(&req->io->timeout.timer); - if (ret == -1) - return -EALREADY; - - req_set_fail_links(req); - req->flags |= REQ_F_COMP_LOCKED; - io_cqring_fill_event(req, -ECANCELED); - io_put_req(req); - return 0; + return __io_timeout_cancel(req); }
static int io_timeout_remove_prep(struct io_kiocb *req, @@ -7729,6 +7738,71 @@ static bool io_wq_files_match(struct io_wq_work *work, void *data) return work->files == files; }
+/* + * Returns true if 'preq' is the link parent of 'req' + */ +static bool io_match_link(struct io_kiocb *preq, struct io_kiocb *req) +{ + struct io_kiocb *link; + + if (!(preq->flags & REQ_F_LINK_HEAD)) + return false; + + list_for_each_entry(link, &preq->link_list, link_list) { + if (link == req) + return true; + } + + return false; +} + +/* + * We're looking to cancel 'req' because it's holding on to our files, but + * 'req' could be a link to another request. See if it is, and cancel that + * parent request if so. + */ +static bool io_poll_remove_link(struct io_ring_ctx *ctx, struct io_kiocb *req) +{ + struct hlist_node *tmp; + struct io_kiocb *preq; + bool found = false; + int i; + + spin_lock_irq(&ctx->completion_lock); + for (i = 0; i < (1U << ctx->cancel_hash_bits); i++) { + struct hlist_head *list; + + list = &ctx->cancel_hash[i]; + hlist_for_each_entry_safe(preq, tmp, list, hash_node) { + found = io_match_link(preq, req); + if (found) { + io_poll_remove_one(preq); + break; + } + } + } + spin_unlock_irq(&ctx->completion_lock); + return found; +} + +static bool io_timeout_remove_link(struct io_ring_ctx *ctx, + struct io_kiocb *req) +{ + struct io_kiocb *preq; + bool found = false; + + spin_lock_irq(&ctx->completion_lock); + list_for_each_entry(preq, &ctx->timeout_list, timeout.list) { + found = io_match_link(preq, req); + if (found) { + __io_timeout_cancel(preq); + break; + } + } + spin_unlock_irq(&ctx->completion_lock); + return found; +} + static void io_uring_cancel_files(struct io_ring_ctx *ctx, struct files_struct *files) { @@ -7786,6 +7860,9 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, } } else { io_wq_cancel_work(ctx->io_wq, &cancel_req->work); + /* could be a link, check and remove if it is */ + if (!io_poll_remove_link(ctx, cancel_req)) + io_timeout_remove_link(ctx, cancel_req); io_put_req(cancel_req); }
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit 7271ef3a93a832180068c7aade3f130b7f39b17e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
syszbot reports a scenario where we recurse on the completion lock when flushing an overflow:
1 lock held by syz-executor287/6816: #0: ffff888093cdb4d8 (&ctx->completion_lock){....}-{2:2}, at: io_cqring_overflow_flush+0xc6/0xab0 fs/io_uring.c:1333
stack backtrace: CPU: 1 PID: 6816 Comm: syz-executor287 Not tainted 5.8.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1f0/0x31e lib/dump_stack.c:118 print_deadlock_bug kernel/locking/lockdep.c:2391 [inline] check_deadlock kernel/locking/lockdep.c:2432 [inline] validate_chain+0x69a4/0x88a0 kernel/locking/lockdep.c:3202 __lock_acquire+0x1161/0x2ab0 kernel/locking/lockdep.c:4426 lock_acquire+0x160/0x730 kernel/locking/lockdep.c:5005 __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline] _raw_spin_lock_irq+0x67/0x80 kernel/locking/spinlock.c:167 spin_lock_irq include/linux/spinlock.h:379 [inline] io_queue_linked_timeout fs/io_uring.c:5928 [inline] __io_queue_async_work fs/io_uring.c:1192 [inline] __io_queue_deferred+0x36a/0x790 fs/io_uring.c:1237 io_cqring_overflow_flush+0x774/0xab0 fs/io_uring.c:1359 io_ring_ctx_wait_and_kill+0x2a1/0x570 fs/io_uring.c:7808 io_uring_release+0x59/0x70 fs/io_uring.c:7829 __fput+0x34f/0x7b0 fs/file_table.c:281 task_work_run+0x137/0x1c0 kernel/task_work.c:135 exit_task_work include/linux/task_work.h:25 [inline] do_exit+0x5f3/0x1f20 kernel/exit.c:806 do_group_exit+0x161/0x2d0 kernel/exit.c:903 __do_sys_exit_group+0x13/0x20 kernel/exit.c:914 __se_sys_exit_group+0x10/0x10 kernel/exit.c:912 __x64_sys_exit_group+0x37/0x40 kernel/exit.c:912 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fix this by passing back the link from __io_queue_async_work(), and then let the caller handle the queueing of the link. Take care to also punt the submission reference put to the caller, as we're holding the completion lock for the __io_queue_defer() case. Hence we need to mark the io_kiocb appropriately for that case.
Reported-by: syzbot+996f91b6ec3812c48042@syzkaller.appspotmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 36 ++++++++++++++++++++++++++---------- 1 file changed, 26 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1ce7395c8939..a7e0bae86df9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -912,6 +912,7 @@ static void io_put_req(struct io_kiocb *req); static void io_double_put_req(struct io_kiocb *req); static void __io_double_put_req(struct io_kiocb *req); static struct io_kiocb *io_prep_linked_timeout(struct io_kiocb *req); +static void __io_queue_linked_timeout(struct io_kiocb *req); static void io_queue_linked_timeout(struct io_kiocb *req); static int __io_sqe_files_update(struct io_ring_ctx *ctx, struct io_uring_files_update *ip, @@ -1250,7 +1251,7 @@ static void io_prep_async_link(struct io_kiocb *req) io_prep_async_work(cur); }
-static void __io_queue_async_work(struct io_kiocb *req) +static struct io_kiocb *__io_queue_async_work(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; struct io_kiocb *link = io_prep_linked_timeout(req); @@ -1258,16 +1259,19 @@ static void __io_queue_async_work(struct io_kiocb *req) trace_io_uring_queue_async_work(ctx, io_wq_is_hashed(&req->work), req, &req->work, req->flags); io_wq_enqueue(ctx->io_wq, &req->work); - - if (link) - io_queue_linked_timeout(link); + return link; }
static void io_queue_async_work(struct io_kiocb *req) { + struct io_kiocb *link; + /* init ->work of the whole link before punting */ io_prep_async_link(req); - __io_queue_async_work(req); + link = __io_queue_async_work(req); + + if (link) + io_queue_linked_timeout(link); }
static void io_kill_timeout(struct io_kiocb *req) @@ -1299,12 +1303,19 @@ static void __io_queue_deferred(struct io_ring_ctx *ctx) do { struct io_defer_entry *de = list_first_entry(&ctx->defer_list, struct io_defer_entry, list); + struct io_kiocb *link;
if (req_need_defer(de->req, de->seq)) break; list_del_init(&de->list); /* punt-init is done before queueing for defer */ - __io_queue_async_work(de->req); + link = __io_queue_async_work(de->req); + if (link) { + __io_queue_linked_timeout(link); + /* drop submission reference */ + link->flags |= REQ_F_COMP_LOCKED; + io_put_req(link); + } kfree(de); } while (!list_empty(&ctx->defer_list)); } @@ -5800,15 +5811,12 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) return HRTIMER_NORESTART; }
-static void io_queue_linked_timeout(struct io_kiocb *req) +static void __io_queue_linked_timeout(struct io_kiocb *req) { - struct io_ring_ctx *ctx = req->ctx; - /* * If the list is now empty, then our linked request finished before * we got a chance to setup the timer */ - spin_lock_irq(&ctx->completion_lock); if (!list_empty(&req->link_list)) { struct io_timeout_data *data = &req->io->timeout;
@@ -5816,6 +5824,14 @@ static void io_queue_linked_timeout(struct io_kiocb *req) hrtimer_start(&data->timer, timespec64_to_ktime(data->ts), data->mode); } +} + +static void io_queue_linked_timeout(struct io_kiocb *req) +{ + struct io_ring_ctx *ctx = req->ctx; + + spin_lock_irq(&ctx->completion_lock); + __io_queue_linked_timeout(req); spin_unlock_irq(&ctx->completion_lock);
/* drop submission reference */
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit ac8691c415e0ce0b8734cb6d9df2df18608eebed category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently we only plug if we're doing more than two request. We're going to be relying on always having the plug there to pass down information, so plug unconditionally.
Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [We need this to transfer arg]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a7e0bae86df9..c455d9ed5795 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -691,7 +691,6 @@ struct io_defer_entry { u32 seq; };
-#define IO_PLUG_THRESHOLD 2 #define IO_IOPOLL_BATCH 8
struct io_comp_state { @@ -6181,7 +6180,7 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, struct file *ring_file, int ring_fd) { - struct io_submit_state state, *statep = NULL; + struct io_submit_state state; struct io_kiocb *link = NULL; int i, submitted = 0;
@@ -6198,10 +6197,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, if (!percpu_ref_tryget_many(&ctx->refs, nr)) return -EAGAIN;
- if (nr > IO_PLUG_THRESHOLD) { - io_submit_state_start(&state, ctx, nr); - statep = &state; - } + io_submit_state_start(&state, ctx, nr);
ctx->ring_fd = ring_fd; ctx->ring_file = ring_file; @@ -6216,14 +6212,14 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, io_consume_sqe(ctx); break; } - req = io_alloc_req(ctx, statep); + req = io_alloc_req(ctx, &state); if (unlikely(!req)) { if (!submitted) submitted = -EAGAIN; break; }
- err = io_init_req(ctx, req, sqe, statep); + err = io_init_req(ctx, req, sqe, &state); io_consume_sqe(ctx); /* will complete beyond this point, count as submitted */ submitted++; @@ -6249,8 +6245,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, } if (link) io_queue_link_head(link, &state.comp); - if (statep) - io_submit_state_end(&state); + io_submit_state_end(&state);
/* Commit SQ ring head once we've consumed and submitted all SQEs */ io_commit_sqring(ctx);
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc2 commit b711d4eaf0c408a811311ee3e94d6e9e5a230a9a category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Commit f254ac04c874 ("io_uring: enable lookup of links holding inflight files") only handled 2 out of the three head link cases we have, we also need to lookup and cancel work that is blocked in io-wq if that work has a link that's holding a reference to the files structure.
Put the "cancel head links that hold this request pending" logic into io_attempt_cancel(), which will to through the motions of finding and canceling head links that hold the current inflight files stable request pending.
Cc: stable@vger.kernel.org Reported-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 33 +++++++++++++++++++++++++++++---- 1 file changed, 29 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c455d9ed5795..22d778c7a45e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7814,6 +7814,33 @@ static bool io_timeout_remove_link(struct io_ring_ctx *ctx, return found; }
+static bool io_cancel_link_cb(struct io_wq_work *work, void *data) +{ + return io_match_link(container_of(work, struct io_kiocb, work), data); +} + +static void io_attempt_cancel(struct io_ring_ctx *ctx, struct io_kiocb *req) +{ + enum io_wq_cancel cret; + + /* cancel this particular work, if it's running */ + cret = io_wq_cancel_work(ctx->io_wq, &req->work); + if (cret != IO_WQ_CANCEL_NOTFOUND) + return; + + /* find links that hold this pending, cancel those */ + cret = io_wq_cancel_cb(ctx->io_wq, io_cancel_link_cb, req, true); + if (cret != IO_WQ_CANCEL_NOTFOUND) + return; + + /* if we have a poll link holding this pending, cancel that */ + if (io_poll_remove_link(ctx, req)) + return; + + /* final option, timeout link is holding this req pending */ + io_timeout_remove_link(ctx, req); +} + static void io_uring_cancel_files(struct io_ring_ctx *ctx, struct files_struct *files) { @@ -7870,10 +7897,8 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, continue; } } else { - io_wq_cancel_work(ctx->io_wq, &cancel_req->work); - /* could be a link, check and remove if it is */ - if (!io_poll_remove_link(ctx, cancel_req)) - io_timeout_remove_link(ctx, cancel_req); + /* cancel this request, or head link requests */ + io_attempt_cancel(ctx, cancel_req); io_put_req(cancel_req); }
From: Marcelo Diop-Gonzalez marcelo827@gmail.com
mainline inclusion from mainline-5.11-rc4 commit f010505b78a4fa8d5b6480752566e7313fb5ca6e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Right now io_flush_timeouts() checks if the current number of events is equal to ->timeout.target_seq, but this will miss some timeouts if there have been more than 1 event added since the last time they were flushed (possible in io_submit_flush_completions(), for example). Fix it by recording the last sequence at which timeouts were flushed so that the number of events seen can be compared to the number of events needed without overflow.
Signed-off-by: Marcelo Diop-Gonzalez marcelo827@gmail.com Reviewed-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 34 ++++++++++++++++++++++++++++++---- 1 file changed, 30 insertions(+), 4 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 22d778c7a45e..7163271d14c3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -314,6 +314,7 @@ struct io_ring_ctx { unsigned cq_entries; unsigned cq_mask; atomic_t cq_timeouts; + unsigned cq_last_tm_flush; unsigned long cq_check_overflow; struct wait_queue_head cq_wait; struct fasync_struct *cq_fasync; @@ -1321,19 +1322,38 @@ static void __io_queue_deferred(struct io_ring_ctx *ctx)
static void io_flush_timeouts(struct io_ring_ctx *ctx) { - while (!list_empty(&ctx->timeout_list)) { + u32 seq; + + if (list_empty(&ctx->timeout_list)) + return; + + seq = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts); + + do { + u32 events_needed, events_got; struct io_kiocb *req = list_first_entry(&ctx->timeout_list, struct io_kiocb, timeout.list);
if (io_is_timeout_noseq(req)) break; - if (req->timeout.target_seq != ctx->cached_cq_tail - - atomic_read(&ctx->cq_timeouts)) + + /* + * Since seq can easily wrap around over time, subtract + * the last seq at which timeouts were flushed before comparing. + * Assuming not more than 2^31-1 events have happened since, + * these subtractions won't have wrapped, so we can check if + * target is in [last_seq, current_seq] by comparing the two. + */ + events_needed = req->timeout.target_seq - ctx->cq_last_tm_flush; + events_got = seq - ctx->cq_last_tm_flush; + if (events_got < events_needed) break;
list_del_init(&req->timeout.list); io_kill_timeout(req); - } + } while (!list_empty(&ctx->timeout_list)); + + ctx->cq_last_tm_flush = seq; }
static void io_commit_cqring(struct io_ring_ctx *ctx) @@ -5060,6 +5080,12 @@ static int io_timeout(struct io_kiocb *req) tail = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts); req->timeout.target_seq = tail + off;
+ /* Update the last seq here in case io_flush_timeouts() hasn't. + * This is safe because ->completion_lock is held, and submissions + * and completions are never mixed in the same ->completion_lock section. + */ + ctx->cq_last_tm_flush = tail; + /* * Insertion sort, ensuring the first entry in the list is always * the one we need first.
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc4 commit b7ddce3cbf010edbfac6c6d8cc708560a7bcd7a4 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
While trying to cancel requests with ->files, it also should look for requests in ->defer_list, otherwise it might end up hanging a thread.
Cancel all requests in ->defer_list up to the last request there with matching ->files, that's needed to follow drain ordering semantics.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 7163271d14c3..36536ed5659e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7867,12 +7867,39 @@ static void io_attempt_cancel(struct io_ring_ctx *ctx, struct io_kiocb *req) io_timeout_remove_link(ctx, req); }
+static void io_cancel_defer_files(struct io_ring_ctx *ctx, + struct files_struct *files) +{ + struct io_defer_entry *de = NULL; + LIST_HEAD(list); + + spin_lock_irq(&ctx->completion_lock); + list_for_each_entry_reverse(de, &ctx->defer_list, list) { + if ((de->req->flags & REQ_F_WORK_INITIALIZED) + && de->req->work.files == files) { + list_cut_position(&list, &ctx->defer_list, &de->list); + break; + } + } + spin_unlock_irq(&ctx->completion_lock); + + while (!list_empty(&list)) { + de = list_first_entry(&list, struct io_defer_entry, list); + list_del_init(&de->list); + req_set_fail_links(de->req); + io_put_req(de->req); + io_req_complete(de->req, -ECANCELED); + kfree(de); + } +} + static void io_uring_cancel_files(struct io_ring_ctx *ctx, struct files_struct *files) { if (list_empty_careful(&ctx->inflight_list)) return;
+ io_cancel_defer_files(ctx, files); /* cancel all at once, should be faster than doing it one by one*/ io_wq_cancel_cb(ctx->io_wq, io_wq_files_match, files, true);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc4 commit c127a2a1b7baa5eb40a7e2de4b7f0c51ccbbb2ef category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
While looking for ->files in ->defer_list, consider that requests there may actually be links.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 36536ed5659e..b4d684321724 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7793,6 +7793,28 @@ static bool io_match_link(struct io_kiocb *preq, struct io_kiocb *req) return false; }
+static inline bool io_match_files(struct io_kiocb *req, + struct files_struct *files) +{ + return (req->flags & REQ_F_WORK_INITIALIZED) && req->work.files == files; +} + +static bool io_match_link_files(struct io_kiocb *req, + struct files_struct *files) +{ + struct io_kiocb *link; + + if (io_match_files(req, files)) + return true; + if (req->flags & REQ_F_LINK_HEAD) { + list_for_each_entry(link, &req->link_list, link_list) { + if (io_match_files(link, files)) + return true; + } + } + return false; +} + /* * We're looking to cancel 'req' because it's holding on to our files, but * 'req' could be a link to another request. See if it is, and cancel that @@ -7875,8 +7897,7 @@ static void io_cancel_defer_files(struct io_ring_ctx *ctx,
spin_lock_irq(&ctx->completion_lock); list_for_each_entry_reverse(de, &ctx->defer_list, list) { - if ((de->req->flags & REQ_F_WORK_INITIALIZED) - && de->req->work.files == files) { + if (io_match_link_files(de->req, files)) { list_cut_position(&list, &ctx->defer_list, &de->list); break; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 4693014340808e7f099e302c1dc40e9d79ff7667 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Add a helper to mark ctx->{cq,sq}_check_overflow to get rid of duplicates, and it's clearer to check cq_overflow_list directly anyway.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 32a7d285262c..cf6ecea9e5c3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1393,6 +1393,15 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx) eventfd_signal(ctx->cq_ev_fd, 1); }
+static void io_cqring_mark_overflow(struct io_ring_ctx *ctx) +{ + if (list_empty(&ctx->cq_overflow_list)) { + clear_bit(0, &ctx->sq_check_overflow); + clear_bit(0, &ctx->cq_check_overflow); + ctx->rings->sq_flags &= ~IORING_SQ_CQ_OVERFLOW; + } +} + /* Returns true if there are no backlogged entries after the flush */ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) { @@ -1437,11 +1446,8 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) }
io_commit_cqring(ctx); - if (cqe) { - clear_bit(0, &ctx->sq_check_overflow); - clear_bit(0, &ctx->cq_check_overflow); - ctx->rings->sq_flags &= ~IORING_SQ_CQ_OVERFLOW; - } + io_cqring_mark_overflow(ctx); + spin_unlock_irqrestore(&ctx->completion_lock, flags); io_cqring_ev_posted(ctx);
@@ -7943,11 +7949,8 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, spin_lock_irq(&ctx->completion_lock); list_del(&cancel_req->compl.list); cancel_req->flags &= ~REQ_F_OVERFLOW; - if (list_empty(&ctx->cq_overflow_list)) { - clear_bit(0, &ctx->sq_check_overflow); - clear_bit(0, &ctx->cq_check_overflow); - ctx->rings->sq_flags &= ~IORING_SQ_CQ_OVERFLOW; - } + + io_cqring_mark_overflow(ctx); WRITE_ONCE(ctx->rings->cq_overflow, atomic_inc_return(&ctx->cached_cq_overflow)); io_commit_cqring(ctx);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 01cec8c18f5ad9c27eee9f21439072832181039e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If ->cq_timeouts modifications are done under ->completion_lock, we don't really nee any fetch-and-add and other complex atomics. Replace it with non-atomic FAA, that saves an implicit full memory barrier.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index cf6ecea9e5c3..1a35314a1045 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1269,7 +1269,8 @@ static void io_kill_timeout(struct io_kiocb *req)
ret = hrtimer_try_to_cancel(&req->io->timeout.timer); if (ret != -1) { - atomic_inc(&req->ctx->cq_timeouts); + atomic_set(&req->ctx->cq_timeouts, + atomic_read(&req->ctx->cq_timeouts) + 1); list_del_init(&req->timeout.list); req->flags |= REQ_F_COMP_LOCKED; io_cqring_fill_event(req, 0); @@ -4922,9 +4923,10 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer) struct io_ring_ctx *ctx = req->ctx; unsigned long flags;
- atomic_inc(&ctx->cq_timeouts); - spin_lock_irqsave(&ctx->completion_lock, flags); + atomic_set(&req->ctx->cq_timeouts, + atomic_read(&req->ctx->cq_timeouts) + 1); + /* * We could be racing with timeout deletion. If the list is empty, * then timeout lookup already found it and will be handling it.
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit b089ed390b5c9bc248a32168709cfa01099caf9d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Linked requests are hashed, remove a comment stating otherwise. Also move hash bits to emphasise that we don't carry it through loop iteration and set it every time.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io-wq.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/fs/io-wq.c b/fs/io-wq.c index d089fb9a83b9..9c223dfb4b3f 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -491,7 +491,6 @@ static void io_worker_handle_work(struct io_worker *worker)
do { struct io_wq_work *work; - unsigned int hash; get_next: /* * If we got some work, mark us as busy. If we didn't, but @@ -514,6 +513,7 @@ static void io_worker_handle_work(struct io_worker *worker) /* handle a whole dependent link */ do { struct io_wq_work *old_work, *next_hashed, *linked; + unsigned int hash = io_get_work_hash(work);
next_hashed = wq_next_work(work); io_impersonate_work(worker, work); @@ -524,7 +524,6 @@ static void io_worker_handle_work(struct io_worker *worker) if (test_bit(IO_WQ_BIT_CANCEL, &wq->state)) work->flags |= IO_WQ_WORK_CANCEL;
- hash = io_get_work_hash(work); old_work = work; linked = wq->do_work(work);
@@ -543,8 +542,6 @@ static void io_worker_handle_work(struct io_worker *worker) spin_lock_irq(&wqe->lock); wqe->hash_map &= ~BIT_ULL(hash); wqe->flags &= ~IO_WQE_FLAG_STALLED; - /* dependent work is not hashed */ - hash = -1U; /* skip unnecessary unlock-lock wqe->lock */ if (!work) goto get_next;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 7a7cacba8b4560403615b04d57bdcd1f93f90f10 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Flip over "if (sock)" condition with return on error, the upper layer will take care. That change will be handy later, but already removes an extra jump from hot path.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 265 +++++++++++++++++++++++++------------------------- 1 file changed, 131 insertions(+), 134 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 1a35314a1045..c913100897fa 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3833,42 +3833,41 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) static int io_sendmsg(struct io_kiocb *req, bool force_nonblock, struct io_comp_state *cs) { - struct io_async_msghdr *kmsg = NULL; + struct io_async_msghdr iomsg, *kmsg = NULL; struct socket *sock; + unsigned flags; int ret;
sock = sock_from_file(req->file, &ret); - if (sock) { - struct io_async_msghdr iomsg; - unsigned flags; - - if (req->io) { - kmsg = &req->io->msg; - kmsg->msg.msg_name = &req->io->msg.addr; - /* if iov is set, it's allocated already */ - if (!kmsg->iov) - kmsg->iov = kmsg->fast_iov; - kmsg->msg.msg_iter.iov = kmsg->iov; - } else { - ret = io_sendmsg_copy_hdr(req, &iomsg); - if (ret) - return ret; - kmsg = &iomsg; - } - - flags = req->sr_msg.msg_flags; - if (flags & MSG_DONTWAIT) - req->flags |= REQ_F_NOWAIT; - else if (force_nonblock) - flags |= MSG_DONTWAIT; + if (unlikely(!sock)) + return ret;
- ret = __sys_sendmsg_sock(sock, &kmsg->msg, flags); - if (force_nonblock && ret == -EAGAIN) - return io_setup_async_msg(req, kmsg); - if (ret == -ERESTARTSYS) - ret = -EINTR; + if (req->io) { + kmsg = &req->io->msg; + kmsg->msg.msg_name = &req->io->msg.addr; + /* if iov is set, it's allocated already */ + if (!kmsg->iov) + kmsg->iov = kmsg->fast_iov; + kmsg->msg.msg_iter.iov = kmsg->iov; + } else { + ret = io_sendmsg_copy_hdr(req, &iomsg); + if (ret) + return ret; + kmsg = &iomsg; }
+ flags = req->sr_msg.msg_flags; + if (flags & MSG_DONTWAIT) + req->flags |= REQ_F_NOWAIT; + else if (force_nonblock) + flags |= MSG_DONTWAIT; + + ret = __sys_sendmsg_sock(sock, &kmsg->msg, flags); + if (force_nonblock && ret == -EAGAIN) + return io_setup_async_msg(req, kmsg); + if (ret == -ERESTARTSYS) + ret = -EINTR; + if (kmsg && kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); req->flags &= ~REQ_F_NEED_CLEANUP; @@ -3881,39 +3880,38 @@ static int io_sendmsg(struct io_kiocb *req, bool force_nonblock, static int io_send(struct io_kiocb *req, bool force_nonblock, struct io_comp_state *cs) { + struct io_sr_msg *sr = &req->sr_msg; + struct msghdr msg; + struct iovec iov; struct socket *sock; + unsigned flags; int ret;
sock = sock_from_file(req->file, &ret); - if (sock) { - struct io_sr_msg *sr = &req->sr_msg; - struct msghdr msg; - struct iovec iov; - unsigned flags; + if (unlikely(!sock)) + return ret;
- ret = import_single_range(WRITE, sr->buf, sr->len, &iov, - &msg.msg_iter); - if (ret) - return ret; + ret = import_single_range(WRITE, sr->buf, sr->len, &iov, &msg.msg_iter); + if (unlikely(ret)) + return ret;
- msg.msg_name = NULL; - msg.msg_control = NULL; - msg.msg_controllen = 0; - msg.msg_namelen = 0; + msg.msg_name = NULL; + msg.msg_control = NULL; + msg.msg_controllen = 0; + msg.msg_namelen = 0;
- flags = req->sr_msg.msg_flags; - if (flags & MSG_DONTWAIT) - req->flags |= REQ_F_NOWAIT; - else if (force_nonblock) - flags |= MSG_DONTWAIT; + flags = req->sr_msg.msg_flags; + if (flags & MSG_DONTWAIT) + req->flags |= REQ_F_NOWAIT; + else if (force_nonblock) + flags |= MSG_DONTWAIT;
- msg.msg_flags = flags; - ret = sock_sendmsg(sock, &msg); - if (force_nonblock && ret == -EAGAIN) - return -EAGAIN; - if (ret == -ERESTARTSYS) - ret = -EINTR; - } + msg.msg_flags = flags; + ret = sock_sendmsg(sock, &msg); + if (force_nonblock && ret == -EAGAIN) + return -EAGAIN; + if (ret == -ERESTARTSYS) + ret = -EINTR;
if (ret < 0) req_set_fail_links(req); @@ -4067,62 +4065,62 @@ static int io_recvmsg_prep(struct io_kiocb *req, static int io_recvmsg(struct io_kiocb *req, bool force_nonblock, struct io_comp_state *cs) { - struct io_async_msghdr *kmsg = NULL; + struct io_async_msghdr iomsg, *kmsg = NULL; struct socket *sock; + struct io_buffer *kbuf; + unsigned flags; int ret, cflags = 0;
sock = sock_from_file(req->file, &ret); - if (sock) { - struct io_buffer *kbuf; - struct io_async_msghdr iomsg; - unsigned flags; - - if (req->io) { - kmsg = &req->io->msg; - kmsg->msg.msg_name = &req->io->msg.addr; - /* if iov is set, it's allocated already */ - if (!kmsg->iov) - kmsg->iov = kmsg->fast_iov; - kmsg->msg.msg_iter.iov = kmsg->iov; - } else { - ret = io_recvmsg_copy_hdr(req, &iomsg); - if (ret) - return ret; - kmsg = &iomsg; - } + if (unlikely(!sock)) + return ret;
- kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock); - if (IS_ERR(kbuf)) { - return PTR_ERR(kbuf); - } else if (kbuf) { - kmsg->fast_iov[0].iov_base = u64_to_user_ptr(kbuf->addr); - iov_iter_init(&kmsg->msg.msg_iter, READ, kmsg->iov, - 1, req->sr_msg.len); - } + if (req->io) { + kmsg = &req->io->msg; + kmsg->msg.msg_name = &req->io->msg.addr; + /* if iov is set, it's allocated already */ + if (!kmsg->iov) + kmsg->iov = kmsg->fast_iov; + kmsg->msg.msg_iter.iov = kmsg->iov; + } else { + ret = io_recvmsg_copy_hdr(req, &iomsg); + if (ret) + return ret; + kmsg = &iomsg; + }
- flags = req->sr_msg.msg_flags; - if (flags & MSG_DONTWAIT) - req->flags |= REQ_F_NOWAIT; - else if (force_nonblock) - flags |= MSG_DONTWAIT; + kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock); + if (IS_ERR(kbuf)) { + return PTR_ERR(kbuf); + } else if (kbuf) { + kmsg->fast_iov[0].iov_base = u64_to_user_ptr(kbuf->addr); + iov_iter_init(&kmsg->msg.msg_iter, READ, kmsg->iov, + 1, req->sr_msg.len); + }
- ret = __sys_recvmsg_sock(sock, &kmsg->msg, req->sr_msg.umsg, - kmsg->uaddr, flags); - if (force_nonblock && ret == -EAGAIN) { - ret = io_setup_async_msg(req, kmsg); - if (ret != -EAGAIN) - kfree(kbuf); - return ret; - } - if (ret == -ERESTARTSYS) - ret = -EINTR; - if (kbuf) + flags = req->sr_msg.msg_flags; + if (flags & MSG_DONTWAIT) + req->flags |= REQ_F_NOWAIT; + else if (force_nonblock) + flags |= MSG_DONTWAIT; + + ret = __sys_recvmsg_sock(sock, &kmsg->msg, req->sr_msg.umsg, + kmsg->uaddr, flags); + if (force_nonblock && ret == -EAGAIN) { + ret = io_setup_async_msg(req, kmsg); + if (ret != -EAGAIN) kfree(kbuf); + return ret; } + if (ret == -ERESTARTSYS) + ret = -EINTR; + if (kbuf) + kfree(kbuf);
if (kmsg && kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); req->flags &= ~REQ_F_NEED_CLEANUP; + if (ret < 0) req_set_fail_links(req); __io_req_complete(req, ret, cflags, cs); @@ -4133,51 +4131,50 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock, struct io_comp_state *cs) { struct io_buffer *kbuf = NULL; + struct io_sr_msg *sr = &req->sr_msg; + struct msghdr msg; + void __user *buf = sr->buf; struct socket *sock; + struct iovec iov; + unsigned flags; int ret, cflags = 0;
sock = sock_from_file(req->file, &ret); - if (sock) { - struct io_sr_msg *sr = &req->sr_msg; - void __user *buf = sr->buf; - struct msghdr msg; - struct iovec iov; - unsigned flags; - - kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock); - if (IS_ERR(kbuf)) - return PTR_ERR(kbuf); - else if (kbuf) - buf = u64_to_user_ptr(kbuf->addr); + if (unlikely(!sock)) + return ret;
- ret = import_single_range(READ, buf, sr->len, &iov, - &msg.msg_iter); - if (ret) { - kfree(kbuf); - return ret; - } + kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock); + if (IS_ERR(kbuf)) + return PTR_ERR(kbuf); + else if (kbuf) + buf = u64_to_user_ptr(kbuf->addr);
- req->flags |= REQ_F_NEED_CLEANUP; - msg.msg_name = NULL; - msg.msg_control = NULL; - msg.msg_controllen = 0; - msg.msg_namelen = 0; - msg.msg_iocb = NULL; - msg.msg_flags = 0; - - flags = req->sr_msg.msg_flags; - if (flags & MSG_DONTWAIT) - req->flags |= REQ_F_NOWAIT; - else if (force_nonblock) - flags |= MSG_DONTWAIT; - - ret = sock_recvmsg(sock, &msg, flags); - if (force_nonblock && ret == -EAGAIN) - return -EAGAIN; - if (ret == -ERESTARTSYS) - ret = -EINTR; + ret = import_single_range(READ, buf, sr->len, &iov, &msg.msg_iter); + if (unlikely(ret)) { + kfree(kbuf); + return ret; }
+ req->flags |= REQ_F_NEED_CLEANUP; + msg.msg_name = NULL; + msg.msg_control = NULL; + msg.msg_controllen = 0; + msg.msg_namelen = 0; + msg.msg_iocb = NULL; + msg.msg_flags = 0; + + flags = req->sr_msg.msg_flags; + if (flags & MSG_DONTWAIT) + req->flags |= REQ_F_NOWAIT; + else if (force_nonblock) + flags |= MSG_DONTWAIT; + + ret = sock_recvmsg(sock, &msg, flags); + if (force_nonblock && ret == -EAGAIN) + return -EAGAIN; + if (ret == -ERESTARTSYS) + ret = -EINTR; + kfree(kbuf); req->flags &= ~REQ_F_NEED_CLEANUP; if (ret < 0)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 6b754c8b912a164fbb15b7b839d51709c3d9ee6f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
With the return on a bad socket, kmsg is always non-null by the end of the function, prune left extra checks and initialisations.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c913100897fa..9a0ae47989b9 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3833,7 +3833,7 @@ static int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) static int io_sendmsg(struct io_kiocb *req, bool force_nonblock, struct io_comp_state *cs) { - struct io_async_msghdr iomsg, *kmsg = NULL; + struct io_async_msghdr iomsg, *kmsg; struct socket *sock; unsigned flags; int ret; @@ -3868,7 +3868,7 @@ static int io_sendmsg(struct io_kiocb *req, bool force_nonblock, if (ret == -ERESTARTSYS) ret = -EINTR;
- if (kmsg && kmsg->iov != kmsg->fast_iov) + if (kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); req->flags &= ~REQ_F_NEED_CLEANUP; if (ret < 0) @@ -4065,7 +4065,7 @@ static int io_recvmsg_prep(struct io_kiocb *req, static int io_recvmsg(struct io_kiocb *req, bool force_nonblock, struct io_comp_state *cs) { - struct io_async_msghdr iomsg, *kmsg = NULL; + struct io_async_msghdr iomsg, *kmsg; struct socket *sock; struct io_buffer *kbuf; unsigned flags; @@ -4117,7 +4117,7 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock, if (kbuf) kfree(kbuf);
- if (kmsg && kmsg->iov != kmsg->fast_iov) + if (kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); req->flags &= ~REQ_F_NEED_CLEANUP;
@@ -4130,7 +4130,7 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock, static int io_recv(struct io_kiocb *req, bool force_nonblock, struct io_comp_state *cs) { - struct io_buffer *kbuf = NULL; + struct io_buffer *kbuf; struct io_sr_msg *sr = &req->sr_msg; struct msghdr msg; void __user *buf = sr->buf;
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 14c32eee9286621dd437b53460e44bd11e5bc08d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Instead of returning error from io_recv(), go through generic cleanup path, because it'll retain cflags for userspace. Do the same for io_send() for consistency.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 9a0ae47989b9..c99384e15376 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3893,7 +3893,7 @@ static int io_send(struct io_kiocb *req, bool force_nonblock,
ret = import_single_range(WRITE, sr->buf, sr->len, &iov, &msg.msg_iter); if (unlikely(ret)) - return ret; + return ret;;
msg.msg_name = NULL; msg.msg_control = NULL; @@ -4150,10 +4150,8 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock, buf = u64_to_user_ptr(kbuf->addr);
ret = import_single_range(READ, buf, sr->len, &iov, &msg.msg_iter); - if (unlikely(ret)) { - kfree(kbuf); - return ret; - } + if (unlikely(ret)) + goto out_free;
req->flags |= REQ_F_NEED_CLEANUP; msg.msg_name = NULL; @@ -4174,7 +4172,7 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock, return -EAGAIN; if (ret == -ERESTARTSYS) ret = -EINTR; - +out_free: kfree(kbuf); req->flags &= ~REQ_F_NEED_CLEANUP; if (ret < 0)
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 0e1b6fe3d1e5f1b79c5bec37881c98febfba7718 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
io_clean_op() may be skipped even if there is a selected io_buffer, that's because *select_buffer() funcions never set REQ_F_NEED_CLEANUP.
Trigger io_clean_op() when REQ_F_BUFFER_SELECTED is set as well, and and clear the flag if was freed out of it.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [f3cd4850504f ("io_uring: ensure open/openat2 name is cleaned on cancelation") merge first and do not support openat2]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 93 ++++++++++++++++++++++++++------------------------- 1 file changed, 48 insertions(+), 45 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index c99384e15376..a4f2cdea02fc 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -947,7 +947,7 @@ static void io_get_req_task(struct io_kiocb *req)
static inline void io_clean_op(struct io_kiocb *req) { - if (req->flags & REQ_F_NEED_CLEANUP) + if (req->flags & (REQ_F_NEED_CLEANUP | REQ_F_BUFFER_SELECTED)) __io_clean_op(req); }
@@ -2029,6 +2029,7 @@ static int io_put_kbuf(struct io_kiocb *req) cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT; cflags |= IORING_CQE_F_BUFFER; req->rw.addr = 0; + req->flags &= ~REQ_F_BUFFER_SELECTED; kfree(kbuf); return cflags; } @@ -4106,20 +4107,16 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock,
ret = __sys_recvmsg_sock(sock, &kmsg->msg, req->sr_msg.umsg, kmsg->uaddr, flags); - if (force_nonblock && ret == -EAGAIN) { - ret = io_setup_async_msg(req, kmsg); - if (ret != -EAGAIN) - kfree(kbuf); - return ret; - } + if (force_nonblock && ret == -EAGAIN) + return io_setup_async_msg(req, kmsg); if (ret == -ERESTARTSYS) ret = -EINTR; + if (kbuf) kfree(kbuf); - if (kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); - req->flags &= ~REQ_F_NEED_CLEANUP; + req->flags &= ~(REQ_F_NEED_CLEANUP | REQ_F_BUFFER_SELECTED);
if (ret < 0) req_set_fail_links(req); @@ -4153,7 +4150,6 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock, if (unlikely(ret)) goto out_free;
- req->flags |= REQ_F_NEED_CLEANUP; msg.msg_name = NULL; msg.msg_control = NULL; msg.msg_controllen = 0; @@ -4173,7 +4169,8 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock, if (ret == -ERESTARTSYS) ret = -EINTR; out_free: - kfree(kbuf); + if (kbuf) + kfree(kbuf); req->flags &= ~REQ_F_NEED_CLEANUP; if (ret < 0) req_set_fail_links(req); @@ -5395,43 +5392,49 @@ static void __io_clean_op(struct io_kiocb *req) { struct io_async_ctx *io = req->io;
- switch (req->opcode) { - case IORING_OP_READV: - case IORING_OP_READ_FIXED: - case IORING_OP_READ: - if (req->flags & REQ_F_BUFFER_SELECTED) + if (req->flags & REQ_F_BUFFER_SELECTED) { + switch (req->opcode) { + case IORING_OP_READV: + case IORING_OP_READ_FIXED: + case IORING_OP_READ: kfree((void *)(unsigned long)req->rw.addr); - /* fallthrough */ - case IORING_OP_WRITEV: - case IORING_OP_WRITE_FIXED: - case IORING_OP_WRITE: - if (io->rw.iov != io->rw.fast_iov) - kfree(io->rw.iov); - break; - case IORING_OP_RECVMSG: - if (req->flags & REQ_F_BUFFER_SELECTED) - kfree(req->sr_msg.kbuf); - /* fallthrough */ - case IORING_OP_SENDMSG: - if (io->msg.iov != io->msg.fast_iov) - kfree(io->msg.iov); - break; - case IORING_OP_RECV: - if (req->flags & REQ_F_BUFFER_SELECTED) + break; + case IORING_OP_RECVMSG: + case IORING_OP_RECV: kfree(req->sr_msg.kbuf); - break; - case IORING_OP_OPENAT: - if (req->open.filename) - putname(req->open.filename); - break; - case IORING_OP_SPLICE: - case IORING_OP_TEE: - io_put_file(req, req->splice.file_in, - (req->splice.flags & SPLICE_F_FD_IN_FIXED)); - break; + break; + } + req->flags &= ~REQ_F_BUFFER_SELECTED; + } + + if (req->flags & REQ_F_NEED_CLEANUP) { + switch (req->opcode) { + case IORING_OP_READV: + case IORING_OP_READ_FIXED: + case IORING_OP_READ: + case IORING_OP_WRITEV: + case IORING_OP_WRITE_FIXED: + case IORING_OP_WRITE: + if (io->rw.iov != io->rw.fast_iov) + kfree(io->rw.iov); + break; + case IORING_OP_RECVMSG: + case IORING_OP_SENDMSG: + if (io->msg.iov != io->msg.fast_iov) + kfree(io->msg.iov); + break; + case IORING_OP_SPLICE: + case IORING_OP_TEE: + io_put_file(req, req->splice.file_in, + (req->splice.flags & SPLICE_F_FD_IN_FIXED)); + break; + case IORING_OP_OPENAT: + if (req->open.filename) + putname(req->open.filename); + break; + } + req->flags &= ~REQ_F_NEED_CLEANUP; } - - req->flags &= ~REQ_F_NEED_CLEANUP; }
static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit bc02ef3325e3ef524ef29b65681ca4207b781224 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Move REQ_F_BUFFER_SELECT flag check out of io_recv_buffer_select(), and do that in its call sites That saves us from double error checking and possibly an extra function call.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 22 ++++++++++------------ 1 file changed, 10 insertions(+), 12 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a4f2cdea02fc..659ce52b53d6 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4016,9 +4016,6 @@ static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req, struct io_sr_msg *sr = &req->sr_msg; struct io_buffer *kbuf;
- if (!(req->flags & REQ_F_BUFFER_SELECT)) - return NULL; - kbuf = io_buffer_select(req, &sr->len, sr->bgid, sr->kbuf, needs_lock); if (IS_ERR(kbuf)) return kbuf; @@ -4068,7 +4065,7 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock, { struct io_async_msghdr iomsg, *kmsg; struct socket *sock; - struct io_buffer *kbuf; + struct io_buffer *kbuf = NULL; unsigned flags; int ret, cflags = 0;
@@ -4090,10 +4087,10 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock, kmsg = &iomsg; }
- kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock); - if (IS_ERR(kbuf)) { - return PTR_ERR(kbuf); - } else if (kbuf) { + if (req->flags & REQ_F_BUFFER_SELECT) { + kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock); + if (IS_ERR(kbuf)) + return PTR_ERR(kbuf); kmsg->fast_iov[0].iov_base = u64_to_user_ptr(kbuf->addr); iov_iter_init(&kmsg->msg.msg_iter, READ, kmsg->iov, 1, req->sr_msg.len); @@ -4140,11 +4137,12 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock, if (unlikely(!sock)) return ret;
- kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock); - if (IS_ERR(kbuf)) - return PTR_ERR(kbuf); - else if (kbuf) + if (req->flags & REQ_F_BUFFER_SELECT) { + kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock); + if (IS_ERR(kbuf)) + return PTR_ERR(kbuf); buf = u64_to_user_ptr(kbuf->addr); + }
ret = import_single_range(READ, buf, sr->len, &iov, &msg.msg_iter); if (unlikely(ret))
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 06ef3608b0eed673fcbc62cf74c8d3ad0007a337 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Currently, file refs in struct io_submit_state are tracked with 2 vars: @has_refs -- how many refs were initially taken @used_refs -- number of refs used
Replace it with a single variable counting how many refs left at the current moment.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 659ce52b53d6..59013930ad03 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -708,7 +708,6 @@ struct io_submit_state { struct file *file; unsigned int fd; unsigned int has_refs; - unsigned int used_refs; unsigned int ios_left; };
@@ -2358,10 +2357,8 @@ static void io_iopoll_req_issued(struct io_kiocb *req)
static void __io_state_file_put(struct io_submit_state *state) { - int diff = state->has_refs - state->used_refs; - - if (diff) - fput_many(state->file, diff); + if (state->has_refs) + fput_many(state->file, state->has_refs); state->file = NULL; }
@@ -2383,7 +2380,7 @@ static struct file *__io_file_get(struct io_submit_state *state, int fd)
if (state->file) { if (state->fd == fd) { - state->used_refs++; + state->has_refs--; state->ios_left--; return state->file; } @@ -2394,9 +2391,8 @@ static struct file *__io_file_get(struct io_submit_state *state, int fd) return NULL;
state->fd = fd; - state->has_refs = state->ios_left; - state->used_refs = 1; state->ios_left--; + state->has_refs = state->ios_left; return state->file; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 8ff069bf2efd7b7aeb90b56ea8edc165c93d8940 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Extract a common helper for cleaning up a selected buffer, this will be used shortly. By the way, correct cflags types to unsigned and, as kbufs are anyway tracked by a flag, remove useless zeroing req->rw.addr.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 59013930ad03..f63de32dde96 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2019,20 +2019,25 @@ static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx) return smp_load_acquire(&rings->sq.tail) - ctx->cached_sq_head; }
-static int io_put_kbuf(struct io_kiocb *req) +static unsigned int io_put_kbuf(struct io_kiocb *req, struct io_buffer *kbuf) { - struct io_buffer *kbuf; - int cflags; + unsigned int cflags;
- kbuf = (struct io_buffer *) (unsigned long) req->rw.addr; cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT; cflags |= IORING_CQE_F_BUFFER; - req->rw.addr = 0; req->flags &= ~REQ_F_BUFFER_SELECTED; kfree(kbuf); return cflags; }
+static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req) +{ + struct io_buffer *kbuf; + + kbuf = (struct io_buffer *) (unsigned long) req->rw.addr; + return io_put_kbuf(req, kbuf); +} + static inline bool io_run_task_work(void) { if (current->task_works) { @@ -2090,7 +2095,7 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, list_del(&req->inflight_entry);
if (req->flags & REQ_F_BUFFER_SELECTED) - cflags = io_put_kbuf(req); + cflags = io_put_rw_kbuf(req);
__io_cqring_fill_event(req, req->result, cflags); (*nr_events)++; @@ -2282,7 +2287,7 @@ static void io_complete_rw_common(struct kiocb *kiocb, long res, if (res != req->result) req_set_fail_links(req); if (req->flags & REQ_F_BUFFER_SELECTED) - cflags = io_put_kbuf(req); + cflags = io_put_rw_kbuf(req); __io_req_complete(req, res, cflags, cs); }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 7fbb1b541f4286cc337b9bca1e5bad0ce4ee978c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Don't implement fast path of kbuf freeing and management inlined into io_recv{,msg}(), that's error prone and duplicates handling. Replace it with a helper io_put_recv_kbuf(), which mimics io_put_rw_kbuf() in the io_read/write().
This also keeps cflags calculation in one place, removing duplication between rw and recv/send.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index f63de32dde96..a90cd061ede3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4012,7 +4012,7 @@ static int io_recvmsg_copy_hdr(struct io_kiocb *req, }
static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req, - int *cflags, bool needs_lock) + bool needs_lock) { struct io_sr_msg *sr = &req->sr_msg; struct io_buffer *kbuf; @@ -4023,12 +4023,14 @@ static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req,
sr->kbuf = kbuf; req->flags |= REQ_F_BUFFER_SELECTED; - - *cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT; - *cflags |= IORING_CQE_F_BUFFER; return kbuf; }
+static inline unsigned int io_put_recv_kbuf(struct io_kiocb *req) +{ + return io_put_kbuf(req, req->sr_msg.kbuf); +} + static int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { @@ -4066,7 +4068,7 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock, { struct io_async_msghdr iomsg, *kmsg; struct socket *sock; - struct io_buffer *kbuf = NULL; + struct io_buffer *kbuf; unsigned flags; int ret, cflags = 0;
@@ -4089,7 +4091,7 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock, }
if (req->flags & REQ_F_BUFFER_SELECT) { - kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock); + kbuf = io_recv_buffer_select(req, !force_nonblock); if (IS_ERR(kbuf)) return PTR_ERR(kbuf); kmsg->fast_iov[0].iov_base = u64_to_user_ptr(kbuf->addr); @@ -4110,12 +4112,11 @@ static int io_recvmsg(struct io_kiocb *req, bool force_nonblock, if (ret == -ERESTARTSYS) ret = -EINTR;
- if (kbuf) - kfree(kbuf); + if (req->flags & REQ_F_BUFFER_SELECTED) + cflags = io_put_recv_kbuf(req); if (kmsg->iov != kmsg->fast_iov) kfree(kmsg->iov); - req->flags &= ~(REQ_F_NEED_CLEANUP | REQ_F_BUFFER_SELECTED); - + req->flags &= ~REQ_F_NEED_CLEANUP; if (ret < 0) req_set_fail_links(req); __io_req_complete(req, ret, cflags, cs); @@ -4139,7 +4140,7 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock, return ret;
if (req->flags & REQ_F_BUFFER_SELECT) { - kbuf = io_recv_buffer_select(req, &cflags, !force_nonblock); + kbuf = io_recv_buffer_select(req, !force_nonblock); if (IS_ERR(kbuf)) return PTR_ERR(kbuf); buf = u64_to_user_ptr(kbuf->addr); @@ -4168,9 +4169,8 @@ static int io_recv(struct io_kiocb *req, bool force_nonblock, if (ret == -ERESTARTSYS) ret = -EINTR; out_free: - if (kbuf) - kfree(kbuf); - req->flags &= ~REQ_F_NEED_CLEANUP; + if (req->flags & REQ_F_BUFFER_SELECTED) + cflags = io_put_recv_kbuf(req); if (ret < 0) req_set_fail_links(req); __io_req_complete(req, ret, cflags, cs);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit ae34817bd93e373a03203a4c6892735c430a14e1 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Calling into opcode prep handlers may be dangerous, as they re-read SQE but might not re-initialise requests completely. If io_req_defer() passed fast checks and is done with preparations, punt it async.
As all other cases are covered with nulling @sqe, this guarantees that io_[opcode]_prep() are visited only once per request.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index a90cd061ede3..582eb9cdc728 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5376,7 +5376,8 @@ static int io_req_defer(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (!req_need_defer(req, seq) && list_empty(&ctx->defer_list)) { spin_unlock_irq(&ctx->completion_lock); kfree(de); - return 0; + io_queue_async_work(req); + return -EIOCBQUEUED; }
trace_io_uring_defer(ctx, req, req->user_data);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit f56040b81999871973d21f334b4657957422c90e category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Move io_req_init_async() into io_grab_files(), it's safer this way. Note that io_queue_async_work() does *init_async(), so it's valid to move out of __io_queue_sqe() punt path. Also, add a helper around io_grab_files().
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 582eb9cdc728..59ce988d5144 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -909,7 +909,7 @@ static void io_queue_linked_timeout(struct io_kiocb *req); static int __io_sqe_files_update(struct io_ring_ctx *ctx, struct io_uring_files_update *ip, unsigned nr_args); -static int io_grab_files(struct io_kiocb *req); +static int io_prep_work_files(struct io_kiocb *req); static void io_complete_rw_common(struct kiocb *kiocb, long res, struct io_comp_state *cs); static void __io_clean_op(struct io_kiocb *req); @@ -5226,13 +5226,9 @@ static int io_req_defer_prep(struct io_kiocb *req,
if (io_alloc_async_ctx(req)) return -EAGAIN; - - if (io_op_defs[req->opcode].file_table) { - io_req_init_async(req); - ret = io_grab_files(req); - if (unlikely(ret)) - return ret; - } + ret = io_prep_work_files(req); + if (unlikely(ret)) + return ret;
switch (req->opcode) { case IORING_OP_NOP: @@ -5781,6 +5777,8 @@ static int io_grab_files(struct io_kiocb *req) int ret = -EBADF; struct io_ring_ctx *ctx = req->ctx;
+ io_req_init_async(req); + if (req->work.files || (req->flags & REQ_F_NO_FILE_TABLE)) return 0; if (!ctx->ring_file) @@ -5806,6 +5804,13 @@ static int io_grab_files(struct io_kiocb *req) return ret; }
+static inline int io_prep_work_files(struct io_kiocb *req) +{ + if (!io_op_defs[req->opcode].file_table) + return 0; + return io_grab_files(req); +} + static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer) { struct io_timeout_data *data = container_of(timer, @@ -5922,14 +5927,9 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, goto exit; } punt: - io_req_init_async(req); - - if (io_op_defs[req->opcode].file_table) { - ret = io_grab_files(req); - if (ret) - goto err; - } - + ret = io_prep_work_files(req); + if (unlikely(ret)) + goto err; /* * Queued up for async execution, worker will release * submit reference when the iocb is actually submitted.
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit f063c5477eb392c315aa25ad538b4920b367ea05 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Whoever called io_prep_linked_timeout() should also do io_queue_linked_timeout(). __io_queue_sqe() doesn't follow that for the punting path leaving linked timeouts prepared but never queued.
Fixes: 6df1db6b54243 ("io_uring: fix mis-refcounting linked timeouts") Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 59ce988d5144..fd544ad594b6 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -5921,20 +5921,20 @@ static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, * doesn't support non-blocking read/write attempts */ if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) { - if (io_arm_poll_handler(req)) { - if (linked_timeout) - io_queue_linked_timeout(linked_timeout); - goto exit; - } + if (!io_arm_poll_handler(req)) { punt: - ret = io_prep_work_files(req); - if (unlikely(ret)) - goto err; - /* - * Queued up for async execution, worker will release - * submit reference when the iocb is actually submitted. - */ - io_queue_async_work(req); + ret = io_prep_work_files(req); + if (unlikely(ret)) + goto err; + /* + * Queued up for async execution, worker will release + * submit reference when the iocb is actually submitted. + */ + io_queue_async_work(req); + } + + if (linked_timeout) + io_queue_linked_timeout(linked_timeout); goto exit; }
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit dd6f843a9fca8f225c86fee5f50da429c369c045 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
put_task_struct_many() is as put_task_struct() but puts several references at once. Useful to batching it.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: include/linux/sched/task.h [ec1d281923cf ("sched/core: Convert task_struct.usage to refcount_t)" not merge]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/sched/task.h | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 44c6f15800ff..d744b385108e 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -98,6 +98,12 @@ static inline void put_task_struct(struct task_struct *t) __put_task_struct(t); }
+static inline void put_task_struct_many(struct task_struct *t, int nr) +{ + if (atomic_sub_and_test(nr, &t->usage)) + __put_task_struct(t); +} + struct task_struct *task_rcu_dereference(struct task_struct **ptask);
#ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc1 commit 5af1d13e8f0d8839db04a71ec786f369b0e67234 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
As every iopoll request have a task ref, it becomes expensive to put them one by one, instead we can put several at once integrating that into io_req_free_batch().
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [adopt patch ecfc51777487 ("io_uring: fix potential use after free on fallback request free") lead this conflict]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 29 +++++++++++++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index fd544ad594b6..417662c3ab6a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1634,7 +1634,6 @@ static void io_dismantle_req(struct io_kiocb *req) kfree(req->io); if (req->file) io_put_file(req, req->file, (req->flags & REQ_F_FIXED_FILE)); - __io_put_req_task(req); io_req_clean_work(req);
if (req->flags & REQ_F_INFLIGHT) { @@ -1654,6 +1653,7 @@ static void __io_free_req(struct io_kiocb *req) struct io_ring_ctx *ctx = req->ctx;
io_dismantle_req(req); + __io_put_req_task(req); if (likely(!io_is_fallback_req(req))) kmem_cache_free(req_cachep, req); else @@ -1903,8 +1903,18 @@ static void io_free_req(struct io_kiocb *req) struct req_batch { void *reqs[IO_IOPOLL_BATCH]; int to_free; + + struct task_struct *task; + int task_refs; };
+static inline void io_init_req_batch(struct req_batch *rb) +{ + rb->to_free = 0; + rb->task_refs = 0; + rb->task = NULL; +} + static void __io_req_free_batch_flush(struct io_ring_ctx *ctx, struct req_batch *rb) { @@ -1918,6 +1928,10 @@ static void io_req_free_batch_finish(struct io_ring_ctx *ctx, { if (rb->to_free) __io_req_free_batch_flush(ctx, rb); + if (rb->task) { + put_task_struct_many(rb->task, rb->task_refs); + rb->task = NULL; + } }
static void io_req_free_batch(struct req_batch *rb, struct io_kiocb *req) @@ -1929,6 +1943,17 @@ static void io_req_free_batch(struct req_batch *rb, struct io_kiocb *req) if (req->flags & REQ_F_LINK_HEAD) io_queue_next(req);
+ if (req->flags & REQ_F_TASK_PINNED) { + if (req->task != rb->task) { + if (rb->task) + put_task_struct_many(rb->task, rb->task_refs); + rb->task = req->task; + rb->task_refs = 0; + } + rb->task_refs++; + req->flags &= ~REQ_F_TASK_PINNED; + } + io_dismantle_req(req); rb->reqs[rb->to_free++] = req; if (unlikely(rb->to_free == ARRAY_SIZE(rb->reqs))) @@ -2081,7 +2106,7 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, /* order with ->result store in io_complete_rw_iopoll() */ smp_rmb();
- rb.to_free = 0; + io_init_req_batch(&rb); while (!list_empty(done)) { int cflags = 0;
From: Jens Axboe axboe@kernel.dk
mainline inclusion from mainline-5.9-rc1 commit 51a4cc112c7a42b62a91bcccdfac42e7c4561729 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
If we're in the error path failing links and we have a link that has grabbed a reference to the fs_struct, then we cannot safely drop our reference to the table if we already hold the completion lock. This adds a hardirq dependency to the fs_struct->lock, which it currently doesn't have.
Defer the final cleanup and free of such requests to avoid adding this dependency.
Reported-by: syzbot+ef4b654b49ed7ff049bf@syzkaller.appspotmail.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/io_uring.c [adopt patch ecfc51777487 ("io_uring: fix potential use after free on fallback request free") lead this conflict]
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 59 ++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 51 insertions(+), 8 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 417662c3ab6a..0e3ec6bbcae7 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1167,10 +1167,16 @@ static void __io_commit_cqring(struct io_ring_ctx *ctx) } }
-static void io_req_clean_work(struct io_kiocb *req) +/* + * Returns true if we need to defer file table putting. This can only happen + * from the error path with REQ_F_COMP_LOCKED set. + */ +static bool io_req_clean_work(struct io_kiocb *req) { if (!(req->flags & REQ_F_WORK_INITIALIZED)) - return; + return false; + + req->flags &= ~REQ_F_WORK_INITIALIZED;
if (req->work.mm) { mmdrop(req->work.mm); @@ -1183,6 +1189,9 @@ static void io_req_clean_work(struct io_kiocb *req) if (req->work.fs) { struct fs_struct *fs = req->work.fs;
+ if (req->flags & REQ_F_COMP_LOCKED) + return true; + spin_lock(&req->work.fs->lock); if (--fs->users) fs = NULL; @@ -1191,7 +1200,8 @@ static void io_req_clean_work(struct io_kiocb *req) free_fs_struct(fs); req->work.fs = NULL; } - req->flags &= ~REQ_F_WORK_INITIALIZED; + + return false; }
static void io_prep_async_work(struct io_kiocb *req) @@ -1626,7 +1636,7 @@ static inline void io_put_file(struct io_kiocb *req, struct file *file, fput(file); }
-static void io_dismantle_req(struct io_kiocb *req) +static bool io_dismantle_req(struct io_kiocb *req) { io_clean_op(req);
@@ -1634,7 +1644,6 @@ static void io_dismantle_req(struct io_kiocb *req) kfree(req->io); if (req->file) io_put_file(req, req->file, (req->flags & REQ_F_FIXED_FILE)); - io_req_clean_work(req);
if (req->flags & REQ_F_INFLIGHT) { struct io_ring_ctx *ctx = req->ctx; @@ -1646,13 +1655,14 @@ static void io_dismantle_req(struct io_kiocb *req) wake_up(&ctx->inflight_wait); spin_unlock_irqrestore(&ctx->inflight_lock, flags); } + + return io_req_clean_work(req); }
-static void __io_free_req(struct io_kiocb *req) +static void __io_free_req_finish(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx;
- io_dismantle_req(req); __io_put_req_task(req); if (likely(!io_is_fallback_req(req))) kmem_cache_free(req_cachep, req); @@ -1661,6 +1671,39 @@ static void __io_free_req(struct io_kiocb *req) percpu_ref_put(&ctx->refs); }
+static void io_req_task_file_table_put(struct callback_head *cb) +{ + struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work); + struct fs_struct *fs = req->work.fs; + + spin_lock(&req->work.fs->lock); + if (--fs->users) + fs = NULL; + spin_unlock(&req->work.fs->lock); + if (fs) + free_fs_struct(fs); + req->work.fs = NULL; + __io_free_req_finish(req); +} + +static void __io_free_req(struct io_kiocb *req) +{ + if (!io_dismantle_req(req)) { + __io_free_req_finish(req); + } else { + int ret; + + init_task_work(&req->task_work, io_req_task_file_table_put); + ret = task_work_add(req->task, &req->task_work, TWA_RESUME); + if (unlikely(ret)) { + struct task_struct *tsk; + + tsk = io_wq_get_task(req->ctx->io_wq); + task_work_add(tsk, &req->task_work, 0); + } + } +} + static bool io_link_cancel_timeout(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; @@ -1954,7 +1997,7 @@ static void io_req_free_batch(struct req_batch *rb, struct io_kiocb *req) req->flags &= ~REQ_F_TASK_PINNED; }
- io_dismantle_req(req); + WARN_ON_ONCE(io_dismantle_req(req)); rb->reqs[rb->to_free++] = req; if (unlikely(rb->to_free == ARRAY_SIZE(rb->reqs))) __io_req_free_batch_flush(req->ctx, rb);
From: Pavel Begunkov asml.silence@gmail.com
mainline inclusion from mainline-5.9-rc2 commit bb175342aa64e6c6f1d04f5235502121d6ff0247 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA ---------------------------
Setting and clearing REQ_F_OVERFLOW in io_uring_cancel_files() and io_cqring_overflow_flush() are racy, because they might be called asynchronously.
REQ_F_OVERFLOW flag in only needed for files cancellation, so if it can be guaranteed that requests _currently_ marked inflight can't be overflown, the problem will be solved with removing the flag altogether.
That's how the patch works, it removes inflight status of a request in io_cqring_fill_event() whenever it should be thrown into CQ-overflow list. That's Ok to do, because no opcode specific handling can be done after io_cqring_fill_event(), the same assumption as with "struct io_completion" patches. And it already have a good place for such cleanups, which is io_clean_op(). A nice side effect of this is removing this inflight check from the hot path.
note on synchronisation: now __io_cqring_fill_event() may be taking two spinlocks simultaneously, completion_lock and inflight_lock. It's fine, because we never do that in reverse order, and CQ-overflow of inflight requests shouldn't happen often.
Signed-off-by: Pavel Begunkov asml.silence@gmail.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/io_uring.c | 61 ++++++++++++++------------------------------------- 1 file changed, 17 insertions(+), 44 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c index 0e3ec6bbcae7..a8f6c5798bae 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -552,7 +552,6 @@ enum { REQ_F_ISREG_BIT, REQ_F_COMP_LOCKED_BIT, REQ_F_NEED_CLEANUP_BIT, - REQ_F_OVERFLOW_BIT, REQ_F_POLLED_BIT, REQ_F_BUFFER_SELECTED_BIT, REQ_F_NO_FILE_TABLE_BIT, @@ -595,8 +594,6 @@ enum { REQ_F_COMP_LOCKED = BIT(REQ_F_COMP_LOCKED_BIT), /* needs cleanup */ REQ_F_NEED_CLEANUP = BIT(REQ_F_NEED_CLEANUP_BIT), - /* in overflow list */ - REQ_F_OVERFLOW = BIT(REQ_F_OVERFLOW_BIT), /* already went through poll handler */ REQ_F_POLLED = BIT(REQ_F_POLLED_BIT), /* buffer already selected */ @@ -946,7 +943,8 @@ static void io_get_req_task(struct io_kiocb *req)
static inline void io_clean_op(struct io_kiocb *req) { - if (req->flags & (REQ_F_NEED_CLEANUP | REQ_F_BUFFER_SELECTED)) + if (req->flags & (REQ_F_NEED_CLEANUP | REQ_F_BUFFER_SELECTED | + REQ_F_INFLIGHT)) __io_clean_op(req); }
@@ -1444,7 +1442,6 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) req = list_first_entry(&ctx->cq_overflow_list, struct io_kiocb, compl.list); list_move(&req->compl.list, &list); - req->flags &= ~REQ_F_OVERFLOW; if (cqe) { WRITE_ONCE(cqe->user_data, req->user_data); WRITE_ONCE(cqe->res, req->result); @@ -1497,7 +1494,6 @@ static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags) ctx->rings->sq_flags |= IORING_SQ_CQ_OVERFLOW; } io_clean_op(req); - req->flags |= REQ_F_OVERFLOW; req->result = res; req->compl.cflags = cflags; refcount_inc(&req->refs); @@ -1645,17 +1641,6 @@ static bool io_dismantle_req(struct io_kiocb *req) if (req->file) io_put_file(req, req->file, (req->flags & REQ_F_FIXED_FILE));
- if (req->flags & REQ_F_INFLIGHT) { - struct io_ring_ctx *ctx = req->ctx; - unsigned long flags; - - spin_lock_irqsave(&ctx->inflight_lock, flags); - list_del(&req->inflight_entry); - if (waitqueue_active(&ctx->inflight_wait)) - wake_up(&ctx->inflight_wait); - spin_unlock_irqrestore(&ctx->inflight_lock, flags); - } - return io_req_clean_work(req); }
@@ -5499,6 +5484,18 @@ static void __io_clean_op(struct io_kiocb *req) } req->flags &= ~REQ_F_NEED_CLEANUP; } + + if (req->flags & REQ_F_INFLIGHT) { + struct io_ring_ctx *ctx = req->ctx; + unsigned long flags; + + spin_lock_irqsave(&ctx->inflight_lock, flags); + list_del(&req->inflight_entry); + if (waitqueue_active(&ctx->inflight_wait)) + wake_up(&ctx->inflight_wait); + spin_unlock_irqrestore(&ctx->inflight_lock, flags); + req->flags &= ~REQ_F_INFLIGHT; + } }
static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe, @@ -8012,33 +8009,9 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, /* We need to keep going until we don't find a matching req */ if (!cancel_req) break; - - if (cancel_req->flags & REQ_F_OVERFLOW) { - spin_lock_irq(&ctx->completion_lock); - list_del(&cancel_req->compl.list); - cancel_req->flags &= ~REQ_F_OVERFLOW; - - io_cqring_mark_overflow(ctx); - WRITE_ONCE(ctx->rings->cq_overflow, - atomic_inc_return(&ctx->cached_cq_overflow)); - io_commit_cqring(ctx); - spin_unlock_irq(&ctx->completion_lock); - - /* - * Put inflight ref and overflow ref. If that's - * all we had, then we're done with this request. - */ - if (refcount_sub_and_test(2, &cancel_req->refs)) { - io_free_req(cancel_req); - finish_wait(&ctx->inflight_wait, &wait); - continue; - } - } else { - /* cancel this request, or head link requests */ - io_attempt_cancel(ctx, cancel_req); - io_put_req(cancel_req); - } - + /* cancel this request, or head link requests */ + io_attempt_cancel(ctx, cancel_req); + io_put_req(cancel_req); schedule(); finish_wait(&ctx->inflight_wait, &wait); }