[PATCH OLK-6.6 00/28] ext4 hardware atomic write support
This patch set support ext4 hardware atomic write. Alan Adamson (1): nvme: Atomic write support John Garry (10): block: Pass blk_queue_get_max_sectors() a request pointer block: Generalize chunk_sectors support as boundary support block: Add core atomic write support block: Call blkdev_dio_unaligned() from blkdev_direct_IO() block: Add fops atomic write support block/fs: Pass an iocb to generic_atomic_write_valid() fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid() block: Add bdev atomic write limits helpers fs: Export generic_atomic_write_valid() fs: iomap: Atomic write support Long Li (3): block: fix kabi broken in struct queue_limits fs: fix kabi broken in struct kstat block: fix kabi broken in enum req_flag_bits Prasad Singamsetty (2): fs: Initial atomic write support fs: Add initial atomic write support info to statx Ritesh Harjani (IBM) (12): ext4: Add statx support for atomic writes ext4: Check for atomic writes support in write iter ext4: Support setting FMODE_CAN_ATOMIC_WRITE ext4: Do not fallback to buffered-io for DIO atomic write ext4: Document an edge case for overwrites ext4: Check if inode uses extents in ext4_inode_can_atomic_write() ext4: Make ext4_meta_trans_blocks() non-static for later use ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS ext4: Add multi-fsblock atomic write support with bigalloc ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc ext4: Add atomic block write documentation iomap: Lift blocksize restriction on atomic writes Documentation/ABI/stable/sysfs-block | 53 +++ .../filesystems/ext4/atomic_writes.rst | 225 ++++++++++++ Documentation/filesystems/ext4/overview.rst | 1 + block/blk-core.c | 19 + block/blk-merge.c | 67 +++- block/blk-mq.c | 2 +- block/blk-settings.c | 87 +++++ block/blk-sysfs.c | 33 ++ block/blk.h | 9 +- block/fops.c | 51 ++- drivers/md/dm.c | 2 +- drivers/nvme/host/core.c | 54 ++- fs/aio.c | 8 +- fs/btrfs/ioctl.c | 2 +- fs/ext4/ext4.h | 35 +- fs/ext4/extents.c | 99 +++++ fs/ext4/file.c | 31 +- fs/ext4/inode.c | 346 ++++++++++++++++-- fs/ext4/super.c | 34 ++ fs/iomap/direct-io.c | 38 +- fs/iomap/trace.h | 3 +- fs/read_write.c | 22 +- fs/stat.c | 34 ++ include/linux/blk_types.h | 8 +- include/linux/blkdev.h | 84 ++++- include/linux/fs.h | 18 +- include/linux/iomap.h | 1 + include/linux/stat.h | 5 +- include/uapi/linux/fs.h | 5 +- include/uapi/linux/stat.h | 11 +- io_uring/rw.c | 8 +- 31 files changed, 1308 insertions(+), 87 deletions(-) create mode 100644 Documentation/filesystems/ext4/atomic_writes.rst -- 2.39.2
From: John Garry <john.g.garry@oracle.com> mainline inclusion from mainline-v6.10-rc3 commit 8d1dfd51c84e202df05a999ce82cb27554f7d152 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Currently blk_queue_get_max_sectors() is passed a enum req_op. In future the value returned from blk_queue_get_max_sectors() may depend on certain request flags, so pass a request pointer. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Long Li <leo.lilong@huawei.com> --- block/blk-merge.c | 3 ++- block/blk-mq.c | 2 +- block/blk.h | 6 ++++-- 3 files changed, 7 insertions(+), 4 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index 5e753ba8ef5d..d8f64e4b0ebc 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -604,7 +604,8 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, if (blk_rq_is_passthrough(rq)) return q->limits.max_hw_sectors; - max_sectors = blk_queue_get_max_sectors(q, req_op(rq)); + max_sectors = blk_queue_get_max_sectors(rq); + if (!q->limits.chunk_sectors || req_op(rq) == REQ_OP_DISCARD || req_op(rq) == REQ_OP_SECURE_ERASE) diff --git a/block/blk-mq.c b/block/blk-mq.c index 8b52aa4f371f..ed2ddd17841b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -3151,7 +3151,7 @@ void blk_mq_submit_bio(struct bio *bio) blk_status_t blk_insert_cloned_request(struct request *rq) { struct request_queue *q = rq->q; - unsigned int max_sectors = blk_queue_get_max_sectors(q, req_op(rq)); + unsigned int max_sectors = blk_queue_get_max_sectors(rq); unsigned int max_segments = blk_rq_get_max_segments(rq); blk_status_t ret; diff --git a/block/blk.h b/block/blk.h index 0c469a84c9e0..09ae7a31d081 100644 --- a/block/blk.h +++ b/block/blk.h @@ -168,9 +168,11 @@ static inline unsigned int blk_rq_get_max_segments(struct request *rq) return queue_max_segments(rq->q); } -static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q, - enum req_op op) +static inline unsigned int blk_queue_get_max_sectors(struct request *rq) { + struct request_queue *q = rq->q; + enum req_op op = req_op(rq); + if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE)) return min(q->limits.max_discard_sectors, UINT_MAX >> SECTOR_SHIFT); -- 2.39.2
From: John Garry <john.g.garry@oracle.com> mainline inclusion from mainline-v6.10-rc3 commit f70167a7a6e7e8a6911f3a216dc044cbfe7c1983 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The purpose of the chunk_sectors limit is to ensure that a mergeble request fits within the boundary of the chunck_sector value. Such a feature will be useful for other request_queue boundary limits, so generalize the chunk_sectors merge code. This idea was proposed by Hannes Reinecke. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: include/linux/blkdev.h Signed-off-by: Long Li <leo.lilong@huawei.com> --- block/blk-merge.c | 20 ++++++++++++++------ drivers/md/dm.c | 2 +- include/linux/blkdev.h | 13 +++++++------ 3 files changed, 22 insertions(+), 13 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index d8f64e4b0ebc..373d7410fdad 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -158,6 +158,11 @@ static struct bio *bio_split_write_zeroes(struct bio *bio, return bio_split(bio, lim->max_write_zeroes_sectors, GFP_NOIO, bs); } +static inline unsigned int blk_boundary_sectors(const struct queue_limits *lim) +{ + return lim->chunk_sectors; +} + /* * Return the maximum number of sectors from the start of a bio that may be * submitted as a single request to a block device. If enough sectors remain, @@ -171,12 +176,13 @@ static inline unsigned get_max_io_size(struct bio *bio, { unsigned pbs = lim->physical_block_size >> SECTOR_SHIFT; unsigned lbs = lim->logical_block_size >> SECTOR_SHIFT; + unsigned boundary_sectors = blk_boundary_sectors(lim); unsigned max_sectors = lim->max_sectors, start, end; - if (lim->chunk_sectors) { + if (boundary_sectors) { max_sectors = min(max_sectors, - blk_chunk_sectors_left(bio->bi_iter.bi_sector, - lim->chunk_sectors)); + blk_boundary_sectors_left(bio->bi_iter.bi_sector, + boundary_sectors)); } start = bio->bi_iter.bi_sector & (pbs - 1); @@ -599,19 +605,21 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, sector_t offset) { struct request_queue *q = rq->q; - unsigned int max_sectors; + struct queue_limits *lim = &q->limits; + unsigned int max_sectors, boundary_sectors; if (blk_rq_is_passthrough(rq)) return q->limits.max_hw_sectors; + boundary_sectors = blk_boundary_sectors(lim); max_sectors = blk_queue_get_max_sectors(rq); - if (!q->limits.chunk_sectors || + if (!boundary_sectors || req_op(rq) == REQ_OP_DISCARD || req_op(rq) == REQ_OP_SECURE_ERASE) return max_sectors; return min(max_sectors, - blk_chunk_sectors_left(offset, q->limits.chunk_sectors)); + blk_boundary_sectors_left(offset, boundary_sectors)); } static inline int ll_new_hw_segment(struct request *req, struct bio *bio, diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 6b16f64be9bd..3862af2f64ad 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1177,7 +1177,7 @@ static sector_t __max_io_len(struct dm_target *ti, sector_t sector, return len; return min_t(sector_t, len, min(max_sectors ? : queue_max_sectors(ti->table->md->queue), - blk_chunk_sectors_left(target_offset, max_granularity))); + blk_boundary_sectors_left(target_offset, max_granularity))); } static inline sector_t max_io_len(struct dm_target *ti, sector_t sector) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 32f9f1cc7d3c..5e8eeb2d299b 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -941,14 +941,15 @@ static inline unsigned int bio_zone_is_seq(struct bio *bio) } /* - * Return how much of the chunk is left to be used for I/O at a given offset. + * Return how much within the boundary is left to be used for I/O at a given + * offset. */ -static inline unsigned int blk_chunk_sectors_left(sector_t offset, - unsigned int chunk_sectors) +static inline unsigned int blk_boundary_sectors_left(sector_t offset, + unsigned int boundary_sectors) { - if (unlikely(!is_power_of_2(chunk_sectors))) - return chunk_sectors - sector_div(offset, chunk_sectors); - return chunk_sectors - (offset & (chunk_sectors - 1)); + if (unlikely(!is_power_of_2(boundary_sectors))) + return boundary_sectors - sector_div(offset, boundary_sectors); + return boundary_sectors - (offset & (boundary_sectors - 1)); } /* -- 2.39.2
From: Prasad Singamsetty <prasad.singamsetty@oracle.com> mainline inclusion from mainline-v6.10-rc3 commit c34fc6f26ab86d03a2d47446f42b6cd492dfdc56 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- An atomic write is a write issued with torn-write protection, meaning that for a power failure or any other hardware failure, all or none of the data from the write will be stored, but never a mix of old and new data. Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the write is to be issued with torn-write prevention, according to special alignment and length rules. For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for iocb->ki_flags field to indicate the same. A call to statx will give the relevant atomic write info for a file: - atomic_write_unit_min - atomic_write_unit_max - atomic_write_segments_max Both min and max values must be a power-of-2. Applications can avail of atomic write feature by ensuring that the total length of a write is a power-of-2 in size and also sized between atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications must ensure that the write is at a naturally-aligned offset in the file wrt the total write length. The value in atomic_write_segments_max indicates the upper limit for IOV_ITER iovcnt. Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the flag set will have RWF_ATOMIC rejected and not just ignored. Add a type argument to kiocb_set_rw_flags() to allows reads which have RWF_ATOMIC set to be rejected. Helper function generic_atomic_write_valid() can be used by FSes to verify compliant writes. There we check for iov_iter type is for ubuf, which implies iovcnt==1 for pwritev2(), which is an initial restriction for atomic_write_segments_max. Initially the only user will be bdev file operations write handler. We will rely on the block BIO submission path to ensure write sizes are compliant for the bdev, so we don't need to check atomic writes sizes yet. Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com> jpg: merge into single patch and much rewrite Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: fs/read_write.c include/linux/fs.h include/uapi/linux/fs.h io_uring/rw.c Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/aio.c | 8 ++++---- fs/btrfs/ioctl.c | 2 +- fs/read_write.c | 18 +++++++++++++++++- include/linux/fs.h | 15 ++++++++++++++- include/uapi/linux/fs.h | 5 ++++- io_uring/rw.c | 8 ++++---- 6 files changed, 44 insertions(+), 12 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index c3614193d749..3b5d6f65a758 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1467,7 +1467,7 @@ static void aio_complete_rw(struct kiocb *kiocb, long res) iocb_put(iocb); } -static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb) +static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb, int rw_type) { int ret; @@ -1493,7 +1493,7 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb) } else req->ki_ioprio = get_current_ioprio(); - ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags); + ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags, rw_type); if (unlikely(ret)) return ret; @@ -1545,7 +1545,7 @@ static int aio_read(struct kiocb *req, const struct iocb *iocb, struct file *file; int ret; - ret = aio_prep_rw(req, iocb); + ret = aio_prep_rw(req, iocb, READ); if (ret) return ret; file = req->ki_filp; @@ -1572,7 +1572,7 @@ static int aio_write(struct kiocb *req, const struct iocb *iocb, struct file *file; int ret; - ret = aio_prep_rw(req, iocb); + ret = aio_prep_rw(req, iocb, WRITE); if (ret) return ret; file = req->ki_filp; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 3d1cb018f3c4..5487b35bc78e 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4569,7 +4569,7 @@ static int btrfs_ioctl_encoded_write(struct file *file, void __user *argp, bool goto out_iov; init_sync_kiocb(&kiocb, file); - ret = kiocb_set_rw_flags(&kiocb, 0); + ret = kiocb_set_rw_flags(&kiocb, 0, WRITE); if (ret) goto out_iov; kiocb.ki_pos = pos; diff --git a/fs/read_write.c b/fs/read_write.c index 27a7729b82f3..c3d0c8d7dd24 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -726,7 +726,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter, ssize_t ret; init_sync_kiocb(&kiocb, filp); - ret = kiocb_set_rw_flags(&kiocb, flags); + ret = kiocb_set_rw_flags(&kiocb, flags, type); if (ret) return ret; kiocb.ki_pos = (ppos ? *ppos : 0); @@ -1766,6 +1766,22 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out) return 0; } +bool generic_atomic_write_valid(struct iov_iter *iter, loff_t pos) +{ + size_t len = iov_iter_count(iter); + + if (!iter_is_ubuf(iter)) + return false; + + if (!is_power_of_2(len)) + return false; + + if (!IS_ALIGNED(pos, len)) + return false; + + return true; +} + #ifdef CONFIG_BPF_READAHEAD static void fs_file_read_ctx_init(struct fs_file_read_ctx *ctx, struct file *filp, loff_t pos) diff --git a/include/linux/fs.h b/include/linux/fs.h index c154746ca38e..10abdc255459 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -127,6 +127,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, #define FMODE_PWRITE ((__force fmode_t)0x10) /* File is opened for execution with sys_execve / sys_uselib */ #define FMODE_EXEC ((__force fmode_t)0x20) +/* File supports atomic writes */ +#define FMODE_CAN_ATOMIC_WRITE ((__force fmode_t)(0x80)) /* 32bit hashes as llseek() offset (for directories) */ #define FMODE_32BITHASH ((__force fmode_t)0x200) /* 64bit hashes as llseek() offset (for directories) */ @@ -347,6 +349,7 @@ enum rw_hint { #define IOCB_SYNC (__force int) RWF_SYNC #define IOCB_NOWAIT (__force int) RWF_NOWAIT #define IOCB_APPEND (__force int) RWF_APPEND +#define IOCB_ATOMIC (__force int) RWF_ATOMIC /* non-RWF related bits - start at 16 */ #define IOCB_EVENTFD (1 << 16) @@ -381,6 +384,7 @@ enum rw_hint { { IOCB_SYNC, "SYNC" }, \ { IOCB_NOWAIT, "NOWAIT" }, \ { IOCB_APPEND, "APPEND" }, \ + { IOCB_ATOMIC, "ATOMIC"}, \ { IOCB_EVENTFD, "EVENTFD"}, \ { IOCB_DIRECT, "DIRECT" }, \ { IOCB_WRITE, "WRITE" }, \ @@ -3409,7 +3413,8 @@ static inline int iocb_flags(struct file *file) return res; } -static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags) +static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags, + int rw_type) { int kiocb_flags = 0; @@ -3426,6 +3431,12 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags) return -EOPNOTSUPP; kiocb_flags |= IOCB_NOIO; } + if (flags & RWF_ATOMIC) { + if (rw_type != WRITE) + return -EOPNOTSUPP; + if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE)) + return -EOPNOTSUPP; + } kiocb_flags |= (__force int) (flags & RWF_SUPPORTED); if (flags & RWF_SYNC) kiocb_flags |= IOCB_DSYNC; @@ -3658,4 +3669,6 @@ static inline bool cachefiles_ondemand_is_enabled(void) } #endif +bool generic_atomic_write_valid(struct iov_iter *iter, loff_t pos); + #endif /* _LINUX_FS_H */ diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index b7b56871029c..d88bb0c3a048 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -301,8 +301,11 @@ typedef int __bitwise __kernel_rwf_t; /* per-IO O_APPEND */ #define RWF_APPEND ((__force __kernel_rwf_t)0x00000010) +/* Atomic Write */ +#define RWF_ATOMIC ((__force __kernel_rwf_t)0x00000040) + /* mask of flags supported by the kernel */ #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ - RWF_APPEND) + RWF_APPEND | RWF_ATOMIC) #endif /* _UAPI_LINUX_FS_H */ diff --git a/io_uring/rw.c b/io_uring/rw.c index 4cac3db0ef29..5eca187c263c 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -668,7 +668,7 @@ static bool need_complete_io(struct io_kiocb *req) S_ISBLK(file_inode(req->file)->i_mode); } -static int io_rw_init_file(struct io_kiocb *req, fmode_t mode) +static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) { struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); struct kiocb *kiocb = &rw->kiocb; @@ -683,7 +683,7 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode) req->flags |= io_file_get_flags(file); kiocb->ki_flags = file->f_iocb_flags; - ret = kiocb_set_rw_flags(kiocb, rw->flags); + ret = kiocb_set_rw_flags(kiocb, rw->flags, rw_type); if (unlikely(ret)) return ret; kiocb->ki_flags |= IOCB_ALLOC_CACHE; @@ -751,7 +751,7 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags) iov_iter_restore(&s->iter, &s->iter_state); iovec = NULL; } - ret = io_rw_init_file(req, FMODE_READ); + ret = io_rw_init_file(req, FMODE_READ, READ); if (unlikely(ret)) { kfree(iovec); return ret; @@ -926,7 +926,7 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) iov_iter_restore(&s->iter, &s->iter_state); iovec = NULL; } - ret = io_rw_init_file(req, FMODE_WRITE); + ret = io_rw_init_file(req, FMODE_WRITE, WRITE); if (unlikely(ret)) { kfree(iovec); return ret; -- 2.39.2
From: John Garry <john.g.garry@oracle.com> mainline inclusion from mainline-v6.10-rc3 commit 9da3d1e912f3953196e66991d75208cde3e845e1 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Add atomic write support, as follows: - add helper functions to get request_queue atomic write limits - report request_queue atomic write support limits to sysfs and update Doc - support to safely merge atomic writes - deal with splitting atomic writes - misc helper functions - add a per-request atomic write flag New request_queue limits are added, as follows: - atomic_write_hw_max is set by the block driver and is the maximum length of an atomic write which the device may support. It is not necessarily a power-of-2. - atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and max_hw_sectors. It is always a power-of-2. Atomic writes may be merged, and atomic_write_max_sectors would be the limit on a merged atomic write request size. This value is not capped at max_sectors, as the value in max_sectors can be controlled from userspace, and it would only cause trouble if userspace could limit atomic_write_unit_max_bytes and the other atomic write limits. - atomic_write_hw_unit_{min,max} are set by the block driver and are the min/max length of an atomic write unit which the device may support. They both must be a power-of-2. Typically atomic_write_hw_unit_max will hold the same value as atomic_write_hw_max. - atomic_write_unit_{min,max} are derived from atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits. Both min and max values must be a power-of-2. - atomic_write_hw_boundary is set by the block driver. If non-zero, it indicates an LBA space boundary at which an atomic write straddles no longer is atomically executed by the disk. The value must be a power-of-2. Note that it would be acceptable to enforce a rule that atomic_write_hw_boundary_sectors is a multiple of atomic_write_hw_unit_max, but the resultant code would be more complicated. All atomic writes limits are by default set 0 to indicate no atomic write support. Even though it is assumed by Linux that a logical block can always be atomically written, we ignore this as it is not of particular interest. Stacked devices are just not supported either for now. An atomic write must always be submitted to the block driver as part of a single request. As such, only a single BIO must be submitted to the block layer for an atomic write. When a single atomic write BIO is submitted, it cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited by the maximum guaranteed BIO size which will not be required to be split. This max size is calculated by request_queue max segments and the number of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each segment containing PAGE_SIZE of data, apart from the first+last, which each can fit logical block size of data. The first+last will be LBS length/aligned as we rely on direct IO alignment rules also. New sysfs files are added to report the following atomic write limits: - atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in bytes - atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in bytes - atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in bytes - atomic_write_max_bytes - same as atomic_write_max_sectors in bytes Atomic writes may only be merged with other atomic writes and only under the following conditions: - total resultant request length <= atomic_write_max_bytes - the merged write does not straddle a boundary Helper function bdev_can_atomic_write() is added to indicate whether atomic writes may be issued to a bdev. If a bdev is a partition, the partition start must be aligned with both atomic_write_unit_min_sectors and atomic_write_hw_boundary_sectors. FSes will rely on the block layer to validate that an atomic write BIO submitted will be of valid size, so add blk_validate_atomic_write_op_size() for this purpose. Userspace expects an atomic write which is of invalid size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use BLK_STS_INVAL for when a BIO needs to be split, as this should mean an invalid size BIO. Flag REQ_ATOMIC is used for indicating an atomic write. Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/blk-core.c block/blk-merge.c block/blk-settings.c Signed-off-by: Long Li <leo.lilong@huawei.com> --- Documentation/ABI/stable/sysfs-block | 53 +++++++++++++++++ block/blk-core.c | 19 ++++++ block/blk-merge.c | 50 ++++++++++++++-- block/blk-settings.c | 87 ++++++++++++++++++++++++++++ block/blk-sysfs.c | 33 +++++++++++ block/blk.h | 3 + include/linux/blk_types.h | 8 ++- include/linux/blkdev.h | 55 ++++++++++++++++++ 8 files changed, 303 insertions(+), 5 deletions(-) diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block index b59ca4c8fbdf..97fc8034a271 100644 --- a/Documentation/ABI/stable/sysfs-block +++ b/Documentation/ABI/stable/sysfs-block @@ -21,6 +21,59 @@ Description: device is offset from the internal allocation unit's natural alignment. +What: /sys/block/<disk>/atomic_write_max_bytes +Date: February 2024 +Contact: Himanshu Madhani <himanshu.madhani@oracle.com> +Description: + [RO] This parameter specifies the maximum atomic write + size reported by the device. This parameter is relevant + for merging of writes, where a merged atomic write + operation must not exceed this number of bytes. + This parameter may be greater than the value in + atomic_write_unit_max_bytes as + atomic_write_unit_max_bytes will be rounded down to a + power-of-two and atomic_write_unit_max_bytes may also be + limited by some other queue limits, such as max_segments. + This parameter - along with atomic_write_unit_min_bytes + and atomic_write_unit_max_bytes - will not be larger than + max_hw_sectors_kb, but may be larger than max_sectors_kb. + + +What: /sys/block/<disk>/atomic_write_unit_min_bytes +Date: February 2024 +Contact: Himanshu Madhani <himanshu.madhani@oracle.com> +Description: + [RO] This parameter specifies the smallest block which can + be written atomically with an atomic write operation. All + atomic write operations must begin at a + atomic_write_unit_min boundary and must be multiples of + atomic_write_unit_min. This value must be a power-of-two. + + +What: /sys/block/<disk>/atomic_write_unit_max_bytes +Date: February 2024 +Contact: Himanshu Madhani <himanshu.madhani@oracle.com> +Description: + [RO] This parameter defines the largest block which can be + written atomically with an atomic write operation. This + value must be a multiple of atomic_write_unit_min and must + be a power-of-two. This value will not be larger than + atomic_write_max_bytes. + + +What: /sys/block/<disk>/atomic_write_boundary_bytes +Date: February 2024 +Contact: Himanshu Madhani <himanshu.madhani@oracle.com> +Description: + [RO] A device may need to internally split an atomic write I/O + which straddles a given logical block address boundary. This + parameter specifies the size in bytes of the atomic boundary if + one is reported by the device. This value must be a + power-of-two and at least the size as in + atomic_write_unit_max_bytes. + Any attempt to merge atomic write I/Os must not result in a + merged I/O which crosses this boundary (if any). + What: /sys/block/<disk>/diskseq Date: February 2021 diff --git a/block/blk-core.c b/block/blk-core.c index aedf1591c155..9ae49ab2b8d3 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -201,6 +201,8 @@ static const struct { /* Command duration limit device-side timeout */ [BLK_STS_DURATION_LIMIT] = { -ETIME, "duration limit exceeded" }, + [BLK_STS_INVAL] = { -EINVAL, "invalid" }, + /* everything else not covered above: */ [BLK_STS_IOERR] = { -EIO, "I/O" }, }; @@ -749,6 +751,18 @@ void submit_bio_noacct_nocheck(struct bio *bio) __submit_bio_noacct(bio); } +static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q, + struct bio *bio) +{ + if (bio->bi_iter.bi_size > queue_atomic_write_unit_max_bytes(q)) + return BLK_STS_INVAL; + + if (bio->bi_iter.bi_size % queue_atomic_write_unit_min_bytes(q)) + return BLK_STS_INVAL; + + return BLK_STS_OK; +} + /** * submit_bio_noacct - re-submit a bio to the block device layer for I/O * @bio: The bio describing the location in memory and on the device. @@ -806,6 +820,11 @@ void submit_bio_noacct(struct bio *bio) switch (bio_op(bio)) { case REQ_OP_READ: case REQ_OP_WRITE: + if (bio->bi_opf & REQ_ATOMIC) { + status = blk_validate_atomic_write_op_size(q, bio); + if (status != BLK_STS_OK) + goto end_io; + } break; case REQ_OP_FLUSH: /* diff --git a/block/blk-merge.c b/block/blk-merge.c index 373d7410fdad..c9bd534cfe3f 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -158,8 +158,16 @@ static struct bio *bio_split_write_zeroes(struct bio *bio, return bio_split(bio, lim->max_write_zeroes_sectors, GFP_NOIO, bs); } -static inline unsigned int blk_boundary_sectors(const struct queue_limits *lim) +static inline unsigned int blk_boundary_sectors(const struct queue_limits *lim, + bool is_atomic) { + /* + * chunk_sectors must be a multiple of atomic_write_boundary_sectors if + * both non-zero. + */ + if (is_atomic && lim->atomic_write_boundary_sectors) + return lim->atomic_write_boundary_sectors; + return lim->chunk_sectors; } @@ -176,8 +184,18 @@ static inline unsigned get_max_io_size(struct bio *bio, { unsigned pbs = lim->physical_block_size >> SECTOR_SHIFT; unsigned lbs = lim->logical_block_size >> SECTOR_SHIFT; - unsigned boundary_sectors = blk_boundary_sectors(lim); - unsigned max_sectors = lim->max_sectors, start, end; + bool is_atomic = bio->bi_opf & REQ_ATOMIC; + unsigned boundary_sectors = blk_boundary_sectors(lim, is_atomic); + unsigned max_sectors, start, end; + + /* + * We ignore lim->max_sectors for atomic writes because it may less + * than the actual bio size, which we cannot tolerate. + */ + if (is_atomic) + max_sectors = lim->atomic_write_max_sectors; + else + max_sectors = lim->max_sectors; if (boundary_sectors) { max_sectors = min(max_sectors, @@ -323,6 +341,11 @@ struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim, *segs = nsegs; return NULL; split: + if (bio->bi_opf & REQ_ATOMIC) { + bio->bi_status = BLK_STS_INVAL; + bio_endio(bio); + return ERR_PTR(-EINVAL); + } /* * We can't sanely support splitting for a REQ_NOWAIT bio. End it * with EAGAIN if splitting is required and return an error pointer. @@ -607,11 +630,12 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, struct request_queue *q = rq->q; struct queue_limits *lim = &q->limits; unsigned int max_sectors, boundary_sectors; + bool is_atomic = rq->cmd_flags & REQ_ATOMIC; if (blk_rq_is_passthrough(rq)) return q->limits.max_hw_sectors; - boundary_sectors = blk_boundary_sectors(lim); + boundary_sectors = blk_boundary_sectors(lim, is_atomic); max_sectors = blk_queue_get_max_sectors(rq); if (!boundary_sectors || @@ -818,6 +842,18 @@ static enum elv_merge blk_try_req_merge(struct request *req, return ELEVATOR_NO_MERGE; } +static bool blk_atomic_write_mergeable_rq_bio(struct request *rq, + struct bio *bio) +{ + return (rq->cmd_flags & REQ_ATOMIC) == (bio->bi_opf & REQ_ATOMIC); +} + +static bool blk_atomic_write_mergeable_rqs(struct request *rq, + struct request *next) +{ + return (rq->cmd_flags & REQ_ATOMIC) == (next->cmd_flags & REQ_ATOMIC); +} + /* * For non-mq, this has to be called with the request spinlock acquired. * For mq with scheduling, the appropriate queue wide lock should be held. @@ -837,6 +873,9 @@ static struct request *attempt_merge(struct request_queue *q, if (req->ioprio != next->ioprio) return NULL; + if (!blk_atomic_write_mergeable_rqs(req, next)) + return NULL; + /* * If we are allowed to merge, then append bio list * from next to rq and release next. merge_requests_fn @@ -965,6 +1004,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio) if (rq->ioprio != bio_prio(bio)) return false; + if (blk_atomic_write_mergeable_rq_bio(rq, bio) == false) + return false; + return true; } diff --git a/block/blk-settings.c b/block/blk-settings.c index a891f27ff834..39739f5a285e 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -25,6 +25,92 @@ void blk_queue_rq_timeout(struct request_queue *q, unsigned int timeout) } EXPORT_SYMBOL_GPL(blk_queue_rq_timeout); +/* + * Returns max guaranteed bytes which we can fit in a bio. + * + * We request that an atomic_write is ITER_UBUF iov_iter (so a single vector), + * so we assume that we can fit in at least PAGE_SIZE in a segment, apart from + * the first and last segments. + */ +static +unsigned int blk_queue_max_guaranteed_bio(struct queue_limits *lim) +{ + unsigned int max_segments = min(BIO_MAX_VECS, lim->max_segments); + unsigned int length; + + length = min(max_segments, 2) * lim->logical_block_size; + if (max_segments > 2) + length += (max_segments - 2) * PAGE_SIZE; + + return length; +} + +static void blk_atomic_writes_update_limits(struct queue_limits *lim) +{ + unsigned int unit_limit = min(lim->max_hw_sectors << SECTOR_SHIFT, + blk_queue_max_guaranteed_bio(lim)); + + unit_limit = rounddown_pow_of_two(unit_limit); + + lim->atomic_write_max_sectors = + min(lim->atomic_write_hw_max >> SECTOR_SHIFT, + lim->max_hw_sectors); + lim->atomic_write_unit_min = + min(lim->atomic_write_hw_unit_min, unit_limit); + lim->atomic_write_unit_max = + min(lim->atomic_write_hw_unit_max, unit_limit); + lim->atomic_write_boundary_sectors = + lim->atomic_write_hw_boundary >> SECTOR_SHIFT; +} + +static void blk_validate_atomic_write_limits(struct queue_limits *lim) +{ + unsigned int chunk_sectors = lim->chunk_sectors; + unsigned int boundary_sectors; + + if (!lim->atomic_write_hw_max) + goto unsupported; + + boundary_sectors = lim->atomic_write_hw_boundary >> SECTOR_SHIFT; + + if (boundary_sectors) { + /* + * A feature of boundary support is that it disallows bios to + * be merged which would result in a merged request which + * crosses either a chunk sector or atomic write HW boundary, + * even though chunk sectors may be just set for performance. + * For simplicity, disallow atomic writes for a chunk sector + * which is non-zero and smaller than atomic write HW boundary. + * Furthermore, chunk sectors must be a multiple of atomic + * write HW boundary. Otherwise boundary support becomes + * complicated. + * Devices which do not conform to these rules can be dealt + * with if and when they show up. + */ + if (WARN_ON_ONCE(do_div(chunk_sectors, boundary_sectors))) + goto unsupported; + + /* + * The boundary size just needs to be a multiple of unit_max + * (and not necessarily a power-of-2), so this following check + * could be relaxed in future. + * Furthermore, if needed, unit_max could even be reduced so + * that it is compliant with a !power-of-2 boundary. + */ + if (!is_power_of_2(boundary_sectors)) + goto unsupported; + } + + blk_atomic_writes_update_limits(lim); + return; + +unsupported: + lim->atomic_write_max_sectors = 0; + lim->atomic_write_boundary_sectors = 0; + lim->atomic_write_unit_min = 0; + lim->atomic_write_unit_max = 0; +} + /** * blk_set_default_limits - reset limits to default values * @lim: the queue_limits structure to reset @@ -145,6 +231,7 @@ void blk_queue_max_hw_sectors(struct request_queue *q, unsigned int max_hw_secto max_sectors = round_down(max_sectors, limits->logical_block_size >> SECTOR_SHIFT); limits->max_sectors = max_sectors; + blk_validate_atomic_write_limits(limits); if (!q->disk) return; diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index c56181999232..89a142955c28 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -119,6 +119,30 @@ static ssize_t queue_max_discard_segments_show(struct request_queue *q, return queue_var_show(queue_max_discard_segments(q), page); } +static ssize_t queue_atomic_write_max_bytes_show(struct request_queue *q, + char *page) +{ + return queue_var_show(queue_atomic_write_max_bytes(q), page); +} + +static ssize_t queue_atomic_write_boundary_show(struct request_queue *q, + char *page) +{ + return queue_var_show(queue_atomic_write_boundary_bytes(q), page); +} + +static ssize_t queue_atomic_write_unit_min_show(struct request_queue *q, + char *page) +{ + return queue_var_show(queue_atomic_write_unit_min_bytes(q), page); +} + +static ssize_t queue_atomic_write_unit_max_show(struct request_queue *q, + char *page) +{ + return queue_var_show(queue_atomic_write_unit_max_bytes(q), page); +} + static ssize_t queue_max_integrity_segments_show(struct request_queue *q, char *page) { return queue_var_show(q->limits.max_integrity_segments, page); @@ -547,6 +571,11 @@ QUEUE_RO_ENTRY(queue_discard_max_hw, "discard_max_hw_bytes"); QUEUE_RW_ENTRY(queue_discard_max, "discard_max_bytes"); QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data"); +QUEUE_RO_ENTRY(queue_atomic_write_max_bytes, "atomic_write_max_bytes"); +QUEUE_RO_ENTRY(queue_atomic_write_boundary, "atomic_write_boundary_bytes"); +QUEUE_RO_ENTRY(queue_atomic_write_unit_max, "atomic_write_unit_max_bytes"); +QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes"); + QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes"); QUEUE_RO_ENTRY(queue_write_zeroes_max, "write_zeroes_max_bytes"); QUEUE_RO_ENTRY(queue_zone_append_max, "zone_append_max_bytes"); @@ -672,6 +701,10 @@ static struct attribute *queue_attrs[] = { &queue_discard_max_entry.attr, &queue_discard_max_hw_entry.attr, &queue_discard_zeroes_data_entry.attr, + &queue_atomic_write_max_bytes_entry.attr, + &queue_atomic_write_boundary_entry.attr, + &queue_atomic_write_unit_min_entry.attr, + &queue_atomic_write_unit_max_entry.attr, &queue_write_same_max_entry.attr, &queue_write_zeroes_max_entry.attr, &queue_zone_append_max_entry.attr, diff --git a/block/blk.h b/block/blk.h index 09ae7a31d081..1e157e11a187 100644 --- a/block/blk.h +++ b/block/blk.h @@ -180,6 +180,9 @@ static inline unsigned int blk_queue_get_max_sectors(struct request *rq) if (unlikely(op == REQ_OP_WRITE_ZEROES)) return q->limits.max_write_zeroes_sectors; + if (rq->cmd_flags & REQ_ATOMIC) + return q->limits.atomic_write_max_sectors; + return q->limits.max_sectors; } diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 076474806b6b..ef5b12d47235 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -195,6 +195,11 @@ typedef u16 blk_short_t; */ #define BLK_STS_DURATION_LIMIT ((__force blk_status_t)18) +/* + * Invalid size or alignment. + */ +#define BLK_STS_INVAL ((__force blk_status_t)19) + /** * blk_path_error - returns true if error may be path related * @error: status the request was completed with @@ -426,7 +431,7 @@ enum req_flag_bits { __REQ_SWAP, /* swap I/O */ __REQ_DRV, /* for driver use */ __REQ_FS_PRIVATE, /* for file system (submitter) use */ - + __REQ_ATOMIC, /* for atomic write operations */ /* * Command specific flags, keep last: */ @@ -458,6 +463,7 @@ enum req_flag_bits { #define REQ_SWAP (__force blk_opf_t)(1ULL << __REQ_SWAP) #define REQ_DRV (__force blk_opf_t)(1ULL << __REQ_DRV) #define REQ_FS_PRIVATE (__force blk_opf_t)(1ULL << __REQ_FS_PRIVATE) +#define REQ_ATOMIC (__force blk_opf_t)(1ULL << __REQ_ATOMIC) #define REQ_NOUNMAP (__force blk_opf_t)(1ULL << __REQ_NOUNMAP) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 5e8eeb2d299b..98edf438c3ac 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -343,6 +343,16 @@ struct queue_limits { unsigned int discard_alignment; unsigned int zone_write_granularity; + /* atomic write limits */ + unsigned int atomic_write_hw_max; + unsigned int atomic_write_max_sectors; + unsigned int atomic_write_hw_boundary; + unsigned int atomic_write_boundary_sectors; + unsigned int atomic_write_hw_unit_min; + unsigned int atomic_write_unit_min; + unsigned int atomic_write_hw_unit_max; + unsigned int atomic_write_unit_max; + unsigned short max_segments; unsigned short max_integrity_segments; unsigned short max_discard_segments; @@ -1400,6 +1410,30 @@ static inline int queue_dma_alignment(const struct request_queue *q) return q ? q->limits.dma_alignment : 511; } +static inline unsigned int +queue_atomic_write_unit_max_bytes(const struct request_queue *q) +{ + return q->limits.atomic_write_unit_max; +} + +static inline unsigned int +queue_atomic_write_unit_min_bytes(const struct request_queue *q) +{ + return q->limits.atomic_write_unit_min; +} + +static inline unsigned int +queue_atomic_write_boundary_bytes(const struct request_queue *q) +{ + return q->limits.atomic_write_boundary_sectors << SECTOR_SHIFT; +} + +static inline unsigned int +queue_atomic_write_max_bytes(const struct request_queue *q) +{ + return q->limits.atomic_write_max_sectors << SECTOR_SHIFT; +} + static inline unsigned int bdev_dma_alignment(struct block_device *bdev) { return queue_dma_alignment(bdev_get_queue(bdev)); @@ -1650,6 +1684,27 @@ struct io_comp_batch { KABI_RESERVE(1) }; +static inline bool bdev_can_atomic_write(struct block_device *bdev) +{ + struct request_queue *bd_queue = bdev->bd_queue; + struct queue_limits *limits = &bd_queue->limits; + + if (!limits->atomic_write_unit_min) + return false; + + if (bdev_is_partition(bdev)) { + sector_t bd_start_sect = bdev->bd_start_sect; + unsigned int alignment = + max(limits->atomic_write_unit_min, + limits->atomic_write_hw_boundary); + + if (!IS_ALIGNED(bd_start_sect, alignment >> SECTOR_SHIFT)) + return false; + } + + return true; +} + #define DEFINE_IO_COMP_BATCH(name) struct io_comp_batch name = { } #endif /* _LINUX_BLKDEV_H */ -- 2.39.2
hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8862 -------------------------------- Fix kabi broken in struct queue_limits. Fixes: 9da3d1e912f3 ("block: Add core atomic write support") Signed-off-by: Long Li <leo.lilong@huawei.com> --- include/linux/blkdev.h | 20 ++++++-------------- 1 file changed, 6 insertions(+), 14 deletions(-) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 98edf438c3ac..8ff5cb3ad89d 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -343,16 +343,6 @@ struct queue_limits { unsigned int discard_alignment; unsigned int zone_write_granularity; - /* atomic write limits */ - unsigned int atomic_write_hw_max; - unsigned int atomic_write_max_sectors; - unsigned int atomic_write_hw_boundary; - unsigned int atomic_write_boundary_sectors; - unsigned int atomic_write_hw_unit_min; - unsigned int atomic_write_unit_min; - unsigned int atomic_write_hw_unit_max; - unsigned int atomic_write_unit_max; - unsigned short max_segments; unsigned short max_integrity_segments; unsigned short max_discard_segments; @@ -369,10 +359,12 @@ struct queue_limits { */ unsigned int dma_alignment; - KABI_RESERVE(1) - KABI_RESERVE(2) - KABI_RESERVE(3) - KABI_RESERVE(4) + /* atomic write limits */ + KABI_USE2(1, unsigned int atomic_write_hw_max, unsigned int atomic_write_max_sectors) + KABI_USE2(2, unsigned int atomic_write_hw_boundary, unsigned int atomic_write_boundary_sectors) + KABI_USE2(3, unsigned int atomic_write_hw_unit_min, unsigned int atomic_write_unit_min) + KABI_USE2(4, unsigned int atomic_write_hw_unit_max, unsigned int atomic_write_unit_max) + KABI_RESERVE(5) KABI_RESERVE(6) KABI_RESERVE(7) -- 2.39.2
From: John Garry <john.g.garry@oracle.com> mainline inclusion from mainline-v6.9-rc2 commit de4c7bef9d330e4a59c78181bd596c7569d7208e category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- blkdev_dio_unaligned() is called from __blkdev_direct_IO(), __blkdev_direct_IO_simple(), and __blkdev_direct_IO_async(), and all these are only called from blkdev_direct_IO(). Move the blkdev_dio_unaligned() call to the common callsite, blkdev_direct_IO(). Pass those functions the bdev pointer from blkdev_direct_IO(), as it is non-trivial to look up. Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240415122020.1541594-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Long Li <leo.lilong@huawei.com> --- block/fops.c | 29 ++++++++++++----------------- 1 file changed, 12 insertions(+), 17 deletions(-) diff --git a/block/fops.c b/block/fops.c index 478fb45717be..7c3a565ea9f9 100644 --- a/block/fops.c +++ b/block/fops.c @@ -44,18 +44,15 @@ static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos, #define DIO_INLINE_BIO_VECS 4 static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb, - struct iov_iter *iter, unsigned int nr_pages) + struct iov_iter *iter, struct block_device *bdev, + unsigned int nr_pages) { - struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs; loff_t pos = iocb->ki_pos; bool should_dirty = false; struct bio bio; ssize_t ret; - if (blkdev_dio_unaligned(bdev, pos, iter)) - return -EINVAL; - if (nr_pages <= DIO_INLINE_BIO_VECS) vecs = inline_vecs; else { @@ -160,9 +157,8 @@ static void blkdev_bio_end_io(struct bio *bio) } static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, - unsigned int nr_pages) + struct block_device *bdev, unsigned int nr_pages) { - struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); struct blk_plug plug; struct blkdev_dio *dio; struct bio *bio; @@ -171,9 +167,6 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t pos = iocb->ki_pos; int ret = 0; - if (blkdev_dio_unaligned(bdev, pos, iter)) - return -EINVAL; - if (iocb->ki_flags & IOCB_ALLOC_CACHE) opf |= REQ_ALLOC_CACHE; bio = bio_alloc_bioset(bdev, nr_pages, opf, GFP_KERNEL, @@ -300,9 +293,9 @@ static void blkdev_bio_end_io_async(struct bio *bio) static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb, struct iov_iter *iter, + struct block_device *bdev, unsigned int nr_pages) { - struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); bool is_read = iov_iter_rw(iter) == READ; blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb); struct blkdev_dio *dio; @@ -310,9 +303,6 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb, loff_t pos = iocb->ki_pos; int ret = 0; - if (blkdev_dio_unaligned(bdev, pos, iter)) - return -EINVAL; - if (iocb->ki_flags & IOCB_ALLOC_CACHE) opf |= REQ_ALLOC_CACHE; bio = bio_alloc_bioset(bdev, nr_pages, opf, GFP_KERNEL, @@ -365,18 +355,23 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb, static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) { + struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); unsigned int nr_pages; if (!iov_iter_count(iter)) return 0; + if (blkdev_dio_unaligned(bdev, iocb->ki_pos, iter)) + return -EINVAL; + nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1); if (likely(nr_pages <= BIO_MAX_VECS)) { if (is_sync_kiocb(iocb)) - return __blkdev_direct_IO_simple(iocb, iter, nr_pages); - return __blkdev_direct_IO_async(iocb, iter, nr_pages); + return __blkdev_direct_IO_simple(iocb, iter, bdev, + nr_pages); + return __blkdev_direct_IO_async(iocb, iter, bdev, nr_pages); } - return __blkdev_direct_IO(iocb, iter, bio_max_segs(nr_pages)); + return __blkdev_direct_IO(iocb, iter, bdev, bio_max_segs(nr_pages)); } static int blkdev_iomap_begin(struct inode *inode, loff_t offset, loff_t length, -- 2.39.2
From: John Garry <john.g.garry@oracle.com> mainline inclusion from mainline-v6.10-rc3 commit caf336f81b3a3ca744e335972e86ec7244512d4a category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Support atomic writes by submitting a single BIO with the REQ_ATOMIC set. It must be ensured that the atomic write adheres to its rules, like naturally aligned offset, so call blkdev_dio_invalid() -> blkdev_atomic_write_valid() [with renaming blkdev_dio_unaligned() to blkdev_dio_invalid()] for this purpose. The BIO submission path currently checks for atomic writes which are too large, so no need to check here. In blkdev_direct_IO(), if the nr_pages exceeds BIO_MAX_VECS, then we cannot produce a single BIO, so error in this case. Finally set FMODE_CAN_ATOMIC_WRITE when the bdev can support atomic writes and the associated file flag is for O_DIRECT. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-8-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/fops.c Signed-off-by: Long Li <leo.lilong@huawei.com> --- block/fops.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/block/fops.c b/block/fops.c index 7c3a565ea9f9..f53b0fd4d577 100644 --- a/block/fops.c +++ b/block/fops.c @@ -34,9 +34,12 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb) return opf; } -static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos, - struct iov_iter *iter) +static bool blkdev_dio_invalid(struct block_device *bdev, loff_t pos, + struct iov_iter *iter, bool is_atomic) { + if (is_atomic && !generic_atomic_write_valid(iter, pos)) + return true; + return pos & (bdev_logical_block_size(bdev) - 1) || !bdev_iter_is_aligned(bdev, iter); } @@ -71,6 +74,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb, } bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT; bio.bi_ioprio = iocb->ki_ioprio; + if (iocb->ki_flags & IOCB_ATOMIC) + bio.bi_opf |= REQ_ATOMIC; ret = bio_iov_iter_get_pages(&bio, iter); if (unlikely(ret)) @@ -340,6 +345,9 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb, task_io_account_write(bio->bi_iter.bi_size); } + if (iocb->ki_flags & IOCB_ATOMIC) + bio->bi_opf |= REQ_ATOMIC; + if (iocb->ki_flags & IOCB_NOWAIT) bio->bi_opf |= REQ_NOWAIT; @@ -356,12 +364,13 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb, static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) { struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); + bool is_atomic = iocb->ki_flags & IOCB_ATOMIC; unsigned int nr_pages; if (!iov_iter_count(iter)) return 0; - if (blkdev_dio_unaligned(bdev, iocb->ki_pos, iter)) + if (blkdev_dio_invalid(bdev, iocb->ki_pos, iter, is_atomic)) return -EINVAL; nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1); @@ -370,6 +379,8 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) return __blkdev_direct_IO_simple(iocb, iter, bdev, nr_pages); return __blkdev_direct_IO_async(iocb, iter, bdev, nr_pages); + } else if (is_atomic) { + return -EINVAL; } return __blkdev_direct_IO(iocb, iter, bdev, bio_max_segs(nr_pages)); } @@ -597,6 +608,9 @@ static int blkdev_open(struct inode *inode, struct file *filp) if (IS_ERR(handle)) return PTR_ERR(handle); + if (bdev_can_atomic_write(handle->bdev) && filp->f_flags & O_DIRECT) + filp->f_mode |= FMODE_CAN_ATOMIC_WRITE; + if (bdev_nowait(handle->bdev)) filp->f_mode |= FMODE_NOWAIT; -- 2.39.2
From: John Garry <john.g.garry@oracle.com> mainline inclusion from mainline-v6.12-rc1 commit 9a8dbdadae509e5717ff6e5aa572ca0974d2101d category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Darrick and Hannes both thought it better that generic_atomic_write_valid() should be passed a struct iocb, and not just the member of that struct which is referenced; see [0] and [1]. I think that makes a more generic and clean API, so make that change. [0] https://lore.kernel.org/linux-block/680ce641-729b-4150-b875-531a98657682@sus... [1] https://lore.kernel.org/linux-xfs/20240620212401.GA3058325@frogsfrogsfrogs/ Fixes: c34fc6f26ab8 ("fs: Initial atomic write support") Suggested-by: Darrick J. Wong <djwong@kernel.org> Suggested-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241019125113.369994-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: include/linux/fs.h Signed-off-by: Long Li <leo.lilong@huawei.com> --- block/fops.c | 8 ++++---- fs/read_write.c | 4 ++-- include/linux/fs.h | 2 +- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/block/fops.c b/block/fops.c index f53b0fd4d577..018cdb2aeeec 100644 --- a/block/fops.c +++ b/block/fops.c @@ -34,13 +34,13 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb) return opf; } -static bool blkdev_dio_invalid(struct block_device *bdev, loff_t pos, +static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb, struct iov_iter *iter, bool is_atomic) { - if (is_atomic && !generic_atomic_write_valid(iter, pos)) + if (is_atomic && !generic_atomic_write_valid(iocb, iter)) return true; - return pos & (bdev_logical_block_size(bdev) - 1) || + return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) || !bdev_iter_is_aligned(bdev, iter); } @@ -370,7 +370,7 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) if (!iov_iter_count(iter)) return 0; - if (blkdev_dio_invalid(bdev, iocb->ki_pos, iter, is_atomic)) + if (blkdev_dio_invalid(bdev, iocb, iter, is_atomic)) return -EINVAL; nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1); diff --git a/fs/read_write.c b/fs/read_write.c index c3d0c8d7dd24..fae9356293ed 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1766,7 +1766,7 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out) return 0; } -bool generic_atomic_write_valid(struct iov_iter *iter, loff_t pos) +bool generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter) { size_t len = iov_iter_count(iter); @@ -1776,7 +1776,7 @@ bool generic_atomic_write_valid(struct iov_iter *iter, loff_t pos) if (!is_power_of_2(len)) return false; - if (!IS_ALIGNED(pos, len)) + if (!IS_ALIGNED(iocb->ki_pos, len)) return false; return true; diff --git a/include/linux/fs.h b/include/linux/fs.h index 10abdc255459..78c059433d69 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3669,6 +3669,6 @@ static inline bool cachefiles_ondemand_is_enabled(void) } #endif -bool generic_atomic_write_valid(struct iov_iter *iter, loff_t pos); +bool generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter); #endif /* _LINUX_FS_H */ -- 2.39.2
From: John Garry <john.g.garry@oracle.com> mainline inclusion from mainline-v6.12-rc1 commit c3be7ebbbce5201e151f17e28a6c807602f369c9 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Currently FMODE_CAN_ATOMIC_WRITE is set if the bdev can atomic write and the file is open for direct IO. This does not work if the file is not opened for direct IO, yet fcntl(O_DIRECT) is used on the fd later. Change to check for direct IO on a per-IO basis in generic_atomic_write_valid(). Since we want to report -EOPNOTSUPP for non-direct IO for an atomic write, change to return an error code. Relocate the block fops atomic write checks to the common write path, as to catch non-direct IO. Fixes: c34fc6f26ab8 ("fs: Initial atomic write support") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241019125113.369994-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/fops.c include/linux/fs.h Signed-off-by: Long Li <leo.lilong@huawei.com> --- block/fops.c | 18 ++++++++++-------- fs/read_write.c | 13 ++++++++----- include/linux/fs.h | 2 +- 3 files changed, 19 insertions(+), 14 deletions(-) diff --git a/block/fops.c b/block/fops.c index 018cdb2aeeec..fda001b04360 100644 --- a/block/fops.c +++ b/block/fops.c @@ -35,11 +35,8 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb) } static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb, - struct iov_iter *iter, bool is_atomic) + struct iov_iter *iter) { - if (is_atomic && !generic_atomic_write_valid(iocb, iter)) - return true; - return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) || !bdev_iter_is_aligned(bdev, iter); } @@ -364,13 +361,12 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb, static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) { struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); - bool is_atomic = iocb->ki_flags & IOCB_ATOMIC; unsigned int nr_pages; if (!iov_iter_count(iter)) return 0; - if (blkdev_dio_invalid(bdev, iocb, iter, is_atomic)) + if (blkdev_dio_invalid(bdev, iocb, iter)) return -EINVAL; nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1); @@ -379,7 +375,7 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) return __blkdev_direct_IO_simple(iocb, iter, bdev, nr_pages); return __blkdev_direct_IO_async(iocb, iter, bdev, nr_pages); - } else if (is_atomic) { + } else if (iocb->ki_flags & IOCB_ATOMIC) { return -EINVAL; } return __blkdev_direct_IO(iocb, iter, bdev, bio_max_segs(nr_pages)); @@ -608,7 +604,7 @@ static int blkdev_open(struct inode *inode, struct file *filp) if (IS_ERR(handle)) return PTR_ERR(handle); - if (bdev_can_atomic_write(handle->bdev) && filp->f_flags & O_DIRECT) + if (bdev_can_atomic_write(handle->bdev)) filp->f_mode |= FMODE_CAN_ATOMIC_WRITE; if (bdev_nowait(handle->bdev)) @@ -686,6 +682,12 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from) if ((iocb->ki_flags & (IOCB_NOWAIT | IOCB_DIRECT)) == IOCB_NOWAIT) return -EOPNOTSUPP; + if (iocb->ki_flags & IOCB_ATOMIC) { + ret = generic_atomic_write_valid(iocb, from); + if (ret) + return ret; + } + size -= iocb->ki_pos; if (iov_iter_count(from) > size) { shorted = iov_iter_count(from) - size; diff --git a/fs/read_write.c b/fs/read_write.c index fae9356293ed..587488c6a452 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1766,20 +1766,23 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out) return 0; } -bool generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter) +int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter) { size_t len = iov_iter_count(iter); if (!iter_is_ubuf(iter)) - return false; + return -EINVAL; if (!is_power_of_2(len)) - return false; + return -EINVAL; if (!IS_ALIGNED(iocb->ki_pos, len)) - return false; + return -EINVAL; - return true; + if (!(iocb->ki_flags & IOCB_DIRECT)) + return -EOPNOTSUPP; + + return 0; } #ifdef CONFIG_BPF_READAHEAD diff --git a/include/linux/fs.h b/include/linux/fs.h index 78c059433d69..b5c4e8e75ff1 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3669,6 +3669,6 @@ static inline bool cachefiles_ondemand_is_enabled(void) } #endif -bool generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter); +int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter); #endif /* _LINUX_FS_H */ -- 2.39.2
From: John Garry <john.g.garry@oracle.com> mainline inclusion from mainline-v6.12-rc1 commit 1eadb157947163ca72ba8963b915fdc099ce6cca category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Add helpers to get atomic write limits for a bdev, so that we don't access request_queue helpers outside the block layer. We check if the bdev can actually atomic write in these helpers, so we can avoid users missing using this check. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241019125113.369994-4-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Long Li <leo.lilong@huawei.com> --- include/linux/blkdev.h | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 8ff5cb3ad89d..442c5c82b36b 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1697,6 +1697,22 @@ static inline bool bdev_can_atomic_write(struct block_device *bdev) return true; } +static inline unsigned int +bdev_atomic_write_unit_min_bytes(struct block_device *bdev) +{ + if (!bdev_can_atomic_write(bdev)) + return 0; + return queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev)); +} + +static inline unsigned int +bdev_atomic_write_unit_max_bytes(struct block_device *bdev) +{ + if (!bdev_can_atomic_write(bdev)) + return 0; + return queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev)); +} + #define DEFINE_IO_COMP_BATCH(name) struct io_comp_batch name = { } #endif /* _LINUX_BLKDEV_H */ -- 2.39.2
From: Alan Adamson <alan.adamson@oracle.com> mainline inclusion from mainline-v6.10-rc3 commit 5f9bbea02f06110ec5cf95a3327019b3194b2d80 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Add support to set block layer request_queue atomic write limits. The limits will be derived from either the namespace or controller atomic parameters. NVMe atomic-related parameters are grouped into "normal" and "power-fail" (or PF) class of parameter. For atomic write support, only PF parameters are of interest. The "normal" parameters are concerned with racing reads and writes (which also applies to PF). See NVM Command Set Specification Revision 1.0d section 2.1.4 for reference. Whether to use per namespace or controller atomic parameters is decided by NSFEAT bit 1 - see Figure 97: Identify – Identify Namespace Data Structure, NVM Command Set. NVMe namespaces may define an atomic boundary, whereby no atomic guarantees are provided for a write which straddles this per-lba space boundary. The block layer merging policy is such that no merges may occur in which the resultant request would straddle such a boundary. Unlike SCSI, NVMe specifies no granularity or alignment rules, apart from atomic boundary rule. In addition, again unlike SCSI, there is no dedicated atomic write command - a write which adheres to the atomic size limit and boundary is implicitly atomic. If NSFEAT bit 1 is set, the following parameters are of interest: - NAWUPF (Namespace Atomic Write Unit Power Fail) - NABSPF (Namespace Atomic Boundary Size Power Fail) - NABO (Namespace Atomic Boundary Offset) and we set request_queue limits as follows: - atomic_write_unit_max = rounddown_pow_of_two(NAWUPF) - atomic_write_max_bytes = NAWUPF - atomic_write_boundary = NABSPF If in the unlikely scenario that NABO is non-zero, then atomic writes will not be supported at all as dealing with this adds extra complexity. This policy may change in future. In all cases, atomic_write_unit_min is set to the logical block size. If NSFEAT bit 1 is unset, the following parameter is of interest: - AWUPF (Atomic Write Unit Power Fail) and we set request_queue limits as follows: - atomic_write_unit_max = rounddown_pow_of_two(AWUPF) - atomic_write_max_bytes = AWUPF - atomic_write_boundary = 0 A new function, nvme_valid_atomic_write(), is also called from submission path to verify that a request has been submitted to the driver will actually be executed atomically. As mentioned, there is no dedicated NVMe atomic write command (which may error for a command which exceeds the controller atomic write limits). Note on NABSPF: There seems to be some vagueness in the spec as to whether NABSPF applies for NSFEAT bit 1 being unset. Figure 97 does not explicitly mention NABSPF and how it is affected by bit 1. However Figure 4 does tell to check Figure 97 for info about per-namespace parameters, which NABSPF is, so it is implied. However currently nvme_update_disk_info() does check namespace parameter NABO regardless of this bit. Signed-off-by: Alan Adamson <alan.adamson@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> jpg: total rewrite Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240620125359.2684798-11-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: drivers/nvme/host/core.c [Context conflicts] Signed-off-by: Long Li <leo.lilong@huawei.com> --- drivers/nvme/host/core.c | 54 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 53 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 50de9e76d07d..c1c6ffa32b17 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -871,6 +871,36 @@ static inline blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns, return BLK_STS_OK; } +/* + * NVMe does not support a dedicated command to issue an atomic write. A write + * which does adhere to the device atomic limits will silently be executed + * non-atomically. The request issuer should ensure that the write is within + * the queue atomic writes limits, but just validate this in case it is not. + */ +static bool nvme_valid_atomic_write(struct request *req) +{ + struct request_queue *q = req->q; + u32 boundary_bytes = queue_atomic_write_boundary_bytes(q); + + if (blk_rq_bytes(req) > queue_atomic_write_unit_max_bytes(q)) + return false; + + if (boundary_bytes) { + u64 mask = boundary_bytes - 1, imask = ~mask; + u64 start = blk_rq_pos(req) << SECTOR_SHIFT; + u64 end = start + blk_rq_bytes(req) - 1; + + /* If greater then must be crossing a boundary */ + if (blk_rq_bytes(req) > boundary_bytes) + return false; + + if ((start & imask) != (end & imask)) + return false; + } + + return true; +} + static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, struct request *req, struct nvme_command *cmnd, enum nvme_opcode op) @@ -886,6 +916,9 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, if (req->cmd_flags & REQ_RAHEAD) dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH; + if (req->cmd_flags & REQ_ATOMIC && !nvme_valid_atomic_write(req)) + return BLK_STS_INVAL; + cmnd->rw.opcode = op; cmnd->rw.flags = 0; cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id); @@ -1853,6 +1886,22 @@ static int nvme_configure_metadata(struct nvme_ns *ns, struct nvme_id_ns *id) return 0; } +static void nvme_update_atomic_write_disk_info(struct nvme_ns *ns, + struct nvme_id_ns *id, struct queue_limits *lim, + u32 bs, u32 atomic_bs) +{ + unsigned int boundary = 0; + + if (id->nsfeat & NVME_NS_FEAT_ATOMICS && id->nawupf) { + if (le16_to_cpu(id->nabspf)) + boundary = (le16_to_cpu(id->nabspf) + 1) * bs; + } + lim->atomic_write_hw_max = atomic_bs; + lim->atomic_write_hw_boundary = boundary; + lim->atomic_write_hw_unit_min = bs; + lim->atomic_write_hw_unit_max = rounddown_pow_of_two(atomic_bs); +} + static void nvme_set_queue_limits(struct nvme_ctrl *ctrl, struct request_queue *q) { @@ -1901,6 +1950,9 @@ static void nvme_update_disk_info(struct gendisk *disk, atomic_bs = (1 + le16_to_cpu(id->nawupf)) * bs; else atomic_bs = (1 + ns->ctrl->subsys->awupf) * bs; + + nvme_update_atomic_write_disk_info(ns, id, &disk->queue->limits, + bs, atomic_bs); } if (id->nsfeat & NVME_NS_FEAT_IO_OPT) { @@ -2030,7 +2082,6 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns, blk_mq_freeze_queue(ns->disk->queue); lbaf = nvme_lbaf_index(id->flbas); ns->lba_shift = id->lbaf[lbaf].ds; - nvme_set_queue_limits(ns->ctrl, ns->queue); ret = nvme_configure_metadata(ns, id); if (ret < 0) { @@ -2039,6 +2090,7 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns, } nvme_set_chunk_sectors(ns, id); nvme_update_disk_info(ns->disk, ns, id); + nvme_set_queue_limits(ns->ctrl, ns->queue); if (ns->head->ids.csi == NVME_CSI_ZNS) { ret = nvme_update_zone_info(ns, lbaf); -- 2.39.2
From: John Garry <john.g.garry@oracle.com> mainline inclusion from mainline-v6.12-rc1 commit a570bad16b9f5252db2f342622bd71febb39a19c category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The XFS code will need this. Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Conflicts: fs/read_write.c Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/read_write.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/read_write.c b/fs/read_write.c index 587488c6a452..6d0cc2090c71 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1784,6 +1784,7 @@ int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter) return 0; } +EXPORT_SYMBOL_GPL(generic_atomic_write_valid); #ifdef CONFIG_BPF_READAHEAD static void fs_file_read_ctx_init(struct fs_file_read_ctx *ctx, -- 2.39.2
From: John Garry <john.g.garry@oracle.com> mainline inclusion from mainline-v6.12-rc1 commit 9e0933c21c128d6d8ac4d8aae0babaf9a43100b8 category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Support direct I/O atomic writes by producing a single bio with REQ_ATOMIC flag set. Initially FSes (XFS) should only support writing a single FS block atomically. As with any atomic write, we should produce a single bio which covers the complete write length. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> [djwong: clarify a couple of things in the docs] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Conflicts: Documentation/filesystems/iomap/operations.rst Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/iomap/direct-io.c | 38 ++++++++++++++++++++++++++++++++++---- fs/iomap/trace.h | 3 ++- include/linux/iomap.h | 1 + 3 files changed, 37 insertions(+), 5 deletions(-) diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index bcd3f8cf5ea4..2cc95448b97a 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -256,7 +256,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, * clearing the WRITE_THROUGH flag in the dio request. */ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio, - const struct iomap *iomap, bool use_fua) + const struct iomap *iomap, bool use_fua, bool atomic) { blk_opf_t opflags = REQ_SYNC | REQ_IDLE; @@ -268,6 +268,8 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio, opflags |= REQ_FUA; else dio->flags &= ~IOMAP_DIO_WRITE_THROUGH; + if (atomic) + opflags |= REQ_ATOMIC; return opflags; } @@ -278,7 +280,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, const struct iomap *iomap = &iter->iomap; struct inode *inode = iter->inode; unsigned int fs_block_size = i_blocksize(inode), pad; - loff_t length = iomap_length(iter); + const loff_t length = iomap_length(iter); + bool atomic = iter->flags & IOMAP_ATOMIC; loff_t pos = iter->pos; blk_opf_t bio_opf; struct bio *bio; @@ -288,6 +291,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, size_t copied = 0; size_t orig_count; + if (atomic && length != fs_block_size) + return -EINVAL; + if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) || !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter)) return -EINVAL; @@ -365,7 +371,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, * can set up the page vector appropriately for a ZONE_APPEND * operation. */ - bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua); + bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic); nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS); do { @@ -397,6 +403,17 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, } n = bio->bi_iter.bi_size; + if (WARN_ON_ONCE(atomic && n != length)) { + /* + * This bio should have covered the complete length, + * which it doesn't, so error. We may need to zero out + * the tail (complete FS block), similar to when + * bio_iov_iter_get_pages() returns an error, above. + */ + ret = -EINVAL; + bio_put(bio); + goto zero_tail; + } if (dio->flags & IOMAP_DIO_WRITE) { task_io_account_write(n); } else { @@ -579,6 +596,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, if (iocb->ki_flags & IOCB_NOWAIT) iomi.flags |= IOMAP_NOWAIT; + if (iocb->ki_flags & IOCB_ATOMIC) + iomi.flags |= IOMAP_ATOMIC; + if (iov_iter_rw(iter) == READ) { /* reads can always complete inline */ dio->flags |= IOMAP_DIO_INLINE_COMP; @@ -640,7 +660,17 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, if (ret != -EAGAIN) { trace_iomap_dio_invalidate_fail(inode, iomi.pos, iomi.len); - ret = -ENOTBLK; + if (iocb->ki_flags & IOCB_ATOMIC) { + /* + * folio invalidation failed, maybe + * this is transient, unlock and see if + * the caller tries again. + */ + ret = -EAGAIN; + } else { + /* fall back to buffered write */ + ret = -ENOTBLK; + } } goto out_free_dio; } diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h index 3ef694f9489f..65d7a58465f0 100644 --- a/fs/iomap/trace.h +++ b/fs/iomap/trace.h @@ -98,7 +98,8 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued); { IOMAP_REPORT, "REPORT" }, \ { IOMAP_FAULT, "FAULT" }, \ { IOMAP_DIRECT, "DIRECT" }, \ - { IOMAP_NOWAIT, "NOWAIT" } + { IOMAP_NOWAIT, "NOWAIT" }, \ + { IOMAP_ATOMIC, "ATOMIC" } #define IOMAP_F_FLAGS_STRINGS \ { IOMAP_F_NEW, "NEW" }, \ diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 6bd7ed98cf1f..85ada1f1b131 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -178,6 +178,7 @@ struct iomap_folio_ops { #else #define IOMAP_DAX 0 #endif /* CONFIG_FS_DAX */ +#define IOMAP_ATOMIC (1 << 9) struct iomap_ops { /* -- 2.39.2
From: Prasad Singamsetty <prasad.singamsetty@oracle.com> mainline inclusion from mainline-v6.10-rc3 commit 0f9ca80fa4f9670ba09721e4e36b8baf086a500c category: feature bugzilla: https://atomgit.com/openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Extend statx system call to return additional info for atomic write support support for a file. Helper function generic_fill_statx_atomic_writes() can be used by FSes to fill in the relevant statx fields. For now atomic_write_segments_max will always be 1, otherwise some rules would need to be imposed on iovec length and alignment, which we don't want now. Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com> jpg: relocate bdev support to another patch Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-5-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: include/linux/stat.h include/uapi/linux/stat.h fs/stat.c Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/stat.c | 34 ++++++++++++++++++++++++++++++++++ include/linux/fs.h | 3 +++ include/linux/stat.h | 3 +++ include/uapi/linux/stat.h | 11 ++++++++++- 4 files changed, 50 insertions(+), 1 deletion(-) diff --git a/fs/stat.c b/fs/stat.c index 5375be5f97cc..5ec1ac0c76d6 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -89,6 +89,37 @@ void generic_fill_statx_attr(struct inode *inode, struct kstat *stat) } EXPORT_SYMBOL(generic_fill_statx_attr); +/** + * generic_fill_statx_atomic_writes - Fill in atomic writes statx attributes + * @stat: Where to fill in the attribute flags + * @unit_min: Minimum supported atomic write length in bytes + * @unit_max: Maximum supported atomic write length in bytes + * + * Fill in the STATX{_ATTR}_WRITE_ATOMIC flags in the kstat structure from + * atomic write unit_min and unit_max values. + */ +void generic_fill_statx_atomic_writes(struct kstat *stat, + unsigned int unit_min, + unsigned int unit_max) +{ + /* Confirm that the request type is known */ + stat->result_mask |= STATX_WRITE_ATOMIC; + + /* Confirm that the file attribute type is known */ + stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC; + + if (unit_min) { + stat->atomic_write_unit_min = unit_min; + stat->atomic_write_unit_max = unit_max; + /* Initially only allow 1x segment */ + stat->atomic_write_segments_max = 1; + + /* Confirm atomic writes are actually supported */ + stat->attributes |= STATX_ATTR_WRITE_ATOMIC; + } +} +EXPORT_SYMBOL_GPL(generic_fill_statx_atomic_writes); + /** * vfs_getattr_nosec - getattr without security checks * @path: file to get attributes from @@ -653,6 +684,9 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer) tmp.stx_mnt_id = stat->mnt_id; tmp.stx_dio_mem_align = stat->dio_mem_align; tmp.stx_dio_offset_align = stat->dio_offset_align; + tmp.stx_atomic_write_unit_min = stat->atomic_write_unit_min; + tmp.stx_atomic_write_unit_max = stat->atomic_write_unit_max; + tmp.stx_atomic_write_segments_max = stat->atomic_write_segments_max; return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0; } diff --git a/include/linux/fs.h b/include/linux/fs.h index b5c4e8e75ff1..9c349a5fcc94 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3232,6 +3232,9 @@ extern const struct inode_operations page_symlink_inode_operations; extern void kfree_link(void *); void generic_fillattr(struct mnt_idmap *, u32, struct inode *, struct kstat *); void generic_fill_statx_attr(struct inode *inode, struct kstat *stat); +void generic_fill_statx_atomic_writes(struct kstat *stat, + unsigned int unit_min, + unsigned int unit_max); extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int); extern int vfs_getattr(const struct path *, struct kstat *, u32, unsigned int); void __inode_add_bytes(struct inode *inode, loff_t bytes); diff --git a/include/linux/stat.h b/include/linux/stat.h index 045064370cd8..bea9a48049c8 100644 --- a/include/linux/stat.h +++ b/include/linux/stat.h @@ -54,6 +54,9 @@ struct kstat { u32 dio_mem_align; u32 dio_offset_align; u64 change_cookie; + u32 atomic_write_unit_min; + u32 atomic_write_unit_max; + u32 atomic_write_segments_max; KABI_RESERVE(1) KABI_RESERVE(2) diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h index 7cab2c65d3d7..9ca13620c947 100644 --- a/include/uapi/linux/stat.h +++ b/include/uapi/linux/stat.h @@ -127,7 +127,14 @@ struct statx { __u32 stx_dio_mem_align; /* Memory buffer alignment for direct I/O */ __u32 stx_dio_offset_align; /* File offset alignment for direct I/O */ /* 0xa0 */ - __u64 __spare3[12]; /* Spare space for future expansion */ + __u64 stx_subvol; /* Subvolume identifier */ + __u32 stx_atomic_write_unit_min; /* Min atomic write unit in bytes */ + __u32 stx_atomic_write_unit_max; /* Max atomic write unit in bytes */ + /* 0xb0 */ + __u32 stx_atomic_write_segments_max; /* Max atomic write segment count */ + __u32 __spare1[1]; + /* 0xb8 */ + __u64 __spare3[9]; /* Spare space for future expansion */ /* 0x100 */ }; @@ -154,6 +161,7 @@ struct statx { #define STATX_BTIME 0x00000800U /* Want/got stx_btime */ #define STATX_MNT_ID 0x00001000U /* Got stx_mnt_id */ #define STATX_DIOALIGN 0x00002000U /* Want/got direct I/O alignment info */ +#define STATX_WRITE_ATOMIC 0x00010000U /* Want/got atomic_write_* fields */ #define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */ @@ -189,6 +197,7 @@ struct statx { #define STATX_ATTR_MOUNT_ROOT 0x00002000 /* Root of a mount */ #define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */ #define STATX_ATTR_DAX 0x00200000 /* File is currently in DAX state */ +#define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File supports atomic write operations */ #endif /* _UAPI_LINUX_STAT_H */ -- 2.39.2
hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8862 -------------------------------- Fix kabi broken in struct kstat. Fixes: 0f9ca80fa4f9 ("fs: Add initial atomic write support info to statx") Signed-off-by: Long Li <leo.lilong@huawei.com> --- include/linux/stat.h | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/include/linux/stat.h b/include/linux/stat.h index bea9a48049c8..3a4917d199d7 100644 --- a/include/linux/stat.h +++ b/include/linux/stat.h @@ -54,12 +54,10 @@ struct kstat { u32 dio_mem_align; u32 dio_offset_align; u64 change_cookie; - u32 atomic_write_unit_min; - u32 atomic_write_unit_max; - u32 atomic_write_segments_max; - KABI_RESERVE(1) - KABI_RESERVE(2) + KABI_USE2(1, u32 atomic_write_unit_min, u32 atomic_write_unit_max) + KABI_USE(2, u32 atomic_write_segments_max) + KABI_RESERVE(3) KABI_RESERVE(4) }; -- 2.39.2
From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> mainline inclusion from mainline-v6.12-rc1 commit 6dfc1c1d597f8b6ebffe25f51f013494994f9b84 category: feature bugzilla: https://atomgit.com/src-openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- This patch adds base support for atomic writes via statx getattr. On bs < ps systems, we can create FS with say bs of 16k. That means both atomic write min and max unit can be set to 16k for supporting atomic writes. Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Conflicts: fs/ext4/ext4.h [Context conflicts] Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/ext4/ext4.h | 10 ++++++++++ fs/ext4/inode.c | 12 ++++++++++++ fs/ext4/super.c | 31 +++++++++++++++++++++++++++++++ 3 files changed, 53 insertions(+) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 45a90910ecba..61e30005ed68 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1726,6 +1726,10 @@ struct ext4_sb_info { */ struct work_struct s_sb_upd_work; + /* Atomic write unit values in bytes */ + unsigned int s_awu_min; + unsigned int s_awu_max; + /* Ext4 fast commit sub transaction ID */ atomic_t s_fc_subtid; @@ -3866,6 +3870,12 @@ static inline int ext4_buffer_uptodate(struct buffer_head *bh) return buffer_uptodate(bh); } +static inline bool ext4_inode_can_atomic_write(struct inode *inode) +{ + + return S_ISREG(inode->i_mode) && EXT4_SB(inode->i_sb)->s_awu_min > 0; +} + #endif /* __KERNEL__ */ #define EFSBADCRC EBADMSG /* Bad CRC detected */ diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index a78245878298..3d00564e7069 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -6421,6 +6421,18 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path, } } + if ((request_mask & STATX_WRITE_ATOMIC) && S_ISREG(inode->i_mode)) { + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); + unsigned int awu_min = 0, awu_max = 0; + + if (ext4_inode_can_atomic_write(inode)) { + awu_min = sbi->s_awu_min; + awu_max = sbi->s_awu_max; + } + + generic_fill_statx_atomic_writes(stat, awu_min, awu_max); + } + flags = ei->i_flags & EXT4_FL_USER_VISIBLE; if (flags & EXT4_APPEND_FL) stat->attributes |= STATX_ATTR_APPEND; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 711c8fc2b3a4..adbbc26c239b 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4575,6 +4575,36 @@ static int ext4_handle_clustersize(struct super_block *sb) return 0; } +/* + * ext4_atomic_write_init: Initializes filesystem min & max atomic write units. + * @sb: super block + * TODO: Later add support for bigalloc + */ +static void ext4_atomic_write_init(struct super_block *sb) +{ + struct ext4_sb_info *sbi = EXT4_SB(sb); + struct block_device *bdev = sb->s_bdev; + + if (!bdev_can_atomic_write(bdev)) + return; + + if (!ext4_has_feature_extents(sb)) + return; + + sbi->s_awu_min = max(sb->s_blocksize, + bdev_atomic_write_unit_min_bytes(bdev)); + sbi->s_awu_max = min(sb->s_blocksize, + bdev_atomic_write_unit_max_bytes(bdev)); + if (sbi->s_awu_min && sbi->s_awu_max && + sbi->s_awu_min <= sbi->s_awu_max) { + ext4_msg(sb, KERN_NOTICE, "Supports (experimental) DIO atomic writes awu_min: %u, awu_max: %u", + sbi->s_awu_min, sbi->s_awu_max); + } else { + sbi->s_awu_min = 0; + sbi->s_awu_max = 0; + } +} + static void ext4_fast_commit_init(struct super_block *sb) { struct ext4_sb_info *sbi = EXT4_SB(sb); @@ -5479,6 +5509,7 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb) spin_lock_init(&sbi->s_bdev_wb_lock); + ext4_atomic_write_init(sb); ext4_fast_commit_init(sb); sb->s_root = NULL; -- 2.39.2
From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> mainline inclusion from mainline-v6.12-rc1 commit 43c696f9d094061e958e31be7f1dae66bc25d389 category: feature bugzilla: https://atomgit.com/src-openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Let's validate the given constraints for atomic write request. Otherwise it will fail with -EINVAL. Currently atomic write is only supported on DIO, so for buffered-io it will return -EOPNOTSUPP. Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/ext4/file.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index e3cbb520f8bc..f22530345254 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -732,6 +732,20 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) if (IS_DAX(inode)) return ext4_dax_write_iter(iocb, from); #endif + + if (iocb->ki_flags & IOCB_ATOMIC) { + size_t len = iov_iter_count(from); + int ret; + + if (len < EXT4_SB(inode->i_sb)->s_awu_min || + len > EXT4_SB(inode->i_sb)->s_awu_max) + return -EINVAL; + + ret = generic_atomic_write_valid(iocb, from); + if (ret) + return ret; + } + if (iocb->ki_flags & IOCB_DIRECT) return ext4_dio_write_iter(iocb, from); else -- 2.39.2
From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> mainline inclusion from mainline-v6.12-rc1 commit b7987a7d69a4a17b7f334f5c100d6729ebcc2ccb category: feature bugzilla: https://atomgit.com/src-openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- FS needs to add the fmode capability in order to support atomic writes during file open (refer kiocb_set_rw_flags()). Set this capability on a regular file if ext4 can do atomic write. Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Conflicts: fs/ext4/file.c [Context Conflicts] Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/ext4/file.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index f22530345254..3896883f7659 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -939,6 +939,9 @@ static int ext4_file_open(struct inode *inode, struct file *filp) return ret; } + if (ext4_inode_can_atomic_write(inode)) + filp->f_mode |= FMODE_CAN_ATOMIC_WRITE; + filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_DIO_PARALLEL_WRITE; return dquot_file_open(inode, filp); -- 2.39.2
From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> mainline inclusion from mainline-v6.12-rc1 commit 299537e9dfac2ecd08e7dae87a6437b92612568a category: feature bugzilla: https://atomgit.com/src-openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- atomic writes is currently only supported for single fsblock and only for direct-io. We should not return -ENOTBLK for atomic writes since we want the atomic write request to either complete fully or fail otherwise. Hence, we should never fallback to buffered-io in case of DIO atomic write requests. Let's also catch if this ever happens by adding some WARN_ON_ONCE before buffered-io handling for direct-io atomic writes. More details of the discussion [1]. While at it let's add an inline helper ext4_want_directio_fallback() which simplifies the logic checks and inherently fixes condition on when to return -ENOTBLK which otherwise was always returning true for any write or directio in ext4_iomap_end(). It was ok since ext4 only supports direct-io via iomap. [1]: https://lore.kernel.org/linux-xfs/cover.1729825985.git.ritesh.list@gmail.com... Suggested-by: Darrick J. Wong <djwong@kernel.org> # inline helper Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/ext4/file.c | 7 +++++++ fs/ext4/inode.c | 27 ++++++++++++++++++++++----- 2 files changed, 29 insertions(+), 5 deletions(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 3896883f7659..128a9dfe111e 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -639,6 +639,13 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) ssize_t err; loff_t endbyte; + /* + * There is no support for atomic writes on buffered-io yet, + * we should never fallback to buffered-io for DIO atomic + * writes. + */ + WARN_ON_ONCE(iocb->ki_flags & IOCB_ATOMIC); + offset = iocb->ki_pos; err = ext4_buffered_write_iter(iocb, from); if (err < 0) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 3d00564e7069..92478b0aea04 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3515,17 +3515,34 @@ static int ext4_iomap_overwrite_begin(struct inode *inode, loff_t offset, return ret; } +static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written) +{ + /* must be a directio to fall back to buffered */ + if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) != + (IOMAP_WRITE | IOMAP_DIRECT)) + return false; + + /* atomic writes are all-or-nothing */ + if (flags & IOMAP_ATOMIC) + return false; + + /* can only try again if we wrote nothing */ + return written == 0; +} + static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length, ssize_t written, unsigned flags, struct iomap *iomap) { /* * Check to see whether an error occurred while writing out the data to - * the allocated blocks. If so, return the magic error code so that we - * fallback to buffered I/O and attempt to complete the remainder of - * the I/O. Any blocks that may have been allocated in preparation for - * the direct I/O will be reused during buffered I/O. + * the allocated blocks. If so, return the magic error code for + * non-atomic write so that we fallback to buffered I/O and attempt to + * complete the remainder of the I/O. + * For non-atomic writes, any blocks that may have been + * allocated in preparation for the direct I/O will be reused during + * buffered I/O. For atomic write, we never fallback to buffered-io. */ - if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0) + if (ext4_want_directio_fallback(flags, written)) return -ENOTBLK; return 0; -- 2.39.2
From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> mainline inclusion from mainline-v6.15-rc4 commit 9fa6121684dad974c8c2b2aceb0df2b27f0627fe category: feature bugzilla: https://atomgit.com/src-openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- ext4_iomap_overwrite_begin() clears the flag for IOMAP_WRITE before calling ext4_iomap_begin(). Document this above ext4_map_blocks() call as it is easy to miss it when focusing on write paths alone. Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/fd50ba05440042dff77d555e463a620a79f8d0e9.1747337952... Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/ext4/inode.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 92478b0aea04..e06d7a4ea9e0 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3481,6 +3481,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, } ret = ext4_iomap_alloc(inode, &map, flags); } else { + /* + * This can be called for overwrites path from + * ext4_iomap_overwrite_begin(). + */ ret = ext4_map_blocks(NULL, inode, &map, 0); } -- 2.39.2
From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> mainline inclusion from mainline-v6.15-rc4 commit 1c972b1d13dde34c9897d991283e2c54209b44e9 category: feature bugzilla: https://atomgit.com/src-openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- EXT4 only supports doing atomic write on inodes which uses extents, so add a check in ext4_inode_can_atomic_write() which gets called during open. Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/86bb502c979398a736ab371d8f35f6866a477f6c.1747337952... Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/ext4/ext4.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 61e30005ed68..01ee238ce20c 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -3873,7 +3873,9 @@ static inline int ext4_buffer_uptodate(struct buffer_head *bh) static inline bool ext4_inode_can_atomic_write(struct inode *inode) { - return S_ISREG(inode->i_mode) && EXT4_SB(inode->i_sb)->s_awu_min > 0; + return S_ISREG(inode->i_mode) && + ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) && + EXT4_SB(inode->i_sb)->s_awu_min > 0; } #endif /* __KERNEL__ */ -- 2.39.2
From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> mainline inclusion from mainline-v6.15-rc4 commit 255e7bc2127cbd3a718a55d2da5b2d3f015adcd7 category: feature bugzilla: https://atomgit.com/src-openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Let's make ext4_meta_trans_blocks() non-static for use in later functions during ->end_io conversion for atomic writes. We will need this function to estimate journal credits for a special case. Instead of adding another wrapper around it, let's make this non-static. Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/23ce80d4286f792831ce99d13558182ee228fedb.1747337952... Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/ext4/ext4.h | 2 ++ fs/ext4/inode.c | 6 +----- 2 files changed, 3 insertions(+), 5 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 01ee238ce20c..4663d44c103d 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -3029,6 +3029,8 @@ extern int ext4_writepage_trans_blocks(struct inode *); extern int ext4_normal_submit_inode_data_buffers(struct jbd2_inode *jinode); extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks); extern int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end); +extern int ext4_meta_trans_blocks(struct inode *inode, int lblocks, + int pextents); extern int ext4_zero_partial_blocks(struct inode *inode, loff_t lstart, loff_t length, bool *did_zero); extern vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index e06d7a4ea9e0..a6c44ba9cd65 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -138,9 +138,6 @@ static inline int ext4_begin_ordered_truncate(struct inode *inode, new_size); } -static int ext4_meta_trans_blocks(struct inode *inode, int lblocks, - int pextents); - /* * Test whether an inode is a fast symlink. * A fast symlink has its symlink data stored in ext4_inode_info->i_data. @@ -6532,8 +6529,7 @@ static int ext4_index_trans_blocks(struct inode *inode, int lblocks, * * Also account for superblock, inode, quota and xattr blocks */ -static int ext4_meta_trans_blocks(struct inode *inode, int lblocks, - int pextents) +int ext4_meta_trans_blocks(struct inode *inode, int lblocks, int pextents) { ext4_group_t groups, ngroups = ext4_get_groups_count(inode->i_sb); int gdpblocks; -- 2.39.2
From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> mainline inclusion from mainline-v6.15-rc4 commit 5bb12b1837c0bf7ddc84e27812f1693a126fe27a category: feature bugzilla: https://atomgit.com/src-openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- There can be a case where there are contiguous extents on the adjacent leaf nodes of on-disk extent trees. So when someone tries to write to this contiguous range, ext4_map_blocks() call will split by returning 1 extent at a time if this is not already cached in extent_status tree cache (where if these extents when cached can get merged since they are contiguous). This is fine for a normal write however in case of atomic writes, it can't afford to break the write into two. Now this is also something that will only happen in the slow write case where we call ext4_map_blocks() for each of these extents spread across different leaf nodes. However, there is no guarantee that these extent status cache cannot be reclaimed before the last call to ext4_map_blocks() in ext4_map_blocks_atomic_write_slow(). Hence this patch adds support of EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS. This flag checks if the requested range can be fully found in extent status cache and return. If not, it looks up in on-disk extent tree via ext4_map_query_blocks(). If the found extent is the last entry in the leaf node, then it goes and queries the next lblk to see if there is an adjacent contiguous extent in the adjacent leaf node of the on-disk extent tree. Even though there can be a case where there are multiple adjacent extent entries spread across multiple leaf nodes. But we only read an adjacent leaf block i.e. in total of 2 extent entries spread across 2 leaf nodes. The reason for this is that we are mostly only going to support atomic writes with upto 64KB or maybe max upto 1MB of atomic write support. Acked-by: Darrick J. Wong <djwong@kernel.org> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/6bb563e661f5fbd80e266a9e6ce6e29178f555f6.1747337952... Signed-off-by: Theodore Ts'o <tytso@mit.edu> Conflicts: fs/ext4/ext4.h fs/ext4/inode.c Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/ext4/ext4.h | 19 +++++++++- fs/ext4/extents.c | 12 +++++++ fs/ext4/inode.c | 89 +++++++++++++++++++++++++++++++++++++++++++---- 3 files changed, 112 insertions(+), 8 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 4663d44c103d..134ffa05b673 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -252,9 +252,19 @@ struct ext4_allocation_request { #define EXT4_MAP_UNWRITTEN BIT(BH_Unwritten) #define EXT4_MAP_BOUNDARY BIT(BH_Boundary) #define EXT4_MAP_DELAYED BIT(BH_Delay) +/* + * This is for use in ext4_map_query_blocks() for a special case where we can + * have a physically and logically contiguous blocks split across two leaf + * nodes instead of a single extent. This is required in case of atomic writes + * to know whether the returned extent is last in leaf. If yes, then lookup for + * next in leaf block in ext4_map_query_blocks_next_in_leaf(). + * - This is never going to be added to any buffer head state. + * - We use the next available bit after BH_BITMAP_UPTODATE. + */ +#define EXT4_MAP_QUERY_LAST_IN_LEAF BIT(BH_BITMAP_UPTODATE + 1) #define EXT4_MAP_FLAGS (EXT4_MAP_NEW | EXT4_MAP_MAPPED |\ EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY |\ - EXT4_MAP_DELAYED) + EXT4_MAP_DELAYED | EXT4_MAP_QUERY_LAST_IN_LEAF) struct ext4_map_blocks { ext4_fsblk_t m_pblk; @@ -718,6 +728,13 @@ enum { /* Caller is in the atomic contex, find extent if it has been cached */ #define EXT4_GET_BLOCKS_CACHED_NOWAIT 0x0800 +/* + * Atomic write caller needs this to query in the slow path of mixed mapping + * case, when a contiguous extent can be split across two adjacent leaf nodes. + * Look EXT4_MAP_QUERY_LAST_IN_LEAF. + */ +#define EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF 0x1000 + /* * The bit position of these flags must not overlap with any of the * EXT4_GET_BLOCKS_*. They are used by ext4_find_extent(), diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index dbc2154f7d4e..e54b408493d8 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4451,6 +4451,18 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, allocated = map->m_len; ext4_ext_show_leaf(inode, path); out: + /* + * We never use EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF with CREATE flag. + * So we know that the depth used here is correct, since there was no + * block allocation done if EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF is set. + * If tomorrow we start using this QUERY flag with CREATE, then we will + * need to re-calculate the depth as it might have changed due to block + * allocation. + */ + if (flags & EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF) + if (!err && ex && (ex == EXT_LAST_EXTENT(path[depth].p_hdr))) + map->m_flags |= EXT4_MAP_QUERY_LAST_IN_LEAF; + ext4_free_ext_path(path); trace_ext4_ext_map_blocks_exit(inode, flags, map, diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index a6c44ba9cd65..9b873113a25f 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -453,14 +453,71 @@ static void ext4_map_blocks_es_recheck(handle_t *handle, } #endif /* ES_AGGRESSIVE_TEST */ +static int ext4_map_query_blocks_next_in_leaf(handle_t *handle, + struct inode *inode, struct ext4_map_blocks *map, + unsigned int orig_mlen) +{ + struct ext4_map_blocks map2; + unsigned int status, status2; + int retval; + + status = map->m_flags & EXT4_MAP_UNWRITTEN ? + EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN; + + WARN_ON_ONCE(!(map->m_flags & EXT4_MAP_QUERY_LAST_IN_LEAF)); + WARN_ON_ONCE(orig_mlen <= map->m_len); + + /* Prepare map2 for lookup in next leaf block */ + map2.m_lblk = map->m_lblk + map->m_len; + map2.m_len = orig_mlen - map->m_len; + map2.m_flags = 0; + retval = ext4_ext_map_blocks(handle, inode, &map2, 0); + + if (retval <= 0) { + ext4_es_insert_extent(inode, map->m_lblk, map->m_len, + map->m_pblk, status); + return map->m_len; + } + + if (unlikely(retval != map2.m_len)) { + ext4_warning(inode->i_sb, + "ES len assertion failed for inode " + "%lu: retval %d != map->m_len %d", + inode->i_ino, retval, map2.m_len); + WARN_ON(1); + } + + status2 = map2.m_flags & EXT4_MAP_UNWRITTEN ? + EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN; + + /* + * If map2 is contiguous with map, then let's insert it as a single + * extent in es cache and return the combined length of both the maps. + */ + if (map->m_pblk + map->m_len == map2.m_pblk && + status == status2) { + ext4_es_insert_extent(inode, map->m_lblk, + map->m_len + map2.m_len, map->m_pblk, + status); + map->m_len += map2.m_len; + } else { + ext4_es_insert_extent(inode, map->m_lblk, map->m_len, + map->m_pblk, status); + } + + return map->m_len; +} + static int ext4_map_query_blocks(handle_t *handle, struct inode *inode, struct ext4_map_blocks *map) { unsigned int status; int retval; + unsigned int orig_mlen = map->m_len; if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) - retval = ext4_ext_map_blocks(handle, inode, map, 0); + retval = ext4_ext_map_blocks(handle, inode, map, + EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF); else retval = ext4_ind_map_blocks(handle, inode, map, 0); @@ -475,11 +532,23 @@ static int ext4_map_query_blocks(handle_t *handle, struct inode *inode, WARN_ON(1); } - status = map->m_flags & EXT4_MAP_UNWRITTEN ? - EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN; - ext4_es_insert_extent(inode, map->m_lblk, map->m_len, - map->m_pblk, status); - return retval; + /* + * No need to query next in leaf: + * - if returned extent is not last in leaf or + * - if the last in leaf is the full requested range + */ + if (!(map->m_flags & EXT4_MAP_QUERY_LAST_IN_LEAF) || + ((map->m_flags & EXT4_MAP_QUERY_LAST_IN_LEAF) && + (map->m_len == orig_mlen))) { + status = map->m_flags & EXT4_MAP_UNWRITTEN ? + EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN; + ext4_es_insert_extent(inode, map->m_lblk, map->m_len, + map->m_pblk, status); + return retval; + } + + return ext4_map_query_blocks_next_in_leaf(handle, inode, map, + orig_mlen); } static int ext4_map_create_blocks(handle_t *handle, struct inode *inode, @@ -590,6 +659,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode, struct extent_status es; int retval; int ret = 0; + unsigned int orig_mlen = map->m_len; #ifdef ES_AGGRESSIVE_TEST struct ext4_map_blocks orig_map; @@ -641,7 +711,12 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode, ext4_map_blocks_es_recheck(handle, inode, map, &orig_map, flags); #endif - goto found; + if (!(flags & EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF) || + orig_mlen == map->m_len) + goto found; + + if (flags & EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF) + map->m_len = orig_mlen; } /* * In the query cache no-wait mode, nothing we can do more if we -- 2.39.2
From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> mainline inclusion from mainline-v6.15-rc4 commit b86629c2b2998338b4a715058b402e47d0b36206 category: feature bugzilla: https://atomgit.com/src-openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- EXT4 supports bigalloc feature which allows the FS to work in size of clusters (group of blocks) rather than individual blocks. This patch adds atomic write support for bigalloc so that systems with bs = ps can also create FS using - mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev> With bigalloc ext4 can support multi-fsblock atomic writes. We will have to adjust ext4's atomic write unit max value to cluster size. This can then support atomic write of size anywhere between [blocksize, clustersize]. This patch adds the required changes to enable multi-fsblock atomic write support using bigalloc in the next patch. In this patch for block allocation: we first query the underlying region of the requested range by calling ext4_map_blocks() call. Here are the various cases which we then handle depending upon the underlying mapping type: 1. If the underlying region for the entire requested range is a mapped extent, then we don't call ext4_map_blocks() to allocate anything. We don't need to even start the jbd2 txn in this case. 2. For an append write case, we create a mapped extent. 3. If the underlying region is entirely a hole, then we create an unwritten extent for the requested range. 4. If the underlying region is a large unwritten extent, then we split the extent into 2 unwritten extent of required size. 5. If the underlying region has any type of mixed mapping, then we call ext4_map_blocks() in a loop to zero out the unwritten and the hole regions within the requested range. This then provide a single mapped extent type mapping for the requested range. Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO flag only when the underlying extent mapping of the requested range is not entirely a hole, an unwritten extent, or a fully mapped extent. That is, if the underlying region contains a mix of hole(s), unwritten extent(s), and mapped extent(s), we use this loop to ensure that all the short mappings are zeroed out. This guarantees that the entire requested range becomes a single, uniformly mapped extent. It is ok to do so because we know this is being done on a bigalloc enabled filesystem where the block bitmap represents the entire cluster unit. Note having a single contiguous underlying region of type mapped, unwrittn or hole is not a problem. But the reason to avoid writing on top of mixed mapping region is because, atomic writes requires all or nothing should get written for the userspace pwritev2 request. So if at any point in time during the write if a crash or a sudden poweroff occurs, the region undergoing atomic write should read either complete old data or complete new data. But it should never have a mix of both old and new data. So, we first convert any mixed mapping region to a single contiguous mapped extent before any data gets written to it. This is because normally FS will only convert unwritten extents to written at the end of the write in ->end_io() call. And if we allow the writes over a mixed mapping and if a sudden power off happens in between, we will end up reading mix of new data (over mapped extents) and old data (over unwritten extents), because unwritten to written conversion never went through. So to avoid this and to avoid writes getting torned due to mixed mapping, we first allocate a single contiguous block mapping and then do the write. Acked-by: Darrick J. Wong <djwong@kernel.org> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/c4965ac3407cbc773f0bc954d0966d9696f5038a.1747337952... Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/ext4/ext4.h | 2 + fs/ext4/extents.c | 87 +++++++++++++++++++ fs/ext4/file.c | 7 +- fs/ext4/inode.c | 208 +++++++++++++++++++++++++++++++++++++++++++++- 4 files changed, 299 insertions(+), 5 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 134ffa05b673..cbe16f1b68c4 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -3737,6 +3737,8 @@ extern long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len); extern int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode, loff_t offset, ssize_t len); +extern int ext4_convert_unwritten_extents_atomic(handle_t *handle, + struct inode *inode, loff_t offset, ssize_t len); extern int ext4_convert_unwritten_io_end_vec(handle_t *handle, ext4_io_end_t *io_end); extern int ext4_map_blocks(handle_t *handle, struct inode *inode, diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index e54b408493d8..25406919141b 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4877,6 +4877,93 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) return ret; } +/* + * This function converts a range of blocks to written extents. The caller of + * this function will pass the start offset and the size. all unwritten extents + * within this range will be converted to written extents. + * + * This function is called from the direct IO end io call back function for + * atomic writes, to convert the unwritten extents after IO is completed. + * + * Note that the requirement for atomic writes is that all conversion should + * happen atomically in a single fs journal transaction. We mainly only allocate + * unwritten extents either on a hole on a pre-exiting unwritten extent range in + * ext4_map_blocks_atomic_write(). The only case where we can have multiple + * unwritten extents in a range [offset, offset+len) is when there is a split + * unwritten extent between two leaf nodes which was cached in extent status + * cache during ext4_iomap_alloc() time. That will allow + * ext4_map_blocks_atomic_write() to return the unwritten extent range w/o going + * into the slow path. That means we might need a loop for conversion of this + * unwritten extent split across leaf block within a single journal transaction. + * Split extents across leaf nodes is a rare case, but let's still handle that + * to meet the requirements of multi-fsblock atomic writes. + * + * Returns 0 on success. + */ +int ext4_convert_unwritten_extents_atomic(handle_t *handle, struct inode *inode, + loff_t offset, ssize_t len) +{ + unsigned int max_blocks; + int ret = 0, ret2 = 0, ret3 = 0; + struct ext4_map_blocks map; + unsigned int blkbits = inode->i_blkbits; + unsigned int credits = 0; + int flags = EXT4_GET_BLOCKS_IO_CONVERT_EXT; + + map.m_lblk = offset >> blkbits; + max_blocks = EXT4_MAX_BLOCKS(len, offset, blkbits); + + if (!handle) { + /* + * TODO: An optimization can be added later by having an extent + * status flag e.g. EXTENT_STATUS_SPLIT_LEAF. If we query that + * it can tell if the extent in the cache is a split extent. + * But for now let's assume pextents as 2 always. + */ + credits = ext4_meta_trans_blocks(inode, max_blocks, 2); + } + + if (credits) { + handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits); + if (IS_ERR(handle)) { + ret = PTR_ERR(handle); + return ret; + } + } + + while (ret >= 0 && ret < max_blocks) { + map.m_lblk += ret; + map.m_len = (max_blocks -= ret); + ret = ext4_map_blocks(handle, inode, &map, flags); + if (ret != max_blocks) + ext4_msg(inode->i_sb, KERN_INFO, + "inode #%lu: block %u: len %u: " + "split block mapping found for atomic write, " + "ret = %d", + inode->i_ino, map.m_lblk, + map.m_len, ret); + if (ret <= 0) + break; + } + + ret2 = ext4_mark_inode_dirty(handle, inode); + + if (credits) { + ret3 = ext4_journal_stop(handle); + if (unlikely(ret3)) + ret2 = ret3; + } + + if (ret <= 0 || ret2) + ext4_warning(inode->i_sb, + "inode #%lu: block %u: len %u: " + "returned %d or %d", + inode->i_ino, map.m_lblk, + map.m_len, ret, ret2); + + return ret > 0 ? ret2 : ret; +} + /* * This function convert a range of blocks to written extents * The caller of this function will pass the start offset and the size. diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 128a9dfe111e..4c8ddf7adc07 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -416,7 +416,12 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, loff_t pos = iocb->ki_pos; struct inode *inode = file_inode(iocb->ki_filp); - if (!error && size && flags & IOMAP_DIO_UNWRITTEN) + + if (!error && size && (flags & IOMAP_DIO_UNWRITTEN) && + (iocb->ki_flags & IOCB_ATOMIC)) + error = ext4_convert_unwritten_extents_atomic(NULL, inode, pos, + size); + else if (!error && size && flags & IOMAP_DIO_UNWRITTEN) error = ext4_convert_unwritten_extents(NULL, inode, pos, size); if (error) return error; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 9b873113a25f..5a50095e5a45 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3457,12 +3457,149 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap, } } +static int ext4_map_blocks_atomic_write_slow(handle_t *handle, + struct inode *inode, struct ext4_map_blocks *map) +{ + ext4_lblk_t m_lblk = map->m_lblk; + unsigned int m_len = map->m_len; + unsigned int mapped_len = 0, m_flags = 0; + ext4_fsblk_t next_pblk; + bool check_next_pblk = false; + int ret = 0; + + WARN_ON_ONCE(!ext4_has_feature_bigalloc(inode->i_sb)); + + /* + * This is a slow path in case of mixed mapping. We use + * EXT4_GET_BLOCKS_CREATE_ZERO flag here to make sure we get a single + * contiguous mapped mapping. This will ensure any unwritten or hole + * regions within the requested range is zeroed out and we return + * a single contiguous mapped extent. + */ + m_flags = EXT4_GET_BLOCKS_CREATE_ZERO; + + do { + ret = ext4_map_blocks(handle, inode, map, m_flags); + if (ret < 0 && ret != -ENOSPC) + goto out_err; + /* + * This should never happen, but let's return an error code to + * avoid an infinite loop in here. + */ + if (ret == 0) { + ret = -EFSCORRUPTED; + ext4_warning_inode(inode, + "ext4_map_blocks() couldn't allocate blocks m_flags: 0x%x, ret:%d", + m_flags, ret); + goto out_err; + } + /* + * With bigalloc we should never get ENOSPC nor discontiguous + * physical extents. + */ + if ((check_next_pblk && next_pblk != map->m_pblk) || + ret == -ENOSPC) { + ext4_warning_inode(inode, + "Non-contiguous allocation detected: expected %llu, got %llu, " + "or ext4_map_blocks() returned out of space ret: %d", + next_pblk, map->m_pblk, ret); + ret = -EFSCORRUPTED; + goto out_err; + } + next_pblk = map->m_pblk + map->m_len; + check_next_pblk = true; + + mapped_len += map->m_len; + map->m_lblk += map->m_len; + map->m_len = m_len - mapped_len; + } while (mapped_len < m_len); + + /* + * We might have done some work in above loop, so we need to query the + * start of the physical extent, based on the origin m_lblk and m_len. + * Let's also ensure we were able to allocate the required range for + * mixed mapping case. + */ + map->m_lblk = m_lblk; + map->m_len = m_len; + map->m_flags = 0; + + ret = ext4_map_blocks(handle, inode, map, + EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF); + if (ret != m_len) { + ext4_warning_inode(inode, + "allocation failed for atomic write request m_lblk:%u, m_len:%u, ret:%d\n", + m_lblk, m_len, ret); + ret = -EINVAL; + } + return ret; + +out_err: + /* reset map before returning an error */ + map->m_lblk = m_lblk; + map->m_len = m_len; + map->m_flags = 0; + return ret; +} + +/* + * ext4_map_blocks_atomic: Helper routine to ensure the entire requested + * range in @map [lblk, lblk + len) is one single contiguous extent with no + * mixed mappings. + * + * We first use m_flags passed to us by our caller (ext4_iomap_alloc()). + * We only call EXT4_GET_BLOCKS_ZERO in the slow path, when the underlying + * physical extent for the requested range does not have a single contiguous + * mapping type i.e. (Hole, Mapped, or Unwritten) throughout. + * In that case we will loop over the requested range to allocate and zero out + * the unwritten / holes in between, to get a single mapped extent from + * [m_lblk, m_lblk + m_len). Note that this is only possible because we know + * this can be called only with bigalloc enabled filesystem where the underlying + * cluster is already allocated. This avoids allocating discontiguous extents + * in the slow path due to multiple calls to ext4_map_blocks(). + * The slow path is mostly non-performance critical path, so it should be ok to + * loop using ext4_map_blocks() with appropriate flags to allocate & zero the + * underlying short holes/unwritten extents within the requested range. + */ +static int ext4_map_blocks_atomic_write(handle_t *handle, struct inode *inode, + struct ext4_map_blocks *map, int m_flags, + bool *force_commit) +{ + ext4_lblk_t m_lblk = map->m_lblk; + unsigned int m_len = map->m_len; + int ret = 0; + + WARN_ON_ONCE(m_len > 1 && !ext4_has_feature_bigalloc(inode->i_sb)); + + ret = ext4_map_blocks(handle, inode, map, m_flags); + if (ret < 0 || ret == m_len) + goto out; + /* + * This is a mixed mapping case where we were not able to allocate + * a single contiguous extent. In that case let's reset requested + * mapping and call the slow path. + */ + map->m_lblk = m_lblk; + map->m_len = m_len; + map->m_flags = 0; + + /* + * slow path means we have mixed mapping, that means we will need + * to force txn commit. + */ + *force_commit = true; + return ext4_map_blocks_atomic_write_slow(handle, inode, map); +out: + return ret; +} + static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, unsigned int flags) { handle_t *handle; u8 blkbits = inode->i_blkbits; int ret, dio_credits, m_flags = 0, retries = 0; + bool force_commit = false; /* * Trim the mapping request to the maximum value that we can map at @@ -3470,7 +3607,30 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, */ if (map->m_len > DIO_MAX_BLOCKS) map->m_len = DIO_MAX_BLOCKS; - dio_credits = ext4_chunk_trans_blocks(inode, map->m_len); + + /* + * journal credits estimation for atomic writes. We call + * ext4_map_blocks(), to find if there could be a mixed mapping. If yes, + * then let's assume the no. of pextents required can be m_len i.e. + * every alternate block can be unwritten and hole. + */ + if (flags & IOMAP_ATOMIC) { + unsigned int orig_mlen = map->m_len; + + ret = ext4_map_blocks(NULL, inode, map, 0); + if (ret < 0) + return ret; + if (map->m_len < orig_mlen) { + map->m_len = orig_mlen; + dio_credits = ext4_meta_trans_blocks(inode, orig_mlen, + map->m_len); + } else { + dio_credits = ext4_chunk_trans_blocks(inode, + map->m_len); + } + } else { + dio_credits = ext4_chunk_trans_blocks(inode, map->m_len); + } retry: /* @@ -3501,7 +3661,11 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; - ret = ext4_map_blocks(handle, inode, map, m_flags); + if (flags & IOMAP_ATOMIC) + ret = ext4_map_blocks_atomic_write(handle, inode, map, m_flags, + &force_commit); + else + ret = ext4_map_blocks(handle, inode, map, m_flags); /* * We cannot fill holes in indirect tree based inodes as that could @@ -3515,6 +3679,22 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) goto retry; + /* + * Force commit the current transaction if the allocation spans a mixed + * mapping range. This ensures any pending metadata updates (like + * unwritten to written extents conversion) in this range are in + * consistent state with the file data blocks, before performing the + * actual write I/O. If the commit fails, the whole I/O must be aborted + * to prevent any possible torn writes. + */ + if (ret > 0 && force_commit) { + int ret2; + + ret2 = ext4_force_commit(inode->i_sb); + if (ret2) + return ret2; + } + return ret; } @@ -3525,6 +3705,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, int ret; struct ext4_map_blocks map; u8 blkbits = inode->i_blkbits; + unsigned int orig_mlen; if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) return -EINVAL; @@ -3538,6 +3719,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, map.m_lblk = offset >> blkbits; map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; + orig_mlen = map.m_len; if (flags & IOMAP_WRITE) { /* @@ -3548,8 +3730,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, */ if (offset + length <= i_size_read(inode)) { ret = ext4_map_blocks(NULL, inode, &map, 0); - if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED)) - goto out; + /* + * For atomic writes the entire requested length should + * be mapped. + */ + if (map.m_flags & EXT4_MAP_MAPPED) { + if ((!(flags & IOMAP_ATOMIC) && ret > 0) || + (flags & IOMAP_ATOMIC && ret >= orig_mlen)) + goto out; + } + map.m_len = orig_mlen; } ret = ext4_iomap_alloc(inode, &map, flags); } else { @@ -3570,6 +3760,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, */ map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len); + /* + * Before returning to iomap, let's ensure the allocated mapping + * covers the entire requested length for atomic writes. + */ + if (flags & IOMAP_ATOMIC) { + if (map.m_len < (length >> blkbits)) { + WARN_ON_ONCE(1); + return -EINVAL; + } + } ext4_set_iomap(inode, iomap, &map, offset, length, flags); return 0; -- 2.39.2
From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> mainline inclusion from mainline-v6.15-rc4 commit 642e0dc73c5def8270ff7c55d750ff36a6ea5d10 category: feature bugzilla: https://atomgit.com/src-openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Last couple of patches added the needed support for multi-fsblock atomic writes using bigalloc. This patch ensures that filesystem advertizes the needed atomic write unit min and max values for enabling multi-fsblock atomic write support with bigalloc. Acked-by: Darrick J. Wong <djwong@kernel.org> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/5e45d7ed24499024b9079436ba6698dae5298e29.1747337952... Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/ext4/super.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index adbbc26c239b..46dbdd82cdb7 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4577,13 +4577,16 @@ static int ext4_handle_clustersize(struct super_block *sb) /* * ext4_atomic_write_init: Initializes filesystem min & max atomic write units. + * With non-bigalloc filesystem awu will be based upon filesystem blocksize + * & bdev awu units. + * With bigalloc it will be based upon bigalloc cluster size & bdev awu units. * @sb: super block - * TODO: Later add support for bigalloc */ static void ext4_atomic_write_init(struct super_block *sb) { struct ext4_sb_info *sbi = EXT4_SB(sb); struct block_device *bdev = sb->s_bdev; + unsigned int clustersize = EXT4_CLUSTER_SIZE(sb); if (!bdev_can_atomic_write(bdev)) return; @@ -4593,7 +4596,7 @@ static void ext4_atomic_write_init(struct super_block *sb) sbi->s_awu_min = max(sb->s_blocksize, bdev_atomic_write_unit_min_bytes(bdev)); - sbi->s_awu_max = min(sb->s_blocksize, + sbi->s_awu_max = min(clustersize, bdev_atomic_write_unit_max_bytes(bdev)); if (sbi->s_awu_min && sbi->s_awu_max && sbi->s_awu_min <= sbi->s_awu_max) { -- 2.39.2
From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> mainline inclusion from mainline-v6.15-rc4 commit 0bf1f51e34c4e260095fc7306c564c7fe01503e3 category: feature bugzilla: https://atomgit.com/src-openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Add an initial documentation around atomic writes support in ext4. Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/d3893b9f5ad70317abae72046e81e4c180af91bf.1747337952... Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Long Li <leo.lilong@huawei.com> --- .../filesystems/ext4/atomic_writes.rst | 225 ++++++++++++++++++ Documentation/filesystems/ext4/overview.rst | 1 + 2 files changed, 226 insertions(+) create mode 100644 Documentation/filesystems/ext4/atomic_writes.rst diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst new file mode 100644 index 000000000000..f65767df3620 --- /dev/null +++ b/Documentation/filesystems/ext4/atomic_writes.rst @@ -0,0 +1,225 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. _atomic_writes: + +Atomic Block Writes +------------------------- + +Introduction +~~~~~~~~~~~~ + +Atomic (untorn) block writes ensure that either the entire write is committed +to disk or none of it is. This prevents "torn writes" during power loss or +system crashes. The ext4 filesystem supports atomic writes (only with Direct +I/O) on regular files with extents, provided the underlying storage device +supports hardware atomic writes. This is supported in the following two ways: + +1. **Single-fsblock Atomic Writes**: + EXT4's supports atomic write operations with a single filesystem block since + v6.13. In this the atomic write unit minimum and maximum sizes are both set + to filesystem blocksize. + e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB + pagesize system is possible. + +2. **Multi-fsblock Atomic Writes with Bigalloc**: + EXT4 now also supports atomic writes spanning multiple filesystem blocks + using a feature known as bigalloc. The atomic write unit's minimum and + maximum sizes are determined by the filesystem block size and cluster size, + based on the underlying device’s supported atomic write unit limits. + +Requirements +~~~~~~~~~~~~ + +Basic requirements for atomic writes in ext4: + + 1. The extents feature must be enabled (default for ext4) + 2. The underlying block device must support atomic writes + 3. For single-fsblock atomic writes: + + 1. A filesystem with appropriate block size (up to the page size) + 4. For multi-fsblock atomic writes: + + 1. The bigalloc feature must be enabled + 2. The cluster size must be appropriately configured + +NOTE: EXT4 does not support software or COW based atomic write, which means +atomic writes on ext4 are only supported if underlying storage device supports +it. + +Multi-fsblock Implementation Details +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The bigalloc feature changes ext4 to allocate in units of multiple filesystem +blocks, also known as clusters. With bigalloc each bit within block bitmap +represents cluster (power of 2 number of blocks) rather than individual +filesystem blocks. +EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the +following constraints. The minimum atomic write size is the larger of the fs +block size and the minimum hardware atomic write unit; and the maximum atomic +write size is smaller of the bigalloc cluster size and the maximum hardware +atomic write unit. Bigalloc ensures that all allocations are aligned to the +cluster size, which satisfies the LBA alignment requirements of the hardware +device if the start of the partition/logical volume is itself aligned correctly. + +Here is the block allocation strategy in bigalloc for atomic writes: + + * For regions with fully mapped extents, no additional work is needed + * For append writes, a new mapped extent is allocated + * For regions that are entirely holes, unwritten extent is created + * For large unwritten extents, the extent gets split into two unwritten + extents of appropriate requested size + * For mixed mapping regions (combinations of holes, unwritten extents, or + mapped extents), ext4_map_blocks() is called in a loop with + EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous + mapped extent by writing zeroes to it and converting any unwritten extents to + written, if found within the range. + +Note: Writing on a single contiguous underlying extent, whether mapped or +unwritten, is not inherently problematic. However, writing to a mixed mapping +region (i.e. one containing a combination of mapped and unwritten extents) +must be avoided when performing atomic writes. + +The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC +flag, requires that either all data is written or none at all. In the event of +a system crash or unexpected power loss during the write operation, the affected +region (when later read) must reflect either the complete old data or the +complete new data, but never a mix of both. + +To enforce this guarantee, we ensure that the write target is backed by +a single, contiguous extent before any data is written. This is critical because +ext4 defers the conversion of unwritten extents to written extents until the I/O +completion path (typically in ->end_io()). If a write is allowed to proceed over +a mixed mapping region (with mapped and unwritten extents) and a failure occurs +mid-write, the system could observe partially updated regions after reboot, i.e. +new data over mapped areas, and stale (old) data over unwritten extents that +were never marked written. This violates the atomicity and/or torn write +prevention guarantee. + +To prevent such torn writes, ext4 proactively allocates a single contiguous +extent for the entire requested region in ``ext4_iomap_alloc`` via +``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling +transaction in case if allocation is done over mixed mapping. This ensures any +pending metadata updates (like unwritten to written extents conversion) in this +range are in consistent state with the file data blocks, before performing the +actual write I/O. If the commit fails, the whole I/O must be aborted to prevent +from any possible torn writes. +Only after this step, the actual data write operation is performed by the iomap. + +Handling Split Extents Across Leaf Blocks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There can be a special edge case where we have logically and physically +contiguous extents stored in separate leaf nodes of the on-disk extent tree. +This occurs because on-disk extent tree merges only happens within the leaf +blocks except for a case where we have 2-level tree which can get merged and +collapsed entirely into the inode. +If such a layout exists and, in the worst case, the extent status cache entries +are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return +a single contiguous extent for these split leaf extents. + +To address this edge case, a new get block flag +``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the +``ext4_map_query_blocks()`` lookup behavior. + +This new get block flag allows ``ext4_map_blocks()`` to first check if there is +an entry in the extent status cache for the full range. +If not present, it consults the on-disk extent tree using +``ext4_map_query_blocks()``. +If the located extent is at the end of a leaf node, it probes the next logical +block (lblk) to detect a contiguous extent in the adjacent leaf. + +For now only one additional leaf block is queried to maintain efficiency, as +atomic writes are typically constrained to small sizes +(e.g. [blocksize, clustersize]). + + +Handling Journal transactions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To support multi-fsblock atomic writes, we ensure enough journal credits are +reserved during: + + 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there + could be a mixed mapping for the underlying requested range. If yes, then we + reserve credits of up to ``m_len``, assuming every alternate block can be + an unwritten extent followed by a hole. + + 2. During ``->end_io()`` call, we make sure a single transaction is started for + doing unwritten-to-written conversion. The loop for conversion is mainly + only required to handle a split extent across leaf blocks. + +How to +------ + +Creating Filesystems with Atomic Write Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +First check the atomic write units supported by block device. +See :ref:`atomic_write_bdev_support` for more details. + +For single-fsblock atomic writes with a larger block size +(on systems with block size < page size): + +.. code-block:: bash + + # Create an ext4 filesystem with a 16KB block size + # (requires page size >= 16KB) + mkfs.ext4 -b 16384 /dev/device + +For multi-fsblock atomic writes with bigalloc: + +.. code-block:: bash + + # Create an ext4 filesystem with bigalloc and 64KB cluster size + mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device + +Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes, +and ``-O bigalloc`` enables the bigalloc feature. + +Application Interface +~~~~~~~~~~~~~~~~~~~~~ + +Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag +to perform atomic writes: + +.. code-block:: c + + pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC); + +The write must be aligned to the filesystem's block size and not exceed the +filesystem's maximum atomic write unit size. +See ``generic_atomic_write_valid()`` for more details. + +``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following +details: + + * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request. + * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request. + * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of + separate memory buffers that can be gathered into a write operation + (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one. + +The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic +writes are supported. + +.. _atomic_write_bdev_support: + +Hardware Support +---------------- + +The underlying storage device must support atomic write operations. +Modern NVMe and SCSI devices often provide this capability. +The Linux kernel exposes this information through sysfs: + +* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size +* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size + +Nonzero values for these attributes indicate that the device supports +atomic writes. + +See Also +-------- + +* :doc:`bigalloc` - Documentation on the bigalloc feature +* :doc:`allocators` - Documentation on block allocation in ext4 +* Support for atomic block writes in 6.13: + https://lwn.net/Articles/1009298/ diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst index 0fad6eda6e15..9d4054c17ecb 100644 --- a/Documentation/filesystems/ext4/overview.rst +++ b/Documentation/filesystems/ext4/overview.rst @@ -25,3 +25,4 @@ order. .. include:: inlinedata.rst .. include:: eainode.rst .. include:: verity.rst +.. include:: atomic_writes.rst -- 2.39.2
From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> mainline inclusion from mainline-v6.14-rc1 commit 786e3080cbe916f6a4c0987124d4f930c1814a97 category: feature bugzilla: https://atomgit.com/src-openeuler/kernel/issues/8862 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Filesystems like ext4 can submit writes in multiples of blocksizes. But we still can't allow the writes to be split. Hence let's check if the iomap_length() is same as iter->len or not. It is the role of the FS to ensure that a single mapping may be created for an atomic write. The FS will also continue to check size and alignment legality. Signed-off-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> jpg: Tweak commit message Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250303171120.2837067-7-john.g.garry@oracle.com Signed-off-by: Christian Brauner <brauner@kernel.org> Conflicts: fs/iomap/direct-io.c [Context conflicts] Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/iomap/direct-io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 2cc95448b97a..f44a135d5c42 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -291,7 +291,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, size_t copied = 0; size_t orig_count; - if (atomic && length != fs_block_size) + if (atomic && length != iter->len) return -EINVAL; if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) || -- 2.39.2
hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8862 -------------------------------- Fix kabi broken in enum req_flag_bits. Add __REQ_ATOMIC to the end of the enum to prevent the value of __REQ_NOUNMAP from being changed. Signed-off-by: Long Li <leo.lilong@huawei.com> --- include/linux/blk_types.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index ef5b12d47235..b0c15876579b 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -431,12 +431,12 @@ enum req_flag_bits { __REQ_SWAP, /* swap I/O */ __REQ_DRV, /* for driver use */ __REQ_FS_PRIVATE, /* for file system (submitter) use */ - __REQ_ATOMIC, /* for atomic write operations */ /* * Command specific flags, keep last: */ /* for REQ_OP_WRITE_ZEROES: */ __REQ_NOUNMAP, /* do not free blocks when zeroing */ + __REQ_ATOMIC, /* for atomic write operations */ __REQ_NR_BITS, /* stops here */ }; -- 2.39.2
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://atomgit.com/openeuler/kernel/merge_requests/21763 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/4XK... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://atomgit.com/openeuler/kernel/merge_requests/21763 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/4XK...
在 2026/4/15 9:17, Long Li 写道:
This patch set support ext4 hardware atomic write.
Alan Adamson (1): nvme: Atomic write support
John Garry (10): block: Pass blk_queue_get_max_sectors() a request pointer block: Generalize chunk_sectors support as boundary support block: Add core atomic write support block: Call blkdev_dio_unaligned() from blkdev_direct_IO() block: Add fops atomic write support block/fs: Pass an iocb to generic_atomic_write_valid() fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid() block: Add bdev atomic write limits helpers fs: Export generic_atomic_write_valid() fs: iomap: Atomic write support
Long Li (3): block: fix kabi broken in struct queue_limits fs: fix kabi broken in struct kstat block: fix kabi broken in enum req_flag_bits
Prasad Singamsetty (2): fs: Initial atomic write support fs: Add initial atomic write support info to statx
Ritesh Harjani (IBM) (12): ext4: Add statx support for atomic writes ext4: Check for atomic writes support in write iter ext4: Support setting FMODE_CAN_ATOMIC_WRITE ext4: Do not fallback to buffered-io for DIO atomic write ext4: Document an edge case for overwrites ext4: Check if inode uses extents in ext4_inode_can_atomic_write() ext4: Make ext4_meta_trans_blocks() non-static for later use ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS ext4: Add multi-fsblock atomic write support with bigalloc ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc ext4: Add atomic block write documentation iomap: Lift blocksize restriction on atomic writes
LGTM
Documentation/ABI/stable/sysfs-block | 53 +++ .../filesystems/ext4/atomic_writes.rst | 225 ++++++++++++ Documentation/filesystems/ext4/overview.rst | 1 + block/blk-core.c | 19 + block/blk-merge.c | 67 +++- block/blk-mq.c | 2 +- block/blk-settings.c | 87 +++++ block/blk-sysfs.c | 33 ++ block/blk.h | 9 +- block/fops.c | 51 ++- drivers/md/dm.c | 2 +- drivers/nvme/host/core.c | 54 ++- fs/aio.c | 8 +- fs/btrfs/ioctl.c | 2 +- fs/ext4/ext4.h | 35 +- fs/ext4/extents.c | 99 +++++ fs/ext4/file.c | 31 +- fs/ext4/inode.c | 346 ++++++++++++++++-- fs/ext4/super.c | 34 ++ fs/iomap/direct-io.c | 38 +- fs/iomap/trace.h | 3 +- fs/read_write.c | 22 +- fs/stat.c | 34 ++ include/linux/blk_types.h | 8 +- include/linux/blkdev.h | 84 ++++- include/linux/fs.h | 18 +- include/linux/iomap.h | 1 + include/linux/stat.h | 5 +- include/uapi/linux/fs.h | 5 +- include/uapi/linux/stat.h | 11 +- io_uring/rw.c | 8 +- 31 files changed, 1308 insertions(+), 87 deletions(-) create mode 100644 Documentation/filesystems/ext4/atomic_writes.rst
participants (3)
-
Long Li -
patchwork bot -
Zhihao Cheng