[PATCH OLK-6.6 00/14] ext4: fix generic/363 failure in iomap buffered I/O mode
Brian Foster (2): iomap: fix handling of dirty folios over unwritten extents iomap: make zero range flush conditional on unwritten mappings Zhang Yi (12): ext4: rename and extend ext4_block_truncate_page() ext4: factor out journalled block zeroing range ext4: rename ext4_block_zero_page_range() to ext4_block_zero_range() ext4: move ordered data handling out of ext4_block_do_zero_range() ext4: remove handle parameters from zero partial block functions ext4: pass allocate range as loff_t to ext4_alloc_file_blocks() ext4: move zero partial block range functions out of active handle ext4: ensure zeroed partial blocks are persisted in SYNC mode ext4: unify SYNC mode checks in fallocate paths ext4: remove ctime/mtime update from ext4_alloc_file_blocks() ext4: move pagecache_isize_extended() out of active handle ext4: zero post EOF partial block before iomap appending write fs/ext4/ext4.h | 5 +- fs/ext4/extents.c | 137 ++++++++++++----------- fs/ext4/file.c | 17 +++ fs/ext4/inode.c | 244 +++++++++++++++++++++++------------------ fs/iomap/buffered-io.c | 63 ++++++++++- fs/xfs/xfs_iops.c | 10 -- 6 files changed, 291 insertions(+), 185 deletions(-) -- 2.39.2
From: Brian Foster <bfoster@redhat.com> mainline inclusion from mainline-v6.12-rc1 commit c5c810b94cfd818fc2f58c96feee58a9e5ead96d category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8477 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------- The iomap zero range implementation doesn't properly handle dirty pagecache over unwritten mappings. It skips such mappings as if they were pre-zeroed. If some part of an unwritten mapping is dirty in pagecache from a previous write, the data in cache should be zeroed as well. Instead, the data is left in cache and creates a stale data exposure problem if writeback occurs sometime after the zero range. Most callers are unaffected by this because the higher level filesystem contexts that call zero range typically perform a filemap flush of the target range for other reasons. A couple contexts that don't otherwise need to flush are write file size extension and truncate in XFS. The former path is currently susceptible to the stale data exposure problem and the latter performs a flush specifically to work around it. This is clearly inconsistent and incomplete. As a first step toward correcting behavior, lift the XFS workaround to iomap_zero_range() and unconditionally flush the range before the zero range operation proceeds. While this appears to be a bit of a big hammer, most all users already do this from calling context save for the couple of exceptions noted above. Future patches will optimize or elide this flush while maintaining functional correctness. Fixes: ae259a9c8593 ("fs: introduce iomap infrastructure") Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20240830145634.138439-2-bfoster@redhat.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> --- fs/iomap/buffered-io.c | 10 ++++++++++ fs/xfs/xfs_iops.c | 10 ---------- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index fb617ec56f0b..3966c562a2b9 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1536,6 +1536,16 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, }; int ret; + /* + * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but + * pagecache must be flushed to ensure stale data from previous + * buffered writes is not exposed. + */ + ret = filemap_write_and_wait_range(inode->i_mapping, + pos, pos + len - 1); + if (ret) + return ret; + while ((ret = iomap_iter(&iter, ops)) > 0) iter.processed = iomap_zero_iter(&iter, did_zero); return ret; diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 9b7f49ad3e52..aaea4da92fed 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -850,16 +850,6 @@ xfs_setattr_size( error = xfs_zero_range(ip, oldsize, newsize - oldsize, &did_zeroing); } else { - /* - * iomap won't detect a dirty page over an unwritten block (or a - * cow block over a hole) and subsequently skips zeroing the - * newly post-EOF portion of the page. Flush the new EOF to - * convert the block before the pagecache truncate. - */ - error = filemap_write_and_wait_range(inode->i_mapping, newsize, - newsize); - if (error) - return error; error = xfs_truncate_page(ip, newsize, &did_zeroing); } -- 2.39.2
From: Brian Foster <bfoster@redhat.com> mainline inclusion from mainline-v6.12-rc1 commit 7d9b474ee4cc375393d13e69d60a30f9b7b46a3a category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8477 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- iomap_zero_range() flushes pagecache to mitigate consistency problems with dirty pagecache and unwritten mappings. The flush is unconditional over the entire range because checking pagecache state after mapping lookup is racy with writeback and reclaim. There are ways around this using iomap's mapping revalidation mechanism, but this is not supported by all iomap based filesystems and so is not a generic solution. There is another way around this limitation that is good enough to filter the flush for most cases in practice. If we check for dirty pagecache over the target range (instead of unconditionally flush), we can keep track of whether the range was dirty before lookup and defer the flush until/unless we see a combination of dirty cache backed by an unwritten mapping. We don't necessarily know whether the dirty cache was backed by the unwritten maping or some other (written) part of the range, but the impliciation of a false positive here is a spurious flush and thus relatively harmless. Note that we also flush for hole mappings because iomap_zero_range() is used for partial folio zeroing in some cases. For example, if a folio straddles EOF on a sub-page FSB size fs, the post-eof portion is hole-backed and dirtied/written via mapped write, and then i_size increases before writeback can occur (which otherwise zeroes the post-eof portion of the EOF folio), then the folio becomes inconsistent with disk until reclaimed. A flush in this case executes partial zeroing from writeback, and iomap knows that there is otherwise no I/O to submit for hole backed mappings. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20240830145634.138439-3-bfoster@redhat.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> --- fs/iomap/buffered-io.c | 63 ++++++++++++++++++++++++++++++++++++------ 1 file changed, 54 insertions(+), 9 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 3966c562a2b9..fc9cce442a09 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1478,16 +1478,53 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len, } EXPORT_SYMBOL_GPL(iomap_file_unshare); -static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) +/* + * Flush the remaining range of the iter and mark the current mapping stale. + * This is used when zero range sees an unwritten mapping that may have had + * dirty pagecache over it. + */ +static inline int iomap_zero_iter_flush_and_stale(struct iomap_iter *i) +{ + struct address_space *mapping = i->inode->i_mapping; + loff_t end = i->pos + i->len - 1; + + i->iomap.flags |= IOMAP_F_STALE; + return filemap_write_and_wait_range(mapping, i->pos, end); +} + +static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, + bool *range_dirty) { const struct iomap *srcmap = iomap_iter_srcmap(iter); loff_t pos = iter->pos; loff_t length = iomap_length(iter); loff_t written = 0; - /* already zeroed? we're done. */ - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) + /* + * We must zero subranges of unwritten mappings that might be dirty in + * pagecache from previous writes. We only know whether the entire range + * was clean or not, however, and dirty folios may have been written + * back or reclaimed at any point after mapping lookup. + * + * The easiest way to deal with this is to flush pagecache to trigger + * any pending unwritten conversions and then grab the updated extents + * from the fs. The flush may change the current mapping, so mark it + * stale for the iterator to remap it for the next pass to handle + * properly. + * + * Note that holes are treated the same as unwritten because zero range + * is (ab)used for partial folio zeroing in some cases. Hole backed + * post-eof ranges can be dirtied via mapped write and the flush + * triggers writeback time post-eof zeroing. + */ + if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) { + if (*range_dirty) { + *range_dirty = false; + return iomap_zero_iter_flush_and_stale(iter); + } + /* range is clean and already zeroed, nothing to do */ return length; + } do { struct folio *folio; @@ -1535,19 +1572,27 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, .flags = IOMAP_ZERO, }; int ret; + bool range_dirty; /* * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but * pagecache must be flushed to ensure stale data from previous - * buffered writes is not exposed. + * buffered writes is not exposed. A flush is only required for certain + * types of mappings, but checking pagecache after mapping lookup is + * racy with writeback and reclaim. + * + * Therefore, check the entire range first and pass along whether any + * part of it is dirty. If so and an underlying mapping warrants it, + * flush the cache at that point. This trades off the occasional false + * positive (and spurious flush, if the dirty data and mapping don't + * happen to overlap) for simplicity in handling a relatively uncommon + * situation. */ - ret = filemap_write_and_wait_range(inode->i_mapping, - pos, pos + len - 1); - if (ret) - return ret; + range_dirty = filemap_range_needs_writeback(inode->i_mapping, + pos, pos + len - 1); while ((ret = iomap_iter(&iter, ops)) > 0) - iter.processed = iomap_zero_iter(&iter, did_zero); + iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty); return ret; } EXPORT_SYMBOL_GPL(iomap_zero_range); -- 2.39.2
From: Zhang Yi <yi.zhang@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8477 -------------------------------- Rename ext4_block_truncate_page() to ext4_block_zero_eof() and extend its signature to accept an explicit 'end' offset instead of calculating the block boundary. This helper function now can replace all cases requiring zeroing of the partial EOF block, including the append buffered write paths in ext4_*_write_end(). Fixes: 5721968224e0 ("ext4: implement zero_range iomap path") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> --- fs/ext4/ext4.h | 2 ++ fs/ext4/extents.c | 4 ++-- fs/ext4/inode.c | 60 ++++++++++++++++++----------------------------- 3 files changed, 27 insertions(+), 39 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 16ce937c4dbc..49e4663ca916 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -3024,6 +3024,8 @@ extern void ext4_set_aops(struct inode *inode); extern int ext4_writepage_trans_blocks(struct inode *); extern int ext4_normal_submit_inode_data_buffers(struct jbd2_inode *jinode); extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks); +extern int ext4_block_zero_eof(handle_t *handle, struct inode *inode, + loff_t from, loff_t end); extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode, loff_t lstart, loff_t lend); extern vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf); diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 83b2e5d1e00a..c1b96c05e842 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4561,8 +4561,8 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset, inode_get_ctime(inode)); if (epos > old_size) { pagecache_isize_extended(inode, old_size, epos); - ext4_zero_partial_blocks(handle, inode, - old_size, epos - old_size); + ext4_block_zero_eof(handle, inode, old_size, + epos); } } ret2 = ext4_mark_inode_dirty(handle, inode); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index b654c07012fb..15709b81f5a0 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1347,7 +1347,7 @@ static int ext4_write_end(struct file *file, if (old_size < pos && !verity) { pagecache_isize_extended(inode, old_size, pos); - ext4_zero_partial_blocks(handle, inode, old_size, pos - old_size); + ext4_block_zero_eof(handle, inode, old_size, pos); } /* * Don't mark the inode dirty under folio lock. First, it unnecessarily @@ -1466,7 +1466,7 @@ static int ext4_journalled_write_end(struct file *file, if (old_size < pos && !verity) { pagecache_isize_extended(inode, old_size, pos); - ext4_zero_partial_blocks(handle, inode, old_size, pos - old_size); + ext4_block_zero_eof(handle, inode, old_size, pos); } if (size_changed) { @@ -3092,7 +3092,7 @@ static int ext4_da_do_write_end(struct address_space *mapping, if (IS_ERR(handle)) return PTR_ERR(handle); if (zero_len) - ext4_zero_partial_blocks(handle, inode, old_size, zero_len); + ext4_block_zero_eof(handle, inode, old_size, pos); ext4_mark_inode_dirty(handle, inode); ext4_journal_stop(handle); @@ -4392,50 +4392,37 @@ static int ext4_block_zero_page_range(handle_t *handle, } /* - * ext4_block_truncate_page() zeroes out a mapping from file offset `from' - * up to the end of the block which corresponds to `from'. - * This required during truncate. We need to physically zero the tail end - * of that block so it doesn't yield old data if the file is later grown. + * Zero out a mapping from file offset 'from' up to the end of the block + * which corresponds to 'from' or to the given 'end' inside this block. + * This required during truncate up and performing append writes. We need + * to physically zero the tail end of that block so it doesn't yield old + * data if the file is grown. Return the zeroed length on success. */ -static loff_t ext4_block_truncate_page(struct address_space *mapping, - loff_t from) +int ext4_block_zero_eof(handle_t *handle, struct inode *inode, + loff_t from, loff_t end) { - unsigned offset = from & (PAGE_SIZE-1); - unsigned length; - unsigned blocksize; - struct inode *inode = mapping->host; + unsigned int blocksize = i_blocksize(inode); + unsigned int offset; + loff_t length = end - from; bool did_zero = false; int err; + offset = from & (blocksize - 1); + if (!offset || from >= end) + return 0; /* If we are processing an encrypted inode during orphan list handling */ if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode)) return 0; - blocksize = inode->i_sb->s_blocksize; - length = blocksize - (offset & (blocksize - 1)); + if (length > blocksize - offset) + length = blocksize - offset; - err = ext4_block_zero_page_range(NULL, mapping, from, length, + err = ext4_block_zero_page_range(handle, inode->i_mapping, from, length, &did_zero); if (err) return err; - /* - * inode with an iomap buffered I/O path does not order data, - * so it is necessary to write out zeroed data before the - * updating i_disksize transaction is committed. Otherwise, - * stale data may remain in the last block, which could be - * exposed during the next expand truncate operation. - */ - if (length && ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) { - loff_t zero_end = inode->i_size + length; - - err = filemap_write_and_wait_range(mapping, - inode->i_size, zero_end - 1); - if (err) - return err; - } - - return length; + return did_zero ? length : 0; } int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode, @@ -4826,7 +4813,6 @@ int ext4_truncate(struct inode *inode) unsigned int credits; int err = 0, err2; handle_t *handle; - struct address_space *mapping = inode->i_mapping; loff_t zero_len = 0; /* @@ -4858,7 +4844,7 @@ int ext4_truncate(struct inode *inode) if (err) goto out_trace; - zero_len = ext4_block_truncate_page(mapping, inode->i_size); + zero_len = ext4_block_zero_eof(NULL, inode, inode->i_size, LLONG_MAX); if (zero_len < 0) { err = zero_len; goto out_trace; @@ -6231,8 +6217,8 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry, if (!shrink && oldsize & (inode->i_sb->s_blocksize - 1)) { loff_t zero_len; - zero_len = ext4_block_truncate_page( - inode->i_mapping, oldsize); + zero_len = ext4_block_zero_eof(NULL, inode, + oldsize, LLONG_MAX); if (zero_len < 0) { error = zero_len; goto out_mmap_sem; -- 2.39.2
From: Zhang Yi <yi.zhang@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8477 -------------------------------- Refactor __ext4_block_zero_page_range() by separating the block zeroing operations for ordered data mode and journal data mode into two distinct functions: - ext4_block_do_zero_range(): handles non-journal data mode with ordered data support - ext4_block_journalled_zero_range(): handles journal data mode Also extract a common helper, ext4_block_get_zero_range(), to handle buffer head and folio retrieval, along with the associated error handling. This prepares for converting the partial block zero range to the iomap infrastructure. Fixes: 5721968224e0 ("ext4: implement zero_range iomap path") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> --- fs/ext4/inode.c | 109 ++++++++++++++++++++++++++++++------------------ 1 file changed, 69 insertions(+), 40 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 15709b81f5a0..45b9a2bb4793 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4236,35 +4236,23 @@ void ext4_set_aops(struct inode *inode) inode->i_mapping->a_ops = &ext4_aops; } -static int __ext4_block_zero_page_range(handle_t *handle, - struct address_space *mapping, - loff_t from, loff_t length, - bool *did_zero) +static struct buffer_head *ext4_block_get_zero_range(struct inode *inode, + loff_t from, loff_t length) { ext4_fsblk_t index = from >> PAGE_SHIFT; unsigned offset = from & (PAGE_SIZE-1); unsigned blocksize, pos; ext4_lblk_t iblock; - struct inode *inode = mapping->host; + struct address_space *mapping = inode->i_mapping; struct buffer_head *bh; struct folio *folio; int err = 0; - bool orig_handle_valid = true; - - if (ext4_should_journal_data(inode) && handle == NULL) { - handle = ext4_journal_start(inode, EXT4_HT_MISC, 1); - if (IS_ERR(handle)) - return PTR_ERR(handle); - orig_handle_valid = false; - } folio = __filemap_get_folio(mapping, from >> PAGE_SHIFT, FGP_LOCK | FGP_ACCESSED | FGP_CREAT, mapping_gfp_constraint(mapping, ~__GFP_FS)); - if (IS_ERR(folio)) { - err = PTR_ERR(folio); - goto out; - } + if (IS_ERR(folio)) + return ERR_CAST(folio); blocksize = inode->i_sb->s_blocksize; @@ -4317,36 +4305,75 @@ static int __ext4_block_zero_page_range(handle_t *handle, } } } + return bh; - if (ext4_should_journal_data(inode)) { - BUFFER_TRACE(bh, "get write access"); - err = ext4_journal_get_write_access(handle, inode->i_sb, bh, - EXT4_JTR_NONE); - if (err) - goto unlock; - folio_zero_range(folio, offset, length); - BUFFER_TRACE(bh, "zeroed end of block"); +unlock: + folio_unlock(folio); + folio_put(folio); + return err ? ERR_PTR(err) : NULL; +} - err = ext4_dirty_journalled_data(handle, bh); - if (err) - goto unlock; - } else { - err = 0; - folio_zero_range(folio, offset, length); - BUFFER_TRACE(bh, "zeroed end of block"); +static int ext4_block_do_zero_range(handle_t *handle, struct inode *inode, + loff_t from, loff_t length, bool *did_zero) +{ + struct buffer_head *bh; + struct folio *folio; + int err = 0; - mark_buffer_dirty(bh); - } + bh = ext4_block_get_zero_range(inode, from, length); + if (IS_ERR_OR_NULL(bh)) + return PTR_ERR_OR_ZERO(bh); - if (did_zero) + folio = bh->b_folio; + folio_zero_range(folio, offset_in_folio(folio, from), length); + BUFFER_TRACE(bh, "zeroed end of block"); + + mark_buffer_dirty(bh); + /* + * Only the written block requires ordered data to prevent exposing + * stale data. + */ + if (ext4_should_order_data(inode) && + !buffer_unwritten(bh) && !buffer_delay(bh)) + err = ext4_jbd2_inode_add_write(handle, inode, from, length); + if (!err && did_zero) *did_zero = true; -unlock: folio_unlock(folio); folio_put(folio); + return err; +} + +static int ext4_block_journalled_zero_range(handle_t *handle, + struct inode *inode, loff_t from, loff_t length, bool *did_zero) +{ + struct buffer_head *bh; + struct folio *folio; + int err; + + bh = ext4_block_get_zero_range(inode, from, length); + if (IS_ERR_OR_NULL(bh)) + return PTR_ERR_OR_ZERO(bh); + folio = bh->b_folio; + + BUFFER_TRACE(bh, "get write access"); + err = ext4_journal_get_write_access(handle, inode->i_sb, bh, + EXT4_JTR_NONE); + if (err) + goto out; + + folio_zero_range(folio, offset_in_folio(folio, from), length); + BUFFER_TRACE(bh, "zeroed end of block"); + + err = ext4_dirty_journalled_data(handle, bh); + if (err) + goto out; + + if (did_zero) + *did_zero = true; out: - if (ext4_should_journal_data(inode) && orig_handle_valid == false) - ext4_journal_stop(handle); + folio_unlock(folio); + folio_put(folio); return err; } @@ -4384,11 +4411,13 @@ static int ext4_block_zero_page_range(handle_t *handle, if (IS_DAX(inode)) { return dax_zero_range(inode, from, length, NULL, &ext4_iomap_ops); + } else if (ext4_should_journal_data(inode)) { + return ext4_block_journalled_zero_range(handle, inode, from, + length, did_zero); } else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) { return ext4_iomap_zero_range(inode, from, length, did_zero); } - return __ext4_block_zero_page_range(handle, mapping, from, length, - did_zero); + return ext4_block_do_zero_range(handle, inode, from, length, did_zero); } /* -- 2.39.2
From: Zhang Yi <yi.zhang@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8477 -------------------------------- Rename ext4_block_zero_page_range() to ext4_block_zero_range() since the "page" naming is no longer appropriate for current context. Also change its signature to take an inode pointer instead of an address_space. This aligns with the caller ext4_block_zero_eof() and ext4_zero_partial_blocks(). Fixes: 5721968224e0 ("ext4: implement zero_range iomap path") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> --- fs/ext4/inode.c | 26 ++++++++++---------------- 1 file changed, 10 insertions(+), 16 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 45b9a2bb4793..bff53d83532b 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4391,12 +4391,9 @@ static int ext4_iomap_zero_range(struct inode *inode, loff_t from, * the end of the block it will be shortened to end of the block * that corresponds to 'from' */ -static int ext4_block_zero_page_range(handle_t *handle, - struct address_space *mapping, - loff_t from, loff_t length, - bool *did_zero) +static int ext4_block_zero_range(handle_t *handle, struct inode *inode, + loff_t from, loff_t length, bool *did_zero) { - struct inode *inode = mapping->host; unsigned offset = from & (PAGE_SIZE-1); unsigned blocksize = inode->i_sb->s_blocksize; unsigned max = blocksize - (offset & (blocksize - 1)); @@ -4446,8 +4443,7 @@ int ext4_block_zero_eof(handle_t *handle, struct inode *inode, if (length > blocksize - offset) length = blocksize - offset; - err = ext4_block_zero_page_range(handle, inode->i_mapping, from, length, - &did_zero); + err = ext4_block_zero_range(handle, inode, from, length, &did_zero); if (err) return err; @@ -4458,7 +4454,6 @@ int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode, loff_t lstart, loff_t length) { struct super_block *sb = inode->i_sb; - struct address_space *mapping = inode->i_mapping; unsigned partial_start, partial_end; ext4_fsblk_t start, end; loff_t byte_end = (lstart + length - 1); @@ -4473,23 +4468,22 @@ int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode, /* Handle partial zero within the single block */ if (start == end && (partial_start || (partial_end != sb->s_blocksize - 1))) { - err = ext4_block_zero_page_range(handle, mapping, - lstart, length, NULL); + err = ext4_block_zero_range(handle, inode, lstart, + length, NULL); return err; } /* Handle partial zero out on the start of the range */ if (partial_start) { - err = ext4_block_zero_page_range(handle, mapping, - lstart, sb->s_blocksize, - NULL); + err = ext4_block_zero_range(handle, inode, lstart, + sb->s_blocksize, NULL); if (err) return err; } /* Handle partial zero out on the end of the range */ if (partial_end != sb->s_blocksize - 1) - err = ext4_block_zero_page_range(handle, mapping, - byte_end - partial_end, - partial_end + 1, NULL); + err = ext4_block_zero_range(handle, inode, + byte_end - partial_end, + partial_end + 1, NULL); return err; } -- 2.39.2
From: Zhang Yi <yi.zhang@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8477 -------------------------------- Remove the handle parameter from ext4_block_do_zero_range() and move the ordered data handling to ext4_block_zero_eof(). This is necessary for truncate up and append writes across a range extending beyond EOF. The ordered data must be committed before updating i_disksize to prevent exposing stale on-disk data from concurrent post-EOF mmap writes during previous folio writeback or in case of system crash during append writes. This is unnecessary for partial block hole punching because the entire punch operation does not provide atomicity guarantees and can already expose intermediate results in case of crash. Since ordered data handling is no longer performed inside ext4_zero_partial_blocks(), ext4_punch_hole() no longer needs to attach jinode. This is prepared for the conversion to the iomap infrastructure, which does not use ordered data mode while zeroing post-EOF partial blocks. Fixes: 5721968224e0 ("ext4: implement zero_range iomap path") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> --- fs/ext4/inode.c | 59 ++++++++++++++++++++++++------------------------- 1 file changed, 29 insertions(+), 30 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index bff53d83532b..b88f255d6a21 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4313,12 +4313,12 @@ static struct buffer_head *ext4_block_get_zero_range(struct inode *inode, return err ? ERR_PTR(err) : NULL; } -static int ext4_block_do_zero_range(handle_t *handle, struct inode *inode, - loff_t from, loff_t length, bool *did_zero) +static int ext4_block_do_zero_range(struct inode *inode, loff_t from, + loff_t length, bool *did_zero, + bool *zero_written) { struct buffer_head *bh; struct folio *folio; - int err = 0; bh = ext4_block_get_zero_range(inode, from, length); if (IS_ERR_OR_NULL(bh)) @@ -4329,19 +4329,14 @@ static int ext4_block_do_zero_range(handle_t *handle, struct inode *inode, BUFFER_TRACE(bh, "zeroed end of block"); mark_buffer_dirty(bh); - /* - * Only the written block requires ordered data to prevent exposing - * stale data. - */ - if (ext4_should_order_data(inode) && - !buffer_unwritten(bh) && !buffer_delay(bh)) - err = ext4_jbd2_inode_add_write(handle, inode, from, length); - if (!err && did_zero) + if (did_zero) *did_zero = true; + if (zero_written && !buffer_unwritten(bh) && !buffer_delay(bh)) + *zero_written = true; folio_unlock(folio); folio_put(folio); - return err; + return 0; } static int ext4_block_journalled_zero_range(handle_t *handle, @@ -4392,7 +4387,8 @@ static int ext4_iomap_zero_range(struct inode *inode, loff_t from, * that corresponds to 'from' */ static int ext4_block_zero_range(handle_t *handle, struct inode *inode, - loff_t from, loff_t length, bool *did_zero) + loff_t from, loff_t length, bool *did_zero, + bool *zero_written) { unsigned offset = from & (PAGE_SIZE-1); unsigned blocksize = inode->i_sb->s_blocksize; @@ -4414,7 +4410,8 @@ static int ext4_block_zero_range(handle_t *handle, struct inode *inode, } else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) { return ext4_iomap_zero_range(inode, from, length, did_zero); } - return ext4_block_do_zero_range(handle, inode, from, length, did_zero); + return ext4_block_do_zero_range(inode, from, length, did_zero, + zero_written); } /* @@ -4431,6 +4428,7 @@ int ext4_block_zero_eof(handle_t *handle, struct inode *inode, unsigned int offset; loff_t length = end - from; bool did_zero = false; + bool zero_written = false; int err; offset = from & (blocksize - 1); @@ -4443,9 +4441,22 @@ int ext4_block_zero_eof(handle_t *handle, struct inode *inode, if (length > blocksize - offset) length = blocksize - offset; - err = ext4_block_zero_range(handle, inode, from, length, &did_zero); + err = ext4_block_zero_range(handle, inode, from, length, + &did_zero, &zero_written); if (err) return err; + /* + * It's necessary to order zeroed data before update i_disksize when + * truncating up or performing an append write, because there might be + * exposing stale on-disk data which may caused by concurrent post-EOF + * mmap write during folio writeback. + */ + if (ext4_should_order_data(inode) && + did_zero && zero_written && !IS_DAX(inode)) { + err = ext4_jbd2_inode_add_write(handle, inode, from, length); + if (err) + return err; + } return did_zero ? length : 0; } @@ -4469,13 +4480,13 @@ int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode, if (start == end && (partial_start || (partial_end != sb->s_blocksize - 1))) { err = ext4_block_zero_range(handle, inode, lstart, - length, NULL); + length, NULL, NULL); return err; } /* Handle partial zero out on the start of the range */ if (partial_start) { err = ext4_block_zero_range(handle, inode, lstart, - sb->s_blocksize, NULL); + sb->s_blocksize, NULL, NULL); if (err) return err; } @@ -4483,7 +4494,7 @@ int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode, if (partial_end != sb->s_blocksize - 1) err = ext4_block_zero_range(handle, inode, byte_end - partial_end, - partial_end + 1, NULL); + partial_end + 1, NULL, NULL); return err; } @@ -4675,18 +4686,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) if (offset + length > max_length) length = max_length - offset; - if (offset & (sb->s_blocksize - 1) || - (offset + length) & (sb->s_blocksize - 1)) { - /* - * Attach jinode to inode for jbd2 if we do any zeroing of - * partial block - */ - ret = ext4_inode_attach_jinode(inode); - if (ret < 0) - goto out_mutex; - - } - /* Wait all existing dio workers, newcomers will block on i_rwsem */ inode_dio_wait(inode); -- 2.39.2
From: Zhang Yi <yi.zhang@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8477 -------------------------------- Only journal data mode requires an active journal handle when zeroing partial blocks. Stop passing handle_t *handle to ext4_zero_partial_blocks() and related functions, and make ext4_block_journalled_zero_range() start a handle independently. This change has no practical impact now because all callers invoke these functions within the context of an active handle. It prepares for moving ext4_block_zero_eof() out of an active handle in the next patch, which is a prerequisite for converting block zero range operations to iomap infrastructure. Fixes: 5721968224e0 ("ext4: implement zero_range iomap path") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> --- fs/ext4/ext4.h | 7 +++--- fs/ext4/extents.c | 5 ++-- fs/ext4/inode.c | 63 ++++++++++++++++++++++++++++------------------- 3 files changed, 42 insertions(+), 33 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 49e4663ca916..f33676aac29c 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -3024,10 +3024,9 @@ extern void ext4_set_aops(struct inode *inode); extern int ext4_writepage_trans_blocks(struct inode *); extern int ext4_normal_submit_inode_data_buffers(struct jbd2_inode *jinode); extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks); -extern int ext4_block_zero_eof(handle_t *handle, struct inode *inode, - loff_t from, loff_t end); -extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode, - loff_t lstart, loff_t lend); +extern int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end); +extern int ext4_zero_partial_blocks(struct inode *inode, loff_t lstart, + loff_t length); extern vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf); extern qsize_t *ext4_get_reserved_space(struct inode *inode); extern int ext4_get_projid(struct inode *inode, kprojid_t *projid); diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index c1b96c05e842..43899a3728fe 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4561,8 +4561,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset, inode_get_ctime(inode)); if (epos > old_size) { pagecache_isize_extended(inode, old_size, epos); - ext4_block_zero_eof(handle, inode, old_size, - epos); + ext4_block_zero_eof(inode, old_size, epos); } } ret2 = ext4_mark_inode_dirty(handle, inode); @@ -4733,7 +4732,7 @@ static long ext4_zero_range(struct file *file, loff_t offset, if (unlikely(ret)) goto out_handle; /* Zero out partial block at the edges of the range */ - ret = ext4_zero_partial_blocks(handle, inode, offset, len); + ret = ext4_zero_partial_blocks(inode, offset, len); if (ret >= 0) ext4_update_inode_fsync_trans(handle, inode, 1); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index b88f255d6a21..271a64050b08 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1347,7 +1347,7 @@ static int ext4_write_end(struct file *file, if (old_size < pos && !verity) { pagecache_isize_extended(inode, old_size, pos); - ext4_block_zero_eof(handle, inode, old_size, pos); + ext4_block_zero_eof(inode, old_size, pos); } /* * Don't mark the inode dirty under folio lock. First, it unnecessarily @@ -1466,7 +1466,7 @@ static int ext4_journalled_write_end(struct file *file, if (old_size < pos && !verity) { pagecache_isize_extended(inode, old_size, pos); - ext4_block_zero_eof(handle, inode, old_size, pos); + ext4_block_zero_eof(inode, old_size, pos); } if (size_changed) { @@ -3092,7 +3092,7 @@ static int ext4_da_do_write_end(struct address_space *mapping, if (IS_ERR(handle)) return PTR_ERR(handle); if (zero_len) - ext4_block_zero_eof(handle, inode, old_size, pos); + ext4_block_zero_eof(inode, old_size, pos); ext4_mark_inode_dirty(handle, inode); ext4_journal_stop(handle); @@ -4339,16 +4339,23 @@ static int ext4_block_do_zero_range(struct inode *inode, loff_t from, return 0; } -static int ext4_block_journalled_zero_range(handle_t *handle, - struct inode *inode, loff_t from, loff_t length, bool *did_zero) +static int ext4_block_journalled_zero_range(struct inode *inode, loff_t from, + loff_t length, bool *did_zero) { struct buffer_head *bh; struct folio *folio; + handle_t *handle; int err; + handle = ext4_journal_start(inode, EXT4_HT_MISC, 1); + if (IS_ERR(handle)) + return PTR_ERR(handle); + bh = ext4_block_get_zero_range(inode, from, length); - if (IS_ERR_OR_NULL(bh)) - return PTR_ERR_OR_ZERO(bh); + if (IS_ERR_OR_NULL(bh)) { + err = PTR_ERR_OR_ZERO(bh); + goto out_handle; + } folio = bh->b_folio; BUFFER_TRACE(bh, "get write access"); @@ -4369,6 +4376,8 @@ static int ext4_block_journalled_zero_range(handle_t *handle, out: folio_unlock(folio); folio_put(folio); +out_handle: + ext4_journal_stop(handle); return err; } @@ -4386,7 +4395,7 @@ static int ext4_iomap_zero_range(struct inode *inode, loff_t from, * the end of the block it will be shortened to end of the block * that corresponds to 'from' */ -static int ext4_block_zero_range(handle_t *handle, struct inode *inode, +static int ext4_block_zero_range(struct inode *inode, loff_t from, loff_t length, bool *did_zero, bool *zero_written) { @@ -4405,8 +4414,8 @@ static int ext4_block_zero_range(handle_t *handle, struct inode *inode, return dax_zero_range(inode, from, length, NULL, &ext4_iomap_ops); } else if (ext4_should_journal_data(inode)) { - return ext4_block_journalled_zero_range(handle, inode, from, - length, did_zero); + return ext4_block_journalled_zero_range(inode, from, length, + did_zero); } else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) { return ext4_iomap_zero_range(inode, from, length, did_zero); } @@ -4421,8 +4430,7 @@ static int ext4_block_zero_range(handle_t *handle, struct inode *inode, * to physically zero the tail end of that block so it doesn't yield old * data if the file is grown. Return the zeroed length on success. */ -int ext4_block_zero_eof(handle_t *handle, struct inode *inode, - loff_t from, loff_t end) +int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end) { unsigned int blocksize = i_blocksize(inode); unsigned int offset; @@ -4441,7 +4449,7 @@ int ext4_block_zero_eof(handle_t *handle, struct inode *inode, if (length > blocksize - offset) length = blocksize - offset; - err = ext4_block_zero_range(handle, inode, from, length, + err = ext4_block_zero_range(inode, from, length, &did_zero, &zero_written); if (err) return err; @@ -4453,7 +4461,14 @@ int ext4_block_zero_eof(handle_t *handle, struct inode *inode, */ if (ext4_should_order_data(inode) && did_zero && zero_written && !IS_DAX(inode)) { + handle_t *handle; + + handle = ext4_journal_start(inode, EXT4_HT_MISC, 1); + if (IS_ERR(handle)) + return PTR_ERR(handle); + err = ext4_jbd2_inode_add_write(handle, inode, from, length); + ext4_journal_stop(handle); if (err) return err; } @@ -4461,8 +4476,7 @@ int ext4_block_zero_eof(handle_t *handle, struct inode *inode, return did_zero ? length : 0; } -int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode, - loff_t lstart, loff_t length) +int ext4_zero_partial_blocks(struct inode *inode, loff_t lstart, loff_t length) { struct super_block *sb = inode->i_sb; unsigned partial_start, partial_end; @@ -4479,21 +4493,19 @@ int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode, /* Handle partial zero within the single block */ if (start == end && (partial_start || (partial_end != sb->s_blocksize - 1))) { - err = ext4_block_zero_range(handle, inode, lstart, - length, NULL, NULL); + err = ext4_block_zero_range(inode, lstart, length, NULL, NULL); return err; } /* Handle partial zero out on the start of the range */ if (partial_start) { - err = ext4_block_zero_range(handle, inode, lstart, - sb->s_blocksize, NULL, NULL); + err = ext4_block_zero_range(inode, lstart, sb->s_blocksize, + NULL, NULL); if (err) return err; } /* Handle partial zero out on the end of the range */ if (partial_end != sb->s_blocksize - 1) - err = ext4_block_zero_range(handle, inode, - byte_end - partial_end, + err = ext4_block_zero_range(inode, byte_end - partial_end, partial_end + 1, NULL, NULL); return err; } @@ -4728,8 +4740,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) goto out_dio; } - ret = ext4_zero_partial_blocks(handle, inode, offset, - length); + ret = ext4_zero_partial_blocks(inode, offset, length); if (ret) goto out_stop; @@ -4866,7 +4877,7 @@ int ext4_truncate(struct inode *inode) if (err) goto out_trace; - zero_len = ext4_block_zero_eof(NULL, inode, inode->i_size, LLONG_MAX); + zero_len = ext4_block_zero_eof(inode, inode->i_size, LLONG_MAX); if (zero_len < 0) { err = zero_len; goto out_trace; @@ -6239,8 +6250,8 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry, if (!shrink && oldsize & (inode->i_sb->s_blocksize - 1)) { loff_t zero_len; - zero_len = ext4_block_zero_eof(NULL, inode, - oldsize, LLONG_MAX); + zero_len = ext4_block_zero_eof(inode, oldsize, + LLONG_MAX); if (zero_len < 0) { error = zero_len; goto out_mmap_sem; -- 2.39.2
From: Zhang Yi <yi.zhang@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8477 -------------------------------- Change ext4_alloc_file_blocks() to accept offset and len in byte granularity instead of block granularity. This allows callers to pass byte offsets and lengths directly, and this prepares for moving the ext4_zero_partial_blocks() call from the while(len) loop for unaligned append writes, where it only needs to be invoked once before doing block allocation. Fixes: 5721968224e0 ("ext4: implement zero_range iomap path") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> --- fs/ext4/extents.c | 40 +++++++++++++++++----------------------- 1 file changed, 17 insertions(+), 23 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 43899a3728fe..b9dbe6072392 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4489,43 +4489,44 @@ int ext4_ext_truncate(handle_t *handle, struct inode *inode) return err; } -static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset, - ext4_lblk_t len, loff_t new_size, - int flags) +static int ext4_alloc_file_blocks(struct file *file, loff_t offset, loff_t len, + loff_t new_size, int flags) { struct inode *inode = file_inode(file); handle_t *handle; int ret = 0, ret2 = 0, ret3 = 0; int retries = 0; int depth = 0; + ext4_lblk_t len_lblk; struct ext4_map_blocks map; unsigned int credits; loff_t epos, old_size = i_size_read(inode); + unsigned int blkbits = inode->i_blkbits; BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)); - map.m_lblk = offset; - map.m_len = len; + map.m_lblk = offset >> blkbits; + map.m_len = len_lblk = EXT4_MAX_BLOCKS(len, offset, blkbits); /* * Don't normalize the request if it can fit in one extent so * that it doesn't get unnecessarily split into multiple * extents. */ - if (len <= EXT_UNWRITTEN_MAX_LEN) + if (len_lblk <= EXT_UNWRITTEN_MAX_LEN) flags |= EXT4_GET_BLOCKS_NO_NORMALIZE; /* * credits to insert 1 extent into extent tree */ - credits = ext4_chunk_trans_blocks(inode, len); + credits = ext4_chunk_trans_blocks(inode, len_lblk); depth = ext_depth(inode); retry: - while (len) { + while (len_lblk) { /* * Recalculate credits when extent tree depth changes. */ if (depth != ext_depth(inode)) { - credits = ext4_chunk_trans_blocks(inode, len); + credits = ext4_chunk_trans_blocks(inode, len_lblk); depth = ext_depth(inode); } @@ -4550,7 +4551,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset, */ retries = 0; map.m_lblk += ret; - map.m_len = len = len - ret; + map.m_len = len_lblk = len_lblk - ret; epos = (loff_t)map.m_lblk << inode->i_blkbits; inode_set_ctime_current(inode); if (new_size) { @@ -4651,11 +4652,8 @@ static long ext4_zero_range(struct file *file, loff_t offset, /* Preallocate the range including the unaligned edges */ if (partial_begin || partial_end) { - ret = ext4_alloc_file_blocks(file, - round_down(offset, 1 << blkbits) >> blkbits, - (round_up((offset + len), 1 << blkbits) - - round_down(offset, 1 << blkbits)) >> blkbits, - new_size, flags); + ret = ext4_alloc_file_blocks(file, offset, len, new_size, + flags); if (ret) goto out_mutex; @@ -4702,7 +4700,7 @@ static long ext4_zero_range(struct file *file, loff_t offset, inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode)); - ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size, + ret = ext4_alloc_file_blocks(file, start, end - start, new_size, flags); filemap_invalidate_unlock(mapping); if (ret) @@ -4757,11 +4755,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { struct inode *inode = file_inode(file); loff_t new_size = 0; - unsigned int max_blocks; int ret = 0; int flags; - ext4_lblk_t lblk; - unsigned int blkbits = inode->i_blkbits; /* * Encrypted inodes can't handle collapse range or insert @@ -4805,9 +4800,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) goto exit; } trace_ext4_fallocate_enter(inode, offset, len, mode); - lblk = offset >> blkbits; - max_blocks = EXT4_MAX_BLOCKS(len, offset, blkbits); flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT; inode_lock(inode); @@ -4836,7 +4829,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) if (ret) goto out; - ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size, flags); + ret = ext4_alloc_file_blocks(file, offset, len, new_size, flags); if (ret) goto out; @@ -4846,7 +4839,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) } out: inode_unlock(inode); - trace_ext4_fallocate_exit(inode, offset, max_blocks, ret); + trace_ext4_fallocate_exit(inode, offset, + EXT4_MAX_BLOCKS(len, offset, inode->i_blkbits), ret); exit: return ret; } -- 2.39.2
From: Zhang Yi <yi.zhang@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8477 -------------------------------- Move ext4_block_zero_eof() and ext4_zero_partial_blocks() calls out of the active handle context, making them independent operations. This is safe because it still ensures data is updated before metadata for data=ordered mode and data=journal mode because we still zero data and ordering data before modifying the metadata. This change is required for iomap infrastructure conversion because the iomap buffered I/O path does not use the same journal infrastructure for partial block zeroing. The lock ordering of folio lock and starting transactions is "folio lock -> transaction start", which is opposite of the current path. Therefore, zeroing partial blocks cannot be performed under the active handle. Fixes: 5721968224e0 ("ext4: implement zero_range iomap path") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> --- fs/ext4/extents.c | 31 +++++++++++++++---------------- fs/ext4/inode.c | 10 +++++----- 2 files changed, 20 insertions(+), 21 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index b9dbe6072392..a35c39cad502 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4520,6 +4520,13 @@ static int ext4_alloc_file_blocks(struct file *file, loff_t offset, loff_t len, credits = ext4_chunk_trans_blocks(inode, len_lblk); depth = ext_depth(inode); + /* Zero to the end of the block containing i_size */ + if (new_size > old_size) { + ret = ext4_block_zero_eof(inode, old_size, LLONG_MAX); + if (ret < 0) + return ret; + } + retry: while (len_lblk) { /* @@ -4560,10 +4567,8 @@ static int ext4_alloc_file_blocks(struct file *file, loff_t offset, loff_t len, if (ext4_update_inode_size(inode, epos) & 0x1) inode_set_mtime_to_ts(inode, inode_get_ctime(inode)); - if (epos > old_size) { + if (epos > old_size) pagecache_isize_extended(inode, old_size, epos); - ext4_block_zero_eof(inode, old_size, epos); - } } ret2 = ext4_mark_inode_dirty(handle, inode); ext4_update_inode_fsync_trans(handle, inode, 1); @@ -4592,7 +4597,6 @@ static long ext4_zero_range(struct file *file, loff_t offset, loff_t new_size = 0; int ret = 0; int flags; - int credits; int partial_begin, partial_end; loff_t start, end; ext4_lblk_t lblk; @@ -4709,14 +4713,12 @@ static long ext4_zero_range(struct file *file, loff_t offset, if (!partial_begin && !partial_end) goto out_mutex; - /* - * In worst case we have to writeout two nonadjacent unwritten - * blocks and update the inode - */ - credits = (2 * ext4_ext_index_trans_blocks(inode, 2)) + 1; - if (ext4_should_journal_data(inode)) - credits += 2; - handle = ext4_journal_start(inode, EXT4_HT_MISC, credits); + /* Zero out partial block at the edges of the range */ + ret = ext4_zero_partial_blocks(inode, offset, len); + if (ret) + goto out_mutex; + + handle = ext4_journal_start(inode, EXT4_HT_MISC, 1); if (IS_ERR(handle)) { ret = PTR_ERR(handle); ext4_std_error(inode->i_sb, ret); @@ -4729,11 +4731,8 @@ static long ext4_zero_range(struct file *file, loff_t offset, ret = ext4_mark_inode_dirty(handle, inode); if (unlikely(ret)) goto out_handle; - /* Zero out partial block at the edges of the range */ - ret = ext4_zero_partial_blocks(inode, offset, len); - if (ret >= 0) - ext4_update_inode_fsync_trans(handle, inode, 1); + ext4_update_inode_fsync_trans(handle, inode, 1); if (file->f_flags & O_SYNC) ext4_handle_sync(handle); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 271a64050b08..94a003d9b77c 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4729,6 +4729,10 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) if (ret) goto out_dio; + ret = ext4_zero_partial_blocks(inode, offset, length); + if (ret) + goto out_dio; + if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) credits = ext4_writepage_trans_blocks(inode); else @@ -4740,10 +4744,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) goto out_dio; } - ret = ext4_zero_partial_blocks(inode, offset, length); - if (ret) - goto out_stop; - first_block = (offset + sb->s_blocksize - 1) >> EXT4_BLOCK_SIZE_BITS(sb); stop_block = (offset + length) >> EXT4_BLOCK_SIZE_BITS(sb); @@ -4778,7 +4778,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) ret = ret2; if (ret >= 0) ext4_update_inode_fsync_trans(handle, inode, 1); -out_stop: + ext4_journal_stop(handle); out_dio: filemap_invalidate_unlock(mapping); -- 2.39.2
From: Zhang Yi <yi.zhang@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8477 -------------------------------- In ext4_zero_range() and ext4_punch_hole(), when operating in SYNC mode and zeroing a partial block, only data=journal modes guarantee that the zeroed data is synchronously persisted after the operation completes. For data=ordered/writeback mode and non-journal modes, this guarantee is missing. Introduce a partial_zero parameter to explicitly trigger writeback for all scenarios where a partial block is zeroed, ensuring the zeroed data is durably persisted. Fixes: 5721968224e0 ("ext4: implement zero_range iomap path") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> --- fs/ext4/ext4.h | 2 +- fs/ext4/extents.c | 9 ++++++++- fs/ext4/inode.c | 19 ++++++++++++++----- 3 files changed, 23 insertions(+), 7 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index f33676aac29c..45a90910ecba 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -3026,7 +3026,7 @@ extern int ext4_normal_submit_inode_data_buffers(struct jbd2_inode *jinode); extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks); extern int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end); extern int ext4_zero_partial_blocks(struct inode *inode, loff_t lstart, - loff_t length); + loff_t length, bool *did_zero); extern vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf); extern qsize_t *ext4_get_reserved_space(struct inode *inode); extern int ext4_get_projid(struct inode *inode, kprojid_t *projid); diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index a35c39cad502..09477549ba3d 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4601,6 +4601,7 @@ static long ext4_zero_range(struct file *file, loff_t offset, loff_t start, end; ext4_lblk_t lblk; unsigned int blkbits = inode->i_blkbits; + bool partial_zeroed = false; trace_ext4_zero_range(inode, offset, len, mode); @@ -4714,9 +4715,15 @@ static long ext4_zero_range(struct file *file, loff_t offset, goto out_mutex; /* Zero out partial block at the edges of the range */ - ret = ext4_zero_partial_blocks(inode, offset, len); + ret = ext4_zero_partial_blocks(inode, offset, len, &partial_zeroed); if (ret) goto out_mutex; + if (((file->f_flags & O_SYNC) || IS_SYNC(inode)) && partial_zeroed) { + ret = filemap_write_and_wait_range(inode->i_mapping, offset, + offset + len - 1); + if (ret) + goto out_mutex; + } handle = ext4_journal_start(inode, EXT4_HT_MISC, 1); if (IS_ERR(handle)) { diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 94a003d9b77c..d4d705d30b7e 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4476,7 +4476,8 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end) return did_zero ? length : 0; } -int ext4_zero_partial_blocks(struct inode *inode, loff_t lstart, loff_t length) +int ext4_zero_partial_blocks(struct inode *inode, loff_t lstart, loff_t length, + bool *did_zero) { struct super_block *sb = inode->i_sb; unsigned partial_start, partial_end; @@ -4493,20 +4494,21 @@ int ext4_zero_partial_blocks(struct inode *inode, loff_t lstart, loff_t length) /* Handle partial zero within the single block */ if (start == end && (partial_start || (partial_end != sb->s_blocksize - 1))) { - err = ext4_block_zero_range(inode, lstart, length, NULL, NULL); + err = ext4_block_zero_range(inode, lstart, length, did_zero, + NULL); return err; } /* Handle partial zero out on the start of the range */ if (partial_start) { err = ext4_block_zero_range(inode, lstart, sb->s_blocksize, - NULL, NULL); + did_zero, NULL); if (err) return err; } /* Handle partial zero out on the end of the range */ if (partial_end != sb->s_blocksize - 1) err = ext4_block_zero_range(inode, byte_end - partial_end, - partial_end + 1, NULL, NULL); + partial_end + 1, did_zero, NULL); return err; } @@ -4671,6 +4673,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) handle_t *handle; unsigned int credits; int ret = 0, ret2 = 0; + bool partial_zeroed = false; trace_ext4_punch_hole(inode, offset, length, 0); @@ -4729,9 +4732,15 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) if (ret) goto out_dio; - ret = ext4_zero_partial_blocks(inode, offset, length); + ret = ext4_zero_partial_blocks(inode, offset, length, &partial_zeroed); if (ret) goto out_dio; + if (((file->f_flags & O_SYNC) || IS_SYNC(inode)) && partial_zeroed) { + ret = filemap_write_and_wait_range(inode->i_mapping, offset, + offset + length - 1); + if (ret) + goto out_dio; + } if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) credits = ext4_writepage_trans_blocks(inode); -- 2.39.2
From: Zhang Yi <yi.zhang@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8477 -------------------------------- In the ext4 fallocate call chain, SYNC mode handling is inconsistent: some places check the inode state, while others check the open file descriptor state. Unify these checks by evaluating both conditions to ensure consistent behavior across all fallocate operations. Fixes: 5721968224e0 ("ext4: implement zero_range iomap path") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> --- fs/ext4/extents.c | 9 +++++---- fs/ext4/inode.c | 2 +- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 09477549ba3d..becaea5a327e 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4740,7 +4740,7 @@ static long ext4_zero_range(struct file *file, loff_t offset, goto out_handle; ext4_update_inode_fsync_trans(handle, inode, 1); - if (file->f_flags & O_SYNC) + if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) ext4_handle_sync(handle); out_handle: @@ -4839,7 +4839,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) if (ret) goto out; - if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal) { + if (((file->f_flags & O_SYNC) || IS_SYNC(inode)) && + EXT4_SB(inode->i_sb)->s_journal) { ret = ext4_fc_commit(EXT4_SB(inode->i_sb)->s_journal, EXT4_I(inode)->i_sync_tid); } @@ -5455,7 +5456,7 @@ static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len) EXT4_I(inode)->i_disksize = new_size; up_write(&EXT4_I(inode)->i_data_sem); - if (IS_SYNC(inode)) + if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) ext4_handle_sync(handle); inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode)); ret = ext4_mark_inode_dirty(handle, inode); @@ -5623,7 +5624,7 @@ static int ext4_insert_range(struct file *file, loff_t offset, loff_t len) max(ee_start_lblk, offset_lblk), len_lblk, SHIFT_RIGHT); up_write(&EXT4_I(inode)->i_data_sem); - if (IS_SYNC(inode)) + if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) ext4_handle_sync(handle); if (ret >= 0) ext4_update_inode_fsync_trans(handle, inode, 1); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index d4d705d30b7e..da5de8561a6d 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4778,7 +4778,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) up_write(&EXT4_I(inode)->i_data_sem); } ext4_fc_track_range(handle, inode, first_block, stop_block); - if (IS_SYNC(inode)) + if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) ext4_handle_sync(handle); inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode)); -- 2.39.2
From: Zhang Yi <yi.zhang@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8477 -------------------------------- The ctime and mtime update is already handled by file_modified() in ext4_fallocate(), the caller of ext4_alloc_file_blocks(). So remove the redundant calls to inode_set_ctime_current() and inode_set_mtime_to_ts() in ext4_alloc_file_blocks(). Fixes: 5721968224e0 ("ext4: implement zero_range iomap path") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> --- fs/ext4/extents.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index becaea5a327e..c992399b5106 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4560,13 +4560,10 @@ static int ext4_alloc_file_blocks(struct file *file, loff_t offset, loff_t len, map.m_lblk += ret; map.m_len = len_lblk = len_lblk - ret; epos = (loff_t)map.m_lblk << inode->i_blkbits; - inode_set_ctime_current(inode); if (new_size) { if (epos > new_size) epos = new_size; - if (ext4_update_inode_size(inode, epos) & 0x1) - inode_set_mtime_to_ts(inode, - inode_get_ctime(inode)); + ext4_update_inode_size(inode, epos); if (epos > old_size) pagecache_isize_extended(inode, old_size, epos); } -- 2.39.2
From: Zhang Yi <yi.zhang@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8477 -------------------------------- In ext4_alloc_file_blocks(), pagecache_isize_extended() is called under an active handle and may also hold folio lock if the block size is smaller than the folio size. This also breaks the "folio lock -> transaction start" lock ordering for the upcoming iomap buffered I/O path. Therefore, move pagecache_isize_extended() outside of an active handle. Additionally, it is unnecessary to update the file length during each iteration of the allocation loop. Instead, update the file length only to the position where the allocation is successful. Postpone updating the inode size until after the allocation loop completes or is interrupted due to an error. Fixes: 5721968224e0 ("ext4: implement zero_range iomap path") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> --- fs/ext4/extents.c | 50 +++++++++++++++++++++++++++++++---------------- 1 file changed, 33 insertions(+), 17 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index c992399b5106..dbc2154f7d4e 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4500,7 +4500,7 @@ static int ext4_alloc_file_blocks(struct file *file, loff_t offset, loff_t len, ext4_lblk_t len_lblk; struct ext4_map_blocks map; unsigned int credits; - loff_t epos, old_size = i_size_read(inode); + loff_t epos = 0, old_size = i_size_read(inode); unsigned int blkbits = inode->i_blkbits; BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)); @@ -4553,31 +4553,47 @@ static int ext4_alloc_file_blocks(struct file *file, loff_t offset, loff_t len, ext4_journal_stop(handle); break; } + ext4_update_inode_fsync_trans(handle, inode, 1); + ret = ext4_journal_stop(handle); + if (unlikely(ret)) + break; + /* * allow a full retry cycle for any remaining allocations */ retries = 0; - map.m_lblk += ret; - map.m_len = len_lblk = len_lblk - ret; + map.m_lblk += map.m_len; + map.m_len = len_lblk = len_lblk - map.m_len; epos = (loff_t)map.m_lblk << inode->i_blkbits; - if (new_size) { - if (epos > new_size) - epos = new_size; - ext4_update_inode_size(inode, epos); - if (epos > old_size) - pagecache_isize_extended(inode, old_size, epos); - } - ret2 = ext4_mark_inode_dirty(handle, inode); - ext4_update_inode_fsync_trans(handle, inode, 1); - ret3 = ext4_journal_stop(handle); - ret2 = ret3 ? ret3 : ret2; - if (unlikely(ret2)) - break; } + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) goto retry; - return ret > 0 ? ret2 : ret; + if (!epos || !new_size) + return ret; + + /* + * Allocate blocks, update the file size to match the size of the + * already successfully allocated blocks. + */ + if (epos > new_size) + epos = new_size; + + handle = ext4_journal_start(inode, EXT4_HT_MISC, 1); + if (IS_ERR(handle)) + return ret ? ret : PTR_ERR(handle); + + ext4_update_inode_size(inode, epos); + ret2 = ext4_mark_inode_dirty(handle, inode); + ext4_update_inode_fsync_trans(handle, inode, 1); + ret3 = ext4_journal_stop(handle); + ret2 = ret3 ? ret3 : ret2; + + if (epos > old_size) + pagecache_isize_extended(inode, old_size, epos); + + return ret ? ret : ret2; } static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len); -- 2.39.2
From: Zhang Yi <yi.zhang@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/8477 -------------------------------- In cases of appending write beyond the end of file (EOF), ext4_block_zero_eof should be called within ext4_iomap_buffered_write and ext4_iomap_write_end to zero out the partial block beyond the EOF. This prevents exposing stale data that might be written through mmap. Fixes: 5721968224e0 ("ext4: implement zero_range iomap path") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> --- fs/ext4/file.c | 17 +++++++++++++++++ fs/ext4/inode.c | 4 +++- 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index b8440444d997..e3cbb520f8bc 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -291,6 +291,23 @@ static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb, { struct inode *inode = file_inode(iocb->ki_filp); const struct iomap_ops *iomap_ops; + unsigned int blocksize = i_blocksize(inode); + loff_t old_size = i_size_read(inode); + int ret; + + /* + * If the position is beyond the EOF, it is necessary to zero out the + * partial block that beyond the existing EOF, as it may contains + * stale data written through mmap. + */ + if (iocb->ki_pos > old_size && (old_size & (blocksize - 1))) { + loff_t end = round_up(old_size, blocksize); + if (iocb->ki_pos < end) + end = iocb->ki_pos; + ret = ext4_block_zero_eof(inode, old_size, end); + if (ret < 0) + return ret; + } if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb)) iomap_ops = &ext4_iomap_buffered_da_write_ops; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index da5de8561a6d..a78245878298 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4085,8 +4085,10 @@ static int ext4_iomap_write_end(struct file *file, folio_unlock(folio); folio_put(folio); - if (old_size < pos) + if (old_size < pos) { pagecache_isize_extended(inode, old_size, pos); + ext4_block_zero_eof(inode, old_size, pos); + } /* * For delalloc, if we have pre-allocated more blocks and copied -- 2.39.2
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://atomgit.com/openeuler/kernel/merge_requests/21441 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/PSY... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://atomgit.com/openeuler/kernel/merge_requests/21441 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/PSY...
participants (2)
-
patchwork bot -
Yongjian Sun