hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IB3O3K CVE: NA
--------------------------------
During concurrent append writes to XFS filesystem, zero padding data may appear in the file after power failure. This happens due to imprecise disk size updates when handling write completion.
Consider this scenario with concurrent append writes same file:
Thread 1: Thread 2: ------------ ----------- write [A, A+B] update inode size to A+B submit I/O [A, A+BS] write [A+B, A+B+C] update inode size to A+B+C <I/O completes, updates disk size to min(A+B+C, A+BS)> <power failure>
After reboot: 1) with A+B+C < A+BS, the file has zero padding in range [A+B, A+B+C]
|< Block Size (BS) >| |DDDDDDDDDDDDDDDD0000000000000000| ^ ^ ^ A A+B A+B+C (EOF)
2) with A+B+C > A+BS, the file has zero padding in range [A+B, A+BS]
|< Block Size (BS) >|< Block Size (BS) >| |DDDDDDDDDDDDDDDD0000000000000000|00000000000000000000000000000000| ^ ^ ^ ^ A A+B A+BS A+B+C (EOF)
D = Valid Data 0 = Zero Padding
The issue stems from disk size being set to min(io_offset + io_size, inode->i_size) at I/O completion. Since io_offset+io_size is block size granularity, it may exceed the actual valid file data size. In the case of concurrent append writes, inode->i_size may be larger than the actual range of valid file data written to disk, leading to inaccurate disk size updates.
This patch modifies the meaning of io_size to represent the size of valid data within EOF in an ioend. If the ioend spans beyond i_size, io_size will be trimmed to provide the file with more accurate size information. This is particularly useful for on-disk size updates at completion time.
After this change, ioends that span i_size will not grow or merge with other ioends in concurrent scenarios. However, these cases that need growth/merging rarely occur and it seems no noticeable performance impact. Although rounding up io_size could enable ioend growth/merging in these scenarios, we decided to keep the code simple after discussion [1].
Another benefit is that it makes the xfs_ioend_is_append() check more accurate, which can reduce unnecessary end bio callbacks of xfs_end_bio() in certain scenarios, such as repeated writes at the file tail without extending the file size.
Fixes: ae259a9c8593 ("fs: introduce iomap infrastructure") # goes further back than this Link [1]: https://patchwork.kernel.org/project/xfs/patch/20241113091907.56937-1-leo.li... Signed-off-by: Long Li leo.lilong@huawei.com --- fs/iomap/buffered-io.c | 47 ++++++++++++++++++++++++++++++++++++++++++ include/linux/iomap.h | 2 +- 2 files changed, 48 insertions(+), 1 deletion(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 0bb3257cba42..95e787f9e694 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1409,6 +1409,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page, unsigned len = i_blocksize(inode); unsigned poff = offset & (PAGE_SIZE - 1); bool merged, same_page = false; + loff_t isize = i_size_read(inode);
if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) { if (wpc->ioend) @@ -1429,7 +1430,53 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page, bio_add_page(wpc->ioend->io_bio, page, len, poff); }
+ /* + * Clamp io_offset and io_size to the incore EOF so that ondisk + * file size updates in the ioend completion are byte-accurate. + * This avoids recovering files with zeroed tail regions when + * writeback races with appending writes: + * + * Thread 1: Thread 2: + * ------------ ----------- + * write [A, A+B] + * update inode size to A+B + * submit I/O [A, A+BS] + * write [A+B, A+B+C] + * update inode size to A+B+C + * <I/O completes, updates disk size to min(A+B+C, A+BS)> + * <power failure> + * + * After reboot: + * 1) with A+B+C < A+BS, the file has zero padding in range + * [A+B, A+B+C] + * + * |< Block Size (BS) >| + * |DDDDDDDDDDDD0000000000000| + * ^ ^ ^ + * A A+B A+B+C + * (EOF) + * + * 2) with A+B+C > A+BS, the file has zero padding in range + * [A+B, A+BS] + * + * |< Block Size (BS) >|< Block Size (BS) >| + * |DDDDDDDDDDDD0000000000000|00000000000000000000000000| + * ^ ^ ^ ^ + * A A+B A+BS A+B+C + * (EOF) + * + * D = Valid Data + * 0 = Zero Padding + * + * Note that this defeats the ability to chain the ioends of + * appending writes. Writeback beyond EOF block may occur in + * concurrent scenarios(e.g. racing with truncate) and io_size + * should not be trimmed in such cases. + */ wpc->ioend->io_size += len; + if (offset < isize && offset + len > isize) + wpc->ioend->io_size = isize - wpc->ioend->io_offset; + wbc_account_cgroup_owner(wbc, page, len); }
diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 1b6e22741d43..ff3473c134b3 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -229,7 +229,7 @@ struct iomap_ioend { u16 io_flags; /* IOMAP_F_* */ u32 io_folios; /* folios added to ioend */ struct inode *io_inode; /* file being written to */ - size_t io_size; /* size of the extent */ + size_t io_size; /* size of data within eof */ loff_t io_offset; /* offset in the file */ void *io_private; /* file system private data */ sector_t io_sector; /* start sector of ioend */
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/14269 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/T...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/14269 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/T...