[PATCH OLK-6.6 v2 00/36] fs: lbs and uncached buffer io support

1. Patch-1 ~ Patch-10 support large block size for xfs. 2. Patch-11 ~ Patch-36 support uncached buffer I/O for xfs/ext4. Christian Brauner (1): fs: claw back a few FMODE_* bits Christoph Hellwig (1): xfs: drop fop_flags for directories Dave Chinner (1): xfs: use kvmalloc for xattr buffers Jens Axboe (19): mm/filemap: change filemap_create_folio() to take a struct kiocb mm/filemap: use page_cache_sync_ra() to kick off read-ahead mm/readahead: add folio allocation helper mm: add PG_dropbehind folio flag mm/readahead: add readahead_control->dropbehind member mm/truncate: add folio_unmap_invalidate() helper fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag mm/filemap: add read support for RWF_DONTCACHE mm/filemap: drop streaming/uncached pages when writeback completes mm/filemap: add filemap_fdatawrite_range_kick() helper mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue mm: add FGP_DONTCACHE folio creation flag mm/filemap: gate dropbehind invalidate on folio !dirty && !writeback mm/filemap: use filemap_end_dropbehind() for read invalidation mm/filemap: unify read/write dropbehind naming mm/filemap: unify dropbehind flag testing and clearing iomap: make buffered writes work with RWF_DONTCACHE iomap: don't lose folio dropbehind state for overwrites xfs: flag as supporting FOP_DONTCACHE Jingbo Xu (2): mm/truncate: don't skip dirty page in folio_unmap_invalidate() mm/filemap: fix miscalculated file range for filemap_fdatawrite_range_kick() Long Li (3): fs: fix kabi breakage in struct file_operations mm/readahead: fix kabi breakage in struct readahead_control ext4: flag as supporting FOP_DONTCACHE Luis Chamberlain (1): mm: split a folio in minimum folio order chunks Pankaj Raghav (8): filemap: allocate mapping_min_order folios in the page cache readahead: allocate folios with mapping_min_order in readahead filemap: cap PTE range to be created to allowed zero fill in folio_map_range() iomap: fix iomap_dio_zero() for fs bs > system page size xfs: expose block size in stat xfs: make the calculation generic in xfs_sb_validate_fsb_count() mm: don't set readahead flag on a folio when lookahead_size > nr_to_read xfs: enable block size larger than page size support block/fops.c | 3 +- drivers/dax/device.c | 2 +- fs/btrfs/file.c | 4 +- fs/ext4/file.c | 14 +++- fs/ext4/inode.c | 3 +- fs/f2fs/file.c | 3 +- fs/iomap/buffered-io.c | 11 ++- fs/iomap/direct-io.c | 45 +++++++++-- fs/read_write.c | 2 +- fs/xfs/libxfs/xfs_attr_leaf.c | 16 ++-- fs/xfs/libxfs/xfs_ialloc.c | 5 ++ fs/xfs/libxfs/xfs_shared.h | 3 + fs/xfs/xfs_aops.c | 3 +- fs/xfs/xfs_file.c | 6 +- fs/xfs/xfs_icache.c | 6 +- fs/xfs/xfs_iops.c | 2 +- fs/xfs/xfs_mount.c | 8 +- fs/xfs/xfs_super.c | 28 +++++-- include/linux/fs.h | 41 +++++++--- include/linux/huge_mm.h | 28 ++++++- include/linux/iomap.h | 2 + include/linux/page-flags.h | 5 ++ include/linux/pagemap.h | 37 ++++++++- include/trace/events/mmflags.h | 3 +- include/uapi/linux/fs.h | 5 +- io_uring/io_uring.c | 2 +- io_uring/rw.c | 9 ++- mm/filemap.c | 134 ++++++++++++++++++++++++++++----- mm/huge_memory.c | 66 +++++++++++++++- mm/internal.h | 2 + mm/mmap.c | 4 +- mm/readahead.c | 96 ++++++++++++++++++----- mm/swap.c | 2 + mm/truncate.c | 53 ++++++------- 34 files changed, 519 insertions(+), 134 deletions(-) -- 2.39.2

From: Pankaj Raghav <p.raghav@samsung.com> mainline inclusion from mainline-v6.12-rc1 commit ab95d23bab220ef845c0d422f49452a475330eaf category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICKJ63 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- filemap_create_folio() and do_read_cache_folio() were always allocating folio of order 0. __filemap_get_folio was trying to allocate higher order folios when fgp_flags had higher order hint set but it will default to order 0 folio if higher order memory allocation fails. Supporting mapping_min_order implies that we guarantee each folio in the page cache has at least an order of mapping_min_order. When adding new folios to the page cache we must also ensure the index used is aligned to the mapping_min_order as the page cache requires the index to be aligned to the order of the folio. Co-developed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240822135018.1931258-3-kernel@pankajraghav.com Tested-by: David Howells <dhowells@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Conflicts: include/linux/pagemap.h [Context conflicts.] Signed-off-by: Jiacheng Yu <yujiacheng3@huawei.com> Signed-off-by: Long Li <leo.lilong@huawei.com> --- include/linux/pagemap.h | 20 ++++++++++++++++++++ mm/filemap.c | 24 ++++++++++++++++-------- 2 files changed, 36 insertions(+), 8 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 624712273947..21b6dc122249 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -431,6 +431,26 @@ mapping_min_folio_order(const struct address_space *mapping) return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN; } +static inline unsigned long +mapping_min_folio_nrpages(struct address_space *mapping) +{ + return 1UL << mapping_min_folio_order(mapping); +} + +/** + * mapping_align_index() - Align index for this mapping. + * @mapping: The address_space. + * + * The index of a folio must be naturally aligned. If you are adding a + * new folio to the page cache and need to know what index to give it, + * call this function. + */ +static inline pgoff_t mapping_align_index(struct address_space *mapping, + pgoff_t index) +{ + return round_down(index, mapping_min_folio_nrpages(mapping)); +} + /** * mapping_clear_large_folios() - The file disable supports large folios. * @mapping: The file. diff --git a/mm/filemap.c b/mm/filemap.c index 905ebca8670e..d5cb0fe7d83a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -858,6 +858,8 @@ noinline int __filemap_add_folio(struct address_space *mapping, VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio); + VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping), + folio); mapping_set_update(&xas, mapping); VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio); @@ -1909,8 +1911,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, folio_wait_stable(folio); no_page: if (!folio && (fgp_flags & FGP_CREAT)) { - unsigned order = FGF_GET_ORDER(fgp_flags); + unsigned int min_order = mapping_min_folio_order(mapping); + unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags)); int err; + index = mapping_align_index(mapping, index); if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping)) gfp |= __GFP_WRITE; @@ -1939,7 +1943,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, gfp_t alloc_gfp = gfp; err = -ENOMEM; - if (order > 0) + if (order > min_order) alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; folio = filemap_alloc_folio(alloc_gfp, order); if (!folio) @@ -1954,7 +1958,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, break; folio_put(folio); folio = NULL; - } while (order-- > 0); + } while (order-- > min_order); if (err == -EEXIST) goto repeat; @@ -2486,13 +2490,15 @@ static int filemap_update_page(struct kiocb *iocb, } static int filemap_create_folio(struct file *file, - struct address_space *mapping, pgoff_t index, + struct address_space *mapping, loff_t pos, struct folio_batch *fbatch) { struct folio *folio; int error; + unsigned int min_order = mapping_min_folio_order(mapping); + pgoff_t index; - folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0); + folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order); if (!folio) return -ENOMEM; @@ -2510,6 +2516,7 @@ static int filemap_create_folio(struct file *file, * well to keep locking rules simple. */ filemap_invalidate_lock_shared(mapping); + index = (pos >> (PAGE_SHIFT + min_order)) << min_order; error = filemap_add_folio(mapping, folio, index, mapping_gfp_constraint(mapping, GFP_KERNEL)); if (error == -EEXIST) @@ -2570,8 +2577,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count, if (!folio_batch_count(fbatch)) { if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ)) return -EAGAIN; - err = filemap_create_folio(filp, mapping, - iocb->ki_pos >> PAGE_SHIFT, fbatch); + err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch); if (err == AOP_TRUNCATED_PAGE) goto retry; return err; @@ -3818,9 +3824,11 @@ static struct folio *do_read_cache_folio(struct address_space *mapping, repeat: folio = filemap_get_folio(mapping, index); if (IS_ERR(folio)) { - folio = filemap_alloc_folio(gfp, 0); + folio = filemap_alloc_folio(gfp, + mapping_min_folio_order(mapping)); if (!folio) return ERR_PTR(-ENOMEM); + index = mapping_align_index(mapping, index); err = filemap_add_folio(mapping, folio, index, gfp); if (unlikely(err)) { folio_put(folio); -- 2.39.2

From: Pankaj Raghav <p.raghav@samsung.com> mainline inclusion from mainline-v6.12-rc1 commit 26cfdb395eefdb2d34e51184d8466ee04a1618d5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICKJ63 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- page_cache_ra_unbounded() was allocating single pages (0 order folios) if there was no folio found in an index. Allocate mapping_min_order folios as we need to guarantee the minimum order if it is set. page_cache_ra_order() tries to allocate folio to the page cache with a higher order if the index aligns with that order. Modify it so that the order does not go below the mapping_min_order requirement of the page cache. This function will do the right thing even if the new_order passed is less than the mapping_min_order. When adding new folios to the page cache we must also ensure the index used is aligned to the mapping_min_order as the page cache requires the index to be aligned to the order of the folio. readahead_expand() is called from readahead aops to extend the range of the readahead so this function can assume ractl->_index to be aligned with min_order. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Co-developed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240822135018.1931258-4-kernel@pankajraghav.com Tested-by: David Howells <dhowells@redhat.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org> Conflicts: mm/readahead.c [Context conflicts.] Signed-off-by: Jiacheng Yu <yujiacheng3@huawei.com> Signed-off-by: Long Li <leo.lilong@huawei.com> --- mm/readahead.c | 79 ++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 61 insertions(+), 18 deletions(-) diff --git a/mm/readahead.c b/mm/readahead.c index 69757a624982..b7779f0064df 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -202,9 +202,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, unsigned long nr_to_read, unsigned long lookahead_size) { struct address_space *mapping = ractl->mapping; - unsigned long index = readahead_index(ractl); + unsigned long ra_folio_index, index = readahead_index(ractl); gfp_t gfp_mask = readahead_gfp_mask(mapping); - unsigned long i; + unsigned long mark, i = 0; + unsigned int min_nrpages = mapping_min_folio_nrpages(mapping); /* * Partway through the readahead operation, we will have added @@ -219,10 +220,24 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, unsigned int nofs = memalloc_nofs_save(); filemap_invalidate_lock_shared(mapping); + index = mapping_align_index(mapping, index); + + /* + * As iterator `i` is aligned to min_nrpages, round_up the + * difference between nr_to_read and lookahead_size to mark the + * index that only has lookahead or "async_region" to set the + * readahead flag. + */ + ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size, + min_nrpages); + mark = ra_folio_index - index; + nr_to_read += readahead_index(ractl) - index; + ractl->_index = index; + /* * Preallocate as many pages as we will need. */ - for (i = 0; i < nr_to_read; i++) { + while (i < nr_to_read) { struct folio *folio = xa_load(&mapping->i_pages, index + i); int ret; @@ -236,12 +251,13 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, * not worth getting one just for that. */ read_pages(ractl); - ractl->_index++; - i = ractl->_index + ractl->_nr_pages - index - 1; + ractl->_index += min_nrpages; + i = ractl->_index + ractl->_nr_pages - index; continue; } - folio = filemap_alloc_folio(gfp_mask, 0); + folio = filemap_alloc_folio(gfp_mask, + mapping_min_folio_order(mapping)); if (!folio) break; @@ -251,14 +267,15 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, if (ret == -ENOMEM) break; read_pages(ractl); - ractl->_index++; - i = ractl->_index + ractl->_nr_pages - index - 1; + ractl->_index += min_nrpages; + i = ractl->_index + ractl->_nr_pages - index; continue; } - if (i == nr_to_read - lookahead_size) + if (i == mark) folio_set_readahead(folio); ractl->_workingset |= folio_test_workingset(folio); - ractl->_nr_pages++; + ractl->_nr_pages += min_nrpages; + i += min_nrpages; } /* @@ -489,13 +506,19 @@ void page_cache_ra_order(struct readahead_control *ractl, struct address_space *mapping = ractl->mapping; pgoff_t start = readahead_index(ractl); pgoff_t index = start; + unsigned int min_order = mapping_min_folio_order(mapping); pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT; pgoff_t mark = index + ra->size - ra->async_size; unsigned int nofs; int err = 0; gfp_t gfp = readahead_gfp_mask(mapping); + unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping)); - if (!mapping_large_folio_support(mapping) || ra->size < 4) + /* + * Fallback when size < min_nrpages as each folio should be + * at least min_nrpages anyway. + */ + if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size) goto fallback; if (!file_mthp_enabled()) goto fallback; @@ -509,6 +532,7 @@ void page_cache_ra_order(struct readahead_control *ractl, new_order = min(mapping_max_folio_order(mapping), new_order); new_order = min_t(unsigned int, new_order, ilog2(ra->size)); + new_order = max(new_order, min_order); /* See comment in page_cache_ra_unbounded() */ nofs = memalloc_nofs_save(); @@ -520,6 +544,14 @@ void page_cache_ra_order(struct readahead_control *ractl, goto fallback; } + /* + * If the new_order is greater than min_order and index is + * already aligned to new_order, then this will be noop as index + * aligned to new_order should also be aligned to min_order. + */ + ractl->_index = mapping_align_index(mapping, index); + index = readahead_index(ractl); + while (index <= limit) { unsigned int order = new_order; @@ -527,7 +559,7 @@ void page_cache_ra_order(struct readahead_control *ractl, if (index & ((1UL << order) - 1)) order = __ffs(index); /* Don't allocate pages past EOF */ - while (index + (1UL << order) - 1 > limit) + while (order > min_order && index + (1UL << order) - 1 > limit) order--; err = ra_alloc_folio(ractl, index, mark, order, gfp); if (err) @@ -838,8 +870,15 @@ void readahead_expand(struct readahead_control *ractl, struct file_ra_state *ra = ractl->ra; pgoff_t new_index, new_nr_pages; gfp_t gfp_mask = readahead_gfp_mask(mapping); + unsigned long min_nrpages = mapping_min_folio_nrpages(mapping); + unsigned int min_order = mapping_min_folio_order(mapping); new_index = new_start / PAGE_SIZE; + /* + * Readahead code should have aligned the ractl->_index to + * min_nrpages before calling readahead aops. + */ + VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages)); /* Expand the leading edge downwards */ while (ractl->_index > new_index) { @@ -849,9 +888,11 @@ void readahead_expand(struct readahead_control *ractl, if (folio && !xa_is_value(folio)) return; /* Folio apparently present */ - folio = filemap_alloc_folio(gfp_mask, 0); + folio = filemap_alloc_folio(gfp_mask, min_order); if (!folio) return; + + index = mapping_align_index(mapping, index); if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) { folio_put(folio); return; @@ -862,7 +903,7 @@ void readahead_expand(struct readahead_control *ractl, ractl->_pflags = 0; psi_memstall_enter(&ractl->_pflags); } - ractl->_nr_pages++; + ractl->_nr_pages += min_nrpages; ractl->_index = folio->index; } @@ -877,9 +918,11 @@ void readahead_expand(struct readahead_control *ractl, if (folio && !xa_is_value(folio)) return; /* Folio apparently present */ - folio = filemap_alloc_folio(gfp_mask, 0); + folio = filemap_alloc_folio(gfp_mask, min_order); if (!folio) return; + + index = mapping_align_index(mapping, index); if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) { folio_put(folio); return; @@ -890,10 +933,10 @@ void readahead_expand(struct readahead_control *ractl, ractl->_pflags = 0; psi_memstall_enter(&ractl->_pflags); } - ractl->_nr_pages++; + ractl->_nr_pages += min_nrpages; if (ra) { - ra->size++; - ra->async_size++; + ra->size += min_nrpages; + ra->async_size += min_nrpages; } } } -- 2.39.2

From: Luis Chamberlain <mcgrof@kernel.org> mainline inclusion from mainline-v6.12-rc1 commit e220917fa50774fedb27c075df2261fd664e8ca3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICKJ63 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- split_folio() and split_folio_to_list() assume order 0, to support minorder for non-anonymous folios, we must expand these to check the folio mapping order and use that. Set new_order to be at least minimum folio order if it is set in split_huge_page_to_list() so that we can maintain minimum folio order requirement in the page cache. Update the debugfs write files used for testing to ensure the order is respected as well. We simply enforce the min order when a file mapping is used. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240902124931.506061-2-kernel@pankajraghav.com # folded fix Link: https://lore.kernel.org/r/20240822135018.1931258-5-kernel@pankajraghav.com Tested-by: David Howells <dhowells@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Conflicts: mm/huge_memory.c [context conflicts] Signed-off-by: Jiacheng Yu <yujiacheng3@huawei.com> Signed-off-by: Long Li <leo.lilong@huawei.com> --- include/linux/huge_mm.h | 28 ++++++++++++++--- mm/huge_memory.c | 66 ++++++++++++++++++++++++++++++++++++++--- 2 files changed, 86 insertions(+), 8 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index cfe42c43b55b..0633cd978321 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -102,6 +102,8 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \ (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order))) +#define split_folio(f) split_folio_to_list(f, NULL) + #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES #define HPAGE_PMD_SHIFT PMD_SHIFT #define HPAGE_PUD_SHIFT PUD_SHIFT @@ -358,9 +360,24 @@ void folio_prep_large_rmappable(struct folio *folio); bool can_split_folio(struct folio *folio, int *pextra_pins); int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, unsigned int new_order); +int min_order_for_split(struct folio *folio); +int split_folio_to_list(struct folio *folio, struct list_head *list); static inline int split_huge_page(struct page *page) { - return split_huge_page_to_list_to_order(page, NULL, 0); + struct folio *folio = page_folio(page); + int ret = min_order_for_split(folio); + + if (ret < 0) + return ret; + + /* + * split_huge_page() locks the page before splitting and + * expects the same page that has been split to be locked when + * returned. split_folio(page_folio(page)) cannot be used here + * because it converts the page to folio and passes the head + * page to be split. + */ + return split_huge_page_to_list_to_order(page, NULL, ret); } void deferred_split_folio(struct folio *folio); @@ -523,6 +540,12 @@ static inline int split_huge_page(struct page *page) { return 0; } + +static inline int split_folio_to_list(struct folio *folio, struct list_head *list) +{ + return 0; +} + static inline void deferred_split_folio(struct folio *folio) {} #define split_huge_pmd(__vma, __pmd, __address) \ do { } while (0) @@ -643,7 +666,4 @@ static inline int split_folio_to_order(struct folio *folio, int new_order) return split_folio_to_list_to_order(folio, NULL, new_order); } -#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0) -#define split_folio(f) split_folio_to_order(f, 0) - #endif /* _LINUX_HUGE_MM_H */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 260d8f3ec934..7d257245bd9f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3438,6 +3438,10 @@ bool can_split_folio(struct folio *folio, int *pextra_pins) * * Returns -EBUSY if @page's folio is pinned, or if the anon_vma disappeared * from under us. + * + * Callers should ensure that the order respects the address space mapping + * min-order if one is set for non-anonymous folios. + * */ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, unsigned int new_order) @@ -3518,6 +3522,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, mapping = NULL; anon_vma_lock_write(anon_vma); } else { + unsigned int min_order; gfp_t gfp; mapping = folio->mapping; @@ -3528,6 +3533,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, goto out; } + min_order = mapping_min_folio_order(folio->mapping); + if (new_order < min_order) { + VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u", + min_order); + ret = -EINVAL; + goto out; + } + gfp = current_gfp_context(mapping_gfp_mask(mapping) & GFP_RECLAIM_MASK); @@ -3644,6 +3657,30 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, return ret; } +int min_order_for_split(struct folio *folio) +{ + if (folio_test_anon(folio)) + return 0; + + if (!folio->mapping) { + if (folio_test_pmd_mappable(folio)) + count_vm_event(THP_SPLIT_PAGE_FAILED); + return -EBUSY; + } + + return mapping_min_folio_order(folio->mapping); +} + +int split_folio_to_list(struct folio *folio, struct list_head *list) +{ + int ret = min_order_for_split(folio); + + if (ret < 0) + return ret; + + return split_huge_page_to_list_to_order(&folio->page, list, ret); +} + /* * __folio_unqueue_deferred_split() is not to be called directly: * the folio_unqueue_deferred_split() inline wrapper in mm/internal.h @@ -3899,6 +3936,8 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, struct vm_area_struct *vma = vma_lookup(mm, addr); struct page *page; struct folio *folio; + struct address_space *mapping; + unsigned int target_order = new_order; if (!vma) break; @@ -3919,7 +3958,13 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, if (!is_transparent_hugepage(folio)) goto next; - if (new_order >= folio_order(folio)) + if (!folio_test_anon(folio)) { + mapping = folio->mapping; + target_order = max(new_order, + mapping_min_folio_order(mapping)); + } + + if (target_order >= folio_order(folio)) goto next; total++; @@ -3935,9 +3980,14 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, if (!folio_trylock(folio)) goto next; - if (!split_folio_to_order(folio, new_order)) + if (!folio_test_anon(folio) && folio->mapping != mapping) + goto unlock; + + if (!split_folio_to_order(folio, target_order)) split++; +unlock: + folio_unlock(folio); next: folio_put(folio); @@ -3962,6 +4012,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start, pgoff_t index; int nr_pages = 1; unsigned long total = 0, split = 0; + unsigned int min_order; + unsigned int target_order; file = getname_kernel(file_path); if (IS_ERR(file)) @@ -3975,6 +4027,8 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start, file_path, off_start, off_end); mapping = candidate->f_mapping; + min_order = mapping_min_folio_order(mapping); + target_order = max(new_order, min_order); for (index = off_start; index < off_end; index += nr_pages) { struct folio *folio = filemap_get_folio(mapping, index); @@ -3989,15 +4043,19 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start, total++; nr_pages = folio_nr_pages(folio); - if (new_order >= folio_order(folio)) + if (target_order >= folio_order(folio)) goto next; if (!folio_trylock(folio)) goto next; - if (!split_folio_to_order(folio, new_order)) + if (folio->mapping != mapping) + goto unlock; + + if (!split_folio_to_order(folio, target_order)) split++; +unlock: folio_unlock(folio); next: folio_put(folio); -- 2.39.2

From: Pankaj Raghav <p.raghav@samsung.com> mainline inclusion from mainline-v6.12-rc1 commit 743a2753a02e805347969f6f89f38b736850d808 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICKJ63 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Usually the page cache does not extend beyond the size of the inode, therefore, no PTEs are created for folios that extend beyond the size. But with LBS support, we might extend page cache beyond the size of the inode as we need to guarantee folios of minimum order. While doing a read, do_fault_around() can create PTEs for pages that lie beyond the EOF leading to incorrect error return when accessing a page beyond the mapped file. Cap the PTE range to be created for the page cache up to the end of file(EOF) in filemap_map_pages() so that return error codes are consistent with POSIX[1] for LBS configurations. generic/749 has been created to trigger this edge case. This also fixes generic/749 for tmpfs with huge=always on systems with 4k base page size. [1](from mmap(2)) SIGBUS Attempted access to a page of the buffer that lies beyond the end of the mapped file. For an explanation of the treatment of the bytes in the page that corresponds to the end of a mapped file that is not a multiple of the page size, see NOTES. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240822135018.1931258-6-kernel@pankajraghav.com Tested-by: David Howells <dhowells@redhat.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jiacheng Yu <yujiacheng3@huawei.com> Signed-off-by: Long Li <leo.lilong@huawei.com> --- mm/filemap.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/mm/filemap.c b/mm/filemap.c index d5cb0fe7d83a..e6f3bf18444e 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3678,7 +3678,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, struct vm_area_struct *vma = vmf->vma; struct file *file = vma->vm_file; struct address_space *mapping = file->f_mapping; - pgoff_t last_pgoff = start_pgoff; + pgoff_t file_end, last_pgoff = start_pgoff; unsigned long addr; XA_STATE(xas, &mapping->i_pages, start_pgoff); struct folio *folio; @@ -3704,6 +3704,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, goto out; } + file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1; + if (end_pgoff > file_end) + end_pgoff = file_end; + folio_type = mm_counter_file(folio); do { unsigned long end; -- 2.39.2

From: Pankaj Raghav <p.raghav@samsung.com> mainline inclusion from mainline-v6.12-rc1 commit 10553a91652d995274da63fc317470f703765081 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICKJ63 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- iomap_dio_zero() will pad a fs block with zeroes if the direct IO size < fs block size. iomap_dio_zero() has an implicit assumption that fs block size < page_size. This is true for most filesystems at the moment. If the block size > page size, this will send the contents of the page next to zero page(as len > PAGE_SIZE) to the underlying block device, causing FS corruption. iomap is a generic infrastructure and it should not make any assumptions about the fs block size and the page size of the system. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240822135018.1931258-7-kernel@pankajraghav.com Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jiacheng Yu <yujiacheng3@huawei.com> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/iomap/buffered-io.c | 4 ++-- fs/iomap/direct-io.c | 45 ++++++++++++++++++++++++++++++++++++------ 2 files changed, 41 insertions(+), 8 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index c994b2f058c3..ad442f71c000 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -2147,10 +2147,10 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc, } EXPORT_SYMBOL_GPL(iomap_writepages); -static int __init iomap_init(void) +static int __init iomap_buffered_init(void) { return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE), offsetof(struct iomap_ioend, io_bio), BIOSET_NEED_BVECS); } -fs_initcall(iomap_init); +fs_initcall(iomap_buffered_init); diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index bcd3f8cf5ea4..409a21144a55 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -11,6 +11,7 @@ #include <linux/iomap.h> #include <linux/backing-dev.h> #include <linux/uio.h> +#include <linux/set_memory.h> #include <linux/task_io_accounting_ops.h> #include "trace.h" @@ -27,6 +28,13 @@ #define IOMAP_DIO_WRITE (1U << 30) #define IOMAP_DIO_DIRTY (1U << 31) +/* + * Used for sub block zeroing in iomap_dio_zero() + */ +#define IOMAP_ZERO_PAGE_SIZE (SZ_64K) +#define IOMAP_ZERO_PAGE_ORDER (get_order(IOMAP_ZERO_PAGE_SIZE)) +static struct page *zero_page; + struct iomap_dio { struct kiocb *iocb; const struct iomap_dio_ops *dops; @@ -232,13 +240,20 @@ void iomap_dio_bio_end_io(struct bio *bio) } EXPORT_SYMBOL_GPL(iomap_dio_bio_end_io); -static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, +static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, loff_t pos, unsigned len) { struct inode *inode = file_inode(dio->iocb->ki_filp); - struct page *page = ZERO_PAGE(0); struct bio *bio; + if (!len) + return 0; + /* + * Max block size supported is 64k + */ + if (WARN_ON_ONCE(len > IOMAP_ZERO_PAGE_SIZE)) + return -EINVAL; + bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, GFP_KERNEL); @@ -246,8 +261,9 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; - __bio_add_page(bio, page, len, 0); + __bio_add_page(bio, zero_page, len, 0); iomap_dio_submit_bio(iter, dio, bio, pos); + return 0; } /* @@ -356,8 +372,10 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, if (need_zeroout) { /* zero out from the start of the block to the write offset */ pad = pos & (fs_block_size - 1); - if (pad) - iomap_dio_zero(iter, dio, pos - pad, pad); + + ret = iomap_dio_zero(iter, dio, pos - pad, pad); + if (ret) + goto out; } /* @@ -430,7 +448,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, /* zero out from the end of the write to the end of the block */ pad = pos & (fs_block_size - 1); if (pad) - iomap_dio_zero(iter, dio, pos, fs_block_size - pad); + ret = iomap_dio_zero(iter, dio, pos, + fs_block_size - pad); } out: /* Undo iter limitation to current extent */ @@ -752,3 +771,17 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, return iomap_dio_complete(dio); } EXPORT_SYMBOL_GPL(iomap_dio_rw); + +static int __init iomap_dio_init(void) +{ + zero_page = alloc_pages(GFP_KERNEL | __GFP_ZERO, + IOMAP_ZERO_PAGE_ORDER); + + if (!zero_page) + return -ENOMEM; + + set_memory_ro((unsigned long)page_address(zero_page), + 1U << IOMAP_ZERO_PAGE_ORDER); + return 0; +} +fs_initcall(iomap_dio_init); -- 2.39.2

From: Dave Chinner <dchinner@redhat.com> mainline inclusion from mainline-v6.12-rc1 commit de631e1a8b71017b8a12b57d07db82e4052555af category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICKJ63 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Pankaj Raghav reported that when filesystem block size is larger than page size, the xattr code can use kmalloc() for high order allocations. This triggers a useless warning in the allocator as it is a __GFP_NOFAIL allocation here: static inline struct page *rmqueue(struct zone *preferred_zone, struct zone *zone, unsigned int order, gfp_t gfp_flags, unsigned int alloc_flags, int migratetype) { struct page *page; /* * We most definitely don't want callers attempting to * allocate greater than order-1 page units with __GFP_NOFAIL. */
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); ...
Fix this by changing all these call sites to use kvmalloc(), which will strip the NOFAIL from the kmalloc attempt and if that fails will do a __GFP_NOFAIL vmalloc(). This is not an issue that productions systems will see as filesystems with block size > page size cannot be mounted by the kernel; Pankaj is developing this functionality right now. Reported-by: Pankaj Raghav <kernel@pankajraghav.com> Fixes: f078d4ea8276 ("xfs: convert kmem_alloc() to kmalloc()") Signed-off-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20240822135018.1931258-8-kernel@pankajraghav.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Conflicts: fs/xfs/libxfs/xfs_attr_leaf.c [Upstream dependency patch not yet merged] Signed-off-by: Jiacheng Yu <yujiacheng3@huawei.com> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/libxfs/xfs_attr_leaf.c | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c index 51ff44068675..0e4e763ec283 100644 --- a/fs/xfs/libxfs/xfs_attr_leaf.c +++ b/fs/xfs/libxfs/xfs_attr_leaf.c @@ -1141,10 +1141,7 @@ xfs_attr3_leaf_to_shortform( trace_xfs_attr_leaf_to_sf(args); - tmpbuffer = kmem_alloc(args->geo->blksize, 0); - if (!tmpbuffer) - return -ENOMEM; - + tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL); memcpy(tmpbuffer, bp->b_addr, args->geo->blksize); leaf = (xfs_attr_leafblock_t *)tmpbuffer; @@ -1207,7 +1204,7 @@ xfs_attr3_leaf_to_shortform( error = 0; out: - kmem_free(tmpbuffer); + kvfree(tmpbuffer); return error; } @@ -1619,7 +1616,7 @@ xfs_attr3_leaf_compact( trace_xfs_attr_leaf_compact(args); - tmpbuffer = kmem_alloc(args->geo->blksize, 0); + tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL); memcpy(tmpbuffer, bp->b_addr, args->geo->blksize); memset(bp->b_addr, 0, args->geo->blksize); leaf_src = (xfs_attr_leafblock_t *)tmpbuffer; @@ -1657,7 +1654,7 @@ xfs_attr3_leaf_compact( */ xfs_trans_log_buf(trans, bp, 0, args->geo->blksize - 1); - kmem_free(tmpbuffer); + kvfree(tmpbuffer); } /* @@ -2336,7 +2333,8 @@ xfs_attr3_leaf_unbalance( struct xfs_attr_leafblock *tmp_leaf; struct xfs_attr3_icleaf_hdr tmphdr; - tmp_leaf = kmem_zalloc(state->args->geo->blksize, 0); + tmp_leaf = kvzalloc(state->args->geo->blksize, + GFP_KERNEL | __GFP_NOFAIL); /* * Copy the header into the temp leaf so that all the stuff @@ -2376,7 +2374,7 @@ xfs_attr3_leaf_unbalance( } memcpy(save_leaf, tmp_leaf, state->args->geo->blksize); savehdr = tmphdr; /* struct copy */ - kmem_free(tmp_leaf); + kvfree(tmp_leaf); } xfs_attr3_leaf_hdr_to_disk(state->args->geo, save_leaf, &savehdr); -- 2.39.2

From: Pankaj Raghav <p.raghav@samsung.com> mainline inclusion from mainline-v6.12-rc1 commit 79012cfa00b50ca80fb9f399f3c54b2185d728be category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICKJ63 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- For block size larger than page size, the unit of efficient IO is the block size, not the page size. Leaving stat() to report PAGE_SIZE as the block size causes test programs like fsx to issue illegal ranges for operations that require block size alignment (e.g. fallocate() insert range). Hence update the preferred IO size to reflect the block size in this case. This change is based on a patch originally from Dave Chinner.[1] [1] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-16-david@fromorbit.com/ Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20240822135018.1931258-9-kernel@pankajraghav.com Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jiacheng Yu <yujiacheng3@huawei.com> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/xfs_iops.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 9b7f49ad3e52..8b126f3810b6 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -546,7 +546,7 @@ xfs_stat_blksize( return 1U << mp->m_allocsize_log; } - return PAGE_SIZE; + return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize); } STATIC int -- 2.39.2

From: Pankaj Raghav <p.raghav@samsung.com> mainline inclusion from mainline-v6.12-rc1 commit cebf9dacd5c3cec2813215a081509647f777ecc3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICKJ63 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Instead of assuming that PAGE_SHIFT is always higher than the blocklog, make the calculation generic so that page cache count can be calculated correctly for LBS. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240822135018.1931258-10-kernel@pankajraghav.com Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jiacheng Yu <yujiacheng3@huawei.com> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/xfs_mount.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index 8481106de5f8..4b2801faf75c 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -131,11 +131,16 @@ xfs_sb_validate_fsb_count( xfs_sb_t *sbp, uint64_t nblocks) { + uint64_t max_bytes; + ASSERT(PAGE_SHIFT >= sbp->sb_blocklog); ASSERT(sbp->sb_blocklog >= BBSHIFT); + if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes)) + return -EFBIG; + /* Limited by ULONG_MAX of page cache index */ - if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX) + if (max_bytes >> PAGE_SHIFT > ULONG_MAX) return -EFBIG; return 0; } -- 2.39.2

From: Pankaj Raghav <p.raghav@samsung.com> mainline inclusion from mainline-v6.10-rc2 commit 0938b1614648d5fbd832449a5a8a1b51d985323d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICKJ63 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The readahead flag is set on a folio based on the lookahead_size and nr_to_read. For example, when the readahead happens from index to index + nr_to_read, then the readahead `mark` offset from index is set at nr_to_read - lookahead_size. There are some scenarios where the lookahead_size > nr_to_read. For example, readahead window was created, but the file was truncated before the readahead starts. do_page_cache_ra() will clamp the nr_to_read if the readahead window extends beyond EOF after truncation. If this happens, readahead flag should not be set on any folio on the current readahead window. The current calculation for `mark` with mapping_min_order > 0 gives incorrect results when lookahead_size > nr_to_read due to rounding up operation: index = 128 nr_to_read = 16 lookahead_size = 28 mapping_min_order = 4 (16 pages) ra_folio_index = round_up(128 + 16 - 28, 16) = 128; mark = 128 - 128 = 0; # offset from index to set RA flag In the above example, the lookahead_size is actually lying outside the current readahead window. Without this patch, RA flag will be set incorrectly on the folio at index 128. This can lead to marking the readahead flag on the wrong folio, therefore, triggering a readahead when it is not necessary. Explicitly initialize `mark` to be ULONG_MAX and only calculate it when lookahead_size is within the readahead window. Link: https://lkml.kernel.org/r/20241017062342.478973-1-kernel@pankajraghav.com Fixes: 26cfdb395eef ("readahead: allocate folios with mapping_min_order in readahead") Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- mm/readahead.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/mm/readahead.c b/mm/readahead.c index b7779f0064df..3395eaf4b4c1 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -202,9 +202,9 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, unsigned long nr_to_read, unsigned long lookahead_size) { struct address_space *mapping = ractl->mapping; - unsigned long ra_folio_index, index = readahead_index(ractl); + unsigned long index = readahead_index(ractl); gfp_t gfp_mask = readahead_gfp_mask(mapping); - unsigned long mark, i = 0; + unsigned long mark = ULONG_MAX, i = 0; unsigned int min_nrpages = mapping_min_folio_nrpages(mapping); /* @@ -228,9 +228,14 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, * index that only has lookahead or "async_region" to set the * readahead flag. */ - ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size, - min_nrpages); - mark = ra_folio_index - index; + if (lookahead_size <= nr_to_read) { + unsigned long ra_folio_index; + + ra_folio_index = round_up(readahead_index(ractl) + + nr_to_read - lookahead_size, + min_nrpages); + mark = ra_folio_index - index; + } nr_to_read += readahead_index(ractl) - index; ractl->_index = index; -- 2.39.2

From: Pankaj Raghav <p.raghav@samsung.com> mainline inclusion from mainline-v6.12-rc1 commit 7df7c204c678e24cd32d33360538670b7b90e330 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICKJ63 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Page cache now has the ability to have a minimum order when allocating a folio which is a prerequisite to add support for block size > page size. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20240827-xfs-fix-wformat-bs-gt-ps-v1-1-aec6717609e... # fix folded Link: https://lore.kernel.org/r/20240822135018.1931258-11-kernel@pankajraghav.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Conflicts: fs/xfs/libxfs/xfs_ialloc.c [context conflicts] Signed-off-by: Jiacheng Yu <yujiacheng3@huawei.com> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/libxfs/xfs_ialloc.c | 5 +++++ fs/xfs/libxfs/xfs_shared.h | 3 +++ fs/xfs/xfs_icache.c | 6 ++++-- fs/xfs/xfs_mount.c | 1 - fs/xfs/xfs_super.c | 28 ++++++++++++++++++++-------- include/linux/pagemap.h | 13 +++++++++++++ 6 files changed, 45 insertions(+), 11 deletions(-) diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c index dd90f30cd144..cf8efced92fb 100644 --- a/fs/xfs/libxfs/xfs_ialloc.c +++ b/fs/xfs/libxfs/xfs_ialloc.c @@ -2893,6 +2893,11 @@ xfs_ialloc_setup_geometry( igeo->ialloc_align = mp->m_dalign; else igeo->ialloc_align = 0; + + if (mp->m_sb.sb_blocksize > PAGE_SIZE) + igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT; + else + igeo->min_folio_order = 0; } /* Compute the location of the root directory inode that is laid out by mkfs. */ diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h index c4381388c0c1..2d535131fb68 100644 --- a/fs/xfs/libxfs/xfs_shared.h +++ b/fs/xfs/libxfs/xfs_shared.h @@ -188,6 +188,9 @@ struct xfs_ino_geometry { /* precomputed value for di_flags2 */ uint64_t new_diflags2; + /* minimum folio order of a page cache allocation */ + unsigned int min_folio_order; + }; #endif /* __XFS_SHARED_H__ */ diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 744d819ca2c3..f2fa80f4a1a7 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -88,7 +88,8 @@ xfs_inode_alloc( /* VFS doesn't initialise i_mode or i_state! */ VFS_I(ip)->i_mode = 0; VFS_I(ip)->i_state = 0; - mapping_set_large_folios(VFS_I(ip)->i_mapping); + mapping_set_folio_min_order(VFS_I(ip)->i_mapping, + M_IGEO(mp)->min_folio_order); XFS_STATS_INC(mp, vn_active); ASSERT(atomic_read(&ip->i_pincount) == 0); @@ -323,7 +324,8 @@ xfs_reinit_inode( inode->i_rdev = dev; inode->i_uid = uid; inode->i_gid = gid; - mapping_set_large_folios(inode->i_mapping); + mapping_set_folio_min_order(inode->i_mapping, + M_IGEO(mp)->min_folio_order); return error; } diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index 4b2801faf75c..454cbfcfa996 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -133,7 +133,6 @@ xfs_sb_validate_fsb_count( { uint64_t max_bytes; - ASSERT(PAGE_SHIFT >= sbp->sb_blocklog); ASSERT(sbp->sb_blocklog >= BBSHIFT); if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes)) diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index ca82472d2558..b036889d4db8 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1639,16 +1639,28 @@ xfs_fs_fill_super( goto out_free_sb; } - /* - * Until this is fixed only page-sized or smaller data blocks work. - */ if (mp->m_sb.sb_blocksize > PAGE_SIZE) { - xfs_warn(mp, - "File system with blocksize %d bytes. " - "Only pagesize (%ld) or less will currently work.", + size_t max_folio_size = mapping_max_folio_size_supported(); + + if (!xfs_has_crc(mp)) { + xfs_warn(mp, +"V4 Filesystem with blocksize %d bytes. Only pagesize (%ld) or less is supported.", mp->m_sb.sb_blocksize, PAGE_SIZE); - error = -ENOSYS; - goto out_free_sb; + error = -ENOSYS; + goto out_free_sb; + } + + if (mp->m_sb.sb_blocksize > max_folio_size) { + xfs_warn(mp, +"block size (%u bytes) not supported; Only block size (%zu) or less is supported", + mp->m_sb.sb_blocksize, max_folio_size); + error = -ENOSYS; + goto out_free_sb; + } + + xfs_warn(mp, +"EXPERIMENTAL: V5 Filesystem with Large Block Size (%d bytes) enabled.", + mp->m_sb.sb_blocksize); } /* Ensure this filesystem fits in the page cache limits */ diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 21b6dc122249..250fe7b78e1a 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -357,6 +357,19 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask) #define MAX_XAS_ORDER (XA_CHUNK_SHIFT * 2 - 1) #define MAX_PAGECACHE_ORDER min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER) +/* + * mapping_max_folio_size_supported() - Check the max folio size supported + * + * The filesystem should call this function at mount time if there is a + * requirement on the folio mapping size in the page cache. + */ +static inline size_t mapping_max_folio_size_supported(void) +{ + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) + return 1U << (PAGE_SHIFT + MAX_PAGECACHE_ORDER); + return PAGE_SIZE; +} + /* * mapping_set_folio_order_range() - Set the orders supported by a file. * @mapping: The address space of the file. -- 2.39.2

From: Christian Brauner <brauner@kernel.org> mainline inclusion from mainline-v6.9-rc1 commit 210a03c9d51aa0e6e6f06980116e3256da8d4c48 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- There's a bunch of flags that are purely based on what the file operations support while also never being conditionally set or unset. IOW, they're not subject to change for individual files. Imho, such flags don't need to live in f_mode they might as well live in the fops structs itself. And the fops struct already has that lonely mmap_supported_flags member. We might as well turn that into a generic fop_flags member and move a few flags from FMODE_* space into FOP_* space. That gets us four FMODE_* bits back and the ability for new static flags that are about file ops to not have to live in FMODE_* space but in their own FOP_* space. It's not the most beautiful thing ever but it gets the job done. Yes, there'll be an additional pointer chase but hopefully that won't matter for these flags. I suspect there's a few more we can move into there and that we can also redirect a bunch of new flag suggestions that follow this pattern into the fop_flags field instead of f_mode. Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org> Conflicts: block/bdev.c fs/f2fs/file.c include/linux/fs.h io_uring/rw.c Signed-off-by: Long Li <leo.lilong@huawei.com> --- block/fops.c | 3 ++- drivers/dax/device.c | 2 +- fs/btrfs/file.c | 4 ++-- fs/ext4/file.c | 6 +++--- fs/f2fs/file.c | 3 ++- fs/read_write.c | 2 +- fs/xfs/xfs_file.c | 8 +++++--- include/linux/fs.h | 22 ++++++++++++---------- io_uring/io_uring.c | 2 +- io_uring/rw.c | 9 +++++---- mm/mmap.c | 4 +++- 11 files changed, 37 insertions(+), 28 deletions(-) diff --git a/block/fops.c b/block/fops.c index cda9978ebf67..f307e92d7fa9 100644 --- a/block/fops.c +++ b/block/fops.c @@ -594,7 +594,7 @@ static int blkdev_open(struct inode *inode, struct file *filp) * during an unstable branch. */ filp->f_flags |= O_LARGEFILE; - filp->f_mode |= FMODE_BUF_RASYNC | FMODE_CAN_ODIRECT; + filp->f_mode |= FMODE_CAN_ODIRECT; mode = file_to_blk_mode(filp); handle = bdev_open_by_dev(inode->i_rdev, mode, @@ -852,6 +852,7 @@ const struct file_operations def_blk_fops = { .splice_read = filemap_splice_read, .splice_write = iter_file_splice_write, .fallocate = blkdev_fallocate, + .fop_flags = FOP_BUFFER_RASYNC, }; static __init int blkdev_init(void) diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 01e89b7ac637..cfb122b3fee3 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -377,7 +377,7 @@ static const struct file_operations dax_fops = { .release = dax_release, .get_unmapped_area = dax_get_unmapped_area, .mmap = dax_mmap, - .mmap_supported_flags = MAP_SYNC, + .fop_flags = FOP_MMAP_SYNC, }; static void dev_dax_cdev_del(void *cdev) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 68092b64e29e..bd6f711b759a 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3757,8 +3757,7 @@ static int btrfs_file_open(struct inode *inode, struct file *filp) { int ret; - filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC | - FMODE_CAN_ODIRECT; + filp->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT; ret = fsverity_file_open(inode, filp); if (ret) @@ -3888,6 +3887,7 @@ const struct file_operations btrfs_file_operations = { .compat_ioctl = btrfs_compat_ioctl, #endif .remap_file_range = btrfs_remap_file_range, + .fop_flags = FOP_BUFFER_RASYNC | FOP_BUFFER_WASYNC, }; int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index ca57a5efd2ec..afa60bd4ae63 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -907,8 +907,7 @@ static int ext4_file_open(struct inode *inode, struct file *filp) return ret; } - filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | - FMODE_DIO_PARALLEL_WRITE; + filp->f_mode |= FMODE_NOWAIT; return dquot_file_open(inode, filp); } @@ -960,7 +959,6 @@ const struct file_operations ext4_file_operations = { .compat_ioctl = ext4_compat_ioctl, #endif .mmap = ext4_file_mmap, - .mmap_supported_flags = MAP_SYNC, .open = ext4_file_open, .release = ext4_release_file, .fsync = ext4_sync_file, @@ -968,6 +966,8 @@ const struct file_operations ext4_file_operations = { .splice_read = ext4_file_splice_read, .splice_write = iter_file_splice_write, .fallocate = ext4_fallocate, + .fop_flags = FOP_MMAP_SYNC | FOP_BUFFER_RASYNC | + FOP_DIO_PARALLEL_WRITE, }; const struct inode_operations ext4_file_inode_operations = { diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c index ae129044c52f..0defd5de6f0b 100644 --- a/fs/f2fs/file.c +++ b/fs/f2fs/file.c @@ -587,7 +587,7 @@ static int f2fs_file_open(struct inode *inode, struct file *filp) if (err) return err; - filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC; + filp->f_mode |= FMODE_NOWAIT; filp->f_mode |= FMODE_CAN_ODIRECT; err = dquot_file_open(inode, filp); @@ -5158,4 +5158,5 @@ const struct file_operations f2fs_file_operations = { .splice_read = f2fs_file_splice_read, .splice_write = iter_file_splice_write, .fadvise = f2fs_file_fadvise, + .fop_flags = FOP_BUFFER_RASYNC, }; diff --git a/fs/read_write.c b/fs/read_write.c index 27a7729b82f3..c8c305e16c85 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1715,7 +1715,7 @@ int generic_write_checks_count(struct kiocb *iocb, loff_t *count) if ((iocb->ki_flags & IOCB_NOWAIT) && !((iocb->ki_flags & IOCB_DIRECT) || - (file->f_mode & FMODE_BUF_WASYNC))) + (file->f_op->fop_flags & FOP_BUFFER_WASYNC))) return -EINVAL; return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e0cef2cb5c53..02dc03060fa0 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1233,8 +1233,7 @@ xfs_file_open( { if (xfs_is_shutdown(XFS_M(inode->i_sb))) return -EIO; - file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC | - FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT; + file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT; return generic_file_open(inode, file); } @@ -1494,7 +1493,6 @@ const struct file_operations xfs_file_operations = { .compat_ioctl = xfs_file_compat_ioctl, #endif .mmap = xfs_file_mmap, - .mmap_supported_flags = MAP_SYNC, .open = xfs_file_open, .release = xfs_file_release, .fsync = xfs_file_fsync, @@ -1502,6 +1500,8 @@ const struct file_operations xfs_file_operations = { .fallocate = xfs_file_fallocate, .fadvise = xfs_file_fadvise, .remap_file_range = xfs_file_remap_range, + .fop_flags = FOP_MMAP_SYNC | FOP_BUFFER_RASYNC | FOP_BUFFER_WASYNC | + FOP_DIO_PARALLEL_WRITE, }; const struct file_operations xfs_dir_file_operations = { @@ -1514,4 +1514,6 @@ const struct file_operations xfs_dir_file_operations = { .compat_ioctl = xfs_file_compat_ioctl, #endif .fsync = xfs_dir_fsync, + .fop_flags = FOP_MMAP_SYNC | FOP_BUFFER_RASYNC | FOP_BUFFER_WASYNC | + FOP_DIO_PARALLEL_WRITE, }; diff --git a/include/linux/fs.h b/include/linux/fs.h index b6ccdceb4f94..1647ddb5ea16 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -169,9 +169,6 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, #define FMODE_NOREUSE ((__force fmode_t)0x800000) -/* File supports non-exclusive O_DIRECT writes from multiple threads */ -#define FMODE_DIO_PARALLEL_WRITE ((__force fmode_t)0x1000000) - /* File is embedded in backing_file object */ #define FMODE_BACKING ((__force fmode_t)0x2000000) @@ -187,12 +184,6 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, /* File does not contribute to nr_files count */ #define FMODE_NOACCOUNT ((__force fmode_t)0x20000000) -/* File supports async buffered reads */ -#define FMODE_BUF_RASYNC ((__force fmode_t)0x40000000) - -/* File supports async nowait buffered writes */ -#define FMODE_BUF_WASYNC ((__force fmode_t)0x80000000) - #ifdef CONFIG_BPF_READAHEAD /* File mode control flag, expect random access pattern */ #define FMODE_CTL_RANDOM ((__force fmode_t)0x1000) @@ -2021,8 +2012,11 @@ struct iov_iter; struct io_uring_cmd; struct offset_ctx; +typedef unsigned int __bitwise fop_flags_t; + struct file_operations { struct module *owner; + fop_flags_t fop_flags; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); @@ -2035,7 +2029,6 @@ struct file_operations { long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); - unsigned long mmap_supported_flags; int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); int (*release) (struct inode *, struct file *); @@ -2074,6 +2067,15 @@ struct file_operations { KABI_RESERVE(7) } __randomize_layout; +/* Supports async buffered reads */ +#define FOP_BUFFER_RASYNC ((__force fop_flags_t)(1 << 0)) +/* Supports async buffered writes */ +#define FOP_BUFFER_WASYNC ((__force fop_flags_t)(1 << 1)) +/* Supports synchronous page faults for mappings */ +#define FOP_MMAP_SYNC ((__force fop_flags_t)(1 << 2)) +/* Supports non-exclusive O_DIRECT writes from multiple threads */ +#define FOP_DIO_PARALLEL_WRITE ((__force fop_flags_t)(1 << 3)) + /* Wrap a directory iterator that needs exclusive inode access */ int wrap_directory_iterator(struct file *, struct dir_context *, int (*) (struct file *, struct dir_context *)); diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 8b15e9dc340f..0784a853b91e 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -465,7 +465,7 @@ static void io_prep_async_work(struct io_kiocb *req) /* don't serialize this request if the fs doesn't need it */ if (should_hash && (req->file->f_flags & O_DIRECT) && - (req->file->f_mode & FMODE_DIO_PARALLEL_WRITE)) + (req->file->f_op->fop_flags & FOP_DIO_PARALLEL_WRITE)) should_hash = false; if (should_hash || (ctx->flags & IORING_SETUP_IOPOLL)) io_wq_hash_work(&req->work, file_inode(req->file)); diff --git a/io_uring/rw.c b/io_uring/rw.c index 75b001febb4d..79522f2784fd 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -629,7 +629,8 @@ static bool io_rw_should_retry(struct io_kiocb *req) * just use poll if we can, and don't attempt if the fs doesn't * support callback based unlocks */ - if (file_can_poll(req->file) || !(req->file->f_mode & FMODE_BUF_RASYNC)) + if (file_can_poll(req->file) || + !(req->file->f_op->fop_flags & FOP_BUFFER_RASYNC)) return false; wait->wait.func = io_async_buf_func; @@ -930,10 +931,10 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) if (unlikely(!io_file_supports_nowait(req))) goto copy_iov; - /* File path supports NOWAIT for non-direct_IO only for block devices. */ + /* Check if we can support NOWAIT. */ if (!(kiocb->ki_flags & IOCB_DIRECT) && - !(kiocb->ki_filp->f_mode & FMODE_BUF_WASYNC) && - (req->flags & REQ_F_ISREG)) + !(req->file->f_op->fop_flags & FOP_BUFFER_WASYNC) && + (req->flags & REQ_F_ISREG)) goto copy_iov; kiocb->ki_flags |= IOCB_NOWAIT; diff --git a/mm/mmap.c b/mm/mmap.c index fb54df419ea2..64ad0e9c7560 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1352,7 +1352,9 @@ unsigned long __do_mmap_mm(struct mm_struct *mm, struct file *file, unsigned lon if (!file_mmap_ok(file, inode, pgoff, len)) return -EOVERFLOW; - flags_mask = LEGACY_MAP_MASK | file->f_op->mmap_supported_flags; + flags_mask = LEGACY_MAP_MASK; + if (file->f_op->fop_flags & FOP_MMAP_SYNC) + flags_mask |= MAP_SYNC; switch (flags & MAP_TYPE) { case MAP_SHARED: -- 2.39.2

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA ---------------------------------------------------------------------- Fix kabi breakage in struct file_operations. Fixes: 210a03c9d51a ("fs: claw back a few FMODE_* bits") Signed-off-by: Long Li <leo.lilong@huawei.com> --- include/linux/fs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 1647ddb5ea16..2ddaf927c8c1 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2016,7 +2016,6 @@ typedef unsigned int __bitwise fop_flags_t; struct file_operations { struct module *owner; - fop_flags_t fop_flags; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); @@ -2029,6 +2028,7 @@ struct file_operations { long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); + KABI_REPLACE(unsigned long mmap_supported_flags, fop_flags_t fop_flags) int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); int (*release) (struct inode *, struct file *); -- 2.39.2

From: Christoph Hellwig <hch@lst.de> mainline inclusion from mainline-v6.9-rc1 commit f50805713a6e8bd58bb69386c60f5c922b882016 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Directories have non of the capabilities, so drop the flags. Note that the current state is harmless as no one actually checks for the flags either. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240423124608.537794-3-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Conflicts: fs/xfs/xfs_file.c [Context conflicts] Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/xfs_file.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 02dc03060fa0..87c82f98929f 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1514,6 +1514,4 @@ const struct file_operations xfs_dir_file_operations = { .compat_ioctl = xfs_file_compat_ioctl, #endif .fsync = xfs_dir_fsync, - .fop_flags = FOP_MMAP_SYNC | FOP_BUFFER_RASYNC | FOP_BUFFER_WASYNC | - FOP_DIO_PARALLEL_WRITE, }; -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.10-rc2 commit 9ad6344568cc31ede9741795b3e3c41c21e3156f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Patch series "Uncached buffered IO", v8. 5 years ago I posted patches adding support for RWF_UNCACHED, as a way to do buffered IO that isn't page cache persistent. The approach back then was to have private pages for IO, and then get rid of them once IO was done. But that then runs into all the issues that O_DIRECT has, in terms of synchronizing with the page cache. So here's a new approach to the same concent, but using the page cache as synchronization. Due to excessive bike shedding on the naming, this is now named RWF_DONTCACHE, and is less special in that it's just page cache IO, except it prunes the ranges once IO is completed. Why do this, you may ask? The tldr is that device speeds are only getting faster, while reclaim is not. Doing normal buffered IO can be very unpredictable, and suck up a lot of resources on the reclaim side. This leads people to use O_DIRECT as a work-around, which has its own set of restrictions in terms of size, offset, and length of IO. It's also inherently synchronous, and now you need async IO as well. While the latter isn't necessarily a big problem as we have good options available there, it also should not be a requirement when all you want to do is read or write some data without caching. Even on desktop type systems, a normal NVMe device can fill the entire page cache in seconds. On the big system I used for testing, there's a lot more RAM, but also a lot more devices. As can be seen in some of the results in the following patches, you can still fill RAM in seconds even when there's 1TB of it. Hence this problem isn't solely a "big hyperscaler system" issue, it's common across the board. Common for both reads and writes with RWF_DONTCACHE is that they use the page cache for IO. Reads work just like a normal buffered read would, with the only exception being that the touched ranges will get pruned after data has been copied. For writes, the ranges will get writeback kicked off before the syscall returns, and then writeback completion will prune the range. Hence writes aren't synchronous, and it's easy to pipeline writes using RWF_DONTCACHE. Folios that aren't instantiated by RWF_DONTCACHE IO are left untouched. This means you that uncached IO will take advantage of the page cache for uptodate data, but not leave anything it instantiated/created in cache. File systems need to support this. This patchset adds support for the generic read path, which covers file systems like ext4. Patches exist to add support for iomap/XFS and btrfs as well, which sit on top of this series. If RWF_DONTCACHE IO is attempted on a file system that doesn't support it, -EOPNOTSUPP is returned. Hence the user can rely on it either working as designed, or flagging and error if that's not the case. The intent here is to give the application a sensible fallback path - eg, it may fall back to O_DIRECT if appropriate, or just live with the fact that uncached IO isn't available and do normal buffered IO. Adding "support" to other file systems should be trivial, most of the time just a one-liner adding FOP_DONTCACHE to the fop_flags in the file_operations struct, if the file system is using either iomap or the generic filemap helpers for reading and writing. Performance results are in patch 8 for reads, and you can find the write side results in the XFS patch adding support for DONTCACHE writes for XFS: https://git.kernel.dk/cgit/linux/commit/?h=buffered-uncached-fs.10&id=257e92... with the tldr being that I see about a 65% improvement in performance for both, with fully predictable IO times. CPU reduction is substantial as well, with no kswapd activity at all for reclaim when using uncached IO. Using it from applications is trivial - just set RWF_DONTCACHE for the read or write, using pwritev2(2) or preadv2(2). For io_uring, same thing, just set RWF_DONTCACHE in sqe->rw_flags for a buffered read/write operation. And that's it. Patches 1..7 are just prep patches, and should have no functional changes at all. Patch 8 adds support for the filemap path for RWF_DONTCACHE reads, and patches 9..12 are just prep patches for supporting the write side of uncached writes. In the below mentioned branch, there are then patches to adopt uncached reads and writes for xfs, btrfs, and ext4. The latter currently relies on bit of a hack for passing whether this is an uncached write or not through ->write_begin(), which can hopefully go away once ext4 adopts iomap for buffered writes. I say this is a hack as it's not the prettiest way to do it, however it is fully solid and will work just fine. Passes full xfstests and fsx overnight runs, no issues observed. That includes the vm running the testing also using RWF_DONTCACHE on the host. I'll post fsstress and fsx patches for RWF_DONTCACHE separately. As far as I'm concerned, no further work needs doing here. And git tree for the patches is here: https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.10 with the file system patches on top adding support for xfs/btrfs/ext4 here: https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached-fs.10 This patch (of 12): Rather than pass in both the file and position directly from the kiocb, just take a struct kiocb instead. With the kiocb being passed in, skip passing in the address_space separately as well. While doing so, move the ki_flags checking into filemap_create_folio() as well. In preparation for actually needing the kiocb in the function. No functional changes in this patch. Link: https://lkml.kernel.org/r/20241220154831.1086649-1-axboe@kernel.dk Link: https://lkml.kernel.org/r/20241220154831.1086649-2-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: mm/filemap.c [Context conflicts] Signed-off-by: Long Li <leo.lilong@huawei.com> --- mm/filemap.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index e6f3bf18444e..1103d942835b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2489,15 +2489,17 @@ static int filemap_update_page(struct kiocb *iocb, return error; } -static int filemap_create_folio(struct file *file, - struct address_space *mapping, loff_t pos, - struct folio_batch *fbatch) +static int filemap_create_folio(struct kiocb *iocb, struct folio_batch *fbatch) { + struct address_space *mapping = iocb->ki_filp->f_mapping; struct folio *folio; int error; unsigned int min_order = mapping_min_folio_order(mapping); pgoff_t index; + if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ)) + return -EAGAIN; + folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order); if (!folio) return -ENOMEM; @@ -2516,7 +2518,7 @@ static int filemap_create_folio(struct file *file, * well to keep locking rules simple. */ filemap_invalidate_lock_shared(mapping); - index = (pos >> (PAGE_SHIFT + min_order)) << min_order; + index = (iocb->ki_pos >> (PAGE_SHIFT + min_order)) << min_order; error = filemap_add_folio(mapping, folio, index, mapping_gfp_constraint(mapping, GFP_KERNEL)); if (error == -EEXIST) @@ -2524,7 +2526,8 @@ static int filemap_create_folio(struct file *file, if (error) goto error; - error = filemap_read_folio(file, mapping->a_ops->read_folio, folio); + error = filemap_read_folio(iocb->ki_filp, mapping->a_ops->read_folio, + folio); if (error) goto error; @@ -2575,9 +2578,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count, filemap_get_read_batch(mapping, index, last_index - 1, fbatch); } if (!folio_batch_count(fbatch)) { - if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ)) - return -EAGAIN; - err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch); + err = filemap_create_folio(iocb, fbatch); if (err == AOP_TRUNCATED_PAGE) goto retry; return err; -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.10-rc2 commit f598cdaafc370a797ae883d370a7c18c1ffc43ef category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Rather than use the page_cache_sync_readahead() helper, define our own ractl and use page_cache_sync_ra() directly. In preparation for needing to modify ractl inside filemap_get_pages(). No functional changes in this patch. Link: https://lkml.kernel.org/r/20241220154831.1086649-3-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: mm/filemap.c [Context conflicts] Signed-off-by: Long Li <leo.lilong@huawei.com> --- mm/filemap.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 1103d942835b..84d24d944040 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2557,7 +2557,6 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count, { struct file *filp = iocb->ki_filp; struct address_space *mapping = filp->f_mapping; - struct file_ra_state *ra = &filp->f_ra; pgoff_t index = iocb->ki_pos >> PAGE_SHIFT; pgoff_t last_index; struct folio *folio; @@ -2571,10 +2570,11 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count, filemap_get_read_batch(mapping, index, last_index - 1, fbatch); if (!folio_batch_count(fbatch)) { + DEFINE_READAHEAD(ractl, filp, &filp->f_ra, mapping, index); + if (iocb->ki_flags & IOCB_NOIO) return -EAGAIN; - page_cache_sync_readahead(mapping, ra, filp, index, - last_index - index); + page_cache_sync_ra(&ractl, last_index - index); filemap_get_read_batch(mapping, index, last_index - 1, fbatch); } if (!folio_batch_count(fbatch)) { -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.10-rc2 commit 1963de79d3a3bc12b7a17a922d508b733ca8fa9e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Just a wrapper around filemap_alloc_folio() for now, but add it in preparation for modifying the folio based on the 'ractl' being passed in. No functional changes in this patch. Link: https://lkml.kernel.org/r/20241220154831.1086649-4-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- mm/readahead.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/mm/readahead.c b/mm/readahead.c index 3395eaf4b4c1..84120f293219 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -184,6 +184,12 @@ static void read_pages(struct readahead_control *rac) BUG_ON(readahead_count(rac)); } +static struct folio *ractl_alloc_folio(struct readahead_control *ractl, + gfp_t gfp_mask, unsigned int order) +{ + return filemap_alloc_folio(gfp_mask, order); +} + /** * page_cache_ra_unbounded - Start unchecked readahead. * @ractl: Readahead control. @@ -261,8 +267,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, continue; } - folio = filemap_alloc_folio(gfp_mask, - mapping_min_folio_order(mapping)); + folio = ractl_alloc_folio(ractl, gfp_mask, + mapping_min_folio_order(mapping)); if (!folio) break; @@ -487,7 +493,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index, pgoff_t mark, unsigned int order, gfp_t gfp) { int err; - struct folio *folio = filemap_alloc_folio(gfp, order); + struct folio *folio = ractl_alloc_folio(ractl, gfp, order); if (!folio) return -ENOMEM; @@ -893,7 +899,7 @@ void readahead_expand(struct readahead_control *ractl, if (folio && !xa_is_value(folio)) return; /* Folio apparently present */ - folio = filemap_alloc_folio(gfp_mask, min_order); + folio = ractl_alloc_folio(ractl, gfp_mask, min_order); if (!folio) return; @@ -923,7 +929,7 @@ void readahead_expand(struct readahead_control *ractl, if (folio && !xa_is_value(folio)) return; /* Folio apparently present */ - folio = filemap_alloc_folio(gfp_mask, min_order); + folio = ractl_alloc_folio(ractl, gfp_mask, min_order); if (!folio) return; -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.10-rc2 commit cceba6f7e46c48deca433030d80fc34599fb9fd8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Add a folio flag that file IO can use to indicate that the cached IO being done should be dropped from the page cache upon completion. Link: https://lkml.kernel.org/r/20241220154831.1086649-5-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: include/linux/page-flags.h [Not merge dfbac6dc68ba ("mm: separate out FOLIO_FLAGS from PAGEFLAGS")] Signed-off-by: Long Li <leo.lilong@huawei.com> --- include/linux/page-flags.h | 5 +++++ include/trace/events/mmflags.h | 3 ++- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 7a67d997eece..141df2d8ad70 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -119,6 +119,7 @@ enum pageflags { PG_reclaim, /* To be reclaimed asap */ PG_swapbacked, /* Page is backed by RAM/swap */ PG_unevictable, /* Page is "unevictable" */ + PG_dropbehind, /* drop pages on IO completion */ #ifdef CONFIG_MMU PG_mlocked, /* Page is vma mlocked */ #endif @@ -543,6 +544,10 @@ PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL) PAGEFLAG(Readahead, readahead, PF_NO_COMPOUND) TESTCLEARFLAG(Readahead, readahead, PF_NO_COMPOUND) +PAGEFLAG(Dropbehind, dropbehind, PF_HEAD) + TESTCLEARFLAG(Dropbehind, dropbehind, PF_HEAD) + __SETPAGEFLAG(Dropbehind, dropbehind, PF_HEAD) + #ifdef CONFIG_HIGHMEM /* * Must use a macro here due to header dependency issues. page_zone() is not diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 6104fa2b6e47..7b8b7577b6bc 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -125,7 +125,8 @@ DEF_PAGEFLAG_NAME(mappedtodisk), \ DEF_PAGEFLAG_NAME(reclaim), \ DEF_PAGEFLAG_NAME(swapbacked), \ - DEF_PAGEFLAG_NAME(unevictable) \ + DEF_PAGEFLAG_NAME(unevictable), \ + DEF_PAGEFLAG_NAME(dropbehind) \ IF_HAVE_PG_MLOCK(mlocked) \ IF_HAVE_PG_UNCACHED(uncached) \ IF_HAVE_PG_HWPOISON(hwpoison) \ -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.10-rc2 commit 77d075221ae777296e2b18a0a4f5fea6f75daf2c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- If ractl->dropbehind is set to true, then folios created are marked as dropbehind as well. Link: https://lkml.kernel.org/r/20241220154831.1086649-6-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- include/linux/pagemap.h | 1 + mm/readahead.c | 8 +++++++- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 250fe7b78e1a..f3b0a19ae3af 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -1382,6 +1382,7 @@ struct readahead_control { pgoff_t _index; unsigned int _nr_pages; unsigned int _batch_count; + bool dropbehind; bool _workingset; unsigned long _pflags; }; diff --git a/mm/readahead.c b/mm/readahead.c index 84120f293219..a36819397653 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -187,7 +187,13 @@ static void read_pages(struct readahead_control *rac) static struct folio *ractl_alloc_folio(struct readahead_control *ractl, gfp_t gfp_mask, unsigned int order) { - return filemap_alloc_folio(gfp_mask, order); + struct folio *folio; + + folio = filemap_alloc_folio(gfp_mask, order); + if (folio && ractl->dropbehind) + __folio_set_dropbehind(folio); + + return folio; } /** -- 2.39.2

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA ---------------------------------------------------------------------- Fix kabi breakage in struct readahead_control. Fixes: 77d075221ae7 ("mm/readahead: add readahead_control->dropbehind member") Signed-off-by: Long Li <leo.lilong@huawei.com> --- include/linux/pagemap.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index f3b0a19ae3af..c4b363e9d9f4 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -1375,14 +1375,13 @@ struct readahead_control { struct file *file; struct address_space *mapping; struct file_ra_state *ra; - KABI_RESERVE(1) + KABI_USE(1, bool dropbehind) KABI_RESERVE(2) KABI_RESERVE(3) /* private: use the readahead_* accessors instead */ pgoff_t _index; unsigned int _nr_pages; unsigned int _batch_count; - bool dropbehind; bool _workingset; unsigned long _pflags; }; -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.10-rc2 commit 4a9e23159fd37677efc0c2c53e3b45a5d260a90a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Add a folio_unmap_invalidate() helper, which unmaps and invalidates a given folio. The caller must already have locked the folio. Embed the old invalidate_complete_folio2() helper in there as well, as nobody else calls it. Use this new helper in invalidate_inode_pages2_range(), rather than duplicate the code there. In preparation for using this elsewhere as well, have it take a gfp_t mask rather than assume GFP_KERNEL is the right choice. This bubbles back to invalidate_complete_folio2() as well. Link: https://lkml.kernel.org/r/20241220154831.1086649-7-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: mm/truncate.c [Context conflicts] Signed-off-by: Long Li <leo.lilong@huawei.com> --- mm/internal.h | 2 ++ mm/truncate.c | 53 +++++++++++++++++++++++++++------------------------ 2 files changed, 30 insertions(+), 25 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index ba2346a41447..0726f6e835b1 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -342,6 +342,8 @@ void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long addr, unsigned long end, struct zap_details *details); +int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio, + gfp_t gfp); void zap_page_range_single_batched(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long addr, unsigned long size, struct zap_details *details); diff --git a/mm/truncate.c b/mm/truncate.c index 1557a0503f8e..559285b8103d 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -551,6 +551,15 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping, } EXPORT_SYMBOL(invalidate_mapping_pages); +static int folio_launder(struct address_space *mapping, struct folio *folio) +{ + if (!folio_test_dirty(folio)) + return 0; + if (folio->mapping != mapping || mapping->a_ops->launder_folio == NULL) + return 0; + return mapping->a_ops->launder_folio(folio); +} + /* * This is like mapping_evict_folio(), except it ignores the folio's * refcount. We do this because invalidate_inode_pages2() needs stronger @@ -558,14 +567,26 @@ EXPORT_SYMBOL(invalidate_mapping_pages); * shrink_page_list() has a temp ref on them, or because they're transiently * sitting in the folio_add_lru() caches. */ -static int invalidate_complete_folio2(struct address_space *mapping, - struct folio *folio) +int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio, + gfp_t gfp) { - if (folio->mapping != mapping) - return 0; + int ret; + + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); - if (!filemap_release_folio(folio, GFP_KERNEL)) + if (folio_test_dirty(folio)) return 0; + if (folio_mapped(folio)) + unmap_mapping_folio(folio); + BUG_ON(folio_mapped(folio)); + + ret = folio_launder(mapping, folio); + if (ret) + return ret; + if (folio->mapping != mapping) + return -EBUSY; + if (!filemap_release_folio(folio, gfp)) + return -EBUSY; spin_lock(&mapping->host->i_lock); xa_lock_irq(&mapping->i_pages); @@ -584,16 +605,7 @@ static int invalidate_complete_folio2(struct address_space *mapping, failed: xa_unlock_irq(&mapping->i_pages); spin_unlock(&mapping->host->i_lock); - return 0; -} - -static int folio_launder(struct address_space *mapping, struct folio *folio) -{ - if (!folio_test_dirty(folio)) - return 0; - if (folio->mapping != mapping || mapping->a_ops->launder_folio == NULL) - return 0; - return mapping->a_ops->launder_folio(folio); + return -EBUSY; } /** @@ -653,16 +665,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping, } VM_BUG_ON_FOLIO(!folio_contains(folio, indices[i]), folio); folio_wait_writeback(folio); - - if (folio_mapped(folio)) - unmap_mapping_folio(folio); - BUG_ON(folio_mapped(folio)); - - ret2 = folio_launder(mapping, folio); - if (ret2 == 0) { - if (!invalidate_complete_folio2(mapping, folio)) - ret2 = -EBUSY; - } + ret2 = folio_unmap_invalidate(mapping, folio, GFP_KERNEL); if (ret2 < 0) ret = ret2; folio_unlock(folio); -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.10-rc2 commit b9f958d4f146bd11be33a5f2bc3ced50f86d6b23 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- If a file system supports uncached buffered IO, it may set FOP_DONTCACHE and enable support for RWF_DONTCACHE. If RWF_DONTCACHE is attempted without the file system supporting it, it'll get errored with -EOPNOTSUPP. Link: https://lkml.kernel.org/r/20241220154831.1086649-8-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: include/linux/fs.h include/uapi/linux/fs.h [Context conflicts] Signed-off-by: Long Li <leo.lilong@huawei.com> --- include/linux/fs.h | 12 ++++++++++++ include/uapi/linux/fs.h | 5 ++++- 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 2ddaf927c8c1..868f583663c1 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -338,6 +338,7 @@ enum rw_hint { #define IOCB_SYNC (__force int) RWF_SYNC #define IOCB_NOWAIT (__force int) RWF_NOWAIT #define IOCB_APPEND (__force int) RWF_APPEND +#define IOCB_DONTCACHE (__force int) RWF_DONTCACHE /* non-RWF related bits - start at 16 */ #define IOCB_EVENTFD (1 << 16) @@ -372,6 +373,7 @@ enum rw_hint { { IOCB_SYNC, "SYNC" }, \ { IOCB_NOWAIT, "NOWAIT" }, \ { IOCB_APPEND, "APPEND" }, \ + { IOCB_DONTCACHE, "DONTCACHE" }, \ { IOCB_EVENTFD, "EVENTFD"}, \ { IOCB_DIRECT, "DIRECT" }, \ { IOCB_WRITE, "WRITE" }, \ @@ -2075,6 +2077,8 @@ struct file_operations { #define FOP_MMAP_SYNC ((__force fop_flags_t)(1 << 2)) /* Supports non-exclusive O_DIRECT writes from multiple threads */ #define FOP_DIO_PARALLEL_WRITE ((__force fop_flags_t)(1 << 3)) +/* File system supports uncached read/write buffered IO */ +#define FOP_DONTCACHE ((__force fop_flags_t)(1 << 7)) /* Wrap a directory iterator that needs exclusive inode access */ int wrap_directory_iterator(struct file *, struct dir_context *, @@ -3426,6 +3430,14 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags) return -EOPNOTSUPP; kiocb_flags |= IOCB_NOIO; } + if (flags & RWF_DONTCACHE) { + /* file system must support it */ + if (!(ki->ki_filp->f_op->fop_flags & FOP_DONTCACHE)) + return -EOPNOTSUPP; + /* DAX mappings not supported */ + if (IS_DAX(ki->ki_filp->f_mapping->host)) + return -EOPNOTSUPP; + } kiocb_flags |= (__force int) (flags & RWF_SUPPORTED); if (flags & RWF_SYNC) kiocb_flags |= IOCB_DSYNC; diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index b7b56871029c..d44cc11ced72 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -301,8 +301,11 @@ typedef int __bitwise __kernel_rwf_t; /* per-IO O_APPEND */ #define RWF_APPEND ((__force __kernel_rwf_t)0x00000010) +/* buffered IO that drops the cache after reading or writing data */ +#define RWF_DONTCACHE ((__force __kernel_rwf_t)0x00000080) + /* mask of flags supported by the kernel */ #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ - RWF_APPEND) + RWF_APPEND | RWF_DONTCACHE) #endif /* _UAPI_LINUX_FS_H */ -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.10-rc2 commit 8026e49bff9b151609da4cae20e9da7f1833dde6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Add RWF_DONTCACHE as a read operation flag, which means that any data read wil be removed from the page cache upon completion. Uses the page cache to synchronize, and simply prunes folios that were instantiated when the operation completes. While it would be possible to use private pages for this, using the page cache as synchronization is handy for a variety of reasons: 1) No special truncate magic is needed 2) Async buffered reads need some place to serialize, using the page cache is a lot easier than writing extra code for this 3) The pruning cost is pretty reasonable and the code to support this is much simpler as a result. You can think of uncached buffered IO as being the much more attractive cousin of O_DIRECT - it has none of the restrictions of O_DIRECT. Yes, it will copy the data, but unlike regular buffered IO, it doesn't run into the unpredictability of the page cache in terms of reclaim. As an example, on a test box with 32 drives, reading them with buffered IO looks as follows: Reading bs 65536, uncached 0 1s: 145945MB/sec 2s: 158067MB/sec 3s: 157007MB/sec 4s: 148622MB/sec 5s: 118824MB/sec 6s: 70494MB/sec 7s: 41754MB/sec 8s: 90811MB/sec 9s: 92204MB/sec 10s: 95178MB/sec 11s: 95488MB/sec 12s: 95552MB/sec 13s: 96275MB/sec where it's quite easy to see where the page cache filled up, and performance went from good to erratic, and finally settles at a much lower rate. Looking at top while this is ongoing, we see: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7535 root 20 0 267004 0 0 S 3199 0.0 8:40.65 uncached 3326 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd4 3327 root 20 0 0 0 0 R 100.0 0.0 0:17.22 kswapd5 3328 root 20 0 0 0 0 R 100.0 0.0 0:13.29 kswapd6 3332 root 20 0 0 0 0 R 100.0 0.0 0:11.11 kswapd10 3339 root 20 0 0 0 0 R 100.0 0.0 0:16.25 kswapd17 3348 root 20 0 0 0 0 R 100.0 0.0 0:16.40 kswapd26 3343 root 20 0 0 0 0 R 100.0 0.0 0:16.30 kswapd21 3344 root 20 0 0 0 0 R 100.0 0.0 0:11.92 kswapd22 3349 root 20 0 0 0 0 R 100.0 0.0 0:16.28 kswapd27 3352 root 20 0 0 0 0 R 99.7 0.0 0:11.89 kswapd30 3353 root 20 0 0 0 0 R 96.7 0.0 0:16.04 kswapd31 3329 root 20 0 0 0 0 R 96.4 0.0 0:11.41 kswapd7 3345 root 20 0 0 0 0 R 96.4 0.0 0:13.40 kswapd23 3330 root 20 0 0 0 0 S 91.1 0.0 0:08.28 kswapd8 3350 root 20 0 0 0 0 S 86.8 0.0 0:11.13 kswapd28 3325 root 20 0 0 0 0 S 76.3 0.0 0:07.43 kswapd3 3341 root 20 0 0 0 0 S 74.7 0.0 0:08.85 kswapd19 3334 root 20 0 0 0 0 S 71.7 0.0 0:10.04 kswapd12 3351 root 20 0 0 0 0 R 60.5 0.0 0:09.59 kswapd29 3323 root 20 0 0 0 0 R 57.6 0.0 0:11.50 kswapd1 [...] which is just showing a partial list of the 32 kswapd threads that are running mostly full tilt, burning ~28 full CPU cores. If the same test case is run with RWF_DONTCACHE set for the buffered read, the output looks as follows: Reading bs 65536, uncached 0 1s: 153144MB/sec 2s: 156760MB/sec 3s: 158110MB/sec 4s: 158009MB/sec 5s: 158043MB/sec 6s: 157638MB/sec 7s: 157999MB/sec 8s: 158024MB/sec 9s: 157764MB/sec 10s: 157477MB/sec 11s: 157417MB/sec 12s: 157455MB/sec 13s: 157233MB/sec 14s: 156692MB/sec which is just chugging along at ~155GB/sec of read performance. Looking at top, we see: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7961 root 20 0 267004 0 0 S 3180 0.0 5:37.95 uncached 8024 axboe 20 0 14292 4096 0 R 1.0 0.0 0:00.13 top where just the test app is using CPU, no reclaim is taking place outside of the main thread. Not only is performance 65% better, it's also using half the CPU to do it. Link: https://lkml.kernel.org/r/20241220154831.1086649-9-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: mm/filemap.c mm/swap.c [Context conflicts] Signed-off-by: Long Li <leo.lilong@huawei.com> --- mm/filemap.c | 28 ++++++++++++++++++++++++++-- mm/swap.c | 2 ++ 2 files changed, 28 insertions(+), 2 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 84d24d944040..b729176c7abe 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2503,6 +2503,8 @@ static int filemap_create_folio(struct kiocb *iocb, struct folio_batch *fbatch) folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order); if (!folio) return -ENOMEM; + if (iocb->ki_flags & IOCB_DONTCACHE) + __folio_set_dropbehind(folio); /* * Protect against truncate / hole punch. Grabbing invalidate_lock @@ -2548,6 +2550,8 @@ static int filemap_readahead(struct kiocb *iocb, struct file *file, if (iocb->ki_flags & IOCB_NOIO) return -EAGAIN; + if (iocb->ki_flags & IOCB_DONTCACHE) + ractl.dropbehind = 1; page_cache_async_ra(&ractl, folio, last_index - folio->index); return 0; } @@ -2574,6 +2578,8 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count, if (iocb->ki_flags & IOCB_NOIO) return -EAGAIN; + if (iocb->ki_flags & IOCB_DONTCACHE) + ractl.dropbehind = 1; page_cache_sync_ra(&ractl, last_index - index); filemap_get_read_batch(mapping, index, last_index - 1, fbatch); } @@ -2618,6 +2624,20 @@ static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio) return (pos1 >> shift == pos2 >> shift); } +static void filemap_end_dropbehind_read(struct address_space *mapping, + struct folio *folio) +{ + if (!folio_test_dropbehind(folio)) + return; + if (folio_test_writeback(folio) || folio_test_dirty(folio)) + return; + if (folio_trylock(folio)) { + if (folio_test_clear_dropbehind(folio)) + folio_unmap_invalidate(mapping, folio, 0); + folio_unlock(folio); + } +} + /** * filemap_read - Read data from the page cache. * @iocb: The iocb to read. @@ -2738,8 +2758,12 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, } } put_folios: - for (i = 0; i < folio_batch_count(&fbatch); i++) - folio_put(fbatch.folios[i]); + for (i = 0; i < folio_batch_count(&fbatch); i++) { + struct folio *folio = fbatch.folios[i]; + + filemap_end_dropbehind_read(mapping, folio); + folio_put(folio); + } folio_batch_init(&fbatch); } while (iov_iter_count(iter) && iocb->ki_pos < isize && !error); diff --git a/mm/swap.c b/mm/swap.c index 3c0e14640951..053bee23f237 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -449,6 +449,8 @@ static void folio_inc_refs(struct folio *folio) */ void folio_mark_accessed(struct folio *folio) { + if (folio_test_dropbehind(folio)) + return; if (lru_gen_enabled()) { folio_inc_refs(folio); return; -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.10-rc2 commit fb7d3bc4149395c1ae99029c852eab6c28fc3c88 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- If the folio is marked as streaming, drop pages when writeback completes. Intended to be used with RWF_DONTCACHE, to avoid needing sync writes for uncached IO. Link: https://lkml.kernel.org/r/20241220154831.1086649-10-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: mm/filemap.c [Context conflicts] Signed-off-by: Long Li <leo.lilong@huawei.com> --- mm/filemap.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/mm/filemap.c b/mm/filemap.c index b729176c7abe..8ab7ff56cd7d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1593,12 +1593,35 @@ int folio_wait_private_2_killable(struct folio *folio) } EXPORT_SYMBOL(folio_wait_private_2_killable); +/* + * If folio was marked as dropbehind, then pages should be dropped when writeback + * completes. Do that now. If we fail, it's likely because of a big folio - + * just reset dropbehind for that case and latter completions should invalidate. + */ +static void folio_end_dropbehind_write(struct folio *folio) +{ + /* + * Hitting !in_task() should not happen off RWF_DONTCACHE writeback, + * but can happen if normal writeback just happens to find dirty folios + * that were created as part of uncached writeback, and that writeback + * would otherwise not need non-IRQ handling. Just skip the + * invalidation in that case. + */ + if (in_task() && folio_trylock(folio)) { + if (folio->mapping) + folio_unmap_invalidate(folio->mapping, folio, 0); + folio_unlock(folio); + } +} + /** * folio_end_writeback - End writeback against a folio. * @folio: The folio. */ void folio_end_writeback(struct folio *folio) { + bool folio_dropbehind = false; + /* * folio_test_clear_reclaim() could be used here but it is an * atomic operation and overkill in this particular case. Failing @@ -1618,12 +1641,17 @@ void folio_end_writeback(struct folio *folio) * reused before the folio_wake(). */ folio_get(folio); + if (!folio_test_dirty(folio)) + folio_dropbehind = folio_test_clear_dropbehind(folio); if (!__folio_end_writeback(folio)) BUG(); smp_mb__after_atomic(); folio_wake(folio, PG_writeback); acct_reclaim_writeback(folio); + + if (folio_dropbehind) + folio_end_dropbehind_write(folio); folio_put(folio); } EXPORT_SYMBOL(folio_end_writeback); -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.10-rc2 commit dddc559f2e7cff9c6525150cd29ef3a4f6692b26 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Works like filemap_fdatawrite_range(), except it's a non-integrity data writeback and hence only starts writeback on the specified range. Will help facilitate generically starting uncached writeback from generic_write_sync(), as header dependencies preclude doing this inline from fs.h. Link: https://lkml.kernel.org/r/20241220154831.1086649-11-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- include/linux/fs.h | 2 ++ mm/filemap.c | 18 ++++++++++++++++++ 2 files changed, 20 insertions(+) diff --git a/include/linux/fs.h b/include/linux/fs.h index 868f583663c1..d04faa8bfd4c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2760,6 +2760,8 @@ extern int __must_check file_fdatawait_range(struct file *file, loff_t lstart, extern int __must_check file_check_and_advance_wb_err(struct file *file); extern int __must_check file_write_and_wait_range(struct file *file, loff_t start, loff_t end); +int filemap_fdatawrite_range_kick(struct address_space *mapping, loff_t start, + loff_t end); static inline int file_write_and_wait(struct file *file) { diff --git a/mm/filemap.c b/mm/filemap.c index 8ab7ff56cd7d..e34c22b36e9e 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -445,6 +445,24 @@ int filemap_fdatawrite_range(struct address_space *mapping, loff_t start, } EXPORT_SYMBOL(filemap_fdatawrite_range); +/** + * filemap_fdatawrite_range_kick - start writeback on a range + * @mapping: target address_space + * @start: index to start writeback on + * @end: last (non-inclusive) index for writeback + * + * This is a non-integrity writeback helper, to start writing back folios + * for the indicated range. + * + * Return: %0 on success, negative error code otherwise. + */ +int filemap_fdatawrite_range_kick(struct address_space *mapping, loff_t start, + loff_t end) +{ + return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_NONE); +} +EXPORT_SYMBOL_GPL(filemap_fdatawrite_range_kick); + /** * filemap_flush - mostly a non-blocking flush * @mapping: target address_space -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.10-rc2 commit 1d4457576570627e1702614bc060b55d95b85e39 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- When a buffered write submitted with IOCB_DONTCACHE has been successfully submitted, call filemap_fdatawrite_range_kick() to kick off the IO. File systems call generic_write_sync() for any successful buffered write submission, hence add the logic here rather than needing to modify the file system. Link: https://lkml.kernel.org/r/20241220154831.1086649-12-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- include/linux/fs.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/include/linux/fs.h b/include/linux/fs.h index d04faa8bfd4c..b0bd871653f2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2794,6 +2794,11 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count) (iocb->ki_flags & IOCB_SYNC) ? 0 : 1); if (ret) return ret; + } else if (iocb->ki_flags & IOCB_DONTCACHE) { + struct address_space *mapping = iocb->ki_filp->f_mapping; + + filemap_fdatawrite_range_kick(mapping, iocb->ki_pos, + iocb->ki_pos + count); } return count; -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.10-rc2 commit d94d23fdd7529f1f3218235d1e0a69e9856907b7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Callers can pass this in for uncached folio creation, in which case if a folio is newly created it gets marked as uncached. If a folio exists for this index and lookup succeeds, then it will not get marked as uncached. If an !uncached lookup finds a cached folio, clear the flag. For that case, there are competeting uncached and cached users of the folio, and it should not get pruned. Link: https://lkml.kernel.org/r/20241220154831.1086649-13-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Brian Foster <bfoster@redhat.com> Cc: Chris Mason <clm@meta.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- include/linux/pagemap.h | 2 ++ mm/filemap.c | 5 +++++ 2 files changed, 7 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index c4b363e9d9f4..083d34d9742e 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -707,6 +707,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping, * * %FGP_NOFS - __GFP_FS will get cleared in gfp. * * %FGP_NOWAIT - Don't block on the folio lock. * * %FGP_STABLE - Wait for the folio to be stable (finished writeback) + * * %FGP_DONTCACHE - Uncached buffered IO * * %FGP_WRITEBEGIN - The flags to use in a filesystem write_begin() * implementation. */ @@ -720,6 +721,7 @@ typedef unsigned int __bitwise fgf_t; #define FGP_NOWAIT ((__force fgf_t)0x00000020) #define FGP_FOR_MMAP ((__force fgf_t)0x00000040) #define FGP_STABLE ((__force fgf_t)0x00000080) +#define FGP_DONTCACHE ((__force fgf_t)0x00000100) #define FGF_GET_ORDER(fgf) (((__force unsigned)fgf) >> 26) /* top 6 bits */ #define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE) diff --git a/mm/filemap.c b/mm/filemap.c index e34c22b36e9e..577b292e6c6d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1998,6 +1998,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, /* Init accessed so avoid atomic mark_page_accessed later */ if (fgp_flags & FGP_ACCESSED) __folio_set_referenced(folio); + if (fgp_flags & FGP_DONTCACHE) + __folio_set_dropbehind(folio); err = filemap_add_folio(mapping, folio, index, gfp); if (!err) @@ -2020,6 +2022,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, if (!folio) return ERR_PTR(-ENOENT); + /* not an uncached lookup, clear uncached if set */ + if (folio_test_dropbehind(folio) && !(fgp_flags & FGP_DONTCACHE)) + folio_clear_dropbehind(folio); return folio; } EXPORT_SYMBOL(__filemap_get_folio); -- 2.39.2

From: Jingbo Xu <jefflexu@linux.alibaba.com> mainline inclusion from mainline-v6.10-rc2 commit 927289988068a65ccc168eda881ce60f8712707b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- ... otherwise this is a behavior change for the previous callers of invalidate_complete_folio2(), e.g. the page invalidation routine. Fixes: 4a9e23159fd3 ("mm/truncate: add folio_unmap_invalidate() helper") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20250218120209.88093-3-jefflexu@linux.alibaba.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- mm/truncate.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/truncate.c b/mm/truncate.c index 559285b8103d..d4b44be38253 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -574,8 +574,6 @@ int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio, VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); - if (folio_test_dirty(folio)) - return 0; if (folio_mapped(folio)) unmap_mapping_folio(folio); BUG_ON(folio_mapped(folio)); -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.15 commit 095f627add86a6ddda2c2cfd563b0ee05d0172b2 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- It's possible for the folio to either get marked for writeback or redirtied. Add a helper, filemap_end_dropbehind(), which guards the folio_unmap_invalidate() call behind check for the folio being both non-dirty and not under writeback AFTER the folio lock has been acquired. Use this helper folio_end_dropbehind_write(). Cc: stable@vger.kernel.org Reported-by: Al Viro <viro@zeniv.linux.org.uk> Fixes: fb7d3bc41493 ("mm/filemap: drop streaming/uncached pages when writeback completes") Link: https://lore.kernel.org/linux-fsdevel/20250525083209.GS2023217@ZenIV/ Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/20250527133255.452431-2-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- mm/filemap.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 577b292e6c6d..11210f610b9b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1611,6 +1611,16 @@ int folio_wait_private_2_killable(struct folio *folio) } EXPORT_SYMBOL(folio_wait_private_2_killable); +static void filemap_end_dropbehind(struct folio *folio) +{ + struct address_space *mapping = folio->mapping; + + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + + if (mapping && !folio_test_writeback(folio) && !folio_test_dirty(folio)) + folio_unmap_invalidate(mapping, folio, 0); +} + /* * If folio was marked as dropbehind, then pages should be dropped when writeback * completes. Do that now. If we fail, it's likely because of a big folio - @@ -1626,8 +1636,7 @@ static void folio_end_dropbehind_write(struct folio *folio) * invalidation in that case. */ if (in_task() && folio_trylock(folio)) { - if (folio->mapping) - folio_unmap_invalidate(folio->mapping, folio, 0); + filemap_end_dropbehind(folio); folio_unlock(folio); } } -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.15 commit 25b065a744ff0c1099bb357be1c40030b5a14c07 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Use the filemap_end_dropbehind() helper rather than calling folio_unmap_invalidate() directly, as we need to check if the folio has been redirtied or marked for writeback once the folio lock has been re-acquired. Cc: stable@vger.kernel.org Reported-by: Trond Myklebust <trondmy@hammerspace.com> Fixes: 8026e49bff9b ("mm/filemap: add read support for RWF_DONTCACHE") Link: https://lore.kernel.org/linux-fsdevel/ba8a9805331ce258a622feaca266b163db681a... Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/20250527133255.452431-3-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- mm/filemap.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 11210f610b9b..176f15606aa0 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2684,8 +2684,7 @@ static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio) return (pos1 >> shift == pos2 >> shift); } -static void filemap_end_dropbehind_read(struct address_space *mapping, - struct folio *folio) +static void filemap_end_dropbehind_read(struct folio *folio) { if (!folio_test_dropbehind(folio)) return; @@ -2693,7 +2692,7 @@ static void filemap_end_dropbehind_read(struct address_space *mapping, return; if (folio_trylock(folio)) { if (folio_test_clear_dropbehind(folio)) - folio_unmap_invalidate(mapping, folio, 0); + filemap_end_dropbehind(folio); folio_unlock(folio); } } @@ -2821,7 +2820,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, for (i = 0; i < folio_batch_count(&fbatch); i++) { struct folio *folio = fbatch.folios[i]; - filemap_end_dropbehind_read(mapping, folio); + filemap_end_dropbehind_read(folio); folio_put(folio); } folio_batch_init(&fbatch); -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.15 commit 1da7a06d9ce4edea3370945af8bfcc71b7744788 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The read side is filemap_end_dropbehind_read(), while the write side used folio_ as the prefix rather than filemap_. The read side makes more sense, unify the naming such that the write side follows that. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/20250527133255.452431-5-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- mm/filemap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 176f15606aa0..3626330d9d94 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1626,7 +1626,7 @@ static void filemap_end_dropbehind(struct folio *folio) * completes. Do that now. If we fail, it's likely because of a big folio - * just reset dropbehind for that case and latter completions should invalidate. */ -static void folio_end_dropbehind_write(struct folio *folio) +static void filemap_end_dropbehind_write(struct folio *folio) { /* * Hitting !in_task() should not happen off RWF_DONTCACHE writeback, @@ -1678,7 +1678,7 @@ void folio_end_writeback(struct folio *folio) acct_reclaim_writeback(folio); if (folio_dropbehind) - folio_end_dropbehind_write(folio); + filemap_end_dropbehind_write(folio); folio_put(folio); } EXPORT_SYMBOL(folio_end_writeback); -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.15 commit a1d98e4ffb972ab007f5de850ef53c2a46cacf15 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The read and write side does this a bit differently, unify it such that the _{read,write} helpers check the bit before locking, and the generic handler is in charge of clearing the bit and invalidating, once under the folio lock. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/20250527133255.452431-6-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org> Conflicts: mm/filemap.c [Context conflicts] Signed-off-by: Long Li <leo.lilong@huawei.com> --- mm/filemap.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 3626330d9d94..cca581bcd302 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1617,7 +1617,11 @@ static void filemap_end_dropbehind(struct folio *folio) VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); - if (mapping && !folio_test_writeback(folio) && !folio_test_dirty(folio)) + if (folio_test_writeback(folio) || folio_test_dirty(folio)) + return; + if (!folio_test_clear_dropbehind(folio)) + return; + if (mapping) folio_unmap_invalidate(mapping, folio, 0); } @@ -1628,6 +1632,9 @@ static void filemap_end_dropbehind(struct folio *folio) */ static void filemap_end_dropbehind_write(struct folio *folio) { + if (!folio_test_dropbehind(folio)) + return; + /* * Hitting !in_task() should not happen off RWF_DONTCACHE writeback, * but can happen if normal writeback just happens to find dirty folios @@ -1647,8 +1654,6 @@ static void filemap_end_dropbehind_write(struct folio *folio) */ void folio_end_writeback(struct folio *folio) { - bool folio_dropbehind = false; - /* * folio_test_clear_reclaim() could be used here but it is an * atomic operation and overkill in this particular case. Failing @@ -1668,17 +1673,13 @@ void folio_end_writeback(struct folio *folio) * reused before the folio_wake(). */ folio_get(folio); - if (!folio_test_dirty(folio)) - folio_dropbehind = folio_test_clear_dropbehind(folio); if (!__folio_end_writeback(folio)) BUG(); smp_mb__after_atomic(); folio_wake(folio, PG_writeback); + filemap_end_dropbehind_write(folio); acct_reclaim_writeback(folio); - - if (folio_dropbehind) - filemap_end_dropbehind_write(folio); folio_put(folio); } EXPORT_SYMBOL(folio_end_writeback); @@ -2691,8 +2692,7 @@ static void filemap_end_dropbehind_read(struct folio *folio) if (folio_test_writeback(folio) || folio_test_dirty(folio)) return; if (folio_trylock(folio)) { - if (folio_test_clear_dropbehind(folio)) - filemap_end_dropbehind(folio); + filemap_end_dropbehind(folio); folio_unlock(folio); } } -- 2.39.2

From: Jingbo Xu <jefflexu@linux.alibaba.com> mainline inclusion from mainline-v6.10-rc2 commit 8510edf191d2df0822ea22d6226e4eef87562271 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- iocb->ki_pos has been updated with the number of written bytes since generic_perform_write(). Besides __filemap_fdatawrite_range() accepts the inclusive end of the data range. Fixes: 1d4457576570 ("mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20250218120209.88093-2-jefflexu@linux.alibaba.com Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- include/linux/fs.h | 4 ++-- mm/filemap.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index b0bd871653f2..a84b6ad34009 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2797,8 +2797,8 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count) } else if (iocb->ki_flags & IOCB_DONTCACHE) { struct address_space *mapping = iocb->ki_filp->f_mapping; - filemap_fdatawrite_range_kick(mapping, iocb->ki_pos, - iocb->ki_pos + count); + filemap_fdatawrite_range_kick(mapping, iocb->ki_pos - count, + iocb->ki_pos - 1); } return count; diff --git a/mm/filemap.c b/mm/filemap.c index cca581bcd302..729490f5e50b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -449,7 +449,7 @@ EXPORT_SYMBOL(filemap_fdatawrite_range); * filemap_fdatawrite_range_kick - start writeback on a range * @mapping: target address_space * @start: index to start writeback on - * @end: last (non-inclusive) index for writeback + * @end: last (inclusive) index for writeback * * This is a non-integrity writeback helper, to start writing back folios * for the indicated range. -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.10-rc2 commit b2cd5ae693a3dc5b70a0f75fba96452c591a2047 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Add iomap buffered write support for RWF_DONTCACHE. If RWF_DONTCACHE is set for a write, mark the folios being written as uncached. Then writeback completion will drop the pages. The write_iter handler simply kicks off writeback for the pages, and writeback completion will take care of the rest. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250204184047.356762-2-axboe@kernel.dk Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org> Conflicts: Documentation/filesystems/iomap/design.rst Documentation/filesystems/iomap/operations.rst include/linux/iomap.h Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/iomap/buffered-io.c | 4 ++++ include/linux/iomap.h | 1 + 2 files changed, 5 insertions(+) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index ad442f71c000..3a3f82966fbf 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -654,6 +654,8 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) if (iter->flags & IOMAP_NOWAIT) fgp |= FGP_NOWAIT; + if (iter->flags & IOMAP_DONTCACHE) + fgp |= FGP_DONTCACHE; fgp |= fgf_set_order(len); return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT, @@ -1101,6 +1103,8 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i, if (iocb->ki_flags & IOCB_NOWAIT) iter.flags |= IOMAP_NOWAIT; + if (iocb->ki_flags & IOCB_DONTCACHE) + iter.flags |= IOMAP_DONTCACHE; while ((ret = iomap_iter(&iter, ops)) > 0) iter.processed = iomap_write_iter(&iter, i); diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 6bd7ed98cf1f..05d68152edb3 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -178,6 +178,7 @@ struct iomap_folio_ops { #else #define IOMAP_DAX 0 #endif /* CONFIG_FS_DAX */ +#define IOMAP_DONTCACHE (1 << 10) struct iomap_ops { /* -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.15 commit 34ecde3c56066ba79e5ec3d93c5b14ea83e3603e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- DONTCACHE I/O must have the completion punted to a workqueue, just like what is done for unwritten extents, as the completion needs task context to perform the invalidation of the folio(s). However, if writeback is started off filemap_fdatawrite_range() off generic_sync() and it's an overwrite, then the DONTCACHE marking gets lost as iomap_add_to_ioend() don't look at the folio being added and no further state is passed down to help it know that this is a dropbehind/DONTCACHE write. Check if the folio being added is marked as dropbehind, and set IOMAP_IOEND_DONTCACHE if that is the case. Then XFS can factor this into the decision making of completion context in xfs_submit_ioend(). Additionally include this ioend flag in the NOMERGE flags, to avoid mixing it with unrelated IO. Since this is the 3rd flag that will cause XFS to punt the completion to a workqueue, add a helper so that each one of them can get appropriately commented. This fixes extra page cache being instantiated when the write performed is an overwrite, rather than newly instantiated blocks. Fixes: b2cd5ae693a3 ("iomap: make buffered writes work with RWF_DONTCACHE") Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/5153f6e8-274d-4546-bf55-30a5018e0d03@kernel.dk Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Conflicts: fs/ext4/inode.c fs/iomap/buffered-io.c fs/xfs/xfs_aops.c include/linux/iomap.h [Context conflicts] Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/ext4/inode.c | 3 ++- fs/iomap/buffered-io.c | 3 +++ fs/xfs/xfs_aops.c | 3 ++- include/linux/iomap.h | 1 + 4 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 045a7213d6b6..3d57bf1b7df4 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3932,7 +3932,8 @@ static int ext4_iomap_prepare_ioend(struct iomap_ioend *ioend, int status) /* Need to convert unwritten extents when I/Os are completed. */ if (ioend->io_type == IOMAP_UNWRITTEN || - ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize)) + ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize) || + ioend->io_flags & IOMAP_F_DONTCACHE) ioend->io_bio.bi_end_io = ext4_iomap_end_bio; return status; diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 3a3f82966fbf..3da42ca9a365 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1871,6 +1871,9 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff)) goto new_ioend; + if (folio_test_dropbehind(folio)) + wpc->ioend->io_flags |= IOMAP_F_DONTCACHE; + if (ifs) atomic_add(len, &ifs->write_bytes_pending); diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 82f18f28c1c1..3fb35bad75ec 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -418,7 +418,8 @@ xfs_prepare_ioend( /* send ioends that might require a transaction to the completion wq */ if (xfs_ioend_is_append(ioend) || ioend->io_type == IOMAP_UNWRITTEN || - (ioend->io_flags & IOMAP_F_SHARED)) + (ioend->io_flags & IOMAP_F_SHARED) || + ioend->io_flags & IOMAP_F_DONTCACHE) ioend->io_bio.bi_end_io = xfs_end_bio; return status; } diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 05d68152edb3..5d9782909b1f 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -64,6 +64,7 @@ struct vm_fault; #define IOMAP_F_BUFFER_HEAD 0 #endif /* CONFIG_BUFFER_HEAD */ #define IOMAP_F_XATTR (1U << 5) +#define IOMAP_F_DONTCACHE (1U << 6) /* * Flags set by the core iomap code during operations: -- 2.39.2

From: Jens Axboe <axboe@kernel.dk> mainline inclusion from mainline-v6.10-rc2 commit 974c5e6139db30fae668e44c381d13bcc63b65fa category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Read side was already fully supported, and with the write side appropriately punted to the worker queue, all that's needed now is setting FOP_DONTCACHE in the file_operations structure to enable full support for read and write uncached IO. This provides similar benefits to using RWF_DONTCACHE with reads. Testing buffered writes on 32 files: writing bs 65536, uncached 0 1s: 196035MB/sec 2s: 132308MB/sec 3s: 132438MB/sec 4s: 116528MB/sec 5s: 103898MB/sec 6s: 108893MB/sec 7s: 99678MB/sec 8s: 106545MB/sec 9s: 106826MB/sec 10s: 101544MB/sec 11s: 111044MB/sec 12s: 124257MB/sec 13s: 116031MB/sec 14s: 114540MB/sec 15s: 115011MB/sec 16s: 115260MB/sec 17s: 116068MB/sec 18s: 116096MB/sec where it's quite obvious where the page cache filled, and performance dropped from to about half of where it started, settling in at around 115GB/sec. Meanwhile, 32 kswapds were running full steam trying to reclaim pages. Running the same test with uncached buffered writes: writing bs 65536, uncached 1 1s: 198974MB/sec 2s: 189618MB/sec 3s: 193601MB/sec 4s: 188582MB/sec 5s: 193487MB/sec 6s: 188341MB/sec 7s: 194325MB/sec 8s: 188114MB/sec 9s: 192740MB/sec 10s: 189206MB/sec 11s: 193442MB/sec 12s: 189659MB/sec 13s: 191732MB/sec 14s: 190701MB/sec 15s: 191789MB/sec 16s: 191259MB/sec 17s: 190613MB/sec 18s: 191951MB/sec and the behavior is fully predictable, performing the same throughout even after the page cache would otherwise have fully filled with dirty data. It's also about 65% faster, and using half the CPU of the system compared to the normal buffered write. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250204184047.356762-3-axboe@kernel.dk Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Conflicts: fs/xfs/xfs_file.c [Context conflicts] Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/xfs_file.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 87c82f98929f..cbcde9a9ef2d 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1501,7 +1501,7 @@ const struct file_operations xfs_file_operations = { .fadvise = xfs_file_fadvise, .remap_file_range = xfs_file_remap_range, .fop_flags = FOP_MMAP_SYNC | FOP_BUFFER_RASYNC | FOP_BUFFER_WASYNC | - FOP_DIO_PARALLEL_WRITE, + FOP_DIO_PARALLEL_WRITE | FOP_DONTCACHE, }; const struct file_operations xfs_dir_file_operations = { -- 2.39.2

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICPOEB CVE: NA ---------------------------------------------------------------------- Currently for buffered iomap operations, both read and write paths fully support uncached IO. However, uncached IO support for non-buffered iomap operations is incomplete. Add FOP_DONTCACHE flag in file_operations structure to enable full uncached IO support for both read and write operations. For ext4 non-buffered iomap operations, return unsupported. Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/ext4/file.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index afa60bd4ae63..1d626aac05c6 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -135,6 +135,10 @@ static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to) if (unlikely(ext4_forced_shutdown(inode->i_sb))) return -EIO; + if ((iocb->ki_flags & IOCB_DONTCACHE) && + !ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) + return -EOPNOTSUPP; + if (!iov_iter_count(to)) return 0; /* skip atime */ @@ -710,6 +714,10 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) if (unlikely(ext4_forced_shutdown(inode->i_sb))) return -EIO; + if ((iocb->ki_flags & IOCB_DONTCACHE) && + !ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) + return -EOPNOTSUPP; + #ifdef CONFIG_FS_DAX if (IS_DAX(inode)) return ext4_dax_write_iter(iocb, from); @@ -967,7 +975,7 @@ const struct file_operations ext4_file_operations = { .splice_write = iter_file_splice_write, .fallocate = ext4_fallocate, .fop_flags = FOP_MMAP_SYNC | FOP_BUFFER_RASYNC | - FOP_DIO_PARALLEL_WRITE, + FOP_DIO_PARALLEL_WRITE | FOP_DONTCACHE, }; const struct inode_operations ext4_file_inode_operations = { -- 2.39.2

反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/17476 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/2EK... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/17476 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/2EK...
participants (2)
-
Long Li
-
patchwork bot