This patch set fix some bug for forcealign:
patch 1 ~ 5 : fix tail alignment issue while approach to no space. patch 6 : fix forcealign not compatible with reflink and realtime. patch 7 : only datafork need bunmap algin for focealign patch 8 ~ 11 : fix truncate for forcealign
Dave Chinner (3): xfs: only allow minlen allocations when near ENOSPC xfs: always tail align maxlen allocations xfs: align args->minlen for forced allocation alignment
John Garry (1): xfs: Don't revert allocated offset for forcealign
Long Li (5): xfs: don't attempting non-aligned fallbacks alloc for forcealign xfs: simplify extent allocation alignment xfs: forcealign not compatible with reflink and realtime device xfs: only bunmap align in datafork for forcealign xfs: correct the truncate blocksize of forcealign
Zhang Yi (3): math64: add rem_u64() to just return the remainder iomap: pass blocksize to iomap_truncate_page() xfs: refactor the truncating order
fs/iomap/buffered-io.c | 8 +-- fs/xfs/libxfs/xfs_alloc.c | 31 +++++---- fs/xfs/libxfs/xfs_bmap.c | 140 ++++++++++++++++++++------------------ fs/xfs/xfs_iops.c | 124 +++++++++++++++++---------------- fs/xfs/xfs_super.c | 19 +++++- include/linux/iomap.h | 4 +- include/linux/math64.h | 24 +++++++ 7 files changed, 208 insertions(+), 142 deletions(-)
From: Dave Chinner dchinner@redhat.com
maillist inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9VTE3 CVE: NA
Reference: https://lore.kernel.org/linux-xfs/20240705162450.3481169-1-john.g.garry@orac...
--------------------------------
When we are near ENOSPC and don't have enough free space for an args->maxlen allocation, xfs_alloc_space_available() will trim args->maxlen to equal the available space. However, this function has only checked that there is enough contiguous free space for an aligned args->minlen allocation to succeed. Hence there is no guarantee that an args->maxlen allocation will succeed, nor that the available space will allow for correct alignment of an args->maxlen allocation.
Further, by trimming args->maxlen arbitrarily, it breaks an assumption made in xfs_alloc_fix_len() that if the caller wants aligned allocation, then args->maxlen will be set to an aligned value. It then skips the tail alignment and so we end up with extents that aren't aligned to extent size hint boundaries as we approach ENOSPC.
To avoid this problem, don't reduce args->maxlen by some random, arbitrary amount. If args->maxlen is too large for the available space, reduce the allocation to a minlen allocation as we know we have contiguous free space available for this to succeed and always be correctly aligned.
Signed-off-by: Dave Chinner dchinner@redhat.com Signed-off-by: John Garry john.g.garry@oracle.com Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/libxfs/xfs_alloc.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 95bfe89651ad..19d89bf19b48 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2385,14 +2385,23 @@ xfs_alloc_space_available( if (available < (int)max(args->total, alloc_len)) return false;
+ if (flags & XFS_ALLOC_FLAG_CHECK) + return true; + /* - * Clamp maxlen to the amount of free space available for the actual - * extent allocation. + * If we can't do a maxlen allocation, then we must reduce the size of + * the allocation to match the available free space. We know how big + * the largest contiguous free space we can allocate is, so that's our + * upper bound. However, we don't exaclty know what alignment/size + * constraints have been placed on the allocation, so we can't + * arbitrarily select some new max size. Hence make this a minlen + * allocation as we know that will definitely succeed and match the + * callers alignment constraints. */ - if (available < (int)args->maxlen && !(flags & XFS_ALLOC_FLAG_CHECK)) { - args->maxlen = available; + alloc_len = args->maxlen + (args->alignment - 1) + args->minalignslop; + if (longest < alloc_len) { + args->maxlen = args->minlen; ASSERT(args->maxlen > 0); - ASSERT(args->maxlen >= args->minlen); }
return true;
From: Dave Chinner dchinner@redhat.com
maillist inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9VTE3 CVE: NA
Reference: https://lore.kernel.org/linux-xfs/20240705162450.3481169-1-john.g.garry@orac...
--------------------------------
When we do a large allocation, the core free space allocation code assumes that args->maxlen is aligned to args->prod/args->mod. hence if we get a maximum sized extent allocated, it does not do tail alignment of the extent.
However, this assumes that nothing modifies args->maxlen between the original allocation context setup and trimming the selected free space extent to size. This assumption has recently been found to be invalid - xfs_alloc_space_available() modifies args->maxlen in low space situations - and there may be more situations we haven't yet found like this.
Force aligned allocation introduces the requirement that extents are correctly tail aligned, resulting in this occasional latent alignment failure to be reclassified from an unimportant curiousity to a must-fix bug.
Removing the assumption about args->maxlen allocations always being tail aligned is trivial, and should not impact anything because args->maxlen for inodes with extent size hints configured are already aligned. Hence all this change does it avoid weird corner cases that would have resulted in unaligned extent sizes by always trimming the extent down to an aligned size.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: "Darrick J. Wong" djwong@kernel.org [provisional on v1 series comment] Signed-off-by: John Garry john.g.garry@oracle.com Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/libxfs/xfs_alloc.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 19d89bf19b48..23c0e666d2f4 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -408,20 +408,18 @@ xfs_alloc_compute_diff( * Fix up the length, based on mod and prod. * len should be k * prod + mod for some k. * If len is too small it is returned unchanged. - * If len hits maxlen it is left alone. */ -STATIC void +static void xfs_alloc_fix_len( - xfs_alloc_arg_t *args) /* allocation argument structure */ + struct xfs_alloc_arg *args) { - xfs_extlen_t k; - xfs_extlen_t rlen; + xfs_extlen_t k; + xfs_extlen_t rlen = args->len;
ASSERT(args->mod < args->prod); - rlen = args->len; ASSERT(rlen >= args->minlen); ASSERT(rlen <= args->maxlen); - if (args->prod <= 1 || rlen < args->mod || rlen == args->maxlen || + if (args->prod <= 1 || rlen < args->mod || (args->mod == 0 && rlen < args->prod)) return; k = rlen % args->prod;
From: Dave Chinner dchinner@redhat.com
maillist inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9VTE3 CVE: NA
Reference: https://lore.kernel.org/linux-xfs/20240705162450.3481169-1-john.g.garry@orac...
--------------------------------
If args->minlen is not aligned to the constraints of forced alignment, we may do minlen allocations that are not aligned when we approach ENOSPC. Avoid this by always aligning args->minlen appropriately. If alignment of minlen results in a value smaller than the alignment constraint, fail the allocation immediately.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: "Darrick J. Wong" djwong@kernel.org Signed-off-by: John Garry john.g.garry@oracle.com Conflicts: fs/xfs/libxfs/xfs_bmap.c [Conflicts in xfs_bmap_select_minlen()] Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/libxfs/xfs_bmap.c | 31 ++++++++++++++++++++++++------- 1 file changed, 24 insertions(+), 7 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 7682dfe2f701..284cc73b8bef 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -3253,32 +3253,47 @@ xfs_bmap_longest_free_extent( return error; }
-static void +static int xfs_bmap_select_minlen( struct xfs_bmalloca *ap, struct xfs_alloc_arg *args, xfs_extlen_t *blen, int notinit) { + xfs_extlen_t nlen = 0; + if (notinit || *blen < ap->minlen) { /* * Since we did a BUF_TRYLOCK above, it is possible that * there is space for this request. */ - args->minlen = ap->minlen; + nlen = ap->minlen; } else if (*blen < args->maxlen) { /* * If the best seen length is less than the request length, * use the best as the minimum. */ - args->minlen = *blen; + + nlen = *blen; } else { /* * Otherwise we've seen an extent as big as maxlen, use that * as the minimum. */ - args->minlen = args->maxlen; + nlen = args->maxlen; } + + if (args->alignment > 1) { + nlen = rounddown(nlen, args->alignment); + if (nlen < ap->minlen) { + if (xfs_inode_forcealign(ap->ip) && + (ap->datatype & XFS_ALLOC_USERDATA)) + return -ENOSPC; + nlen = ap->minlen; + } + } + args->minlen = nlen; + return 0; }
STATIC int @@ -3311,8 +3326,8 @@ xfs_bmap_btalloc_nullfb( break; }
- xfs_bmap_select_minlen(ap, args, blen, notinit); - return 0; + error = xfs_bmap_select_minlen(ap, args, blen, notinit); + return error; }
STATIC int @@ -3349,7 +3364,9 @@ xfs_bmap_btalloc_filestreams(
}
- xfs_bmap_select_minlen(ap, args, blen, notinit); + error = xfs_bmap_select_minlen(ap, args, blen, notinit); + if (error) + return error;
/* * Set the failure fallback case to look in the selected AG as stream
From: John Garry john.g.garry@oracle.com
maillist inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9VTE3 CVE: NA
Reference: https://lore.kernel.org/linux-xfs/20240705162450.3481169-1-john.g.garry@orac...
--------------------------------
In xfs_bmap_process_allocated_extent(), for when we found that we could not provide the requested length completely, the mapping is moved so that we can provide as much as possible for the original request.
For forcealign, this would mean ignoring alignment guaranteed, so don't do this.
Reviewed-by: "Darrick J. Wong" djwong@kernel.org Signed-off-by: John Garry john.g.garry@oracle.com Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/libxfs/xfs_bmap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 284cc73b8bef..35728779b0d6 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -3686,10 +3686,12 @@ xfs_bmap_btalloc( * very fragmented so we're unlikely to be able to satisfy the * hints anyway. */ - if (ap->length <= orig_length) - ap->offset = orig_offset; - else if (ap->offset + ap->length < orig_offset + orig_length) - ap->offset = orig_offset + orig_length - ap->length; + if (!(xfs_inode_forcealign(ap->ip) && align)) { + if (ap->length <= orig_length) + ap->offset = orig_offset; + else if (ap->offset + ap->length < orig_offset + orig_length) + ap->offset = orig_offset + orig_length - ap->length; + } xfs_bmap_btalloc_accounting(ap, &args); } else { ap->blkno = NULLFSBLOCK;
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9VTE3 CVE: NA
--------------------------------
When forced allocation alignment is specified, the extent will be aligned to the extent size hint size rather than stripe alignment. If aligned allocation cannot be done, then the allocation is failed rather than attempting non-aligned fallbacks.
Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/libxfs/xfs_bmap.c | 25 +++++++++++++++++++------ 1 file changed, 19 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 35728779b0d6..89f843a63842 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -3565,7 +3565,13 @@ xfs_bmap_btalloc( * is only set if the allocation length is >= the stripe unit and the * allocation offset is at the end of file. */ - if (!(ap->tp->t_flags & XFS_TRANS_LOWMODE) && ap->aeof) { + args.minalignslop = 0; + if (ap->tp->t_flags & XFS_TRANS_LOWMODE) { + if (args.alignment > 1 && xfs_inode_forcealign(ap->ip)) { + args.fsbno = NULLFSBLOCK; + goto alloc_out; + } + } else if (ap->aeof) { if (!ap->offset) { args.alignment = stripe_align; atype = args.type; @@ -3577,7 +3583,6 @@ xfs_bmap_btalloc( if (blen > args.alignment && blen <= args.maxlen + args.alignment) args.minlen = blen - args.alignment; - args.minalignslop = 0; } else { /* * First try an exact bno allocation. @@ -3604,8 +3609,6 @@ xfs_bmap_btalloc( else args.minalignslop = 0; } - } else { - args.minalignslop = 0; } args.postallocs = 1; args.minleft = ap->minleft; @@ -3632,8 +3635,16 @@ xfs_bmap_btalloc( return error; }
- if (isaligned && args.fsbno == NULLFSBLOCK && - (args.alignment <= 1 || !xfs_inode_forcealign(ap->ip))) { + if (args.fsbno == NULLFSBLOCK && args.alignment > 1 && + xfs_inode_forcealign(ap->ip)) { + /* + * Don't attempting non-aligned fallbacks alloc + * for forcealign + */ + goto alloc_out; + } + + if (isaligned && args.fsbno == NULLFSBLOCK) { /* * allocation failed, so turn off alignment and * try again. @@ -3660,6 +3671,8 @@ xfs_bmap_btalloc( return error; ap->tp->t_flags |= XFS_TRANS_LOWMODE; } + +alloc_out: if (args.fsbno != NULLFSBLOCK) { /* * check the allocation happened at the same or higher AG than
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9VTE3 CVE: NA
--------------------------------
We currently align extent allocation to stripe unit or stripe width. That is specified by an external parameter to the allocation code, which then manipulates the xfs_alloc_args alignment configuration in interesting ways.
The args->alignment field specifies extent start alignment, but because we may be attempting non-aligned allocation first there are also slop variables that allow for those allocation attempts to account for aligned allocation if they fail.
This gets much more complex as we introduce forced allocation alignment, where extent size hints are used to generate the extent start alignment. extent size hints currently only affect extent lengths (via args->prod and args->mod) and so with this change we will have two different start alignment conditions.
Avoid this complexity by always using args->alignment to indicate extent start alignment, and always using args->prod/mod to indicate extent length adjustment.
Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/libxfs/xfs_bmap.c | 78 ++++++++++++++-------------------------- 1 file changed, 27 insertions(+), 51 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 89f843a63842..9e6a609450fe 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -3262,6 +3262,10 @@ xfs_bmap_select_minlen( { xfs_extlen_t nlen = 0;
+ /* Adjust best length for extent start alignment. */ + if (*blen > args->alignment) + *blen -= args->alignment; + if (notinit || *blen < ap->minlen) { /* * Since we did a BUF_TRYLOCK above, it is possible that @@ -3436,9 +3440,8 @@ xfs_bmap_btalloc( xfs_fileoff_t orig_offset; xfs_extlen_t orig_length; xfs_extlen_t blen; - xfs_extlen_t nextminlen = 0; + xfs_extlen_t alignment; int nullfb; /* true if ap->firstblock isn't set */ - int isaligned; int tryagain; int error; int stripe_align; @@ -3497,7 +3500,7 @@ xfs_bmap_btalloc( /* * Normal allocation, done through xfs_alloc_vextent. */ - tryagain = isaligned = 0; + tryagain = 0; memset(&args, 0, sizeof(args)); args.tp = ap->tp; args.mp = mp; @@ -3508,13 +3511,12 @@ xfs_bmap_btalloc( * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is * set as forcealign and cowextsz_hint are mutually exclusive */ - if (xfs_inode_forcealign(ap->ip) && align) { + if (xfs_inode_forcealign(ap->ip)) args.alignment = align; - if (stripe_align == 0 || stripe_align % align) - stripe_align = align; - } else { + else if (stripe_align) + args.alignment = stripe_align; + else args.alignment = 1; - }
/* Trim the allocation back to the maximum an AG can fit. */ args.maxlen = min(ap->length, mp->m_ag_max_usable); @@ -3571,44 +3573,21 @@ xfs_bmap_btalloc( args.fsbno = NULLFSBLOCK; goto alloc_out; } - } else if (ap->aeof) { - if (!ap->offset) { - args.alignment = stripe_align; - atype = args.type; - isaligned = 1; - /* - * Adjust minlen to try and preserve alignment if we - * can't guarantee an aligned maxlen extent. - */ - if (blen > args.alignment && - blen <= args.maxlen + args.alignment) - args.minlen = blen - args.alignment; - } else { - /* - * First try an exact bno allocation. - * If it fails then do a near or start bno - * allocation with alignment turned on. - */ - atype = args.type; - tryagain = 1; - args.type = XFS_ALLOCTYPE_THIS_BNO; - /* - * Compute the minlen+alignment for the - * next case. Set slop so that the value - * of minlen+alignment+slop doesn't go up - * between the calls. - */ - if (blen > stripe_align && blen <= args.maxlen) - nextminlen = blen - stripe_align; - else - nextminlen = args.minlen; - if (nextminlen + stripe_align > args.minlen + 1) - args.minalignslop = - nextminlen + stripe_align - - args.minlen - 1; - else - args.minalignslop = 0; - } + args.alignment = 1; + } else if (ap->aeof && ap->offset) { + /* + * First try an exact bno allocation. + * If it fails then do a near or start bno + * allocation with alignment turned on. + */ + alignment = args.alignment; + atype = args.type; + tryagain = 1; + args.type = XFS_ALLOCTYPE_THIS_BNO; + args.fsbno = ap->blkno; + + args.alignment = 1; + args.minalignslop = alignment - args.alignment; } args.postallocs = 1; args.minleft = ap->minleft; @@ -3627,10 +3606,8 @@ xfs_bmap_btalloc( */ args.type = atype; args.fsbno = ap->blkno; - args.alignment = stripe_align; - args.minlen = nextminlen; + args.alignment = alignment; args.minalignslop = 0; - isaligned = 1; if ((error = xfs_alloc_vextent(&args))) return error; } @@ -3644,12 +3621,11 @@ xfs_bmap_btalloc( goto alloc_out; }
- if (isaligned && args.fsbno == NULLFSBLOCK) { + if (args.alignment > 1 && args.fsbno == NULLFSBLOCK) { /* * allocation failed, so turn off alignment and * try again. */ - args.type = atype; args.fsbno = ap->blkno; args.alignment = 0; if ((error = xfs_alloc_vextent(&args)))
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9VTE3 CVE: NA
--------------------------------
Reflink will not be supported for forcealign yet, so disallow a mount under this condition. This is because we have the limitation of pageache writeback not knowing how to writeback an entire allocation unut, after covert extent form cowfork to datafork, force alignment constraints may be break, so reject a mount with relink.
RT vol will not be supported for forcealign yet, so disallow a mount under this condition. It will be possible to support RT vol and forcealign in future. For this, the inode extsize must be a multiple of rtextsize - this is enforced already in xfs_ioctl_setattr_check_extsize() and xfs_inode_validate_extsize().
Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/xfs_super.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index d43f76a4b99a..f2ff547e760c 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1658,10 +1658,19 @@ xfs_fc_fill_super( } }
- if (xfs_has_forcealign(mp)) + if (xfs_has_forcealign(mp)) { xfs_warn(mp, "EXPERIMENTAL forced data extent alignment feature in use. Use at your own risk!");
+ if (xfs_has_realtime(mp)) { + xfs_alert(mp, + "forcealign not supported for realtime device!"); + error = -EINVAL; + goto out_filestream_unmount; + } + + } + if (xfs_has_atomicwrites(mp)) xfs_warn(mp, "EXPERIMENTAL atomicwrites feature in use. Use at your own risk!"); @@ -1674,6 +1683,14 @@ xfs_fc_fill_super( goto out_filestream_unmount; }
+ if (xfs_has_forcealign(mp)) { + xfs_alert(mp, + "reflink not compatible with forcealign!"); + error = -EINVAL; + goto out_filestream_unmount; + } + + if (xfs_globals.always_cow) { xfs_info(mp, "using DEBUG-only always_cow mode."); mp->m_always_cow = true;
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9VTE3 CVE: NA
--------------------------------
We only need bunmap align in datafork for forcealign, so fix it add extra datafork check.
Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/libxfs/xfs_bmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 9e6a609450fe..1323259192d6 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -5297,7 +5297,7 @@ __xfs_bunmapi( isrt = (whichfork == XFS_DATA_FORK) && XFS_IS_REALTIME_INODE(ip); end = start + len; if (xfs_inode_forcealign(ip) && ip->i_d.di_extsize > 1 - && S_ISREG(VFS_I(ip)->i_mode)) { + && S_ISREG(VFS_I(ip)->i_mode) && whichfork == XFS_DATA_FORK) { start = roundup_64(start, ip->i_d.di_extsize); end = rounddown_64(end, ip->i_d.di_extsize); len = end - start;
From: Zhang Yi yi.zhang@huawei.com
maillist inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9VTE3 CVE: NA
Reference: https://lore.kernel.org/linux-fsdevel/20240529095206.2568162-1-yi.zhang@huaw...
--------------------------------
Add a new helper rem_u64() to only get the remainder of unsigned 64bit divide with 32bit divisor.
Signed-off-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Long Li leo.lilong@huawei.com --- include/linux/math64.h | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+)
diff --git a/include/linux/math64.h b/include/linux/math64.h index 66deb1fdc2ef..b5c4d1df08e5 100644 --- a/include/linux/math64.h +++ b/include/linux/math64.h @@ -3,6 +3,7 @@ #define _LINUX_MATH64_H
#include <linux/types.h> +#include <linux/log2.h> #include <vdso/math64.h> #include <asm/div64.h>
@@ -11,6 +12,20 @@ #define div64_long(x, y) div64_s64((x), (y)) #define div64_ul(x, y) div64_u64((x), (y))
+/** + * rem_u64 - remainder of unsigned 64bit divide with 32bit divisor + * @dividend: unsigned 64bit dividend + * @divisor: unsigned 32bit divisor + * + * Return: dividend % divisor + */ +static inline u32 rem_u64(u64 dividend, u32 divisor) +{ + if (is_power_of_2(divisor)) + return dividend & (divisor - 1); + return dividend % divisor; +} + /** * div_u64_rem - unsigned 64bit divide with 32bit divisor with remainder * @dividend: unsigned 64bit dividend @@ -85,6 +100,15 @@ static inline s64 div64_s64(s64 dividend, s64 divisor) #define div64_long(x, y) div_s64((x), (y)) #define div64_ul(x, y) div_u64((x), (y))
+#ifndef rem_u64 +static inline u32 rem_u64(u64 dividend, u32 divisor) +{ + if (is_power_of_2(divisor)) + return dividend & (divisor - 1); + return do_div(dividend, divisor); +} +#endif + #ifndef div_u64_rem static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder) {
From: Zhang Yi yi.zhang@huawei.com
maillist inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9VTE3 CVE: NA
Reference: https://lore.kernel.org/linux-fsdevel/20240529095206.2568162-1-yi.zhang@huaw...
--------------------------------
iomap_truncate_page() always assumes the block size of the truncating inode is i_blocksize(), this is not always true for some filesystems, e.g. XFS does extent size alignment for realtime inodes. Drop this assumption and pass the block size for zeroing into iomap_truncate_page(), allow filesystems to indicate the correct block size.
Suggested-by: Dave Chinner david@fromorbit.com Signed-off-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Long Li leo.lilong@huawei.com --- fs/iomap/buffered-io.c | 8 ++++---- fs/xfs/xfs_iops.c | 5 +++-- include/linux/iomap.h | 4 ++-- 3 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 45471ee7e919..0bb3257cba42 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -17,6 +17,7 @@ #include <linux/bio.h> #include <linux/sched/signal.h> #include <linux/migrate.h> +#include <linux/math64.h> #include "trace.h"
#include "../internal.h" @@ -1044,11 +1045,10 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, EXPORT_SYMBOL_GPL(iomap_zero_range);
int -iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, - const struct iomap_ops *ops) +iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize, + bool *did_zero, const struct iomap_ops *ops) { - unsigned int blocksize = i_blocksize(inode); - unsigned int off = pos & (blocksize - 1); + unsigned int off = rem_u64(pos, blocksize);
/* Block boundary? Nothing to do */ if (!off) diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index a527a544a684..092fb02d1a13 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -769,6 +769,7 @@ xfs_setattr_size( int error; uint lock_flags = 0; bool did_zeroing = false; + unsigned int blocksize = i_blocksize(inode);
ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL)); ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL)); @@ -830,8 +831,8 @@ xfs_setattr_size( newsize); if (error) return error; - error = iomap_truncate_page(inode, newsize, &did_zeroing, - &xfs_buffered_write_iomap_ops); + error = iomap_truncate_page(inode, newsize, blocksize, + &did_zeroing, &xfs_buffered_write_iomap_ops); }
if (error) diff --git a/include/linux/iomap.h b/include/linux/iomap.h index d14a729d40ce..1b6e22741d43 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -207,8 +207,8 @@ int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len, const struct iomap_ops *ops); int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, const struct iomap_ops *ops); -int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, - const struct iomap_ops *ops); +int iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize, + bool *did_zero, const struct iomap_ops *ops); vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops); int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
From: Zhang Yi yi.zhang@huawei.com
maillist inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9VTE3 CVE: NA
Reference: https://lore.kernel.org/linux-fsdevel/20240529095206.2568162-1-yi.zhang@huaw...
--------------------------------
When truncating down an inode, we call xfs_truncate_page() to zero out the tail partial block that beyond new EOF, which prevents exposing stale data. But xfs_truncate_page() always assumes the blocksize is i_blocksize(inode), it's not always true if we have a large allocation unit for a file and we should aligned to this unitsize, e.g. realtime inode should aligned to the rtextsize.
Current xfs_setattr_size() can't support zeroing out a large alignment size on trucate down since the process order is wrong. We first do zero out through xfs_truncate_page(), and then update inode size through truncate_setsize() immediately. If the zeroed range is larger than a folio, the write back path would not write back zeroed pagecache beyond the EOF folio, so it doesn't write zeroes to the entire tail extent and could expose stale data after an appending write into the next aligned extent.
We need to adjust the order to zero out tail aligned blocks, write back zeroed or cached data, update i_size and drop cache beyond aligned EOF block, preparing for the fix of realtime inode and supporting the upcoming forced alignment feature.
Signed-off-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/xfs_iops.c | 117 ++++++++++++++++++++++++---------------------- 1 file changed, 61 insertions(+), 56 deletions(-)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 092fb02d1a13..d8caa2f495ff 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -769,7 +769,7 @@ xfs_setattr_size( int error; uint lock_flags = 0; bool did_zeroing = false; - unsigned int blocksize = i_blocksize(inode); + bool write_back = false;
ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL)); ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL)); @@ -806,21 +806,10 @@ xfs_setattr_size( */ inode_dio_wait(inode);
- /* - * File data changes must be complete before we start the transaction to - * modify the inode. This needs to be done before joining the inode to - * the transaction because the inode cannot be unlocked once it is a - * part of the transaction. - * - * Start with zeroing any data beyond EOF that we may expose on file - * extension, or zeroing out the rest of the block on a downward - * truncate. - */ - if (newsize > oldsize) { - trace_xfs_zero_eof(ip, oldsize, newsize - oldsize); - error = iomap_zero_range(inode, oldsize, newsize - oldsize, - &did_zeroing, &xfs_buffered_write_iomap_ops); - } else { + write_back = newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size; + if (newsize < oldsize) { + unsigned int blocksize = i_blocksize(inode); + /* * iomap won't detect a dirty page over an unwritten block (or a * cow block over a hole) and subsequently skips zeroing the @@ -828,53 +817,69 @@ xfs_setattr_size( * convert the block before the pagecache truncate. */ error = filemap_write_and_wait_range(inode->i_mapping, newsize, - newsize); + roundup_64(newsize, blocksize) - 1); if (error) return error; + error = iomap_truncate_page(inode, newsize, blocksize, &did_zeroing, &xfs_buffered_write_iomap_ops); - } + if (error) + return error; + /* + * We are going to log the inode size change in this transaction + * so any previous writes that are beyond the on disk EOF and + * the new EOF that have not been written out need to be written + * here. If we do not write the data out, we expose ourselves + * to the null files problem. Note that this includes any block + * zeroing we did above; otherwise those blocks may not be + * zeroed after a crash. + */ + if (did_zeroing || write_back) { + error = filemap_write_and_wait_range(inode->i_mapping, + min_t(loff_t, ip->i_d.di_size, newsize), + roundup_64(newsize, blocksize) - 1); + if (error) + return error; + }
- if (error) - return error; + /* + * Updating i_size after writing back to make sure the zeroed + * blocks could been written out, and drop all the page cache + * range that beyond blocksize aligned new EOF block. + * + * We've already locked out new page faults, so now we can + * safely remove pages from the page cache knowing they won't + * get refaulted until we drop the XFS_MMAP_EXCL lock after the + * extent manipulations are complete. + */ + i_size_write(inode, newsize); + truncate_pagecache(inode, roundup_64(newsize, blocksize)); + } else { + /* + * Start with zeroing any data beyond EOF that we may expose on + * file extension. + */ + if (newsize > oldsize) { + trace_xfs_zero_eof(ip, oldsize, newsize - oldsize); + error = iomap_zero_range(inode, oldsize, newsize - oldsize, + &did_zeroing, &xfs_buffered_write_iomap_ops); + if (error) + return error; + }
- /* - * We've already locked out new page faults, so now we can safely remove - * pages from the page cache knowing they won't get refaulted until we - * drop the XFS_MMAP_EXCL lock after the extent manipulations are - * complete. The truncate_setsize() call also cleans partial EOF page - * PTEs on extending truncates and hence ensures sub-page block size - * filesystems are correctly handled, too. - * - * We have to do all the page cache truncate work outside the - * transaction context as the "lock" order is page lock->log space - * reservation as defined by extent allocation in the writeback path. - * Hence a truncate can fail with ENOMEM from xfs_trans_alloc(), but - * having already truncated the in-memory version of the file (i.e. made - * user visible changes). There's not much we can do about this, except - * to hope that the caller sees ENOMEM and retries the truncate - * operation. - * - * And we update in-core i_size and truncate page cache beyond newsize - * before writeback the [di_size, newsize] range, so we're guaranteed - * not to write stale data past the new EOF on truncate down. - */ - truncate_setsize(inode, newsize); + /* + * The truncate_setsize() call also cleans partial EOF page + * PTEs on extending truncates and hence ensures sub-page block + * size filesystems are correctly handled, too. + */ + truncate_setsize(inode, newsize);
- /* - * We are going to log the inode size change in this transaction so - * any previous writes that are beyond the on disk EOF and the new - * EOF that have not been written out need to be written here. If we - * do not write the data out, we expose ourselves to the null files - * problem. Note that this includes any block zeroing we did above; - * otherwise those blocks may not be zeroed after a crash. - */ - if (did_zeroing || - (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) { - error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, - ip->i_d.di_size, newsize - 1); - if (error) - return error; + if (did_zeroing || write_back) { + error = filemap_write_and_wait_range(inode->i_mapping, + ip->i_d.di_size, newsize - 1); + if (error) + return error; + } }
error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I9VTE3 CVE: NA
--------------------------------
Now xfs_itruncate_extents() unmap extsize aligned extents for forcealign, when unaligned truncating down a forceflign file which extsize is bigger than one block, xfs_truncate_page() only zeros out the tail EOF block, this could expose stale data.
If we truncate file that contains a large enough written extent:
|< ext >|< ext >| ...WWWWWWWWWWWWWWWWWWWWWzzzzzzzzzzzz ^ (new EOF) ^ old EOF
Since we only zeros out the tail of the EOF block, and xfs_itruncate_extents() unmap the whole ailgned extents, it becomes this state:
|< ext >| ...WWWzWWWWWWWWWWWWW ^ new EOF
Then if we do an extending write like this, the blocks in the previous tail extent becomes stale:
|< ext >| |< ext >| ...WWWzSSSSSSSSSSSSS......WWWWWWWWWWWzzzzzz ^ old EOF ^ append start ^ new EOF
Fix this by zeroing out the tail allocation uint for forcealign.
Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/xfs_iops.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index d8caa2f495ff..30dc960951ca 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -770,6 +770,7 @@ xfs_setattr_size( uint lock_flags = 0; bool did_zeroing = false; bool write_back = false; + unsigned int blocksize = 0;
ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL)); ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL)); @@ -777,6 +778,11 @@ xfs_setattr_size( ASSERT((iattr->ia_valid & (ATTR_UID|ATTR_GID|ATTR_ATIME|ATTR_ATIME_SET| ATTR_MTIME_SET|ATTR_TIMES_SET)) == 0);
+ if (xfs_inode_forcealign(ip) && ip->i_d.di_extsize > 1) + blocksize = ip->i_d.di_extsize << i_blocksize(inode); + else + blocksize = i_blocksize(inode); + oldsize = inode->i_size; newsize = iattr->ia_size;
@@ -808,8 +814,6 @@ xfs_setattr_size(
write_back = newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size; if (newsize < oldsize) { - unsigned int blocksize = i_blocksize(inode); - /* * iomap won't detect a dirty page over an unwritten block (or a * cow block over a hole) and subsequently skips zeroing the
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/12187 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/5...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/12187 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/5...