*** BLURB HERE *** This patch set fix block space problems
Gao Xiang (1): xfs: account extra freespace btree splits for multiple allocations
Guo Xuenan (1): xfs: set minleft correctly for randomly sparse inode allocations
yangerkun (2): xfs: fix xfs shutdown since we reserve more blocks in agfl fixup xfs: longest free extent no need consider postalloc
fs/xfs/libxfs/xfs_alloc.c | 49 ++++++++++++++++++++++++++++++++------ fs/xfs/libxfs/xfs_alloc.h | 1 + fs/xfs/libxfs/xfs_bmap.c | 1 + fs/xfs/libxfs/xfs_ialloc.c | 3 +++ fs/xfs/xfs_mount.c | 9 +++++++ 5 files changed, 56 insertions(+), 7 deletions(-)
From: Gao Xiang hsiangkao@linux.alibaba.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8TRWW CVE: NA
--------------------------------
There is a long standing issue which could cause fs shutdown due to inode extent-to-btree conversion failure right after an extent allocation in the same AG, which is absolutely unexpected due to the proper minleft reservation in the previous allocation. Brian once addressed one of the root cause [1], however, such symptom can still occur after the commit is merged as reported [2], and our cloud environment is also suffering from this issue.
From the description of the commit [1], I found that Zirong has an
in-house stress test reproducer for this issue, therefore I asked him to reproduce again and he confirmed that such issue can still be reproduced on RHEL 9 in several days.
Thanks to him, after adding some debugging code to dump the current transaction log items, I think the root cause is as below:
1. xfs_bmapi_allocate() with the following condition: freeblks: 18304 pagf_flcount: 6 reservation: 18276 need (min_free): 6 args->minleft: 1 available = freeblks + agflcount - reservation - need - minleft = 18304 + min(6, 6) - 18276 - 6 - 1 = 27 The first allocation check itself is ok, and args->maxlen = 27 here
At this time, AG 3 also has the following state: 1st:64 last:69 cnt:6 longest:6395
AGFL has the following state: 64:547 65:167 66:1651 67:2040807 68:783 69:604
2. Tried to get 27 blocks from this AG, but in order to finish such allocation, it had to need a new btree block for cntbt (so take another free block from agfl). It can be seen with a new AGF recorded in the transaction: blkno 62914177, len 1, map_size 1 00000000: 58 41 47 46 00 00 00 01 00 00 00 03 00 27 ff f0 XAGF.........'.. 00000010: 00 00 00 09 00 00 00 07 00 00 00 00 00 00 00 02 ................ 00000020: 00 00 00 02 00 00 00 00 00 00 00 41 00 00 00 45 ...........A...E 00000030: 00 00 00 05 00 00 47 65 00 00 18 fb 00 00 00 09 ......Ge........ 00000040: 75 dc c1 b5 1a 45 40 2a 80 50 72 f0 59 6e 62 66 u....E@*.Pr.Ynbf
It can be parsed as: agf 3 flfirst: 65 (0x41) fllast: 69 (0x45) cnt: 5 freeblks 18277
3. agfl 64 (agbno 547, daddr 62918552) was then written as a cntbt block, which can also be seen in a log item as below: type#011= 0x123c flags#011= 0x8 blkno 62918552, len 8, map_size 1 00000000: 41 42 33 43 00 00 00 fd 00 1f 23 e4 ff ff ff ff AB3C......#..... 00000010: 00 00 00 00 03 c0 0f 98 00 00 00 00 00 00 00 00 ................ 00000020: 75 dc c1 b5 1a 45 40 2a 80 50 72 f0 59 6e 62 66 u....E@*.Pr.Ynbf ...
4. Finally, the following inode extent to btree allocation fails as below: kernel: ------------[ cut here ]------------ WARNING: CPU: 15 PID: 49290 at fs/xfs/libxfs/xfs_bmap.c:717 xfs_bmap_extents_to_btree+0xc51/0x1050 [xfs] ... XFS (sda2): agno 3 agflcount 5 freeblks 18277 reservation 18276 6
since freeblks = 18304 - 27 = 18277, but with another agfl block allocated (pagf_flcount from 6 to 5), the inequality will not be satisfied:
available = freeblks + agflcount - reservation - need - minleft = 18277 + min(5, 6) - 18276 - 6 - 0 = 0 < 1
Full current transaction log item dump can be fetched from [3].
As a short-term solution, the following allocations (e.g. allocation for inode extent-to-btree conversion) can be recorded in order to count more blocks to reserve for safely freespace btree splits so that it will shorten available and args->maxlen to available = freeblks + agflcount - reservation - need - minleft = 18304 + min(6, 6) - 18276 - 6*2 - 1 = 21 args->maxlen = 21 in the first allocation, and the following conversion should then succeed. At least, it's easy to be backported and do hotfix.
In the long term, args->total and args->minleft have be revisited although it could cause more refactoring.
[1] commit 1ca89fbc48e1 ("xfs: don't account extra agfl blocks as available") https://lore.kernel.org/r/20190327145000.10756-1-bfoster@redhat.com [2] https://lore.kernel.org/r/20220105071052.GD20464@templeofstupid.com [3] https://lore.kernel.org/linux-xfs/Y2RevDyoeJZSpiat@B-P7TQMD6M-0146.local/2-d... Reported-by: Zirong Lang zlang@redhat.com Signed-off-by: Gao Xiang hsiangkao@linux.alibaba.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/libxfs/xfs_alloc.c | 9 +++++++-- fs/xfs/libxfs/xfs_alloc.h | 1 + fs/xfs/libxfs/xfs_bmap.c | 1 + fs/xfs/libxfs/xfs_ialloc.c | 1 + 4 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 0246dc3e8bab..392d787ceb22 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2645,7 +2645,12 @@ xfs_alloc_fix_freelist( goto out_agbp_relse; }
- need = xfs_alloc_min_freelist(mp, pag); + /* + * Also need to fulfill freespace btree splits by reservaing more + * blocks to perform multiple allocations from a single AG and + * transaction if needed. + */ + need = xfs_alloc_min_freelist(mp, pag) * (1 + args->postallocs); if (!xfs_alloc_space_available(args, need, alloc_flags | XFS_ALLOC_FLAG_CHECK)) goto out_agbp_relse; @@ -2669,7 +2674,7 @@ xfs_alloc_fix_freelist( xfs_agfl_reset(tp, agbp, pag);
/* If there isn't enough total space or single-extent, reject it. */ - need = xfs_alloc_min_freelist(mp, pag); + need = xfs_alloc_min_freelist(mp, pag) * (1 + args->postallocs); if (!xfs_alloc_space_available(args, need, alloc_flags)) goto out_agbp_relse;
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h index 6bb8d295c321..0bce6e1383c0 100644 --- a/fs/xfs/libxfs/xfs_alloc.h +++ b/fs/xfs/libxfs/xfs_alloc.h @@ -53,6 +53,7 @@ typedef struct xfs_alloc_arg { int datatype; /* mask defining data type treatment */ char wasdel; /* set if allocation was prev delayed */ char wasfromfl; /* set if allocation is from freelist */ + bool postallocs; /* number of post-allocations */ struct xfs_owner_info oinfo; /* owner of blocks being allocated */ enum xfs_ag_resv_type resv; /* block reservation to use */ #ifdef DEBUG diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 30c931b38853..b332c72551a8 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -3692,6 +3692,7 @@ xfs_bmap_btalloc( .datatype = ap->datatype, .alignment = 1, .minalignslop = 0, + .postallocs = 1, }; xfs_fileoff_t orig_offset; xfs_extlen_t orig_length; diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c index a51e2de3eb5a..ea2c9cf78235 100644 --- a/fs/xfs/libxfs/xfs_ialloc.c +++ b/fs/xfs/libxfs/xfs_ialloc.c @@ -652,6 +652,7 @@ xfs_ialloc_ag_alloc( int do_sparse = 0;
memset(&args, 0, sizeof(args)); + args.postallocs = 1; args.tp = tp; args.mp = tp->t_mountp; args.fsbno = NULLFSBLOCK;
From: Guo Xuenan guoxuenan@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8TRWW CVE: NA
--------------------------------
In DEBUG mode may do sparse inode allocations randomly, but forget to set the remaining space correctly for the inode btree to split. It's OK for most cases, only under DEBUG mode and AG space is running out may bring something bad.
Fixes: 1cdadee11f8d ("xfs: randomly do sparse inode allocations in DEBUG mode") Signed-off-by: Guo Xuenan guoxuenan@huawei.com Signed-off-by: yangerkun yangerkun@huawei.com Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/libxfs/xfs_ialloc.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c index ea2c9cf78235..dd90f30cd144 100644 --- a/fs/xfs/libxfs/xfs_ialloc.c +++ b/fs/xfs/libxfs/xfs_ialloc.c @@ -783,6 +783,8 @@ xfs_ialloc_ag_alloc( args.alignment = args.mp->m_sb.sb_spino_align; args.prod = 1;
+ /* Allow space for the inode btree to split */ + args.minleft = igeo->inobt_maxlevels; args.minlen = igeo->ialloc_min_blks; args.maxlen = args.minlen;
From: yangerkun yangerkun@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8TRWW CVE: NA
--------------------------------
Twice fixup for the same ag may happen within exact one tp, and the consume of agfl after first fixup may trigger failure of second fixup, which is a unintended behavior and then xfs shutdown[1][2].
Gao Xiang describe one solution that we can reserve more blocks when first fixup, but there is some logical error:
- we may first see postallocs as 1 and second as 0, this can trigger pointless agfl filling or shortening - upper case(postallocs first equals to 1, second equals to 0) give us examples that we need shorten the agfl, but xfs_alloc_fix_freelist can only free agfl after success freespace check. Besides, the filling or shortening of agfl won't change fdblocks, so we can fall into that we can see fdblocks(or resblocks) but ag fixup will reject us, and then xfs can shutdown too - once postallocs equals to 1, it can also change the logical of xfs_alloc_ag_max_usable, which will change the block allocation logical(found this problem by check each ag's freeblocks after we fallocate a huge file) - once postallocs equals to 1, we reserve 2 * xfs_alloc_min_freelist(), but sometimes it seems not enough once bnt/cnt grow and the second fixup need more reserve...
This patch fix all bug above by using m_ag_maxlevels to reserve more blocks, and adapt xfs_alloc_set_aside/xfs_alloc_ag_max_usable to match this more reserve. Besides, we just reserve more, won't fill or shorten agfl according to that reserve.
[1] https://www.spinics.net/lists/linux-xfs/msg66440.html [2] https://lore.kernel.org/linux-xfs/20221228133204.4021519-1-guoxuenan@huawei....
Fixes: 53f85096f93e ("xfs: account extra freespace btree splits for multiple allocations") Signed-off-by: yangerkun yangerkun@huawei.com Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/libxfs/xfs_alloc.c | 39 ++++++++++++++++++++++++++++++++++----- fs/xfs/xfs_mount.c | 9 +++++++++ 2 files changed, 43 insertions(+), 5 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 392d787ceb22..2931486dcbd7 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -92,6 +92,25 @@ xfs_prealloc_blocks( */ #define XFS_ALLOCBT_AGFL_RESERVE 4
+/* + * Twice fixup for the same ag may happen within exact one tp, and the consume + * of agfl after first fixup may trigger second fixup's failure, then xfs will + * shutdown. To avoid that, we reserve blocks which can satisfy the second + * fixup. + */ +xfs_extlen_t +xfs_ag_fixup_aside( + struct xfs_mount *mp) +{ + xfs_extlen_t ret; + + ret = 2 * mp->m_alloc_maxlevels; + if (xfs_has_rmapbt(mp)) + ret += mp->m_rmap_maxlevels; + + return ret; +} + /* * Compute the number of blocks that we set aside to guarantee the ability to * refill the AGFL and handle a full bmap btree split. @@ -114,7 +133,8 @@ unsigned int xfs_alloc_set_aside( struct xfs_mount *mp) { - return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4); + return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + + 4 + xfs_ag_fixup_aside(mp)); }
/* @@ -147,6 +167,8 @@ xfs_alloc_ag_max_usable( if (xfs_has_reflink(mp)) blocks++; /* refcount root block */
+ blocks += xfs_ag_fixup_aside(mp); + return mp->m_sb.sb_agblocks - blocks; }
@@ -2618,6 +2640,7 @@ xfs_alloc_fix_freelist( struct xfs_alloc_arg targs; /* local allocation arguments */ xfs_agblock_t bno; /* freelist block */ xfs_extlen_t need; /* total blocks needed in freelist */ + xfs_extlen_t minfree; int error = 0;
/* deferred ops (AGFL block frees) require permanent transactions */ @@ -2650,8 +2673,11 @@ xfs_alloc_fix_freelist( * blocks to perform multiple allocations from a single AG and * transaction if needed. */ - need = xfs_alloc_min_freelist(mp, pag) * (1 + args->postallocs); - if (!xfs_alloc_space_available(args, need, alloc_flags | + minfree = need = xfs_alloc_min_freelist(mp, pag); + if (args->postallocs) + minfree += xfs_ag_fixup_aside(mp); + + if (!xfs_alloc_space_available(args, minfree, alloc_flags | XFS_ALLOC_FLAG_CHECK)) goto out_agbp_relse;
@@ -2674,8 +2700,11 @@ xfs_alloc_fix_freelist( xfs_agfl_reset(tp, agbp, pag);
/* If there isn't enough total space or single-extent, reject it. */ - need = xfs_alloc_min_freelist(mp, pag) * (1 + args->postallocs); - if (!xfs_alloc_space_available(args, need, alloc_flags)) + minfree = need = xfs_alloc_min_freelist(mp, pag); + if (args->postallocs) + minfree += xfs_ag_fixup_aside(mp); + + if (!xfs_alloc_space_available(args, minfree, alloc_flags)) goto out_agbp_relse;
#ifdef DEBUG diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index 0a0fd19573d8..78c72d0aa0a6 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -693,6 +693,15 @@ xfs_mountfs(
xfs_agbtree_compute_maxlevels(mp);
+ /* + * We now need m_ag_maxlevels/m_rmap_maxlevels to initialize + * m_alloc_set_aside/m_ag_max_usable. And when we first do the + * init in xfs_sb_mount_common, m_alloc_set_aside/m_ag_max_usable + * still equals to 0. Redo it now. + */ + mp->m_alloc_set_aside = xfs_alloc_set_aside(mp); + mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp); + /* * Check if sb_agblocks is aligned at stripe boundary. If sb_agblocks * is NOT aligned turn off m_dalign since allocator alignment is within
From: yangerkun yangerkun@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8TRWW CVE: NA
--------------------------------
xfs_alloc_space_available need to make sure the longest free extent can satisfy extra blocks needed for agfl fixup and also the really needed block for the latter allocation. Or the latter agfl filling may consume some blocks within the longest free extent, which can lead to that all the free extent can not satisfy the really block allocation.
The arg 'min_free' show that actually blocks we need fill agfl up to. But after commit aa1ab9a77d89 ("xfs: fix xfs shutdown since we reserve more blocks in agfl fixup"), the 'min_free' will add the reserve blocks for second fixup, but leave agfl blocks does not include this.
Actually, what we want was just reserve more blocks to satisfy second fixup, we will not fill agfl with that so many blocks, so the args for xfs_alloc_longest_free_extent should not consider postalloc. And fix agflcount in xfs_alloc_space_available too.
Fixes: aa1ab9a77d89 ("xfs: fix xfs shutdown since we reserve more blocks in agfl fixup") Signed-off-by: yangerkun yangerkun@huawei.com Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/libxfs/xfs_alloc.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 2931486dcbd7..9e54cfe7252c 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2320,6 +2320,7 @@ xfs_alloc_min_freelist( static bool xfs_alloc_space_available( struct xfs_alloc_arg *args, + xfs_extlen_t need, xfs_extlen_t min_free, int flags) { @@ -2336,7 +2337,7 @@ xfs_alloc_space_available(
/* do we have enough contiguous free space for the allocation? */ alloc_len = args->minlen + (args->alignment - 1) + args->minalignslop; - longest = xfs_alloc_longest_free_extent(pag, min_free, reservation); + longest = xfs_alloc_longest_free_extent(pag, need, reservation); if (longest < alloc_len) return false;
@@ -2345,7 +2346,7 @@ xfs_alloc_space_available( * account extra agfl blocks because we are about to defer free them, * making them unavailable until the current transaction commits. */ - agflcount = min_t(xfs_extlen_t, pag->pagf_flcount, min_free); + agflcount = min_t(xfs_extlen_t, pag->pagf_flcount, need); available = (int)(pag->pagf_freeblks + agflcount - reservation - min_free - args->minleft); if (available < (int)max(args->total, alloc_len)) @@ -2677,7 +2678,7 @@ xfs_alloc_fix_freelist( if (args->postallocs) minfree += xfs_ag_fixup_aside(mp);
- if (!xfs_alloc_space_available(args, minfree, alloc_flags | + if (!xfs_alloc_space_available(args, need, minfree, alloc_flags | XFS_ALLOC_FLAG_CHECK)) goto out_agbp_relse;
@@ -2704,7 +2705,7 @@ xfs_alloc_fix_freelist( if (args->postallocs) minfree += xfs_ag_fixup_aside(mp);
- if (!xfs_alloc_space_available(args, minfree, alloc_flags)) + if (!xfs_alloc_space_available(args, need, minfree, alloc_flags)) goto out_agbp_relse;
#ifdef DEBUG
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/3833 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/4...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/3833 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/4...