[PATCH OLK-5.10 0/2] xfs: fix delay extent reserve issue
This patch set fix delay extent reserve issue. Christoph Hellwig (1): xfs: fix xfs_bmap_add_extent_delay_real for partial conversions Ye Bin (1): xfs: fix possible bugon in xfs_trans_unreserve_and_mod_sb() fs/xfs/libxfs/xfs_bmap.c | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) -- 2.52.0
From: Christoph Hellwig <hch@lst.de> mainline inclusion from mainline-v6.9-rc4 commit d69bee6a35d3c5e4873b9e164dd1a9711351a97c category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/9229 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- xfs_bmap_add_extent_delay_real takes parts or all of a delalloc extent and converts them to a real extent. It is written to deal with any potential overlap of the to be converted range with the delalloc extent, but it turns out that currently only converting the entire extents, or a part starting at the beginning is actually exercised, as the only caller always tries to convert the entire delalloc extent, and either succeeds or at least progresses partially from the start. If it only converts a tiny part of a delalloc extent, the indirect block calculation for the new delalloc extent (da_new) might be equivalent to that of the existing delalloc extent (da_old). If this extent conversion now requires allocating an indirect block that gets accounted into da_new, leading to the assert that da_new must be smaller or equal to da_new unless we split the extent to trigger. Except for the assert that case is actually handled by just trying to allocate more space, as that already handled for the split case (which currently can't be reached at all), so just reusing it should be fine. Except that without dipping into the reserved block pool that would make it a bit too easy to trigger a fs shutdown due to ENOSPC. So in addition to adjusting the assert, also dip into the reserved block pool. Note that I could only reproduce the assert with a change to only convert the actually asked range instead of the full delalloc extent from xfs_bmapi_write. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Conflicts: fs/xfs/libxfs/xfs_bmap.c [Context conflicts] Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/libxfs/xfs_bmap.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 1323259192d6..b8d16ffa0f7c 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -1572,6 +1572,7 @@ xfs_bmap_add_extent_delay_real( if (error) goto done; } + ASSERT(da_new <= da_old); break; case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING | BMAP_LEFT_CONTIG: @@ -1601,6 +1602,7 @@ xfs_bmap_add_extent_delay_real( if (error) goto done; } + ASSERT(da_new <= da_old); break; case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING | BMAP_RIGHT_CONTIG: @@ -1634,6 +1636,7 @@ xfs_bmap_add_extent_delay_real( if (error) goto done; } + ASSERT(da_new <= da_old); break; case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING: @@ -1666,6 +1669,7 @@ xfs_bmap_add_extent_delay_real( goto done; } } + ASSERT(da_new <= da_old); break; case BMAP_LEFT_FILLING | BMAP_LEFT_CONTIG: @@ -1703,6 +1707,7 @@ xfs_bmap_add_extent_delay_real( if (error) goto done; } + ASSERT(da_new <= da_old); break; case BMAP_LEFT_FILLING: @@ -1790,6 +1795,7 @@ xfs_bmap_add_extent_delay_real( xfs_iext_update_extent(bma->ip, state, &bma->icur, &PREV); xfs_iext_next(ifp, &bma->icur); xfs_iext_update_extent(bma->ip, state, &bma->icur, &RIGHT); + ASSERT(da_new <= da_old); break; case BMAP_RIGHT_FILLING: @@ -1837,6 +1843,7 @@ xfs_bmap_add_extent_delay_real( PREV.br_blockcount = temp; xfs_iext_insert(bma->ip, &bma->icur, &PREV, state); xfs_iext_next(ifp, &bma->icur); + ASSERT(da_new <= da_old); break; case 0: @@ -1958,9 +1965,8 @@ xfs_bmap_add_extent_delay_real( /* adjust for changes in reserved delayed indirect blocks */ if (da_new != da_old) { - ASSERT(state == 0 || da_new < da_old); error = xfs_mod_fdblocks(mp, (int64_t)(da_old - da_new), - false); + true); } xfs_bmap_check_leaf_extents(bma->cur, bma->ip, whichfork); -- 2.52.0
From: Ye Bin <yebin10@huawei.com> hulk inclusion category: bugfix bugzilla: https://atomgit.com/openeuler/kernel/issues/9229 CVE: NA -------------------------------- Recently, I encountered a problem where a BUG was triggered in the write-back process. The detailed problem information is as follows: xfs_bmap_extents_to_btree: ip=0xffff888148ecad00 wasdel=0 sde: writeback error on inode 68, offset 61440, sector 1400 XFS (sde): Corruption of in-memory data (0x8) detected at xfs_trans_mod_sb+0xaa6/0xc60 (fs/xfs/xfs_trans.c:351). Shutting. XFS (sde): Please unmount the filesystem and rectify the problem(s) XFS: Assertion failed: tp->t_blk_res || tp->t_fdblocks_delta >= 0, file: fs/xfs/xfs_trans.c, line: 610 ------------[ cut here ]------------ kernel BUG at fs/xfs/xfs_message.c:102! Oops: invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 5 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 7.0.0-rc6-next-20260402-00028-g56f243e5f8ea-dirty #360 P Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Workqueue: writeback wb_workfn (flush-8:64) RIP: 0010:assfail+0x9f/0xb0 Code: fe 84 db 75 20 e8 41 2e 33 fe 0f 0b 5b 5d 41 5c 41 5d e9 94 34 a5 06 48 c7 c7 58 ae 2b 8d e8 f8 72 a0 RSP: 0018:ffffc900000dedd0 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff838c91b6 RDX: ffff88810425d880 RSI: ffffffff838c91df RDI: 0000000000000001 RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed10e3b14901 R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a956520 R13: 0000000000000262 R14: 0000000000000000 R15: ffffffffffffffff FS: 0000000000000000(0000) GS:ffff88878bdc5000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fccb3c788f0 CR3: 000000000c98a000 CR4: 00000000000006f0 Call Trace: <TASK> xfs_trans_unreserve_and_mod_sb+0xb86/0xd00 __xfs_trans_commit+0x38b/0xe00 xfs_trans_commit+0xeb/0x1a0 xfs_bmapi_convert_one_delalloc+0xba9/0x12d0 xfs_bmapi_convert_delalloc+0x101/0x350 xfs_writeback_range+0x76c/0x12d0 iomap_writeback_folio+0x9ed/0x2100 iomap_writepages+0x13c/0x2a0 xfs_vm_writepages+0x278/0x330 do_writepages+0x247/0x5c0 __writeback_single_inode+0x123/0x1370 writeback_sb_inodes+0x71e/0x1b90 __writeback_inodes_wb+0xc3/0x280 wb_writeback+0x730/0xb80 wb_workfn+0x8b0/0xbc0 process_one_work+0xa08/0x1d00 worker_thread+0x698/0xeb0 kthread+0x408/0x540 ret_from_fork+0xa4d/0xdd0 ret_from_fork_asm+0x1a/0x30 </TASK> After analyzing the above issues, the possible triggering process is as follows: xfs_bmapi_convert_delalloc xfs_bmapi_convert_one_delalloc xfs_bmapi_allocate xfs_bmap_add_extent_delay_real da_old = startblockval(PREV.br_startblock); // da_old = 5 case BMAP_LEFT_FILLING: ifp->if_nextents++; // 21 + 1 = 22 if (xfs_bmap_needs_btree(bma->ip, whichfork)) // 22 > 21 xfs_bmap_extents_to_btree // convert to btree cur->bc_ino.allocated++; // da_new = 5 - 1 = 4 da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp), startblockval(PREV.br_startblock) - (bma->cur ? bma->cur->bc_ino.allocated : 0)) //xfs_bmapi_convert_one_delalloc() return PREV.br_startblock = nullstartblock(da_new); xfs_bmap_del_extent_real case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING: ifp->if_nextents--; // 22 - 1 = 21 if (xfs_bmap_needs_btree(ip, whichfork)) xfs_bmap_extents_to_btree(); else // convert to extents xfs_bmap_btree_to_extents(); ... // Alternate a few times in the middle. da_old = 4 da_old = 3 da_old = 2 da_old = 1 ... xfs_bmapi_convert_delalloc xfs_bmapi_convert_one_delalloc // Both blocks and rtextents are 0 error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0, XFS_TRANS_RESERVE, &tp); tp = kmem_cache_zalloc(xfs_trans_cache, GFP_KERNEL | __GFP_NOFAIL); error = xfs_trans_reserve(tp, resp, blocks, rtextents); if (blocks > 0) error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd); // The value of blocks is 0, so the value of tp->t_blk_res is 0 tp->t_blk_res += blocks; xfs_bmapi_allocate xfs_bmap_add_extent_delay_real da_old = startblockval(PREV.br_startblock); // da_old = 0 // The current delay extent is just exhausted. case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING ifp->if_nextents++; // 21 + 1 + 22 if (xfs_bmap_needs_btree(bma->ip, whichfork)) // 22 > 21 // Converted to btree. da_old > 0 is false. error = xfs_bmap_extents_to_btree(bma->tp, bma->ip, &bma->cur, da_old > 0, &tmp_logflags, whichfork); args.wasdel = wasdel; // wasdel is false error = xfs_alloc_vextent_start_ag(&args, XFS_INO_TO_FSB(mp, ip->i_ino)); xfs_alloc_vextent_finish(args, minimum_agno, error, true); xfs_ag_resv_alloc_extent(args->pag, args->resv, args); case XFS_AG_RESV_NONE: field = args->wasdel ? XFS_TRANS_SB_RES_FDBLOCKS : XFS_TRANS_SB_FDBLOCKS; //args->wasdel == false xfs_trans_mod_sb(args->tp, field, -(int64_t)args->len); case XFS_TRANS_SB_FDBLOCKS: if (delta < 0) tp->t_blk_res_used += (uint)-delta; if (tp->t_blk_res_used > tp->t_blk_res) // ***tp->t_blk_res is 0, thus triggering xfs_force_shutdown()*** xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); The logic that triggers the issue above was designed by me to facilitate the construction of the problem. Besides the scenario where BTREE and EXTENTS are converted back and forth, there is also the scenario of btree splitting. The core reason for the issue is that in xfs_bmapi_convert_delalloc(), the call to xfs_bmap_worst_indlen() calculates the worst-case number of reserved blocks, which is the number of additional blocks required after a complete conversion of the entire delayed extent. It assumes that the entire conversion process is atomic. However, the current process cannot guarantee such atomicity. In the case of a fragmented filesystem, the most extreme scenario is that every block conversion triggers a full btree split, in which case the reserved blocks are far from sufficient. When this issue is triggered, the filesystem fragmentation in the environment is indeed quite severe. Further analysis of this abnormal model shows that because the reserved blocks are continuously consumed, they may eventually exceed the reserved amount. When the space is nearly exhausted, xfs_bmap_extents_to_btree() may fail to allocate blocks, triggering a warning. This failure to allocate additional blocks can lead to issues with normal block allocation. Since a single delay extent cannot guarantee a one-time completion of the conversion, the 'inlen' of the delay extent should be maintained at the value calculated by xfs_bmap_worst_indlen(). Commit d69bee6a35d3 ("xfs: fix xfs_bmap_add_extent_delay_real for partial conversions") addressed the issue of potentially not reserving enough space in emergency situations. Based on this modification, we can recalculate the worst-case 'inlen' required for the remaining delay extents after the conversion in xfs_bmap_add_extent_delay_real(), instead of using the remaining value. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/libxfs/xfs_bmap.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index b8d16ffa0f7c..08342a4c2074 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -1679,8 +1679,7 @@ xfs_bmap_add_extent_delay_real( */ old = LEFT; temp = PREV.br_blockcount - new->br_blockcount; - da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp), - startblockval(PREV.br_startblock)); + da_new = xfs_bmap_worst_indlen(bma->ip, temp); LEFT.br_blockcount += new->br_blockcount; @@ -1747,9 +1746,7 @@ xfs_bmap_add_extent_delay_real( } temp = PREV.br_blockcount - new->br_blockcount; - da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp), - startblockval(PREV.br_startblock) - - (bma->cur ? bma->cur->bc_ino.allocated : 0)); + da_new = xfs_bmap_worst_indlen(bma->ip, temp); PREV.br_startoff = new_endoff; PREV.br_blockcount = temp; @@ -1786,8 +1783,7 @@ xfs_bmap_add_extent_delay_real( } temp = PREV.br_blockcount - new->br_blockcount; - da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp), - startblockval(PREV.br_startblock)); + da_new = xfs_bmap_worst_indlen(bma->ip, temp); PREV.br_blockcount = temp; PREV.br_startblock = nullstartblock(da_new); @@ -1835,9 +1831,7 @@ xfs_bmap_add_extent_delay_real( } temp = PREV.br_blockcount - new->br_blockcount; - da_new = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(bma->ip, temp), - startblockval(PREV.br_startblock) - - (bma->cur ? bma->cur->bc_ino.allocated : 0)); + da_new = xfs_bmap_worst_indlen(bma->ip, temp); PREV.br_startblock = nullstartblock(da_new); PREV.br_blockcount = temp; -- 2.52.0
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://atomgit.com/openeuler/kernel/merge_requests/23476 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/KD7... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://atomgit.com/openeuler/kernel/merge_requests/23476 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/KD7...
participants (2)
-
Long Li -
patchwork bot