xfs fsmap bugfix.
Darrick J. Wong (2): xfs: fix interval filtering in multi-step fsmap queries xfs: fix an agbno overflow in __xfs_getfsmap_datadev
fs/xfs/xfs_fsmap.c | 80 ++++++++++++++++++++++++++++++++++------------ 1 file changed, 60 insertions(+), 20 deletions(-)
From: "Darrick J. Wong" djwong@kernel.org
mainline inclusion from mainline-v6.5-rc1 commit 63ef7a35912dd743cabd65d5bb95891625c0dd46 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IA470G
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
I noticed a bug in ranged GETFSMAP queries:
EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL 0: 8:80 [0..7]: static fs metadata 0 (0..7) 8 <snip> 9: 8:80 [192..223]: 137 0..31 0 (192..223) 32
That's not right -- we asked what block maps block 208, and we should've received a mapping for inode 137 offset 16. Instead, we get nothing.
The root cause of this problem is a mis-interaction between the fsmap code and how btree ranged queries work. xfs_btree_query_range returns any btree record that overlaps with the query interval, even if the record starts before or ends after the interval. Similarly, GETFSMAP is supposed to return a recordset containing all records that overlap the range queried.
However, it's possible that the recordset is larger than the buffer that the caller provided to convey mappings to userspace. In /that/ case, userspace is supposed to copy the last record returned to fmh_keys[0] and call GETFSMAP again. In this case, we do not want to return mappings that we have already supplied to the caller. The call to xfs_btree_query_range is the same, but now we ignore any records that start before fmh_keys[0].
Unfortunately, we didn't implement the filtering predicate correctly. The predicate should only be called when we're calling back for more records. Accomplish this by setting info->low.rm_blockcount to a nonzero value and ensuring that it is cleared as necessary. As a result, we no longer want to adjust dkeys[0] in the main setup function because that's confusing.
This patch doesn't touch the logdev/rtbitmap backends because they have bigger problems that will be addressed by subsequent patches.
Found via xfs/556 with parent pointers enabled.
Fixes: e89c041338ed ("xfs: implement the GETFSMAP ioctl") Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Dave Chinner dchinner@redhat.com
Conflicts: fs/xfs/xfs_fsmap.c [Because there are many conflicting patches that need to be adapted, the in-place context adaptation is performed directly.] Signed-off-by: Zizhi Wo wozizhi@huawei.com --- fs/xfs/xfs_fsmap.c | 67 +++++++++++++++++++++++++++++++++------------- 1 file changed, 48 insertions(+), 19 deletions(-)
diff --git a/fs/xfs/xfs_fsmap.c b/fs/xfs/xfs_fsmap.c index 595450a99ae2..deaed94011ad 100644 --- a/fs/xfs/xfs_fsmap.c +++ b/fs/xfs/xfs_fsmap.c @@ -161,7 +161,14 @@ struct xfs_getfsmap_info { u64 missing_owner; /* owner of holes */ u32 dev; /* device id */ xfs_agnumber_t agno; /* AG number, if applicable */ - struct xfs_rmap_irec low; /* low rmap key */ + /* + * Low rmap key for the query. If low.rm_blockcount is nonzero, this + * is the second (or later) call to retrieve the recordset in pieces. + * xfs_getfsmap_rec_before_start will compare all records retrieved + * by the rmapbt query to filter out any records that start before + * the last record. + */ + struct xfs_rmap_irec low; struct xfs_rmap_irec high; /* high rmap key */ bool last; /* last extent? */ }; @@ -237,6 +244,17 @@ xfs_getfsmap_format( xfs_fsmap_from_internal(rec, xfm); }
+static inline bool +xfs_getfsmap_rec_before_start( + struct xfs_getfsmap_info *info, + const struct xfs_rmap_irec *rec, + xfs_daddr_t rec_daddr) +{ + if (info->low.rm_blockcount) + return xfs_rmap_compare(rec, &info->low) < 0; + return false; +} + /* * Format a reverse mapping for getfsmap, having translated rm_startblock * into the appropriate daddr units. @@ -260,7 +278,7 @@ xfs_getfsmap_helper( * Filter out records that start before our startpoint, if the * caller requested that. */ - if (xfs_rmap_compare(rec, &info->low) < 0) { + if (xfs_getfsmap_rec_before_start(info, rec, rec_daddr)) { rec_daddr += XFS_FSB_TO_BB(mp, rec->rm_blockcount); if (info->next_daddr < rec_daddr) info->next_daddr = rec_daddr; @@ -604,9 +622,27 @@ __xfs_getfsmap_datadev( error = xfs_fsmap_owner_to_rmap(&info->low, &keys[0]); if (error) return error; - info->low.rm_blockcount = 0; + info->low.rm_blockcount = XFS_BB_TO_FSBT(mp, keys[0].fmr_length); xfs_getfsmap_set_irec_flags(&info->low, &keys[0]);
+ /* Adjust the low key if we are continuing from where we left off. */ + if (info->low.rm_blockcount == 0) { + /* empty */ + } else if (XFS_RMAP_NON_INODE_OWNER(info->low.rm_owner) || + (info->low.rm_flags & (XFS_RMAP_ATTR_FORK | + XFS_RMAP_BMBT_BLOCK | + XFS_RMAP_UNWRITTEN))) { + info->low.rm_startblock += info->low.rm_blockcount; + info->low.rm_owner = 0; + info->low.rm_offset = 0; + + start_fsb += info->low.rm_blockcount; + if (XFS_FSB_TO_DADDR(mp, start_fsb) >= eofs) + return 0; + } else { + info->low.rm_offset += info->low.rm_blockcount; + } + info->high.rm_startblock = -1U; info->high.rm_owner = ULLONG_MAX; info->high.rm_offset = ULLONG_MAX; @@ -657,12 +693,8 @@ __xfs_getfsmap_datadev( * Set the AG low key to the start of the AG prior to * moving on to the next AG. */ - if (info->agno == start_ag) { - info->low.rm_startblock = 0; - info->low.rm_owner = 0; - info->low.rm_offset = 0; - info->low.rm_flags = 0; - } + if (info->agno == start_ag) + memset(&info->low, 0, sizeof(info->low)); }
/* Report any gap at the end of the AG */ @@ -886,21 +918,17 @@ xfs_getfsmap( * blocks could be mapped to several other files/offsets. * According to rmapbt record ordering, the minimal next * possible record for the block range is the next starting - * offset in the same inode. Therefore, bump the file offset to - * continue the search appropriately. For all other low key - * mapping types (attr blocks, metadata), bump the physical - * offset as there can be no other mapping for the same physical - * block range. + * offset in the same inode. Therefore, each fsmap backend bumps + * the file offset to continue the search appropriately. For + * all other low key mapping types (attr blocks, metadata), each + * fsmap backend bumps the physical offset as there can be no + * other mapping for the same physical block range. */ dkeys[0] = head->fmh_keys[0]; if (dkeys[0].fmr_flags & (FMR_OF_SPECIAL_OWNER | FMR_OF_EXTENT_MAP)) { - dkeys[0].fmr_physical += dkeys[0].fmr_length; - dkeys[0].fmr_owner = 0; if (dkeys[0].fmr_offset) return -EINVAL; - } else - dkeys[0].fmr_offset += dkeys[0].fmr_length; - dkeys[0].fmr_length = 0; + } memset(&dkeys[1], 0xFF, sizeof(struct xfs_fsmap));
if (!xfs_getfsmap_check_keys(dkeys, &head->fmh_keys[1])) @@ -948,6 +976,7 @@ xfs_getfsmap( info.dev = handlers[i].dev; info.last = false; info.agno = NULLAGNUMBER; + info.low.rm_blockcount = 0; error = handlers[i].fn(tp, dkeys, &info); if (error) break;
From: "Darrick J. Wong" djwong@kernel.org
mainline inclusion from mainline-v6.6-rc3 commit cfa2df68b7ceb49ac9eb2d295ab0c5974dbf17e7 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IA470G
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Dave Chinner reported that xfs/273 fails if the AG size happens to be an exact power of two. I traced this to an agbno integer overflow when the current GETFSMAP call is a continuation of a previous GETFSMAP call, and the last record returned was non-shareable space at the end of an AG.
__xfs_getfsmap_datadev sets up a data device query by converting the incoming fmr_physical into an xfs_fsblock_t and cracking it into an agno and agbno pair. In the (failing) case of where fmr_blockcount of the low key is nonzero and the record was for a non-shareable extent, it will add fmr_blockcount to start_fsb and info->low.rm_startblock.
If the low key was actually the last record for that AG, then this addition causes info->low.rm_startblock to point beyond EOAG. When the rmapbt range query starts, it'll return an empty set, and fsmap moves on to the next AG.
Or so I thought. Remember how we added to start_fsb?
If agsize < 1<<agblklog, start_fsb points to the same AG as the original fmr_physical from the low key. We run the rmapbt query, which returns nothing, so getfsmap zeroes info->low and moves on to the next AG.
If agsize == 1<<agblklog, start_fsb now points to the next AG. We run the rmapbt query on the next AG with the excessively large rm_startblock. If this next AG is actually the last AG, we'll set info->high to EOFS (which is now has a lower rm_startblock than info->low), and the ranged btree query code will return -EINVAL. If it's not the last AG, we ignore all records for the intermediate AGs.
Oops.
Fix this by decoding start_fsb into agno and agbno only after making adjustments to start_fsb. This means that info->low.rm_startblock will always be set to a valid agbno, and we always start the rmapbt iteration in the correct AG.
While we're at it, fix the predicate for determining if an fsmap record represents non-shareable space to include file data on pre-reflink filesystems.
Reported-by: Dave Chinner david@fromorbit.com Fixes: 63ef7a35912dd ("xfs: fix interval filtering in multi-step fsmap queries") Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Dave Chinner dchinner@redhat.com Signed-off-by: Zizhi Wo wozizhi@huawei.com --- fs/xfs/xfs_fsmap.c | 25 ++++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-)
diff --git a/fs/xfs/xfs_fsmap.c b/fs/xfs/xfs_fsmap.c index deaed94011ad..ac67f80660b2 100644 --- a/fs/xfs/xfs_fsmap.c +++ b/fs/xfs/xfs_fsmap.c @@ -585,6 +585,19 @@ xfs_getfsmap_rtdev_rtbitmap( } #endif /* CONFIG_XFS_RT */
+static inline bool +rmap_not_shareable(struct xfs_mount *mp, const struct xfs_rmap_irec *r) +{ + if (!xfs_has_reflink(mp)) + return true; + if (XFS_RMAP_NON_INODE_OWNER(r->rm_owner)) + return true; + if (r->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK | + XFS_RMAP_UNWRITTEN)) + return true; + return false; +} + /* Execute a getfsmap query against the regular data device. */ STATIC int __xfs_getfsmap_datadev( @@ -617,7 +630,6 @@ __xfs_getfsmap_datadev( * low to the fsmap low key and max out the high key to the end * of the AG. */ - info->low.rm_startblock = XFS_FSB_TO_AGBNO(mp, start_fsb); info->low.rm_offset = XFS_BB_TO_FSBT(mp, keys[0].fmr_offset); error = xfs_fsmap_owner_to_rmap(&info->low, &keys[0]); if (error) @@ -627,12 +639,9 @@ __xfs_getfsmap_datadev(
/* Adjust the low key if we are continuing from where we left off. */ if (info->low.rm_blockcount == 0) { - /* empty */ - } else if (XFS_RMAP_NON_INODE_OWNER(info->low.rm_owner) || - (info->low.rm_flags & (XFS_RMAP_ATTR_FORK | - XFS_RMAP_BMBT_BLOCK | - XFS_RMAP_UNWRITTEN))) { - info->low.rm_startblock += info->low.rm_blockcount; + /* No previous record from which to continue */ + } else if (rmap_not_shareable(mp, &info->low)) { + /* Last record seen was an unshareable extent */ info->low.rm_owner = 0; info->low.rm_offset = 0;
@@ -640,8 +649,10 @@ __xfs_getfsmap_datadev( if (XFS_FSB_TO_DADDR(mp, start_fsb) >= eofs) return 0; } else { + /* Last record seen was a shareable file data extent */ info->low.rm_offset += info->low.rm_blockcount; } + info->low.rm_startblock = XFS_FSB_TO_AGBNO(mp, start_fsb);
info->high.rm_startblock = -1U; info->high.rm_owner = ULLONG_MAX;
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/11339 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/V...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/11339 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/V...