Long Li (5): xfs: fix a UAF when inode item push xfs: fix a UAF in xfs_iflush_abort_clean xfs: fix dir3 block read verify fail during log recover xfs: don't verify agf length when log recovery xfs: xfs_trans_cancel() path must check for log shutdown
fs/xfs/libxfs/xfs_alloc.c | 3 ++- fs/xfs/xfs_buf.c | 5 +++++ fs/xfs/xfs_buf_item_recover.c | 9 +++------ fs/xfs/xfs_inode_item.c | 8 +++++++- fs/xfs/xfs_trans.c | 2 +- 5 files changed, 18 insertions(+), 9 deletions(-)
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8LHTR CVE: NA
--------------------------------
KASAN reported a UAF bug while fault injection test:
================================================================== BUG: KASAN: use-after-free in xfs_inode_item_push+0x2db/0x2f0 Read of size 8 at addr ffff888022f74788 by task xfsaild/sda/479
CPU: 0 PID: 479 Comm: xfsaild/sda Not tainted 6.2.0-rc7-00003-ga8a43e2eb5f6 #89 Call Trace: <TASK> dump_stack_lvl+0x51/0x6a print_report+0x171/0x4a6 kasan_report+0xb7/0x130 xfs_inode_item_push+0x2db/0x2f0 xfsaild+0x729/0x1f70 kthread+0x290/0x340 ret_from_fork+0x1f/0x30 </TASK>
Allocated by task 494: kasan_save_stack+0x22/0x40 kasan_set_track+0x25/0x30 __kasan_slab_alloc+0x58/0x70 kmem_cache_alloc+0x197/0x5d0 xfs_inode_item_init+0x62/0x170 xfs_trans_ijoin+0x15e/0x240 xfs_init_new_inode+0x573/0x1820 xfs_create+0x6a1/0x1020 xfs_generic_create+0x544/0x5d0 vfs_mkdir+0x5d0/0x980 do_mkdirat+0x14e/0x220 __x64_sys_mkdir+0x6a/0x80 do_syscall_64+0x39/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
Freed by task 14: kasan_save_stack+0x22/0x40 kasan_set_track+0x25/0x30 kasan_save_free_info+0x2e/0x40 __kasan_slab_free+0x114/0x1b0 kmem_cache_free+0xee/0x4e0 xfs_inode_free_callback+0x187/0x2a0 rcu_do_batch+0x317/0xce0 rcu_core+0x686/0xa90 __do_softirq+0x1b6/0x626
The buggy address belongs to the object at ffff888022f74758 which belongs to the cache xfs_ili of size 200 The buggy address is located 48 bytes inside of 200-byte region [ffff888022f74758, ffff888022f74820)
The buggy address belongs to the physical page: page:ffffea00008bdd00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22f74 head:ffffea00008bdd00 order:1 compound_mapcount:0 subpages_mapcount:0 compound_pincount:0 flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff) raw: 001fffff80010200 ffff888010ed4040 ffffea00008b2510 ffffea00008bde10 raw: 0000000000000000 00000000001a001a 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected
Memory state around the buggy address: ffff888022f74680: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc ffff888022f74700: fc fc fc fc fc fc fc fc fc fc fc fa fb fb fb fb
ffff888022f74780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ ffff888022f74800: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc ffff888022f74880: fc fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ==================================================================
When push inode item in xfsaild, it will race with reclaim inodes task. Consider the following call graph, both tasks deal with the same inode. During flushing the cluster, it will enter xfs_iflush_abort() in shutdown conditions, inode's XFS_IFLUSHING flag will be cleared and lip->li_buf set to null. Concurrently, inode will be reclaimed in shutdown conditions, there is no need to wait xfs buf lock because of lip->li_buf is null at this time, inode will be freed via rcu callback if xfsaild task schedule out during flushing the cluster. so, it is unsafe to reference lip after flushing the cluster in xfs_inode_item_push().
<log item is in AIL> <filesystem shutdown> spin_lock(&ailp->ail_lock) xfs_inode_item_push(lip) xfs_buf_trylock(bp) spin_unlock(&lip->li_ailp->ail_lock) xfs_iflush_cluster(bp) if (xfs_is_shutdown()) xfs_iflush_abort(ip) xfs_trans_ail_delete(ip) spin_lock(&ailp->ail_lock) spin_unlock(&ailp->ail_lock) xfs_iflush_abort_clean(ip) error = -EIO <log item removed from AIL> <log item li_buf set to null> if (error) xfs_force_shutdown() xlog_shutdown_wait(mp->m_log) might_sleep() xfs_reclaim_inode(ip) if (shutdown) xfs_iflush_shutdown_abort(ip) if (!bp) xfs_iflush_abort(ip) return __xfs_inode_free(ip) call_rcu(ip, xfs_inode_free_callback) ...... <rcu grace period expires> <rcu free callbacks run somewhere> xfs_inode_free_callback(ip) kmem_cache_free(ip->i_itemp) ...... <starts running again> xfs_buf_ioend_fail(bp); xfs_buf_ioend(bp) xfs_buf_relse(bp); return error spin_lock(&lip->li_ailp->ail_lock) <UAF on log item>
Fix the uaf by add XFS_ILOCK_SHARED lock in xfs_inode_item_push(), this prevents race conditions between inode item push and inode reclaim.
Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary") Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/xfs_inode_item.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 127b2410eb20..c3897c5417f2 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -711,9 +711,14 @@ xfs_inode_item_push( if (xfs_iflags_test(ip, XFS_IFLUSHING)) return XFS_ITEM_FLUSHING;
- if (!xfs_buf_trylock(bp)) + if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) return XFS_ITEM_LOCKED;
+ if (!xfs_buf_trylock(bp)) { + xfs_iunlock(ip, XFS_ILOCK_SHARED); + return XFS_ITEM_LOCKED; + } + spin_unlock(&lip->li_ailp->ail_lock);
/* @@ -739,6 +744,7 @@ xfs_inode_item_push( }
spin_lock(&lip->li_ailp->ail_lock); + xfs_iunlock(ip, XFS_ILOCK_SHARED); return rval; }
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8LHTR CVE: NA
--------------------------------
KASAN reported a UAF bug while fault injection test:
================================================================== BUG: KASAN: use-after-free in __list_del_entry_valid+0x2b1/0x2c0 Read of size 8 at addr ffff888023edb888 by task kworker/0:1/34
CPU: 0 PID: 34 Comm: kworker/0:1 Not tainted 5.10.0-07305-g4c00b418452b-dirty #369 Workqueue: xfs-reclaim/sda xfs_reclaim_worker Call Trace: dump_stack+0x115/0x16b print_address_description.constprop.0+0x2c/0x450 kasan_report.cold+0x5d/0xdb __asan_report_load8_noabort+0x20/0x30 __list_del_entry_valid+0x2b1/0x2c0 xfs_iflush_abort_clean+0x11c/0x290 xfs_iflush_abort+0xd2/0x2c0 xfs_iflush_shutdown_abort+0x2e3/0x580 xfs_icwalk_ag+0xe9d/0x1a00 xfs_reclaim_worker+0x29/0x50 process_one_work+0x71f/0x11d0 worker_thread+0x5cb/0x10a0 kthread+0x35b/0x490 ret_from_fork+0x1f/0x30
Allocated by task 642: kasan_save_stack+0x23/0x60 __kasan_kmalloc.constprop.0+0xd9/0x140 kasan_slab_alloc+0x12/0x20 kmem_cache_alloc+0x1c4/0xa50 _xfs_buf_alloc+0x72/0xd50 xfs_buf_get_map+0x156/0x7c0 xfs_trans_get_buf_map+0x41c/0x8c0 xfs_ialloc_inode_init+0x455/0xaf0 xfs_ialloc_ag_alloc+0x71f/0x1790 xfs_dialloc+0x3f9/0x8a0 xfs_ialloc+0x12e/0x1970 xfs_dir_ialloc+0x144/0x730 xfs_create+0x623/0xe80 xfs_generic_create+0x571/0x820 xfs_vn_create+0x31/0x40 path_openat+0x209d/0x3b10 do_filp_open+0x1c2/0x2e0 do_sys_openat2+0x4fc/0x900 do_sys_open+0xd8/0x150 __x64_sys_open+0x87/0xd0 do_syscall_64+0x45/0x70 entry_SYSCALL_64_after_hwframe+0x61/0xc6
The buggy address belongs to the object at ffff888023edb780 which belongs to the cache xfs_buf of size 392 The buggy address is located 264 bytes inside of 392-byte region [ffff888023edb780, ffff888023edb908) The buggy address belongs to the page: page:ffffea00008fb600 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888023edb780 pfn:0x23ed8 head:ffffea00008fb600 order:2 compound_mapcount:0 compound_pincount:0 flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff) raw: 001fffff80010200 ffffea00008fb008 ffff888019016050 ffff888018ff4300 raw: ffff888023edb780 000000000013000d 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected
Memory state around the buggy address: ffff888023edb780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888023edb800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888023edb880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ ffff888023edb900: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888023edb980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ==================================================================
This is a low probability problem, it tooks me a long time to find the process that the problem occurred:
1. When creating a new file, if there are no free inodes, we need to allocate a new chunk. The buf item and inode items associated with inode will be submitted to CIL independently. If all goes well, both the buf item and the inode item will be inserted into the AIL, and the buf item will be in front of the inode item.
2. At the first time, xfsaild only pushed buf item. If an error occurs while writing back the inode buffer, the inode item will be set XFS_LI_FAILED in xfs_buf_inode_io_fail() when buf io end, and the buf item will remain in the AIL.
3. At the second time, xfsaild only pushed buf item again, while writing back the inode buffer and the log has shut down, the inode buffer will be set XBF_STALE and the buf item is removed from AIL when buf io end. Because of inode is not flushed, ili_last_fields in xfs_inode is still 0, so inode item will left in AIL.
4. Concurrently, a new transaction log inode that in the same cluster as the previous inode, it will get the same inode buffer in xfs_buf_find(), _XBF_INODES flag will be cleared in xfs_buf_find() due to buffer is staled.
5. At the third time, xfsaild push the inode item that has marked XFS_LI_FAILED, AIL will resubmit the inode item in xfsaild_resubmit_item(). It will go to the wrong code path due to inode buffer missing _XBF_INODES flag, all inode items that in bp->b_li_list will be reduced the references to buffer, and inode item's li_buf set to null, but inode item still in bp->b_li_list. After all reference count decreasing the inode buffer will be freed.
6. When xfs reclaim inode, remove inode item from bp->b_li_list will cause a uaf xfs_iflush_abort_clean().
Fix it by add xfs shutdown condition check in xfs_buf_find(), if it has been shutdown, it is useless to get the buffer. While the inode item is still reference to the inode buffer, the _XBF_INODES flag will not be missing.
Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/xfs_buf.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index c1ece4a08ff4..54f10d914b46 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -554,6 +554,11 @@ xfs_buf_find_lock( XFS_STATS_INC(bp->b_mount, xb_get_locked_waited); }
+ if (xlog_is_shutdown(bp->b_mount->m_log)) { + xfs_buf_relse(bp); + return -EIO; + } + /* * if the buffer is stale, clear all the external state associated with * it. We need to keep flags such as how we allocated the buffer memory
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8LHTR CVE: NA
--------------------------------
Our growfs test trigger mount error show as below:
XFS (dm-3): Starting recovery (logdev: internal) XFS (dm-3): Internal error !ino_ok at line 201 of file fs/xfs/libxfs/xfs_dir2.c. Caller xfs_dir_ino_validate+0x54/0xc0 [xfs] CPU: 0 PID: 3719345 Comm: mount Kdump: loaded Not tainted 5.10.0-136.12.0.86.h1036.kasan.eulerosv2r12.aarch64 #1 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: dump_backtrace+0x0/0x3a4 show_stack+0x34/0x4c dump_stack+0x170/0x1dc xfs_corruption_error+0x104/0x11c [xfs] xfs_dir_ino_validate+0x8c/0xc0 [xfs] __xfs_dir3_data_check+0x5e4/0xb60 [xfs] xfs_dir3_block_verify+0x108/0x190 [xfs] xfs_dir3_block_read_verify+0x10c/0x184 [xfs] xlog_recover_buf_commit_pass2+0x388/0x8e0 [xfs] xlog_recover_items_pass2+0xc8/0x160 [xfs] xlog_recover_commit_trans+0x56c/0x58c [xfs] xlog_recovery_process_trans+0x174/0x180 [xfs] xlog_recover_process_ophdr+0x120/0x210 [xfs] xlog_recover_process_data+0xcc/0x1b4 [xfs] xlog_recover_process+0x124/0x25c [xfs] xlog_do_recovery_pass+0x534/0x864 [xfs] xlog_do_log_recovery+0x98/0xc4 [xfs] xlog_do_recover+0x64/0x2ec [xfs] xlog_recover+0x1c4/0x2f0 [xfs] xfs_log_mount+0x1b8/0x550 [xfs] xfs_mountfs+0x768/0xe40 [xfs] xfs_fc_fill_super+0xb54/0xeb0 [xfs] get_tree_bdev+0x240/0x3e0 xfs_fc_get_tree+0x30/0x40 [xfs] vfs_get_tree+0x5c/0x1a4 do_new_mount+0x1c8/0x220 path_mount+0x2a8/0x3f0 __arm64_sys_mount+0x1cc/0x220 el0_svc_common.constprop.0+0xc0/0x2c4 do_el0_svc+0xb4/0xec el0_svc+0x24/0x3c el0_sync_handler+0x160/0x164 el0_sync+0x160/0x180 XFS (dm-3): Corruption detected. Unmount and run xfs_repair XFS (dm-3): Invalid inode number 0xd00082 XFS (dm-3): Metadata corruption detected at __xfs_dir3_data_check+0xa08/0xb60 [xfs], xfs_dir3_block block 0x100070 XFS (dm-3): Unmount and run xfs_repair XFS (dm-3): First 128 bytes of corrupted metadata buffer: 00000000: 58 44 42 33 f3 c3 ae cf 00 00 00 00 00 10 00 70 XDB3...........p 00000010: 00 00 00 01 00 00 0c 3e 85 a6 68 6c 63 b3 42 30 .......>..hlc.B0 00000020: b0 6b d1 b5 9d eb 55 7d 00 00 00 00 00 10 00 90 .k....U}........ 00000030: 03 60 0b 78 01 40 00 40 00 b0 00 40 00 00 00 00 .`.x.@.@...@.... 00000040: 00 00 00 00 00 10 00 90 01 2e 02 00 00 00 00 40 ...............@ 00000050: 00 00 00 00 00 10 00 81 02 2e 2e 02 00 00 00 50 ...............P 00000060: 00 00 00 00 00 10 00 91 02 66 65 01 00 00 00 60 .........fe....` 00000070: 00 00 00 00 00 10 00 92 02 66 66 01 00 00 00 70 .........ff....p
Consider the following log format, dir3 block has 2 items in the log, and the inode number recorded ondisk in diri3 block exceeds the current file system boundary. When replaying log items, it will skipping replay of first dir3 buffer log item due to the log item LSN being behind the ondisk buffer, but verification is still required. Since the superblock hasn't been replayed yet, the inode number in dir3 block exceeds the file system boundary and causes log recovery to fail.
log record: +---------------+----------------+--------------+-------------------+ | dir3 buf item | growfs sb item | inode item | dir3 buf item .. | +---------------+----------------+--------------+-------------------+ lsn X lsn X+A lsn X+A+B lsn X+A+B+C
metadata block: +-----------+-----------+------------+-----------+-----------+ | sb block | ... | dir3 block | ... | inodes | +-----------+-----------+------------+-----------+-----------+ lsn < X lsn X+A+B+C
Remove buffer read verify during log recovry pass2, clear buffer's XBF_DONE flag, so it can be verified in the next buf read after log recover.
Fixes: d38a530e3710 ("xfs: verify buffer contents when we skip log replay") Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/xfs_buf_item_recover.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c index 43167f543afc..36177164d141 100644 --- a/fs/xfs/xfs_buf_item_recover.c +++ b/fs/xfs/xfs_buf_item_recover.c @@ -946,13 +946,10 @@ xlog_recover_buf_commit_pass2(
/* * We're skipping replay of this buffer log item due to the log - * item LSN being behind the ondisk buffer. Verify the buffer - * contents since we aren't going to run the write verifier. + * item LSN being behind the ondisk buffer. clear XBF_DONE flag + * of the buffer to prevent buffer from being used without verify. */ - if (bp->b_ops) { - bp->b_ops->verify_read(bp); - error = bp->b_error; - } + bp->b_flags &= ~XBF_DONE; goto out_release; }
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8LHTR CVE: NA
--------------------------------
The following corruption is reported when log recovery:
XFS (loop0): Metadata corruption detected at xfs_agf_verify+0x386/0xb70, xfs_agf block 0x64001 XFS (loop0): Unmount and run xfs_repair XFS (loop0): First 128 bytes of corrupted metadata buffer: 00000000: 58 41 47 46 00 00 00 01 00 00 00 02 00 00 4b 88 XAGF..........K. 00000010: 00 00 00 01 00 00 00 02 00 00 00 00 00 00 00 01 ................ 00000020: 00 00 00 01 00 00 00 00 00 00 00 01 00 00 00 04 ................ 00000030: 00 00 00 04 00 00 4b 7e 00 00 4b 7e 00 00 00 00 ......K~..K~.... 00000040: 3a 9d 97 6d b5 a0 42 13 a3 b3 7f 28 2a ac 3f e8 :..m..B....(*.?. 00000050: 00 00 00 00 00 00 00 01 00 00 00 05 00 00 00 01 ................ 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply+0xbc9/0xe60 (fs/xfs/xfs_buf.c:1593). Shutting down filesystem. XFS (loop0): Please unmount the filesystem and rectify the problem(s) XFS (loop0): log mount/recovery failed: error -117 XFS (loop0): log mount failed
This problem occurs during agf write verify, where the AG's agf_length is smaller than sb_agblocks, and it is not the last AG. Consider the following situation, the file system has 3 AG, and the last AG is not full size, now the file system is growfs twice. In the first growfs, only AG2 is increased some blocks. In the second growfs, AG2 is extent to the full AG size and an AG3 is added.
pre growfs:
|----------|----------|-----|
First growfs:
|----------|----------|--------|
Second growfs:
|----------|----------|----------|-------| AG0 AG1 AG2 AG3
During each growfs, agf in AG2 and sb will be modified, if sb has already written to metadata and agf has not yet, then a shutdown occurs. The next time we mount the disk, the sb changes from the second growfs will have taken effect in memory, when we recovery the agf from the first growfs will report the above problem.
Fixes: edd8276dd702 ("xfs: AGF length has never been bounds checked") Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/libxfs/xfs_alloc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 3069194527dd..0246dc3e8bab 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2980,7 +2980,8 @@ xfs_validate_ag_length( * Only the last AG in the filesystem is allowed to be shorter * than the AG size recorded in the superblock. */ - if (length != mp->m_sb.sb_agblocks) { + if (length != mp->m_sb.sb_agblocks && + !(bp->b_flags & _XBF_LOGRECOVERY)) { /* * During growfs, the new last AG can get here before we * have updated the superblock. Give it a pass on the seqno
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8LHTR CVE: NA
--------------------------------
The following error occurred when do IO fault injection test:
XFS: Assertion failed: xlog_is_shutdown(lip->li_log), file: fs/xfs/xfs_inode_item.c, line: 748
commit "3c4cb76bce43 xfs: xfs_trans_commit() path must check for log shutdown" fix a problem that dirty transaction was canceled before log shutdown, because of the log is still running, it result dirty and unlogged inode item that isn't in the AIL in memory that can be flushed to disk via writeback clustering.
xfs_trans_cancel() has the same problem, if a shut down races with xfs_trans_cancel() and we have shut down the filesystem but not the log, we will still cancel the transaction before log shutdown. So xfs_trans_cancel() needs to check log state for shutdown, not mount.
Signed-off-by: Long Li leo.lilong@huawei.com --- fs/xfs/xfs_trans.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 8c0bfc9a33b1..29f6ac670135 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -1100,7 +1100,7 @@ xfs_trans_cancel( * progress, so we only need to check against the mount shutdown state * here. */ - if (dirty && !xfs_is_shutdown(mp)) { + if (dirty && !(xfs_is_shutdown(mp) && xlog_is_shutdown(log))) { XFS_ERROR_REPORT("xfs_trans_cancel", XFS_ERRLEVEL_LOW, mp); xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); }