[PATCH OLK-6.6 00/26] ext4: better scalability for ext4 block allocation

Baokun Li (18): ext4: add ext4_try_lock_group() to skip busy groups Revert "ext4: move mb_last_[group|start] to ext4_inode_info" ext4: separate stream goal hits from s_bal_goals for better tracking ext4: remove unnecessary s_mb_last_start ext4: remove unnecessary s_md_lock on update s_mb_last_group ext4: utilize multiple global goals to reduce contention ext4: get rid of some obsolete EXT4_MB_HINT flags ext4: fix typo in CR_GOAL_LEN_SLOW comment ext4: convert sbi->s_mb_free_pending to atomic_t ext4: merge freed extent with existing extents before insertion ext4: fix zombie groups in average fragment size lists ext4: fix largest free orders lists corruption on mb_optimize_scan switch ext4: factor out __ext4_mb_scan_group() ext4: factor out ext4_mb_might_prefetch() ext4: factor out ext4_mb_scan_group() ext4: convert free groups order lists to xarrays ext4: refactor choose group to scan group ext4: implement linear-like traversal across order xarrays Jinke Han (1): ext4: make running and commit transaction have their own freed_data_list Kemeng Shi (5): ext4: remove unused ext4_allocation_context::ac_groups_considered ext4: keep "prefetch_grp" and "nr" consistent ext4: remove unused parameter ngroup in ext4_mb_choose_next_group_*() ext4: use correct criteria name instead stale integer number in comment ext4: open coding repeated check in next_linear_group Ojaswin Mujoo (2): ext4: fallback to complex scan if aligned scan doesn't work ext4: convert EXT4_B2C(sbi->s_stripe) users to EXT4_NUM_B2C fs/ext4/balloc.c | 2 +- fs/ext4/ext4.h | 38 +- fs/ext4/mballoc.c | 948 ++++++++++++++++++++---------------- fs/ext4/mballoc.h | 14 +- fs/ext4/super.c | 2 - include/trace/events/ext4.h | 3 - 6 files changed, 547 insertions(+), 460 deletions(-) -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit e9eec6f33971fbfcdd32fd1c7dd515ff4d2954c0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- When ext4 allocates blocks, we used to just go through the block groups one by one to find a good one. But when there are tons of block groups (like hundreds of thousands or even millions) and not many have free space (meaning they're mostly full), it takes a really long time to check them all, and performance gets bad. So, we added the "mb_optimize_scan" mount option (which is on by default now). It keeps track of some group lists, so when we need a free block, we can just grab a likely group from the right list. This saves time and makes block allocation much faster. But when multiple processes or containers are doing similar things, like constantly allocating 8k blocks, they all try to use the same block group in the same list. Even just two processes doing this can cut the IOPS in half. For example, one container might do 300,000 IOPS, but if you run two at the same time, the total is only 150,000. Since we can already look at block groups in a non-linear way, the first and last groups in the same list are basically the same for finding a block right now. Therefore, add an ext4_try_lock_group() helper function to skip the current group when it is locked by another process, thereby avoiding contention with other processes. This helps ext4 make better use of having multiple block groups. Also, to make sure we don't skip all the groups that have free space when allocating blocks, we won't try to skip busy groups anymore when ac_criteria is CR_ANY_FREE. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. |CPU: Kunpeng 920 | P80 | |Memory: 512GB |-------------------------| |960GB SSD (0.5GB/s)| base | patched | |-------------------|-------|-----------------| |mb_optimize_scan=0 | 2667 | 4821 (+80.7%) | |mb_optimize_scan=1 | 2643 | 4784 (+81.0%) | |CPU: AMD 9654 * 2 | P96 | |Memory: 1536GB |-------------------------| |960GB SSD (1GB/s) | base | patched | |-------------------|-------|-----------------| |mb_optimize_scan=0 | 3450 | 15371 (+345%) | |mb_optimize_scan=1 | 3209 | 6101 (+90.0%) | Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Conflicts: fs/ext4/mballoc.c [Because of already merged-in hulk commit da6eb19e6550 ("ext4: add ext4_try_lock_group() to skip busy groups").] Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/mballoc.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index ec2808247f4f..d7a1fcbdcff6 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -899,8 +899,8 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context bb_largest_free_order_node) { if (sbi->s_mb_stats) atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]); - if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED)) && - !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) { + if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) && + likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) { *group = iter->bb_group; ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED; read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); @@ -936,8 +936,8 @@ ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int o list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) { if (sbi->s_mb_stats) atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]); - if (likely(ext4_mb_good_group(ac, iter->bb_group, cr)) && - !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) { + if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) && + likely(ext4_mb_good_group(ac, iter->bb_group, cr))) { grp = iter; break; } @@ -2899,6 +2899,11 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) nr, &prefetch_ios); } + /* prevent unnecessary buddy loading. */ + if (cr < CR_ANY_FREE && + spin_is_locked(ext4_group_lock_ptr(sb, group))) + continue; + /* This now checks without needing the buddy page */ ret = ext4_mb_good_group_nolock(ac, group, cr); if (ret <= 0) { -- 2.46.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB -------------------------------- This reverts commit 0f5fea890fd315f60e80749436a90f9f7e5ce503. This commit reverts our custom-developed solution. This change is a preparatory step to align our codebase with the mainline, making it easier to backport the official mainline solution. Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/ext4.h | 7 +++---- fs/ext4/mballoc.c | 20 +++++++++++--------- fs/ext4/super.c | 2 -- 3 files changed, 14 insertions(+), 15 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index a31069001add..885fe1d9a8e9 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1179,10 +1179,6 @@ struct ext4_inode_info { __u32 i_csum_seed; kprojid_t i_projid; - - /* where last allocation was done - for stream allocation */ - ext4_group_t i_mb_last_group; - ext4_grpblk_t i_mb_last_start; }; /* @@ -1615,6 +1611,9 @@ struct ext4_sb_info { unsigned int s_mb_order2_reqs; unsigned int s_mb_group_prealloc; unsigned int s_max_dir_size_kb; + /* where last allocation was done - for stream allocation */ + unsigned long s_mb_last_group; + unsigned long s_mb_last_start; unsigned int s_mb_prefetch; unsigned int s_mb_prefetch_limit; unsigned int s_mb_best_avail_max_trim_order; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index d7a1fcbdcff6..c33d40acdf16 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2139,6 +2139,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, struct ext4_buddy *e4b) { + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); int ret; BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group); @@ -2169,8 +2170,10 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, get_page(ac->ac_buddy_page); /* store last allocated for subsequent stream allocation */ if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { - EXT4_I(ac->ac_inode)->i_mb_last_group = ac->ac_f_ex.fe_group; - EXT4_I(ac->ac_inode)->i_mb_last_start = ac->ac_f_ex.fe_start; + spin_lock(&sbi->s_md_lock); + sbi->s_mb_last_group = ac->ac_f_ex.fe_group; + sbi->s_mb_last_start = ac->ac_f_ex.fe_start; + spin_unlock(&sbi->s_md_lock); } /* * As we've just preallocated more space than @@ -2842,14 +2845,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) MB_NUM_ORDERS(sb)); } - /* if stream allocation is enabled, use last goal */ + /* if stream allocation is enabled, use global goal */ if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { - struct ext4_inode_info *ei = EXT4_I(ac->ac_inode); - - if (ei->i_mb_last_group || ei->i_mb_last_start) { - ac->ac_g_ex.fe_group = ei->i_mb_last_group; - ac->ac_g_ex.fe_start = ei->i_mb_last_start; - } + /* TBD: may be hot point */ + spin_lock(&sbi->s_md_lock); + ac->ac_g_ex.fe_group = sbi->s_mb_last_group; + ac->ac_g_ex.fe_start = sbi->s_mb_last_start; + spin_unlock(&sbi->s_md_lock); } /* diff --git a/fs/ext4/super.c b/fs/ext4/super.c index f7a301116d59..c45dfcf9ac62 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1517,8 +1517,6 @@ static struct inode *ext4_alloc_inode(struct super_block *sb) INIT_WORK(&ei->i_iomap_ioend_work, ext4_iomap_end_io); ext4_fc_init_inode(&ei->vfs_inode); mutex_init(&ei->i_fc_lock); - ei->i_mb_last_group = 0; - ei->i_mb_last_start = 0; return &ei->vfs_inode; } -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit 35bfd4b44ef04a10a091a4037a26296e7cd5273a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- In ext4_mb_regular_allocator(), after the call to ext4_mb_find_by_goal() fails to achieve the inode goal, allocation continues with the stream allocation global goal. Currently, hits for both are combined in sbi->s_bal_goals, hindering accurate optimization. This commit separates global goal hits into sbi->s_bal_stream_goals. Since stream allocation doesn't use ac->ac_g_ex.fe_start, set fe_start to -1. This prevents stream allocations from being counted in s_bal_goals. Also clear EXT4_MB_HINT_TRY_GOAL to avoid calling ext4_mb_find_by_goal again. After adding `stream_goal_hits`, `/proc/fs/ext4/sdx/mb_stats` will show: mballoc: reqs: 840347 success: 750992 groups_scanned: 1230506 cr_p2_aligned_stats: hits: 21531 groups_considered: 411664 extents_scanned: 21531 useless_loops: 0 bad_suggestions: 6 cr_goal_fast_stats: hits: 111222 groups_considered: 1806728 extents_scanned: 467908 useless_loops: 0 bad_suggestions: 13 cr_best_avail_stats: hits: 36267 groups_considered: 1817631 extents_scanned: 156143 useless_loops: 0 bad_suggestions: 204 cr_goal_slow_stats: hits: 106396 groups_considered: 5671710 extents_scanned: 22540056 useless_loops: 123747 cr_any_free_stats: hits: 138071 groups_considered: 724692 extents_scanned: 23615593 useless_loops: 585 extents_scanned: 46804261 goal_hits: 1307 stream_goal_hits: 236317 len_goal_hits: 155549 2^n_hits: 21531 breaks: 225096 lost: 35062 buddies_generated: 40/40 buddies_time_used: 48004 preallocated: 5962467 discarded: 4847560 Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/ext4.h | 1 + fs/ext4/mballoc.c | 11 +++++++++-- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 885fe1d9a8e9..936dd5d54600 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1626,6 +1626,7 @@ struct ext4_sb_info { atomic_t s_bal_cX_ex_scanned[EXT4_MB_NUM_CRS]; /* total extents scanned */ atomic_t s_bal_groups_scanned; /* number of groups scanned */ atomic_t s_bal_goals; /* goal hits */ + atomic_t s_bal_stream_goals; /* stream allocation global goal hits */ atomic_t s_bal_len_goals; /* len goal hits */ atomic_t s_bal_breaks; /* too long searches */ atomic_t s_bal_2orders; /* 2^order hits */ diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index c33d40acdf16..488a3ef289cc 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2850,8 +2850,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) /* TBD: may be hot point */ spin_lock(&sbi->s_md_lock); ac->ac_g_ex.fe_group = sbi->s_mb_last_group; - ac->ac_g_ex.fe_start = sbi->s_mb_last_start; spin_unlock(&sbi->s_md_lock); + ac->ac_g_ex.fe_start = -1; + ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL; } /* @@ -2993,8 +2994,12 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) } } - if (sbi->s_mb_stats && ac->ac_status == AC_STATUS_FOUND) + if (sbi->s_mb_stats && ac->ac_status == AC_STATUS_FOUND) { atomic64_inc(&sbi->s_bal_cX_hits[ac->ac_criteria]); + if (ac->ac_flags & EXT4_MB_STREAM_ALLOC && + ac->ac_b_ex.fe_group == ac->ac_g_ex.fe_group) + atomic_inc(&sbi->s_bal_stream_goals); + } out: if (!err && ac->ac_status != AC_STATUS_FOUND && first_err) err = first_err; @@ -3190,6 +3195,8 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset) seq_printf(seq, "\textents_scanned: %u\n", atomic_read(&sbi->s_bal_ex_scanned)); seq_printf(seq, "\t\tgoal_hits: %u\n", atomic_read(&sbi->s_bal_goals)); + seq_printf(seq, "\t\tstream_goal_hits: %u\n", + atomic_read(&sbi->s_bal_stream_goals)); seq_printf(seq, "\t\tlen_goal_hits: %u\n", atomic_read(&sbi->s_bal_len_goals)); seq_printf(seq, "\t\t2^n_hits: %u\n", atomic_read(&sbi->s_bal_2orders)); -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit f0374d80711adf8628bdd442131c045c64a10951 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Since stream allocation does not use ac->ac_f_ex.fe_start, it is set to -1 by default, so the no longer needed sbi->s_mb_last_start is removed. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-4-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/ext4.h | 1 - fs/ext4/mballoc.c | 1 - 2 files changed, 2 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 936dd5d54600..081e672d8a95 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1613,7 +1613,6 @@ struct ext4_sb_info { unsigned int s_max_dir_size_kb; /* where last allocation was done - for stream allocation */ unsigned long s_mb_last_group; - unsigned long s_mb_last_start; unsigned int s_mb_prefetch; unsigned int s_mb_prefetch_limit; unsigned int s_mb_best_avail_max_trim_order; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 488a3ef289cc..5d2b7863cbda 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2172,7 +2172,6 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { spin_lock(&sbi->s_md_lock); sbi->s_mb_last_group = ac->ac_f_ex.fe_group; - sbi->s_mb_last_start = ac->ac_f_ex.fe_start; spin_unlock(&sbi->s_md_lock); } /* -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit 4b41deb896e3d0417701759194f0765c06258b9c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- After we optimized the block group lock, we found another lock contention issue when running will-it-scale/fallocate2 with multiple processes. The fallocate's block allocation and the truncate's block release were fighting over the s_md_lock. The problem is, this lock protects totally different things in those two processes: the list of freed data blocks (s_freed_data_list) when releasing, and where to start looking for new blocks (mb_last_group) when allocating. Now we only need to track s_mb_last_group and no longer need to track s_mb_last_start, so we don't need the s_md_lock lock to ensure that the two are consistent. Since s_mb_last_group is merely a hint and doesn't require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient. Besides, the s_mb_last_group data type only requires ext4_group_t (i.e., unsigned int), rendering unsigned long superfluous. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. |CPU: Kunpeng 920 | P80 | P1 | |Memory: 512GB |------------------------|-------------------------| |960GB SSD (0.5GB/s)| base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 4821 | 9636 (+99.8%) | 314065 | 337597 (+7.4%) | |mb_optimize_scan=1 | 4784 | 4834 (+1.04%) | 316344 | 341440 (+7.9%) | |CPU: AMD 9654 * 2 | P96 | P1 | |Memory: 1536GB |------------------------|-------------------------| |960GB SSD (1GB/s) | base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 15371 | 22341 (+45.3%) | 205851 | 219707 (+6.7%) | |mb_optimize_scan=1 | 6101 | 9177 (+50.4%) | 207373 | 215732 (+4.0%) | Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-5-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/ext4.h | 2 +- fs/ext4/mballoc.c | 12 +++--------- 2 files changed, 4 insertions(+), 10 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 081e672d8a95..d8382d116211 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1612,7 +1612,7 @@ struct ext4_sb_info { unsigned int s_mb_group_prealloc; unsigned int s_max_dir_size_kb; /* where last allocation was done - for stream allocation */ - unsigned long s_mb_last_group; + ext4_group_t s_mb_last_group; unsigned int s_mb_prefetch; unsigned int s_mb_prefetch_limit; unsigned int s_mb_best_avail_max_trim_order; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 5d2b7863cbda..d6c76fb297e3 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2169,11 +2169,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, ac->ac_buddy_page = e4b->bd_buddy_page; get_page(ac->ac_buddy_page); /* store last allocated for subsequent stream allocation */ - if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { - spin_lock(&sbi->s_md_lock); - sbi->s_mb_last_group = ac->ac_f_ex.fe_group; - spin_unlock(&sbi->s_md_lock); - } + if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) + WRITE_ONCE(sbi->s_mb_last_group, ac->ac_f_ex.fe_group); /* * As we've just preallocated more space than * user requested originally, we store allocated @@ -2846,10 +2843,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) /* if stream allocation is enabled, use global goal */ if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { - /* TBD: may be hot point */ - spin_lock(&sbi->s_md_lock); - ac->ac_g_ex.fe_group = sbi->s_mb_last_group; - spin_unlock(&sbi->s_md_lock); + ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_group); ac->ac_g_ex.fe_start = -1; ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL; } -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit 8f2c3b74865cac9b3bd87fb15633475b46069ca8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- When allocating data blocks, if the first try (goal allocation) fails and stream allocation is on, it tries a global goal starting from the last group we used (s_mb_last_group). This helps cluster large files together to reduce free space fragmentation, and the data block contiguity also accelerates write-back to disk. However, when multiple processes allocate blocks, having just one global goal means they all fight over the same group. This drastically lowers the chances of extents merging and leads to much worse file fragmentation. To mitigate this multi-process contention, we now employ multiple global goals, with the number of goals being the minimum between the number of possible CPUs and one-quarter of the filesystem's total block group count. To ensure a consistent goal for each inode, we select the corresponding goal by taking the inode number modulo the total number of goals. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. |CPU: Kunpeng 920 | P80 | P1 | |Memory: 512GB |------------------------|-------------------------| |960GB SSD (0.5GB/s)| base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 9636 | 19628 (+103%) | 337597 | 320885 (-4.9%) | |mb_optimize_scan=1 | 4834 | 7129 (+47.4%) | 341440 | 321275 (-5.9%) | |CPU: AMD 9654 * 2 | P96 | P1 | |Memory: 1536GB |------------------------|-------------------------| |960GB SSD (1GB/s) | base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 22341 | 53760 (+140%) | 219707 | 213145 (-2.9%) | |mb_optimize_scan=1 | 9177 | 12716 (+38.5%) | 215732 | 215262 (+0.2%) | Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-6-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Conflicts: fs/ext4/mballoc.c [Context differences because there is no commit 908177175a2a ("ext4: remove unused return value of ext4_mb_release").] Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/ext4.h | 6 ++++-- fs/ext4/mballoc.c | 27 +++++++++++++++++++++++---- 2 files changed, 27 insertions(+), 6 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index d8382d116211..4b6896198101 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1611,12 +1611,14 @@ struct ext4_sb_info { unsigned int s_mb_order2_reqs; unsigned int s_mb_group_prealloc; unsigned int s_max_dir_size_kb; - /* where last allocation was done - for stream allocation */ - ext4_group_t s_mb_last_group; unsigned int s_mb_prefetch; unsigned int s_mb_prefetch_limit; unsigned int s_mb_best_avail_max_trim_order; + /* where last allocation was done - for stream allocation */ + ext4_group_t *s_mb_last_groups; + unsigned int s_mb_nr_global_goals; + /* stats for buddy allocator */ atomic_t s_bal_reqs; /* number of reqs with len > 1 */ atomic_t s_bal_success; /* we found long enough chunks */ diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index d6c76fb297e3..c39f23c21a5c 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2169,8 +2169,12 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, ac->ac_buddy_page = e4b->bd_buddy_page; get_page(ac->ac_buddy_page); /* store last allocated for subsequent stream allocation */ - if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) - WRITE_ONCE(sbi->s_mb_last_group, ac->ac_f_ex.fe_group); + if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { + int hash = ac->ac_inode->i_ino % sbi->s_mb_nr_global_goals; + + WRITE_ONCE(sbi->s_mb_last_groups[hash], ac->ac_f_ex.fe_group); + } + /* * As we've just preallocated more space than * user requested originally, we store allocated @@ -2843,7 +2847,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) /* if stream allocation is enabled, use global goal */ if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { - ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_group); + int hash = ac->ac_inode->i_ino % sbi->s_mb_nr_global_goals; + + ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_groups[hash]); ac->ac_g_ex.fe_start = -1; ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL; } @@ -3716,10 +3722,19 @@ int ext4_mb_init(struct super_block *sb) sbi->s_mb_group_prealloc, EXT4_B2C(sbi, sbi->s_stripe)); } + sbi->s_mb_nr_global_goals = umin(num_possible_cpus(), + DIV_ROUND_UP(sbi->s_groups_count, 4)); + sbi->s_mb_last_groups = kcalloc(sbi->s_mb_nr_global_goals, + sizeof(ext4_group_t), GFP_KERNEL); + if (sbi->s_mb_last_groups == NULL) { + ret = -ENOMEM; + goto out; + } + sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group); if (sbi->s_locality_groups == NULL) { ret = -ENOMEM; - goto out; + goto out_free_last_groups; } for_each_possible_cpu(i) { struct ext4_locality_group *lg; @@ -3744,6 +3759,9 @@ int ext4_mb_init(struct super_block *sb) out_free_locality_groups: free_percpu(sbi->s_locality_groups); sbi->s_locality_groups = NULL; +out_free_last_groups: + kfree(sbi->s_mb_last_groups); + sbi->s_mb_last_groups = NULL; out: kfree(sbi->s_mb_avg_fragment_size); kfree(sbi->s_mb_avg_fragment_size_locks); @@ -3848,6 +3866,7 @@ int ext4_mb_release(struct super_block *sb) } free_percpu(sbi->s_locality_groups); + kfree(sbi->s_mb_last_groups); return 0; } -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit 4d18a0b98259c2fa62f04ce5f94a7ec6e840f220 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Since nobody has used these EXT4_MB_HINT flags for ages, let's remove them. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-7-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/ext4.h | 6 ------ include/trace/events/ext4.h | 3 --- 2 files changed, 9 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 4b6896198101..c752147afc4e 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -196,14 +196,8 @@ enum criteria { /* prefer goal again. length */ #define EXT4_MB_HINT_MERGE 0x0001 -/* blocks already reserved */ -#define EXT4_MB_HINT_RESERVED 0x0002 -/* metadata is being allocated */ -#define EXT4_MB_HINT_METADATA 0x0004 /* first blocks in the file */ #define EXT4_MB_HINT_FIRST 0x0008 -/* search for the best chunk */ -#define EXT4_MB_HINT_BEST 0x0010 /* data is being allocated */ #define EXT4_MB_HINT_DATA 0x0020 /* don't preallocate (for tails) */ diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h index d500568daeb1..2c9fce13d906 100644 --- a/include/trace/events/ext4.h +++ b/include/trace/events/ext4.h @@ -23,10 +23,7 @@ struct partial_cluster; #define show_mballoc_flags(flags) __print_flags(flags, "|", \ { EXT4_MB_HINT_MERGE, "HINT_MERGE" }, \ - { EXT4_MB_HINT_RESERVED, "HINT_RESV" }, \ - { EXT4_MB_HINT_METADATA, "HINT_MDATA" }, \ { EXT4_MB_HINT_FIRST, "HINT_FIRST" }, \ - { EXT4_MB_HINT_BEST, "HINT_BEST" }, \ { EXT4_MB_HINT_DATA, "HINT_DATA" }, \ { EXT4_MB_HINT_NOPREALLOC, "HINT_NOPREALLOC" }, \ { EXT4_MB_HINT_GROUP_ALLOC, "HINT_GRP_ALLOC" }, \ -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit 9a0ed1698191a143588a9bfb46ed76a4ee094931 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Remove the superfluous "find_". Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-8-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/ext4.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index c752147afc4e..79d0d55e3a61 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -168,7 +168,7 @@ enum criteria { /* * Reads each block group sequentially, performing disk IO if - * necessary, to find find_suitable block group. Tries to + * necessary, to find suitable block group. Tries to * allocate goal length but might trim the request if nothing * is found after enough tries. */ -- 2.46.1

From: Jinke Han <hanjinke.666@bytedance.com> mainline inclusion from mainline-v6.7-rc1 commit ce774e5365e46be73ed055302c6de123a03394ea category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- When releasing space in jbd, we traverse s_freed_data_list to get the free range belonging to the current commit transaction. In extreme cases, the time spent may not be small, and we have observed cases exceeding 10ms. This patch makes running and commit transactions manage their own free_data_list respectively, eliminating unnecessary traversal. And in the callback phase of the commit transaction, no one will touch it except the jbd thread itself, so s_md_lock is no longer needed. Signed-off-by: Jinke Han <hanjinke.666@bytedance.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20230612124017.14115-1-hanjinke.666@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/ext4.h | 2 +- fs/ext4/mballoc.c | 18 +++++------------- 2 files changed, 6 insertions(+), 14 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 79d0d55e3a61..ee57bbc9bca5 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1585,7 +1585,7 @@ struct ext4_sb_info { unsigned int *s_mb_maxs; unsigned int s_group_info_size; unsigned int s_mb_free_pending; - struct list_head s_freed_data_list; /* List of blocks to be freed + struct list_head s_freed_data_list[2]; /* List of blocks to be freed after commit completed */ struct list_head s_discard_list; struct work_struct s_discard_work; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index c39f23c21a5c..82b3d539be7f 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -3683,7 +3683,8 @@ int ext4_mb_init(struct super_block *sb) spin_lock_init(&sbi->s_md_lock); sbi->s_mb_free_pending = 0; - INIT_LIST_HEAD(&sbi->s_freed_data_list); + INIT_LIST_HEAD(&sbi->s_freed_data_list[0]); + INIT_LIST_HEAD(&sbi->s_freed_data_list[1]); INIT_LIST_HEAD(&sbi->s_discard_list); INIT_WORK(&sbi->s_discard_work, ext4_discard_work); atomic_set(&sbi->s_retry_alloc_pending, 0); @@ -3945,19 +3946,10 @@ void ext4_process_freed_data(struct super_block *sb, tid_t commit_tid) struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_free_data *entry, *tmp; LIST_HEAD(freed_data_list); - struct list_head *cut_pos = NULL; + struct list_head *s_freed_head = &sbi->s_freed_data_list[commit_tid & 1]; bool wake; - spin_lock(&sbi->s_md_lock); - list_for_each_entry(entry, &sbi->s_freed_data_list, efd_list) { - if (entry->efd_tid != commit_tid) - break; - cut_pos = &entry->efd_list; - } - if (cut_pos) - list_cut_position(&freed_data_list, &sbi->s_freed_data_list, - cut_pos); - spin_unlock(&sbi->s_md_lock); + list_replace_init(s_freed_head, &freed_data_list); list_for_each_entry(entry, &freed_data_list, efd_list) ext4_free_data_in_buddy(sb, entry); @@ -6490,7 +6482,7 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b, } spin_lock(&sbi->s_md_lock); - list_add_tail(&new_entry->efd_list, &sbi->s_freed_data_list); + list_add_tail(&new_entry->efd_list, &sbi->s_freed_data_list[new_entry->efd_tid & 1]); sbi->s_mb_free_pending += clusters; spin_unlock(&sbi->s_md_lock); } -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit 0a2326f6ae60e99f5e6e9ca900a19b5c14304a51 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Previously, s_md_lock was used to protect s_mb_free_pending during modifications, while smp_mb() ensured fresh reads, so s_md_lock just guarantees the atomicity of s_mb_free_pending. Thus we optimized it by converting s_mb_free_pending into an atomic variable, thereby eliminating s_md_lock and minimizing lock contention. This also prepares for future lockless merging of free extents. Following this modification, s_md_lock is exclusively responsible for managing insertions and deletions within s_freed_data_list, along with operations involving list_splice. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. |CPU: Kunpeng 920 | P80 | P1 | |Memory: 512GB |------------------------|-------------------------| |960GB SSD (0.5GB/s)| base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 19628 | 20043 (+2.1%) | 320885 | 314331 (-2.0%) | |mb_optimize_scan=1 | 7129 | 7290 (+2.2%) | 321275 | 324226 (+0.9%) | |CPU: AMD 9654 * 2 | P96 | P1 | |Memory: 1536GB |------------------------|-------------------------| |960GB SSD (1GB/s) | base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 53760 | 54999 (+2.3%) | 213145 | 214380 (+0.5%) | |mb_optimize_scan=1 | 12716 | 13497 (+6.1%) | 215262 | 216276 (+0.4%) | Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-9-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/balloc.c | 2 +- fs/ext4/ext4.h | 2 +- fs/ext4/mballoc.c | 9 +++------ 3 files changed, 5 insertions(+), 8 deletions(-) diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index 396474e9e2bf..7e70176e0eae 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -695,7 +695,7 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries) * possible we just missed a transaction commit that did so */ smp_mb(); - if (sbi->s_mb_free_pending == 0) { + if (atomic_read(&sbi->s_mb_free_pending) == 0) { if (test_opt(sb, DISCARD)) { atomic_inc(&sbi->s_retry_alloc_pending); flush_work(&sbi->s_discard_work); diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index ee57bbc9bca5..07e240231fe0 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1584,7 +1584,7 @@ struct ext4_sb_info { unsigned short *s_mb_offsets; unsigned int *s_mb_maxs; unsigned int s_group_info_size; - unsigned int s_mb_free_pending; + atomic_t s_mb_free_pending; struct list_head s_freed_data_list[2]; /* List of blocks to be freed after commit completed */ struct list_head s_discard_list; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 82b3d539be7f..ab05e047c553 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -3682,7 +3682,7 @@ int ext4_mb_init(struct super_block *sb) } spin_lock_init(&sbi->s_md_lock); - sbi->s_mb_free_pending = 0; + atomic_set(&sbi->s_mb_free_pending, 0); INIT_LIST_HEAD(&sbi->s_freed_data_list[0]); INIT_LIST_HEAD(&sbi->s_freed_data_list[1]); INIT_LIST_HEAD(&sbi->s_discard_list); @@ -3906,10 +3906,7 @@ static void ext4_free_data_in_buddy(struct super_block *sb, /* we expect to find existing buddy because it's pinned */ BUG_ON(err != 0); - spin_lock(&EXT4_SB(sb)->s_md_lock); - EXT4_SB(sb)->s_mb_free_pending -= entry->efd_count; - spin_unlock(&EXT4_SB(sb)->s_md_lock); - + atomic_sub(entry->efd_count, &EXT4_SB(sb)->s_mb_free_pending); db = e4b.bd_info; /* there are blocks to put in buddy to make them really free */ count += entry->efd_count; @@ -6483,7 +6480,7 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b, spin_lock(&sbi->s_md_lock); list_add_tail(&new_entry->efd_list, &sbi->s_freed_data_list[new_entry->efd_tid & 1]); - sbi->s_mb_free_pending += clusters; + atomic_add(clusters, &sbi->s_mb_free_pending); spin_unlock(&sbi->s_md_lock); } -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit e7f101a8088770e8f3bb089f13652b9b0fd22b06 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Attempt to merge ext4_free_data with already inserted free extents prior to adding new ones. This strategy drastically cuts down the number of times locks are held. For example, if prev, new, and next extents are all mergeable, the existing code (before this patch) requires acquiring the s_md_lock three times: prev merge into new and free prev // hold lock next merge into new and free next // hold lock insert new // hold lock After the patch, it only needs to be acquired once: new merge into next and free new // no lock next merge into prev and free next // hold lock Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. |CPU: Kunpeng 920 | P80 | P1 | |Memory: 512GB |------------------------|-------------------------| |960GB SSD (0.5GB/s)| base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 20043 | 20097 (+0.2%) | 314331 | 316141 (+0.5%) | |mb_optimize_scan=1 | 7290 | 13318 (+87.4%) | 324226 | 325273 (+0.3%) | |CPU: AMD 9654 * 2 | P96 | P1 | |Memory: 1536GB |------------------------|-------------------------| |960GB SSD (1GB/s) | base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 54999 | 53603 (-2.5%) | 214380 | 214243 (-0.06%)| |mb_optimize_scan=1 | 13497 | 20887 (+54.6%) | 216276 | 213632 (-1.2%) | Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-10-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/mballoc.c | 113 +++++++++++++++++++++++++++++++--------------- 1 file changed, 76 insertions(+), 37 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index ab05e047c553..f516c7da7ea8 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -6389,28 +6389,63 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, * are contiguous, AND the extents were freed by the same transaction, * AND the blocks are associated with the same group. */ -static void ext4_try_merge_freed_extent(struct ext4_sb_info *sbi, - struct ext4_free_data *entry, - struct ext4_free_data *new_entry, - struct rb_root *entry_rb_root) +static inline bool +ext4_freed_extents_can_be_merged(struct ext4_free_data *entry1, + struct ext4_free_data *entry2) { - if ((entry->efd_tid != new_entry->efd_tid) || - (entry->efd_group != new_entry->efd_group)) - return; - if (entry->efd_start_cluster + entry->efd_count == - new_entry->efd_start_cluster) { - new_entry->efd_start_cluster = entry->efd_start_cluster; - new_entry->efd_count += entry->efd_count; - } else if (new_entry->efd_start_cluster + new_entry->efd_count == - entry->efd_start_cluster) { - new_entry->efd_count += entry->efd_count; - } else - return; + if (entry1->efd_tid != entry2->efd_tid) + return false; + if (entry1->efd_start_cluster + entry1->efd_count != + entry2->efd_start_cluster) + return false; + if (WARN_ON_ONCE(entry1->efd_group != entry2->efd_group)) + return false; + return true; +} + +static inline void +ext4_merge_freed_extents(struct ext4_sb_info *sbi, struct rb_root *root, + struct ext4_free_data *entry1, + struct ext4_free_data *entry2) +{ + entry1->efd_count += entry2->efd_count; spin_lock(&sbi->s_md_lock); - list_del(&entry->efd_list); + list_del(&entry2->efd_list); spin_unlock(&sbi->s_md_lock); - rb_erase(&entry->efd_node, entry_rb_root); - kmem_cache_free(ext4_free_data_cachep, entry); + rb_erase(&entry2->efd_node, root); + kmem_cache_free(ext4_free_data_cachep, entry2); +} + +static inline void +ext4_try_merge_freed_extent_prev(struct ext4_sb_info *sbi, struct rb_root *root, + struct ext4_free_data *entry) +{ + struct ext4_free_data *prev; + struct rb_node *node; + + node = rb_prev(&entry->efd_node); + if (!node) + return; + + prev = rb_entry(node, struct ext4_free_data, efd_node); + if (ext4_freed_extents_can_be_merged(prev, entry)) + ext4_merge_freed_extents(sbi, root, prev, entry); +} + +static inline void +ext4_try_merge_freed_extent_next(struct ext4_sb_info *sbi, struct rb_root *root, + struct ext4_free_data *entry) +{ + struct ext4_free_data *next; + struct rb_node *node; + + node = rb_next(&entry->efd_node); + if (!node) + return; + + next = rb_entry(node, struct ext4_free_data, efd_node); + if (ext4_freed_extents_can_be_merged(entry, next)) + ext4_merge_freed_extents(sbi, root, entry, next); } static noinline_for_stack void @@ -6420,11 +6455,12 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b, ext4_group_t group = e4b->bd_group; ext4_grpblk_t cluster; ext4_grpblk_t clusters = new_entry->efd_count; - struct ext4_free_data *entry; + struct ext4_free_data *entry = NULL; struct ext4_group_info *db = e4b->bd_info; struct super_block *sb = e4b->bd_sb; struct ext4_sb_info *sbi = EXT4_SB(sb); - struct rb_node **n = &db->bb_free_root.rb_node, *node; + struct rb_root *root = &db->bb_free_root; + struct rb_node **n = &root->rb_node; struct rb_node *parent = NULL, *new_node; BUG_ON(!ext4_handle_valid(handle)); @@ -6460,27 +6496,30 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b, } } - rb_link_node(new_node, parent, n); - rb_insert_color(new_node, &db->bb_free_root); - - /* Now try to see the extent can be merged to left and right */ - node = rb_prev(new_node); - if (node) { - entry = rb_entry(node, struct ext4_free_data, efd_node); - ext4_try_merge_freed_extent(sbi, entry, new_entry, - &(db->bb_free_root)); + atomic_add(clusters, &sbi->s_mb_free_pending); + if (!entry) + goto insert; + + /* Now try to see the extent can be merged to prev and next */ + if (ext4_freed_extents_can_be_merged(new_entry, entry)) { + entry->efd_start_cluster = cluster; + entry->efd_count += new_entry->efd_count; + kmem_cache_free(ext4_free_data_cachep, new_entry); + ext4_try_merge_freed_extent_prev(sbi, root, entry); + return; } - - node = rb_next(new_node); - if (node) { - entry = rb_entry(node, struct ext4_free_data, efd_node); - ext4_try_merge_freed_extent(sbi, entry, new_entry, - &(db->bb_free_root)); + if (ext4_freed_extents_can_be_merged(entry, new_entry)) { + entry->efd_count += new_entry->efd_count; + kmem_cache_free(ext4_free_data_cachep, new_entry); + ext4_try_merge_freed_extent_next(sbi, root, entry); + return; } +insert: + rb_link_node(new_node, parent, n); + rb_insert_color(new_node, root); spin_lock(&sbi->s_md_lock); list_add_tail(&new_entry->efd_list, &sbi->s_freed_data_list[new_entry->efd_tid & 1]); - atomic_add(clusters, &sbi->s_mb_free_pending); spin_unlock(&sbi->s_md_lock); } -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit 1c320d8e92925bb7615f83a7b6e3f402a5c2ca63 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Groups with no free blocks shouldn't be in any average fragment size list. However, when all blocks in a group are allocated(i.e., bb_fragments or bb_free is 0), we currently skip updating the average fragment size, which means the group isn't removed from its previous s_mb_avg_fragment_size[old] list. This created "zombie" groups that were always skipped during traversal as they couldn't satisfy any block allocation requests, negatively impacting traversal efficiency. Therefore, when a group becomes completely full, bb_avg_fragment_size_order is now set to -1. If the old order was not -1, a removal operation is performed; if the new order is not -1, an insertion is performed. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-11-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/mballoc.c | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index f516c7da7ea8..1075a658c6c5 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -844,30 +844,30 @@ static void mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp) { struct ext4_sb_info *sbi = EXT4_SB(sb); - int new_order; + int new, old; - if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_fragments == 0) + if (!test_opt2(sb, MB_OPTIMIZE_SCAN)) return; - new_order = mb_avg_fragment_size_order(sb, - grp->bb_free / grp->bb_fragments); - if (new_order == grp->bb_avg_fragment_size_order) + old = grp->bb_avg_fragment_size_order; + new = grp->bb_fragments == 0 ? -1 : + mb_avg_fragment_size_order(sb, grp->bb_free / grp->bb_fragments); + if (new == old) return; - if (grp->bb_avg_fragment_size_order != -1) { - write_lock(&sbi->s_mb_avg_fragment_size_locks[ - grp->bb_avg_fragment_size_order]); + if (old >= 0) { + write_lock(&sbi->s_mb_avg_fragment_size_locks[old]); list_del(&grp->bb_avg_fragment_size_node); - write_unlock(&sbi->s_mb_avg_fragment_size_locks[ - grp->bb_avg_fragment_size_order]); - } - grp->bb_avg_fragment_size_order = new_order; - write_lock(&sbi->s_mb_avg_fragment_size_locks[ - grp->bb_avg_fragment_size_order]); - list_add_tail(&grp->bb_avg_fragment_size_node, - &sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]); - write_unlock(&sbi->s_mb_avg_fragment_size_locks[ - grp->bb_avg_fragment_size_order]); + write_unlock(&sbi->s_mb_avg_fragment_size_locks[old]); + } + + grp->bb_avg_fragment_size_order = new; + if (new >= 0) { + write_lock(&sbi->s_mb_avg_fragment_size_locks[new]); + list_add_tail(&grp->bb_avg_fragment_size_node, + &sbi->s_mb_avg_fragment_size[new]); + write_unlock(&sbi->s_mb_avg_fragment_size_locks[new]); + } } /* -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit 7d345aa1fac4c2ec9584fbd6f389f2c2368671d5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The grp->bb_largest_free_order is updated regardless of whether mb_optimize_scan is enabled. This can lead to inconsistencies between grp->bb_largest_free_order and the actual s_mb_largest_free_orders list index when mb_optimize_scan is repeatedly enabled and disabled via remount. For example, if mb_optimize_scan is initially enabled, largest free order is 3, and the group is in s_mb_largest_free_orders[3]. Then, mb_optimize_scan is disabled via remount, block allocations occur, updating largest free order to 2. Finally, mb_optimize_scan is re-enabled via remount, more block allocations update largest free order to 1. At this point, the group would be removed from s_mb_largest_free_orders[3] under the protection of s_mb_largest_free_orders_locks[2]. This lock mismatch can lead to list corruption. To fix this, whenever grp->bb_largest_free_order changes, we now always attempt to remove the group from its old order list. However, we only insert the group into the new order list if `mb_optimize_scan` is enabled. This approach helps prevent lock inconsistencies and ensures the data in the order lists remains reliable. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-12-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/mballoc.c | 33 ++++++++++++++------------------- 1 file changed, 14 insertions(+), 19 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 1075a658c6c5..dcad2d88df6b 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1155,33 +1155,28 @@ static void mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp) { struct ext4_sb_info *sbi = EXT4_SB(sb); - int i; + int new, old = grp->bb_largest_free_order; - for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) - if (grp->bb_counters[i] > 0) + for (new = MB_NUM_ORDERS(sb) - 1; new >= 0; new--) + if (grp->bb_counters[new] > 0) break; + /* No need to move between order lists? */ - if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || - i == grp->bb_largest_free_order) { - grp->bb_largest_free_order = i; + if (new == old) return; - } - if (grp->bb_largest_free_order >= 0) { - write_lock(&sbi->s_mb_largest_free_orders_locks[ - grp->bb_largest_free_order]); + if (old >= 0 && !list_empty(&grp->bb_largest_free_order_node)) { + write_lock(&sbi->s_mb_largest_free_orders_locks[old]); list_del_init(&grp->bb_largest_free_order_node); - write_unlock(&sbi->s_mb_largest_free_orders_locks[ - grp->bb_largest_free_order]); + write_unlock(&sbi->s_mb_largest_free_orders_locks[old]); } - grp->bb_largest_free_order = i; - if (grp->bb_largest_free_order >= 0 && grp->bb_free) { - write_lock(&sbi->s_mb_largest_free_orders_locks[ - grp->bb_largest_free_order]); + + grp->bb_largest_free_order = new; + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && new >= 0 && grp->bb_free) { + write_lock(&sbi->s_mb_largest_free_orders_locks[new]); list_add_tail(&grp->bb_largest_free_order_node, - &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]); - write_unlock(&sbi->s_mb_largest_free_orders_locks[ - grp->bb_largest_free_order]); + &sbi->s_mb_largest_free_orders[new]); + write_unlock(&sbi->s_mb_largest_free_orders_locks[new]); } } -- 2.46.1

From: Ojaswin Mujoo <ojaswin@linux.ibm.com> mainline inclusion from mainline-v6.8-rc1 commit 1f6bc02f18489b9c9ea39b068d0695fb0e4567e9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Currently in case the goal length is a multiple of stripe size we use ext4_mb_scan_aligned() to find the stripe size aligned physical blocks. In case we are not able to find any, we again go back to calling ext4_mb_choose_next_group() to search for a different suitable block group. However, since the linear search always begins from the start, most of the times we end up with the same BG and the cycle continues. With large fliesystems, the CPU can be stuck in this loop for hours which can slow down the whole system. Hence, until we figure out a better way to continue the search (rather than starting from beginning) in ext4_mb_choose_next_group(), lets just fallback to ext4_mb_complex_scan_group() in case aligned scan fails, as it is much more likely to find the needed blocks. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/ee033f6dfa0a7f2934437008a909c3788233950f.170245501... Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/mballoc.c | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index dcad2d88df6b..19c7a3cfae49 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2935,14 +2935,19 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) ac->ac_groups_scanned++; if (cr == CR_POWER2_ALIGNED) ext4_mb_simple_scan_group(ac, &e4b); - else if ((cr == CR_GOAL_LEN_FAST || - cr == CR_BEST_AVAIL_LEN) && - sbi->s_stripe && - !(ac->ac_g_ex.fe_len % - EXT4_B2C(sbi, sbi->s_stripe))) - ext4_mb_scan_aligned(ac, &e4b); - else - ext4_mb_complex_scan_group(ac, &e4b); + else { + bool is_stripe_aligned = sbi->s_stripe && + !(ac->ac_g_ex.fe_len % + EXT4_B2C(sbi, sbi->s_stripe)); + + if ((cr == CR_GOAL_LEN_FAST || + cr == CR_BEST_AVAIL_LEN) && + is_stripe_aligned) + ext4_mb_scan_aligned(ac, &e4b); + + if (ac->ac_status == AC_STATUS_CONTINUE) + ext4_mb_complex_scan_group(ac, &e4b); + } ext4_unlock_group(sb, group); ext4_mb_unload_buddy(&e4b); -- 2.46.1

From: Ojaswin Mujoo <ojaswin@linux.ibm.com> mainline inclusion from mainline-v6.12-rc1 commit ff2beee206d23f49d022650122f81285849033e4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Although we have checks to make sure s_stripe is a multiple of cluster size, in case we accidentally end up with a scenario where this is not the case, use EXT4_NUM_B2C() so that we don't end up with unexpected cases where EXT4_B2C(stripe) becomes 0. Also make the is_stripe_aligned check in regular_allocator a bit more robust while we are at it. This should ideally have no functional change unless we have a bug somewhere causing (stripe % cluster_size != 0) Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/e0c0a3b58a40935a1361f668851d041575861411.1725002410... Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/mballoc.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 19c7a3cfae49..f53aa8814e60 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2354,7 +2354,7 @@ int ext4_mb_find_by_goal(struct ext4_allocation_context *ac, ex.fe_logical = 0xDEADFA11; /* debug value */ if (max >= ac->ac_g_ex.fe_len && - ac->ac_g_ex.fe_len == EXT4_B2C(sbi, sbi->s_stripe)) { + ac->ac_g_ex.fe_len == EXT4_NUM_B2C(sbi, sbi->s_stripe)) { ext4_fsblk_t start; start = ext4_grp_offs_to_block(ac->ac_sb, &ex); @@ -2551,7 +2551,7 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac, do_div(a, sbi->s_stripe); i = (a * sbi->s_stripe) - first_group_block; - stripe = EXT4_B2C(sbi, sbi->s_stripe); + stripe = EXT4_NUM_B2C(sbi, sbi->s_stripe); i = EXT4_B2C(sbi, i); while (i < EXT4_CLUSTERS_PER_GROUP(sb)) { if (!mb_test_bit(i, bitmap)) { @@ -2936,9 +2936,11 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) if (cr == CR_POWER2_ALIGNED) ext4_mb_simple_scan_group(ac, &e4b); else { - bool is_stripe_aligned = sbi->s_stripe && + bool is_stripe_aligned = + (sbi->s_stripe >= + sbi->s_cluster_ratio) && !(ac->ac_g_ex.fe_len % - EXT4_B2C(sbi, sbi->s_stripe)); + EXT4_NUM_B2C(sbi, sbi->s_stripe)); if ((cr == CR_GOAL_LEN_FAST || cr == CR_BEST_AVAIL_LEN) && @@ -3720,7 +3722,7 @@ int ext4_mb_init(struct super_block *sb) */ if (sbi->s_stripe > 1) { sbi->s_mb_group_prealloc = roundup( - sbi->s_mb_group_prealloc, EXT4_B2C(sbi, sbi->s_stripe)); + sbi->s_mb_group_prealloc, EXT4_NUM_B2C(sbi, sbi->s_stripe)); } sbi->s_mb_nr_global_goals = umin(num_possible_cpus(), -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit 45704f92e55853fe287760e019feb45eeb9c988e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Extract __ext4_mb_scan_group() to make the code clearer and to prepare for the later conversion of 'choose group' to 'scan groups'. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-13-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Conflicts: fs/ext4/mballoc.h [Context conflicts because there are no commits here: ccedf35b5daa ("ext4: convert ac_bitmap_page to ac_bitmap_folio") c84f1510fba9 ("ext4: convert ac_buddy_page to ac_buddy_folio").] Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/mballoc.c | 45 +++++++++++++++++++++++++++------------------ fs/ext4/mballoc.h | 2 ++ 2 files changed, 29 insertions(+), 18 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index f53aa8814e60..dc693afcc857 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2569,6 +2569,30 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac, } } +static void __ext4_mb_scan_group(struct ext4_allocation_context *ac) +{ + bool is_stripe_aligned; + struct ext4_sb_info *sbi; + enum criteria cr = ac->ac_criteria; + + ac->ac_groups_scanned++; + if (cr == CR_POWER2_ALIGNED) + return ext4_mb_simple_scan_group(ac, ac->ac_e4b); + + sbi = EXT4_SB(ac->ac_sb); + is_stripe_aligned = false; + if ((sbi->s_stripe >= sbi->s_cluster_ratio) && + !(ac->ac_g_ex.fe_len % EXT4_NUM_B2C(sbi, sbi->s_stripe))) + is_stripe_aligned = true; + + if ((cr == CR_GOAL_LEN_FAST || cr == CR_BEST_AVAIL_LEN) && + is_stripe_aligned) + ext4_mb_scan_aligned(ac, ac->ac_e4b); + + if (ac->ac_status == AC_STATUS_CONTINUE) + ext4_mb_complex_scan_group(ac, ac->ac_e4b); +} + /* * This is also called BEFORE we load the buddy bitmap. * Returns either 1 or 0 indicating that the group is either suitable @@ -2856,6 +2880,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) */ if (ac->ac_2order) cr = CR_POWER2_ALIGNED; + + ac->ac_e4b = &e4b; repeat: for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) { ac->ac_criteria = cr; @@ -2932,24 +2958,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) continue; } - ac->ac_groups_scanned++; - if (cr == CR_POWER2_ALIGNED) - ext4_mb_simple_scan_group(ac, &e4b); - else { - bool is_stripe_aligned = - (sbi->s_stripe >= - sbi->s_cluster_ratio) && - !(ac->ac_g_ex.fe_len % - EXT4_NUM_B2C(sbi, sbi->s_stripe)); - - if ((cr == CR_GOAL_LEN_FAST || - cr == CR_BEST_AVAIL_LEN) && - is_stripe_aligned) - ext4_mb_scan_aligned(ac, &e4b); - - if (ac->ac_status == AC_STATUS_CONTINUE) - ext4_mb_complex_scan_group(ac, &e4b); - } + __ext4_mb_scan_group(ac); ext4_unlock_group(sb, group); ext4_mb_unload_buddy(&e4b); diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index dd16050022f5..99461f0cdc8a 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -205,6 +205,8 @@ struct ext4_allocation_context { __u8 ac_2order; /* if request is to allocate 2^N blocks and * N > 0, the field stores N, otherwise 0 */ __u8 ac_op; /* operation, for history only */ + + struct ext4_buddy *ac_e4b; struct page *ac_bitmap_page; struct page *ac_buddy_page; struct ext4_prealloc_space *ac_pa; -- 2.46.1

From: Kemeng Shi <shikemeng@huaweicloud.com> mainline inclusion from mainline-v6.8-rc3 commit 97c32dbffce12136c06d9ceb6919850233d3a1c2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Remove unused ext4_allocation_context::ac_groups_considered Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/mballoc.h | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index 99461f0cdc8a..587d4dd5746d 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -192,7 +192,6 @@ struct ext4_allocation_context { */ ext4_grpblk_t ac_orig_goal_len; - __u32 ac_groups_considered; __u32 ac_flags; /* allocation hints */ __u32 ac_groups_linear_remaining; __u16 ac_groups_scanned; -- 2.46.1

From: Kemeng Shi <shikemeng@huaweicloud.com> mainline inclusion from mainline-v6.10-rc1 commit 9c97c34a998a01bc0cf970b1685456bc2ea16b64 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Keep "prefetch_grp" and "nr" consistent to avoid to call ext4_mb_prefetch_fini with non-prefetched groups. When we step into next criteria, "prefetch_grp" is set to prefetch start of new criteria while "nr" is number of the prefetched group in previous criteria. If previous criteria and next criteria are both inexpensive (< CR_GOAL_LEN_SLOW) and prefetch_ios reachs sbi->s_mb_prefetch_limit in previous criteria, "prefetch_grp" and "nr" will be inconsistent and may introduce unexpected cost to do ext4_mb_init_group for non-prefetched groups. Reset "nr" to 0 when we reset "prefetch_grp" to goal group to keep them consistent. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240424061904.987525-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/mballoc.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index dc693afcc857..cd8a9be56cc7 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2892,6 +2892,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) group = ac->ac_g_ex.fe_group; ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups; prefetch_grp = group; + nr = 0; for (i = 0, new_cr = cr; i < ngroups; i++, ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) { -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit 5abd85f667a19ef7d880ed00c201fc22de6fa707 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Extract ext4_mb_might_prefetch() to make the code clearer and to prepare for the later conversion of 'choose group' to 'scan groups'. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-14-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/mballoc.c | 62 +++++++++++++++++++++++++++++------------------ fs/ext4/mballoc.h | 4 +++ 2 files changed, 42 insertions(+), 24 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index cd8a9be56cc7..305c22df3a80 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2782,6 +2782,37 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb, ext4_group_t group, return group; } +/* + * Batch reads of the block allocation bitmaps to get + * multiple READs in flight; limit prefetching at inexpensive + * CR, otherwise mballoc can spend a lot of time loading + * imperfect groups + */ +static void ext4_mb_might_prefetch(struct ext4_allocation_context *ac, + ext4_group_t group) +{ + struct ext4_sb_info *sbi; + + if (ac->ac_prefetch_grp != group) + return; + + sbi = EXT4_SB(ac->ac_sb); + if (ext4_mb_cr_expensive(ac->ac_criteria) || + ac->ac_prefetch_ios < sbi->s_mb_prefetch_limit) { + unsigned int nr = sbi->s_mb_prefetch; + + if (ext4_has_feature_flex_bg(ac->ac_sb)) { + nr = 1 << sbi->s_log_groups_per_flex; + nr -= group & (nr - 1); + nr = umin(nr, sbi->s_mb_prefetch); + } + + ac->ac_prefetch_nr = nr; + ac->ac_prefetch_grp = ext4_mb_prefetch(ac->ac_sb, group, nr, + &ac->ac_prefetch_ios); + } +} + /* * Prefetching reads the block bitmap into the buffer cache; but we * need to make sure that the buddy bitmap in the page cache has been @@ -2818,10 +2849,9 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group, static noinline_for_stack int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) { - ext4_group_t prefetch_grp = 0, ngroups, group, i; + ext4_group_t ngroups, group, i; enum criteria new_cr, cr = CR_GOAL_LEN_FAST; int err = 0, first_err = 0; - unsigned int nr = 0, prefetch_ios = 0; struct ext4_sb_info *sbi; struct super_block *sb; struct ext4_buddy e4b; @@ -2882,6 +2912,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) cr = CR_POWER2_ALIGNED; ac->ac_e4b = &e4b; + ac->ac_prefetch_ios = 0; repeat: for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) { ac->ac_criteria = cr; @@ -2891,8 +2922,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) */ group = ac->ac_g_ex.fe_group; ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups; - prefetch_grp = group; - nr = 0; + ac->ac_prefetch_grp = group; + ac->ac_prefetch_nr = 0; for (i = 0, new_cr = cr; i < ngroups; i++, ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) { @@ -2904,24 +2935,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) goto repeat; } - /* - * Batch reads of the block allocation bitmaps - * to get multiple READs in flight; limit - * prefetching at inexpensive CR, otherwise mballoc - * can spend a lot of time loading imperfect groups - */ - if ((prefetch_grp == group) && - (ext4_mb_cr_expensive(cr) || - prefetch_ios < sbi->s_mb_prefetch_limit)) { - nr = sbi->s_mb_prefetch; - if (ext4_has_feature_flex_bg(sb)) { - nr = 1 << sbi->s_log_groups_per_flex; - nr -= group & (nr - 1); - nr = min(nr, sbi->s_mb_prefetch); - } - prefetch_grp = ext4_mb_prefetch(sb, group, - nr, &prefetch_ios); - } + ext4_mb_might_prefetch(ac, group); /* prevent unnecessary buddy loading. */ if (cr < CR_ANY_FREE && @@ -3019,8 +3033,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status, ac->ac_flags, cr, err); - if (nr) - ext4_mb_prefetch_fini(sb, prefetch_grp, nr); + if (ac->ac_prefetch_nr) + ext4_mb_prefetch_fini(sb, ac->ac_prefetch_grp, ac->ac_prefetch_nr); return err; } diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index 587d4dd5746d..b736ee7eb660 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -192,6 +192,10 @@ struct ext4_allocation_context { */ ext4_grpblk_t ac_orig_goal_len; + ext4_group_t ac_prefetch_grp; + unsigned int ac_prefetch_ios; + unsigned int ac_prefetch_nr; + __u32 ac_flags; /* allocation hints */ __u32 ac_groups_linear_remaining; __u16 ac_groups_scanned; -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit 9c08e42db9056d423dcef5e7998c73182180ff83 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Extract ext4_mb_scan_group() to make the code clearer and to prepare for the later conversion of 'choose group' to 'scan groups'. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-15-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/mballoc.c | 93 +++++++++++++++++++++++++---------------------- fs/ext4/mballoc.h | 2 + 2 files changed, 51 insertions(+), 44 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 305c22df3a80..8c427927dacc 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2846,12 +2846,56 @@ void ext4_mb_prefetch_fini(struct super_block *sb, ext4_group_t group, } } +static int ext4_mb_scan_group(struct ext4_allocation_context *ac, + ext4_group_t group) +{ + int ret; + struct super_block *sb = ac->ac_sb; + enum criteria cr = ac->ac_criteria; + + ext4_mb_might_prefetch(ac, group); + + /* prevent unnecessary buddy loading. */ + if (cr < CR_ANY_FREE && spin_is_locked(ext4_group_lock_ptr(sb, group))) + return 0; + + /* This now checks without needing the buddy page */ + ret = ext4_mb_good_group_nolock(ac, group, cr); + if (ret <= 0) { + if (!ac->ac_first_err) + ac->ac_first_err = ret; + return 0; + } + + ret = ext4_mb_load_buddy(sb, group, ac->ac_e4b); + if (ret) + return ret; + + /* skip busy group */ + if (cr >= CR_ANY_FREE) + ext4_lock_group(sb, group); + else if (!ext4_try_lock_group(sb, group)) + goto out_unload; + + /* We need to check again after locking the block group. */ + if (unlikely(!ext4_mb_good_group(ac, group, cr))) + goto out_unlock; + + __ext4_mb_scan_group(ac); + +out_unlock: + ext4_unlock_group(sb, group); +out_unload: + ext4_mb_unload_buddy(ac->ac_e4b); + return ret; +} + static noinline_for_stack int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) { ext4_group_t ngroups, group, i; enum criteria new_cr, cr = CR_GOAL_LEN_FAST; - int err = 0, first_err = 0; + int err = 0; struct ext4_sb_info *sbi; struct super_block *sb; struct ext4_buddy e4b; @@ -2913,6 +2957,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) ac->ac_e4b = &e4b; ac->ac_prefetch_ios = 0; + ac->ac_first_err = 0; repeat: for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) { ac->ac_criteria = cr; @@ -2927,7 +2972,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) for (i = 0, new_cr = cr; i < ngroups; i++, ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) { - int ret = 0; cond_resched(); if (new_cr != cr) { @@ -2935,49 +2979,10 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) goto repeat; } - ext4_mb_might_prefetch(ac, group); - - /* prevent unnecessary buddy loading. */ - if (cr < CR_ANY_FREE && - spin_is_locked(ext4_group_lock_ptr(sb, group))) - continue; - - /* This now checks without needing the buddy page */ - ret = ext4_mb_good_group_nolock(ac, group, cr); - if (ret <= 0) { - if (!first_err) - first_err = ret; - continue; - } - - err = ext4_mb_load_buddy(sb, group, &e4b); + err = ext4_mb_scan_group(ac, group); if (err) goto out; - /* skip busy group */ - if (cr >= CR_ANY_FREE) { - ext4_lock_group(sb, group); - } else if (!ext4_try_lock_group(sb, group)) { - ext4_mb_unload_buddy(&e4b); - continue; - } - - /* - * We need to check again after locking the - * block group - */ - ret = ext4_mb_good_group(ac, group, cr); - if (ret == 0) { - ext4_unlock_group(sb, group); - ext4_mb_unload_buddy(&e4b); - continue; - } - - __ext4_mb_scan_group(ac); - - ext4_unlock_group(sb, group); - ext4_mb_unload_buddy(&e4b); - if (ac->ac_status != AC_STATUS_CONTINUE) break; } @@ -3026,8 +3031,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) atomic_inc(&sbi->s_bal_stream_goals); } out: - if (!err && ac->ac_status != AC_STATUS_FOUND && first_err) - err = first_err; + if (!err && ac->ac_status != AC_STATUS_FOUND && ac->ac_first_err) + err = ac->ac_first_err; mb_debug(sb, "Best len %d, origin len %d, ac_status %u, ac_flags 0x%x, cr %d ret %d\n", ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status, diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index b736ee7eb660..5a0cbd512444 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -196,6 +196,8 @@ struct ext4_allocation_context { unsigned int ac_prefetch_ios; unsigned int ac_prefetch_nr; + int ac_first_err; + __u32 ac_flags; /* allocation hints */ __u32 ac_groups_linear_remaining; __u16 ac_groups_scanned; -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit f7eaacbb4e54f8a6c6674c16eff54f703ea63d5e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- While traversing the list, holding a spin_lock prevents load_buddy, making direct use of ext4_try_lock_group impossible. This can lead to a bouncing scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group() fails, forcing the list traversal to repeatedly restart from grp_A. In contrast, linear traversal directly uses ext4_try_lock_group(), avoiding this bouncing. Therefore, we need a lockless, ordered traversal to achieve linear-like efficiency. Therefore, this commit converts both average fragment size lists and largest free order lists into ordered xarrays. In an xarray, the index represents the block group number and the value holds the block group information; a non-empty value indicates the block group's presence. While insertion and deletion complexity remain O(1), lookup complexity changes from O(1) to O(nlogn), which may slightly reduce single-threaded performance. Additionally, xarray insertions might fail, potentially due to memory allocation issues. However, since we have linear traversal as a fallback, this isn't a major problem. Therefore, we've only added a warning message for insertion failures here. A helper function ext4_mb_find_good_group_xarray() is added to find good groups in the specified xarray starting at the specified position start, and when it reaches ngroups-1, it wraps around to 0 and then to start-1. This ensures an ordered traversal within the xarray. Performance test results are as follows: Single-process operations on an empty disk show negligible impact, while multi-process workloads demonstrate a noticeable performance gain. |CPU: Kunpeng 920 | P80 | P1 | |Memory: 512GB |------------------------|-------------------------| |960GB SSD (0.5GB/s)| base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 20097 | 19555 (-2.6%) | 316141 | 315636 (-0.2%) | |mb_optimize_scan=1 | 13318 | 15496 (+16.3%) | 325273 | 323569 (-0.5%) | |CPU: AMD 9654 * 2 | P96 | P1 | |Memory: 1536GB |------------------------|-------------------------| |960GB SSD (1GB/s) | base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 53603 | 53192 (-0.7%) | 214243 | 212678 (-0.7%) | |mb_optimize_scan=1 | 20887 | 37636 (+80.1%) | 213632 | 214189 (+0.2%) | [ Applied spelling fixes per discussion on the ext4-list see thread referened in the Link tag. --tytso] Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-16-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Conflicts: fs/ext4/mballoc-test.c [Context differences because there is no commit 2b81493f8eb6 ("ext4: Add unit test for ext4_mb_mark_diskspace_used").] Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/ext4.h | 8 +- fs/ext4/mballoc.c | 254 +++++++++++++++++++++++++--------------------- 2 files changed, 140 insertions(+), 122 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 07e240231fe0..1e30aaf39e0e 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1590,10 +1590,8 @@ struct ext4_sb_info { struct list_head s_discard_list; struct work_struct s_discard_work; atomic_t s_retry_alloc_pending; - struct list_head *s_mb_avg_fragment_size; - rwlock_t *s_mb_avg_fragment_size_locks; - struct list_head *s_mb_largest_free_orders; - rwlock_t *s_mb_largest_free_orders_locks; + struct xarray *s_mb_avg_fragment_size; + struct xarray *s_mb_largest_free_orders; /* tunables */ unsigned long s_stripe; @@ -3431,8 +3429,6 @@ struct ext4_group_info { void *bb_bitmap; #endif struct rw_semaphore alloc_sem; - struct list_head bb_avg_fragment_size_node; - struct list_head bb_largest_free_order_node; ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block * regions, index is order. * bb_counters[3] = 5 means diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 8c427927dacc..c894aa014d2e 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -131,25 +131,30 @@ * If "mb_optimize_scan" mount option is set, we maintain in memory group info * structures in two data structures: * - * 1) Array of largest free order lists (sbi->s_mb_largest_free_orders) + * 1) Array of largest free order xarrays (sbi->s_mb_largest_free_orders) * - * Locking: sbi->s_mb_largest_free_orders_locks(array of rw locks) + * Locking: Writers use xa_lock, readers use rcu_read_lock. * - * This is an array of lists where the index in the array represents the + * This is an array of xarrays where the index in the array represents the * largest free order in the buddy bitmap of the participating group infos of - * that list. So, there are exactly MB_NUM_ORDERS(sb) (which means total - * number of buddy bitmap orders possible) number of lists. Group-infos are - * placed in appropriate lists. + * that xarray. So, there are exactly MB_NUM_ORDERS(sb) (which means total + * number of buddy bitmap orders possible) number of xarrays. Group-infos are + * placed in appropriate xarrays. * - * 2) Average fragment size lists (sbi->s_mb_avg_fragment_size) + * 2) Average fragment size xarrays (sbi->s_mb_avg_fragment_size) * - * Locking: sbi->s_mb_avg_fragment_size_locks(array of rw locks) + * Locking: Writers use xa_lock, readers use rcu_read_lock. * - * This is an array of lists where in the i-th list there are groups with + * This is an array of xarrays where in the i-th xarray there are groups with * average fragment size >= 2^i and < 2^(i+1). The average fragment size * is computed as ext4_group_info->bb_free / ext4_group_info->bb_fragments. - * Note that we don't bother with a special list for completely empty groups - * so we only have MB_NUM_ORDERS(sb) lists. + * Note that we don't bother with a special xarray for completely empty + * groups so we only have MB_NUM_ORDERS(sb) xarrays. Group-infos are placed + * in appropriate xarrays. + * + * In xarray, the index is the block group number, the value is the block group + * information, and a non-empty value indicates the block group is present in + * the current xarray. * * When "mb_optimize_scan" mount option is set, mballoc consults the above data * structures to decide the order in which groups are to be traversed for @@ -855,19 +860,73 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp) if (new == old) return; - if (old >= 0) { - write_lock(&sbi->s_mb_avg_fragment_size_locks[old]); - list_del(&grp->bb_avg_fragment_size_node); - write_unlock(&sbi->s_mb_avg_fragment_size_locks[old]); - } + if (old >= 0) + xa_erase(&sbi->s_mb_avg_fragment_size[old], grp->bb_group); grp->bb_avg_fragment_size_order = new; if (new >= 0) { - write_lock(&sbi->s_mb_avg_fragment_size_locks[new]); - list_add_tail(&grp->bb_avg_fragment_size_node, - &sbi->s_mb_avg_fragment_size[new]); - write_unlock(&sbi->s_mb_avg_fragment_size_locks[new]); + /* + * Cannot use __GFP_NOFAIL because we hold the group lock. + * Although allocation for insertion may fails, it's not fatal + * as we have linear traversal to fall back on. + */ + int err = xa_insert(&sbi->s_mb_avg_fragment_size[new], + grp->bb_group, grp, GFP_ATOMIC); + if (err) + mb_debug(sb, "insert group: %u to s_mb_avg_fragment_size[%d] failed, err %d", + grp->bb_group, new, err); + } +} + +static struct ext4_group_info * +ext4_mb_find_good_group_xarray(struct ext4_allocation_context *ac, + struct xarray *xa, ext4_group_t start) +{ + struct super_block *sb = ac->ac_sb; + struct ext4_sb_info *sbi = EXT4_SB(sb); + enum criteria cr = ac->ac_criteria; + ext4_group_t ngroups = ext4_get_groups_count(sb); + unsigned long group = start; + ext4_group_t end = ngroups; + struct ext4_group_info *grp; + + if (WARN_ON_ONCE(start >= end)) + return NULL; + +wrap_around: + xa_for_each_range(xa, group, grp, start, end - 1) { + if (sbi->s_mb_stats) + atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]); + + if (!spin_is_locked(ext4_group_lock_ptr(sb, group)) && + likely(ext4_mb_good_group(ac, group, cr))) + return grp; + + cond_resched(); } + + if (start) { + end = start; + start = 0; + goto wrap_around; + } + + return NULL; +} + +/* + * Find a suitable group of given order from the largest free orders xarray. + */ +static struct ext4_group_info * +ext4_mb_find_good_group_largest_free_order(struct ext4_allocation_context *ac, + int order, ext4_group_t start) +{ + struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_largest_free_orders[order]; + + if (xa_empty(xa)) + return NULL; + + return ext4_mb_find_good_group_xarray(ac, xa, start); } /* @@ -878,7 +937,7 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups) { struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); - struct ext4_group_info *iter; + struct ext4_group_info *grp; int i; if (ac->ac_status == AC_STATUS_FOUND) @@ -888,26 +947,12 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context atomic_inc(&sbi->s_bal_p2_aligned_bad_suggestions); for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) { - if (list_empty(&sbi->s_mb_largest_free_orders[i])) - continue; - read_lock(&sbi->s_mb_largest_free_orders_locks[i]); - if (list_empty(&sbi->s_mb_largest_free_orders[i])) { - read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); - continue; - } - list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i], - bb_largest_free_order_node) { - if (sbi->s_mb_stats) - atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]); - if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) && - likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) { - *group = iter->bb_group; - ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED; - read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); - return; - } + grp = ext4_mb_find_good_group_largest_free_order(ac, i, *group); + if (grp) { + *group = grp->bb_group; + ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED; + return; } - read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); } /* Increment cr and search again if no group is found */ @@ -915,35 +960,18 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context } /* - * Find a suitable group of given order from the average fragments list. + * Find a suitable group of given order from the average fragments xarray. */ static struct ext4_group_info * -ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int order) +ext4_mb_find_good_group_avg_frag_xarray(struct ext4_allocation_context *ac, + int order, ext4_group_t start) { - struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); - struct list_head *frag_list = &sbi->s_mb_avg_fragment_size[order]; - rwlock_t *frag_list_lock = &sbi->s_mb_avg_fragment_size_locks[order]; - struct ext4_group_info *grp = NULL, *iter; - enum criteria cr = ac->ac_criteria; + struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_avg_fragment_size[order]; - if (list_empty(frag_list)) - return NULL; - read_lock(frag_list_lock); - if (list_empty(frag_list)) { - read_unlock(frag_list_lock); + if (xa_empty(xa)) return NULL; - } - list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) { - if (sbi->s_mb_stats) - atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]); - if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) && - likely(ext4_mb_good_group(ac, iter->bb_group, cr))) { - grp = iter; - break; - } - } - read_unlock(frag_list_lock); - return grp; + + return ext4_mb_find_good_group_xarray(ac, xa, start); } /* @@ -964,7 +992,7 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context * for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len); i < MB_NUM_ORDERS(ac->ac_sb); i++) { - grp = ext4_mb_find_good_group_avg_frag_lists(ac, i); + grp = ext4_mb_find_good_group_avg_frag_xarray(ac, i, *group); if (grp) { *group = grp->bb_group; ac->ac_flags |= EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED; @@ -1060,7 +1088,8 @@ static void ext4_mb_choose_next_group_best_avail(struct ext4_allocation_context frag_order = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len); - grp = ext4_mb_find_good_group_avg_frag_lists(ac, frag_order); + grp = ext4_mb_find_good_group_avg_frag_xarray(ac, frag_order, + *group); if (grp) { *group = grp->bb_group; ac->ac_flags |= EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED; @@ -1165,18 +1194,25 @@ mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp) if (new == old) return; - if (old >= 0 && !list_empty(&grp->bb_largest_free_order_node)) { - write_lock(&sbi->s_mb_largest_free_orders_locks[old]); - list_del_init(&grp->bb_largest_free_order_node); - write_unlock(&sbi->s_mb_largest_free_orders_locks[old]); + if (old >= 0) { + struct xarray *xa = &sbi->s_mb_largest_free_orders[old]; + + if (!xa_empty(xa) && xa_load(xa, grp->bb_group)) + xa_erase(xa, grp->bb_group); } grp->bb_largest_free_order = new; if (test_opt2(sb, MB_OPTIMIZE_SCAN) && new >= 0 && grp->bb_free) { - write_lock(&sbi->s_mb_largest_free_orders_locks[new]); - list_add_tail(&grp->bb_largest_free_order_node, - &sbi->s_mb_largest_free_orders[new]); - write_unlock(&sbi->s_mb_largest_free_orders_locks[new]); + /* + * Cannot use __GFP_NOFAIL because we hold the group lock. + * Although allocation for insertion may fails, it's not fatal + * as we have linear traversal to fall back on. + */ + int err = xa_insert(&sbi->s_mb_largest_free_orders[new], + grp->bb_group, grp, GFP_ATOMIC); + if (err) + mb_debug(sb, "insert group: %u to s_mb_largest_free_orders[%d] failed, err %d", + grp->bb_group, new, err); } } @@ -3273,6 +3309,7 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v) unsigned long position = ((unsigned long) v); struct ext4_group_info *grp; unsigned int count; + unsigned long idx; position--; if (position >= MB_NUM_ORDERS(sb)) { @@ -3281,11 +3318,8 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v) seq_puts(seq, "avg_fragment_size_lists:\n"); count = 0; - read_lock(&sbi->s_mb_avg_fragment_size_locks[position]); - list_for_each_entry(grp, &sbi->s_mb_avg_fragment_size[position], - bb_avg_fragment_size_node) + xa_for_each(&sbi->s_mb_avg_fragment_size[position], idx, grp) count++; - read_unlock(&sbi->s_mb_avg_fragment_size_locks[position]); seq_printf(seq, "\tlist_order_%u_groups: %u\n", (unsigned int)position, count); return 0; @@ -3297,11 +3331,8 @@ static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v) seq_puts(seq, "max_free_order_lists:\n"); } count = 0; - read_lock(&sbi->s_mb_largest_free_orders_locks[position]); - list_for_each_entry(grp, &sbi->s_mb_largest_free_orders[position], - bb_largest_free_order_node) + xa_for_each(&sbi->s_mb_largest_free_orders[position], idx, grp) count++; - read_unlock(&sbi->s_mb_largest_free_orders_locks[position]); seq_printf(seq, "\tlist_order_%u_groups: %u\n", (unsigned int)position, count); @@ -3421,8 +3452,6 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group, INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list); init_rwsem(&meta_group_info[i]->alloc_sem); meta_group_info[i]->bb_free_root = RB_ROOT; - INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node); - INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node); meta_group_info[i]->bb_largest_free_order = -1; /* uninit */ meta_group_info[i]->bb_avg_fragment_size_order = -1; /* uninit */ meta_group_info[i]->bb_group = group; @@ -3631,6 +3660,20 @@ static void ext4_discard_work(struct work_struct *work) ext4_mb_unload_buddy(&e4b); } +static inline void ext4_mb_avg_fragment_size_destroy(struct ext4_sb_info *sbi) +{ + for (int i = 0; i < MB_NUM_ORDERS(sbi->s_sb); i++) + xa_destroy(&sbi->s_mb_avg_fragment_size[i]); + kfree(sbi->s_mb_avg_fragment_size); +} + +static inline void ext4_mb_largest_free_orders_destroy(struct ext4_sb_info *sbi) +{ + for (int i = 0; i < MB_NUM_ORDERS(sbi->s_sb); i++) + xa_destroy(&sbi->s_mb_largest_free_orders[i]); + kfree(sbi->s_mb_largest_free_orders); +} + int ext4_mb_init(struct super_block *sb) { struct ext4_sb_info *sbi = EXT4_SB(sb); @@ -3676,41 +3719,24 @@ int ext4_mb_init(struct super_block *sb) } while (i < MB_NUM_ORDERS(sb)); sbi->s_mb_avg_fragment_size = - kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head), + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct xarray), GFP_KERNEL); if (!sbi->s_mb_avg_fragment_size) { ret = -ENOMEM; goto out; } - sbi->s_mb_avg_fragment_size_locks = - kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t), - GFP_KERNEL); - if (!sbi->s_mb_avg_fragment_size_locks) { - ret = -ENOMEM; - goto out; - } - for (i = 0; i < MB_NUM_ORDERS(sb); i++) { - INIT_LIST_HEAD(&sbi->s_mb_avg_fragment_size[i]); - rwlock_init(&sbi->s_mb_avg_fragment_size_locks[i]); - } + for (i = 0; i < MB_NUM_ORDERS(sb); i++) + xa_init(&sbi->s_mb_avg_fragment_size[i]); + sbi->s_mb_largest_free_orders = - kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head), + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct xarray), GFP_KERNEL); if (!sbi->s_mb_largest_free_orders) { ret = -ENOMEM; goto out; } - sbi->s_mb_largest_free_orders_locks = - kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t), - GFP_KERNEL); - if (!sbi->s_mb_largest_free_orders_locks) { - ret = -ENOMEM; - goto out; - } - for (i = 0; i < MB_NUM_ORDERS(sb); i++) { - INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]); - rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]); - } + for (i = 0; i < MB_NUM_ORDERS(sb); i++) + xa_init(&sbi->s_mb_largest_free_orders[i]); spin_lock_init(&sbi->s_md_lock); atomic_set(&sbi->s_mb_free_pending, 0); @@ -3795,10 +3821,8 @@ int ext4_mb_init(struct super_block *sb) kfree(sbi->s_mb_last_groups); sbi->s_mb_last_groups = NULL; out: - kfree(sbi->s_mb_avg_fragment_size); - kfree(sbi->s_mb_avg_fragment_size_locks); - kfree(sbi->s_mb_largest_free_orders); - kfree(sbi->s_mb_largest_free_orders_locks); + ext4_mb_avg_fragment_size_destroy(sbi); + ext4_mb_largest_free_orders_destroy(sbi); kfree(sbi->s_mb_offsets); sbi->s_mb_offsets = NULL; kfree(sbi->s_mb_maxs); @@ -3865,10 +3889,8 @@ int ext4_mb_release(struct super_block *sb) kvfree(group_info); rcu_read_unlock(); } - kfree(sbi->s_mb_avg_fragment_size); - kfree(sbi->s_mb_avg_fragment_size_locks); - kfree(sbi->s_mb_largest_free_orders); - kfree(sbi->s_mb_largest_free_orders_locks); + ext4_mb_avg_fragment_size_destroy(sbi); + ext4_mb_largest_free_orders_destroy(sbi); kfree(sbi->s_mb_offsets); kfree(sbi->s_mb_maxs); iput(sbi->s_buddy_cache); -- 2.46.1

From: Kemeng Shi <shikemeng@huaweicloud.com> mainline inclusion from mainline-v6.8-rc3 commit 438a35e72d09c01da7ffb49570ecd11ca632270b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Remove unused parameter ngroup in ext4_mb_choose_next_group_*(). Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20240105092102.496631-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/mballoc.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index c894aa014d2e..ba3b2b4aad6a 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -934,7 +934,7 @@ ext4_mb_find_good_group_largest_free_order(struct ext4_allocation_context *ac, * cr level needs an update. */ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context *ac, - enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups) + enum criteria *new_cr, ext4_group_t *group) { struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); struct ext4_group_info *grp; @@ -979,7 +979,7 @@ ext4_mb_find_good_group_avg_frag_xarray(struct ext4_allocation_context *ac, * order. Updates *new_cr if cr level needs an update. */ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *ac, - enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups) + enum criteria *new_cr, ext4_group_t *group) { struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); struct ext4_group_info *grp = NULL; @@ -1024,7 +1024,7 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context * * much and fall to CR_GOAL_LEN_SLOW in that case. */ static void ext4_mb_choose_next_group_best_avail(struct ext4_allocation_context *ac, - enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups) + enum criteria *new_cr, ext4_group_t *group) { struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); struct ext4_group_info *grp = NULL; @@ -1162,11 +1162,11 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac, } if (*new_cr == CR_POWER2_ALIGNED) { - ext4_mb_choose_next_group_p2_aligned(ac, new_cr, group, ngroups); + ext4_mb_choose_next_group_p2_aligned(ac, new_cr, group); } else if (*new_cr == CR_GOAL_LEN_FAST) { - ext4_mb_choose_next_group_goal_fast(ac, new_cr, group, ngroups); + ext4_mb_choose_next_group_goal_fast(ac, new_cr, group); } else if (*new_cr == CR_BEST_AVAIL_LEN) { - ext4_mb_choose_next_group_best_avail(ac, new_cr, group, ngroups); + ext4_mb_choose_next_group_best_avail(ac, new_cr, group); } else { /* * TODO: For CR=2, we can arrange groups in an rb tree sorted by -- 2.46.1

From: Kemeng Shi <shikemeng@huaweicloud.com> mainline inclusion from mainline-v6.10-rc1 commit 2caffb6a277bb0f2a482a2eb824d012d5f45f4d0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Use correct criteria name instead stale integer number in comment Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240424061904.987525-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/ext4.h | 9 ++++++--- fs/ext4/mballoc.c | 16 +++++++++------- fs/ext4/mballoc.h | 4 ++-- 3 files changed, 17 insertions(+), 12 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 1e30aaf39e0e..02cd6b7b5dce 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -218,11 +218,14 @@ enum criteria { #define EXT4_MB_USE_RESERVED 0x2000 /* Do strict check for free blocks while retrying block allocation */ #define EXT4_MB_STRICT_CHECK 0x4000 -/* Large fragment size list lookup succeeded at least once for cr = 0 */ +/* Large fragment size list lookup succeeded at least once for + * CR_POWER2_ALIGNED */ #define EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED 0x8000 -/* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */ +/* Avg fragment size rb tree lookup succeeded at least once for + * CR_GOAL_LEN_FAST */ #define EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED 0x00010000 -/* Avg fragment size rb tree lookup succeeded at least once for cr = 1.5 */ +/* Avg fragment size rb tree lookup succeeded at least once for + * CR_BEST_AVAIL_LEN */ #define EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED 0x00020000 struct ext4_allocation_request { diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index ba3b2b4aad6a..65cdb2c0dfc7 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1169,8 +1169,9 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac, ext4_mb_choose_next_group_best_avail(ac, new_cr, group); } else { /* - * TODO: For CR=2, we can arrange groups in an rb tree sorted by - * bb_free. But until that happens, we should never come here. + * TODO: For CR_GOAL_LEN_SLOW, we can arrange groups in an + * rb tree sorted by bb_free. But until that happens, we should + * never come here. */ WARN_ON(1); } @@ -2745,7 +2746,7 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac, int ret; /* - * cr=CR_POWER2_ALIGNED/CR_GOAL_LEN_FAST is a very optimistic + * CR_POWER2_ALIGNED/CR_GOAL_LEN_FAST is a very optimistic * search to find large good chunks almost for free. If buddy * data is not ready, then this optimization makes no sense. But * we never skip the first block group in a flex_bg, since this @@ -3526,10 +3527,11 @@ static int ext4_mb_init_backend(struct super_block *sb) } if (sbi->s_mb_prefetch > ext4_get_groups_count(sb)) sbi->s_mb_prefetch = ext4_get_groups_count(sb); - /* now many real IOs to prefetch within a single allocation at cr=0 - * given cr=0 is an CPU-related optimization we shouldn't try to - * load too many groups, at some point we should start to use what - * we've got in memory. + /* + * now many real IOs to prefetch within a single allocation at + * CR_POWER2_ALIGNED. Given CR_POWER2_ALIGNED is an CPU-related + * optimization we shouldn't try to load too many groups, at some point + * we should start to use what we've got in memory. * with an average random access time 5ms, it'd take a second to get * 200 groups (* N with flex_bg), so let's make this limit 4 */ diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index 5a0cbd512444..336cafd4187d 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -187,8 +187,8 @@ struct ext4_allocation_context { struct ext4_free_extent ac_f_ex; /* - * goal len can change in CR1.5, so save the original len. This is - * used while adjusting the PA window and for accounting. + * goal len can change in CR_BEST_AVAIL_LEN, so save the original len. + * This is used while adjusting the PA window and for accounting. */ ext4_grpblk_t ac_orig_goal_len; -- 2.46.1

From: Kemeng Shi <shikemeng@huaweicloud.com> mainline inclusion from mainline-v6.10-rc1 commit da5704eef7037a5bc84a56519729d93d10a0e0a0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Open coding repeated check in next_linear_group. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240424061904.987525-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/mballoc.c | 31 +++++++++++++++---------------- 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 65cdb2c0dfc7..66191ba5d53c 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1114,23 +1114,11 @@ static inline int should_optimize_scan(struct ext4_allocation_context *ac) } /* - * Return next linear group for allocation. If linear traversal should not be - * performed, this function just returns the same group + * Return next linear group for allocation. */ static ext4_group_t -next_linear_group(struct ext4_allocation_context *ac, ext4_group_t group, - ext4_group_t ngroups) +next_linear_group(ext4_group_t group, ext4_group_t ngroups) { - if (!should_optimize_scan(ac)) - goto inc_and_return; - - if (ac->ac_groups_linear_remaining) { - ac->ac_groups_linear_remaining--; - goto inc_and_return; - } - - return group; -inc_and_return: /* * Artificially restricted ngroups for non-extent * files makes group > ngroups possible on first loop. @@ -1156,8 +1144,19 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac, { *new_cr = ac->ac_criteria; - if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining) { - *group = next_linear_group(ac, *group, ngroups); + if (!should_optimize_scan(ac)) { + *group = next_linear_group(*group, ngroups); + return; + } + + /* + * Optimized scanning can return non adjacent groups which can cause + * seek overhead for rotational disks. So try few linear groups before + * trying optimized scan. + */ + if (ac->ac_groups_linear_remaining) { + *group = next_linear_group(*group, ngroups); + ac->ac_groups_linear_remaining--; return; } -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit 6347558764911f88acac06ab996e162f0c8a212d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- This commit converts the `choose group` logic to `scan group` using previously prepared helper functions. This allows us to leverage xarrays for ordered non-linear traversal, thereby mitigating the "bouncing" issue inherent in the `choose group` mechanism. This also decouples linear and non-linear traversals, leading to cleaner and more readable code. Key changes: * ext4_mb_choose_next_group() is refactored to ext4_mb_scan_groups(). * Replaced ext4_mb_good_group() with ext4_mb_scan_group() in non-linear traversals, and related functions now return error codes instead of group info. * Added ext4_mb_scan_groups_linear() for performing linear scans starting from a specific group for a set number of times. * Linear scans now execute up to sbi->s_mb_max_linear_groups times, so ac_groups_linear_remaining is removed as it's no longer used. * ac->ac_criteria is now used directly instead of passing cr around. Also, ac->ac_criteria is incremented directly after groups scan fails for the corresponding criteria. * Since we're now directly scanning groups instead of finding a good group then scanning, the following variables and flags are no longer needed, s_bal_cX_groups_considered is sufficient. s_bal_p2_aligned_bad_suggestions s_bal_goal_fast_bad_suggestions s_bal_best_avail_bad_suggestions EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-17-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/ext4.h | 12 -- fs/ext4/mballoc.c | 292 +++++++++++++++++++++------------------------- fs/ext4/mballoc.h | 1 - 3 files changed, 131 insertions(+), 174 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 02cd6b7b5dce..a1fc8a94d803 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -218,15 +218,6 @@ enum criteria { #define EXT4_MB_USE_RESERVED 0x2000 /* Do strict check for free blocks while retrying block allocation */ #define EXT4_MB_STRICT_CHECK 0x4000 -/* Large fragment size list lookup succeeded at least once for - * CR_POWER2_ALIGNED */ -#define EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED 0x8000 -/* Avg fragment size rb tree lookup succeeded at least once for - * CR_GOAL_LEN_FAST */ -#define EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED 0x00010000 -/* Avg fragment size rb tree lookup succeeded at least once for - * CR_BEST_AVAIL_LEN */ -#define EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED 0x00020000 struct ext4_allocation_request { /* target inode for block we're allocating */ @@ -1626,9 +1617,6 @@ struct ext4_sb_info { atomic_t s_bal_len_goals; /* len goal hits */ atomic_t s_bal_breaks; /* too long searches */ atomic_t s_bal_2orders; /* 2^order hits */ - atomic_t s_bal_p2_aligned_bad_suggestions; - atomic_t s_bal_goal_fast_bad_suggestions; - atomic_t s_bal_best_avail_bad_suggestions; atomic64_t s_bal_cX_groups_considered[EXT4_MB_NUM_CRS]; atomic64_t s_bal_cX_hits[EXT4_MB_NUM_CRS]; atomic64_t s_bal_cX_failed[EXT4_MB_NUM_CRS]; /* cX loop didn't find blocks */ diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 66191ba5d53c..2ab1eb9d0c77 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -424,8 +424,8 @@ static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap, ext4_group_t group); static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac); -static bool ext4_mb_good_group(struct ext4_allocation_context *ac, - ext4_group_t group, enum criteria cr); +static int ext4_mb_scan_group(struct ext4_allocation_context *ac, + ext4_group_t group); static void ext4_mb_put_pa(struct ext4_allocation_context *ac, struct super_block *sb, struct ext4_prealloc_space *pa); @@ -878,9 +878,8 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp) } } -static struct ext4_group_info * -ext4_mb_find_good_group_xarray(struct ext4_allocation_context *ac, - struct xarray *xa, ext4_group_t start) +static int ext4_mb_scan_groups_xarray(struct ext4_allocation_context *ac, + struct xarray *xa, ext4_group_t start) { struct super_block *sb = ac->ac_sb; struct ext4_sb_info *sbi = EXT4_SB(sb); @@ -891,16 +890,18 @@ ext4_mb_find_good_group_xarray(struct ext4_allocation_context *ac, struct ext4_group_info *grp; if (WARN_ON_ONCE(start >= end)) - return NULL; + return 0; wrap_around: xa_for_each_range(xa, group, grp, start, end - 1) { + int err; + if (sbi->s_mb_stats) atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]); - if (!spin_is_locked(ext4_group_lock_ptr(sb, group)) && - likely(ext4_mb_good_group(ac, group, cr))) - return grp; + err = ext4_mb_scan_group(ac, grp->bb_group); + if (err || ac->ac_status != AC_STATUS_CONTINUE) + return err; cond_resched(); } @@ -911,95 +912,82 @@ ext4_mb_find_good_group_xarray(struct ext4_allocation_context *ac, goto wrap_around; } - return NULL; + return 0; } /* * Find a suitable group of given order from the largest free orders xarray. */ -static struct ext4_group_info * -ext4_mb_find_good_group_largest_free_order(struct ext4_allocation_context *ac, - int order, ext4_group_t start) +static int +ext4_mb_scan_groups_largest_free_order(struct ext4_allocation_context *ac, + int order, ext4_group_t start) { struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_largest_free_orders[order]; if (xa_empty(xa)) - return NULL; + return 0; - return ext4_mb_find_good_group_xarray(ac, xa, start); + return ext4_mb_scan_groups_xarray(ac, xa, start); } /* * Choose next group by traversing largest_free_order lists. Updates *new_cr if * cr level needs an update. */ -static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context *ac, - enum criteria *new_cr, ext4_group_t *group) +static int ext4_mb_scan_groups_p2_aligned(struct ext4_allocation_context *ac, + ext4_group_t group) { struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); - struct ext4_group_info *grp; int i; - - if (ac->ac_status == AC_STATUS_FOUND) - return; - - if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED)) - atomic_inc(&sbi->s_bal_p2_aligned_bad_suggestions); + int ret = 0; for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) { - grp = ext4_mb_find_good_group_largest_free_order(ac, i, *group); - if (grp) { - *group = grp->bb_group; - ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED; - return; - } + ret = ext4_mb_scan_groups_largest_free_order(ac, i, group); + if (ret || ac->ac_status != AC_STATUS_CONTINUE) + return ret; } + if (sbi->s_mb_stats) + atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]); + /* Increment cr and search again if no group is found */ - *new_cr = CR_GOAL_LEN_FAST; + ac->ac_criteria = CR_GOAL_LEN_FAST; + return ret; } /* * Find a suitable group of given order from the average fragments xarray. */ -static struct ext4_group_info * -ext4_mb_find_good_group_avg_frag_xarray(struct ext4_allocation_context *ac, - int order, ext4_group_t start) +static int ext4_mb_scan_groups_avg_frag_order(struct ext4_allocation_context *ac, + int order, ext4_group_t start) { struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_avg_fragment_size[order]; if (xa_empty(xa)) - return NULL; + return 0; - return ext4_mb_find_good_group_xarray(ac, xa, start); + return ext4_mb_scan_groups_xarray(ac, xa, start); } /* * Choose next group by traversing average fragment size list of suitable * order. Updates *new_cr if cr level needs an update. */ -static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *ac, - enum criteria *new_cr, ext4_group_t *group) +static int ext4_mb_scan_groups_goal_fast(struct ext4_allocation_context *ac, + ext4_group_t group) { struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); - struct ext4_group_info *grp = NULL; - int i; - - if (unlikely(ac->ac_flags & EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED)) { - if (sbi->s_mb_stats) - atomic_inc(&sbi->s_bal_goal_fast_bad_suggestions); - } + int i, ret = 0; - for (i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len); - i < MB_NUM_ORDERS(ac->ac_sb); i++) { - grp = ext4_mb_find_good_group_avg_frag_xarray(ac, i, *group); - if (grp) { - *group = grp->bb_group; - ac->ac_flags |= EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED; - return; - } + i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len); + for (; i < MB_NUM_ORDERS(ac->ac_sb); i++) { + ret = ext4_mb_scan_groups_avg_frag_order(ac, i, group); + if (ret || ac->ac_status != AC_STATUS_CONTINUE) + return ret; } + if (sbi->s_mb_stats) + atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]); /* * CR_BEST_AVAIL_LEN works based on the concept that we have * a larger normalized goal len request which can be trimmed to @@ -1009,9 +997,11 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context * * See function ext4_mb_normalize_request() (EXT4_MB_HINT_DATA). */ if (ac->ac_flags & EXT4_MB_HINT_DATA) - *new_cr = CR_BEST_AVAIL_LEN; + ac->ac_criteria = CR_BEST_AVAIL_LEN; else - *new_cr = CR_GOAL_LEN_SLOW; + ac->ac_criteria = CR_GOAL_LEN_SLOW; + + return ret; } /* @@ -1023,19 +1013,14 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context * * preallocations. However, we make sure that we don't trim the request too * much and fall to CR_GOAL_LEN_SLOW in that case. */ -static void ext4_mb_choose_next_group_best_avail(struct ext4_allocation_context *ac, - enum criteria *new_cr, ext4_group_t *group) +static int ext4_mb_scan_groups_best_avail(struct ext4_allocation_context *ac, + ext4_group_t group) { + int ret = 0; struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); - struct ext4_group_info *grp = NULL; int i, order, min_order; unsigned long num_stripe_clusters = 0; - if (unlikely(ac->ac_flags & EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED)) { - if (sbi->s_mb_stats) - atomic_inc(&sbi->s_bal_best_avail_bad_suggestions); - } - /* * mb_avg_fragment_size_order() returns order in a way that makes * retrieving back the length using (1 << order) inaccurate. Hence, use @@ -1088,18 +1073,18 @@ static void ext4_mb_choose_next_group_best_avail(struct ext4_allocation_context frag_order = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len); - grp = ext4_mb_find_good_group_avg_frag_xarray(ac, frag_order, - *group); - if (grp) { - *group = grp->bb_group; - ac->ac_flags |= EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED; - return; - } + ret = ext4_mb_scan_groups_avg_frag_order(ac, frag_order, group); + if (ret || ac->ac_status != AC_STATUS_CONTINUE) + return ret; } /* Reset goal length to original goal length before falling into CR_GOAL_LEN_SLOW */ ac->ac_g_ex.fe_len = ac->ac_orig_goal_len; - *new_cr = CR_GOAL_LEN_SLOW; + if (sbi->s_mb_stats) + atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]); + ac->ac_criteria = CR_GOAL_LEN_SLOW; + + return ret; } static inline int should_optimize_scan(struct ext4_allocation_context *ac) @@ -1114,59 +1099,82 @@ static inline int should_optimize_scan(struct ext4_allocation_context *ac) } /* - * Return next linear group for allocation. + * next linear group for allocation. */ -static ext4_group_t -next_linear_group(ext4_group_t group, ext4_group_t ngroups) +static void next_linear_group(ext4_group_t *group, ext4_group_t ngroups) { /* * Artificially restricted ngroups for non-extent * files makes group > ngroups possible on first loop. */ - return group + 1 >= ngroups ? 0 : group + 1; + *group = *group + 1 >= ngroups ? 0 : *group + 1; } -/* - * ext4_mb_choose_next_group: choose next group for allocation. - * - * @ac Allocation Context - * @new_cr This is an output parameter. If the there is no good group - * available at current CR level, this field is updated to indicate - * the new cr level that should be used. - * @group This is an input / output parameter. As an input it indicates the - * next group that the allocator intends to use for allocation. As - * output, this field indicates the next group that should be used as - * determined by the optimization functions. - * @ngroups Total number of groups - */ -static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac, - enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups) +static int ext4_mb_scan_groups_linear(struct ext4_allocation_context *ac, + ext4_group_t ngroups, ext4_group_t *start, ext4_group_t count) { - *new_cr = ac->ac_criteria; + int ret, i; + enum criteria cr = ac->ac_criteria; + struct super_block *sb = ac->ac_sb; + struct ext4_sb_info *sbi = EXT4_SB(sb); + ext4_group_t group = *start; - if (!should_optimize_scan(ac)) { - *group = next_linear_group(*group, ngroups); - return; + for (i = 0; i < count; i++, next_linear_group(&group, ngroups)) { + ret = ext4_mb_scan_group(ac, group); + if (ret || ac->ac_status != AC_STATUS_CONTINUE) + return ret; + cond_resched(); } + *start = group; + if (count == ngroups) + ac->ac_criteria++; + + /* Processed all groups and haven't found blocks */ + if (sbi->s_mb_stats && i == ngroups) + atomic64_inc(&sbi->s_bal_cX_failed[cr]); + + return 0; +} + +static int ext4_mb_scan_groups(struct ext4_allocation_context *ac) +{ + int ret = 0; + ext4_group_t start; + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); + ext4_group_t ngroups = ext4_get_groups_count(ac->ac_sb); + + /* non-extent files are limited to low blocks/groups */ + if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))) + ngroups = sbi->s_blockfile_groups; + + /* searching for the right group start from the goal value specified */ + start = ac->ac_g_ex.fe_group; + ac->ac_prefetch_grp = start; + ac->ac_prefetch_nr = 0; + + if (!should_optimize_scan(ac)) + return ext4_mb_scan_groups_linear(ac, ngroups, &start, ngroups); + /* * Optimized scanning can return non adjacent groups which can cause * seek overhead for rotational disks. So try few linear groups before * trying optimized scan. */ - if (ac->ac_groups_linear_remaining) { - *group = next_linear_group(*group, ngroups); - ac->ac_groups_linear_remaining--; - return; - } + if (sbi->s_mb_max_linear_groups) + ret = ext4_mb_scan_groups_linear(ac, ngroups, &start, + sbi->s_mb_max_linear_groups); + if (ret || ac->ac_status != AC_STATUS_CONTINUE) + return ret; - if (*new_cr == CR_POWER2_ALIGNED) { - ext4_mb_choose_next_group_p2_aligned(ac, new_cr, group); - } else if (*new_cr == CR_GOAL_LEN_FAST) { - ext4_mb_choose_next_group_goal_fast(ac, new_cr, group); - } else if (*new_cr == CR_BEST_AVAIL_LEN) { - ext4_mb_choose_next_group_best_avail(ac, new_cr, group); - } else { + switch (ac->ac_criteria) { + case CR_POWER2_ALIGNED: + return ext4_mb_scan_groups_p2_aligned(ac, start); + case CR_GOAL_LEN_FAST: + return ext4_mb_scan_groups_goal_fast(ac, start); + case CR_BEST_AVAIL_LEN: + return ext4_mb_scan_groups_best_avail(ac, start); + default: /* * TODO: For CR_GOAL_LEN_SLOW, we can arrange groups in an * rb tree sorted by bb_free. But until that happens, we should @@ -1174,6 +1182,8 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac, */ WARN_ON(1); } + + return 0; } /* @@ -2929,20 +2939,11 @@ static int ext4_mb_scan_group(struct ext4_allocation_context *ac, static noinline_for_stack int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) { - ext4_group_t ngroups, group, i; - enum criteria new_cr, cr = CR_GOAL_LEN_FAST; + ext4_group_t i; int err = 0; - struct ext4_sb_info *sbi; - struct super_block *sb; + struct super_block *sb = ac->ac_sb; + struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_buddy e4b; - int lost; - - sb = ac->ac_sb; - sbi = EXT4_SB(sb); - ngroups = ext4_get_groups_count(sb); - /* non-extent files are limited to low blocks/groups */ - if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))) - ngroups = sbi->s_blockfile_groups; BUG_ON(ac->ac_status == AC_STATUS_FOUND); @@ -2988,48 +2989,21 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) * start with CR_GOAL_LEN_FAST, unless it is power of 2 * aligned, in which case let's do that faster approach first. */ + ac->ac_criteria = CR_GOAL_LEN_FAST; if (ac->ac_2order) - cr = CR_POWER2_ALIGNED; + ac->ac_criteria = CR_POWER2_ALIGNED; ac->ac_e4b = &e4b; ac->ac_prefetch_ios = 0; ac->ac_first_err = 0; repeat: - for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) { - ac->ac_criteria = cr; - /* - * searching for the right group start - * from the goal value specified - */ - group = ac->ac_g_ex.fe_group; - ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups; - ac->ac_prefetch_grp = group; - ac->ac_prefetch_nr = 0; - - for (i = 0, new_cr = cr; i < ngroups; i++, - ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) { - - cond_resched(); - if (new_cr != cr) { - cr = new_cr; - goto repeat; - } - - err = ext4_mb_scan_group(ac, group); - if (err) - goto out; - - if (ac->ac_status != AC_STATUS_CONTINUE) - break; - } - /* Processed all groups and haven't found blocks */ - if (sbi->s_mb_stats && i == ngroups) - atomic64_inc(&sbi->s_bal_cX_failed[cr]); + while (ac->ac_criteria < EXT4_MB_NUM_CRS) { + err = ext4_mb_scan_groups(ac); + if (err) + goto out; - if (i == ngroups && ac->ac_criteria == CR_BEST_AVAIL_LEN) - /* Reset goal length to original goal length before - * falling into CR_GOAL_LEN_SLOW */ - ac->ac_g_ex.fe_len = ac->ac_orig_goal_len; + if (ac->ac_status != AC_STATUS_CONTINUE) + break; } if (ac->ac_b_ex.fe_len > 0 && ac->ac_status != AC_STATUS_FOUND && @@ -3040,6 +3014,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) */ ext4_mb_try_best_found(ac, &e4b); if (ac->ac_status != AC_STATUS_FOUND) { + int lost; + /* * Someone more lucky has already allocated it. * The only thing we can do is just take first @@ -3055,7 +3031,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) ac->ac_b_ex.fe_len = 0; ac->ac_status = AC_STATUS_CONTINUE; ac->ac_flags |= EXT4_MB_HINT_FIRST; - cr = CR_ANY_FREE; + ac->ac_criteria = CR_ANY_FREE; goto repeat; } } @@ -3072,7 +3048,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) mb_debug(sb, "Best len %d, origin len %d, ac_status %u, ac_flags 0x%x, cr %d ret %d\n", ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status, - ac->ac_flags, cr, err); + ac->ac_flags, ac->ac_criteria, err); if (ac->ac_prefetch_nr) ext4_mb_prefetch_fini(sb, ac->ac_prefetch_grp, ac->ac_prefetch_nr); @@ -3201,8 +3177,6 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset) atomic_read(&sbi->s_bal_cX_ex_scanned[CR_POWER2_ALIGNED])); seq_printf(seq, "\t\tuseless_loops: %llu\n", atomic64_read(&sbi->s_bal_cX_failed[CR_POWER2_ALIGNED])); - seq_printf(seq, "\t\tbad_suggestions: %u\n", - atomic_read(&sbi->s_bal_p2_aligned_bad_suggestions)); /* CR_GOAL_LEN_FAST stats */ seq_puts(seq, "\tcr_goal_fast_stats:\n"); @@ -3215,8 +3189,6 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset) atomic_read(&sbi->s_bal_cX_ex_scanned[CR_GOAL_LEN_FAST])); seq_printf(seq, "\t\tuseless_loops: %llu\n", atomic64_read(&sbi->s_bal_cX_failed[CR_GOAL_LEN_FAST])); - seq_printf(seq, "\t\tbad_suggestions: %u\n", - atomic_read(&sbi->s_bal_goal_fast_bad_suggestions)); /* CR_BEST_AVAIL_LEN stats */ seq_puts(seq, "\tcr_best_avail_stats:\n"); @@ -3230,8 +3202,6 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset) atomic_read(&sbi->s_bal_cX_ex_scanned[CR_BEST_AVAIL_LEN])); seq_printf(seq, "\t\tuseless_loops: %llu\n", atomic64_read(&sbi->s_bal_cX_failed[CR_BEST_AVAIL_LEN])); - seq_printf(seq, "\t\tbad_suggestions: %u\n", - atomic_read(&sbi->s_bal_best_avail_bad_suggestions)); /* CR_GOAL_LEN_SLOW stats */ seq_puts(seq, "\tcr_goal_slow_stats:\n"); diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index 336cafd4187d..f7aec43503f8 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -199,7 +199,6 @@ struct ext4_allocation_context { int ac_first_err; __u32 ac_flags; /* allocation hints */ - __u32 ac_groups_linear_remaining; __u16 ac_groups_scanned; __u16 ac_found; __u16 ac_cX_found[EXT4_MB_NUM_CRS]; -- 2.46.1

mainline inclusion from mainline-v6.17-rc1 commit a3ce570a5d6a70df616ae9a78635a188e6b5fd2f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICN2MB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Although we now perform ordered traversal within an xarray, this is currently limited to a single xarray. However, we have multiple such xarrays, which prevents us from guaranteeing a linear-like traversal where all groups on the right are visited before all groups on the left. For example, suppose we have 128 block groups, with a target group of 64, a target length corresponding to an order of 1, and available free groups of 16 (order 1) and group 65 (order 8): For linear traversal, when no suitable free block is found in group 64, it will search in the next block group until group 127, then start searching from 0 up to block group 63. It ensures continuous forward traversal, which is consistent with the unidirectional rotation behavior of HDD platters. Additionally, the block group lock contention during freeing block is unavoidable. The goal increasing from 0 to 64 indicates that previously scanned groups (which had no suitable free space and are likely to free blocks later) and skipped groups (which are currently in use) have newly freed some used blocks. If we allocate blocks in these groups, the probability of competing with other processes increases. For non-linear traversal, we first traverse all groups in order_1. If only group 16 has free space in this list, we first traverse [63, 128), then traverse [0, 64) to find the available group 16, and then allocate blocks in group 16. Therefore, it cannot guarantee continuous traversal in one direction, thus increasing the probability of contention. So refactor ext4_mb_scan_groups_xarray() to ext4_mb_scan_groups_xa_range() to only traverse a fixed range of groups, and move the logic for handling wrap around to the caller. The caller first iterates through all xarrays in the range [start, ngroups) and then through the range [0, start). This approach simulates a linear scan, which reduces contention between freeing blocks and allocating blocks. Assume we have the following groups, where "|" denotes the xarray traversal start position: order_1_groups: AB | CD order_2_groups: EF | GH Traversal order: Before: C > D > A > B > G > H > E > F After: C > D > G > H > A > B > E > F Performance test data follows: |CPU: Kunpeng 920 | P80 | P1 | |Memory: 512GB |------------------------|-------------------------| |960GB SSD (0.5GB/s)| base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 19555 | 20049 (+2.5%) | 315636 | 316724 (-0.3%) | |mb_optimize_scan=1 | 15496 | 19342 (+24.8%) | 323569 | 328324 (+1.4%) | |CPU: AMD 9654 * 2 | P96 | P1 | |Memory: 1536GB |------------------------|-------------------------| |960GB SSD (1GB/s) | base | patched | base | patched | |-------------------|-------|----------------|--------|----------------| |mb_optimize_scan=0 | 53192 | 52125 (-2.0%) | 212678 | 215136 (+1.1%) | |mb_optimize_scan=1 | 37636 | 50331 (+33.7%) | 214189 | 209431 (-2.2%) | Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-18-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/mballoc.c | 68 ++++++++++++++++++++++++++++++++--------------- 1 file changed, 47 insertions(+), 21 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 2ab1eb9d0c77..fd66bdea4d6a 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -878,21 +878,20 @@ mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp) } } -static int ext4_mb_scan_groups_xarray(struct ext4_allocation_context *ac, - struct xarray *xa, ext4_group_t start) +static int ext4_mb_scan_groups_xa_range(struct ext4_allocation_context *ac, + struct xarray *xa, + ext4_group_t start, ext4_group_t end) { struct super_block *sb = ac->ac_sb; struct ext4_sb_info *sbi = EXT4_SB(sb); enum criteria cr = ac->ac_criteria; ext4_group_t ngroups = ext4_get_groups_count(sb); unsigned long group = start; - ext4_group_t end = ngroups; struct ext4_group_info *grp; - if (WARN_ON_ONCE(start >= end)) + if (WARN_ON_ONCE(end > ngroups || start >= end)) return 0; -wrap_around: xa_for_each_range(xa, group, grp, start, end - 1) { int err; @@ -906,28 +905,23 @@ static int ext4_mb_scan_groups_xarray(struct ext4_allocation_context *ac, cond_resched(); } - if (start) { - end = start; - start = 0; - goto wrap_around; - } - return 0; } /* * Find a suitable group of given order from the largest free orders xarray. */ -static int -ext4_mb_scan_groups_largest_free_order(struct ext4_allocation_context *ac, - int order, ext4_group_t start) +static inline int +ext4_mb_scan_groups_largest_free_order_range(struct ext4_allocation_context *ac, + int order, ext4_group_t start, + ext4_group_t end) { struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_largest_free_orders[order]; if (xa_empty(xa)) return 0; - return ext4_mb_scan_groups_xarray(ac, xa, start); + return ext4_mb_scan_groups_xa_range(ac, xa, start, end); } /* @@ -940,12 +934,22 @@ static int ext4_mb_scan_groups_p2_aligned(struct ext4_allocation_context *ac, struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); int i; int ret = 0; + ext4_group_t start, end; + start = group; + end = ext4_get_groups_count(ac->ac_sb); +wrap_around: for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) { - ret = ext4_mb_scan_groups_largest_free_order(ac, i, group); + ret = ext4_mb_scan_groups_largest_free_order_range(ac, i, + start, end); if (ret || ac->ac_status != AC_STATUS_CONTINUE) return ret; } + if (start) { + end = start; + start = 0; + goto wrap_around; + } if (sbi->s_mb_stats) atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]); @@ -958,15 +962,17 @@ static int ext4_mb_scan_groups_p2_aligned(struct ext4_allocation_context *ac, /* * Find a suitable group of given order from the average fragments xarray. */ -static int ext4_mb_scan_groups_avg_frag_order(struct ext4_allocation_context *ac, - int order, ext4_group_t start) +static int +ext4_mb_scan_groups_avg_frag_order_range(struct ext4_allocation_context *ac, + int order, ext4_group_t start, + ext4_group_t end) { struct xarray *xa = &EXT4_SB(ac->ac_sb)->s_mb_avg_fragment_size[order]; if (xa_empty(xa)) return 0; - return ext4_mb_scan_groups_xarray(ac, xa, start); + return ext4_mb_scan_groups_xa_range(ac, xa, start, end); } /* @@ -978,13 +984,23 @@ static int ext4_mb_scan_groups_goal_fast(struct ext4_allocation_context *ac, { struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); int i, ret = 0; + ext4_group_t start, end; + start = group; + end = ext4_get_groups_count(ac->ac_sb); +wrap_around: i = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len); for (; i < MB_NUM_ORDERS(ac->ac_sb); i++) { - ret = ext4_mb_scan_groups_avg_frag_order(ac, i, group); + ret = ext4_mb_scan_groups_avg_frag_order_range(ac, i, + start, end); if (ret || ac->ac_status != AC_STATUS_CONTINUE) return ret; } + if (start) { + end = start; + start = 0; + goto wrap_around; + } if (sbi->s_mb_stats) atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]); @@ -1020,6 +1036,7 @@ static int ext4_mb_scan_groups_best_avail(struct ext4_allocation_context *ac, struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); int i, order, min_order; unsigned long num_stripe_clusters = 0; + ext4_group_t start, end; /* * mb_avg_fragment_size_order() returns order in a way that makes @@ -1051,6 +1068,9 @@ static int ext4_mb_scan_groups_best_avail(struct ext4_allocation_context *ac, if (1 << min_order < ac->ac_o_ex.fe_len) min_order = fls(ac->ac_o_ex.fe_len); + start = group; + end = ext4_get_groups_count(ac->ac_sb); +wrap_around: for (i = order; i >= min_order; i--) { int frag_order; /* @@ -1073,10 +1093,16 @@ static int ext4_mb_scan_groups_best_avail(struct ext4_allocation_context *ac, frag_order = mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len); - ret = ext4_mb_scan_groups_avg_frag_order(ac, frag_order, group); + ret = ext4_mb_scan_groups_avg_frag_order_range(ac, frag_order, + start, end); if (ret || ac->ac_status != AC_STATUS_CONTINUE) return ret; } + if (start) { + end = start; + start = 0; + goto wrap_around; + } /* Reset goal length to original goal length before falling into CR_GOAL_LEN_SLOW */ ac->ac_g_ex.fe_len = ac->ac_orig_goal_len; -- 2.46.1

反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/17469 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/JJW... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/17469 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/JJW...
participants (2)
-
Baokun Li
-
patchwork bot