[PATCH OLK-6.6 0/2] ext4: better scalability for ext4 block allocation

Since servers have more and more CPUs, and we're running more containers on them, we've been using will-it-scale to test how well ext4 scales. The fallocate2 test (append 8KB to 1MB, truncate to 0, repeat) run concurrently on 64 containers revealed significant contention in block allocation/free, leading to much lower aggregate fallocate OPS compared to a single container (see below). 1 | 2 | 4 | 8 | 16 | 32 | 64 -------|--------|--------|--------|--------|--------|------- 295287 | 70665 | 33865 | 19387 | 10104 | 5588 | 3588 The main bottleneck was the ext4_lock_group(), which both block allocation and free fought over. While the block group for block free is fixed and unoptimizable, the block group for allocation is selectable. Consequently, the ext4_try_lock_group() helper function was added to avoid contention on busy groups, and you can see more in Patch 1. After we fixed the ext4_lock_group bottleneck, another one showed up: s_md_lock. This lock protects different data when allocating and freeing blocks. We got rid of the s_md_lock call in block allocation by making stream allocation work per inode instead of globally. You can find more details in Patch 2. Performance test data follows: CPU: HUAWEI Kunpeng 920 Memory: 480GB Disk: 480GB SSD SATA 3.2 Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers. Observation: Average fallocate operations per container per second. |--------|--------|--------|--------|--------|--------|--------|--------| | - | 1 | 2 | 4 | 8 | 16 | 32 | 64 | |--------|--------|--------|--------|--------|--------|--------|--------| | base | 295287 | 70665 | 33865 | 19387 | 10104 | 5588 | 3588 | |--------|--------|--------|--------|--------|--------|--------|--------| | linear | 286328 | 123102 | 119542 | 90653 | 60344 | 35302 | 23280 | | | -3.0% | 74.20% | 252.9% | 367.5% | 497.2% | 531.6% | 548.7% | |--------|--------|--------|--------|--------|--------|--------|--------| |mb_optim| 292498 | 133305 | 103069 | 61727 | 29702 | 16845 | 10430 | |ize_scan| -0.9% | 88.64% | 204.3% | 218.3% | 193.9% | 201.4% | 190.6% | |--------|--------|--------|--------|--------|--------|--------|--------| Baokun Li (2): ext4: add ext4_try_lock_group() to skip busy groups ext4: move mb_last_[group|start] to ext4_inode_info fs/ext4/ext4.h | 30 ++++++++++++++++++------------ fs/ext4/mballoc.c | 34 ++++++++++++++++++++-------------- fs/ext4/super.c | 2 ++ 3 files changed, 40 insertions(+), 26 deletions(-) -- 2.46.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICAH6I -------------------------------- When ext4 allocates blocks, we used to just go through the block groups one by one to find a good one. But when there are tons of block groups (like hundreds of thousands or even millions) and not many have free space (meaning they're mostly full), it takes a really long time to check them all, and performance gets bad. So, we added the "mb_optimize_scan" mount option (which is on by default now). It keeps track of some group lists, so when we need a free block, we can just grab a likely group from the right list. This saves time and makes block allocation much faster. But when multiple processes or containers are doing similar things, like constantly allocating 8k blocks, they all try to use the same block group in the same list. Even just two processes doing this can cut the IOPS in half. For example, one container might do 300,000 IOPS, but if you run two at the same time, the total is only 150,000. Since we can already look at block groups in a non-linear way, the first and last groups in the same list are basically the same for finding a block right now. Therefore, add an ext4_try_lock_group() helper function to skip the current group when it is locked by another process, thereby avoiding contention with other processes. This helps ext4 make better use of having multiple block groups. Also, to make sure we don't skip all the groups that have free space when allocating blocks, we won't try to skip busy groups anymore when ac_criteria is CR_ANY_FREE. Performance test data follows: CPU: HUAWEI Kunpeng 920 Memory: 480GB Disk: 480GB SSD SATA 3.2 Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers. Observation: Average fallocate operations per container per second. base patched mb_optimize_scan=0 3588 6755 (+88.2%) mb_optimize_scan=1 3588 4302 (+19.8%) Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/ext4.h | 23 ++++++++++++++--------- fs/ext4/mballoc.c | 14 +++++++++++--- 2 files changed, 25 insertions(+), 12 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 6fcbb2184d1b..885fe1d9a8e9 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -3487,23 +3487,28 @@ static inline int ext4_fs_is_busy(struct ext4_sb_info *sbi) return (atomic_read(&sbi->s_lock_busy) > EXT4_CONTENTION_THRESHOLD); } +static inline bool ext4_try_lock_group(struct super_block *sb, ext4_group_t group) +{ + if (!spin_trylock(ext4_group_lock_ptr(sb, group))) + return false; + /* + * We're able to grab the lock right away, so drop the lock + * contention counter. + */ + atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0); + return true; +} + static inline void ext4_lock_group(struct super_block *sb, ext4_group_t group) { - spinlock_t *lock = ext4_group_lock_ptr(sb, group); - if (spin_trylock(lock)) - /* - * We're able to grab the lock right away, so drop the - * lock contention counter. - */ - atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0); - else { + if (!ext4_try_lock_group(sb, group)) { /* * The lock is busy, so bump the contention counter, * and then wait on the spin lock. */ atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, 1, EXT4_MAX_CONTENTION); - spin_lock(lock); + spin_lock(ext4_group_lock_ptr(sb, group)); } } diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 7c8cb88b426e..7646ac5d3b6f 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -899,7 +899,8 @@ static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_context bb_largest_free_order_node) { if (sbi->s_mb_stats) atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]); - if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) { + if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED)) && + !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) { *group = iter->bb_group; ac->ac_flags |= EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED; read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); @@ -935,7 +936,8 @@ ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac, int o list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) { if (sbi->s_mb_stats) atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]); - if (likely(ext4_mb_good_group(ac, iter->bb_group, cr))) { + if (likely(ext4_mb_good_group(ac, iter->bb_group, cr)) && + !spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group))) { grp = iter; break; } @@ -2911,7 +2913,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) if (err) goto out; - ext4_lock_group(sb, group); + /* skip busy group */ + if (cr >= CR_ANY_FREE) { + ext4_lock_group(sb, group); + } else if (!ext4_try_lock_group(sb, group)) { + ext4_mb_unload_buddy(&e4b); + continue; + } /* * We need to check again after locking the -- 2.46.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/ICAH6I -------------------------------- After we optimized the block group lock, we found another lock contention issue when running will-it-scale/fallocate2 with multiple processes. The fallocate's block allocation and the truncate's block release were fighting over the s_md_lock. The problem is, this lock protects totally different things in those two processes: the list of freed data blocks (s_freed_data_list) when releasing, and where to start looking for new blocks (mb_last_[group|start]) when allocating. Moreover, when allocating data blocks, if the first try (goal allocation) fails and stream allocation is on, it tries a global goal starting from the last group we used (s_mb_last_group). This can make things faster by writing blocks close together on the disk. But when many processes are allocating, they all fight over s_md_lock and might even try to use the same group. This makes it harder to merge extents and can make files more fragmented. If different processes allocate chunks of very different sizes, the free space on the disk can also get fragmented. A small allocation might fit in a partially full group, but a big allocation might have skipped it, leading to the small IO ending up in a more empty group. So, we're changing stream allocation to work per inode. First, it tries the goal, then the last group where that inode successfully allocated a block. This keeps an inode's data closer together. Plus, after moving mb_last_[group|start] to ext4_inode_info, we don't need s_md_lock during block allocation anymore because we already have the write lock on i_data_sem. This gets rid of the contention between allocating and releasing blocks, which gives a huge performance boost to fallocate2. Performance test data follows: CPU: HUAWEI Kunpeng 920 Memory: 480GB Disk: 480GB SSD SATA 3.2 Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers. Observation: Average fallocate operations per container per second. base patched mb_optimize_scan=0 6755 23280 (+244.6%) mb_optimize_scan=1 4302 10430 (+142.4%) Signed-off-by: Baokun Li <libaokun1@huawei.com> --- fs/ext4/ext4.h | 7 ++++--- fs/ext4/mballoc.c | 20 +++++++++----------- fs/ext4/super.c | 2 ++ 3 files changed, 15 insertions(+), 14 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 885fe1d9a8e9..a31069001add 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1179,6 +1179,10 @@ struct ext4_inode_info { __u32 i_csum_seed; kprojid_t i_projid; + + /* where last allocation was done - for stream allocation */ + ext4_group_t i_mb_last_group; + ext4_grpblk_t i_mb_last_start; }; /* @@ -1611,9 +1615,6 @@ struct ext4_sb_info { unsigned int s_mb_order2_reqs; unsigned int s_mb_group_prealloc; unsigned int s_max_dir_size_kb; - /* where last allocation was done - for stream allocation */ - unsigned long s_mb_last_group; - unsigned long s_mb_last_start; unsigned int s_mb_prefetch; unsigned int s_mb_prefetch_limit; unsigned int s_mb_best_avail_max_trim_order; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 7646ac5d3b6f..ec2808247f4f 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2139,7 +2139,6 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, struct ext4_buddy *e4b) { - struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); int ret; BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group); @@ -2170,10 +2169,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, get_page(ac->ac_buddy_page); /* store last allocated for subsequent stream allocation */ if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { - spin_lock(&sbi->s_md_lock); - sbi->s_mb_last_group = ac->ac_f_ex.fe_group; - sbi->s_mb_last_start = ac->ac_f_ex.fe_start; - spin_unlock(&sbi->s_md_lock); + EXT4_I(ac->ac_inode)->i_mb_last_group = ac->ac_f_ex.fe_group; + EXT4_I(ac->ac_inode)->i_mb_last_start = ac->ac_f_ex.fe_start; } /* * As we've just preallocated more space than @@ -2845,13 +2842,14 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) MB_NUM_ORDERS(sb)); } - /* if stream allocation is enabled, use global goal */ + /* if stream allocation is enabled, use last goal */ if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { - /* TBD: may be hot point */ - spin_lock(&sbi->s_md_lock); - ac->ac_g_ex.fe_group = sbi->s_mb_last_group; - ac->ac_g_ex.fe_start = sbi->s_mb_last_start; - spin_unlock(&sbi->s_md_lock); + struct ext4_inode_info *ei = EXT4_I(ac->ac_inode); + + if (ei->i_mb_last_group || ei->i_mb_last_start) { + ac->ac_g_ex.fe_group = ei->i_mb_last_group; + ac->ac_g_ex.fe_start = ei->i_mb_last_start; + } } /* diff --git a/fs/ext4/super.c b/fs/ext4/super.c index c45dfcf9ac62..f7a301116d59 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1517,6 +1517,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb) INIT_WORK(&ei->i_iomap_ioend_work, ext4_iomap_end_io); ext4_fc_init_inode(&ei->vfs_inode); mutex_init(&ei->i_fc_lock); + ei->i_mb_last_group = 0; + ei->i_mb_last_start = 0; return &ei->vfs_inode; } -- 2.46.1

反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/16441 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/DQU... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/16441 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/DQU...
participants (2)
-
Baokun Li
-
patchwork bot