From: Max Reitz mreitz@redhat.com
mainline inclusion from mainline-v5.4-rc3 commit e093c4be760ebf46c131ae0dd6138865a22f46fa category: bugfix bugzilla: NA CVE: NA
---------------------------
To ensure that all blocks touched by the range [offset, offset + count) are allocated, we need to calculate the block count from the difference of the range end (rounded up) and the range start (rounded down).
Before this patch, we just round up the byte count, which may lead to unaligned ranges not being fully allocated:
$ touch test_file $ block_size=$(stat -fc '%S' test_file) $ fallocate -o $((block_size / 2)) -l $block_size test_file $ xfs_bmap test_file test_file: 0: [0..7]: 1396264..1396271 1: [8..15]: hole
There should not be a hole there. Instead, the first two blocks should be fully allocated.
With this patch applied, the result is something like this:
$ touch test_file $ block_size=$(stat -fc '%S' test_file) $ fallocate -o $((block_size / 2)) -l $block_size test_file $ xfs_bmap test_file test_file: 0: [0..15]: 11024..11039
Signed-off-by: Max Reitz mreitz@redhat.com Reviewed-by: Carlos Maiolino cmaiolino@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/xfs_bmap_util.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 3e1dd66bd676..e54f5aed30a9 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -869,6 +869,7 @@ xfs_alloc_file_space( xfs_filblks_t allocatesize_fsb; xfs_extlen_t extsz, temp; xfs_fileoff_t startoffset_fsb; + xfs_fileoff_t endoffset_fsb; int nimaps; int quota_flag; int rt; @@ -896,7 +897,8 @@ xfs_alloc_file_space( imapp = &imaps[0]; nimaps = 1; startoffset_fsb = XFS_B_TO_FSBT(mp, offset); - allocatesize_fsb = XFS_B_TO_FSB(mp, count); + endoffset_fsb = XFS_B_TO_FSB(mp, offset + count); + allocatesize_fsb = endoffset_fsb - startoffset_fsb;
/* * Allocate file space until done or until there is an error
From: Luo Meng luomeng12@huawei.com
hulk inclusion category: bugfix bugzilla: 39268 CVE: NA
-------------------------------------------------
Since the code of commit c03b42efb1c4 ("ext4: Fold ext4_data_block_valid_rcu() into the caller") when check valid the inode blocks, we set the last error block before final determination the block is invalid, which confuses with linux master.
The block should be invalid only when the block is belong to the system zone. The system zone was initialized when mount, and the entry->ino just should be 0 or journal_ino, and it never changed in his lifetime. Only when check the inode with ino=0/journal_ino will cause set the wrong last error block. But the ino=0/journal_ino never call ext4_inode_block_valid, so it never case any problem.
In order to keep the same logic with linux master and dispel the confuse, add explicit judgment for invalid block before set the last error block.
Fixes: c03b42efb1c4 ("ext4: Fold ext4_data_block_valid_rcu() into the caller") Signed-off-by: Luo Meng luomeng12@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/block_validity.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c index 817e3896462f..2471577d5c09 100644 --- a/fs/ext4/block_validity.c +++ b/fs/ext4/block_validity.c @@ -326,8 +326,9 @@ int ext4_inode_block_valid(struct inode *inode, ext4_fsblk_t start_blk, else if (start_blk >= (entry->start_blk + entry->count)) n = n->rb_right; else { - sbi->s_es->s_last_error_block = cpu_to_le64(start_blk); ret = (entry->ino == inode->i_ino); + if (!ret) + sbi->s_es->s_last_error_block = cpu_to_le64(start_blk); break; } }
From: Theodore Ts'o tytso@mit.edu
mainline inclusion from mainline-5.6-rc1 commit 46f870d690fecc792a66730dcbbf0aa109f5f9ab category: feature bugzilla: NA CVE: NA ---------------------------
This allows us to test various error handling code paths
Link: https://lore.kernel.org/r/20191209012317.59398-1-tytso@mit.edu Signed-off-by: Theodore Ts'o tytso@mit.edu
Conflict: fs/ext4/ext4.h
Signed-off-by: zhangyi (F) yi.zhang@huawei.com Reviewed-by: yangerkun yangerkun@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/balloc.c | 4 +++- fs/ext4/ext4.h | 38 +++++++++++++++++++++++++++++++++++++- fs/ext4/ialloc.c | 4 +++- fs/ext4/inode.c | 6 +++++- fs/ext4/namei.c | 11 ++++++++--- fs/ext4/sysfs.c | 23 +++++++++++++++++++++++ 6 files changed, 79 insertions(+), 7 deletions(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index 904fe348084f..aa4d8702bac2 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -379,7 +379,8 @@ static int ext4_validate_block_bitmap(struct super_block *sb, if (buffer_verified(bh)) goto verified; if (unlikely(!ext4_block_bitmap_csum_verify(sb, block_group, - desc, bh))) { + desc, bh) || + ext4_simulate_fail(sb, EXT4_SIM_BBITMAP_CRC))) { ext4_unlock_group(sb, block_group); ext4_error(sb, "bg %u: bad block bitmap checksum", block_group); ext4_mark_group_bitmap_corrupted(sb, block_group, @@ -514,6 +515,7 @@ int ext4_wait_block_bitmap(struct super_block *sb, ext4_group_t block_group, if (!desc) return -EFSCORRUPTED; wait_on_buffer(bh); + ext4_simulate_fail_bh(sb, bh, EXT4_SIM_BBITMAP_EIO); if (!buffer_uptodate(bh)) { ext4_error(sb, "Cannot read block bitmap - " "block_group = %u, block_bitmap = %llu", diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 03cf8faf51b9..3a3cde616b26 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1535,7 +1535,9 @@ struct ext4_sb_info { */ struct percpu_rw_semaphore s_writepages_rwsem; struct dax_device *s_daxdev; - +#ifdef CONFIG_EXT4_DEBUG + unsigned long s_simulate_fail; +#endif /* Record the errseq of the backing block device */ errseq_t s_bdev_wb_err; spinlock_t s_bdev_wb_lock; @@ -1557,6 +1559,40 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino) ino <= le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count)); }
+/* + * Simulate_fail codes + */ +#define EXT4_SIM_BBITMAP_EIO 1 +#define EXT4_SIM_BBITMAP_CRC 2 +#define EXT4_SIM_IBITMAP_EIO 3 +#define EXT4_SIM_IBITMAP_CRC 4 +#define EXT4_SIM_INODE_EIO 5 +#define EXT4_SIM_INODE_CRC 6 +#define EXT4_SIM_DIRBLOCK_EIO 7 +#define EXT4_SIM_DIRBLOCK_CRC 8 + +static inline bool ext4_simulate_fail(struct super_block *sb, + unsigned long code) +{ +#ifdef CONFIG_EXT4_DEBUG + struct ext4_sb_info *sbi = EXT4_SB(sb); + + if (unlikely(sbi->s_simulate_fail == code)) { + sbi->s_simulate_fail = 0; + return true; + } +#endif + return false; +} + +static inline void ext4_simulate_fail_bh(struct super_block *sb, + struct buffer_head *bh, + unsigned long code) +{ + if (!IS_ERR(bh) && ext4_simulate_fail(sb, code)) + clear_buffer_uptodate(bh); +} + /* * Returns: sbi->field[index] * Used to access an array element from the following sbi fields which require diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 9b04dac91248..f7989081ff54 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -94,7 +94,8 @@ static int ext4_validate_inode_bitmap(struct super_block *sb, goto verified; blk = ext4_inode_bitmap(sb, desc); if (!ext4_inode_bitmap_csum_verify(sb, block_group, desc, bh, - EXT4_INODES_PER_GROUP(sb) / 8)) { + EXT4_INODES_PER_GROUP(sb) / 8) || + ext4_simulate_fail(sb, EXT4_SIM_IBITMAP_CRC)) { ext4_unlock_group(sb, block_group); ext4_error(sb, "Corrupt inode bitmap - block_group = %u, " "inode_bitmap = %llu", block_group, blk); @@ -193,6 +194,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group) get_bh(bh); submit_bh(REQ_OP_READ, REQ_META | REQ_PRIO, bh); wait_on_buffer(bh); + ext4_simulate_fail_bh(sb, bh, EXT4_SIM_IBITMAP_EIO); if (!buffer_uptodate(bh)) { put_bh(bh); ext4_error(sb, "Cannot read inode bitmap - " diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 0bde4afe6b7f..ae8cd602c8ee 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4627,6 +4627,8 @@ static int __ext4_get_inode_loc(struct inode *inode, bh = sb_getblk(sb, block); if (unlikely(!bh)) return -ENOMEM; + if (ext4_simulate_fail(sb, EXT4_SIM_INODE_EIO)) + goto simulate_eio; if (!buffer_uptodate(bh)) { lock_buffer(bh);
@@ -4723,6 +4725,7 @@ static int __ext4_get_inode_loc(struct inode *inode, submit_bh(REQ_OP_READ, REQ_META | REQ_PRIO, bh); wait_on_buffer(bh); if (!buffer_uptodate(bh)) { + simulate_eio: EXT4_ERROR_INODE_BLOCK(inode, block, "unable to read itable block"); brelse(bh); @@ -4930,7 +4933,8 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino, sizeof(gen)); }
- if (!ext4_inode_csum_verify(inode, raw_inode, ei)) { + if (!ext4_inode_csum_verify(inode, raw_inode, ei) || + ext4_simulate_fail(sb, EXT4_SIM_INODE_CRC)) { ext4_error_inode(inode, function, line, 0, "iget: checksum invalid"); ret = -EFSBADCRC; diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index b24864418524..09d9290430dc 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -108,7 +108,10 @@ static struct buffer_head *__ext4_read_dirblock(struct inode *inode, struct ext4_dir_entry *dirent; int is_dx_block = 0;
- bh = ext4_bread(NULL, inode, block, 0); + if (ext4_simulate_fail(inode->i_sb, EXT4_SIM_DIRBLOCK_EIO)) + bh = ERR_PTR(-EIO); + else + bh = ext4_bread(NULL, inode, block, 0); if (IS_ERR(bh)) { __ext4_warning(inode->i_sb, func, line, "inode #%lu: lblock %lu: comm %s: " @@ -152,7 +155,8 @@ static struct buffer_head *__ext4_read_dirblock(struct inode *inode, * caller is sure it should be an index block. */ if (is_dx_block && type == INDEX) { - if (ext4_dx_csum_verify(inode, dirent)) + if (ext4_dx_csum_verify(inode, dirent) && + !ext4_simulate_fail(inode->i_sb, EXT4_SIM_DIRBLOCK_CRC)) set_buffer_verified(bh); else { ext4_error_inode(inode, func, line, block, @@ -162,7 +166,8 @@ static struct buffer_head *__ext4_read_dirblock(struct inode *inode, } } if (!is_dx_block) { - if (ext4_dirent_csum_verify(inode, dirent)) + if (ext4_dirent_csum_verify(inode, dirent) && + !ext4_simulate_fail(inode->i_sb, EXT4_SIM_DIRBLOCK_CRC)) set_buffer_verified(bh); else { ext4_error_inode(inode, func, line, block, diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c index 9212a026a1f1..bac32096ed59 100644 --- a/fs/ext4/sysfs.c +++ b/fs/ext4/sysfs.c @@ -29,6 +29,7 @@ typedef enum { attr_last_error_time, attr_feature, attr_pointer_ui, + attr_pointer_ul, attr_pointer_atomic, } attr_id_t;
@@ -151,6 +152,9 @@ static struct ext4_attr ext4_attr_##_name = { \ #define EXT4_RW_ATTR_SBI_UI(_name,_elname) \ EXT4_ATTR_OFFSET(_name, 0644, pointer_ui, ext4_sb_info, _elname)
+#define EXT4_RW_ATTR_SBI_UL(_name,_elname) \ + EXT4_ATTR_OFFSET(_name, 0644, pointer_ul, ext4_sb_info, _elname) + #define EXT4_ATTR_PTR(_name,_mode,_id,_ptr) \ static struct ext4_attr ext4_attr_##_name = { \ .attr = {.name = __stringify(_name), .mode = _mode }, \ @@ -185,6 +189,9 @@ EXT4_RW_ATTR_SBI_UI(warning_ratelimit_interval_ms, s_warning_ratelimit_state.int EXT4_RW_ATTR_SBI_UI(warning_ratelimit_burst, s_warning_ratelimit_state.burst); EXT4_RW_ATTR_SBI_UI(msg_ratelimit_interval_ms, s_msg_ratelimit_state.interval); EXT4_RW_ATTR_SBI_UI(msg_ratelimit_burst, s_msg_ratelimit_state.burst); +#ifdef CONFIG_EXT4_DEBUG +EXT4_RW_ATTR_SBI_UL(simulate_fail, s_simulate_fail); +#endif EXT4_RO_ATTR_ES_UI(errors_count, s_error_count); EXT4_ATTR(first_error_time, 0444, first_error_time); EXT4_ATTR(last_error_time, 0444, last_error_time); @@ -217,6 +224,9 @@ static struct attribute *ext4_attrs[] = { ATTR_LIST(errors_count), ATTR_LIST(first_error_time), ATTR_LIST(last_error_time), +#ifdef CONFIG_EXT4_DEBUG + ATTR_LIST(simulate_fail), +#endif NULL, };
@@ -293,6 +303,11 @@ static ssize_t ext4_attr_show(struct kobject *kobj, else return snprintf(buf, PAGE_SIZE, "%u\n", *((unsigned int *) ptr)); + case attr_pointer_ul: + if (!ptr) + return 0; + return snprintf(buf, PAGE_SIZE, "%lu\n", + *((unsigned long *) ptr)); case attr_pointer_atomic: if (!ptr) return 0; @@ -334,6 +349,14 @@ static ssize_t ext4_attr_store(struct kobject *kobj, else *((unsigned int *) ptr) = t; return len; + case attr_pointer_ul: + if (!ptr) + return 0; + ret = kstrtoul(skip_spaces(buf), 0, &t); + if (ret) + return ret; + *((unsigned long *) ptr) = t; + return len; case attr_inode_readahead: return inode_readahead_blks_store(sbi, buf, len); case attr_trigger_test_error:
From: Muchun Song songmuchun@bytedance.com
mainline inclusion from mainline-v5.8-rc3 commit 3a98990ae2150277ed34d3b248c60e68bf2244b2 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
We should put the css reference when memory allocation failed.
Link: http://lkml.kernel.org/r/20200614122653.98829-1-songmuchun@bytedance.com Fixes: f0a3a24b532d ("mm: memcg/slab: rework non-root kmem_cache lifecycle management") Signed-off-by: Muchun Song songmuchun@bytedance.com Acked-by: Roman Gushchin guro@fb.com Acked-by: Michal Hocko mhocko@suse.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Qian Cai cai@lca.pw Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/memcontrol.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6e5c798c7c77..be50582ae13e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2654,8 +2654,10 @@ static void __memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg, return;
cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN); - if (!cw) + if (!cw) { + css_put(&memcg->css); return; + }
cw->memcg = memcg; cw->cachep = cachep;
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-v5.3-rc5 commit ec9f02384f6053f2a5417e82b65078edc5364a8d category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Memcg counters for shadow nodes are broken because the memcg pointer is obtained in a wrong way. The following approach is used: virt_to_page(xa_node)->mem_cgroup
Since commit 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't set for slab pages, so memcg_from_slab_page() should be used instead.
Also I doubt that it ever worked correctly: virt_to_head_page() should be used instead of virt_to_page(). Otherwise objects residing on tail pages are not accounted, because only the head page contains a valid mem_cgroup pointer. That was a case since the introduction of these counters by the commit 68d48e6a2df5 ("mm: workingset: add vmstat counter for shadow nodes").
Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages") Signed-off-by: Roman Gushchin guro@fb.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Shakeel Butt shakeelb@google.com Cc: Michal Hocko mhocko@suse.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/memcontrol.h | 19 +++++++++++++++++++ mm/memcontrol.c | 20 ++++++++++++++++++++ mm/workingset.c | 2 +- 3 files changed, 40 insertions(+), 1 deletion(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 8c7b83cee2d7..5559d2ac442b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -676,6 +676,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val); +void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val);
static inline void mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val) @@ -1077,6 +1078,14 @@ static inline void mod_lruvec_page_state(struct page *page, mod_node_page_state(page_pgdat(page), idx, val); }
+static inline void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, + int val) +{ + struct page *page = virt_to_head_page(p); + + __mod_node_page_state(page_pgdat(page), idx, val); +} + static inline unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, @@ -1158,6 +1167,16 @@ static inline void __dec_lruvec_page_state(struct page *page, __mod_lruvec_page_state(page, idx, -1); }
+static inline void __inc_lruvec_slab_state(void *p, enum node_stat_item idx) +{ + __mod_lruvec_slab_state(p, idx, 1); +} + +static inline void __dec_lruvec_slab_state(void *p, enum node_stat_item idx) +{ + __mod_lruvec_slab_state(p, idx, -1); +} + /* idx can be of type enum memcg_stat_item or node_stat_item */ static inline void inc_memcg_state(struct mem_cgroup *memcg, int idx) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index be50582ae13e..74722218768a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -776,6 +776,26 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, __this_cpu_write(pn->lruvec_stat_cpu->count[idx], x); }
+void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val) +{ + struct page *page = virt_to_head_page(p); + pg_data_t *pgdat = page_pgdat(page); + struct mem_cgroup *memcg; + struct lruvec *lruvec; + + rcu_read_lock(); + memcg = memcg_from_slab_page(page); + + /* Untracked pages have no memcg, no lruvec. Update only the node */ + if (!memcg || memcg == root_mem_cgroup) { + __mod_node_page_state(pgdat, idx, val); + } else { + lruvec = mem_cgroup_lruvec(pgdat, memcg); + __mod_lruvec_state(lruvec, idx, val); + } + rcu_read_unlock(); +} + /** * __count_memcg_events - account VM events in a cgroup * @memcg: the memory cgroup diff --git a/mm/workingset.c b/mm/workingset.c index 15b361bf6da6..d422f396ab64 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -502,7 +502,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item, } if (WARN_ON_ONCE(node->exceptional)) goto out_invalid; - inc_lruvec_page_state(virt_to_page(node), WORKINGSET_NODERECLAIM); + __inc_lruvec_slab_state(node, WORKINGSET_NODERECLAIM); __radix_tree_delete_node(&mapping->i_pages, node, workingset_lookup_update(mapping));
From: Roman Gushchin guro@fb.com
mainline inclusion from mainline-v5.6 commit 8380ce479010f2f779587b462a9b4681934297c3 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the space for task stacks can be allocated using __vmalloc_node_range(), alloc_pages_node() and kmem_cache_alloc_node().
In the first and the second cases page->mem_cgroup pointer is set, but in the third it's not: memcg membership of a slab page should be determined using the memcg_from_slab_page() function, which looks at page->slab_cache->memcg_params.memcg . In this case, using mod_memcg_page_state() (as in account_kernel_stack()) is incorrect: page->mem_cgroup pointer is NULL even for pages charged to a non-root memory cgroup.
It can lead to kernel_stack per-memcg counters permanently showing 0 on some architectures (depending on the configuration).
In order to fix it, let's introduce a mod_memcg_obj_state() helper, which takes a pointer to a kernel object as a first argument, uses mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls mod_memcg_state(). It allows to handle all possible configurations (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without spilling any memcg/kmem specifics into fork.c .
Note: This is a special version of the patch created for stable backports. It contains code from the following two patches: - mm: memcg/slab: introduce mem_cgroup_from_obj() - mm: fork: fix kernel_stack memcg stats for various stack implementations
[guro@fb.com: introduce mem_cgroup_from_obj()] Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages") Signed-off-by: Roman Gushchin guro@fb.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: Bharata B Rao bharata@linux.ibm.com Cc: Shakeel Butt shakeelb@google.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/memcontrol.h | 11 +++++++++++ kernel/fork.c | 4 ++-- mm/memcontrol.c | 38 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5559d2ac442b..f6f51b14a332 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -677,6 +677,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val); void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val); +void mod_memcg_obj_state(void *p, int idx, int val);
static inline void mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val) @@ -1086,6 +1087,10 @@ static inline void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, __mod_node_page_state(page_pgdat(page), idx, val); }
+static inline void mod_memcg_obj_state(void *p, int idx, int val) +{ +} + static inline unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, @@ -1354,6 +1359,7 @@ extern int memcg_expand_shrinker_maps(int new_id);
extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id); +struct mem_cgroup *mem_cgroup_from_obj(void *p); #else
static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order) @@ -1397,6 +1403,11 @@ static inline void memcg_put_cache_ids(void)
static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id) { } +static inline struct mem_cgroup *mem_cgroup_from_obj(void *p) +{ + return NULL; +} + #endif /* CONFIG_MEMCG_KMEM */
#endif /* _LINUX_MEMCONTROL_H */ diff --git a/kernel/fork.c b/kernel/fork.c index 9b8628451bd1..2c7ce8f48967 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -382,8 +382,8 @@ static void account_kernel_stack(struct task_struct *tsk, int account) mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB, THREAD_SIZE / 1024 * account);
- mod_memcg_page_state(first_page, MEMCG_KERNEL_STACK_KB, - account * (THREAD_SIZE / 1024)); + mod_memcg_obj_state(stack, MEMCG_KERNEL_STACK_KB, + account * (THREAD_SIZE / 1024)); } }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 74722218768a..9c7d4aff03d1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -796,6 +796,17 @@ void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val) rcu_read_unlock(); }
+void mod_memcg_obj_state(void *p, int idx, int val) +{ + struct mem_cgroup *memcg; + + rcu_read_lock(); + memcg = mem_cgroup_from_obj(p); + if (memcg) + mod_memcg_state(memcg, idx, val); + rcu_read_unlock(); +} + /** * __count_memcg_events - account VM events in a cgroup * @memcg: the memory cgroup @@ -2598,6 +2609,33 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg, }
#ifdef CONFIG_MEMCG_KMEM +/* + * Returns a pointer to the memory cgroup to which the kernel object is charged. + * + * The caller must ensure the memcg lifetime, e.g. by taking rcu_read_lock(), + * cgroup_mutex, etc. + */ +struct mem_cgroup *mem_cgroup_from_obj(void *p) +{ + struct page *page; + + if (mem_cgroup_disabled()) + return NULL; + + page = virt_to_head_page(p); + + /* + * Slab pages don't have page->mem_cgroup set because corresponding + * kmem caches can be reparented during the lifetime. That's why + * memcg_from_slab_page() should be used instead. + */ + if (PageSlab(page)) + return memcg_from_slab_page(page); + + /* All other pages use page->mem_cgroup */ + return page->mem_cgroup; +} + static int memcg_alloc_cache_id(void) { int id, size;
From: Yafang Shao laoar.shao@gmail.com
mainline inclusion from mainline-v5.8-rc1 commit a6f5576bb195c3b7508e3e1c98d2dcf6691f96e8 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
There's a new workingset counter introduced in commit 1899ad18c607 ("mm: workingset: tell cache transitions from workingset thrashing"). With the help of this counter we can know the workingset is transitioning or thrashing. To leverage the benifit of this counter to memcg, we should introduce it into memory.stat. Then we could know the workingset of the workload inside a memcg better.
Bellow is the verification of this new counter in memory.stat. Read a file into the memory and then read it again to make these pages be active. The size of this file is 1G. (memory.max is greater than file size) The counters in memory.stat will be
inactive_file 0 active_file 1073639424
workingset_refault 0 workingset_activate 0 workingset_restore 0 workingset_nodereclaim 0
Trigger the memcg reclaim by setting a lower value to memory.high, and then some pages will be demoted into inactive list, and then some pages in the inactive list will be evicted into the storage.
inactive_file 498094080 active_file 310063104
workingset_refault 0 workingset_activate 0 workingset_restore 0 workingset_nodereclaim 0
Then recover the memory.high and read the file into memory again. As a result of it, the transitioning will occur. Bellow is the result of this transitioning,
inactive_file 498094080 active_file 575397888
workingset_refault 64746 workingset_activate 64746 workingset_restore 64746 workingset_nodereclaim 0
Signed-off-by: Yafang Shao laoar.shao@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Johannes Weiner hannes@cmpxchg.org Acked-by: Michal Hocko mhocko@suse.com Acked-by: Chris Down chris@chrisdown.name Cc: Peter Zijlstra (Intel) peterz@infradead.org Cc: Suren Baghdasaryan surenb@google.com Cc: Shakeel Butt shakeelb@google.com Link: http://lkml.kernel.org/r/20200504153522.11553-1-laoar.shao@gmail.com Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/admin-guide/cgroup-v2.rst | 4 ++++ mm/memcontrol.c | 2 ++ 2 files changed, 6 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 8305db50fa04..8c1fe548fee8 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1214,6 +1214,10 @@ PAGE_SIZE multiple when read back.
Number of refaulted pages that were immediately activated
+ workingset_restore + Number of restored pages which have been detected as an active + workingset before they got reclaimed. + workingset_nodereclaim
Number of times a shadow node has been reclaimed diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9c7d4aff03d1..336498908b2d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5991,6 +5991,8 @@ static int memory_stat_show(struct seq_file *m, void *v) memcg_page_state(memcg, WORKINGSET_REFAULT)); seq_printf(m, "workingset_activate %lu\n", memcg_page_state(memcg, WORKINGSET_ACTIVATE)); + seq_printf(m, "workingset_restore %lu\n", + memcg_page_state(memcg, WORKINGSET_RESTORE)); seq_printf(m, "workingset_nodereclaim %lu\n", memcg_page_state(memcg, WORKINGSET_NODERECLAIM));
From: Michal Hocko mhocko@suse.com
mainline inclusion from mainline-v5.9-rc1 commit de3f32e1424ca5c751d4cd78cbed6d17ca261c24 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
There are at least two notes in the oom section. The 3% discount for root processes is gone since d46078b28889 ("mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes").
Likewise children of the selected oom victim are not sacrificed since bbbe48029720 ("mm, oom: remove 'prefer children over parent' heuristic")
Drop both of them.
Signed-off-by: Michal Hocko mhocko@suse.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Cc: Jonathan Corbet corbet@lwn.net Cc: David Rientjes rientjes@google.com Cc: Yafang Shao laoar.shao@gmail.com Link: http://lkml.kernel.org/r/20200709062603.18480-1-mhocko@kernel.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/filesystems/proc.txt | 8 -------- 1 file changed, 8 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index cd465304bec4..3f6e1b516315 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -1494,9 +1494,6 @@ may allocate from based on an estimation of its current memory and swap use. For example, if a task is using all allowed memory, its badness score will be 1000. If it is using half of its allowed memory, its score will be 500.
-There is an additional factor included in the badness score: the current memory -and swap usage is discounted by 3% for root processes. - The amount of "allowed" memory depends on the context in which the oom killer was called. If it is due to the memory assigned to the allocating task's cpuset being exhausted, the allowed memory represents the set of mems assigned to that @@ -1532,11 +1529,6 @@ The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last value set by a CAP_SYS_RESOURCE process. To reduce the value any lower requires CAP_SYS_RESOURCE.
-Caveat: when a parent task is selected, the oom killer will sacrifice any first -generation children with separate address spaces instead, if possible. This -avoids servers and important system daemons from being killed and loses the -minimal amount of work. -
3.2 /proc/<pid>/oom_score - Display current oom-killer score -------------------------------------------------------------
From: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp
mainline inclusion from mainline-v5.3-rc1 commit 2c207985f354dfb549e5a543102a3e084eea81f6 category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Since commit bbbe48029720 ("mm, oom: remove 'prefer children over parent' heuristic") removed the
"%s: Kill process %d (%s) score %u or sacrifice child\n"
line, oc->chosen_points is no longer used after select_bad_process().
Link: http://lkml.kernel.org/r/1560853435-15575-1-git-send-email-penguin-kernel@I-... Signed-off-by: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Acked-by: Michal Hocko mhocko@suse.com Cc: Shakeel Butt shakeelb@google.com Cc: Roman Gushchin guro@fb.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: David Rientjes rientjes@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/oom_kill.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 55410ef265d6..dee2f9d50e36 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -380,8 +380,6 @@ static void select_bad_process(struct oom_control *oc) break; rcu_read_unlock(); } - - oc->chosen_points = oc->chosen_points * 1000 / oc->totalpages; }
/**
From: Joonsoo Kim iamjoonsoo.kim@lge.com
mainline inclusion from mainline-v5.9-rc1 commit 8510e69c8efef82f2b37ea3e8ea19a27122c533e category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
Currently, memalloc_nocma_{save/restore} API that prevents CMA area in page allocation is implemented by using current_gfp_context(). However, there are two problems of this implementation.
First, this doesn't work for allocation fastpath. In the fastpath, original gfp_mask is used since current_gfp_context() is introduced in order to control reclaim and it is on slowpath. So, CMA area can be allocated through the allocation fastpath even if memalloc_nocma_{save/restore} APIs are used. Currently, there is just one user for these APIs and it has a fallback method to prevent actual problem. Second, clearing __GFP_MOVABLE in current_gfp_context() has a side effect to exclude the memory on the ZONE_MOVABLE for allocation target.
To fix these problems, this patch changes the implementation to exclude CMA area in page allocation. Main point of this change is using the alloc_flags. alloc_flags is mainly used to control allocation so it fits for excluding CMA area in allocation.
Fixes: d7fefcc8de91 (mm/cma: add PF flag to force non cma alloc) Signed-off-by: Joonsoo Kim iamjoonsoo.kim@lge.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Reviewed-by: Vlastimil Babka vbabka@suse.cz Cc: Christoph Hellwig hch@infradead.org Cc: Roman Gushchin guro@fb.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Naoya Horiguchi n-horiguchi@ah.jp.nec.com Cc: Michal Hocko mhocko@suse.com Cc: "Aneesh Kumar K . V" aneesh.kumar@linux.ibm.com Link: http://lkml.kernel.org/r/1595468942-29687-1-git-send-email-iamjoonsoo.kim@lg... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/sched/mm.h | 8 +------- mm/page_alloc.c | 27 +++++++++++++++++++-------- 2 files changed, 20 insertions(+), 15 deletions(-)
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index c04db712c5d2..9457e8dd06c6 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -175,12 +175,10 @@ static inline bool in_vfork(struct task_struct *tsk) * Applies per-task gfp context to the given allocation flags. * PF_MEMALLOC_NOIO implies GFP_NOIO * PF_MEMALLOC_NOFS implies GFP_NOFS - * PF_MEMALLOC_NOCMA implies no allocation from CMA region. */ static inline gfp_t current_gfp_context(gfp_t flags) { - if (unlikely(current->flags & - (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA))) { + if (unlikely(current->flags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS))) { /* * NOIO implies both NOIO and NOFS and it is a weaker context * so always make sure it makes precedence @@ -189,10 +187,6 @@ static inline gfp_t current_gfp_context(gfp_t flags) flags &= ~(__GFP_IO | __GFP_FS); else if (current->flags & PF_MEMALLOC_NOFS) flags &= ~__GFP_FS; -#ifdef CONFIG_CMA - if (current->flags & PF_MEMALLOC_NOCMA) - flags &= ~__GFP_MOVABLE; -#endif } return flags; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1cc5dfa418d0..553875714352 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3338,6 +3338,20 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) } #endif /* CONFIG_NUMA */
+static inline unsigned int current_alloc_flags(gfp_t gfp_mask, + unsigned int alloc_flags) +{ +#ifdef CONFIG_CMA + unsigned int pflags = current->flags; + + if (!(pflags & PF_MEMALLOC_NOCMA) && + gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE) + alloc_flags |= ALLOC_CMA; + +#endif + return alloc_flags; +} + /* * get_page_from_freelist goes through the zonelist trying to allocate * a page. @@ -3974,10 +3988,8 @@ gfp_to_alloc_flags(gfp_t gfp_mask) } else if (unlikely(rt_task(current)) && !in_interrupt()) alloc_flags |= ALLOC_HARDER;
-#ifdef CONFIG_CMA - if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE) - alloc_flags |= ALLOC_CMA; -#endif + alloc_flags = current_alloc_flags(gfp_mask, alloc_flags); + return alloc_flags; }
@@ -4268,7 +4280,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
reserve_flags = __gfp_pfmemalloc_flags(gfp_mask); if (reserve_flags) - alloc_flags = reserve_flags; + alloc_flags = current_alloc_flags(gfp_mask, reserve_flags);
/* * Reset the nodemask and zonelist iterators if memory policies can be @@ -4345,7 +4357,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
/* Avoid allocations with no watermarks from looping endlessly */ if (tsk_is_oom_victim(current) && - (alloc_flags == ALLOC_OOM || + (alloc_flags & ALLOC_OOM || (gfp_mask & __GFP_NOMEMALLOC))) goto nopage;
@@ -4441,8 +4453,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, if (should_fail_alloc_page(gfp_mask, order)) return false;
- if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE) - *alloc_flags |= ALLOC_CMA; + *alloc_flags = current_alloc_flags(gfp_mask, *alloc_flags);
return true; }
From: Joonsoo Kim iamjoonsoo.kim@lge.com
mainline inclusion from mainline-v5.9-rc8 commit 1d91df85f399adbe4f318f3e74ac5a5d84c0ca7c category: bugfix bugzilla: 34611 CVE: NA
-------------------------------------------------
memalloc_nocma_{save/restore} APIs can be used to skip page allocation on CMA area, but, there is a missing case and the page on CMA area could be allocated even if APIs are used. This patch handles this case to fix the potential issue.
For now, these APIs are used to prevent long-term pinning on the CMA page. When the long-term pinning is requested on the CMA page, it is migrated to the non-CMA page before pinning. This non-CMA page is allocated by using memalloc_nocma_{save/restore} APIs. If APIs doesn't work as intended, the CMA page is allocated and it is pinned for a long time. This long-term pin for the CMA page causes cma_alloc() failure and it could result in wrong behaviour on the device driver who uses the cma_alloc().
Missing case is an allocation from the pcplist. MIGRATE_MOVABLE pcplist could have the pages on CMA area so we need to skip it if ALLOC_CMA isn't specified.
Fixes: 8510e69c8efe (mm/page_alloc: fix memalloc_nocma_{save/restore} APIs) Signed-off-by: Joonsoo Kim iamjoonsoo.kim@lge.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Acked-by: Vlastimil Babka vbabka@suse.cz Acked-by: Michal Hocko mhocko@suse.com Cc: "Aneesh Kumar K . V" aneesh.kumar@linux.ibm.com Cc: Mel Gorman mgorman@techsingularity.net Link: https://lkml.kernel.org/r/1601429472-12599-1-git-send-email-iamjoonsoo.kim@l... Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/page_alloc.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 553875714352..3b2d73f1b233 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3076,9 +3076,16 @@ struct page *rmqueue(struct zone *preferred_zone, struct page *page;
if (likely(order == 0)) { - page = rmqueue_pcplist(preferred_zone, zone, order, - gfp_flags, migratetype); - goto out; + /* + * MIGRATE_MOVABLE pcplist could have the pages on CMA area and + * we need to skip it when CMA area isn't allowed. + */ + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA || + migratetype != MIGRATE_MOVABLE) { + page = rmqueue_pcplist(preferred_zone, zone, order, + gfp_flags, migratetype); + goto out; + } }
/* @@ -3090,7 +3097,13 @@ struct page *rmqueue(struct zone *preferred_zone,
do { page = NULL; - if (alloc_flags & ALLOC_HARDER) { + /* + * order-0 request can reach here when the pcplist is skipped + * due to non-CMA allocation context. HIGHATOMIC area is + * reserved for high-order atomic allocation, so order-0 + * request should skip it. + */ + if (order > 0 && alloc_flags & ALLOC_HARDER) { page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); if (page) trace_mm_page_alloc_zone_locked(page, order, migratetype);