From: Baisong Zhong zhongbaisong@huawei.com
mainline inclusion from mainline-v6.2-rc1 commit 0ed554fd769a19ea8464bb83e9ac201002ef74ad category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6NCQH CVE: CVE-2023-28328
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Wei Chen reports a kernel bug as blew:
general protection fault, probably for non-canonical address KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] ... Call Trace: <TASK> __i2c_transfer+0x77e/0x1930 drivers/i2c/i2c-core-base.c:2109 i2c_transfer+0x1d5/0x3d0 drivers/i2c/i2c-core-base.c:2170 i2cdev_ioctl_rdwr+0x393/0x660 drivers/i2c/i2c-dev.c:297 i2cdev_ioctl+0x75d/0x9f0 drivers/i2c/i2c-dev.c:458 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl+0xfb/0x170 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fd834a8bded
In az6027_i2c_xfer(), if msg[i].addr is 0x99, a null-ptr-deref will caused when accessing msg[i].buf. For msg[i].len is 0 and msg[i].buf is null.
Fix this by checking msg[i].len in az6027_i2c_xfer().
Link: https://lore.kernel.org/lkml/CAO4mrfcPHB5aQJO=mpqV+p8mPLNg-Fok0gw8gZ=zemAfMG...
Link: https://lore.kernel.org/linux-media/20221120065918.2160782-1-zhongbaisong@hu... Fixes: 76f9a820c867 ("V4L/DVB: AZ6027: Initial import of the driver") Reported-by: Wei Chen harperchen1110@gmail.com Signed-off-by: Baisong Zhong zhongbaisong@huawei.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org Signed-off-by: ZhangPeng zhangpeng362@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Nanyong Sun sunnanyong@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/media/usb/dvb-usb/az6027.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/media/usb/dvb-usb/az6027.c b/drivers/media/usb/dvb-usb/az6027.c index 6321b8e30261..555c8ac44881 100644 --- a/drivers/media/usb/dvb-usb/az6027.c +++ b/drivers/media/usb/dvb-usb/az6027.c @@ -977,6 +977,10 @@ static int az6027_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg msg[], int n if (msg[i].addr == 0x99) { req = 0xBE; index = 0; + if (msg[i].len < 1) { + i = -EOPNOTSUPP; + break; + } value = msg[i].buf[0] & 0x00ff; length = 1; az6027_usb_out_op(d, req, value, index, data, length);
From: ZhaoLong Wang wangzhaolong1@huawei.com
mainline inclusion from mainline-v6.2-rc8 commit aa5465aeca3c66fecdf7efcf554aed79b4c4b211 category: bugfix bugzilla: 188381, https://gitee.com/openeuler/kernel/issues/I644ST CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
------------------------------------------------------
When the network status is unstable, use-after-free may occur when read data from the server.
BUG: KASAN: use-after-free in readpages_fill_pages+0x14c/0x7e0
Call Trace: <TASK> dump_stack_lvl+0x38/0x4c print_report+0x16f/0x4a6 kasan_report+0xb7/0x130 readpages_fill_pages+0x14c/0x7e0 cifs_readv_receive+0x46d/0xa40 cifs_demultiplex_thread+0x121c/0x1490 kthread+0x16b/0x1a0 ret_from_fork+0x2c/0x50 </TASK>
Allocated by task 2535: kasan_save_stack+0x22/0x50 kasan_set_track+0x25/0x30 __kasan_kmalloc+0x82/0x90 cifs_readdata_direct_alloc+0x2c/0x110 cifs_readdata_alloc+0x2d/0x60 cifs_readahead+0x393/0xfe0 read_pages+0x12f/0x470 page_cache_ra_unbounded+0x1b1/0x240 filemap_get_pages+0x1c8/0x9a0 filemap_read+0x1c0/0x540 cifs_strict_readv+0x21b/0x240 vfs_read+0x395/0x4b0 ksys_read+0xb8/0x150 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc
Freed by task 79: kasan_save_stack+0x22/0x50 kasan_set_track+0x25/0x30 kasan_save_free_info+0x2e/0x50 __kasan_slab_free+0x10e/0x1a0 __kmem_cache_free+0x7a/0x1a0 cifs_readdata_release+0x49/0x60 process_one_work+0x46c/0x760 worker_thread+0x2a4/0x6f0 kthread+0x16b/0x1a0 ret_from_fork+0x2c/0x50
Last potentially related work creation: kasan_save_stack+0x22/0x50 __kasan_record_aux_stack+0x95/0xb0 insert_work+0x2b/0x130 __queue_work+0x1fe/0x660 queue_work_on+0x4b/0x60 smb2_readv_callback+0x396/0x800 cifs_abort_connection+0x474/0x6a0 cifs_reconnect+0x5cb/0xa50 cifs_readv_from_socket.cold+0x22/0x6c cifs_read_page_from_socket+0xc1/0x100 readpages_fill_pages.cold+0x2f/0x46 cifs_readv_receive+0x46d/0xa40 cifs_demultiplex_thread+0x121c/0x1490 kthread+0x16b/0x1a0 ret_from_fork+0x2c/0x50
The following function calls will cause UAF of the rdata pointer.
readpages_fill_pages cifs_read_page_from_socket cifs_readv_from_socket cifs_reconnect __cifs_reconnect cifs_abort_connection mid->callback() --> smb2_readv_callback queue_work(&rdata->work) # if the worker completes first, # the rdata is freed cifs_readv_complete kref_put cifs_readdata_release kfree(rdata) return rdata->... # UAF in readpages_fill_pages()
Similarly, this problem also occurs in the uncache_fill_pages().
Fix this by adjusts the order of condition judgment in the return statement.
Signed-off-by: ZhaoLong Wang wangzhaolong1@huawei.com Cc: stable@vger.kernel.org Acked-by: Paulo Alcantara (SUSE) pc@cjr.nz Signed-off-by: Steve French stfrench@microsoft.com Reviewed-by: Zhang Xiaoxu zhangxiaoxu5@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/cifs/file.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/cifs/file.c b/fs/cifs/file.c index facd4315ef56..bc3d0d76c2c4 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -3161,7 +3161,7 @@ uncached_fill_pages(struct TCP_Server_Info *server, rdata->got_bytes += result; }
- return rdata->got_bytes > 0 && result != -ECONNABORTED ? + return result != -ECONNABORTED && rdata->got_bytes > 0 ? rdata->got_bytes : result; }
@@ -3747,7 +3747,7 @@ readpages_fill_pages(struct TCP_Server_Info *server, rdata->got_bytes += result; }
- return rdata->got_bytes > 0 && result != -ECONNABORTED ? + return result != -ECONNABORTED && rdata->got_bytes > 0 ? rdata->got_bytes : result; }
From: Dan Carpenter dan.carpenter@oracle.com
stable inclusion from stable-v4.19.102 commit 732ecd4aad51d336b49b9be431219d173ac826c8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6L0EC CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit c7a91bc7c2e17e0a9c8b9745a2cb118891218fd1 upstream.
What we are trying to do is change the '=' character to a NUL terminator and then at the end of the function we restore it back to an '='. The problem is there are two error paths where we jump to the end of the function before we have replaced the '=' with NUL.
We end up putting the '=' in the wrong place (possibly one element before the start of the buffer).
Link: http://lkml.kernel.org/r/20200115055426.vdjwvry44nfug7yy@kili.mountain Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display") Signed-off-by: Dan Carpenter dan.carpenter@oracle.com Acked-by: Vlastimil Babka vbabka@suse.cz Dmitry Vyukov dvyukov@google.com Cc: Michal Hocko mhocko@kernel.org Cc: Dan Carpenter dan.carpenter@oracle.com Cc: Lee Schermerhorn lee.schermerhorn@hp.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Hugh Dickins hughd@google.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: ZhangPeng zhangpeng362@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: tong tiangen tongtiangen@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- mm/mempolicy.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 4cac46d56f38..4769ed2ed7f3 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2886,6 +2886,9 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) char *flags = strchr(str, '='); int err = 1;
+ if (flags) + *flags++ = '\0'; /* terminate mode string */ + if (nodelist) { /* NUL-terminate mode or flags string */ *nodelist++ = '\0'; @@ -2896,9 +2899,6 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) } else nodes_clear(nodes);
- if (flags) - *flags++ = '\0'; /* terminate mode string */ - for (mode = 0; mode < MPOL_MAX; mode++) { if (!strcmp(str, policy_modes[mode])) { break;
From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-v5.5-rc5 commit 85a8ce62c2eabe28b9d76ca4eecf37922402df93 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6GTUI CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Some filesystem, such as vfat, may send bio which crosses device boundary, and the worse thing is that the IO request starting within device boundaries can contain more than one segment past EOD.
Commit dce30ca9e3b6 ("fs: fix guard_bio_eod to check for real EOD errors") tries to fix this issue by returning -EIO for this situation. However, this way lets fs user code lose chance to handle -EIO, then sync_inodes_sb() may hang for ever.
Also the current truncating on last segment is dangerous by updating the last bvec, given bvec table becomes not immutable any more, and fs bio users may not retrieve the truncated pages via bio_for_each_segment_all() in its .end_io callback.
Fixes this issue by supporting multi-segment truncating. And the approach is simpler:
- just update bio size since block layer can make correct bvec with the updated bio size. Then bvec table becomes really immutable.
- zero all truncated segments for read bio
Cc: Carlos Maiolino cmaiolino@redhat.com Cc: linux-fsdevel@vger.kernel.org Fixed-by: dce30ca9e3b6 ("fs: fix guard_bio_eod to check for real EOD errors") Reported-by: syzbot+2b9e54155c8c25d8d165@syzkaller.appspotmail.com Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- block/bio.c | 39 +++++++++++++++++++++++++++++++++++++++ fs/buffer.c | 22 +--------------------- include/linux/bio.h | 1 + 3 files changed, 41 insertions(+), 21 deletions(-)
diff --git a/block/bio.c b/block/bio.c index 21c56c177b25..acefd2be1cd9 100644 --- a/block/bio.c +++ b/block/bio.c @@ -547,6 +547,45 @@ void zero_fill_bio_iter(struct bio *bio, struct bvec_iter start) } EXPORT_SYMBOL(zero_fill_bio_iter);
+void bio_truncate(struct bio *bio, unsigned new_size) +{ + struct bio_vec bv; + struct bvec_iter iter; + unsigned int done = 0; + bool truncated = false; + + if (new_size >= bio->bi_iter.bi_size) + return; + + if (bio_data_dir(bio) != READ) + goto exit; + + bio_for_each_segment(bv, bio, iter) { + if (done + bv.bv_len > new_size) { + unsigned offset; + + if (!truncated) + offset = new_size - done; + else + offset = 0; + zero_user(bv.bv_page, offset, bv.bv_len - offset); + truncated = true; + } + done += bv.bv_len; + } + + exit: + /* + * Don't touch bvec table here and make it really immutable, since + * fs bio user has to retrieve all pages via bio_for_each_segment_all + * in its .end_bio() callback. + * + * It is enough to truncate bio by updating .bi_size since we can make + * correct bvec with the updated .bi_size for drivers. + */ + bio->bi_iter.bi_size = new_size; +} + /** * bio_put - release a reference to a bio * @bio: bio to release reference to diff --git a/fs/buffer.c b/fs/buffer.c index ea8a7b6efdf5..6be882a23758 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2993,8 +2993,6 @@ static void end_bio_bh_io_sync(struct bio *bio) void guard_bio_eod(int op, struct bio *bio) { sector_t maxsector; - struct bio_vec *bvec = bio_last_bvec_all(bio); - unsigned truncated_bytes; struct hd_struct *part;
rcu_read_lock(); @@ -3020,25 +3018,7 @@ void guard_bio_eod(int op, struct bio *bio) if (likely((bio->bi_iter.bi_size >> 9) <= maxsector)) return;
- /* Uhhuh. We've got a bio that straddles the device size! */ - truncated_bytes = bio->bi_iter.bi_size - (maxsector << 9); - - /* - * The bio contains more than one segment which spans EOD, just return - * and let IO layer turn it into an EIO - */ - if (truncated_bytes > bvec->bv_len) - return; - - /* Truncate the bio.. */ - bio->bi_iter.bi_size -= truncated_bytes; - bvec->bv_len -= truncated_bytes; - - /* ..and clear the end of the buffer for reads */ - if (op == REQ_OP_READ) { - zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len, - truncated_bytes); - } + bio_truncate(bio, maxsector << 9); }
static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh, diff --git a/include/linux/bio.h b/include/linux/bio.h index b2413f682aba..361b1bcd3deb 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -521,6 +521,7 @@ extern struct bio *bio_copy_user_iov(struct request_queue *, gfp_t); extern int bio_uncopy_user(struct bio *); void zero_fill_bio_iter(struct bio *bio, struct bvec_iter iter); +void bio_truncate(struct bio *bio, unsigned new_size);
static inline void zero_fill_bio(struct bio *bio) {
From: Ming Lei ming.lei@redhat.com
mainline inclusion from mainline-v5.5-rc6 commit 83c9c547168e8b914ea6398430473a4de68c52cc category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6GTUI CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Commit 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod") adds bio_truncate() for handling bio EOD. However, bio_truncate() doesn't use the passed 'op' parameter from guard_bio_eod's callers.
So bio_trunacate() may retrieve wrong 'op', and zering pages may not be done for READ bio.
Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs() in submit_bh_wbc() so that bio_truncate() can always retrieve correct op info.
Meantime remove the 'op' parameter from guard_bio_eod() because it isn't used any more.
Cc: Carlos Maiolino cmaiolino@redhat.com Cc: linux-fsdevel@vger.kernel.org Fixes: 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod") Signed-off-by: Ming Lei ming.lei@redhat.com
Fold in kerneldoc and bio_op() change.
Signed-off-by: Jens Axboe axboe@kernel.dk
Conflicts: fs/buffer.c fs/internal.h Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- block/bio.c | 12 +++++++++++- fs/buffer.c | 8 ++++---- fs/internal.h | 2 +- fs/mpage.c | 2 +- 4 files changed, 17 insertions(+), 7 deletions(-)
diff --git a/block/bio.c b/block/bio.c index acefd2be1cd9..a1c74f10d604 100644 --- a/block/bio.c +++ b/block/bio.c @@ -547,6 +547,16 @@ void zero_fill_bio_iter(struct bio *bio, struct bvec_iter start) } EXPORT_SYMBOL(zero_fill_bio_iter);
+/** + * bio_truncate - truncate the bio to small size of @new_size + * @bio: the bio to be truncated + * @new_size: new size for truncating the bio + * + * Description: + * Truncate the bio to new size of @new_size. If bio_op(bio) is + * REQ_OP_READ, zero the truncated part. This function should only + * be used for handling corner cases, such as bio eod. + */ void bio_truncate(struct bio *bio, unsigned new_size) { struct bio_vec bv; @@ -557,7 +567,7 @@ void bio_truncate(struct bio *bio, unsigned new_size) if (new_size >= bio->bi_iter.bi_size) return;
- if (bio_data_dir(bio) != READ) + if (bio_op(bio) != REQ_OP_READ) goto exit;
bio_for_each_segment(bv, bio, iter) { diff --git a/fs/buffer.c b/fs/buffer.c index 6be882a23758..af88734d1d38 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2990,7 +2990,7 @@ static void end_bio_bh_io_sync(struct bio *bio) * errors, this only handles the "we need to be able to * do IO at the final sector" case. */ -void guard_bio_eod(int op, struct bio *bio) +void guard_bio_eod(struct bio *bio) { sector_t maxsector; struct hd_struct *part; @@ -3059,15 +3059,15 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh, bio->bi_end_io = end_bio_bh_io_sync; bio->bi_private = bh;
- /* Take care of bh's that straddle the end of the device */ - guard_bio_eod(op, bio); - if (buffer_meta(bh)) op_flags |= REQ_META; if (buffer_prio(bh)) op_flags |= REQ_PRIO; bio_set_op_attrs(bio, op, op_flags);
+ /* Take care of bh's that straddle the end of the device */ + guard_bio_eod(bio); + submit_bio(bio); return 0; } diff --git a/fs/internal.h b/fs/internal.h index 73e9829245f1..e63939e64439 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -40,7 +40,7 @@ static inline int __sync_blockdev(struct block_device *bdev, int wait) /* * buffer.c */ -extern void guard_bio_eod(int rw, struct bio *bio); +extern void guard_bio_eod(struct bio *bio); extern int __block_write_begin_int(struct page *page, loff_t pos, unsigned len, get_block_t *get_block, struct iomap *iomap); void __generic_write_end(struct inode *inode, loff_t pos, unsigned copied, diff --git a/fs/mpage.c b/fs/mpage.c index c820dc9bebab..fb2ff971c66b 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -62,7 +62,7 @@ static struct bio *mpage_bio_submit(int op, int op_flags, struct bio *bio) { bio->bi_end_io = mpage_end_io; bio_set_op_attrs(bio, op, op_flags); - guard_bio_eod(op, bio); + guard_bio_eod(bio); submit_bio(bio); return NULL; }
From: OGAWA Hirofumi hirofumi@mail.parknet.co.jp
mainline inclusion from mainline-v5.17-rc1 commit 3ee859e384d453d6ac68bfd5971f630d9fa46ad3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6GTUI CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
bio_truncate() clears the buffer outside of last block of bdev, however current bio_truncate() is using the wrong offset of page. So it can return the uninitialized data.
This happened when both of truncated/corrupted FS and userspace (via bdev) are trying to read the last of bdev.
Reported-by: syzbot+ac94ae5f68b84197f41c@syzkaller.appspotmail.com Signed-off-by: OGAWA Hirofumi hirofumi@mail.parknet.co.jp Reviewed-by: Ming Lei ming.lei@redhat.com Link: https://lore.kernel.org/r/875yqt1c9g.fsf@mail.parknet.co.jp Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- block/bio.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/block/bio.c b/block/bio.c index a1c74f10d604..9d70ebe4122c 100644 --- a/block/bio.c +++ b/block/bio.c @@ -578,7 +578,8 @@ void bio_truncate(struct bio *bio, unsigned new_size) offset = new_size - done; else offset = 0; - zero_user(bv.bv_page, offset, bv.bv_len - offset); + zero_user(bv.bv_page, bv.bv_offset + offset, + bv.bv_len - offset); truncated = true; } done += bv.bv_len;
From: Shakeel Butt shakeelb@google.com
mainline inclusion from mainline-v5.3-rc1 commit 5eee7e1cdb97123bb55ac14ccd3af8b6edc31537 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6E9D3 CVE: NA
--------------------------------
dump_tasks() traverses all the existing processes even for the memcg OOM context which is not only unnecessary but also wasteful. This imposes a long RCU critical section even from a contained context which can be quite disruptive.
Change dump_tasks() to be aligned with select_bad_process and use mem_cgroup_scan_tasks to selectively traverse only processes of the target memcg hierarchy during memcg OOM.
Link: http://lkml.kernel.org/r/20190617231207.160865-1-shakeelb@google.com Signed-off-by: Shakeel Butt shakeelb@google.com Acked-by: Michal Hocko mhocko@suse.com Acked-by: Roman Gushchin guro@fb.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: David Rientjes rientjes@google.com Cc: KOSAKI Motohiro kosaki.motohiro@jp.fujitsu.com Cc: Paul Jackson pj@sgi.com Cc: Nick Piggin npiggin@suse.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: mm/oom_kill.c Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Weilong Chen chenweilong@huawei.com Reviewed-by: Nanyong Sun sunnanyong@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- mm/oom_kill.c | 121 +++++++++++++++++++++++++++++--------------------- 1 file changed, 70 insertions(+), 51 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 020fb6ac2497..41fcb8e45319 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -440,10 +440,60 @@ static void select_bad_process(struct oom_control *oc) oc->chosen_points = oc->chosen_points * 1000 / oc->totalpages; }
+static int dump_task(struct task_struct *p, void *arg) +{ + struct oom_control *oc = arg; + struct task_struct *task; + struct sp_proc_stat *stat; + + if (oom_unkillable_task(p, NULL, oc->nodemask)) + return 0; + + task = find_lock_task_mm(p); + if (!task) { + /* + * This is a kthread or all of p's threads have already + * detached their mm's. There's no need to report + * them; they can't be oom killed anyway. + */ + return 0; + } + + if (ascend_sp_oom_show()) { + stat = sp_get_proc_stat_ref(task->mm); + + pr_cont("[%7d] %5d %5d %8lu %8lu ", + task->pid, from_kuid(&init_user_ns, task_uid(task)), + task->tgid, task->mm->total_vm, get_mm_rss(task->mm)); + if (!stat) + pr_cont("%-9c %-9c ", '-', '-'); + else { + pr_cont("%-9lld %-9lld ", /* byte to KB */ + (long long)atomic64_read(&stat->alloc_size) >> 10, + (long long)atomic64_read(&stat->k2u_size) >> 10); + sp_proc_stat_drop(stat); + } + pr_cont("%8ld %8lu %5hd %s\n", + mm_pgtables_bytes(task->mm), + get_mm_counter(task->mm, MM_SWAPENTS), + task->signal->oom_score_adj, task->comm); + } else { + pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n", + task->pid, from_kuid(&init_user_ns, task_uid(task)), + task->tgid, task->mm->total_vm, get_mm_rss(task->mm), + mm_pgtables_bytes(task->mm), + get_mm_counter(task->mm, MM_SWAPENTS), + task->signal->oom_score_adj, task->comm); + } + + task_unlock(task); + + return 0; +} + /** * dump_tasks - dump current memory state of all system tasks - * @memcg: current's memory controller, if constrained - * @nodemask: nodemask passed to page allocator for mempolicy ooms + * @oc: pointer to struct oom_control * * Dumps the current memory state of all eligible tasks. Tasks not in the same * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes @@ -451,12 +501,8 @@ static void select_bad_process(struct oom_control *oc) * State information includes task's pid, uid, tgid, vm size, rss, * pgtables_bytes, swapents, oom_score_adj value, and name. */ -static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) +static void dump_tasks(struct oom_control *oc) { - struct task_struct *p; - struct task_struct *task; - struct sp_proc_stat *stat; - if (ascend_sp_oom_show()) { pr_info("Tasks state (memory values in pages, share pool memory values in KB):\n"); pr_info("[ pid ] uid tgid total_vm rss sp_alloc sp_k2u pgtables_bytes swapents oom_score_adj name\n"); @@ -465,50 +511,16 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n"); }
- rcu_read_lock(); - for_each_process(p) { - if (oom_unkillable_task(p, memcg, nodemask)) - continue; - - task = find_lock_task_mm(p); - if (!task) { - /* - * This is a kthread or all of p's threads have already - * detached their mm's. There's no need to report - * them; they can't be oom killed anyway. - */ - continue; - } + if (is_memcg_oom(oc)) + mem_cgroup_scan_tasks(oc->memcg, dump_task, oc); + else { + struct task_struct *p;
- if (ascend_sp_oom_show()) { - stat = sp_get_proc_stat_ref(task->mm); - - pr_cont("[%7d] %5d %5d %8lu %8lu ", - task->pid, from_kuid(&init_user_ns, task_uid(task)), - task->tgid, task->mm->total_vm, get_mm_rss(task->mm)); - if (!stat) - pr_cont("%-9c %-9c ", '-', '-'); - else { - pr_cont("%-9lld %-9lld ", /* byte to KB */ - (long long)atomic64_read(&stat->alloc_size) >> 10, - (long long)atomic64_read(&stat->k2u_size) >> 10); - sp_proc_stat_drop(stat); - } - pr_cont("%8ld %8lu %5hd %s\n", - mm_pgtables_bytes(task->mm), - get_mm_counter(task->mm, MM_SWAPENTS), - task->signal->oom_score_adj, task->comm); - } else { - pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n", - task->pid, from_kuid(&init_user_ns, task_uid(task)), - task->tgid, task->mm->total_vm, get_mm_rss(task->mm), - mm_pgtables_bytes(task->mm), - get_mm_counter(task->mm, MM_SWAPENTS), - task->signal->oom_score_adj, task->comm); - } - task_unlock(task); + rcu_read_lock(); + for_each_process(p) + dump_task(p, oc); + rcu_read_unlock(); } - rcu_read_unlock(); }
static void dump_oom_summary(struct oom_control *oc, struct task_struct *victim) @@ -540,7 +552,7 @@ static void dump_header(struct oom_control *oc, struct task_struct *p) dump_unreclaimable_slab(); } if (sysctl_oom_dump_tasks) - dump_tasks(oc->memcg, oc->nodemask); + dump_tasks(oc); if (p) dump_oom_summary(oc, p); } @@ -1158,6 +1170,13 @@ int hisi_oom_notifier_call(unsigned long val, void *v) { int ret; unsigned long freed = 0; + struct oom_control oc = { + .zonelist = NULL, + .nodemask = NULL, + .memcg = NULL, + .gfp_mask = GFP_KERNEL, + .order = 0, + };
/* when enable oom killer, just return */ if (sysctl_enable_oom_killer == 1) @@ -1171,7 +1190,7 @@ int hisi_oom_notifier_call(unsigned long val, void *v) show_mem(SHOW_MEM_FILTER_NODES, NULL); spg_overview_show(NULL); spa_overview_show(NULL); - dump_tasks(NULL, 0); + dump_tasks(&oc); last_jiffies = jiffies; }
From: Shakeel Butt shakeelb@google.com
mainline inclusion from mainline-v5.3-rc1 commit 6ba749ee78ef42ffdf4b95c042fc574a37d229d9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6E9D3 CVE: NA
--------------------------------
oom_unkillable_task() can be called from three different contexts i.e. global OOM, memcg OOM and oom_score procfs interface. At the moment oom_unkillable_task() does a task_in_mem_cgroup() check on the given process. Since there is no reason to perform task_in_mem_cgroup() check for global OOM and oom_score procfs interface, those contexts provide NULL memcg and skips the task_in_mem_cgroup() check. However for memcg OOM context, the oom_unkillable_task() is always called from mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes redundant and effectively dead code. So, just remove the task_in_mem_cgroup() check altogether.
Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.com Signed-off-by: Shakeel Butt shakeelb@google.com Signed-off-by: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Acked-by: Roman Gushchin guro@fb.com Acked-by: Michal Hocko mhocko@suse.com Cc: David Rientjes rientjes@google.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: KOSAKI Motohiro kosaki.motohiro@jp.fujitsu.com Cc: Nick Piggin npiggin@suse.de Cc: Paul Jackson pj@sgi.com Cc: Vladimir Davydov vdavydov.dev@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: mm/page_alloc.c Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Nanyong Sun sunnanyong@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/proc/base.c | 2 +- include/linux/memcontrol.h | 7 ------- include/linux/oom.h | 2 +- mm/memcontrol.c | 26 -------------------------- mm/oom_kill.c | 23 +++++++++-------------- 5 files changed, 11 insertions(+), 49 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c index d7e94f7b5ad3..45ac75d16d71 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -552,7 +552,7 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns, unsigned long totalpages = totalram_pages + total_swap_pages; unsigned long points = 0;
- points = oom_badness(task, NULL, NULL, totalpages) * + points = oom_badness(task, NULL, totalpages) * 1000 / totalpages; seq_printf(m, "%lu\n", points);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 594925ea3076..23db8eec0755 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -436,7 +436,6 @@ static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
-bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg); struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm); @@ -907,12 +906,6 @@ static inline bool mm_match_cgroup(struct mm_struct *mm, return true; }
-static inline bool task_in_mem_cgroup(struct task_struct *task, - const struct mem_cgroup *memcg) -{ - return true; -} - static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) { return NULL; diff --git a/include/linux/oom.h b/include/linux/oom.h index 123538b89dc8..76a2086af472 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -109,7 +109,7 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm) bool __oom_reap_task_mm(struct mm_struct *mm);
extern unsigned long oom_badness(struct task_struct *p, - struct mem_cgroup *memcg, const nodemask_t *nodemask, + const nodemask_t *nodemask, unsigned long totalpages);
extern bool out_of_memory(struct oom_control *oc); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 886d6b0a4fce..fe854542fb77 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1332,32 +1332,6 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, *lru_size += nr_pages; }
-bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg) -{ - struct mem_cgroup *task_memcg; - struct task_struct *p; - bool ret; - - p = find_lock_task_mm(task); - if (p) { - task_memcg = get_mem_cgroup_from_mm(p->mm); - task_unlock(p); - } else { - /* - * All threads may have already detached their mm's, but the oom - * killer still needs to detect if they have already been oom - * killed to prevent needlessly killing additional tasks. - */ - rcu_read_lock(); - task_memcg = mem_cgroup_from_task(task); - css_get(&task_memcg->css); - rcu_read_unlock(); - } - ret = mem_cgroup_is_descendant(task_memcg, memcg); - css_put(&task_memcg->css); - return ret; -} - /** * mem_cgroup_margin - calculate chargeable space of a memory cgroup * @memcg: the memory cgroup diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 41fcb8e45319..d2639ce2cec4 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -158,17 +158,13 @@ static inline bool is_memcg_oom(struct oom_control *oc)
/* return true if the task is not adequate as candidate victim task. */ static bool oom_unkillable_task(struct task_struct *p, - struct mem_cgroup *memcg, const nodemask_t *nodemask) + const nodemask_t *nodemask) { if (is_global_init(p)) return true; if (p->flags & PF_KTHREAD) return true;
- /* When mem_cgroup_out_of_memory() and p is not member of the group */ - if (memcg && !task_in_mem_cgroup(p, memcg)) - return true; - /* p may not have freeable memory in nodemask */ if (!has_intersects_mems_allowed(p, nodemask)) return true; @@ -199,20 +195,19 @@ static bool is_dump_unreclaim_slabs(void) * oom_badness - heuristic function to determine which candidate task to kill * @p: task struct of which task we should calculate * @totalpages: total present RAM allowed for page allocation - * @memcg: task's memory controller, if constrained * @nodemask: nodemask passed to page allocator for mempolicy ooms * * The heuristic for determining which task to kill is made to be as simple and * predictable as possible. The goal is to return the highest value for the * task consuming the most memory to avoid subsequent oom failures. */ -unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, +unsigned long oom_badness(struct task_struct *p, const nodemask_t *nodemask, unsigned long totalpages) { long points; long adj;
- if (oom_unkillable_task(p, memcg, nodemask)) + if (oom_unkillable_task(p, nodemask)) return 0;
p = find_lock_task_mm(p); @@ -369,7 +364,7 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) struct oom_control *oc = arg; unsigned long points;
- if (oom_unkillable_task(task, NULL, oc->nodemask)) + if (oom_unkillable_task(task, oc->nodemask)) goto next;
/* @@ -393,7 +388,7 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) goto select; }
- points = oom_badness(task, NULL, oc->nodemask, oc->totalpages); + points = oom_badness(task, oc->nodemask, oc->totalpages); if (oom_next_task(task, oc, points)) goto next;
@@ -446,7 +441,7 @@ static int dump_task(struct task_struct *p, void *arg) struct task_struct *task; struct sp_proc_stat *stat;
- if (oom_unkillable_task(p, NULL, oc->nodemask)) + if (oom_unkillable_task(p, oc->nodemask)) return 0;
task = find_lock_task_mm(p); @@ -1086,8 +1081,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message) /* * oom_badness() returns 0 if the thread is unkillable */ - child_points = oom_badness(child, - oc->memcg, oc->nodemask, oc->totalpages); + child_points = oom_badness(child, oc->nodemask, + oc->totalpages); if (child_points > victim_points) { put_task_struct(victim); victim = child; @@ -1276,7 +1271,7 @@ bool out_of_memory(struct oom_control *oc) check_panic_on_oom(oc);
if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && - current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) && + current->mm && !oom_unkillable_task(current, oc->nodemask) && current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { get_task_struct(current); oc->chosen = current;
From: Shakeel Butt shakeelb@google.com
mainline inclusion from mainline-v5.3-rc1 commit ac311a14c682dcd8a120a6244d0542ec654e3d93 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6E9D3 CVE: NA
--------------------------------
Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset") introduces a heuristic where a potential oom-killer victim is skipped if the intersection of the potential victim and the current (the process triggered the oom) is empty based on the reason that killing such victim most probably will not help the current allocating process.
However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the heuristic to just decrease the oom_badness scores of such potential victim based on the reason that the cpuset of such processes might have changed and previously they may have allocated memory on mems where the current allocating process can allocate from.
Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a side effect as the oom_badness is also exposed to the user space through /proc/[pid]/oom_score, so, readers with different cpusets can read different oom_score of the same process.
Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same cpuset") fixed the side effect introduced by 7887a3da753e by moving the cpuset intersection back to only oom-killer context and out of oom_badness. However the combination of ab290adbaf8f ("oom: make oom_unkillable_task() helper function") and 26ebc984913b ("oom: /proc/<pid>/oom_score treat kernel thread honestly") unintentionally brought back the cpuset intersection check into the oom_badness calculation function.
Other than doing cpuset/mempolicy intersection from oom_badness, the memcg oom context is also doing cpuset/mempolicy intersection which is quite wrong and is caught by syzcaller with the following report:
kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321 mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169 select_bad_process mm/oom_kill.c:374 [inline] out_of_memory mm/oom_kill.c:1088 [inline] out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035 mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573 mem_cgroup_oom mm/memcontrol.c:1905 [inline] try_charge+0xfbe/0x1480 mm/memcontrol.c:2468 mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073 mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088 do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201 do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359 wp_huge_pmd mm/memory.c:3793 [inline] __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006 handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053 do_user_addr_fault arch/x86/mm/fault.c:1455 [inline] __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521 do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552 page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156 RIP: 0033:0x400590 Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48 8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01 00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b RSP: 002b:00007fff7bc49780 EFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001 RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008 R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0 Modules linked in: ---[ end trace a65689219582ffff ]--- RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline] RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline] RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155 Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff RSP: 0018:ffff888000127490 EFLAGS: 00010a03 RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001 RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0 R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007 R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6 FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
The fix is to decouple the cpuset/mempolicy intersection check from oom_unkillable_task() and make sure cpuset/mempolicy intersection check is only done in the global oom context.
[shakeelb@google.com: change function name and update comment] Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com Signed-off-by: Shakeel Butt shakeelb@google.com Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com Acked-by: Roman Gushchin guro@fb.com Acked-by: Michal Hocko mhocko@suse.com Cc: David Rientjes rientjes@google.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: KOSAKI Motohiro kosaki.motohiro@jp.fujitsu.com Cc: Nick Piggin npiggin@suse.de Cc: Paul Jackson pj@sgi.com Cc: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Cc: Vladimir Davydov vdavydov.dev@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: mm/oom_kill.c Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Nanyong Sun sunnanyong@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/proc/base.c | 3 +-- include/linux/oom.h | 1 - mm/oom_kill.c | 60 +++++++++++++++++++++++++-------------------- 3 files changed, 34 insertions(+), 30 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c index 45ac75d16d71..dc9841826264 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -552,8 +552,7 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns, unsigned long totalpages = totalram_pages + total_swap_pages; unsigned long points = 0;
- points = oom_badness(task, NULL, totalpages) * - 1000 / totalpages; + points = oom_badness(task, totalpages) * 1000 / totalpages; seq_printf(m, "%lu\n", points);
return 0; diff --git a/include/linux/oom.h b/include/linux/oom.h index 76a2086af472..1fd14b61c21b 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -109,7 +109,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm) bool __oom_reap_task_mm(struct mm_struct *mm);
extern unsigned long oom_badness(struct task_struct *p, - const nodemask_t *nodemask, unsigned long totalpages);
extern bool out_of_memory(struct oom_control *oc); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d2639ce2cec4..549f3e6ddb25 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -69,21 +69,33 @@ DEFINE_MUTEX(oom_lock); /* Serializes oom_score_adj and oom_score_adj_min updates */ DEFINE_MUTEX(oom_adj_mutex);
+static inline bool is_memcg_oom(struct oom_control *oc) +{ + return oc->memcg != NULL; +} + #ifdef CONFIG_NUMA /** - * has_intersects_mems_allowed() - check task eligiblity for kill + * oom_cpuset_eligible() - check task eligiblity for kill * @start: task struct of which task to consider * @mask: nodemask passed to page allocator for mempolicy ooms * * Task eligibility is determined by whether or not a candidate task, @tsk, * shares the same mempolicy nodes as current if it is bound by such a policy * and whether or not it has the same set of allowed cpuset nodes. + * + * This function is assuming oom-killer context and 'current' has triggered + * the oom-killer. */ -static bool has_intersects_mems_allowed(struct task_struct *start, - const nodemask_t *mask) +static bool oom_cpuset_eligible(struct task_struct *start, + struct oom_control *oc) { struct task_struct *tsk; bool ret = false; + const nodemask_t *mask = oc->nodemask; + + if (is_memcg_oom(oc)) + return true;
rcu_read_lock(); for_each_thread(start, tsk) { @@ -110,8 +122,7 @@ static bool has_intersects_mems_allowed(struct task_struct *start, return ret; } #else -static bool has_intersects_mems_allowed(struct task_struct *tsk, - const nodemask_t *mask) +static bool oom_cpuset_eligible(struct task_struct *tsk, struct oom_control *oc) { return true; } @@ -151,24 +162,13 @@ static inline bool is_sysrq_oom(struct oom_control *oc) return oc->order == -1; }
-static inline bool is_memcg_oom(struct oom_control *oc) -{ - return oc->memcg != NULL; -} - /* return true if the task is not adequate as candidate victim task. */ -static bool oom_unkillable_task(struct task_struct *p, - const nodemask_t *nodemask) +static bool oom_unkillable_task(struct task_struct *p) { if (is_global_init(p)) return true; if (p->flags & PF_KTHREAD) return true; - - /* p may not have freeable memory in nodemask */ - if (!has_intersects_mems_allowed(p, nodemask)) - return true; - return false; }
@@ -195,19 +195,17 @@ static bool is_dump_unreclaim_slabs(void) * oom_badness - heuristic function to determine which candidate task to kill * @p: task struct of which task we should calculate * @totalpages: total present RAM allowed for page allocation - * @nodemask: nodemask passed to page allocator for mempolicy ooms * * The heuristic for determining which task to kill is made to be as simple and * predictable as possible. The goal is to return the highest value for the * task consuming the most memory to avoid subsequent oom failures. */ -unsigned long oom_badness(struct task_struct *p, - const nodemask_t *nodemask, unsigned long totalpages) +unsigned long oom_badness(struct task_struct *p, unsigned long totalpages) { long points; long adj;
- if (oom_unkillable_task(p, nodemask)) + if (oom_unkillable_task(p)) return 0;
p = find_lock_task_mm(p); @@ -364,7 +362,11 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) struct oom_control *oc = arg; unsigned long points;
- if (oom_unkillable_task(task, oc->nodemask)) + if (oom_unkillable_task(task)) + goto next; + + /* p may not have freeable memory in nodemask */ + if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc)) goto next;
/* @@ -388,7 +390,7 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) goto select; }
- points = oom_badness(task, oc->nodemask, oc->totalpages); + points = oom_badness(task, oc->totalpages); if (oom_next_task(task, oc, points)) goto next;
@@ -441,7 +443,11 @@ static int dump_task(struct task_struct *p, void *arg) struct task_struct *task; struct sp_proc_stat *stat;
- if (oom_unkillable_task(p, oc->nodemask)) + if (oom_unkillable_task(p)) + return 0; + + /* p may not have freeable memory in nodemask */ + if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc)) return 0;
task = find_lock_task_mm(p); @@ -1081,8 +1087,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) /* * oom_badness() returns 0 if the thread is unkillable */ - child_points = oom_badness(child, oc->nodemask, - oc->totalpages); + child_points = oom_badness(child, oc->totalpages); if (child_points > victim_points) { put_task_struct(victim); victim = child; @@ -1271,7 +1276,8 @@ bool out_of_memory(struct oom_control *oc) check_panic_on_oom(oc);
if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && - current->mm && !oom_unkillable_task(current, oc->nodemask) && + current->mm && !oom_unkillable_task(current) && + oom_cpuset_eligible(current, oc) && current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { get_task_struct(current); oc->chosen = current;
From: Yi Wang wang.yi59@zte.com.cn
mainline inclusion from mainline-v5.4-rc1 commit f364f06b34b55285df7b132b4e3752d820412ad4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6E9D3 CVE: NA
--------------------------------
Commit ac311a14c682 ("oom: decouple mems_allowed from oom_unkillable_task") changed has_intersects_mems_allowed() to oom_cpuset_eligible(), but didn't change the comment.
Link: http://lkml.kernel.org/r/1566959929-10638-1-git-send-email-wang.yi59@zte.com... Signed-off-by: Yi Wang wang.yi59@zte.com.cn Acked-by: Michal Hocko mhocko@suse.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ma Wupeng mawupeng1@huawei.com Reviewed-by: Weilong Chen chenweilong@huawei.com Reviewed-by: Nanyong Sun sunnanyong@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- mm/oom_kill.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 549f3e6ddb25..bcd08df6d577 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -78,7 +78,7 @@ static inline bool is_memcg_oom(struct oom_control *oc) /** * oom_cpuset_eligible() - check task eligiblity for kill * @start: task struct of which task to consider - * @mask: nodemask passed to page allocator for mempolicy ooms + * @oc: pointer to struct oom_control * * Task eligibility is determined by whether or not a candidate task, @tsk, * shares the same mempolicy nodes as current if it is bound by such a policy