Recently, when install package in a docker which almost reached its memory limit, the installer has no respond severely for more than 15 minutes. During this period, I/O stays high(~1G/s) and influence the whole machine. I've constructed a use case as follows:
1. create a docker:
$ cat test.sh #!/bin/bash
docker rm centos7 --force
docker create --name centos7 --memory 4G --memory-swap 6G centos:7 /usr/sbin/init docker start centos7 sleep 1
docker cp ./alloc_page centos7:/ docker cp ./reproduce.sh centos7:/
docker exec -it centos7 /bin/bash
2. try reproduce the problem in docker:
$ cat reproduce.sh #!/bin/bash
while true; do flag=$(ps -ef | grep -v grep | grep alloc_page| wc -l) if [ "$flag" -eq 0 ]; then /alloc_page & fi
sleep 30
start_time=$(date +%s) yum install -y expect > /dev/null 2>&1
end_time=$(date +%s)
elapsed_time=$((end_time - start_time))
echo "$elapsed_time seconds" yum remove -y expect > /dev/null 2>&1 done
$ cat alloc_page.c: #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h>
#define SIZE 1*1024*1024 //1M
int main() { void *addr = NULL; int i;
for (i = 0; i < 1024 * 6 - 50;i++) { addr = (void *)malloc(SIZE); if (!addr) return -1;
memset(addr, 0, SIZE); }
sleep(99999); return 0; }
We found that this problem is caused by a lot ot meaningless read-ahead. Since the docker is almost met memory limit, the page will be reclaimed immediately after read-ahead and will read-ahead again immediately. The program is executed slowly and waste a lot of I/O resource.
These patches aim to break the read-ahead in above scenario.
Liu Shixin (3): mm/readahead: break read-ahead loop if filemap_add_folio return -ENOMEM mm/filemap: don't decrease mmap_miss when folio has workingset flag mm/readahead: add sysctl to enable early break readahead
Documentation/admin-guide/sysctl/vm.rst | 7 +++++++ include/linux/pagemap.h | 2 ++ kernel/sysctl.c | 9 +++++++++ mm/filemap.c | 14 +++++++++++++- mm/readahead.c | 20 +++++++++++++++----- 5 files changed, 46 insertions(+), 6 deletions(-)
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8EXN6 CVE: NA
--------------------------------
When filemap_add_folio() return -ENOMEM, break read-ahead loop like what filemap_alloc_folio() does.
Signed-off-by: Liu Shixin liushixin2@huawei.com Signed-off-by: Jinjiang Tu tujinjiang@huawei.com Reviewed-by: Jan Kara jack@suse.cz --- mm/readahead.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/mm/readahead.c b/mm/readahead.c index ed23d5dec123..22dd9c8fe808 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -220,11 +220,18 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, if (mapping->a_ops->readpages) { page->index = index + i; list_add(&page->lru, &page_pool); - } else if (add_to_page_cache_lru(page, mapping, index + i, - gfp_mask) < 0) { - put_page(page); - read_pages(ractl, &page_pool, true); - continue; + } else { + int ret; + + ret = add_to_page_cache_lru(page, mapping, index + i, + gfp_mask); + if (ret < 0) { + put_page(page); + read_pages(ractl, &page_pool, true); + if (ret == -ENOMEM) + break; + continue; + } } if (i == nr_to_read - lookahead_size) SetPageReadahead(page);
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8EXN6 CVE: NA
--------------------------------
If there are too many folios that are recently evicted in a file, then they will probably continue to be evicted. In such situation, there is no positive effect to read-ahead this file since it is only a waste of IO.
The mmap_miss is increased in do_sync_mmap_readahead() and decreased in both do_async_mmap_readahead() and filemap_map_pages(). In order to skip read-ahead in above scenario, the mmap_miss have to increased exceed MMAP_LOTSAMISS. This can be done by stop decreased mmap_miss when folio has workingset flag. The async path is not to care because in above scenario, it's hard to run into the async path.
Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Jan Kara jack@suse.cz --- mm/filemap.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/mm/filemap.c b/mm/filemap.c index 2eeb9978f39e..1cfcb82223fa 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3168,7 +3168,14 @@ void filemap_map_pages(struct vm_fault *vmf, if (xas.xa_index >= max_idx) goto unlock;
- if (mmap_miss > 0) + /* + * If there are too many folios that are recently evicted + * in a file, they will probably continue to be evicted. + * In such situation, read-ahead is only a waste of IO. + * Don't decrease mmap_miss in this scenario to make sure + * we can stop read-ahead. + */ + if (mmap_miss > 0 && !PageWorkingset(page)) mmap_miss--;
vmf->address += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8EXN6 CVE: NA
--------------------------------
The previous two patches fix a corner case about out-of-memory. Add a sysctl to enable/disable readahead_early_break mode.
Signed-off-by: Liu Shixin liushixin2@huawei.com --- Documentation/admin-guide/sysctl/vm.rst | 7 +++++++ include/linux/pagemap.h | 2 ++ kernel/sysctl.c | 9 +++++++++ mm/filemap.c | 2 +- mm/readahead.c | 5 ++++- 5 files changed, 23 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index b508acfdde2e..0880f769f4dd 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -80,6 +80,7 @@ Currently, these files are in /proc/sys/vm: - cache_reclaim_weight - cache_reclaim_enable - cache_limit_mbytes +- readahead_early_break
admin_reserve_kbytes @@ -1089,3 +1090,9 @@ cache_limit_mbytes
This is used to set the upper limit of page cache in megabytes. Page cache will be reclaimed periodically if page cache is over limit. + +readahead_early_break +===================== + +This is used to break readahead when reached memcg limit or there are too +many folio that are recently evicted. diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 0bfa9cce6589..2568ff96c13a 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -817,6 +817,8 @@ struct readahead_control {
#define VM_READAHEAD_PAGES (SZ_128K / PAGE_SIZE)
+extern int vm_readahead_early_break; + void page_cache_ra_unbounded(struct readahead_control *, unsigned long nr_to_read, unsigned long lookahead_count); void page_cache_sync_ra(struct readahead_control *, struct file_ra_state *, diff --git a/kernel/sysctl.c b/kernel/sysctl.c index f3f43b2def7f..97dda5113657 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -3335,6 +3335,15 @@ static struct ctl_table vm_table[] = { .extra2 = SYSCTL_ONE, }, #endif + { + .procname = "readahead_early_break", + .data = &vm_readahead_early_break, + .maxlen = sizeof(vm_readahead_early_break), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, { } };
diff --git a/mm/filemap.c b/mm/filemap.c index 1cfcb82223fa..dbf5379d74c7 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3175,7 +3175,7 @@ void filemap_map_pages(struct vm_fault *vmf, * Don't decrease mmap_miss in this scenario to make sure * we can stop read-ahead. */ - if (mmap_miss > 0 && !PageWorkingset(page)) + if (mmap_miss > 0 && !(vm_readahead_early_break && PageWorkingset(page))) mmap_miss--;
vmf->address += (xas.xa_index - last_pgoff) << PAGE_SHIFT; diff --git a/mm/readahead.c b/mm/readahead.c index 22dd9c8fe808..b2652bc20623 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -27,6 +27,9 @@ #include "internal.h"
#define READAHEAD_FIRST_SIZE (2 * 1024 * 1024) + +int vm_readahead_early_break; + /* * Initialise a struct file's readahead state. Assumes that the caller has * memset *ra to zero. @@ -228,7 +231,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, if (ret < 0) { put_page(page); read_pages(ractl, &page_pool, true); - if (ret == -ENOMEM) + if (vm_readahead_early_break && (ret == -ENOMEM)) break; continue; }
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/5549 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/H...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/5549 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/H...