From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15-rc1 commit fab827dbee8c2e06ca4ba000fa6c48bcf9054aba category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4A0WD CVE: NA
--------------------------------
Commit 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg") enabled memcg accounting for pids allocated from init_pid_ns.pid_cachep, but forgot to adjust the setting for nested pid namespaces. As a result, pid memory is not accounted exactly where it is really needed, inside memcg-limited containers with their own pid namespaces.
Pid was one the first kernel objects enabled for memcg accounting. init_pid_ns.pid_cachep marked by SLAB_ACCOUNT and we can expect that any new pids in the system are memcg-accounted.
Though recently I've noticed that it is wrong. nested pid namespaces creates own slab caches for pid objects, nested pids have increased size because contain id both for all parent and for own pid namespaces. The problem is that these slab caches are _NOT_ marked by SLAB_ACCOUNT, as a result any pids allocated in nested pid namespaces are not memcg-accounted.
Pid struct in nested pid namespace consumes up to 500 bytes memory, 100000 such objects gives us up to ~50Mb unaccounted memory, this allow container to exceed assigned memcg limits.
Link: https://lkml.kernel.org/r/8b6de616-fd1a-02c6-cbdb-976ecdcfa604@virtuozzo.com Fixes: 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg") Cc: stable@vger.kernel.org Signed-off-by: Vasily Averin vvs@virtuozzo.com Reviewed-by: Michal Koutný mkoutny@suse.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Christian Brauner christian.brauner@ubuntu.com Acked-by: Roman Gushchin guro@fb.com Cc: Michal Hocko mhocko@suse.com Cc: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Li Ming limingming.li@huawei.com
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/pid_namespace.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index dd7a0e41d4016..ad95ca43b51ee 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -52,7 +52,8 @@ static struct kmem_cache *create_pid_cachep(unsigned int level) mutex_lock(&pid_caches_mutex); /* Name collision forces to do allocation under mutex. */ if (!*pkc) - *pkc = kmem_cache_create(name, len, 0, SLAB_HWCACHE_ALIGN, 0); + *pkc = kmem_cache_create(name, len, 0, + SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, 0); mutex_unlock(&pid_caches_mutex); /* current can fail, but someone else can succeed. */ return READ_ONCE(*pkc);
From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15-rc1 commit 79f6540ba88dfb383ecf057a3425e668105ca774 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4A0WD CVE: NA
--------------------------------
Patch series "memcg accounting from OpenVZ", v7.
OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. Initially we used our own accounting subsystem, then partially committed it to upstream, and a few years ago switched to cgroups v1. Now we're rebasing again, revising our old patches and trying to push them upstream.
We try to protect the host system from any misuse of kernel memory allocation triggered by untrusted users inside the containers.
Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing list, though I would be very grateful for any comments from maintainersi of affected subsystems or other people added in cc:
Compared to the upstream, we additionally account the following kernel objects: - network devices and its Tx/Rx queues - ipv4/v6 addresses and routing-related objects - inet_bind_bucket cache objects - VLAN group arrays - ipv6/sit: ip_tunnel_prl - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets - nsproxy and namespace objects itself - IPC objects: semaphores, message queues and share memory segments - mounts - pollfd and select bits arrays - signals and posix timers - file lock - fasync_struct used by the file lease code and driver's fasync queues - tty objects - per-mm LDT
We have an incorrect/incomplete/obsoleted accounting for few other kernel objects: sk_filter, af_packets, netlink and xt_counters for iptables. They require rework and probably will be dropped at all.
Also we're going to add an accounting for nft, however it is not ready yet.
We have not tested performance on upstream, however, our performance team compares our current RHEL7-based production kernel and reports that they are at least not worse as the according original RHEL7 kernel.
This patch (of 10):
The kernel allocates ~400 bytes of 'struct mount' for any new mount. Creating a new mount namespace clones most of the parent mounts, and this can be repeated many times. Additionally, each mount allocates up to PATH_MAX=4096 bytes for mnt->mnt_devname.
It makes sense to account for these allocations to restrict the host's memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.com Signed-off-by: Vasily Averin vvs@virtuozzo.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Christian Brauner christian.brauner@ubuntu.com Cc: Tejun Heo tj@kernel.org Cc: Michal Hocko mhocko@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Roman Gushchin guro@fb.com Cc: Yutian Yang nglaive@gmail.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Alexey Dobriyan adobriyan@gmail.com Cc: Andrei Vagin avagin@gmail.com Cc: Borislav Petkov bp@alien8.de Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: "J. Bruce Fields" bfields@fieldses.org Cc: Jeff Layton jlayton@kernel.org Cc: Jens Axboe axboe@kernel.dk Cc: Jiri Slaby jirislaby@kernel.org Cc: Kirill Tkhai ktkhai@virtuozzo.com Cc: Oleg Nesterov oleg@redhat.com Cc: Serge Hallyn serge@hallyn.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Zefan Li lizefan.x@bytedance.com Cc: Borislav Petkov bp@suse.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Li Ming limingming.li@huawei.com
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/namespace.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c index 237f12b4882ac..0683b67d83f91 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -185,7 +185,8 @@ static struct mount *alloc_vfsmnt(const char *name) goto out_free_cache;
if (name) { - mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL); + mnt->mnt_devname = kstrdup_const(name, + GFP_KERNEL_ACCOUNT); if (!mnt->mnt_devname) goto out_free_id; } @@ -3289,7 +3290,7 @@ void __init mnt_init(void) int err;
mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount), - 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); + 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
mount_hashtable = alloc_large_system_hash("Mount-cache", sizeof(struct hlist_head),
From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15-rc1 commit 839d68206de869b8cb4272c5ea10da2ef7ec34cb category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4A0WD CVE: NA
--------------------------------
fasync_struct is used by almost all character device drivers to set up the fasync queue, and for regular files by the file lease code. This structure is quite small but long-living and it can be assigned for any open file.
It makes sense to account for its allocations to restrict the host's memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/1b408625-d71c-0b26-b0b6-9baf00f93e69@virtuozzo.com Signed-off-by: Vasily Averin vvs@virtuozzo.com Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Alexey Dobriyan adobriyan@gmail.com Cc: Andrei Vagin avagin@gmail.com Cc: Borislav Petkov bp@alien8.de Cc: Borislav Petkov bp@suse.de Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: "J. Bruce Fields" bfields@fieldses.org Cc: Jeff Layton jlayton@kernel.org Cc: Jens Axboe axboe@kernel.dk Cc: Jiri Slaby jirislaby@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Kirill Tkhai ktkhai@virtuozzo.com Cc: Michal Hocko mhocko@kernel.org Cc: Oleg Nesterov oleg@redhat.com Cc: Roman Gushchin guro@fb.com Cc: Serge Hallyn serge@hallyn.com Cc: Tejun Heo tj@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Yutian Yang nglaive@gmail.com Cc: Zefan Li lizefan.x@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Li Ming limingming.li@huawei.com
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/fcntl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c index 2e54fdc14e51f..9ed564c1bcffd 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -1059,7 +1059,8 @@ static int __init fcntl_init(void) __FMODE_EXEC | __FMODE_NONOTIFY));
fasync_cache = kmem_cache_create("fasync_cache", - sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL); + sizeof(struct fasync_struct), 0, + SLAB_PANIC | SLAB_ACCOUNT, NULL); return 0; }
From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15-rc1 commit 30acd0bdfb86548172168a0cc71d455944de0683 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4A0WD CVE: NA
--------------------------------
Container admin can create new namespaces and force kernel to allocate up to several pages of memory for the namespaces and its associated structures.
Net and uts namespaces have enabled accounting for such allocations. It makes sense to account for rest ones to restrict the host's memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.com Signed-off-by: Vasily Averin vvs@virtuozzo.com Acked-by: Serge Hallyn serge@hallyn.com Acked-by: Christian Brauner christian.brauner@ubuntu.com Acked-by: Kirill Tkhai ktkhai@virtuozzo.com Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Alexey Dobriyan adobriyan@gmail.com Cc: Andrei Vagin avagin@gmail.com Cc: Borislav Petkov bp@alien8.de Cc: Borislav Petkov bp@suse.de Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: "J. Bruce Fields" bfields@fieldses.org Cc: Jeff Layton jlayton@kernel.org Cc: Jens Axboe axboe@kernel.dk Cc: Jiri Slaby jirislaby@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: Oleg Nesterov oleg@redhat.com Cc: Roman Gushchin guro@fb.com Cc: Tejun Heo tj@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Yutian Yang nglaive@gmail.com Cc: Zefan Li lizefan.x@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: fs/namespace.c ipc/namespace.c Signed-off-by: Li Ming limingming.li@huawei.com
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/namespace.c | 2 +- ipc/namespace.c | 2 +- kernel/cgroup/namespace.c | 2 +- kernel/nsproxy.c | 2 +- kernel/pid_namespace.c | 2 +- kernel/user_namespace.c | 2 +- 6 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c index 0683b67d83f91..a34a8782bbcca 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2917,7 +2917,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns) if (!ucounts) return ERR_PTR(-ENOSPC);
- new_ns = kmalloc(sizeof(struct mnt_namespace), GFP_KERNEL); + new_ns = kmalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT); if (!new_ns) { dec_mnt_namespaces(ucounts); return ERR_PTR(-ENOMEM); diff --git a/ipc/namespace.c b/ipc/namespace.c index 21607791d62c8..f3038017df2d4 100644 --- a/ipc/namespace.c +++ b/ipc/namespace.c @@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns, goto fail;
err = -ENOMEM; - ns = kmalloc(sizeof(struct ipc_namespace), GFP_KERNEL); + ns = kmalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT); if (ns == NULL) goto fail_dec;
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c index b05f1dd58a622..4b8432cc8040d 100644 --- a/kernel/cgroup/namespace.c +++ b/kernel/cgroup/namespace.c @@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void) struct cgroup_namespace *new_ns; int ret;
- new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL); + new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT); if (!new_ns) return ERR_PTR(-ENOMEM); ret = ns_alloc_inum(&new_ns->ns); diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index f6c5d330059ac..ab0480892ab15 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -272,6 +272,6 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
int __init nsproxy_cache_init(void) { - nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC); + nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT); return 0; } diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index ad95ca43b51ee..a64e0e7e36572 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -460,7 +460,7 @@ const struct proc_ns_operations pidns_for_children_operations = {
static __init int pid_namespaces_init(void) { - pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC); + pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
#ifdef CONFIG_CHECKPOINT_RESTORE register_sysctl_paths(kern_path, pid_ns_ctl_table); diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 923414a246e9e..9a7f57aecef05 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -1325,7 +1325,7 @@ const struct proc_ns_operations userns_operations = {
static __init int user_namespaces_init(void) { - user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC); + user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT); return 0; } subsys_initcall(user_namespaces_init);
From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15-rc1 commit 5f58c39819ff78ca5ddbba2b3cd8ff4779b19bb5 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4A0WD CVE: NA
--------------------------------
When a user send a signal to any another processes it forces the kernel to allocate memory for 'struct sigqueue' objects. The number of signals is limited by RLIMIT_SIGPENDING resource limit, but even the default settings allow each user to consume up to several megabytes of memory.
It makes sense to account for these allocations to restrict the host's memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/e34e958c-e785-712e-a62a-2c7b66c646c7@virtuozzo.com Signed-off-by: Vasily Averin vvs@virtuozzo.com Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Alexey Dobriyan adobriyan@gmail.com Cc: Andrei Vagin avagin@gmail.com Cc: Borislav Petkov bp@alien8.de Cc: Borislav Petkov bp@suse.de Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: "J. Bruce Fields" bfields@fieldses.org Cc: Jeff Layton jlayton@kernel.org Cc: Jens Axboe axboe@kernel.dk Cc: Jiri Slaby jirislaby@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Kirill Tkhai ktkhai@virtuozzo.com Cc: Michal Hocko mhocko@kernel.org Cc: Oleg Nesterov oleg@redhat.com Cc: Roman Gushchin guro@fb.com Cc: Serge Hallyn serge@hallyn.com Cc: Tejun Heo tj@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Yutian Yang nglaive@gmail.com Cc: Zefan Li lizefan.x@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: kernel/signal.c Signed-off-by: Li Ming limingming.li@huawei.com
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/signal.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/signal.c b/kernel/signal.c index a39ca8841c8fa..89f1bd9c008a2 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -4051,7 +4051,7 @@ void __init signals_init(void) != offsetof(struct siginfo, _sifields._pad)); BUILD_BUG_ON(sizeof(struct siginfo) != SI_MAX_SIZE);
- sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC); + sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT); }
#ifdef CONFIG_KGDB_KDB
From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15-rc1 commit c509723ec27e925bb91a20682c448e95d4bc8c9f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4A0WD CVE: NA
--------------------------------
A program may create multiple interval timers using timer_create(). For each timer the kernel preallocates a "queued real-time signal", Consequently, the number of timers is limited by the RLIMIT_SIGPENDING resource limit. The allocated object is quite small, ~250 bytes, but even the default signal limits allow to consume up to 100 megabytes per user.
It makes sense to account for them to limit the host's memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/57795560-025c-267c-6b1a-dea852d95530@virtuozzo.com Signed-off-by: Vasily Averin vvs@virtuozzo.com Reviewed-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Alexey Dobriyan adobriyan@gmail.com Cc: Andrei Vagin avagin@gmail.com Cc: Borislav Petkov bp@alien8.de Cc: Borislav Petkov bp@suse.de Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: "J. Bruce Fields" bfields@fieldses.org Cc: Jeff Layton jlayton@kernel.org Cc: Jens Axboe axboe@kernel.dk Cc: Jiri Slaby jirislaby@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Kirill Tkhai ktkhai@virtuozzo.com Cc: Michal Hocko mhocko@kernel.org Cc: Oleg Nesterov oleg@redhat.com Cc: Roman Gushchin guro@fb.com Cc: Serge Hallyn serge@hallyn.com Cc: Tejun Heo tj@kernel.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Yutian Yang nglaive@gmail.com Cc: Zefan Li lizefan.x@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Li Ming limingming.li@huawei.com
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/time/posix-timers.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c index b0bda93c595d8..aebb612404006 100644 --- a/kernel/time/posix-timers.c +++ b/kernel/time/posix-timers.c @@ -268,8 +268,8 @@ static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp) static __init int init_posix_timers(void) { posix_timers_cache = kmem_cache_create("posix_timers_cache", - sizeof (struct k_itimer), 0, SLAB_PANIC, - NULL); + sizeof(struct k_itimer), 0, + SLAB_PANIC | SLAB_ACCOUNT, NULL); return 0; } __initcall(init_posix_timers);
From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15-rc1 commit ec403e2ae0dfc85996aad6e944a98a16e6dfcc6d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4A0WD CVE: NA
--------------------------------
Each task can request own LDT and force the kernel to allocate up to 64Kb memory per-mm.
There are legitimate workloads with hundreds of processes and there can be hundreds of workloads running on large machines. The unaccounted memory can cause isolation issues between the workloads particularly on highly utilized machines.
It makes sense to account for this objects to restrict the host's memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/38010594-50fe-c06d-7cb0-d1f77ca422f3@virtuozzo.com Signed-off-by: Vasily Averin vvs@virtuozzo.com Acked-by: Borislav Petkov bp@suse.de Reviewed-by: Shakeel Butt shakeelb@google.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Alexey Dobriyan adobriyan@gmail.com Cc: Andrei Vagin avagin@gmail.com Cc: Borislav Petkov bp@alien8.de Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: "J. Bruce Fields" bfields@fieldses.org Cc: Jeff Layton jlayton@kernel.org Cc: Jens Axboe axboe@kernel.dk Cc: Jiri Slaby jirislaby@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Kirill Tkhai ktkhai@virtuozzo.com Cc: Michal Hocko mhocko@kernel.org Cc: Oleg Nesterov oleg@redhat.com Cc: Roman Gushchin guro@fb.com Cc: Serge Hallyn serge@hallyn.com Cc: Tejun Heo tj@kernel.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Yutian Yang nglaive@gmail.com Cc: Zefan Li lizefan.x@bytedance.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: arch/x86/kernel/ldt.c Signed-off-by: Li Ming limingming.li@huawei.com
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/kernel/ldt.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index 65590eee62893..f702b5b380b29 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -70,7 +70,7 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries) if (num_entries > LDT_ENTRIES) return NULL;
- new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL); + new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL_ACCOUNT); if (!new_ldt) return NULL;
@@ -84,9 +84,10 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries) * than PAGE_SIZE. */ if (alloc_size > PAGE_SIZE) - new_ldt->entries = vzalloc(alloc_size); + new_ldt->entries = __vmalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO, + PAGE_KERNEL); else - new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL); + new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
if (!new_ldt->entries) { kfree(new_ldt);