From: Vasily Averin vvs@virtuozzo.com
mainline inclusion from mainline-v5.15-rc1 commit 79f6540ba88dfb383ecf057a3425e668105ca774 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4A0WD CVE: NA
--------------------------------
Patch series "memcg accounting from OpenVZ", v7.
OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. Initially we used our own accounting subsystem, then partially committed it to upstream, and a few years ago switched to cgroups v1. Now we're rebasing again, revising our old patches and trying to push them upstream.
We try to protect the host system from any misuse of kernel memory allocation triggered by untrusted users inside the containers.
Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing list, though I would be very grateful for any comments from maintainersi of affected subsystems or other people added in cc:
Compared to the upstream, we additionally account the following kernel objects: - network devices and its Tx/Rx queues - ipv4/v6 addresses and routing-related objects - inet_bind_bucket cache objects - VLAN group arrays - ipv6/sit: ip_tunnel_prl - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets - nsproxy and namespace objects itself - IPC objects: semaphores, message queues and share memory segments - mounts - pollfd and select bits arrays - signals and posix timers - file lock - fasync_struct used by the file lease code and driver's fasync queues - tty objects - per-mm LDT
We have an incorrect/incomplete/obsoleted accounting for few other kernel objects: sk_filter, af_packets, netlink and xt_counters for iptables. They require rework and probably will be dropped at all.
Also we're going to add an accounting for nft, however it is not ready yet.
We have not tested performance on upstream, however, our performance team compares our current RHEL7-based production kernel and reports that they are at least not worse as the according original RHEL7 kernel.
This patch (of 10):
The kernel allocates ~400 bytes of 'struct mount' for any new mount. Creating a new mount namespace clones most of the parent mounts, and this can be repeated many times. Additionally, each mount allocates up to PATH_MAX=4096 bytes for mnt->mnt_devname.
It makes sense to account for these allocations to restrict the host's memory consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.com Signed-off-by: Vasily Averin vvs@virtuozzo.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: Christian Brauner christian.brauner@ubuntu.com Cc: Tejun Heo tj@kernel.org Cc: Michal Hocko mhocko@kernel.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Vladimir Davydov vdavydov.dev@gmail.com Cc: Roman Gushchin guro@fb.com Cc: Yutian Yang nglaive@gmail.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Alexey Dobriyan adobriyan@gmail.com Cc: Andrei Vagin avagin@gmail.com Cc: Borislav Petkov bp@alien8.de Cc: Dmitry Safonov 0x7f454c46@gmail.com Cc: "Eric W. Biederman" ebiederm@xmission.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "H. Peter Anvin" hpa@zytor.com Cc: Ingo Molnar mingo@redhat.com Cc: "J. Bruce Fields" bfields@fieldses.org Cc: Jeff Layton jlayton@kernel.org Cc: Jens Axboe axboe@kernel.dk Cc: Jiri Slaby jirislaby@kernel.org Cc: Kirill Tkhai ktkhai@virtuozzo.com Cc: Oleg Nesterov oleg@redhat.com Cc: Serge Hallyn serge@hallyn.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Zefan Li lizefan.x@bytedance.com Cc: Borislav Petkov bp@suse.de Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Li Ming limingming.li@huawei.com
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/namespace.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c index 237f12b4882ac..0683b67d83f91 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -185,7 +185,8 @@ static struct mount *alloc_vfsmnt(const char *name) goto out_free_cache;
if (name) { - mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL); + mnt->mnt_devname = kstrdup_const(name, + GFP_KERNEL_ACCOUNT); if (!mnt->mnt_devname) goto out_free_id; } @@ -3289,7 +3290,7 @@ void __init mnt_init(void) int err;
mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount), - 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); + 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
mount_hashtable = alloc_large_system_hash("Mount-cache", sizeof(struct hlist_head),