[PATCH OLK-6.6 0/8] v3: ucounts: rlimit scalability issues

v3: ucounts: rlimit scalability issues Chen Ridong (3): ucounts: free ucount only count and rlimit are zero ucounts: turn the atomic rlimit to percpu_counter ucounts: fix kabi MengEn Sun (1): ucounts: move kfree() out of critical zone protected by ucounts_lock Sebastian Andrzej Siewior (4): rcu: provide a static initializer for hlist_nulls_head ucount: replace get_ucounts_or_wrap() with atomic_inc_not_zero() ucount: use RCU for ucounts lookups ucount: use rcuref_t for reference counting include/linux/list_nulls.h | 1 + include/linux/user_namespace.h | 38 ++++- init/main.c | 1 + ipc/mqueue.c | 6 +- kernel/signal.c | 11 +- kernel/ucount.c | 247 ++++++++++++++++++++------------- mm/mlock.c | 5 +- 7 files changed, 191 insertions(+), 118 deletions(-) -- 2.34.1

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> mainline inclusion from mainline-v6.15-rc1 commit 8c6bbda879b62f16bb03321a84554b4f63415c55 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC97W5 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Patch series "ucount: Simplify refcounting with rcuref_t". I noticed that the atomic_dec_and_lock_irqsave() in put_ucounts() loops sometimes even during boot. Something like 2-3 iterations but still. This series replaces the refcounting with rcuref_t and adds a RCU lookup. This allows a lockless lookup in alloc_ucounts() if the entry is available and a cmpxchg()less put of the item. This patch (of 4): Provide a static initializer for hlist_nulls_head so that it can be used in statically defined data structures. Link: https://lkml.kernel.org/r/20250203150525.456525-1-bigeasy@linutronix.de Link: https://lkml.kernel.org/r/20250203150525.456525-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chen Ridong <chenridong@huawei.com> --- include/linux/list_nulls.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/list_nulls.h b/include/linux/list_nulls.h index fa6e8471bd227..248db9b77ee24 100644 --- a/include/linux/list_nulls.h +++ b/include/linux/list_nulls.h @@ -28,6 +28,7 @@ struct hlist_nulls_node { #define NULLS_MARKER(value) (1UL | (((long)value) << 1)) #define INIT_HLIST_NULLS_HEAD(ptr, nulls) \ ((ptr)->first = (struct hlist_nulls_node *) NULLS_MARKER(nulls)) +#define HLIST_NULLS_HEAD_INIT(nulls) {.first = (struct hlist_nulls_node *)NULLS_MARKER(nulls)} #define hlist_nulls_entry(ptr, type, member) container_of(ptr,type,member) -- 2.34.1

From: MengEn Sun <mengensun@tencent.com> mainline inclusion from mainline-v6.14-rc1 commit f49b42d415a32faee6bc08923821f432f64a4e90 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC97W5 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Although kfree is a non-sleep function, it is possible to enter a long chain of calls probabilistically, so it looks better to move kfree from alloc_ucounts() out of the critical zone of ucounts_lock. Link: https://lkml.kernel.org/r/1733458427-11794-1-git-send-email-mengensun@tencen... Signed-off-by: MengEn Sun <mengensun@tencent.com> Reviewed-by: YueHong Wu <yuehongwu@tencent.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrei Vagin <avagin@google.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chen Ridong <chenridong@huawei.com> --- kernel/ucount.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/ucount.c b/kernel/ucount.c index 584b73807c445..2c929c6c4784c 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -169,8 +169,8 @@ struct ucounts *get_ucounts(struct ucounts *ucounts) struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid) { struct hlist_head *hashent = ucounts_hashentry(ns, uid); - struct ucounts *ucounts, *new; bool wrapped; + struct ucounts *ucounts, *new = NULL; spin_lock_irq(&ucounts_lock); ucounts = find_ucounts(ns, uid, hashent); @@ -187,17 +187,17 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid) spin_lock_irq(&ucounts_lock); ucounts = find_ucounts(ns, uid, hashent); - if (ucounts) { - kfree(new); - } else { + if (!ucounts) { hlist_add_head(&new->node, hashent); get_user_ns(new->ns); spin_unlock_irq(&ucounts_lock); return new; } } + wrapped = !get_ucounts_or_wrap(ucounts); spin_unlock_irq(&ucounts_lock); + kfree(new); if (wrapped) { put_ucounts(ucounts); return NULL; -- 2.34.1

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> mainline inclusion from mainline-v6.15-rc1 commit 328152e6774d9d801ad1d90af557b9113647b379 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC97W5 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- get_ucounts_or_wrap() increments the counter and if the counter is negative then it decrements it again in order to reset the previous increment. This statement can be replaced with atomic_inc_not_zero() to only increment the counter if it is not yet 0. This simplifies the get function because the put (if the get failed) can be removed. atomic_inc_not_zero() is implement as a cmpxchg() loop which can be repeated several times if another get/put is performed in parallel. This will be optimized later. Increment the reference counter only if not yet dropped to zero. Link: https://lkml.kernel.org/r/20250203150525.456525-3-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chen Ridong <chenridong@huawei.com> --- kernel/ucount.c | 24 ++++++------------------ 1 file changed, 6 insertions(+), 18 deletions(-) diff --git a/kernel/ucount.c b/kernel/ucount.c index 2c929c6c4784c..797e6479dd914 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -151,25 +151,16 @@ static void hlist_add_ucounts(struct ucounts *ucounts) spin_unlock_irq(&ucounts_lock); } -static inline bool get_ucounts_or_wrap(struct ucounts *ucounts) -{ - /* Returns true on a successful get, false if the count wraps. */ - return !atomic_add_negative(1, &ucounts->count); -} - struct ucounts *get_ucounts(struct ucounts *ucounts) { - if (!get_ucounts_or_wrap(ucounts)) { - put_ucounts(ucounts); - ucounts = NULL; - } - return ucounts; + if (atomic_inc_not_zero(&ucounts->count)) + return ucounts; + return NULL; } struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid) { struct hlist_head *hashent = ucounts_hashentry(ns, uid); - bool wrapped; struct ucounts *ucounts, *new = NULL; spin_lock_irq(&ucounts_lock); @@ -194,14 +185,11 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid) return new; } } - - wrapped = !get_ucounts_or_wrap(ucounts); + if (!atomic_inc_not_zero(&ucounts->count)) + ucounts = NULL; spin_unlock_irq(&ucounts_lock); kfree(new); - if (wrapped) { - put_ucounts(ucounts); - return NULL; - } + return ucounts; } -- 2.34.1

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> mainline inclusion from mainline-v6.15-rc1 commit 5f01a22c5b231dd590f61a2591b3090665733bcb category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC97W5 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- The ucounts element is looked up under ucounts_lock. This can be optimized by using RCU for a lockless lookup and return and element if the reference can be obtained. Replace hlist_head with hlist_nulls_head which is RCU compatible. Let find_ucounts() search for the required item within a RCU section and return the item if a reference could be obtained. This means alloc_ucounts() will always return an element (unless the memory allocation failed). Let put_ucounts() RCU free the element if the reference counter dropped to zero. Link: https://lkml.kernel.org/r/20250203150525.456525-4-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chen Ridong <chenridong@huawei.com> --- include/linux/user_namespace.h | 4 +- kernel/ucount.c | 75 ++++++++++++++++++---------------- 2 files changed, 43 insertions(+), 36 deletions(-) diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index ccd5337671e43..6b69ec3d9e66d 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -5,6 +5,7 @@ #include <linux/kref.h> #include <linux/nsproxy.h> #include <linux/ns_common.h> +#include <linux/rculist_nulls.h> #include <linux/sched.h> #include <linux/workqueue.h> #include <linux/rwsem.h> @@ -112,9 +113,10 @@ struct user_namespace { } __randomize_layout; struct ucounts { - struct hlist_node node; + struct hlist_nulls_node node; struct user_namespace *ns; kuid_t uid; + struct rcu_head rcu; atomic_t count; atomic_long_t ucount[UCOUNT_COUNTS]; atomic_long_t rlimit[UCOUNT_RLIMIT_COUNTS]; diff --git a/kernel/ucount.c b/kernel/ucount.c index 797e6479dd914..5677eb6e57c9e 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -15,7 +15,10 @@ struct ucounts init_ucounts = { }; #define UCOUNTS_HASHTABLE_BITS 10 -static struct hlist_head ucounts_hashtable[(1 << UCOUNTS_HASHTABLE_BITS)]; +#define UCOUNTS_HASHTABLE_ENTRIES (1 << UCOUNTS_HASHTABLE_BITS) +static struct hlist_nulls_head ucounts_hashtable[UCOUNTS_HASHTABLE_ENTRIES] = { + [0 ... UCOUNTS_HASHTABLE_ENTRIES - 1] = HLIST_NULLS_HEAD_INIT(0) +}; static DEFINE_SPINLOCK(ucounts_lock); #define ucounts_hashfn(ns, uid) \ @@ -24,7 +27,6 @@ static DEFINE_SPINLOCK(ucounts_lock); #define ucounts_hashentry(ns, uid) \ (ucounts_hashtable + ucounts_hashfn(ns, uid)) - #ifdef CONFIG_SYSCTL static struct ctl_table_set * set_lookup(struct ctl_table_root *root) @@ -132,22 +134,28 @@ void retire_userns_sysctls(struct user_namespace *ns) #endif } -static struct ucounts *find_ucounts(struct user_namespace *ns, kuid_t uid, struct hlist_head *hashent) +static struct ucounts *find_ucounts(struct user_namespace *ns, kuid_t uid, + struct hlist_nulls_head *hashent) { struct ucounts *ucounts; + struct hlist_nulls_node *pos; - hlist_for_each_entry(ucounts, hashent, node) { - if (uid_eq(ucounts->uid, uid) && (ucounts->ns == ns)) - return ucounts; + guard(rcu)(); + hlist_nulls_for_each_entry_rcu(ucounts, pos, hashent, node) { + if (uid_eq(ucounts->uid, uid) && (ucounts->ns == ns)) { + if (atomic_inc_not_zero(&ucounts->count)) + return ucounts; + } } return NULL; } static void hlist_add_ucounts(struct ucounts *ucounts) { - struct hlist_head *hashent = ucounts_hashentry(ucounts->ns, ucounts->uid); + struct hlist_nulls_head *hashent = ucounts_hashentry(ucounts->ns, ucounts->uid); + spin_lock_irq(&ucounts_lock); - hlist_add_head(&ucounts->node, hashent); + hlist_nulls_add_head_rcu(&ucounts->node, hashent); spin_unlock_irq(&ucounts_lock); } @@ -160,37 +168,33 @@ struct ucounts *get_ucounts(struct ucounts *ucounts) struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid) { - struct hlist_head *hashent = ucounts_hashentry(ns, uid); - struct ucounts *ucounts, *new = NULL; + struct hlist_nulls_head *hashent = ucounts_hashentry(ns, uid); + struct ucounts *ucounts, *new; + + ucounts = find_ucounts(ns, uid, hashent); + if (ucounts) + return ucounts; + + new = kzalloc(sizeof(*new), GFP_KERNEL); + if (!new) + return NULL; + + new->ns = ns; + new->uid = uid; + atomic_set(&new->count, 1); spin_lock_irq(&ucounts_lock); ucounts = find_ucounts(ns, uid, hashent); - if (!ucounts) { + if (ucounts) { spin_unlock_irq(&ucounts_lock); - - new = kzalloc(sizeof(*new), GFP_KERNEL); - if (!new) - return NULL; - - new->ns = ns; - new->uid = uid; - atomic_set(&new->count, 1); - - spin_lock_irq(&ucounts_lock); - ucounts = find_ucounts(ns, uid, hashent); - if (!ucounts) { - hlist_add_head(&new->node, hashent); - get_user_ns(new->ns); - spin_unlock_irq(&ucounts_lock); - return new; - } + kfree(new); + return ucounts; } - if (!atomic_inc_not_zero(&ucounts->count)) - ucounts = NULL; - spin_unlock_irq(&ucounts_lock); - kfree(new); - return ucounts; + hlist_nulls_add_head_rcu(&new->node, hashent); + get_user_ns(new->ns); + spin_unlock_irq(&ucounts_lock); + return new; } void put_ucounts(struct ucounts *ucounts) @@ -198,10 +202,11 @@ void put_ucounts(struct ucounts *ucounts) unsigned long flags; if (atomic_dec_and_lock_irqsave(&ucounts->count, &ucounts_lock, flags)) { - hlist_del_init(&ucounts->node); + hlist_nulls_del_rcu(&ucounts->node); spin_unlock_irqrestore(&ucounts_lock, flags); + put_user_ns(ucounts->ns); - kfree(ucounts); + kfree_rcu(ucounts, rcu); } } -- 2.34.1

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> mainline inclusion from mainline-v6.15-rc1 commit b4dc0bee2a749083028afba346910e198653f42a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC97W5 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... ---------------------------------------------------------------------- Use rcuref_t for reference counting. This eliminates the cmpxchg loop in the get and put path. This also eliminates the need to acquire the lock in the put path because once the final user returns the reference, it can no longer be obtained anymore. Use rcuref_t for reference counting. Link: https://lkml.kernel.org/r/20250203150525.456525-5-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chen Ridong <chenridong@huawei.com> --- include/linux/user_namespace.h | 11 +++++++++-- kernel/ucount.c | 16 +++++----------- 2 files changed, 14 insertions(+), 13 deletions(-) diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 6b69ec3d9e66d..c3b4de67471c8 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -8,6 +8,7 @@ #include <linux/rculist_nulls.h> #include <linux/sched.h> #include <linux/workqueue.h> +#include <linux/rcuref.h> #include <linux/rwsem.h> #include <linux/sysctl.h> #include <linux/err.h> @@ -117,7 +118,7 @@ struct ucounts { struct user_namespace *ns; kuid_t uid; struct rcu_head rcu; - atomic_t count; + rcuref_t count; atomic_long_t ucount[UCOUNT_COUNTS]; atomic_long_t rlimit[UCOUNT_RLIMIT_COUNTS]; }; @@ -130,9 +131,15 @@ void retire_userns_sysctls(struct user_namespace *ns); struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type); void dec_ucount(struct ucounts *ucounts, enum ucount_type type); struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid); -struct ucounts * __must_check get_ucounts(struct ucounts *ucounts); void put_ucounts(struct ucounts *ucounts); +static inline struct ucounts * __must_check get_ucounts(struct ucounts *ucounts) +{ + if (rcuref_get(&ucounts->count)) + return ucounts; + return NULL; +} + static inline long get_rlimit_value(struct ucounts *ucounts, enum rlimit_type type) { return atomic_long_read(&ucounts->rlimit[type]); diff --git a/kernel/ucount.c b/kernel/ucount.c index 5677eb6e57c9e..fd2ccffe08394 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -11,7 +11,7 @@ struct ucounts init_ucounts = { .ns = &init_user_ns, .uid = GLOBAL_ROOT_UID, - .count = ATOMIC_INIT(1), + .count = RCUREF_INIT(1), }; #define UCOUNTS_HASHTABLE_BITS 10 @@ -143,7 +143,7 @@ static struct ucounts *find_ucounts(struct user_namespace *ns, kuid_t uid, guard(rcu)(); hlist_nulls_for_each_entry_rcu(ucounts, pos, hashent, node) { if (uid_eq(ucounts->uid, uid) && (ucounts->ns == ns)) { - if (atomic_inc_not_zero(&ucounts->count)) + if (rcuref_get(&ucounts->count)) return ucounts; } } @@ -159,13 +159,6 @@ static void hlist_add_ucounts(struct ucounts *ucounts) spin_unlock_irq(&ucounts_lock); } -struct ucounts *get_ucounts(struct ucounts *ucounts) -{ - if (atomic_inc_not_zero(&ucounts->count)) - return ucounts; - return NULL; -} - struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid) { struct hlist_nulls_head *hashent = ucounts_hashentry(ns, uid); @@ -181,7 +174,7 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid) new->ns = ns; new->uid = uid; - atomic_set(&new->count, 1); + rcuref_init(&new->count, 1); spin_lock_irq(&ucounts_lock); ucounts = find_ucounts(ns, uid, hashent); @@ -201,7 +194,8 @@ void put_ucounts(struct ucounts *ucounts) { unsigned long flags; - if (atomic_dec_and_lock_irqsave(&ucounts->count, &ucounts_lock, flags)) { + if (rcuref_put(&ucounts->count)) { + spin_lock_irqsave(&ucounts_lock, flags); hlist_nulls_del_rcu(&ucounts->node); spin_unlock_irqrestore(&ucounts_lock, flags); -- 2.34.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC97W5 -------------------------------- After the commit fda31c50292a ("signal: avoid double atomic counter increments for user accounting") and the commit 15bc01effefe ("ucounts: Fix signal ucount refcounting"), the reference counting mechanism for ucounts has the following behavior. The reference count is incremented when the first pending signal pins to the ucounts, and it is decremented when the last pending signal is dequeued. This implies that as long as there are any pending signals pinned to the ucounts, the ucounts cannot be freed. To address the scalability issue, the next patch will mention, the ucounts.rlimits will be converted to percpu_counter. However, summing up the percpu counters is expensive. To overcome this, this patch modifies the conditions for freeing ucounts. Instead of complex checks regarding whether a pending signal is the first or the last one, the ucounts can now be freed only when both the refcount and the rlimits are zero. This change not only simplifies the logic but also reduces the number of atomic operations. Signed-off-by: Chen Ridong <chenridong@huawei.com> --- include/linux/user_namespace.h | 1 + kernel/ucount.c | 75 ++++++++++++++++++++++++++-------- 2 files changed, 59 insertions(+), 17 deletions(-) diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index c3b4de67471c8..d504d506a70f1 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -119,6 +119,7 @@ struct ucounts { kuid_t uid; struct rcu_head rcu; rcuref_t count; + atomic_long_t freed; atomic_long_t ucount[UCOUNT_COUNTS]; atomic_long_t rlimit[UCOUNT_RLIMIT_COUNTS]; }; diff --git a/kernel/ucount.c b/kernel/ucount.c index fd2ccffe08394..1e300184f5edb 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -190,18 +190,61 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid) return new; } -void put_ucounts(struct ucounts *ucounts) +/* + * Whether all the rlimits are zero. + * For now, only UCOUNT_RLIMIT_SIGPENDING is considered. + * Other rlimit can be added. + */ +static bool rlimits_are_zero(struct ucounts *ucounts) +{ + int rtypes[] = { UCOUNT_RLIMIT_SIGPENDING }; + int rtype; + + for (int i = 0; i < sizeof(rtypes) / sizeof(int); ++i) { + rtype = rtypes[i]; + if (atomic_long_read(&ucounts->rlimit[rtype]) > 0) + return false; + } + return true; +} + +/* + * Ucounts can be freed only when the ucount->count is released + * and the rlimits are zero. + * The caller should hold rcu_read_lock(); + */ +static bool ucounts_can_be_freed(struct ucounts *ucounts) +{ + if (rcuref_read(&ucounts->count) > 0) + return false; + if (!rlimits_are_zero(ucounts)) + return false; + /* Prevent double free */ + return atomic_long_cmpxchg(&ucounts->freed, 0, 1) == 0; +} + +static void free_ucounts(struct ucounts *ucounts) { unsigned long flags; - if (rcuref_put(&ucounts->count)) { - spin_lock_irqsave(&ucounts_lock, flags); - hlist_nulls_del_rcu(&ucounts->node); - spin_unlock_irqrestore(&ucounts_lock, flags); + spin_lock_irqsave(&ucounts_lock, flags); + hlist_nulls_del_rcu(&ucounts->node); + spin_unlock_irqrestore(&ucounts_lock, flags); + + put_user_ns(ucounts->ns); + kfree_rcu(ucounts, rcu); +} - put_user_ns(ucounts->ns); - kfree_rcu(ucounts, rcu); +void put_ucounts(struct ucounts *ucounts) +{ + rcu_read_lock(); + if (rcuref_put(&ucounts->count) && + ucounts_can_be_freed(ucounts)) { + rcu_read_unlock(); + free_ucounts(ucounts); + return; } + rcu_read_unlock(); } static inline bool atomic_long_inc_below(atomic_long_t *v, int u) @@ -286,11 +329,17 @@ static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts, { struct ucounts *iter, *next; for (iter = ucounts; iter != last; iter = next) { + bool to_free; + + rcu_read_lock(); long dec = atomic_long_sub_return(1, &iter->rlimit[type]); WARN_ON_ONCE(dec < 0); next = iter->ns->ucounts; - if (dec == 0) - put_ucounts(iter); + to_free = ucounts_can_be_freed(iter); + rcu_read_unlock(); + /* If ucounts->count is zero and the rlimits are zero, free ucounts */ + if (to_free) + free_ucounts(iter); } } @@ -315,14 +364,6 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type, ret = new; if (!override_rlimit) max = get_userns_rlimit_max(iter->ns, type); - /* - * Grab an extra ucount reference for the caller when - * the rlimit count was previously 0. - */ - if (new != 1) - continue; - if (!get_ucounts(iter)) - goto dec_unwind; } return ret; dec_unwind: -- 2.34.1

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC97W5 ---------------------------------------- The will-it-scale test case signal1 [1] has been observed. and the test results reveal that the signal sending system call lacks linearity. To further investigate this issue, we initiated a series of tests by launching varying numbers of dockers and closely monitored the throughput of each individual docker. The detailed test outcomes are presented as follows: | Dockers |1 |4 |8 |16 |32 |64 | | Throughput |380068 |353204 |308948 |306453 |180659 |129152 | The data clearly demonstrates a discernible trend: as the quantity of dockers increases, the throughput per container progressively declines. In-depth analysis has identified the root cause of this performance degradation. The ucounts module conducts statistics on rlimit, which involves a significant number of atomic operations. These atomic operations, when acting on the same variable, trigger a substantial number of cache misses or remote accesses, ultimately resulting in a drop in performance. To address the above issues, this patch converts the atomic rlimit to a percpu_counter. After the optimization, the performance data is shown below, demonstrating that the throughput no longer declines as the number of Docker containers increases: | Dockers |1 |4 |8 |16 |32 |64 | | Throughput |374737 |376377 |374814 |379284 |374950 |377509 | [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/ Signed-off-by: Chen Ridong <chenridong@huawei.com> --- include/linux/user_namespace.h | 16 ++++-- init/main.c | 1 + ipc/mqueue.c | 6 +-- kernel/signal.c | 11 ++--- kernel/ucount.c | 89 +++++++++++++++++++++------------- mm/mlock.c | 5 +- 6 files changed, 75 insertions(+), 53 deletions(-) diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index d504d506a70f1..0f6cf35c831f7 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -13,6 +13,7 @@ #include <linux/sysctl.h> #include <linux/err.h> #include <linux/kabi.h> +#include <linux/percpu_counter.h> #define UID_GID_MAP_MAX_BASE_EXTENTS 5 #define UID_GID_MAP_MAX_EXTENTS 340 @@ -121,7 +122,7 @@ struct ucounts { rcuref_t count; atomic_long_t freed; atomic_long_t ucount[UCOUNT_COUNTS]; - atomic_long_t rlimit[UCOUNT_RLIMIT_COUNTS]; + struct percpu_counter rlimit[UCOUNT_RLIMIT_COUNTS]; }; extern struct user_namespace init_user_ns; @@ -133,6 +134,7 @@ struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_ty void dec_ucount(struct ucounts *ucounts, enum ucount_type type); struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid); void put_ucounts(struct ucounts *ucounts); +void __init ucounts_init(void); static inline struct ucounts * __must_check get_ucounts(struct ucounts *ucounts) { @@ -143,13 +145,17 @@ static inline struct ucounts * __must_check get_ucounts(struct ucounts *ucounts) static inline long get_rlimit_value(struct ucounts *ucounts, enum rlimit_type type) { - return atomic_long_read(&ucounts->rlimit[type]); + return percpu_counter_sum(&ucounts->rlimit[type]); } -long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v); -bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v); +bool inc_rlimit_ucounts_limit(struct ucounts *ucounts, enum rlimit_type type, long v, long limit); +static inline bool inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v) +{ + return inc_rlimit_ucounts_limit(ucounts, type, v, LONG_MAX); +} +void dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v); long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type, - bool override_rlimit); + bool override_rlimit, long limit); void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type); bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long max); diff --git a/init/main.c b/init/main.c index 8fdfa69dba0fa..02a2c5d9be671 100644 --- a/init/main.c +++ b/init/main.c @@ -1050,6 +1050,7 @@ void start_kernel(void) efi_enter_virtual_mode(); #endif thread_stack_cache_init(); + ucounts_init(); cred_init(); fork_init(); proc_caches_init(); diff --git a/ipc/mqueue.c b/ipc/mqueue.c index ba8215ed663a4..a910c93bea08a 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -371,11 +371,9 @@ static struct inode *mqueue_get_inode(struct super_block *sb, mq_bytes += mq_treesize; info->ucounts = get_ucounts(current_ucounts()); if (info->ucounts) { - long msgqueue; - spin_lock(&mq_lock); - msgqueue = inc_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes); - if (msgqueue == LONG_MAX || msgqueue > rlimit(RLIMIT_MSGQUEUE)) { + if (!inc_rlimit_ucounts_limit(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, + mq_bytes, rlimit(RLIMIT_MSGQUEUE))) { dec_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes); spin_unlock(&mq_lock); put_ucounts(info->ucounts); diff --git a/kernel/signal.c b/kernel/signal.c index c73873d67a63f..c75ef0e3f5264 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -429,17 +429,14 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags, rcu_read_lock(); ucounts = task_ucounts(t); sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, - override_rlimit); + override_rlimit, task_rlimit(t, RLIMIT_SIGPENDING)); rcu_read_unlock(); - if (!sigpending) - return NULL; - - if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) { - q = kmem_cache_alloc(sigqueue_cachep, gfp_flags); - } else { + if (!sigpending) { print_dropped_signal(sig); + return NULL; } + q = kmem_cache_alloc(sigqueue_cachep, gfp_flags); if (unlikely(q == NULL)) { dec_rlimit_put_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING); } else { diff --git a/kernel/ucount.c b/kernel/ucount.c index 1e300184f5edb..bdaa6261b7cae 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -175,11 +175,17 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid) new->ns = ns; new->uid = uid; rcuref_init(&new->count, 1); + if (percpu_counter_init_many(&new->rlimit[0], 0, GFP_KERNEL_ACCOUNT, + UCOUNT_RLIMIT_COUNTS)) { + kfree(new); + return NULL; + } spin_lock_irq(&ucounts_lock); ucounts = find_ucounts(ns, uid, hashent); if (ucounts) { spin_unlock_irq(&ucounts_lock); + percpu_counter_destroy_many(&new->rlimit[0], UCOUNT_RLIMIT_COUNTS); kfree(new); return ucounts; } @@ -202,7 +208,7 @@ static bool rlimits_are_zero(struct ucounts *ucounts) for (int i = 0; i < sizeof(rtypes) / sizeof(int); ++i) { rtype = rtypes[i]; - if (atomic_long_read(&ucounts->rlimit[rtype]) > 0) + if (get_rlimit_value(ucounts, rtype) > 0) return false; } return true; @@ -230,7 +236,7 @@ static void free_ucounts(struct ucounts *ucounts) spin_lock_irqsave(&ucounts_lock, flags); hlist_nulls_del_rcu(&ucounts->node); spin_unlock_irqrestore(&ucounts_lock, flags); - + percpu_counter_destroy_many(&ucounts->rlimit[0], UCOUNT_RLIMIT_COUNTS); put_user_ns(ucounts->ns); kfree_rcu(ucounts, rcu); } @@ -294,36 +300,35 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_type type) put_ucounts(ucounts); } -long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v) +bool inc_rlimit_ucounts_limit(struct ucounts *ucounts, enum rlimit_type type, + long v, long limit) { struct ucounts *iter; long max = LONG_MAX; - long ret = 0; + bool good = true; for (iter = ucounts; iter; iter = iter->ns->ucounts) { - long new = atomic_long_add_return(v, &iter->rlimit[type]); - if (new < 0 || new > max) - ret = LONG_MAX; - else if (iter == ucounts) - ret = new; + max = min(limit, max); + if (!percpu_counter_limited_add(&iter->rlimit[type], max, v)) + good = false; + max = get_userns_rlimit_max(iter->ns, type); } - return ret; + return good; } -bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v) +void dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v) { struct ucounts *iter; - long new = -1; /* Silence compiler warning */ - for (iter = ucounts; iter; iter = iter->ns->ucounts) { - long dec = atomic_long_sub_return(v, &iter->rlimit[type]); - WARN_ON_ONCE(dec < 0); - if (iter == ucounts) - new = dec; - } - return (new == 0); + + for (iter = ucounts; iter; iter = iter->ns->ucounts) + percpu_counter_sub(&iter->rlimit[type], v); } +/* + * The inc_rlimit_get_ucounts does not grab the refcount. + * The rlimit_release should be called very time the rlimit is decremented. + */ static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts, struct ucounts *last, enum rlimit_type type) { @@ -332,8 +337,7 @@ static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts, bool to_free; rcu_read_lock(); - long dec = atomic_long_sub_return(1, &iter->rlimit[type]); - WARN_ON_ONCE(dec < 0); + percpu_counter_sub(&iter->rlimit[type], 1); next = iter->ns->ucounts; to_free = ucounts_can_be_freed(iter); rcu_read_unlock(); @@ -348,29 +352,37 @@ void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type) do_dec_rlimit_put_ucounts(ucounts, NULL, type); } +/* + * Though this function does not grab the refcount, it is promised that the + * ucounts will not be freed as long as there have any rlimit pins to it. + * Caller must hold a reference to ucounts or under rcu_read_lock(). + * + * Return 1 if increments successful, otherwise return 0. + */ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type, - bool override_rlimit) + bool override_rlimit, long limit) { - /* Caller must hold a reference to ucounts */ struct ucounts *iter; long max = LONG_MAX; - long dec, ret = 0; + long ret = 0; + long in_limit = limit; + + if (override_rlimit) + in_limit = LONG_MAX; for (iter = ucounts; iter; iter = iter->ns->ucounts) { - long new = atomic_long_add_return(1, &iter->rlimit[type]); - if (new < 0 || new > max) + /* Can not exceed the limit(inputed) or the ns->rlimit_max */ + max = min(in_limit, max); + if (!percpu_counter_limited_add(&iter->rlimit[type], max, 1)) goto dec_unwind; - if (iter == ucounts) - ret = new; + if (!override_rlimit) max = get_userns_rlimit_max(iter->ns, type); } - return ret; + return 1; dec_unwind: - dec = atomic_long_sub_return(1, &iter->rlimit[type]); - WARN_ON_ONCE(dec < 0); do_dec_rlimit_put_ucounts(ucounts, iter, type); - return 0; + return ret; } bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long rlimit) @@ -379,15 +391,24 @@ bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigne long max = rlimit; if (rlimit > LONG_MAX) max = LONG_MAX; + for (iter = ucounts; iter; iter = iter->ns->ucounts) { - long val = get_rlimit_value(iter, type); - if (val < 0 || val > max) + /* iter->rlimit[type] > max return 1 */ + if (percpu_counter_compare(&iter->rlimit[type], max) > 0) return true; + max = get_userns_rlimit_max(iter->ns, type); } return false; } +void __init ucounts_init(void) +{ + if (percpu_counter_init_many(&init_ucounts.rlimit[0], 0, GFP_KERNEL, + UCOUNT_RLIMIT_COUNTS)) + panic("Cannot create init_ucounts rlimit counters"); +} + static __init int user_namespace_sysctl_init(void) { #ifdef CONFIG_SYSCTL diff --git a/mm/mlock.c b/mm/mlock.c index cd0997d89c7c5..65e5c40c26795 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -825,7 +825,6 @@ static DEFINE_SPINLOCK(shmlock_user_lock); int user_shm_lock(size_t size, struct ucounts *ucounts) { unsigned long lock_limit, locked; - long memlock; int allowed = 0; locked = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; @@ -833,9 +832,9 @@ int user_shm_lock(size_t size, struct ucounts *ucounts) if (lock_limit != RLIM_INFINITY) lock_limit >>= PAGE_SHIFT; spin_lock(&shmlock_user_lock); - memlock = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked); - if ((memlock == LONG_MAX || memlock > lock_limit) && !capable(CAP_IPC_LOCK)) { + if (!inc_rlimit_ucounts_limit(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked, lock_limit) + && !capable(CAP_IPC_LOCK)) { dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked); goto out; } -- 2.34.1

hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC97W5 ---------------------------------------------------------------------- Fix kabi break caused by previous patch. Fixes: 1280510fda96 ("ucount: use RCU for ucounts lookups") Fixes: a4c834ba39c9 ("ucount: use rcuref_t for reference counting") Fixes: d73a45212afd ("ucounts: turn the atomic rlimit to percpu_counter") Signed-off-by: Chen Ridong <chenridong@huawei.com> --- include/linux/user_namespace.h | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 0f6cf35c831f7..ac928e6e70427 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -115,14 +115,26 @@ struct user_namespace { } __randomize_layout; struct ucounts { +#ifdef __GENKSYMS__ + struct hlist_node node; +#else struct hlist_nulls_node node; +#endif struct user_namespace *ns; kuid_t uid; +#ifdef __GENKSYMS__ + atomic_t count; +#else struct rcu_head rcu; rcuref_t count; atomic_long_t freed; +#endif atomic_long_t ucount[UCOUNT_COUNTS]; +#ifdef __GENKSYMS__ + atomic_long_t rlimit[UCOUNT_RLIMIT_COUNTS]; +#else struct percpu_counter rlimit[UCOUNT_RLIMIT_COUNTS]; +#endif }; extern struct user_namespace init_user_ns; -- 2.34.1

反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/16430 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/USZ... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/16430 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/USZ...
participants (2)
-
Chen Ridong
-
patchwork bot