From: Vlastimil Babka vbabka@suse.cz
mainline inclusion from mainline-v5.12-rc1 commit 7e1fa93deff44677a94dfc323ff629bbf5cf9360 category: bugfix bugzilla: 175589 CVE: NA
------------------------------------------------- Since commit 03afc0e25f7f ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for SLAB and SLUB when creating, destroying or shrinking a cache. It is quite a heavy lock and it's best to avoid it if possible, as we had several issues with lockdep complaining about ordering in the past, see e.g. e4f8e513c3d3 ("mm/slub: fix a deadlock in show_slab_objects()").
The problem scenario in 03afc0e25f7f (solved by the memory hotplug lock) can be summarized as follows: while there's slab_mutex synchronizing new kmem cache creation and SLUB's MEM_GOING_ONLINE callback slab_mem_going_online_callback(), we may miss creation of kmem_cache_node for the hotplugged node in the new kmem cache, because the hotplug callback doesn't yet see the new cache, and cache creation in init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the N_NORMAL_MEMORY nodemask, which however may not yet include the new node, as that happens only later after the MEM_GOING_ONLINE callback.
Instead of using get/put_online_mems(), the problem can be solved by SLUB maintaining its own nodemask of nodes for which it has allocated the per-node kmem_cache_node structures. This nodemask would generally mirror the N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's control in its memory hotplug callbacks under the slab_mutex. This patch adds such nodemask and its handling.
Commit 03afc0e25f7f mentiones "issues like [the one above]", but there don't appear to be further issues. All the paths (shared for SLAB and SLUB) taking the memory hotplug locks are also taking the slab_mutex, except kmem_cache_shrink() where 03afc0e25f7f replaced slab_mutex with get/put_online_mems().
We however cannot simply restore slab_mutex in kmem_cache_shrink(), as SLUB can enters the function from a write to sysfs 'shrink' file, thus holding kernfs lock, and in kmem_cache_create() the kernfs lock is nested within slab_mutex. But on closer inspection we don't actually need to protect kmem_cache_shrink() from hotplug callbacks: While SLUB's __kmem_cache_shrink() does for_each_kmem_cache_node(), missing a new node added in parallel hotplug is not fatal, and parallel hotremove does not free kmem_cache_node's anymore after the previous patch, so use-after free cannot happen. The per-node shrinking itself is protected by n->list_lock. Same is true for SLAB, and SLOB is no-op.
SLAB also doesn't need the memory hotplug locking, which it only gained by 03afc0e25f7f through the shared paths in slab_common.c. Its memory hotplug callbacks are also protected by slab_mutex against races with these paths. The problem of SLUB relying on N_NORMAL_MEMORY doesn't apply to SLAB, as its setup_kmem_cache_nodes relies on N_ONLINE, and the new node is already set there during the MEM_GOING_ONLINE callback, so no special care is needed for SLAB.
As such, this patch removes all get/put_online_mems() usage by the slab subsystem.
Link: https://lkml.kernel.org/r/20210113131634.3671-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka vbabka@suse.cz Cc: Christoph Lameter cl@linux.com Cc: David Hildenbrand david@redhat.com Cc: David Rientjes rientjes@google.com Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Cc: Michal Hocko mhocko@kernel.org Cc: Pekka Enberg penberg@kernel.org Cc: Qian Cai cai@redhat.com Cc: Vladimir Davydov vdavydov.dev@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Chengyang Fan cy.fan@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- mm/slab_common.c | 10 ++-------- mm/slub.c | 28 +++++++++++++++++++++++++--- 2 files changed, 27 insertions(+), 11 deletions(-)
diff --git a/mm/slab_common.c b/mm/slab_common.c index acc743315bb5c..5e3dc1e9eaf09 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -460,7 +460,6 @@ kmem_cache_create_usercopy(const char *name, int err;
get_online_cpus(); - get_online_mems(); memcg_get_cache_ids();
mutex_lock(&slab_mutex); @@ -512,7 +511,6 @@ kmem_cache_create_usercopy(const char *name, mutex_unlock(&slab_mutex);
memcg_put_cache_ids(); - put_online_mems(); put_online_cpus();
if (err) { @@ -917,7 +915,6 @@ void kmem_cache_destroy(struct kmem_cache *s) return;
get_online_cpus(); - get_online_mems();
mutex_lock(&slab_mutex);
@@ -930,13 +927,11 @@ void kmem_cache_destroy(struct kmem_cache *s)
mutex_unlock(&slab_mutex);
- put_online_mems(); put_online_cpus();
flush_memcg_workqueue(s);
get_online_cpus(); - get_online_mems();
mutex_lock(&slab_mutex);
@@ -963,7 +958,6 @@ void kmem_cache_destroy(struct kmem_cache *s) out_unlock: mutex_unlock(&slab_mutex);
- put_online_mems(); put_online_cpus(); } EXPORT_SYMBOL(kmem_cache_destroy); @@ -980,10 +974,10 @@ int kmem_cache_shrink(struct kmem_cache *cachep) int ret;
get_online_cpus(); - get_online_mems(); + kasan_cache_shrink(cachep); ret = __kmem_cache_shrink(cachep); - put_online_mems(); + put_online_cpus(); return ret; } diff --git a/mm/slub.c b/mm/slub.c index 983392cdccef9..3e698e306c345 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -237,6 +237,14 @@ static inline void stat(const struct kmem_cache *s, enum stat_item si) #endif }
+/* + * Tracks for which NUMA nodes we have kmem_cache_nodes allocated. + * Corresponds to node_state[N_NORMAL_MEMORY], but can temporarily + * differ during memory hotplug/hotremove operations. + * Protected by slab_mutex. + */ +static nodemask_t slab_nodes; + /******************************************************************** * Core slab cache functions *******************************************************************/ @@ -2525,7 +2533,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, * ignore the node constraint */ if (unlikely(node != NUMA_NO_NODE && - !node_state(node, N_NORMAL_MEMORY))) + !node_isset(node, slab_nodes))) node = NUMA_NO_NODE; goto new_slab; } @@ -2536,7 +2544,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, * same as above but node_match() being false already * implies node != NUMA_NO_NODE */ - if (!node_state(node, N_NORMAL_MEMORY)) { + if (!node_isset(node, slab_nodes)) { node = NUMA_NO_NODE; goto redo; } else { @@ -3404,7 +3412,7 @@ static int init_kmem_cache_nodes(struct kmem_cache *s) { int node;
- for_each_node_state(node, N_NORMAL_MEMORY) { + for_each_node_mask(node, slab_nodes) { struct kmem_cache_node *n;
if (slab_state == DOWN) { @@ -4072,6 +4080,7 @@ static void slab_mem_offline_callback(void *arg) return;
mutex_lock(&slab_mutex); + node_clear(offline_node, slab_nodes); /* * We no longer free kmem_cache_node structures here, as it would be * racy with all get_node() users, and infeasible to protect them with @@ -4121,6 +4130,11 @@ static int slab_mem_going_online_callback(void *arg) init_kmem_cache_node(n); s->node[nid] = n; } + /* + * Any cache created after this point will also have kmem_cache_node + * initialized for the new node. + */ + node_set(nid, slab_nodes); out: mutex_unlock(&slab_mutex); return ret; @@ -4203,6 +4217,7 @@ void __init kmem_cache_init(void) { static __initdata struct kmem_cache boot_kmem_cache, boot_kmem_cache_node; + int node;
if (debug_guardpage_minorder()) slub_max_order = 0; @@ -4210,6 +4225,13 @@ void __init kmem_cache_init(void) kmem_cache_node = &boot_kmem_cache_node; kmem_cache = &boot_kmem_cache;
+ /* + * Initialize the nodemask for which we will allocate per node + * structures. Here we don't need taking slab_mutex yet. + */ + for_each_node_state(node, N_NORMAL_MEMORY) + node_set(node, slab_nodes); + create_boot_cache(kmem_cache_node, "kmem_cache_node", sizeof(struct kmem_cache_node), SLAB_HWCACHE_ALIGN, 0, 0);