From: Ma Wupeng mawupeng1@huawei.com
handle uninitialized numa nodes gracefully.
Haifeng Xu (1): mm/memcontrol: do not tweak node in mem_cgroup_init()
Michal Hocko (5): mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG mm: handle uninitialized numa nodes gracefully mm, memory_hotplug: drop arch_free_nodedata mm, memory_hotplug: reorganize new pgdat initialization mm: make free_area_init_node aware of memory less nodes
Oscar Salvador (1): arch/x86/mm/numa: Do not initialize nodes twice
Srikar Dronamraju (1): powerpc/numa: Handle partially initialized numa nodes
Wei Yang (1): memcg: do not tweak node in alloc_mem_cgroup_per_node_info
arch/ia64/mm/discontig.c | 11 +-- arch/powerpc/mm/numa.c | 2 +- arch/x86/mm/numa.c | 33 +++++---- include/linux/memory_hotplug.h | 118 ++++++++++++++++----------------- include/linux/mm.h | 1 - mm/internal.h | 2 + mm/memcontrol.c | 17 +---- mm/memory_hotplug.c | 55 +++------------ mm/page_alloc.c | 78 +++++++++++++++++++--- 9 files changed, 163 insertions(+), 154 deletions(-)
From: Michal Hocko mhocko@suse.com
mainline inclusion from mainline-v5.18-rc1 commit e930d999715073a70d306fb59a394ea8b84d0b45 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8EQ3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Patch series "mm, memory_hotplug: handle unitialized numa node gracefully".
The core of the fix is patch 2 which also links existing bug reports. The high level goal is to have all possible numa nodes have their pgdat allocated and initialized so
for_each_possible_node(nid) NODE_DATA(nid)
will never return garbage. This has proven to be problem in several places when an offline numa node is used for an allocation just to realize that node_data and therefore allocation fallback zonelists are not initialized and such an allocation request blows up.
There were attempts to address that by checking node_online in several places including the page allocator. This patchset approaches the problem from a different perspective and instead of special casing, which just adds a runtime overhead, it allocates pglist_data for each possible node. This can add some memory overhead for platforms with high number of possible nodes if they do not contain any memory. This should be a rather rare configuration though.
How to test this? David has provided and excellent howto: http://lkml.kernel.org/r/6e5ebc19-890c-b6dd-1924-9f25c441010d@redhat.com
Patches 1 and 3-6 are mostly cleanups. The patchset has been reviewed by Rafael (thanks!) and the core fix tested by Rafael and Alexey (thanks to both). David has tested as per instructions above and hasn't found any fallouts in the memory hotplug scenarios.
This patch (of 6):
This is a preparatory patch and it doesn't introduce any functional change. It merely pulls out arch_alloc_nodedata (and co) outside of CONFIG_MEMORY_HOTPLUG because the following patch will need to call this from the generic MM code.
Link: https://lkml.kernel.org/r/20220127085305.20890-1-mhocko@kernel.org Link: https://lkml.kernel.org/r/20220127085305.20890-2-mhocko@kernel.org Signed-off-by: Michal Hocko mhocko@suse.com Acked-by: Rafael Aquini raquini@redhat.com Acked-by: David Hildenbrand david@redhat.com Acked-by: Mike Rapoport rppt@linux.ibm.com Reviewed-by: Oscar Salvador osalvador@suse.de Reviewed-by: Wei Yang richard.weiyang@gmail.com Cc: Alexey Makhalov amakhalov@vmware.com Cc: Christoph Lameter cl@linux.com Cc: Dennis Zhou dennis@kernel.org Cc: Eric Dumazet eric.dumazet@gmail.com Cc: Nico Pache npache@redhat.com Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ma Wupeng mawupeng1@huawei.com --- arch/ia64/mm/discontig.c | 2 - include/linux/memory_hotplug.h | 119 ++++++++++++++++----------------- 2 files changed, 59 insertions(+), 62 deletions(-)
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c index 4d0813419013..1975f599d914 100644 --- a/arch/ia64/mm/discontig.c +++ b/arch/ia64/mm/discontig.c @@ -630,7 +630,6 @@ void __init paging_init(void) zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page)); }
-#ifdef CONFIG_MEMORY_HOTPLUG pg_data_t *arch_alloc_nodedata(int nid) { unsigned long size = compute_pernodesize(nid); @@ -648,7 +647,6 @@ void arch_refresh_nodedata(int update_node, pg_data_t *update_pgdat) pgdat_list[update_node] = update_pgdat; scatter_node_data(); } -#endif
#ifdef CONFIG_SPARSEMEM_VMEMMAP int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 3debe1f7a6c1..a026220742cc 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -15,6 +15,65 @@ struct memory_block; struct resource; struct vmem_altmap;
+#ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION +/* + * For supporting node-hotadd, we have to allocate a new pgdat. + * + * If an arch has generic style NODE_DATA(), + * node_data[nid] = kzalloc() works well. But it depends on the architecture. + * + * In general, generic_alloc_nodedata() is used. + * Now, arch_free_nodedata() is just defined for error path of node_hot_add. + * + */ +extern pg_data_t *arch_alloc_nodedata(int nid); +extern void arch_free_nodedata(pg_data_t *pgdat); +extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat); + +#else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */ + +#define arch_alloc_nodedata(nid) generic_alloc_nodedata(nid) +#define arch_free_nodedata(pgdat) generic_free_nodedata(pgdat) + +#ifdef CONFIG_NUMA +/* + * XXX: node aware allocation can't work well to get new node's memory at this time. + * Because, pgdat for the new node is not allocated/initialized yet itself. + * To use new node's memory, more consideration will be necessary. + */ +#define generic_alloc_nodedata(nid) \ +({ \ + kzalloc(sizeof(pg_data_t), GFP_KERNEL); \ +}) +/* + * This definition is just for error path in node hotadd. + * For node hotremove, we have to replace this. + */ +#define generic_free_nodedata(pgdat) kfree(pgdat) + +extern pg_data_t *node_data[]; +static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat) +{ + node_data[nid] = pgdat; +} + +#else /* !CONFIG_NUMA */ + +/* never called */ +static inline pg_data_t *generic_alloc_nodedata(int nid) +{ + BUG(); + return NULL; +} +static inline void generic_free_nodedata(pg_data_t *pgdat) +{ +} +static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat) +{ +} +#endif /* CONFIG_NUMA */ +#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */ + #ifdef CONFIG_MEMORY_HOTPLUG /* * Return page for the valid pfn only if the page is online. All pfn @@ -150,66 +209,6 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages, struct mhp_params *params); #endif /* ARCH_HAS_ADD_PAGES */
-#ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION -/* - * For supporting node-hotadd, we have to allocate a new pgdat. - * - * If an arch has generic style NODE_DATA(), - * node_data[nid] = kzalloc() works well. But it depends on the architecture. - * - * In general, generic_alloc_nodedata() is used. - * Now, arch_free_nodedata() is just defined for error path of node_hot_add. - * - */ -extern pg_data_t *arch_alloc_nodedata(int nid); -extern void arch_free_nodedata(pg_data_t *pgdat); -extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat); - -#else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */ - -#define arch_alloc_nodedata(nid) generic_alloc_nodedata(nid) -#define arch_free_nodedata(pgdat) generic_free_nodedata(pgdat) - -#ifdef CONFIG_NUMA -/* - * If ARCH_HAS_NODEDATA_EXTENSION=n, this func is used to allocate pgdat. - * XXX: kmalloc_node() can't work well to get new node's memory at this time. - * Because, pgdat for the new node is not allocated/initialized yet itself. - * To use new node's memory, more consideration will be necessary. - */ -#define generic_alloc_nodedata(nid) \ -({ \ - kzalloc(sizeof(pg_data_t), GFP_KERNEL); \ -}) -/* - * This definition is just for error path in node hotadd. - * For node hotremove, we have to replace this. - */ -#define generic_free_nodedata(pgdat) kfree(pgdat) - -extern pg_data_t *node_data[]; -static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat) -{ - node_data[nid] = pgdat; -} - -#else /* !CONFIG_NUMA */ - -/* never called */ -static inline pg_data_t *generic_alloc_nodedata(int nid) -{ - BUG(); - return NULL; -} -static inline void generic_free_nodedata(pg_data_t *pgdat) -{ -} -static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat) -{ -} -#endif /* CONFIG_NUMA */ -#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */ - void get_online_mems(void); void put_online_mems(void);
From: Michal Hocko mhocko@suse.com
mainline inclusion from mainline-v5.18-rc1 commit 09f49dca570a917a8c6bccd7e8c61f5141534e3a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8EQ3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
We have had several reports [1][2][3] that page allocator blows up when an allocation from a possible node is requested. The underlying reason is that NODE_DATA for the specific node is not allocated.
NUMA specific initialization is arch specific and it can vary a lot. E.g. x86 tries to initialize all nodes that have some cpu affinity (see init_cpu_to_node) but this can be insufficient because the node might be cpuless for example.
One way to address this problem would be to check for !node_online nodes when trying to get a zonelist and silently fall back to another node. That is unfortunately adding a branch into allocator hot path and it doesn't handle any other potential NODE_DATA users.
This patch takes a different approach (following a lead of [3]) and it pre allocates pgdat for all possible nodes in an arch indipendent code - free_area_init. All uninitialized nodes are treated as memoryless nodes. node_state of the node is not changed because that would lead to other side effects - e.g. sysfs representation of such a node and from past discussions [4] it is known that some tools might have problems digesting that.
Newly allocated pgdat only gets a minimal initialization and the rest of the work is expected to be done by the memory hotplug - hotadd_new_pgdat (renamed to hotadd_init_pgdat).
generic_alloc_nodedata is changed to use the memblock allocator because neither page nor slab allocators are available at the stage when all pgdats are allocated. Hotplug doesn't allocate pgdat anymore so we can use the early boot allocator. The only arch specific implementation is ia64 and that is changed to use the early allocator as well.
[1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com
[akpm@linux-foundation.org: replace comment, per Mike]
Link: https://lkml.kernel.org/r/Yfe7RBeLCijnWBON@dhcp22.suse.cz Reported-by: Alexey Makhalov amakhalov@vmware.com Tested-by: Alexey Makhalov amakhalov@vmware.com Reported-by: Nico Pache npache@redhat.com Acked-by: Rafael Aquini raquini@redhat.com Tested-by: Rafael Aquini raquini@redhat.com Acked-by: David Hildenbrand david@redhat.com Reviewed-by: Oscar Salvador osalvador@suse.de Acked-by: Mike Rapoport rppt@linux.ibm.com Signed-off-by: Michal Hocko mhocko@suse.com Cc: Christoph Lameter cl@linux.com Cc: Dennis Zhou dennis@kernel.org Cc: Eric Dumazet eric.dumazet@gmail.com Cc: Tejun Heo tj@kernel.org Cc: Wei Yang richard.weiyang@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: mm/internal.h mm/page_alloc.c Signed-off-by: Ma Wupeng mawupeng1@huawei.com --- arch/ia64/mm/discontig.c | 4 ++-- include/linux/memory_hotplug.h | 2 +- mm/internal.h | 2 ++ mm/memory_hotplug.c | 21 ++++++++---------- mm/page_alloc.c | 40 ++++++++++++++++++++++++++++++---- 5 files changed, 50 insertions(+), 19 deletions(-)
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c index 1975f599d914..d72ad280c932 100644 --- a/arch/ia64/mm/discontig.c +++ b/arch/ia64/mm/discontig.c @@ -630,11 +630,11 @@ void __init paging_init(void) zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page)); }
-pg_data_t *arch_alloc_nodedata(int nid) +pg_data_t * __init arch_alloc_nodedata(int nid) { unsigned long size = compute_pernodesize(nid);
- return kzalloc(size, GFP_KERNEL); + return memblock_alloc(size, SMP_CACHE_BYTES); }
void arch_free_nodedata(pg_data_t *pgdat) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index a026220742cc..aaacbd41940d 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -43,7 +43,7 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat); */ #define generic_alloc_nodedata(nid) \ ({ \ - kzalloc(sizeof(pg_data_t), GFP_KERNEL); \ + memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES); \ }) /* * This definition is just for error path in node hotadd. diff --git a/mm/internal.h b/mm/internal.h index 13f78c5cd961..a9fc47f3677f 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -671,4 +671,6 @@ struct migration_target_control { gfp_t gfp_mask; };
+DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 654ab82447c8..12698242b03d 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -764,19 +764,21 @@ static void reset_node_present_pages(pg_data_t *pgdat) }
/* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */ -static pg_data_t __ref *hotadd_new_pgdat(int nid) +static pg_data_t __ref *hotadd_init_pgdat(int nid) { struct pglist_data *pgdat;
pgdat = NODE_DATA(nid); - if (!pgdat) { - pgdat = arch_alloc_nodedata(nid); - if (!pgdat) - return NULL;
+ /* + * NODE_DATA is preallocated (free_area_init) but its internal + * state is not allocated completely. Add missing pieces. + * Completely offline nodes stay around and they just need + * reintialization. + */ + if (pgdat->per_cpu_nodestats == &boot_nodestats) { pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat); - arch_refresh_nodedata(nid, pgdat); } else { int cpu; /* @@ -795,8 +797,6 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid) } }
- /* we can use NODE_DATA(nid) from here */ - pgdat->node_id = nid; pgdat->node_start_pfn = 0;
/* init node's zones as empty zones, we don't have any present pages.*/ @@ -848,7 +848,7 @@ static int __try_online_node(int nid, bool set_node_online) if (node_online(nid)) return 0;
- pgdat = hotadd_new_pgdat(nid); + pgdat = hotadd_init_pgdat(nid); if (!pgdat) { pr_err("Cannot online node %d due to NULL pgdat\n", nid); ret = -ENOMEM; @@ -978,9 +978,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
return ret; error: - /* rollback pgdat allocation and others */ - if (new_node) - rollback_node_hotadd(nid); if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) memblock_remove(start, size); mem_hotplug_done(); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b511a3c17769..b29a71546ba4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6393,7 +6393,7 @@ static void pageset_init(struct per_cpu_pageset *p); #define BOOT_PAGESET_HIGH 0 #define BOOT_PAGESET_BATCH 1 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset); -static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats); +DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
static void __build_all_zonelists(void *data) { @@ -6414,7 +6414,11 @@ static void __build_all_zonelists(void *data) if (self && !node_online(self->node_id)) { build_zonelists(self); } else { - for_each_online_node(nid) { + /* + * All possible nodes have pgdat preallocated + * in free_area_init + */ + for_each_node(nid) { pg_data_t *pgdat = NODE_DATA(nid);
build_zonelists(pgdat); @@ -8046,8 +8050,36 @@ void __init free_area_init(unsigned long *max_zone_pfn) /* Initialise every node */ mminit_verify_pageflags_layout(); setup_nr_node_ids(); - for_each_online_node(nid) { - pg_data_t *pgdat = NODE_DATA(nid); + for_each_node(nid) { + pg_data_t *pgdat; + + if (!node_online(nid)) { + pr_info("Initializing node %d as memoryless\n", nid); + + /* Allocator not initialized yet */ + pgdat = arch_alloc_nodedata(nid); + if (!pgdat) { + pr_err("Cannot allocate %zuB for node %d.\n", + sizeof(*pgdat), nid); + continue; + } + arch_refresh_nodedata(nid, pgdat); + free_area_init_memoryless_node(nid); + + /* + * We do not want to confuse userspace by sysfs + * files/directories for node without any memory + * attached to it, so this node is not marked as + * N_MEMORY and not marked online so that no sysfs + * hierarchy will be created via register_one_node for + * it. The pgdat will get fully initialized by + * hotadd_init_pgdat() when memory is hotplugged into + * this node. + */ + continue; + } + + pgdat = NODE_DATA(nid); free_area_init_node(nid);
/* Any memory on that node */
From: Michal Hocko mhocko@suse.com
mainline inclusion from mainline-v5.18-rc1 commit 390511e1476eb1cc41d420a7661b33f4d8584c3f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8EQ3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Prior to "mm: handle uninitialized numa nodes gracefully" memory hotplug used to allocate pgdat when memory has been added to a node (hotadd_init_pgdat) arch_free_nodedata has been only used in the failure path because once the pgdat is exported (to be visible by NODA_DATA(nid)) it cannot really be freed because there is no synchronization available for that.
pgdat is allocated for each possible nodes now so the memory hotplug doesn't need to do the ever use arch_free_nodedata so drop it.
This patch doesn't introduce any functional change.
Link: https://lkml.kernel.org/r/20220127085305.20890-4-mhocko@kernel.org Signed-off-by: Michal Hocko mhocko@suse.com Acked-by: Rafael Aquini raquini@redhat.com Acked-by: David Hildenbrand david@redhat.com Acked-by: Mike Rapoport rppt@linux.ibm.com Reviewed-by: Oscar Salvador osalvador@suse.de Cc: Alexey Makhalov amakhalov@vmware.com Cc: Christoph Lameter cl@linux.com Cc: Dennis Zhou dennis@kernel.org Cc: Eric Dumazet eric.dumazet@gmail.com Cc: Nico Pache npache@redhat.com Cc: Tejun Heo tj@kernel.org Cc: Wei Yang richard.weiyang@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: mm/memory_hotplug.c Signed-off-by: Ma Wupeng mawupeng1@huawei.com --- arch/ia64/mm/discontig.c | 5 ----- include/linux/memory_hotplug.h | 3 --- mm/memory_hotplug.c | 10 ---------- 3 files changed, 18 deletions(-)
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c index d72ad280c932..224d590ca0d4 100644 --- a/arch/ia64/mm/discontig.c +++ b/arch/ia64/mm/discontig.c @@ -637,11 +637,6 @@ pg_data_t * __init arch_alloc_nodedata(int nid) return memblock_alloc(size, SMP_CACHE_BYTES); }
-void arch_free_nodedata(pg_data_t *pgdat) -{ - kfree(pgdat); -} - void arch_refresh_nodedata(int update_node, pg_data_t *update_pgdat) { pgdat_list[update_node] = update_pgdat; diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index aaacbd41940d..c6ca309fc45e 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -23,17 +23,14 @@ struct vmem_altmap; * node_data[nid] = kzalloc() works well. But it depends on the architecture. * * In general, generic_alloc_nodedata() is used. - * Now, arch_free_nodedata() is just defined for error path of node_hot_add. * */ extern pg_data_t *arch_alloc_nodedata(int nid); -extern void arch_free_nodedata(pg_data_t *pgdat); extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
#else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
#define arch_alloc_nodedata(nid) generic_alloc_nodedata(nid) -#define arch_free_nodedata(pgdat) generic_free_nodedata(pgdat)
#ifdef CONFIG_NUMA /* diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 12698242b03d..cbd6cd96e98b 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -819,16 +819,6 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid) return pgdat; }
-static void rollback_node_hotadd(int nid) -{ - pg_data_t *pgdat = NODE_DATA(nid); - - arch_refresh_nodedata(nid, NULL); - free_percpu(pgdat->per_cpu_nodestats); - arch_free_nodedata(pgdat); -} - - /** * try_online_node - online a node if offlined * @nid: the node ID
From: Michal Hocko mhocko@suse.com
mainline inclusion from mainline-v5.18-rc1 commit 70b5b46a754245d383811b4d2f2c76c34bb7e145 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8EQ3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When a !node_online node is brought up it needs a hotplug specific initialization because the node could be either uninitialized yet or it could have been recycled after previous hotremove. hotadd_init_pgdat is responsible for that.
Internal pgdat state is initialized at two places currently - hotadd_init_pgdat - free_area_init_core_hotplug
There is no real clear cut what should go where but this patch's chosen to move the whole internal state initialization into free_area_init_core_hotplug. hotadd_init_pgdat is still responsible to pull all the parts together - most notably to initialize zonelists because those depend on the overall topology.
This patch doesn't introduce any functional change.
Link: https://lkml.kernel.org/r/20220127085305.20890-5-mhocko@kernel.org Signed-off-by: Michal Hocko mhocko@suse.com Acked-by: Rafael Aquini raquini@redhat.com Acked-by: David Hildenbrand david@redhat.com Reviewed-by: Oscar Salvador osalvador@suse.de Cc: Alexey Makhalov amakhalov@vmware.com Cc: Christoph Lameter cl@linux.com Cc: Dennis Zhou dennis@kernel.org Cc: Eric Dumazet eric.dumazet@gmail.com Cc: Mike Rapoport rppt@linux.ibm.com Cc: Nico Pache npache@redhat.com Cc: Tejun Heo tj@kernel.org Cc: Wei Yang richard.weiyang@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: include/linux/memory_hotplug.h Signed-off-by: Ma Wupeng mawupeng1@huawei.com --- include/linux/memory_hotplug.h | 2 +- mm/memory_hotplug.c | 28 +++------------------------- mm/page_alloc.c | 25 +++++++++++++++++++++++-- 3 files changed, 27 insertions(+), 28 deletions(-)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index c6ca309fc45e..c7e5dca72f69 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -314,7 +314,7 @@ extern void clear_zone_contiguous(struct zone *zone); #ifdef CONFIG_MEMORY_HOTPLUG extern struct acpi_device *hotplug_mdev[MAX_NUMNODES];
-extern void __ref free_area_init_core_hotplug(int nid); +extern void __ref free_area_init_core_hotplug(struct pglist_data *pgdat); extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags); extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags); extern int add_memory_resource(int nid, struct resource *resource, diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index cbd6cd96e98b..3930b3817b17 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -768,39 +768,16 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid) { struct pglist_data *pgdat;
- pgdat = NODE_DATA(nid); - /* * NODE_DATA is preallocated (free_area_init) but its internal * state is not allocated completely. Add missing pieces. * Completely offline nodes stay around and they just need * reintialization. */ - if (pgdat->per_cpu_nodestats == &boot_nodestats) { - pgdat->per_cpu_nodestats = - alloc_percpu(struct per_cpu_nodestat); - } else { - int cpu; - /* - * Reset the nr_zones, order and highest_zoneidx before reuse. - * Note that kswapd will init kswapd_highest_zoneidx properly - * when it starts in the near future. - */ - pgdat->nr_zones = 0; - pgdat->kswapd_order = 0; - pgdat->kswapd_highest_zoneidx = 0; - for_each_online_cpu(cpu) { - struct per_cpu_nodestat *p; - - p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu); - memset(p, 0, sizeof(*p)); - } - } - - pgdat->node_start_pfn = 0; + pgdat = NODE_DATA(nid);
/* init node's zones as empty zones, we don't have any present pages.*/ - free_area_init_core_hotplug(nid); + free_area_init_core_hotplug(pgdat);
/* * The node we allocated has no zone fallback lists. For avoiding @@ -812,6 +789,7 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid) * When memory is hot-added, all the memory is in offline state. So * clear all zones' present_pages because they will be updated in * online_pages() and offline_pages(). + * TODO: should be in free_area_init_core_hotplug? */ reset_node_managed_pages(pgdat); reset_node_present_pages(pgdat); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b29a71546ba4..ee863d43600b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7430,12 +7430,33 @@ static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, * NOTE: this function is only called during memory hotplug */ #ifdef CONFIG_MEMORY_HOTPLUG -void __ref free_area_init_core_hotplug(int nid) +void __ref free_area_init_core_hotplug(struct pglist_data *pgdat) { + int nid = pgdat->node_id; enum zone_type z; - pg_data_t *pgdat = NODE_DATA(nid); + int cpu;
pgdat_init_internals(pgdat); + + if (pgdat->per_cpu_nodestats == &boot_nodestats) + pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat); + + /* + * Reset the nr_zones, order and highest_zoneidx before reuse. + * Note that kswapd will init kswapd_highest_zoneidx properly + * when it starts in the near future. + */ + pgdat->nr_zones = 0; + pgdat->kswapd_order = 0; + pgdat->kswapd_highest_zoneidx = 0; + pgdat->node_start_pfn = 0; + for_each_online_cpu(cpu) { + struct per_cpu_nodestat *p; + + p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu); + memset(p, 0, sizeof(*p)); + } + for (z = 0; z < MAX_NR_ZONES; z++) zone_init_internals(&pgdat->node_zones[z], z, nid, 0); }
From: Michal Hocko mhocko@suse.com
mainline inclusion from mainline-v5.18-rc1 commit 7c30daac20698cb035255089c896f230982b085e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8EQ3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
free_area_init_node is also called from memory less node initialization path (free_area_init_memoryless_node). It doesn't really make much sense to display the physical memory range for those nodes: Initmem setup node XX [mem 0x0000000000000000-0x0000000000000000]
Instead be explicit that the node is memoryless: Initmem setup node XX as memoryless
Link: https://lkml.kernel.org/r/20220127085305.20890-6-mhocko@kernel.org Signed-off-by: Michal Hocko mhocko@suse.com Acked-by: Rafael Aquini raquini@redhat.com Acked-by: David Hildenbrand david@redhat.com Reviewed-by: Mike Rapoport rppt@linux.ibm.com Reviewed-by: Oscar Salvador osalvador@suse.de Cc: Alexey Makhalov amakhalov@vmware.com Cc: Christoph Lameter cl@linux.com Cc: Dennis Zhou dennis@kernel.org Cc: Eric Dumazet eric.dumazet@gmail.com Cc: Nico Pache npache@redhat.com Cc: Tejun Heo tj@kernel.org Cc: Wei Yang richard.weiyang@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ma Wupeng mawupeng1@huawei.com --- mm/page_alloc.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ee863d43600b..d586c6e6faee 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7610,9 +7610,14 @@ static void __init free_area_init_node(int nid) pgdat->node_start_pfn = start_pfn; pgdat->per_cpu_nodestats = NULL;
- pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid, - (u64)start_pfn << PAGE_SHIFT, - end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0); + if (start_pfn != end_pfn) { + pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid, + (u64)start_pfn << PAGE_SHIFT, + end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0); + } else { + pr_info("Initmem setup node %d as memoryless\n", nid); + } + calculate_node_totalpages(pgdat, start_pfn, end_pfn);
alloc_node_mem_map(pgdat);
From: Wei Yang richard.weiyang@gmail.com
mainline inclusion from mainline-v5.18-rc1 commit 8c9bb39816f01a309d30243da0ca91bd7e7bd1c2 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8EQ3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
alloc_mem_cgroup_per_node_info is allocated for each possible node and this used to be a problem because !node_online nodes didn't have appropriate data structure allocated. This has changed by "mm: handle uninitialized numa nodes gracefully" so we can drop the special casing here.
Link: https://lkml.kernel.org/r/20220127085305.20890-7-mhocko@kernel.org Signed-off-by: Wei Yang richard.weiyang@gmail.com Signed-off-by: Michal Hocko mhocko@suse.com Cc: David Hildenbrand david@redhat.com Cc: Alexey Makhalov amakhalov@vmware.com Cc: Dennis Zhou dennis@kernel.org Cc: Eric Dumazet eric.dumazet@gmail.com Cc: Tejun Heo tj@kernel.org Cc: Christoph Lameter cl@linux.com Cc: Nico Pache npache@redhat.com Cc: Wei Yang richard.weiyang@gmail.com Cc: Mike Rapoport rppt@linux.ibm.com Cc: Oscar Salvador osalvador@suse.de Cc: Rafael Aquini raquini@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ma Wupeng mawupeng1@huawei.com --- mm/memcontrol.c | 14 ++------------ 1 file changed, 2 insertions(+), 12 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9603b6dff9aa..b70c820e54bd 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6199,18 +6199,8 @@ struct mem_cgroup *mem_cgroup_from_id(unsigned short id) static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) { struct mem_cgroup_per_node *pn; - int tmp = node; - /* - * This routine is called against possible nodes. - * But it's BUG to call kmalloc() against offline node. - * - * TODO: this routine can waste much memory for nodes which will - * never be onlined. It's better to use memory hotplug callback - * function. - */ - if (!node_state(node, N_NORMAL_MEMORY)) - tmp = -1; - pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, tmp); + + pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, node); if (!pn) return 1;
From: Haifeng Xu haifeng.xu@shopee.com
mainline inclusion from mainline-v6.5-rc1 commit 91f0dccef141483f8399299c39ce9114d19bb147 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8EQ3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
mem_cgroup_init() request for allocations from each possible node, and it's used to be a problem because NODE_DATA is not allocated for offline node. Things have already changed since commit 09f49dca570a9 ("mm: handle uninitialized numa nodes gracefully"), so it's unnecessary to check for !node_online nodes here.
How to test?
qemu-system-x86_64 \ -kernel vmlinux \ -initrd full.rootfs.cpio.gz \ -append "console=ttyS0,115200 root=/dev/ram0 nokaslr earlyprintk=serial oops=panic panic_on_warn" \ -drive format=qcow2,file=vm_disk.qcow2,media=disk,if=ide \ -enable-kvm \ -cpu host \ -m 8G,slots=2,maxmem=16G \ -smp cores=4,threads=1,sockets=2 \ -object memory-backend-ram,id=mem0,size=4G \ -object memory-backend-ram,id=mem1,size=4G \ -numa node,memdev=mem0,cpus=0-3,nodeid=0 \ -numa node,memdev=mem1,cpus=4-7,nodeid=1 \ -numa node,nodeid=2 \ -net nic,model=virtio,macaddr=52:54:00:12:34:58 \ -net user \ -nographic \ -rtc base=localtime \ -gdb tcp::6000
Guest state when booting:
[ 0.048881] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x00000000-0xbfffffff] [ 0.050489] NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0x00000000-0x13fffffff] [ 0.052173] NODE_DATA(0) allocated [mem 0x13fffc000-0x13fffffff] [ 0.053164] NODE_DATA(1) allocated [mem 0x23fffa000-0x23fffdfff] [ 0.054187] Zone ranges: [ 0.054587] DMA [mem 0x0000000000001000-0x0000000000ffffff] [ 0.055551] DMA32 [mem 0x0000000001000000-0x00000000ffffffff] [ 0.056515] Normal [mem 0x0000000100000000-0x000000023fffffff] [ 0.057484] Movable zone start for each node [ 0.058149] Early memory node ranges [ 0.058705] node 0: [mem 0x0000000000001000-0x000000000009efff] [ 0.059679] node 0: [mem 0x0000000000100000-0x00000000bffdffff] [ 0.060659] node 0: [mem 0x0000000100000000-0x000000013fffffff] [ 0.061649] node 1: [mem 0x0000000140000000-0x000000023fffffff] [ 0.062638] Initmem setup node 0 [mem 0x0000000000001000-0x000000013fffffff] [ 0.063745] Initmem setup node 1 [mem 0x0000000140000000-0x000000023fffffff] [ 0.064855] DMA zone: 158 reserved pages exceeds freesize 0 [ 0.065746] Initializing node 2 as memoryless [ 0.066437] Initmem setup node 2 as memoryless [ 0.067132] DMA zone: 158 reserved pages exceeds freesize 0 [ 0.068037] On node 0, zone DMA: 1 pages in unavailable ranges [ 0.068265] On node 0, zone DMA: 97 pages in unavailable ranges [ 0.124755] On node 0, zone Normal: 32 pages in unavailable ranges
cat /sys/devices/system/node/online 0-1 cat /sys/devices/system/node/possible 0-2
Link: https://lkml.kernel.org/r/20230619130442.2487-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu haifeng.xu@shopee.com Acked-by: Michal Hocko mhocko@suse.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Shakeel Butt shakeelb@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Ma Wupeng mawupeng1@huawei.com --- mm/memcontrol.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b70c820e54bd..de683b28f2ab 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -8171,8 +8171,7 @@ static int __init mem_cgroup_init(void) for_each_node(node) { struct mem_cgroup_tree_per_node *rtpn;
- rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, - node_online(node) ? node : NUMA_NO_NODE); + rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, node);
rtpn->rb_root = RB_ROOT; rtpn->rb_rightmost = NULL;
From: Srikar Dronamraju srikar@linux.vnet.ibm.com
mainline inclusion from mainline-v5.18-rc2 commit e4ff77598a109bd36789ad5e80aba66fc53d0ffb category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8EQ3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
With commit 09f49dca570a ("mm: handle uninitialized numa nodes gracefully") NODE_DATA for even a memoryless/cpuless node is partially initialized at boot time.
Before onlining the node, current Powerpc code checks for NODE_DATA to be NULL. However since NODE_DATA is partially initialized, this check will end up always being false.
This causes hotplugging a CPU to a memoryless/cpuless node to fail.
Before adding CPUs: $ numactl -H available: 1 nodes (4) node 4 cpus: 0 1 2 3 4 5 6 7 node 4 size: 97372 MB node 4 free: 95545 MB node distances: node 4 4: 10
$ lparstat System Configuration type=Dedicated mode=Capped smt=8 lcpu=1 mem=99709440 kB cpus=0 ent=1.00
%user %sys %wait %idle physc %entc lbusy app vcsw phint ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 2.66 2.67 0.16 94.51 0.00 0.00 5.33 0.00 67749 0
After hotplugging 32 cores: $ numactl -H node 4 cpus: 0 1 2 3 4 5 6 7 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 node 4 size: 97372 MB node 4 free: 93636 MB node distances: node 4 4: 10
$ lparstat System Configuration type=Dedicated mode=Capped smt=8 lcpu=33 mem=99709440 kB cpus=0 ent=33.00
%user %sys %wait %idle physc %entc lbusy app vcsw phint ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 0.04 0.02 0.00 99.94 0.00 0.00 0.06 0.00 1128751 3
As we can see numactl is listing only 8 cores while lparstat is showing 33 cores.
Also dmesg is showing messages like: [ 2261.318350 ] BUG: arch topology borken [ 2261.318357 ] the DIE domain not a subset of the NODE domain
Fixes: 09f49dca570a ("mm: handle uninitialized numa nodes gracefully") Reported-by: Geetika Moolchandani Geetika.Moolchandani1@ibm.com Signed-off-by: Srikar Dronamraju srikar@linux.vnet.ibm.com Acked-by: Michal Hocko mhocko@suse.com Acked-by: Oscar Salvador osalvador@suse.de Signed-off-by: Michael Ellerman mpe@ellerman.id.au Link: https://lore.kernel.org/r/20220330135123.1868197-1-srikar@linux.vnet.ibm.com Signed-off-by: Ma Wupeng mawupeng1@huawei.com --- arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index b76aa2aa2ea9..a62fd266d508 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1452,7 +1452,7 @@ int find_and_online_cpu_nid(int cpu) if (new_nid < 0 || !node_possible(new_nid)) new_nid = first_online_node;
- if (NODE_DATA(new_nid) == NULL) { + if (!node_online(new_nid)) { #ifdef CONFIG_MEMORY_HOTPLUG /* * Need to ensure that NODE_DATA is initialized for a node from
From: Oscar Salvador osalvador@suse.de
mainline inclusion from mainline-v5.18-rc1 commit 1ca75fa7f19d694c58af681fa023295072b03120 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8EQ3K
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
On x86, prior to ("mm: handle uninitialized numa nodes gracecully"), NUMA nodes could be allocated at three different places.
- numa_register_memblks - init_cpu_to_node - init_gi_nodes
All these calls happen at setup_arch, and have the following order:
setup_arch ... x86_numa_init numa_init numa_register_memblks ... init_cpu_to_node init_memory_less_node alloc_node_data free_area_init_memoryless_node init_gi_nodes init_memory_less_node alloc_node_data free_area_init_memoryless_node
numa_register_memblks() is only interested in those nodes which have memory, so it skips over any memoryless node it founds. Later on, when we have read ACPI's SRAT table, we call init_cpu_to_node() and init_gi_nodes(), which initialize any memoryless node we might have that have either CPU or Initiator affinity, meaning we allocate pg_data_t struct for them and we mark them as ONLINE.
So far so good, but the thing is that after ("mm: handle uninitialized numa nodes gracefully"), we allocate all possible NUMA nodes in free_area_init(), meaning we have a picture like the following:
setup_arch x86_numa_init numa_init numa_register_memblks <-- allocate non-memoryless node x86_init.paging.pagetable_init ... free_area_init free_area_init_memoryless <-- allocate memoryless node init_cpu_to_node alloc_node_data <-- allocate memoryless node with CPU free_area_init_memoryless_node init_gi_nodes alloc_node_data <-- allocate memoryless node with Initiator free_area_init_memoryless_node
free_area_init() already allocates all possible NUMA nodes, but init_cpu_to_node() and init_gi_nodes() are clueless about that, so they go ahead and allocate a new pg_data_t struct without checking anything, meaning we end up allocating twice.
It should be mad clear that this only happens in the case where memoryless NUMA node happens to have a CPU/Initiator affinity.
So get rid of init_memory_less_node() and just set the node online.
Note that setting the node online is needed, otherwise we choke down the chain when bringup_nonboot_cpus() ends up calling __try_online_node()->register_one_node()->... and we blow up in bus_add_device(). As can be seen here:
BUG: kernel NULL pointer dereference, address: 0000000000000060 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4 RIP: 0010:bus_add_device+0x5a/0x140 Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004 RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19 RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400 RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98 R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0 R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0 Call Trace: device_add+0x4c0/0x910 __register_one_node+0x97/0x2d0 __try_online_node+0x85/0xc0 try_online_node+0x25/0x40 cpu_up+0x4f/0x100 bringup_nonboot_cpus+0x4f/0x60 smp_init+0x26/0x79 kernel_init_freeable+0x130/0x2f1 kernel_init+0x17/0x150 ret_from_fork+0x22/0x30
The reason is simple, by the time bringup_nonboot_cpus() gets called, we did not register the node_subsys bus yet, so we crash when bus_add_device() tries to dereference bus()->p.
The following shows the order of the calls:
kernel_init_freeable smp_init bringup_nonboot_cpus ... bus_add_device() <- we did not register node_subsys yet do_basic_setup do_initcalls postcore_initcall(register_node_type); register_node_type subsys_system_register subsys_register bus_register <- register node_subsys bus
Why setting the node online saves us then? Well, simply because __try_online_node() backs off when the node is online, meaning we do not end up calling register_one_node() in the first place.
This is subtle, broken and deserves a deep analysis and thought about how to put this into shape, but for now let us have this easy fix for the leaking memory issue.
[osalvador@suse.de: add comments] Link: https://lkml.kernel.org/r/20220221142649.3457-1-osalvador@suse.de
Link: https://lkml.kernel.org/r/20220218224302.5282-2-osalvador@suse.de Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully") Signed-off-by: Oscar Salvador osalvador@suse.de Acked-by: Michal Hocko mhocko@suse.com Cc: David Hildenbrand david@redhat.com Cc: Rafael Aquini raquini@redhat.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Wei Yang richard.weiyang@gmail.com Cc: Dennis Zhou dennis@kernel.org Cc: Alexey Makhalov amakhalov@vmware.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Ma Wupeng mawupeng1@huawei.com --- arch/x86/mm/numa.c | 33 ++++++++++++++++++++------------- include/linux/mm.h | 1 - mm/page_alloc.c | 2 +- 3 files changed, 21 insertions(+), 15 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 9dc31996c7ed..cfb326e41dc6 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -739,17 +739,6 @@ void __init x86_numa_init(void) numa_init(dummy_numa_init); }
-static void __init init_memory_less_node(int nid) -{ - /* Allocate and initialize node data. Memory-less node is now online.*/ - alloc_node_data(nid); - free_area_init_memoryless_node(nid); - - /* - * All zonelists will be built later in start_kernel() after per cpu - * areas are initialized. - */ -}
/* * A node may exist which has one or more Generic Initiators but no CPUs and no @@ -767,9 +756,18 @@ void __init init_gi_nodes(void) { int nid;
+ /* + * Exclude this node from + * bringup_nonboot_cpus + * cpu_up + * __try_online_node + * register_one_node + * because node_subsys is not initialized yet. + * TODO remove dependency on node_online + */ for_each_node_state(nid, N_GENERIC_INITIATOR) if (!node_online(nid)) - init_memory_less_node(nid); + node_set_online(nid); }
/* @@ -799,8 +797,17 @@ void __init init_cpu_to_node(void) if (node == NUMA_NO_NODE) continue;
+ /* + * Exclude this node from + * bringup_nonboot_cpus + * cpu_up + * __try_online_node + * register_one_node + * because node_subsys is not initialized yet. + * TODO remove dependency on node_online + */ if (!node_online(node)) - init_memory_less_node(node); + node_set_online(node);
numa_set_node(cpu, node); } diff --git a/include/linux/mm.h b/include/linux/mm.h index a5316ffd4e1f..2ad509486cad 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2402,7 +2402,6 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud) }
extern void __init pagecache_init(void); -extern void __init free_area_init_memoryless_node(int nid); extern void free_initmem(void);
/* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d586c6e6faee..f21365c92a98 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7626,7 +7626,7 @@ static void __init free_area_init_node(int nid) free_area_init_core(pgdat); }
-void __init free_area_init_memoryless_node(int nid) +static void __init free_area_init_memoryless_node(int nid) { free_area_init_node(nid); }
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/2798 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/B...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/2798 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/B...