From: Ma Wupeng mawupeng1@huawei.com
handle uninitialized numa nodes gracefully.
Changelog since v1: - add bugfix patch #3.
Michal Hocko (2): mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG mm: handle uninitialized numa nodes gracefully
Oscar Salvador (1): arch/x86/mm/numa: Do not initialize nodes twice
arch/ia64/mm/discontig.c | 6 +- arch/x86/mm/numa.c | 26 +++---- include/linux/memory_hotplug.h | 120 ++++++++++++++++----------------- mm/internal.h | 2 + mm/memory_hotplug.c | 24 +++---- mm/page_alloc.c | 45 +++++++++++-- 6 files changed, 123 insertions(+), 100 deletions(-)
From: Michal Hocko mhocko@suse.com
mainline inclusion from mainline-v5.18-rc1 commit e930d999715073a70d306fb59a394ea8b84d0b45 category: bugfix bugzilla: 189331 CVE: NA
--------------------------------
Patch series "mm, memory_hotplug: handle unitialized numa node gracefully".
The core of the fix is patch 2 which also links existing bug reports. The high level goal is to have all possible numa nodes have their pgdat allocated and initialized so
for_each_possible_node(nid) NODE_DATA(nid)
will never return garbage. This has proven to be problem in several places when an offline numa node is used for an allocation just to realize that node_data and therefore allocation fallback zonelists are not initialized and such an allocation request blows up.
There were attempts to address that by checking node_online in several places including the page allocator. This patchset approaches the problem from a different perspective and instead of special casing, which just adds a runtime overhead, it allocates pglist_data for each possible node. This can add some memory overhead for platforms with high number of possible nodes if they do not contain any memory. This should be a rather rare configuration though.
How to test this? David has provided and excellent howto: http://lkml.kernel.org/r/6e5ebc19-890c-b6dd-1924-9f25c441010d@redhat.com
Patches 1 and 3-6 are mostly cleanups. The patchset has been reviewed by Rafael (thanks!) and the core fix tested by Rafael and Alexey (thanks to both). David has tested as per instructions above and hasn't found any fallouts in the memory hotplug scenarios.
This patch (of 6):
This is a preparatory patch and it doesn't introduce any functional change. It merely pulls out arch_alloc_nodedata (and co) outside of CONFIG_MEMORY_HOTPLUG because the following patch will need to call this from the generic MM code.
Link: https://lkml.kernel.org/r/20220127085305.20890-1-mhocko@kernel.org Link: https://lkml.kernel.org/r/20220127085305.20890-2-mhocko@kernel.org Signed-off-by: Michal Hocko mhocko@suse.com Acked-by: Rafael Aquini raquini@redhat.com Acked-by: David Hildenbrand david@redhat.com Acked-by: Mike Rapoport rppt@linux.ibm.com Reviewed-by: Oscar Salvador osalvador@suse.de Reviewed-by: Wei Yang richard.weiyang@gmail.com Cc: Alexey Makhalov amakhalov@vmware.com Cc: Christoph Lameter cl@linux.com Cc: Dennis Zhou dennis@kernel.org Cc: Eric Dumazet eric.dumazet@gmail.com Cc: Nico Pache npache@redhat.com Cc: Tejun Heo tj@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: include/linux/memory_hotplug.h Signed-off-by: Ma Wupeng mawupeng1@huawei.com --- arch/ia64/mm/discontig.c | 2 - include/linux/memory_hotplug.h | 120 ++++++++++++++++----------------- 2 files changed, 60 insertions(+), 62 deletions(-)
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c index 1928d5719e41..2711e0861d8f 100644 --- a/arch/ia64/mm/discontig.c +++ b/arch/ia64/mm/discontig.c @@ -631,7 +631,6 @@ void __init paging_init(void) zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page)); }
-#ifdef CONFIG_MEMORY_HOTPLUG pg_data_t *arch_alloc_nodedata(int nid) { unsigned long size = compute_pernodesize(nid); @@ -649,7 +648,6 @@ void arch_refresh_nodedata(int update_node, pg_data_t *update_pgdat) pgdat_list[update_node] = update_pgdat; scatter_node_data(); } -#endif
#ifdef CONFIG_SPARSEMEM_VMEMMAP int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index bc433d459c86..403ba6dc3c2e 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -15,6 +15,66 @@ struct memory_block; struct resource; struct vmem_altmap;
+#ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION +/* + * For supporting node-hotadd, we have to allocate a new pgdat. + * + * If an arch has generic style NODE_DATA(), + * node_data[nid] = kzalloc() works well. But it depends on the architecture. + * + * In general, generic_alloc_nodedata() is used. + * Now, arch_free_nodedata() is just defined for error path of node_hot_add. + * + */ +extern pg_data_t *arch_alloc_nodedata(int nid); +extern void arch_free_nodedata(pg_data_t *pgdat); +extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat); + +#else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */ + +#define arch_alloc_nodedata(nid) generic_alloc_nodedata(nid) +#define arch_free_nodedata(pgdat) generic_free_nodedata(pgdat) + +#ifdef CONFIG_NUMA +/* + * XXX: node aware allocation can't work well to get new node's memory at this time. + * Because, pgdat for the new node is not allocated/initialized yet itself. + * To use new node's memory, more consideration will be necessary. + */ +#define generic_alloc_nodedata(nid) \ +({ \ + kzalloc(sizeof(pg_data_t), GFP_KERNEL); \ +}) +/* + * This definition is just for error path in node hotadd. + * For node hotremove, we have to replace this. + */ +#define generic_free_nodedata(pgdat) kfree(pgdat) + +extern pg_data_t *node_data[]; +static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat) +{ + node_data[nid] = pgdat; +} + +#else /* !CONFIG_NUMA */ + +/* never called */ +static inline pg_data_t *generic_alloc_nodedata(int nid) +{ + BUG(); + return NULL; +} +static inline void generic_free_nodedata(pg_data_t *pgdat) +{ +} +static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat) +{ +} +#endif /* CONFIG_NUMA */ +#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */ + + #ifdef CONFIG_MEMORY_HOTPLUG /* * Return page for the valid pfn only if the page is online. All pfn @@ -146,66 +206,6 @@ static inline int memory_add_physaddr_to_nid(u64 start) } #endif
-#ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION -/* - * For supporting node-hotadd, we have to allocate a new pgdat. - * - * If an arch has generic style NODE_DATA(), - * node_data[nid] = kzalloc() works well. But it depends on the architecture. - * - * In general, generic_alloc_nodedata() is used. - * Now, arch_free_nodedata() is just defined for error path of node_hot_add. - * - */ -extern pg_data_t *arch_alloc_nodedata(int nid); -extern void arch_free_nodedata(pg_data_t *pgdat); -extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat); - -#else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */ - -#define arch_alloc_nodedata(nid) generic_alloc_nodedata(nid) -#define arch_free_nodedata(pgdat) generic_free_nodedata(pgdat) - -#ifdef CONFIG_NUMA -/* - * If ARCH_HAS_NODEDATA_EXTENSION=n, this func is used to allocate pgdat. - * XXX: kmalloc_node() can't work well to get new node's memory at this time. - * Because, pgdat for the new node is not allocated/initialized yet itself. - * To use new node's memory, more consideration will be necessary. - */ -#define generic_alloc_nodedata(nid) \ -({ \ - kzalloc(sizeof(pg_data_t), GFP_KERNEL); \ -}) -/* - * This definition is just for error path in node hotadd. - * For node hotremove, we have to replace this. - */ -#define generic_free_nodedata(pgdat) kfree(pgdat) - -extern pg_data_t *node_data[]; -static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat) -{ - node_data[nid] = pgdat; -} - -#else /* !CONFIG_NUMA */ - -/* never called */ -static inline pg_data_t *generic_alloc_nodedata(int nid) -{ - BUG(); - return NULL; -} -static inline void generic_free_nodedata(pg_data_t *pgdat) -{ -} -static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat) -{ -} -#endif /* CONFIG_NUMA */ -#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */ - #ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE extern void __init register_page_bootmem_info_node(struct pglist_data *pgdat); #else
From: Michal Hocko mhocko@suse.com
mainline inclusion from mainline-v5.18-rc1 commit 09f49dca570a917a8c6bccd7e8c61f5141534e3a category: bugfix bugzilla: 189331 CVE: NA
--------------------------------
We have had several reports [1][2][3] that page allocator blows up when an allocation from a possible node is requested. The underlying reason is that NODE_DATA for the specific node is not allocated.
NUMA specific initialization is arch specific and it can vary a lot. E.g. x86 tries to initialize all nodes that have some cpu affinity (see init_cpu_to_node) but this can be insufficient because the node might be cpuless for example.
One way to address this problem would be to check for !node_online nodes when trying to get a zonelist and silently fall back to another node. That is unfortunately adding a branch into allocator hot path and it doesn't handle any other potential NODE_DATA users.
This patch takes a different approach (following a lead of [3]) and it pre allocates pgdat for all possible nodes in an arch indipendent code - free_area_init. All uninitialized nodes are treated as memoryless nodes. node_state of the node is not changed because that would lead to other side effects - e.g. sysfs representation of such a node and from past discussions [4] it is known that some tools might have problems digesting that.
Newly allocated pgdat only gets a minimal initialization and the rest of the work is expected to be done by the memory hotplug - hotadd_new_pgdat (renamed to hotadd_init_pgdat).
generic_alloc_nodedata is changed to use the memblock allocator because neither page nor slab allocators are available at the stage when all pgdats are allocated. Hotplug doesn't allocate pgdat anymore so we can use the early boot allocator. The only arch specific implementation is ia64 and that is changed to use the early allocator as well.
[1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com
[akpm@linux-foundation.org: replace comment, per Mike]
Link: https://lkml.kernel.org/r/Yfe7RBeLCijnWBON@dhcp22.suse.cz Reported-by: Alexey Makhalov amakhalov@vmware.com Tested-by: Alexey Makhalov amakhalov@vmware.com Reported-by: Nico Pache npache@redhat.com Acked-by: Rafael Aquini raquini@redhat.com Tested-by: Rafael Aquini raquini@redhat.com Acked-by: David Hildenbrand david@redhat.com Reviewed-by: Oscar Salvador osalvador@suse.de Acked-by: Mike Rapoport rppt@linux.ibm.com Signed-off-by: Michal Hocko mhocko@suse.com Cc: Christoph Lameter cl@linux.com Cc: Dennis Zhou dennis@kernel.org Cc: Eric Dumazet eric.dumazet@gmail.com Cc: Tejun Heo tj@kernel.org Cc: Wei Yang richard.weiyang@gmail.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: mm/internal.h mm/memory_hotplug.c mm/page_alloc.c Signed-off-by: Ma Wupeng mawupeng1@huawei.com --- arch/ia64/mm/discontig.c | 4 +-- include/linux/memory_hotplug.h | 4 +-- mm/internal.h | 2 ++ mm/memory_hotplug.c | 24 ++++++++---------- mm/page_alloc.c | 45 +++++++++++++++++++++++++++++----- 5 files changed, 55 insertions(+), 24 deletions(-)
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c index 2711e0861d8f..32c16a68815d 100644 --- a/arch/ia64/mm/discontig.c +++ b/arch/ia64/mm/discontig.c @@ -631,11 +631,11 @@ void __init paging_init(void) zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page)); }
-pg_data_t *arch_alloc_nodedata(int nid) +pg_data_t * __init arch_alloc_nodedata(int nid) { unsigned long size = compute_pernodesize(nid);
- return kzalloc(size, GFP_KERNEL); + return _va(memblock_alloc(size, SMP_CACHE_BYTES)); }
void arch_free_nodedata(pg_data_t *pgdat) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 403ba6dc3c2e..4ad765a6e734 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -32,7 +32,7 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
#else /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
-#define arch_alloc_nodedata(nid) generic_alloc_nodedata(nid) +#define hotadd_add_pgdatarch_alloc_nodedata(nid) generic_alloc_nodedata(nid) #define arch_free_nodedata(pgdat) generic_free_nodedata(pgdat)
#ifdef CONFIG_NUMA @@ -43,7 +43,7 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat); */ #define generic_alloc_nodedata(nid) \ ({ \ - kzalloc(sizeof(pg_data_t), GFP_KERNEL); \ + __va(memblock_alloc(sizeof(*pgdat), SMP_CACHE_BYTES)); \ }) /* * This definition is just for error path in node hotadd. diff --git a/mm/internal.h b/mm/internal.h index deffd247b010..72f77379b58f 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -614,4 +614,6 @@ extern int is_pagecache_reading_kernel_recovery_enable(void); extern int is_get_user_kernel_recovery_enable(void); #endif
+DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 3df9285b73b1..3b961b0fffcc 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -920,18 +920,20 @@ static void reset_node_present_pages(pg_data_t *pgdat) }
/* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */ -static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start) +static pg_data_t __ref *hotadd_init_pgdat(int nid, u64 start) { struct pglist_data *pgdat; unsigned long start_pfn = PFN_DOWN(start);
pgdat = NODE_DATA(nid); - if (!pgdat) { - pgdat = arch_alloc_nodedata(nid); - if (!pgdat) - return NULL;
- arch_refresh_nodedata(nid, pgdat); + /* + * NODE_DATA is preallocated (free_area_init) but its internal + * state is not allocated completely. Add missing pieces. + * Completely offline nodes stay around and they just need + * reintialization. + */ + if (pgdat->per_cpu_nodestats == &boot_nodestats) { } else { /* * Reset the nr_zones, order and classzone_idx before reuse. @@ -943,10 +945,7 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start) pgdat->kswapd_classzone_idx = 0; }
- /* we can use NODE_DATA(nid) from here */ - - pgdat->node_id = nid; - pgdat->node_start_pfn = start_pfn; + pgdat->node_start_pfn = 0;
/* init node's zones as empty zones, we don't have any present pages.*/ free_area_init_core_hotplug(nid); @@ -1000,7 +999,7 @@ static int __try_online_node(int nid, u64 start, bool set_node_online) if (node_online(nid)) return 0;
- pgdat = hotadd_new_pgdat(nid, start); + pgdat = hotadd_init_pgdat(nid, start); if (!pgdat) { pr_err("Cannot online node %d due to NULL pgdat\n", nid); ret = -ENOMEM; @@ -1123,9 +1122,6 @@ int __ref add_memory_resource(int nid, struct resource *res)
return ret; error: - /* rollback pgdat allocation and others */ - if (new_node) - rollback_node_hotadd(nid); memblock_remove(start, size); mem_hotplug_done(); return ret; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fc8be4b00125..1d946043ee63 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5819,7 +5819,7 @@ static void build_zonelists(pg_data_t *pgdat) */ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch); static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset); -static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats); +DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
static void __build_all_zonelists(void *data) { @@ -5854,7 +5854,11 @@ static void __build_all_zonelists(void *data) if (self && !node_online(self->node_id)) { build_zonelists(self); } else { - for_each_online_node(nid) { + /* + * All possible nodes have pgdat preallocated + * in free_area_init + */ + for_each_node(nid) { pg_data_t *pgdat = NODE_DATA(nid);
build_zonelists(pgdat); @@ -7441,10 +7445,39 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn) mminit_verify_pageflags_layout(); setup_nr_node_ids(); zero_resv_unavail(); - for_each_online_node(nid) { - pg_data_t *pgdat = NODE_DATA(nid); - free_area_init_node(nid, NULL, - find_min_pfn_for_node(nid), NULL); + for_each_node(nid) { + pg_data_t *pgdat; + + if (!node_online(nid)) { + pr_info("Initializing node %d as memoryless\n", nid); + + /* Allocator not initialized yet */ + pgdat = arch_alloc_nodedata(nid); + if (!pgdat) { + pr_err("Cannot allocate %zuB for node %d.\n", + sizeof(*pgdat), nid); + continue; + } + memset(pgdat, 0, sizeof(*pgdat)); + arch_refresh_nodedata(nid, pgdat); + free_area_init_node(nid, NULL, 0, NULL); + + /* + * We do not want to confuse userspace by sysfs + * files/directories for node without any memory + * attached to it, so this node is not marked as + * N_MEMORY and not marked online so that no sysfs + * hierarchy will be created via register_one_node for + * it. The pgdat will get fully initialized by + * hotadd_init_pgdat() when memory is hotplugged into + * this node. + */ + continue; + } + + pgdat = NODE_DATA(nid); + free_area_init_node(nid, NULL, find_min_pfn_for_node(nid), + NULL);
/* Any memory on that node */ if (pgdat->node_present_pages) {
From: Oscar Salvador osalvador@suse.de
mainline inclusion from mainline-v5.18-rc1 commit 1ca75fa7f19d694c58af681fa023295072b03120 category: bugfix bugzilla: 189331 CVE: NA
--------------------------------
On x86, prior to ("mm: handle uninitialized numa nodes gracecully"), NUMA nodes could be allocated at three different places.
- numa_register_memblks - init_cpu_to_node - init_gi_nodes
All these calls happen at setup_arch, and have the following order:
setup_arch ... x86_numa_init numa_init numa_register_memblks ... init_cpu_to_node init_memory_less_node alloc_node_data free_area_init_memoryless_node init_gi_nodes init_memory_less_node alloc_node_data free_area_init_memoryless_node
numa_register_memblks() is only interested in those nodes which have memory, so it skips over any memoryless node it founds. Later on, when we have read ACPI's SRAT table, we call init_cpu_to_node() and init_gi_nodes(), which initialize any memoryless node we might have that have either CPU or Initiator affinity, meaning we allocate pg_data_t struct for them and we mark them as ONLINE.
So far so good, but the thing is that after ("mm: handle uninitialized numa nodes gracefully"), we allocate all possible NUMA nodes in free_area_init(), meaning we have a picture like the following:
setup_arch x86_numa_init numa_init numa_register_memblks <-- allocate non-memoryless node x86_init.paging.pagetable_init ... free_area_init free_area_init_memoryless <-- allocate memoryless node init_cpu_to_node alloc_node_data <-- allocate memoryless node with CPU free_area_init_memoryless_node init_gi_nodes alloc_node_data <-- allocate memoryless node with Initiator free_area_init_memoryless_node
free_area_init() already allocates all possible NUMA nodes, but init_cpu_to_node() and init_gi_nodes() are clueless about that, so they go ahead and allocate a new pg_data_t struct without checking anything, meaning we end up allocating twice.
It should be mad clear that this only happens in the case where memoryless NUMA node happens to have a CPU/Initiator affinity.
So get rid of init_memory_less_node() and just set the node online.
Note that setting the node online is needed, otherwise we choke down the chain when bringup_nonboot_cpus() ends up calling __try_online_node()->register_one_node()->... and we blow up in bus_add_device(). As can be seen here:
BUG: kernel NULL pointer dereference, address: 0000000000000060 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4 RIP: 0010:bus_add_device+0x5a/0x140 Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004 RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19 RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400 RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98 R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0 R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0 Call Trace: device_add+0x4c0/0x910 __register_one_node+0x97/0x2d0 __try_online_node+0x85/0xc0 try_online_node+0x25/0x40 cpu_up+0x4f/0x100 bringup_nonboot_cpus+0x4f/0x60 smp_init+0x26/0x79 kernel_init_freeable+0x130/0x2f1 kernel_init+0x17/0x150 ret_from_fork+0x22/0x30
The reason is simple, by the time bringup_nonboot_cpus() gets called, we did not register the node_subsys bus yet, so we crash when bus_add_device() tries to dereference bus()->p.
The following shows the order of the calls:
kernel_init_freeable smp_init bringup_nonboot_cpus ... bus_add_device() <- we did not register node_subsys yet do_basic_setup do_initcalls postcore_initcall(register_node_type); register_node_type subsys_system_register subsys_register bus_register <- register node_subsys bus
Why setting the node online saves us then? Well, simply because __try_online_node() backs off when the node is online, meaning we do not end up calling register_one_node() in the first place.
This is subtle, broken and deserves a deep analysis and thought about how to put this into shape, but for now let us have this easy fix for the leaking memory issue.
[osalvador@suse.de: add comments] Link: https://lkml.kernel.org/r/20220221142649.3457-1-osalvador@suse.de
Link: https://lkml.kernel.org/r/20220218224302.5282-2-osalvador@suse.de Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully") Signed-off-by: Oscar Salvador osalvador@suse.de Acked-by: Michal Hocko mhocko@suse.com Cc: David Hildenbrand david@redhat.com Cc: Rafael Aquini raquini@redhat.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Wei Yang richard.weiyang@gmail.com Cc: Dennis Zhou dennis@kernel.org Cc: Alexey Makhalov amakhalov@vmware.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org
Conflicts: arch/x86/mm/numa.c include/linux/mm.h mm/page_alloc.c Signed-off-by: Ma Wupeng mawupeng1@huawei.com --- arch/x86/mm/numa.c | 26 ++++++++++---------------- 1 file changed, 10 insertions(+), 16 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index fa150855647c..2284f279a1af 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -722,21 +722,6 @@ void __init x86_numa_init(void) numa_init(dummy_numa_init); }
-static void __init init_memory_less_node(int nid) -{ - unsigned long zones_size[MAX_NR_ZONES] = {0}; - unsigned long zholes_size[MAX_NR_ZONES] = {0}; - - /* Allocate and initialize node data. Memory-less node is now online.*/ - alloc_node_data(nid); - free_area_init_node(nid, zones_size, 0, zholes_size); - - /* - * All zonelists will be built later in start_kernel() after per cpu - * areas are initialized. - */ -} - /* * Setup early cpu_to_node. * @@ -764,8 +749,17 @@ void __init init_cpu_to_node(void) if (node == NUMA_NO_NODE) continue;
+ /* + * Exclude this node from + * bringup_nonboot_cpus + * cpu_up + * __try_online_node + * register_one_node + * because node_subsys is not initialized yet. + * TODO remove dependency on node_online + */ if (!node_online(node)) - init_memory_less_node(node); + node_set_online(node);
numa_set_node(cpu, node); }
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/2778 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/X...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/2778 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/X...