[PATCH OLK-6.6 v2] Backport some memory policy feature and bugfix in mm from mainline

List overview All Threads
Download

newer

older

[openeuler:OLK-5.10] BUILD SUCCESS...

[PATCH OLK-6.6 v2 0/2] mm: prepare...

Ze Zuo

11 May 2024 11 May '24

9:49 p.m.

Backport feature(mm/mempolicy: weighted interleave mempolicy and sysfs extension) and some mempolicy bugfix and cleanup.

v1 -> v2

- fix conlict in patch [1].

Gregory Price (3): mm/mempolicy: refactor a read-once mechanism into a function for re-use mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving mm/mempolicy: protect task interleave functions with tsk->mems_allowed_seq

Hugh Dickins (10): mempolicy: fix migrate_pages(2) syscall return nr_failed mempolicy trivia: delete those ancient pr_debug()s mempolicy trivia: slightly more consistent naming mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy: remove confusing MPOL_MF_LAZY dead code kernfs: drop shared NUMA mempolicy hooks mempolicy: alloc_pages_mpol() for NUMA policy without vma mempolicy: mmap_lock is not needed while migrating folios mempolicy: migration attempt to match interleave nodes

Rakie Kim (1): mm/mempolicy: implement the sysfs-based weighted_interleave interface

-- 2.33.0

Show replies by date

Ze Zuo

11 May 11 May

9:49 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy: fix migrate_pages(2) syscall return nr_failed

From: Hugh Dickins hughd@google.com

mainline inclusion from mainline-v6.7-rc1 commit 1cb5d11a370f661c5d0d888bb0cfc2cdc5791382 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OHHN CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...

--------------------------------

"man 2 migrate_pages" says "On success migrate_pages() returns the number of pages that could not be moved". Although 5.3 and 5.4 commits fixed mbind(MPOL_MF_STRICT|MPOL_MF_MOVE*) to fail with EIO when not all pages could be moved (because some could not be isolated for migration), migrate_pages(2) was left still reporting only those pages failing at the migration stage, forgetting those failing at the earlier isolation stage.

Fix that by accumulating a long nr_failed count in struct queue_pages, returned by queue_pages_range() when it's not returning an error, for adding on to the nr_failed count from migrate_pages() in mm/migrate.c. A count of pages? It's more a count of folios, but changing it to pages would entail more work (also in mm/migrate.c): does not seem justified.

queue_pages_range() itself should only return -EIO in the "strictly unmovable" case (STRICT without any MOVEs): in that case it's best to break out as soon as nr_failed gets set; but otherwise it should continue to isolate pages for MOVing even when nr_failed - as the mbind(2) manpage promises.

There's a case when nr_failed should be incremented when it was missed: queue_folios_pte_range() and queue_folios_hugetlb() count the transient migration entries, like queue_folios_pmd() already did. And there's a case when nr_failed should not be incremented when it would have been: in meeting later PTEs of the same large folio, which can only be isolated once: fixed by recording the current large folio in struct queue_pages.

Clean up the affected functions, fixing or updating many comments. Bool migrate_folio_add(), without -EIO: true if adding, or if skipping shared (but its arguable folio_estimated_sharers() heuristic left unchanged). Use MPOL_MF_WRLOCK flag to queue_pages_range(), instead of bool lock_vma. Use explicit STRICT|MOVE* flags where queue_pages_test_walk() checks for skipping, instead of hiding them behind MPOL_MF_VALID.

Link: https://lkml.kernel.org/r/9a6b0b9-3bb-dbef-8adf-efab4397b8d@google.com Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Reviewed-by: "Huang, Ying" ying.huang@intel.com Cc: Andi Kleen ak@linux.intel.com Cc: Christoph Lameter cl@linux.com Cc: David Hildenbrand david@redhat.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Michal Hocko mhocko@suse.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Nhat Pham nphamcs@gmail.com Cc: Sidhartha Kumar sidhartha.kumar@oracle.com Cc: Suren Baghdasaryan surenb@google.com Cc: Tejun heo tj@kernel.org Cc: Vishal Moola (Oracle) vishal.moola@gmail.com Cc: Yang Shi shy828301@gmail.com Cc: Yosry Ahmed yosryahmed@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org

Conflicts: mm/mempolicy.c [ context conlict with commit b8eba2d7e4a3 ] Signed-off-by: Ze Zuo zuoze1@huawei.com --- mm/mempolicy.c | 339 +++++++++++++++++++++++-------------------------- 1 file changed, 160 insertions(+), 179 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c index a80f99751904..3f4fd547f148 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -112,7 +112,8 @@

/* Internal flags */ #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0) /* Skip checks for continuous vmas */ -#define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */ +#define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */ +#define MPOL_MF_WRLOCK (MPOL_MF_INTERNAL << 2) /* Write-lock walked vmas */

static struct kmem_cache *policy_cache; static struct kmem_cache *sn_cache; @@ -421,9 +422,19 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { }, };

-static int migrate_folio_add(struct folio *folio, struct list_head *foliolist, +static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, unsigned long flags);

+static bool strictly_unmovable(unsigned long flags) +{ + /* + * STRICT without MOVE flags lets do_mbind() fail immediately with -EIO + * if any misplaced page is found. + */ + return (flags & (MPOL_MF_STRICT | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) == + MPOL_MF_STRICT; +} + struct queue_pages { struct list_head *pagelist; unsigned long flags; @@ -431,7 +442,8 @@ struct queue_pages { unsigned long start; unsigned long end; struct vm_area_struct *first; - bool has_unmovable; + struct folio *large; /* note last large folio encountered */ + long nr_failed; /* could not be isolated at this time */ };

/* @@ -449,61 +461,37 @@ static inline bool queue_folio_required(struct folio *folio, return node_isset(nid, *qp->nmask) == !(flags & MPOL_MF_INVERT); }

-/* - * queue_folios_pmd() has three possible return values: - * 0 - folios are placed on the right node or queued successfully, or - * special page is met, i.e. zero page, or unmovable page is found - * but continue walking (indicated by queue_pages.has_unmovable). - * -EIO - is migration entry or only MPOL_MF_STRICT was specified and an - * existing folio was already on a node that does not follow the - * policy. - */ -static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr, - unsigned long end, struct mm_walk *walk) - __releases(ptl) +static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk) { - int ret = 0; struct folio *folio; struct queue_pages *qp = walk->private; - unsigned long flags;

if (unlikely(is_pmd_migration_entry(*pmd))) { - ret = -EIO; - goto unlock; + qp->nr_failed++; + return; } folio = pfn_folio(pmd_pfn(*pmd)); if (is_huge_zero_page(&folio->page)) { walk->action = ACTION_CONTINUE; - goto unlock; + return; } if (!queue_folio_required(folio, qp)) - goto unlock; - - flags = qp->flags; - /* go to folio migration */ - if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) { - if (!vma_migratable(walk->vma) || - migrate_folio_add(folio, qp->pagelist, flags)) { - qp->has_unmovable = true; - goto unlock; - } - } else - ret = -EIO; -unlock: - spin_unlock(ptl); - return ret; + return; + if (!(qp->flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) || + !vma_migratable(walk->vma) || + !migrate_folio_add(folio, qp->pagelist, qp->flags)) + qp->nr_failed++; }

/* - * Scan through pages checking if pages follow certain conditions, - * and move them to the pagelist if they do. + * Scan through folios, checking if they satisfy the required conditions, + * moving them from LRU to local pagelist for migration if they do (or not). * - * queue_folios_pte_range() has three possible return values: - * 0 - folios are placed on the right node or queued successfully, or - * special page is met, i.e. zero page, or unmovable page is found - * but continue walking (indicated by queue_pages.has_unmovable). - * -EIO - only MPOL_MF_STRICT was specified and an existing folio was already - * on a node that does not follow the policy. + * queue_folios_pte_range() has two possible return values: + * 0 - continue walking to scan for more, even if an existing folio on the + * wrong node could not be isolated and queued for migration. + * -EIO - only MPOL_MF_STRICT was specified, without MPOL_MF_MOVE or ..._ALL, + * and an existing folio was on a node that does not follow the policy. */ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) @@ -517,8 +505,11 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, spinlock_t *ptl;

ptl = pmd_trans_huge_lock(pmd, vma); - if (ptl) - return queue_folios_pmd(pmd, ptl, addr, end, walk); + if (ptl) { + queue_folios_pmd(pmd, walk); + spin_unlock(ptl); + goto out; + }

mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); if (!pte) { @@ -527,8 +518,13 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, } for (; addr != end; pte++, addr += PAGE_SIZE) { ptent = ptep_get(pte); - if (!pte_present(ptent)) + if (pte_none(ptent)) + continue; + if (!pte_present(ptent)) { + if (is_migration_entry(pte_to_swp_entry(ptent))) + qp->nr_failed++; continue; + } folio = vm_normal_folio(vma, addr, ptent); if (!folio || folio_is_zone_device(folio)) continue; @@ -540,94 +536,86 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, continue; if (!queue_folio_required(folio, qp)) continue; - if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) { - /* - * MPOL_MF_STRICT must be specified if we get here. - * Continue walking vmas due to MPOL_MF_MOVE* flags. - */ - if (!vma_migratable(vma)) - qp->has_unmovable = true; - + if (folio_test_large(folio)) { /* - * Do not abort immediately since there may be - * temporary off LRU pages in the range. Still - * need migrate other LRU pages. + * A large folio can only be isolated from LRU once, + * but may be mapped by many PTEs (and Copy-On-Write may + * intersperse PTEs of other, order 0, folios). This is + * a common case, so don't mistake it for failure (but + * there can be other cases of multi-mapped pages which + * this quick check does not help to filter out - and a + * search of the pagelist might grow to be prohibitive). + * + * migrate_pages(&pagelist) returns nr_failed folios, so + * check "large" now so that queue_pages_range() returns + * a comparable nr_failed folios. This does imply that + * if folio could not be isolated for some racy reason + * at its first PTE, later PTEs will not give it another + * chance of isolation; but keeps the accounting simple. */ - if (migrate_folio_add(folio, qp->pagelist, flags)) - qp->has_unmovable = true; - } else - break; + if (folio == qp->large) + continue; + qp->large = folio; + } + if (!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) || + !vma_migratable(vma) || + !migrate_folio_add(folio, qp->pagelist, flags)) { + qp->nr_failed++; + if (strictly_unmovable(flags)) + break; + } } pte_unmap_unlock(mapped_pte, ptl); cond_resched(); - - return addr != end ? -EIO : 0; +out: + if (qp->nr_failed && strictly_unmovable(flags)) + return -EIO; + return 0; }

static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, unsigned long end, struct mm_walk *walk) { - int ret = 0; #ifdef CONFIG_HUGETLB_PAGE struct queue_pages *qp = walk->private; - unsigned long flags = (qp->flags & MPOL_MF_VALID); + unsigned long flags = qp->flags; struct folio *folio; spinlock_t *ptl; pte_t entry;

ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte); entry = huge_ptep_get(pte); - if (!pte_present(entry)) + if (!pte_present(entry)) { + if (unlikely(is_hugetlb_entry_migration(entry))) + qp->nr_failed++; goto unlock; + } folio = pfn_folio(pte_pfn(entry)); if (!queue_folio_required(folio, qp)) goto unlock; - - if (flags == MPOL_MF_STRICT) { - /* - * STRICT alone means only detecting misplaced folio and no - * need to further check other vma. - */ - ret = -EIO; + if (!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) || + !vma_migratable(walk->vma)) { + qp->nr_failed++; goto unlock; } - - if (!vma_migratable(walk->vma)) { - /* - * Must be STRICT with MOVE*, otherwise .test_walk() have - * stopped walking current vma. - * Detecting misplaced folio but allow migrating folios which - * have been queued. - */ - qp->has_unmovable = true; - goto unlock; - } - /* - * With MPOL_MF_MOVE, we try to migrate only unshared folios. If it - * is shared it is likely not worth migrating. + * Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio. + * Choosing not to migrate a shared folio is not counted as a failure. * * See folio_likely_mapped_shared() on possible imprecision when we * cannot easily detect if a folio is shared. */ - if (flags & (MPOL_MF_MOVE_ALL) || - (flags & MPOL_MF_MOVE && !folio_likely_mapped_shared(folio) && - !hugetlb_pmd_shared(pte))) { - if (!isolate_hugetlb(folio, qp->pagelist) && - (flags & MPOL_MF_STRICT)) - /* - * Failed to isolate folio but allow migrating pages - * which have been queued. - */ - qp->has_unmovable = true; - } + if ((flags & MPOL_MF_MOVE_ALL) || + (!folio_likely_mapped_shared(folio) && !hugetlb_pmd_shared(pte))) + if (!isolate_hugetlb(folio, qp->pagelist)) + qp->nr_failed++; unlock: spin_unlock(ptl); -#else - BUG(); + if (qp->nr_failed && strictly_unmovable(flags)) + return -EIO; #endif - return ret; + return 0; }

#ifdef CONFIG_NUMA_BALANCING @@ -708,8 +696,11 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end, return 1; }

- /* queue pages from current vma */ - if (flags & MPOL_MF_VALID) + /* + * Check page nodes, and queue pages to move, in the current vma. + * But if no moving, and no strict checking, the scan can be skipped. + */ + if (flags & (MPOL_MF_STRICT | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) return 0; return 1; } @@ -731,22 +722,21 @@ static const struct mm_walk_ops queue_pages_lock_vma_walk_ops = { /* * Walk through page tables and collect pages to be migrated. * - * If pages found in a given range are on a set of nodes (determined by - * @nodes and @flags,) it's isolated and queued to the pagelist which is - * passed via @private. + * If pages found in a given range are not on the required set of @nodes, + * and migration is allowed, they are isolated and queued to @pagelist. * - * queue_pages_range() has three possible return values: - * 1 - there is unmovable page, but MPOL_MF_MOVE* & MPOL_MF_STRICT were - * specified. - * 0 - queue pages successfully or no misplaced page. - * errno - i.e. misplaced pages with MPOL_MF_STRICT specified (-EIO) or - * memory range specified by nodemask and maxnode points outside - * your accessible address space (-EFAULT) + * queue_pages_range() may return: + * 0 - all pages already on the right node, or successfully queued for moving + * (or neither strict checking nor moving requested: only range checking). + * >0 - this number of misplaced folios could not be queued for moving + * (a hugetlbfs page or a transparent huge page being counted as 1). + * -EIO - a misplaced page found, when MPOL_MF_STRICT specified without MOVEs. + * -EFAULT - a hole in the memory range, when MPOL_MF_DISCONTIG_OK unspecified. */ -static int +static long queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, nodemask_t *nodes, unsigned long flags, - struct list_head *pagelist, bool lock_vma) + struct list_head *pagelist) { int err; struct queue_pages qp = { @@ -756,20 +746,17 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, .start = start, .end = end, .first = NULL, - .has_unmovable = false, }; - const struct mm_walk_ops *ops = lock_vma ? + const struct mm_walk_ops *ops = (flags & MPOL_MF_WRLOCK) ? &queue_pages_lock_vma_walk_ops : &queue_pages_walk_ops;

err = walk_page_range(mm, start, end, ops, &qp);

- if (qp.has_unmovable) - err = 1; if (!qp.first) /* whole range in hole */ err = -EFAULT;

- return err; + return err ? : qp.nr_failed; }

/* @@ -1032,15 +1019,16 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, }

#ifdef CONFIG_MIGRATION -static int migrate_folio_add(struct folio *folio, struct list_head *foliolist, +static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, unsigned long flags) { /* - * We try to migrate only unshared folios. If it is shared it - * is likely not worth migrating. + * Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio. + * Choosing not to migrate a shared folio is not counted as a failure. * - * See folio_likely_mapped_shared() on possible imprecision when we - * cannot easily detect if a folio is shared. + * To check if the folio is shared, ideally we want to make sure + * every page is mapped to the same process. Doing that is very + * expensive, so check the estimated mapcount of the folio instead. */ if ((flags & MPOL_MF_MOVE_ALL) || !folio_likely_mapped_shared(folio)) { if (folio_isolate_lru(folio)) { @@ -1048,32 +1036,31 @@ static int migrate_folio_add(struct folio *folio, struct list_head *foliolist, node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio), folio_nr_pages(folio)); - } else if (flags & MPOL_MF_STRICT) { + } else { /* * Non-movable folio may reach here. And, there may be * temporary off LRU folios or non-LRU movable folios. * Treat them as unmovable folios since they can't be - * isolated, so they can't be moved at the moment. It - * should return -EIO for this case too. + * isolated, so they can't be moved at the moment. */ - return -EIO; + return false; } } - - return 0; + return true; }

/* * Migrate pages from one node to a target node. * Returns error or the number of pages not migrated. */ -static int migrate_to_node(struct mm_struct *mm, int source, int dest, - int flags) +static long migrate_to_node(struct mm_struct *mm, int source, int dest, + int flags) { nodemask_t nmask; struct vm_area_struct *vma; LIST_HEAD(pagelist); - int err = 0; + long nr_failed; + long err = 0; struct migration_target_control mtc = { .nid = dest, .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, @@ -1082,23 +1069,27 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest, nodes_clear(nmask); node_set(source, nmask);

+ VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))); + vma = find_vma(mm, 0); + /* - * This does not "check" the range but isolates all pages that + * This does not migrate the range, but isolates all pages that * need migration. Between passing in the full user address - * space range and MPOL_MF_DISCONTIG_OK, this call can not fail. + * space range and MPOL_MF_DISCONTIG_OK, this call cannot fail, + * but passes back the count of pages which could not be isolated. */ - vma = find_vma(mm, 0); - VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))); - queue_pages_range(mm, vma->vm_start, mm->task_size, &nmask, - flags | MPOL_MF_DISCONTIG_OK, &pagelist, false); + nr_failed = queue_pages_range(mm, vma->vm_start, mm->task_size, &nmask, + flags | MPOL_MF_DISCONTIG_OK, &pagelist);

if (!list_empty(&pagelist)) { err = migrate_pages(&pagelist, alloc_migration_target, NULL, - (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL, NULL); + (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL, NULL); if (err) putback_movable_pages(&pagelist); }

+ if (err >= 0) + err += nr_failed; return err; }

@@ -1111,8 +1102,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest, int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, const nodemask_t *to, int flags) { - int busy = 0; - int err = 0; + long nr_failed = 0; + long err = 0; nodemask_t tmp;

lru_cache_disable(); @@ -1194,7 +1185,7 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, node_clear(source, tmp); err = migrate_to_node(mm, source, dest, flags); if (err > 0) - busy += err; + nr_failed += err; if (err < 0) break; } @@ -1203,8 +1194,7 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, lru_cache_enable(); if (err < 0) return err; - return busy; - + return (nr_failed < INT_MAX) ? nr_failed : INT_MAX; }

/* @@ -1243,10 +1233,10 @@ static struct folio *new_folio(struct folio *src, unsigned long start) } #else

-static int migrate_folio_add(struct folio *folio, struct list_head *foliolist, +static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, unsigned long flags) { - return -EIO; + return false; }

int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, @@ -1269,8 +1259,8 @@ long __do_mbind(unsigned long start, unsigned long len, struct vma_iterator vmi; struct mempolicy *new; unsigned long end; - int err; - int ret; + long err; + long nr_failed; LIST_HEAD(pagelist);

if (flags & ~(unsigned long)MPOL_MF_VALID) @@ -1310,10 +1300,8 @@ long __do_mbind(unsigned long start, unsigned long len, start, start + len, mode, mode_flags, nmask ? nodes_addr(*nmask)[0] : NUMA_NO_NODE);

- if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) { - + if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) lru_cache_disable(); - } { NODEMASK_SCRATCH(scratch); if (scratch) { @@ -1329,44 +1317,37 @@ long __do_mbind(unsigned long start, unsigned long len, goto mpol_out;

/* - * Lock the VMAs before scanning for pages to migrate, to ensure we don't - * miss a concurrently inserted page. + * Lock the VMAs before scanning for pages to migrate, + * to ensure we don't miss a concurrently inserted page. */ - ret = queue_pages_range(mm, start, end, nmask, - flags | MPOL_MF_INVERT, &pagelist, true); - - if (ret < 0) { - err = ret; - goto up_out; - } + nr_failed = queue_pages_range(mm, start, end, nmask, + flags | MPOL_MF_INVERT | MPOL_MF_WRLOCK, &pagelist);

- vma_iter_init(&vmi, mm, start); - prev = vma_prev(&vmi); - for_each_vma_range(vmi, vma, end) { - err = mbind_range(&vmi, vma, &prev, start, end, new); - if (err) - break; + if (nr_failed < 0) { + err = nr_failed; + } else { + vma_iter_init(&vmi, mm, start); + prev = vma_prev(&vmi); + for_each_vma_range(vmi, vma, end) { + err = mbind_range(&vmi, vma, &prev, start, end, new); + if (err) + break; + } }

if (!err) { - int nr_failed = 0; - if (!list_empty(&pagelist)) { WARN_ON_ONCE(flags & MPOL_MF_LAZY); - nr_failed = migrate_pages(&pagelist, new_folio, NULL, + nr_failed |= migrate_pages(&pagelist, new_folio, NULL, start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND, NULL); - if (nr_failed) - putback_movable_pages(&pagelist); } - - if (((ret > 0) || nr_failed) && (flags & MPOL_MF_STRICT)) + if (nr_failed && (flags & MPOL_MF_STRICT)) err = -EIO; - } else { -up_out: - if (!list_empty(&pagelist)) - putback_movable_pages(&pagelist); }

+ if (!list_empty(&pagelist)) + putback_movable_pages(&pagelist); + mmap_write_unlock(mm); mpol_out: mpol_put(new);

-- 2.33.0

patchwork bot

10:08 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy: fix migrate_pages(2) syscall return nr_failed

反馈: 您发送到kernel@openeuler.org的补丁/补丁集，转换为PR失败！邮件列表地址：https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/N... 失败原因：应用补丁/补丁集失败，Patch failed at 0001 mempolicy: fix migrate_pages(2) syscall return nr_failed 建议解决方法：请查看失败原因，确认补丁是否可以应用在当前期望分支的最新代码上

FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/N... Failed Reason: apply patch(es) failed, Patch failed at 0001 mempolicy: fix migrate_pages(2) syscall return nr_failed Suggest Solution: please checkout if the failed patch(es) can work on the newest codes in expected branch

Ze Zuo

9:49 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy trivia: delete those ancient pr_debug()s

From: Hugh Dickins hughd@google.com

mainline inclusion from mainline-v6.7-rc1 commit 7f1ee4e2070883d18a431c761db8bb30a958b654 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OHHN CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...

--------------------------------

Delete those ancient pr_debug()s - PDprintk()s in Andi Kleen's original submission of core NUMA API, and useful when debugging shared mempolicy lifetime back then, but not used recently.

Link: https://lkml.kernel.org/r/f25135-ffb2-40d8-9577-720772b333@google.com Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: Andi Kleen ak@linux.intel.com Cc: Christoph Lameter cl@linux.com Cc: David Hildenbrand david@redhat.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "Huang, Ying" ying.huang@intel.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Michal Hocko mhocko@suse.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Nhat Pham nphamcs@gmail.com Cc: Sidhartha Kumar sidhartha.kumar@oracle.com Cc: Suren Baghdasaryan surenb@google.com Cc: Tejun heo tj@kernel.org Cc: Vishal Moola (Oracle) vishal.moola@gmail.com Cc: Yang Shi shy828301@gmail.com Cc: Yosry Ahmed yosryahmed@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Ze Zuo zuoze1@huawei.com --- mm/mempolicy.c | 21 --------------------- 1 file changed, 21 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 3f4fd547f148..15f1347677c2 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -269,9 +269,6 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, { struct mempolicy *policy;

- pr_debug("setting mode %d flags %d nodes[0] %lx\n", - mode, flags, nodes ? nodes_addr(*nodes)[0] : NUMA_NO_NODE); - if (mode == MPOL_DEFAULT) { if (nodes && !nodes_empty(*nodes)) return ERR_PTR(-EINVAL); @@ -772,11 +769,6 @@ static int vma_replace_policy(struct vm_area_struct *vma,

vma_assert_write_locked(vma);

- pr_debug("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n", - vma->vm_start, vma->vm_end, vma->vm_pgoff, - vma->vm_ops, vma->vm_file, - vma->vm_ops ? vma->vm_ops->set_policy : NULL); - new = mpol_dup(pol); if (IS_ERR(new)) return PTR_ERR(new); @@ -1296,10 +1288,6 @@ long __do_mbind(unsigned long start, unsigned long len, if (!new) flags |= MPOL_MF_DISCONTIG_OK;

- pr_debug("mbind %lx-%lx mode:%d flags:%d nodes:%lx\n", - start, start + len, mode, mode_flags, - nmask ? nodes_addr(*nmask)[0] : NUMA_NO_NODE); - if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) lru_cache_disable(); { @@ -2521,8 +2509,6 @@ static void sp_insert(struct shared_policy *sp, struct sp_node *new) } rb_link_node(&new->nd, parent, p); rb_insert_color(&new->nd, &sp->root); - pr_debug("inserting %lx-%lx: %d\n", new->start, new->end, - new->policy ? new->policy->mode : 0); }

/* Find shared policy intersecting idx */ @@ -2661,7 +2647,6 @@ void mpol_put_task_policy(struct task_struct *task)

static void sp_delete(struct shared_policy *sp, struct sp_node *n) { - pr_debug("deleting %lx-l%lx\n", n->start, n->end); rb_erase(&n->nd, &sp->root); sp_free(n); } @@ -2818,12 +2803,6 @@ int mpol_set_shared_policy(struct shared_policy *info, struct sp_node *new = NULL; unsigned long sz = vma_pages(vma);

- pr_debug("set_shared_policy %lx sz %lu %d %d %lx\n", - vma->vm_pgoff, - sz, npol ? npol->mode : -1, - npol ? npol->flags : -1, - npol ? nodes_addr(npol->nodes)[0] : NUMA_NO_NODE); - if (npol) { new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, npol); if (!new)

-- 2.33.0

patchwork bot

10:08 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy trivia: delete those ancient pr_debug()s

反馈: 您发送到kernel@openeuler.org的补丁/补丁集，转换为PR失败！邮件列表地址：https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/O... 失败原因：应用补丁/补丁集失败，Patch failed at 0001 mempolicy trivia: delete those ancient pr_debug()s 建议解决方法：请查看失败原因，确认补丁是否可以应用在当前期望分支的最新代码上

FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/O... Failed Reason: apply patch(es) failed, Patch failed at 0001 mempolicy trivia: delete those ancient pr_debug()s Suggest Solution: please checkout if the failed patch(es) can work on the newest codes in expected branch

Ze Zuo

9:49 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy trivia: slightly more consistent naming

From: Hugh Dickins hughd@google.com

mainline inclusion from mainline-v6.7-rc1 commit c36f6e6dff4d32ec8b6da8f553933727a57a7a4a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OHHN CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...

--------------------------------

Before getting down to work, do a little cleanup, mainly of inconsistent variable naming. I gave up trying to rationalize mpol versus pol versus policy, and node versus nid, but let's avoid p and nd. Remove a few superfluous blank lines, but add one; and here prefer vma->vm_policy to vma_policy(vma) - the latter being appropriate in other sources, which have to allow for !CONFIG_NUMA. That intriguing line about KERNEL_DS? should have gone in v2.6.15, when numa_policy_init() stopped using set_mempolicy(2)'s system call handler.

Link: https://lkml.kernel.org/r/68287974-b6ae-7df-4ba-d19ddd69cbf@google.com Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: Andi Kleen ak@linux.intel.com Cc: Christoph Lameter cl@linux.com Cc: David Hildenbrand david@redhat.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "Huang, Ying" ying.huang@intel.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Michal Hocko mhocko@suse.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Nhat Pham nphamcs@gmail.com Cc: Sidhartha Kumar sidhartha.kumar@oracle.com Cc: Suren Baghdasaryan surenb@google.com Cc: Tejun heo tj@kernel.org Cc: Vishal Moola (Oracle) vishal.moola@gmail.com Cc: Yang Shi shy828301@gmail.com Cc: Yosry Ahmed yosryahmed@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Ze Zuo zuoze1@huawei.com --- include/linux/mempolicy.h | 11 +++--- mm/mempolicy.c | 73 ++++++++++++++++++--------------------- 2 files changed, 38 insertions(+), 46 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 2e81ac87e6f6..aaf2255efcff 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -129,10 +129,9 @@ struct shared_policy {

int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst); void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol); -int mpol_set_shared_policy(struct shared_policy *info, - struct vm_area_struct *vma, - struct mempolicy *new); -void mpol_free_shared_policy(struct shared_policy *p); +int mpol_set_shared_policy(struct shared_policy *sp, + struct vm_area_struct *vma, struct mempolicy *mpol); +void mpol_free_shared_policy(struct shared_policy *sp); struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx);

@@ -199,7 +198,7 @@ static inline bool mpol_equal(struct mempolicy *a, struct mempolicy *b) return true; }

-static inline void mpol_put(struct mempolicy *p) +static inline void mpol_put(struct mempolicy *pol) { }

@@ -218,7 +217,7 @@ static inline void mpol_shared_policy_init(struct shared_policy *sp, { }

-static inline void mpol_free_shared_policy(struct shared_policy *p) +static inline void mpol_free_shared_policy(struct shared_policy *sp) { }

diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 15f1347677c2..137f7ed52b8f 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -25,7 +25,7 @@ * to the last. It would be better if bind would truly restrict * the allocation to memory nodes instead * - * preferred Try a specific node first before normal fallback. + * preferred Try a specific node first before normal fallback. * As a special case NUMA_NO_NODE here means do the allocation * on the local CPU. This is normally identical to default, * but useful to set in a VMA when you have a non default @@ -52,7 +52,7 @@ * on systems with highmem kernel lowmem allocation don't get policied. * Same with GFP_DMA allocations. * - * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between + * For shmem/tmpfs shared memory the policy is shared between * all users and remembered even when nobody has memory mapped. */

@@ -296,6 +296,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, return ERR_PTR(-EINVAL); } else if (nodes_empty(*nodes)) return ERR_PTR(-EINVAL); + policy = kmem_cache_alloc(policy_cache, GFP_KERNEL); if (!policy) return ERR_PTR(-ENOMEM); @@ -308,11 +309,11 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, }

/* Slow path of a mpol destructor. */ -void __mpol_put(struct mempolicy *p) +void __mpol_put(struct mempolicy *pol) { - if (!atomic_dec_and_test(&p->refcnt)) + if (!atomic_dec_and_test(&pol->refcnt)) return; - kmem_cache_free(policy_cache, p); + kmem_cache_free(policy_cache, pol); }

static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes) @@ -369,7 +370,6 @@ static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask) * * Called with task's alloc_lock held. */ - void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new) { mpol_rebind_policy(tsk->mempolicy, new); @@ -380,7 +380,6 @@ void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new) * * Call holding a reference to mm. Takes mm->mmap_lock during call. */ - void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new) { struct vm_area_struct *vma; @@ -761,7 +760,7 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, * This must be called with the mmap_lock held for writing. */ static int vma_replace_policy(struct vm_area_struct *vma, - struct mempolicy *pol) + struct mempolicy *pol) { int err; struct mempolicy *old; @@ -807,7 +806,7 @@ static int mbind_range(struct vma_iterator *vmi, struct vm_area_struct *vma, vmstart = vma->vm_start; }

- if (mpol_equal(vma_policy(vma), new_pol)) { + if (mpol_equal(vma->vm_policy, new_pol)) { *prev = vma; return 0; } @@ -879,18 +878,18 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, * * Called with task's alloc_lock held */ -static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes) +static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes) { nodes_clear(*nodes); - if (p == &default_policy) + if (pol == &default_policy) return;

- switch (p->mode) { + switch (pol->mode) { case MPOL_BIND: case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: - *nodes = p->nodes; + *nodes = pol->nodes; break; case MPOL_LOCAL: /* return empty node mask for local allocation */ @@ -1664,7 +1663,6 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode, out_put: put_task_struct(task); goto out; - }

SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode, @@ -1674,7 +1672,6 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode, return kernel_migrate_pages(pid, maxnode, old_nodes, new_nodes); }

- /* Retrieve NUMA policy */ static int kernel_get_mempolicy(int __user *policy, unsigned long __user *nmask, @@ -1857,10 +1854,10 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) * policy_node() is always coupled with policy_nodemask(), which * secures the nodemask limit for 'bind' and 'prefer-many' policy. */ -static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd) +static int policy_node(gfp_t gfp, struct mempolicy *policy, int nid) { if (policy->mode == MPOL_PREFERRED) { - nd = first_node(policy->nodes); + nid = first_node(policy->nodes); } else { /* * __GFP_THISNODE shouldn't even be used with the bind policy @@ -1875,19 +1872,18 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd) policy->home_node != NUMA_NO_NODE) return policy->home_node;

- return nd; + return nid; }

/* Do dynamic interleaving for a process */ -static unsigned interleave_nodes(struct mempolicy *policy) +static unsigned int interleave_nodes(struct mempolicy *policy) { - unsigned next; - struct task_struct *me = current; + unsigned int nid;

- next = next_node_in(me->il_prev, policy->nodes); - if (next < MAX_NUMNODES) - me->il_prev = next; - return next; + nid = next_node_in(current->il_prev, policy->nodes); + if (nid < MAX_NUMNODES) + current->il_prev = nid; + return nid; }

/* @@ -2372,7 +2368,7 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,

int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst) { - struct mempolicy *pol = mpol_dup(vma_policy(src)); + struct mempolicy *pol = mpol_dup(src->vm_policy);

if (IS_ERR(pol)) return PTR_ERR(pol); @@ -2796,40 +2792,40 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol) } }

-int mpol_set_shared_policy(struct shared_policy *info, - struct vm_area_struct *vma, struct mempolicy *npol) +int mpol_set_shared_policy(struct shared_policy *sp, + struct vm_area_struct *vma, struct mempolicy *pol) { int err; struct sp_node *new = NULL; unsigned long sz = vma_pages(vma);

- if (npol) { - new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, npol); + if (pol) { + new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, pol); if (!new) return -ENOMEM; } - err = shared_policy_replace(info, vma->vm_pgoff, vma->vm_pgoff+sz, new); + err = shared_policy_replace(sp, vma->vm_pgoff, vma->vm_pgoff + sz, new); if (err && new) sp_free(new); return err; }

/* Free a backing policy store on inode delete. */ -void mpol_free_shared_policy(struct shared_policy *p) +void mpol_free_shared_policy(struct shared_policy *sp) { struct sp_node *n; struct rb_node *next;

- if (!p->root.rb_node) + if (!sp->root.rb_node) return; - write_lock(&p->lock); - next = rb_first(&p->root); + write_lock(&sp->lock); + next = rb_first(&sp->root); while (next) { n = rb_entry(next, struct sp_node, nd); next = rb_next(&n->nd); - sp_delete(p, n); + sp_delete(sp, n); } - write_unlock(&p->lock); + write_unlock(&sp->lock); }

#ifdef CONFIG_NUMA_BALANCING @@ -2879,7 +2875,6 @@ static inline void __init check_numabalancing_enable(void) } #endif /* CONFIG_NUMA_BALANCING */

-/* assumes fs == KERNEL_DS */ void __init numa_policy_init(void) { nodemask_t interleave_nodes; @@ -2942,7 +2937,6 @@ void numa_default_policy(void) /* * Parse and format mempolicy from/to strings */ - static const char * const policy_modes[] = { [MPOL_DEFAULT] = "default", @@ -2953,7 +2947,6 @@ static const char * const policy_modes[] = [MPOL_PREFERRED_MANY] = "prefer (many)", };

- #ifdef CONFIG_TMPFS /** * mpol_parse_str - parse string to mempolicy, for tmpfs mpol mount option.

-- 2.33.0

patchwork bot

10:06 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy trivia: slightly more consistent naming

反馈: 您发送到kernel@openeuler.org的补丁/补丁集，转换为PR失败！邮件列表地址：https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/D... 失败原因：应用补丁/补丁集失败，Patch failed at 0001 mempolicy trivia: slightly more consistent naming 建议解决方法：请查看失败原因，确认补丁是否可以应用在当前期望分支的最新代码上

FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/D... Failed Reason: apply patch(es) failed, Patch failed at 0001 mempolicy trivia: slightly more consistent naming Suggest Solution: please checkout if the failed patch(es) can work on the newest codes in expected branch

Ze Zuo

9:49 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy trivia: use pgoff_t in shared mempolicy tree

From: Hugh Dickins hughd@google.com

mainline inclusion from mainline-v6.7-rc1 commit 93397c3b7684555b7cec726cd13eef6742d191fe category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OHHN CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...

--------------------------------

Prefer the more explicit "pgoff_t" to "unsigned long" when dealing with a shared mempolicy tree. Delete confusing comment about pseudo mm vmas.

Link: https://lkml.kernel.org/r/5451157-3818-4af5-fd2c-5d26a5d1dc53@google.com Signed-off-by: Hugh Dickins hughd@google.com Cc: Andi Kleen ak@linux.intel.com Cc: Christoph Lameter cl@linux.com Cc: David Hildenbrand david@redhat.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "Huang, Ying" ying.huang@intel.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Mel Gorman mgorman@techsingularity.net Cc: Michal Hocko mhocko@suse.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Nhat Pham nphamcs@gmail.com Cc: Sidhartha Kumar sidhartha.kumar@oracle.com Cc: Suren Baghdasaryan surenb@google.com Cc: Tejun heo tj@kernel.org Cc: Vishal Moola (Oracle) vishal.moola@gmail.com Cc: Yang Shi shy828301@gmail.com Cc: Yosry Ahmed yosryahmed@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Ze Zuo zuoze1@huawei.com --- include/linux/mempolicy.h | 20 +++++++------------- mm/mempolicy.c | 12 ++++++------ 2 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index aaf2255efcff..684d28f36efc 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -110,22 +110,16 @@ static inline bool mpol_equal(struct mempolicy *a, struct mempolicy *b)

/* * Tree of shared policies for a shared memory region. - * Maintain the policies in a pseudo mm that contains vmas. The vmas - * carry the policy. As a special twist the pseudo mm is indexed in pages, not - * bytes, so that we can work with shared memory segments bigger than - * unsigned long. */ - -struct sp_node { - struct rb_node nd; - unsigned long start, end; - struct mempolicy *policy; -}; - struct shared_policy { struct rb_root root; rwlock_t lock; }; +struct sp_node { + struct rb_node nd; + pgoff_t start, end; + struct mempolicy *policy; +};

int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst); void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol); @@ -133,7 +127,7 @@ int mpol_set_shared_policy(struct shared_policy *sp, struct vm_area_struct *vma, struct mempolicy *mpol); void mpol_free_shared_policy(struct shared_policy *sp); struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp, - unsigned long idx); + pgoff_t idx);

struct mempolicy *get_task_policy(struct task_struct *p); struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, @@ -222,7 +216,7 @@ static inline void mpol_free_shared_policy(struct shared_policy *sp) }

static inline struct mempolicy * -mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx) +mpol_shared_policy_lookup(struct shared_policy *sp, pgoff_t idx) { return NULL; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 137f7ed52b8f..a7192a2196d2 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2453,8 +2453,8 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b) * lookup first element intersecting start-end. Caller holds sp->lock for * reading or for writing */ -static struct sp_node * -sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end) +static struct sp_node *sp_lookup(struct shared_policy *sp, + pgoff_t start, pgoff_t end) { struct rb_node *n = sp->root.rb_node;

@@ -2508,8 +2508,8 @@ static void sp_insert(struct shared_policy *sp, struct sp_node *new) }

/* Find shared policy intersecting idx */ -struct mempolicy * -mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx) +struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp, + pgoff_t idx) { struct mempolicy *pol = NULL; struct sp_node *sn; @@ -2677,8 +2677,8 @@ static struct sp_node *sp_alloc(unsigned long start, unsigned long end, }

/* Replace a policy range. */ -static int shared_policy_replace(struct shared_policy *sp, unsigned long start, - unsigned long end, struct sp_node *new) +static int shared_policy_replace(struct shared_policy *sp, pgoff_t start, + pgoff_t end, struct sp_node *new) { struct sp_node *n; struct sp_node *n_new = NULL;

-- 2.33.0

patchwork bot

10:06 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy trivia: use pgoff_t in shared mempolicy tree

反馈: 您发送到kernel@openeuler.org的补丁/补丁集，转换为PR失败！邮件列表地址：https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/6... 失败原因：应用补丁/补丁集失败，Patch failed at 0001 mempolicy trivia: use pgoff_t in shared mempolicy tree 建议解决方法：请查看失败原因，确认补丁是否可以应用在当前期望分支的最新代码上

FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/6... Failed Reason: apply patch(es) failed, Patch failed at 0001 mempolicy trivia: use pgoff_t in shared mempolicy tree Suggest Solution: please checkout if the failed patch(es) can work on the newest codes in expected branch

Ze Zuo

9:49 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy: mpol_shared_policy_init() without pseudo-vma

From: Hugh Dickins hughd@google.com

mainline inclusion from mainline-v6.7-rc1 commit 35ec8fa0207b3c7f7c3c22337c9a507d7b291626 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OHHN CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...

--------------------------------

mpol_shared_policy_init() does not need to use a pseudo-vma: it can use sp_alloc() and sp_insert() directly, since the object's shared policy tree is empty and inaccessible (needing no lock) at get_inode() time.

Link: https://lkml.kernel.org/r/3bef62d8-ae78-4c2-533-56a44ae425c@google.com Signed-off-by: Hugh Dickins hughd@google.com Cc: Andi Kleen ak@linux.intel.com Cc: Christoph Lameter cl@linux.com Cc: David Hildenbrand david@redhat.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "Huang, Ying" ying.huang@intel.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Mel Gorman mgorman@techsingularity.net Cc: Michal Hocko mhocko@suse.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Nhat Pham nphamcs@gmail.com Cc: Sidhartha Kumar sidhartha.kumar@oracle.com Cc: Suren Baghdasaryan surenb@google.com Cc: Tejun heo tj@kernel.org Cc: Vishal Moola (Oracle) vishal.moola@gmail.com Cc: Yang Shi shy828301@gmail.com Cc: Yosry Ahmed yosryahmed@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Ze Zuo zuoze1@huawei.com --- mm/mempolicy.c | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c index a7192a2196d2..4d8dbfd72225 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2761,30 +2761,30 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol) rwlock_init(&sp->lock);

if (mpol) { - struct vm_area_struct pvma; - struct mempolicy *new; + struct sp_node *sn; + struct mempolicy *npol; NODEMASK_SCRATCH(scratch);

if (!scratch) goto put_mpol; - /* contextualize the tmpfs mount point mempolicy */ - new = mpol_new(mpol->mode, mpol->flags, &mpol->w.user_nodemask); - if (IS_ERR(new)) + + /* contextualize the tmpfs mount point mempolicy to this file */ + npol = mpol_new(mpol->mode, mpol->flags, &mpol->w.user_nodemask); + if (IS_ERR(npol)) goto free_scratch; /* no valid nodemask intersection */

task_lock(current); - ret = mpol_set_nodemask(new, &mpol->w.user_nodemask, scratch); + ret = mpol_set_nodemask(npol, &mpol->w.user_nodemask, scratch); task_unlock(current); if (ret) - goto put_new; - - /* Create pseudo-vma that contains just the policy */ - vma_init(&pvma, NULL); - pvma.vm_end = TASK_SIZE; /* policy covers entire file */ - mpol_set_shared_policy(sp, &pvma, new); /* adds ref */ - -put_new: - mpol_put(new); /* drop initial ref */ + goto put_npol; + + /* alloc node covering entire file; adds ref to file's npol */ + sn = sp_alloc(0, MAX_LFS_FILESIZE >> PAGE_SHIFT, npol); + if (sn) + sp_insert(sp, sn); +put_npol: + mpol_put(npol); /* drop initial ref on file's npol */ free_scratch: NODEMASK_SCRATCH_FREE(scratch); put_mpol:

-- 2.33.0

patchwork bot

10:07 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy: mpol_shared_policy_init() without pseudo-vma

反馈：您发送到kernel@openeuler.org的补丁/补丁集，已成功转换为PR！ PR链接地址： https://gitee.com/openeuler/kernel/pulls/7199 邮件列表地址：https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/N...

FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/7199 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/N...

Ze Zuo

9:49 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy: remove confusing MPOL_MF_LAZY dead code

From: Hugh Dickins hughd@google.com

mainline inclusion from mainline-v6.7-rc1 commit 2cafb582173f3870240af90de3f31d18b0728882 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OHHN CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...

--------------------------------

v3.8 commit b24f53a0bea3 ("mm: mempolicy: Add MPOL_MF_LAZY") introduced MPOL_MF_LAZY, and included it in the MPOL_MF_VALID flags; but a720094ded8 ("mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now") immediately removed it from MPOL_MF_VALID flags, pending further review. "This will need to be revisited", but it has not been reinstated.

The present state is confusing: there is dead code in mm/mempolicy.c to handle MPOL_MF_LAZY cases which can never occur. Remove that: it can be resurrected later if necessary. But keep the definition of MPOL_MF_LAZY, which must remain in the UAPI, even though it always fails with EINVAL.

https://lore.kernel.org/linux-mm/1553041659-46787-1-git-send-email-yang.shi@... links to a previous request to remove MPOL_MF_LAZY.

Link: https://lkml.kernel.org/r/80c9665c-1c3f-17ba-21a3-f6115cebf7d@google.com Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Reviewed-by: Yang Shi shy828301@gmail.com Cc: Andi Kleen ak@linux.intel.com Cc: Christoph Lameter cl@linux.com Cc: David Hildenbrand david@redhat.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "Huang, Ying" ying.huang@intel.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Michal Hocko mhocko@suse.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Nhat Pham nphamcs@gmail.com Cc: Sidhartha Kumar sidhartha.kumar@oracle.com Cc: Suren Baghdasaryan surenb@google.com Cc: Tejun heo tj@kernel.org Cc: Vishal Moola (Oracle) vishal.moola@gmail.com Cc: Yosry Ahmed yosryahmed@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Ze Zuo zuoze1@huawei.com --- include/uapi/linux/mempolicy.h | 2 +- mm/mempolicy.c | 18 ------------------ 2 files changed, 1 insertion(+), 19 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 046d0ccba4cd..a8963f7ef4c2 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -48,7 +48,7 @@ enum { #define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to policy */ #define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to policy */ -#define MPOL_MF_LAZY (1<<3) /* Modifies '_MOVE: lazy migrate on fault */ +#define MPOL_MF_LAZY (1<<3) /* UNSUPPORTED FLAG: Lazy migrate on fault */ #define MPOL_MF_INTERNAL (1<<4) /* Internal flags start here */

#define MPOL_MF_VALID (MPOL_MF_STRICT | \ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 4d8dbfd72225..2ef6e8f09595 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -640,12 +640,6 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,

return nr_updated; } -#else -static unsigned long change_prot_numa(struct vm_area_struct *vma, - unsigned long addr, unsigned long end) -{ - return 0; -} #endif /* CONFIG_NUMA_BALANCING */

static int queue_pages_test_walk(unsigned long start, unsigned long end, @@ -684,14 +678,6 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end, if (endvma > end) endvma = end;

- if (flags & MPOL_MF_LAZY) { - /* Similar to task_numa_work, skip inaccessible VMAs */ - if (!is_vm_hugetlb_page(vma) && vma_is_accessible(vma) && - !(vma->vm_flags & VM_MIXEDMAP)) - change_prot_numa(vma, start, endvma); - return 1; - } - /* * Check page nodes, and queue pages to move, in the current vma. * But if no moving, and no strict checking, the scan can be skipped. @@ -1277,9 +1263,6 @@ long __do_mbind(unsigned long start, unsigned long len, if (IS_ERR(new)) return PTR_ERR(new);

- if (flags & MPOL_MF_LAZY) - new->flags |= MPOL_F_MOF; - /* * If we are using the default policy then operation * on discontinuous address spaces is okay after all @@ -1324,7 +1307,6 @@ long __do_mbind(unsigned long start, unsigned long len,

if (!err) { if (!list_empty(&pagelist)) { - WARN_ON_ONCE(flags & MPOL_MF_LAZY); nr_failed |= migrate_pages(&pagelist, new_folio, NULL, start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND, NULL); }

-- 2.33.0

patchwork bot

10:09 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy: remove confusing MPOL_MF_LAZY dead code

反馈: 您发送到kernel@openeuler.org的补丁/补丁集，转换为PR失败！邮件列表地址：https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/4... 失败原因：应用补丁/补丁集失败，Patch failed at 0001 mempolicy: remove confusing MPOL_MF_LAZY dead code 建议解决方法：请查看失败原因，确认补丁是否可以应用在当前期望分支的最新代码上

FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/4... Failed Reason: apply patch(es) failed, Patch failed at 0001 mempolicy: remove confusing MPOL_MF_LAZY dead code Suggest Solution: please checkout if the failed patch(es) can work on the newest codes in expected branch

Ze Zuo

9:49 p.m.

New subject: [PATCH OLK-6.6 v2] kernfs: drop shared NUMA mempolicy hooks

From: Hugh Dickins hughd@google.com

mainline inclusion from mainline-v6.7-rc1 commit 4b981bc1aa73c204c2aa7f99b5f4f74d03b0e381 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OHHN CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...

--------------------------------

It seems strange that kernfs should be an outlier with a set_policy and get_policy in its kernfs_vm_ops. Ah, it dates back to v2.6.30's commit 095160aee954 ("sysfs: fix some bin_vm_ops errors"), when I had crashed on powerpc's pci_mmap_legacy_page_range() fallback to shmem_zero_setup().

Well, that was commendably thorough, to give sysfs-bin a set_policy and get_policy, just to avoid the way it was coded resulting in EINVAL from mmap when CONFIG_NUMA; but somehow feels a bit over-the-top to me now.

It's easier to say that nobody should expect to manage a shmem object's shared NUMA mempolicy via some kernfs backdoor to that object: delete that code (and there's no longer an EINVAL from mmap in the NUMA case).

This then leaves set_policy/get_policy as implemented only by shmem - though importantly also by SysV SHM, which has to interface with shmem which implements them, and with SHM_HUGETLB which does not.

Link: https://lkml.kernel.org/r/302164-a760-4a9e-879b-6870c9b4013@google.com Signed-off-by: Hugh Dickins hughd@google.com Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Cc: Andi Kleen ak@linux.intel.com Cc: Christoph Lameter cl@linux.com Cc: David Hildenbrand david@redhat.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "Huang, Ying" ying.huang@intel.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Mel Gorman mgorman@techsingularity.net Cc: Michal Hocko mhocko@suse.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Nhat Pham nphamcs@gmail.com Cc: Sidhartha Kumar sidhartha.kumar@oracle.com Cc: Suren Baghdasaryan surenb@google.com Cc: Tejun heo tj@kernel.org Cc: Vishal Moola (Oracle) vishal.moola@gmail.com Cc: Yang Shi shy828301@gmail.com Cc: Yosry Ahmed yosryahmed@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Ze Zuo zuoze1@huawei.com --- fs/kernfs/file.c | 49 ------------------------------------------------ 1 file changed, 49 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c index 1cbf9a44422e..b1511c228852 100644 --- a/fs/kernfs/file.c +++ b/fs/kernfs/file.c @@ -433,60 +433,11 @@ static int kernfs_vma_access(struct vm_area_struct *vma, unsigned long addr, return ret; }

-#ifdef CONFIG_NUMA -static int kernfs_vma_set_policy(struct vm_area_struct *vma, - struct mempolicy *new) -{ - struct file *file = vma->vm_file; - struct kernfs_open_file *of = kernfs_of(file); - int ret; - - if (!of->vm_ops) - return 0; - - if (!kernfs_get_active(of->kn)) - return -EINVAL; - - ret = 0; - if (of->vm_ops->set_policy) - ret = of->vm_ops->set_policy(vma, new); - - kernfs_put_active(of->kn); - return ret; -} - -static struct mempolicy *kernfs_vma_get_policy(struct vm_area_struct *vma, - unsigned long addr) -{ - struct file *file = vma->vm_file; - struct kernfs_open_file *of = kernfs_of(file); - struct mempolicy *pol; - - if (!of->vm_ops) - return vma->vm_policy; - - if (!kernfs_get_active(of->kn)) - return vma->vm_policy; - - pol = vma->vm_policy; - if (of->vm_ops->get_policy) - pol = of->vm_ops->get_policy(vma, addr); - - kernfs_put_active(of->kn); - return pol; -} - -#endif - static const struct vm_operations_struct kernfs_vm_ops = { .open = kernfs_vma_open, .fault = kernfs_vma_fault, .page_mkwrite = kernfs_vma_page_mkwrite, .access = kernfs_vma_access, -#ifdef CONFIG_NUMA - .set_policy = kernfs_vma_set_policy, - .get_policy = kernfs_vma_get_policy, -#endif };

static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)

-- 2.33.0

patchwork bot

10:13 p.m.

New subject: [PATCH OLK-6.6 v2] kernfs: drop shared NUMA mempolicy hooks

反馈：您发送到kernel@openeuler.org的补丁/补丁集，已成功转换为PR！ PR链接地址： https://gitee.com/openeuler/kernel/pulls/7201 邮件列表地址：https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/I...

FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/7201 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/I...

Ze Zuo

9:49 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy: alloc_pages_mpol() for NUMA policy without vma

From: Hugh Dickins hughd@google.com

mainline inclusion from mainline-v6.7-rc1 commit ddc1a5cbc05dc62743a2f409b96faa5cf95ba064 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OHHN CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...

--------------------------------

Shrink shmem's stack usage by eliminating the pseudo-vma from its folio allocation. alloc_pages_mpol(gfp, order, pol, ilx, nid) becomes the principal actor for passing mempolicy choice down to __alloc_pages(), rather than vma_alloc_folio(gfp, order, vma, addr, hugepage).

vma_alloc_folio() and alloc_pages() remain, but as wrappers around alloc_pages_mpol(). alloc_pages_bulk_*() untouched, except to provide the additional args to policy_nodemask(), which subsumes policy_node(). Cleanup throughout, cutting out some unhelpful "helpers".

It would all be much simpler without MPOL_INTERLEAVE, but that adds a dynamic to the constant mpol: complicated by v3.6 commit 09c231cb8bfd ("tmpfs: distribute interleave better across nodes"), which added ino bias to the interleave, hidden from mm/mempolicy.c until this commit.

Hence "ilx" throughout, the "interleave index". Originally I thought it could be done just with nid, but that's wrong: the nodemask may come from the shared policy layer below a shmem vma, or it may come from the task layer above a shmem vma; and without the final nodemask then nodeid cannot be decided. And how ilx is applied depends also on page order.

The interleave index is almost always irrelevant unless MPOL_INTERLEAVE: with one exception in alloc_pages_mpol(), where the NO_INTERLEAVE_INDEX passed down from vma-less alloc_pages() is also used as hint not to use THP-style hugepage allocation - to avoid the overhead of a hugepage arg (though I don't understand why we never just added a GFP bit for THP - if it actually needs a different allocation strategy from other pages of the same order). vma_alloc_folio() still carries its hugepage arg here, but it is not used, and should be removed when agreed.

get_vma_policy() no longer allows a NULL vma: over time I believe we've eradicated all the places which used to need it e.g. swapoff and madvise used to pass NULL vma to read_swap_cache_async(), but now know the vma.

[hughd@google.com: handle NULL mpol being passed to __read_swap_cache_async()] Link: https://lkml.kernel.org/r/ea419956-4751-0102-21f7-9c93cb957892@google.com Link: https://lkml.kernel.org/r/74e34633-6060-f5e3-aee-7040d43f2e93@google.com Link: https://lkml.kernel.org/r/1738368e-bac0-fd11-ed7f-b87142a939fe@google.com Signed-off-by: Hugh Dickins hughd@google.com Cc: Andi Kleen ak@linux.intel.com Cc: Christoph Lameter cl@linux.com Cc: David Hildenbrand david@redhat.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Huang Ying ying.huang@intel.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Mel Gorman mgorman@techsingularity.net Cc: Michal Hocko mhocko@suse.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Nhat Pham nphamcs@gmail.com Cc: Sidhartha Kumar sidhartha.kumar@oracle.com Cc: Suren Baghdasaryan surenb@google.com Cc: Tejun heo tj@kernel.org Cc: Vishal Moola (Oracle) vishal.moola@gmail.com Cc: Yang Shi shy828301@gmail.com Cc: Yosry Ahmed yosryahmed@google.com Cc: Domenico Cerasuolo mimmocerasuolo@gmail.com Cc: Johannes Weiner hannes@cmpxchg.org Signed-off-by: Andrew Morton akpm@linux-foundation.org

Conflicts: include/linux/mempolicy.h [ Code line count context conflict] mm/mempolicy.c [ Context conflict is due to a significant dependency conflict with commit 3022fd7af960. This conflict is too large and risky to be incorporated into the codebase. ] mm/shmem.c [ Context conflict with commit 1a553561230a. The refactoring of the vma_alloc_folio function has caused conflicts with the QOS-aware code.] Signed-off-by: Ze Zuo zuoze1@huawei.com --- fs/proc/task_mmu.c | 5 +- include/linux/gfp.h | 10 +- include/linux/mempolicy.h | 20 +- include/linux/mm.h | 2 +- ipc/shm.c | 21 +- mm/mempolicy.c | 393 ++++++++++++++++---------------------- mm/shmem.c | 95 ++++----- mm/swap.h | 9 +- mm/swap_state.c | 86 +++++---- mm/zswap.c | 7 +- 10 files changed, 316 insertions(+), 332 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 639eb70634f4..ab4ffd3a06f7 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1950,8 +1950,9 @@ static int show_numa_map(struct seq_file *m, void *v) struct numa_maps *md = &numa_priv->md; struct file *file = vma->vm_file; struct mm_struct *mm = vma->vm_mm; - struct mempolicy *pol; char buffer[64]; + struct mempolicy *pol; + pgoff_t ilx; int nid;

if (!mm) @@ -1960,7 +1961,7 @@ static int show_numa_map(struct seq_file *m, void *v) /* Ensure we start with an empty set of numa_maps statistics. */ memset(md, 0, sizeof(*md));

- pol = __get_vma_policy(vma, vma->vm_start); + pol = __get_vma_policy(vma, vma->vm_start, &ilx); if (pol) { mpol_to_str(buffer, sizeof(buffer), pol); mpol_cond_put(pol); diff --git a/include/linux/gfp.h b/include/linux/gfp.h index b18b7e3758be..ad30b46d8a3d 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -8,6 +8,7 @@ #include <linux/topology.h>

struct vm_area_struct; +struct mempolicy;

/* Convert GFP flags to their corresponding migrate type */ #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE) @@ -276,7 +277,9 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,

#ifdef CONFIG_NUMA struct page *alloc_pages(gfp_t gfp, unsigned int order); -struct folio *folio_alloc(gfp_t gfp, unsigned order); +struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, + struct mempolicy *mpol, pgoff_t ilx, int nid); +struct folio *folio_alloc(gfp_t gfp, unsigned int order); struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma, unsigned long addr, bool hugepage); #else @@ -284,6 +287,11 @@ static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order) { return alloc_pages_node(numa_node_id(), gfp_mask, order); } +static inline struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, + struct mempolicy *mpol, pgoff_t ilx, int nid) +{ + return alloc_pages(gfp, order); +} static inline struct folio *folio_alloc(gfp_t gfp, unsigned int order) { return __folio_alloc_node(gfp, order, numa_node_id()); diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 684d28f36efc..7a45b565fdcc 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -17,6 +17,8 @@

struct mm_struct;

+#define NO_INTERLEAVE_INDEX (-1UL) /* use task il_prev for interleaving */ + #ifdef CONFIG_NUMA

/* @@ -131,7 +133,9 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,

struct mempolicy *get_task_policy(struct task_struct *p); struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, - unsigned long addr); + unsigned long addr, pgoff_t *ilx); +struct mempolicy *get_vma_policy(struct vm_area_struct *vma, + unsigned long addr, int order, pgoff_t *ilx); bool vma_policy_mof(struct vm_area_struct *vma);

extern void numa_default_policy(void); @@ -145,8 +149,6 @@ extern int huge_node(struct vm_area_struct *vma, extern bool init_nodemask_of_mempolicy(nodemask_t *mask); extern bool mempolicy_in_oom_domain(struct task_struct *tsk, const nodemask_t *mask); -extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); - extern unsigned int mempolicy_slab_node(void);

extern enum zone_type policy_zone; @@ -187,6 +189,11 @@ extern long __do_mbind(unsigned long start, unsigned long len,

struct mempolicy {};

+static inline struct mempolicy *get_task_policy(struct task_struct *p) +{ + return NULL; +} + static inline bool mpol_equal(struct mempolicy *a, struct mempolicy *b) { return true; @@ -223,6 +230,13 @@ mpol_shared_policy_lookup(struct shared_policy *sp, pgoff_t idx)

#define vma_policy(vma) NULL

+static inline struct mempolicy *get_vma_policy(struct vm_area_struct *vma, + unsigned long addr, int order, pgoff_t *ilx) +{ + *ilx = 0; + return NULL; +} + static inline int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst) { diff --git a/include/linux/mm.h b/include/linux/mm.h index f86fd573a4a1..49f4fac2dcf7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -640,7 +640,7 @@ struct vm_operations_struct { * policy. */ struct mempolicy *(*get_policy)(struct vm_area_struct *vma, - unsigned long addr); + unsigned long addr, pgoff_t *ilx); #endif /* * Called by vm_normal_page() for special PTEs to find the diff --git a/ipc/shm.c b/ipc/shm.c index 576a543b7cff..222aaf035afb 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -562,30 +562,25 @@ static unsigned long shm_pagesize(struct vm_area_struct *vma) }

#ifdef CONFIG_NUMA -static int shm_set_policy(struct vm_area_struct *vma, struct mempolicy *new) +static int shm_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol) { - struct file *file = vma->vm_file; - struct shm_file_data *sfd = shm_file_data(file); + struct shm_file_data *sfd = shm_file_data(vma->vm_file); int err = 0;

if (sfd->vm_ops->set_policy) - err = sfd->vm_ops->set_policy(vma, new); + err = sfd->vm_ops->set_policy(vma, mpol); return err; }

static struct mempolicy *shm_get_policy(struct vm_area_struct *vma, - unsigned long addr) + unsigned long addr, pgoff_t *ilx) { - struct file *file = vma->vm_file; - struct shm_file_data *sfd = shm_file_data(file); - struct mempolicy *pol = NULL; + struct shm_file_data *sfd = shm_file_data(vma->vm_file); + struct mempolicy *mpol = vma->vm_policy;

if (sfd->vm_ops->get_policy) - pol = sfd->vm_ops->get_policy(vma, addr); - else if (vma->vm_policy) - pol = vma->vm_policy; - - return pol; + mpol = sfd->vm_ops->get_policy(vma, addr, ilx); + return mpol; } #endif

diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 2ef6e8f09595..0bd3a89b9c3b 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -922,6 +922,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, }

if (flags & MPOL_F_ADDR) { + pgoff_t ilx; /* ignored here */ /* * Do NOT fall back to task policy if the * vma/shared policy at addr is NULL. We @@ -933,10 +934,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, mmap_read_unlock(mm); return -EFAULT; } - if (vma->vm_ops && vma->vm_ops->get_policy) - pol = vma->vm_ops->get_policy(vma, addr); - else - pol = vma->vm_policy; + pol = __get_vma_policy(vma, addr, &ilx); } else if (addr) return -EINVAL;

@@ -1194,6 +1192,15 @@ static struct folio *new_folio(struct folio *src, unsigned long start) break; }

+ /* + * __get_vma_policy() now expects a genuine non-NULL vma. Return NULL + * when the page can no longer be located in a vma: that is not ideal + * (migrate_pages() will give up early, presuming ENOMEM), but good + * enough to avoid a crash by syzkaller or concurrent holepunch. + */ + if (!vma) + return NULL; + if (folio_test_hugetlb(src)) { return alloc_hugetlb_folio_vma(folio_hstate(src), vma, address); @@ -1202,9 +1209,6 @@ static struct folio *new_folio(struct folio *src, unsigned long start) if (folio_test_large(src)) gfp = GFP_TRANSHUGE;

- /* - * if !vma, vma_alloc_folio() will use task or system default policy - */ return vma_alloc_folio(gfp, folio_order(src), vma, address, folio_test_large(src)); } @@ -1720,34 +1724,19 @@ bool vma_migratable(struct vm_area_struct *vma) }

struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, - unsigned long addr) + unsigned long addr, pgoff_t *ilx) { - struct mempolicy *pol = NULL; - - if (vma) { - if (vma->vm_ops && vma->vm_ops->get_policy) { - pol = vma->vm_ops->get_policy(vma, addr); - } else if (vma->vm_policy) { - pol = vma->vm_policy; - - /* - * shmem_alloc_page() passes MPOL_F_SHARED policy with - * a pseudo vma whose vma->vm_ops=NULL. Take a reference - * count on these policies which will be dropped by - * mpol_cond_put() later - */ - if (mpol_needs_cond_ref(pol)) - mpol_get(pol); - } - } - - return pol; + *ilx = 0; + return (vma->vm_ops && vma->vm_ops->get_policy) ? + vma->vm_ops->get_policy(vma, addr, ilx) : vma->vm_policy; }

/* - * get_vma_policy(@vma, @addr) + * get_vma_policy(@vma, @addr, @order, @ilx) * @vma: virtual memory area whose policy is sought * @addr: address in @vma for shared policy lookup + * @order: 0, or appropriate huge_page_order for interleaving + * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE * * Returns effective policy for a VMA at specified address. * Falls back to current->mempolicy or system default policy, as necessary. @@ -1756,14 +1745,18 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, * freeing by another task. It is the caller's responsibility to free the * extra reference for shared policies. */ -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma, - unsigned long addr) +struct mempolicy *get_vma_policy(struct vm_area_struct *vma, + unsigned long addr, int order, pgoff_t *ilx) { - struct mempolicy *pol = __get_vma_policy(vma, addr); + struct mempolicy *pol;

+ pol = __get_vma_policy(vma, addr, ilx); if (!pol) pol = get_task_policy(current); - + if (pol->mode == MPOL_INTERLEAVE) { + *ilx += vma->vm_pgoff >> order; + *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); + } return pol; }

@@ -1773,8 +1766,9 @@ bool vma_policy_mof(struct vm_area_struct *vma)

if (vma->vm_ops && vma->vm_ops->get_policy) { bool ret = false; + pgoff_t ilx; /* ignored here */

- pol = vma->vm_ops->get_policy(vma, vma->vm_start); + pol = vma->vm_ops->get_policy(vma, vma->vm_start, &ilx); if (pol && (pol->flags & MPOL_F_MOF)) ret = true; mpol_cond_put(pol); @@ -1809,54 +1803,6 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) return zone >= dynamic_policy_zone; }

-/* - * Return a nodemask representing a mempolicy for filtering nodes for - * page allocation - */ -nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) -{ - int mode = policy->mode; - - /* Lower zones don't get a nodemask applied for MPOL_BIND */ - if (unlikely(mode == MPOL_BIND) && - apply_policy_zone(policy, gfp_zone(gfp)) && - cpuset_nodemask_valid_mems_allowed(&policy->nodes)) - return &policy->nodes; - - if (mode == MPOL_PREFERRED_MANY) - return &policy->nodes; - - return NULL; -} - -/* - * Return the preferred node id for 'prefer' mempolicy, and return - * the given id for all other policies. - * - * policy_node() is always coupled with policy_nodemask(), which - * secures the nodemask limit for 'bind' and 'prefer-many' policy. - */ -static int policy_node(gfp_t gfp, struct mempolicy *policy, int nid) -{ - if (policy->mode == MPOL_PREFERRED) { - nid = first_node(policy->nodes); - } else { - /* - * __GFP_THISNODE shouldn't even be used with the bind policy - * because we might easily break the expectation to stay on the - * requested node and not break the policy. - */ - WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE)); - } - - if ((policy->mode == MPOL_BIND || - policy->mode == MPOL_PREFERRED_MANY) && - policy->home_node != NUMA_NO_NODE) - return policy->home_node; - - return nid; -} - /* Do dynamic interleaving for a process */ static unsigned int interleave_nodes(struct mempolicy *policy) { @@ -1916,11 +1862,11 @@ unsigned int mempolicy_slab_node(void) }

/* - * Do static interleaving for a VMA with known offset @n. Returns the n'th - * node in pol->nodes (starting from n=0), wrapping around if n exceeds the - * number of present nodes. + * Do static interleaving for interleave index @ilx. Returns the ilx'th + * node in pol->nodes (starting from ilx=0), wrapping around if ilx + * exceeds the number of present nodes. */ -static unsigned offset_il_node(struct mempolicy *pol, unsigned long n) +static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx) { nodemask_t nodemask = pol->nodes; unsigned int target, nnodes; @@ -1938,33 +1884,59 @@ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n) nnodes = nodes_weight(nodemask); if (!nnodes) return numa_node_id(); - target = (unsigned int)n % nnodes; + target = ilx % nnodes; nid = first_node(nodemask); for (i = 0; i < target; i++) nid = next_node(nid, nodemask); return nid; }

-/* Determine a node number for interleave */ -static inline unsigned interleave_nid(struct mempolicy *pol, - struct vm_area_struct *vma, unsigned long addr, int shift) +/* + * Return a nodemask representing a mempolicy for filtering nodes for + * page allocation, together with preferred node id (or the input node id). + */ +static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, + pgoff_t ilx, int *nid) { - if (vma) { - unsigned long off; + nodemask_t *nodemask = NULL; + int tnid = NUMA_NO_NODE;

+ switch (pol->mode) { + case MPOL_PREFERRED: + /* Override input node id */ + *nid = first_node(pol->nodes); + break; + case MPOL_PREFERRED_MANY: + nodemask = &pol->nodes; + if (pol->home_node != NUMA_NO_NODE) + *nid = pol->home_node; + break; + case MPOL_BIND: + /* Restrict to nodemask (but not on lower zones) */ + if (apply_policy_zone(pol, gfp_zone(gfp)) && + cpuset_nodemask_valid_mems_allowed(&pol->nodes)) + nodemask = &pol->nodes; + if (pol->home_node != NUMA_NO_NODE) + *nid = pol->home_node; /* - * for small pages, there is no difference between - * shift and PAGE_SHIFT, so the bit-shift is safe. - * for huge pages, since vm_pgoff is in units of small - * pages, we need to shift off the always 0 bits to get - * a useful offset. + * __GFP_THISNODE shouldn't even be used with the bind policy + * because we might easily break the expectation to stay on the + * requested node and not break the policy. */ - BUG_ON(shift < PAGE_SHIFT); - off = vma->vm_pgoff >> (shift - PAGE_SHIFT); - off += (addr - vma->vm_start) >> shift; - return offset_il_node(pol, off); - } else - return interleave_nodes(pol); + WARN_ON_ONCE(gfp & __GFP_THISNODE); + break; + case MPOL_INTERLEAVE: + if (smart_grid_used()) + tnid = sched_grid_preferred_interleave_nid(pol); + if (tnid == NUMA_NO_NODE) + /* Override input node id */ + tnid = (ilx == NO_INTERLEAVE_INDEX) ? + interleave_nodes(pol) : interleave_nid(pol, ilx); + *nid = tnid; + break; + } + + return nodemask; }

#ifdef CONFIG_HUGETLBFS @@ -1980,27 +1952,16 @@ static inline unsigned interleave_nid(struct mempolicy *pol, * to the struct mempolicy for conditional unref after allocation. * If the effective policy is 'bind' or 'prefer-many', returns a pointer * to the mempolicy's @nodemask for filtering the zonelist. - * - * Must be protected by read_mems_allowed_begin() */ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags, - struct mempolicy **mpol, nodemask_t **nodemask) + struct mempolicy **mpol, nodemask_t **nodemask) { + pgoff_t ilx; int nid; - int mode; - - *mpol = get_vma_policy(vma, addr); - *nodemask = NULL; - mode = (*mpol)->mode;

- if (unlikely(mode == MPOL_INTERLEAVE)) { - nid = interleave_nid(*mpol, vma, addr, - huge_page_shift(hstate_vma(vma))); - } else { - nid = policy_node(gfp_flags, *mpol, numa_node_id()); - if (mode == MPOL_BIND || mode == MPOL_PREFERRED_MANY) - *nodemask = &(*mpol)->nodes; - } + nid = numa_node_id(); + *mpol = get_vma_policy(vma, addr, hstate_vma(vma)->order, &ilx); + *nodemask = policy_nodemask(gfp_flags, *mpol, ilx, &nid); return nid; }

@@ -2078,27 +2039,8 @@ bool mempolicy_in_oom_domain(struct task_struct *tsk, return ret; }

-/* Allocate a page in interleaved policy. - Own path because it needs to do special accounting. */ -static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, - unsigned nid) -{ - struct page *page; - - page = __alloc_pages(gfp, order, nid, NULL); - /* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */ - if (!static_branch_likely(&vm_numa_stat_key)) - return page; - if (page && page_to_nid(page) == nid) { - preempt_disable(); - __count_numa_event(page_zone(page), NUMA_INTERLEAVE_HIT); - preempt_enable(); - } - return page; -} - static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order, - int nid, struct mempolicy *pol) + int nid, nodemask_t *nodemask) { struct page *page; gfp_t preferred_gfp; @@ -2111,7 +2053,7 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order, */ preferred_gfp = gfp | __GFP_NOWARN; preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL); - page = __alloc_pages(preferred_gfp, order, nid, &pol->nodes); + page = __alloc_pages(preferred_gfp, order, nid, nodemask); if (!page) page = __alloc_pages(gfp, order, nid, NULL);

@@ -2119,59 +2061,29 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order, }

/** - * vma_alloc_folio - Allocate a folio for a VMA. + * alloc_pages_mpol - Allocate pages according to NUMA mempolicy. * @gfp: GFP flags. - * @order: Order of the folio. - * @vma: Pointer to VMA or NULL if not available. - * @addr: Virtual address of the allocation. Must be inside @vma. - * @hugepage: For hugepages try only the preferred node if possible. - * - * Allocate a folio for a specific address in @vma, using the appropriate - * NUMA policy. When @vma is not NULL the caller must hold the mmap_lock - * of the mm_struct of the VMA to prevent it from going away. Should be - * used for all allocations for folios that will be mapped into user space. + * @order: Order of the page allocation. + * @pol: Pointer to the NUMA mempolicy. + * @ilx: Index for interleave mempolicy (also distinguishes alloc_pages()). + * @nid: Preferred node (usually numa_node_id() but @mpol may override it). * - * Return: The folio on success or NULL if allocation fails. + * Return: The page on success or NULL if allocation fails. */ -struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma, - unsigned long addr, bool hugepage) +struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, + struct mempolicy *pol, pgoff_t ilx, int nid) { - struct mempolicy *pol; - int node = numa_node_id(); - struct folio *folio; - int preferred_nid; - nodemask_t *nmask; - - pol = get_vma_policy(vma, addr); - - if (pol->mode == MPOL_INTERLEAVE) { - struct page *page; - int nid = NUMA_NO_NODE; - - if (smart_grid_used()) - nid = sched_grid_preferred_interleave_nid(pol); - if (nid == NUMA_NO_NODE) - nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); - - mpol_cond_put(pol); - gfp |= __GFP_COMP; - page = alloc_page_interleave(gfp, order, nid); - return page_rmappable_folio(page); - } - - if (pol->mode == MPOL_PREFERRED_MANY) { - struct page *page; + nodemask_t *nodemask; + struct page *page;

- node = policy_node(gfp, pol, node); - gfp |= __GFP_COMP; - page = alloc_pages_preferred_many(gfp, order, node, pol); - mpol_cond_put(pol); - return page_rmappable_folio(page); - } + nodemask = policy_nodemask(gfp, pol, ilx, &nid);

- if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) { - int hpage_node = node; + if (pol->mode == MPOL_PREFERRED_MANY) + return alloc_pages_preferred_many(gfp, order, nid, nodemask);

+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && + /* filter "hugepage" allocation, unless from alloc_pages() */ + order == HPAGE_PMD_ORDER && ilx != NO_INTERLEAVE_INDEX) { /* * For hugepage allocation and non-interleave policy which * allows the current node (or other explicitly preferred @@ -2182,41 +2094,69 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma, * If the policy is interleave or does not allow the current * node in its nodemask, we allocate the standard way. */ - if (pol->mode == MPOL_PREFERRED) - hpage_node = first_node(pol->nodes); - - nmask = policy_nodemask(gfp, pol); - if (!nmask || node_isset(hpage_node, *nmask)) { - mpol_cond_put(pol); + if (pol->mode != MPOL_INTERLEAVE && + (!nodemask || node_isset(nid, *nodemask))) { /* * First, try to allocate THP only on local node, but * don't reclaim unnecessarily, just compact. */ - folio = __folio_alloc_node(gfp | __GFP_THISNODE | - __GFP_NORETRY, order, hpage_node); - + page = __alloc_pages_node(nid, + gfp | __GFP_THISNODE | __GFP_NORETRY, order); + if (page || !(gfp & __GFP_DIRECT_RECLAIM)) + return page; /* * If hugepage allocations are configured to always * synchronous compact or the vma has been madvised * to prefer hugepage backing, retry allowing remote * memory with both reclaim and compact as well. */ - if (!folio && (gfp & __GFP_DIRECT_RECLAIM)) - folio = __folio_alloc(gfp, order, hpage_node, - nmask); - - goto out; } } + if (pol->mode != MPOL_INTERLEAVE && smart_grid_used()) + nid = sched_grid_preferred_nid(nid, nodemask); + page = __alloc_pages(gfp, order, nid, nodemask); + + if (unlikely(pol->mode == MPOL_INTERLEAVE) && page) { + /* skip NUMA_INTERLEAVE_HIT update if numa stats is disabled */ + if (static_branch_likely(&vm_numa_stat_key) && + page_to_nid(page) == nid) { + preempt_disable(); + __count_numa_event(page_zone(page), NUMA_INTERLEAVE_HIT); + preempt_enable(); + } + } + + return page; +} + +/** + * vma_alloc_folio - Allocate a folio for a VMA. + * @gfp: GFP flags. + * @order: Order of the folio. + * @vma: Pointer to VMA. + * @addr: Virtual address of the allocation. Must be inside @vma. + * @hugepage: Unused (was: For hugepages try only preferred node if possible). + * + * Allocate a folio for a specific address in @vma, using the appropriate + * NUMA policy. The caller must hold the mmap_lock of the mm_struct of the + * VMA to prevent it from going away. Should be used for all allocations + * for folios that will be mapped into user space, excepting hugetlbfs, and + * excepting where direct use of alloc_pages_mpol() is more appropriate. + * + * Return: The folio on success or NULL if allocation fails. + */ +struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma, + unsigned long addr, bool hugepage) +{ + struct mempolicy *pol; + pgoff_t ilx; + struct page *page;

- nmask = policy_nodemask(gfp, pol); - preferred_nid = policy_node(gfp, pol, node); - if (smart_grid_used()) - preferred_nid = sched_grid_preferred_nid(preferred_nid, nmask); - folio = __folio_alloc(gfp, order, preferred_nid, nmask); + pol = get_vma_policy(vma, addr, order, &ilx); + page = alloc_pages_mpol(gfp | __GFP_COMP, order, + pol, ilx, numa_node_id()); mpol_cond_put(pol); -out: - return folio; + return page_rmappable_folio(page); } EXPORT_SYMBOL(vma_alloc_folio);

@@ -2234,33 +2174,23 @@ EXPORT_SYMBOL(vma_alloc_folio); * flags are used. * Return: The page on success or NULL if allocation fails. */ -struct page *alloc_pages(gfp_t gfp, unsigned order) +struct page *alloc_pages(gfp_t gfp, unsigned int order) { struct mempolicy *pol = &default_policy; - struct page *page; - - if (!in_interrupt() && !(gfp & __GFP_THISNODE)) - pol = get_task_policy(current);

/* * No reference counting needed for current->mempolicy * nor system default_policy */ - if (pol->mode == MPOL_INTERLEAVE) - page = alloc_page_interleave(gfp, order, interleave_nodes(pol)); - else if (pol->mode == MPOL_PREFERRED_MANY) - page = alloc_pages_preferred_many(gfp, order, - policy_node(gfp, pol, numa_node_id()), pol); - else - page = __alloc_pages(gfp, order, - policy_node(gfp, pol, numa_node_id()), - policy_nodemask(gfp, pol)); + if (!in_interrupt() && !(gfp & __GFP_THISNODE)) + pol = get_task_policy(current);

- return page; + return alloc_pages_mpol(gfp, order, + pol, NO_INTERLEAVE_INDEX, numa_node_id()); } EXPORT_SYMBOL(alloc_pages);

-struct folio *folio_alloc(gfp_t gfp, unsigned order) +struct folio *folio_alloc(gfp_t gfp, unsigned int order) { return page_rmappable_folio(alloc_pages(gfp | __GFP_COMP, order)); } @@ -2331,6 +2261,8 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp, unsigned long nr_pages, struct page **page_array) { struct mempolicy *pol = &default_policy; + nodemask_t *nodemask; + int nid;

if (!in_interrupt() && !(gfp & __GFP_THISNODE)) pol = get_task_policy(current); @@ -2343,9 +2275,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp, return alloc_pages_bulk_array_preferred_many(gfp, numa_node_id(), pol, nr_pages, page_array);

- return __alloc_pages_bulk(gfp, policy_node(gfp, pol, numa_node_id()), - policy_nodemask(gfp, pol), nr_pages, NULL, - page_array); + nid = numa_node_id(); + nodemask = policy_nodemask(gfp, pol, NO_INTERLEAVE_INDEX, &nid); + return __alloc_pages_bulk(gfp, nid, nodemask, + nr_pages, NULL, page_array); }

int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst) @@ -2532,23 +2465,21 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, unsigned long addr) { struct mempolicy *pol; + pgoff_t ilx; struct zoneref *z; int curnid = folio_nid(folio); - unsigned long pgoff; int thiscpu = raw_smp_processor_id(); int thisnid = cpu_to_node(thiscpu); int polnid = NUMA_NO_NODE; int ret = NUMA_NO_NODE;

- pol = get_vma_policy(vma, addr); + pol = get_vma_policy(vma, addr, folio_order(folio), &ilx); if (!(pol->flags & MPOL_F_MOF)) goto out;

switch (pol->mode) { case MPOL_INTERLEAVE: - pgoff = vma->vm_pgoff; - pgoff += (addr - vma->vm_start) >> PAGE_SHIFT; - polnid = offset_il_node(pol, pgoff); + polnid = interleave_nid(pol, ilx); break;

case MPOL_PREFERRED: diff --git a/mm/shmem.c b/mm/shmem.c index a7550982a13d..4a989eeaa616 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1579,38 +1579,20 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) return NULL; } #endif /* CONFIG_NUMA && CONFIG_TMPFS */ -#ifndef CONFIG_NUMA -#define vm_policy vm_private_data -#endif

-static void shmem_pseudo_vma_init(struct vm_area_struct *vma, - struct shmem_inode_info *info, pgoff_t index) -{ - /* Create a pseudo vma that just contains the policy */ - vma_init(vma, NULL); - /* Bias interleave by inode number to distribute better across nodes */ - vma->vm_pgoff = index + info->vfs_inode.i_ino; - vma->vm_policy = mpol_shared_policy_lookup(&info->policy, index); -} +static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info, + pgoff_t index, unsigned int order, pgoff_t *ilx);

-static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma) -{ - /* Drop reference taken by mpol_shared_policy_lookup() */ - mpol_cond_put(vma->vm_policy); -} - -static struct folio *shmem_swapin(swp_entry_t swap, gfp_t gfp, +static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp, struct shmem_inode_info *info, pgoff_t index) { - struct vm_area_struct pvma; + struct mempolicy *mpol; + pgoff_t ilx; struct page *page; - struct vm_fault vmf = { - .vma = &pvma, - };

- shmem_pseudo_vma_init(&pvma, info, index); - page = swap_cluster_readahead(swap, gfp, &vmf); - shmem_pseudo_vma_destroy(&pvma); + mpol = shmem_get_pgoff_policy(info, index, 0, &ilx); + page = swap_cluster_readahead(swap, gfp, mpol, ilx); + mpol_cond_put(mpol);

if (!page) return NULL; @@ -1644,35 +1626,36 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp) static struct folio *shmem_alloc_hugefolio(gfp_t gfp, struct shmem_inode_info *info, pgoff_t index) { - struct vm_area_struct pvma; + struct mempolicy *mpol; struct address_space *mapping = info->vfs_inode.i_mapping; - pgoff_t hindex; - struct folio *folio; + pgoff_t hindex, ilx; + struct page *page;

hindex = round_down(index, HPAGE_PMD_NR); if (xa_find(&mapping->i_pages, &hindex, hindex + HPAGE_PMD_NR - 1, XA_PRESENT)) return NULL;

- shmem_pseudo_vma_init(&pvma, info, hindex); - folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, &pvma, 0, true); - shmem_pseudo_vma_destroy(&pvma); - if (!folio) + mpol = shmem_get_pgoff_policy(info, index, HPAGE_PMD_ORDER, &ilx); + page = alloc_pages_mpol(gfp, HPAGE_PMD_ORDER, mpol, ilx, numa_node_id()); + mpol_cond_put(mpol); + if (!page) count_vm_event(THP_FILE_FALLBACK); - return folio; + return page_rmappable_folio(page); }

static struct folio *shmem_alloc_folio(gfp_t gfp, struct shmem_inode_info *info, pgoff_t index) { - struct vm_area_struct pvma; - struct folio *folio; + struct mempolicy *mpol; + pgoff_t ilx; + struct page *page;

- shmem_pseudo_vma_init(&pvma, info, index); - folio = vma_alloc_folio(gfp, 0, &pvma, 0, false); - shmem_pseudo_vma_destroy(&pvma); + mpol = shmem_get_pgoff_policy(info, index, 0, &ilx); + page = alloc_pages_mpol(gfp, 0, mpol, ilx, numa_node_id()); + mpol_cond_put(mpol);

- return folio; + return (struct folio *)page; }

static struct folio *shmem_alloc_and_acct_folio(gfp_t gfp, struct inode *inode, @@ -1868,7 +1851,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, count_memcg_event_mm(charge_mm, PGMAJFAULT); } /* Here we actually start the io */ - folio = shmem_swapin(swap, gfp, info, index); + folio = shmem_swapin_cluster(swap, gfp, info, index); if (!folio) { error = -ENOMEM; goto failed; @@ -2352,15 +2335,41 @@ static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol) }

static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, - unsigned long addr) + unsigned long addr, pgoff_t *ilx) { struct inode *inode = file_inode(vma->vm_file); pgoff_t index;

+ /* + * Bias interleave by inode number to distribute better across nodes; + * but this interface is independent of which page order is used, so + * supplies only that bias, letting caller apply the offset (adjusted + * by page order, as in shmem_get_pgoff_policy() and get_vma_policy()). + */ + *ilx = inode->i_ino; index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; return mpol_shared_policy_lookup(&SHMEM_I(inode)->policy, index); } -#endif + +static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info, + pgoff_t index, unsigned int order, pgoff_t *ilx) +{ + struct mempolicy *mpol; + + /* Bias interleave by inode number to distribute better across nodes */ + *ilx = info->vfs_inode.i_ino + (index >> order); + + mpol = mpol_shared_policy_lookup(&info->policy, index); + return mpol ? mpol : get_task_policy(current); +} +#else +static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info, + pgoff_t index, unsigned int order, pgoff_t *ilx) +{ + *ilx = 0; + return NULL; +} +#endif /* CONFIG_NUMA */

int shmem_lock(struct file *file, int lock, struct ucounts *ucounts) { diff --git a/mm/swap.h b/mm/swap.h index 693d1b281559..3501fdf5f1a6 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -2,6 +2,8 @@ #ifndef _MM_SWAP_H #define _MM_SWAP_H

+struct mempolicy; + #ifdef CONFIG_SWAP #include <linux/blk_types.h> /* for bio_end_io_t */

@@ -49,11 +51,10 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, unsigned long addr, struct swap_iocb **plug); struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, - struct vm_area_struct *vma, - unsigned long addr, + struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated); struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, - struct vm_fault *vmf); + struct mempolicy *mpol, pgoff_t ilx); struct page *swapin_readahead(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf);

@@ -81,7 +82,7 @@ static inline void show_swap_cache_info(void) }

static inline struct page *swap_cluster_readahead(swp_entry_t entry, - gfp_t gfp_mask, struct vm_fault *vmf) + gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx) { return NULL; } diff --git a/mm/swap_state.c b/mm/swap_state.c index 40b84dc47974..412416c6093a 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -10,6 +10,7 @@ #include <linux/mm.h> #include <linux/gfp.h> #include <linux/kernel_stat.h> +#include <linux/mempolicy.h> #include <linux/swap.h> #include <linux/swapops.h> #include <linux/init.h> @@ -422,8 +423,8 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping, }

struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, - struct vm_area_struct *vma, unsigned long addr, - bool *new_page_allocated) + struct mempolicy *mpol, pgoff_t ilx, + bool *new_page_allocated) { struct swap_info_struct *si; struct folio *folio; @@ -465,7 +466,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, * before marking swap_map SWAP_HAS_CACHE, when -EEXIST will * cause any racers to loop around until we add it to cache. */ - folio = vma_alloc_folio(gfp_mask, 0, vma, addr, false); + folio = (struct folio *)alloc_pages_mpol(gfp_mask, 0, + mpol, ilx, numa_node_id()); if (!folio) goto fail_put_swap;

@@ -540,14 +542,19 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, struct vm_area_struct *vma, unsigned long addr, struct swap_iocb **plug) { - bool page_was_allocated; - struct page *retpage = __read_swap_cache_async(entry, gfp_mask, - vma, addr, &page_was_allocated); + bool page_allocated; + struct mempolicy *mpol; + pgoff_t ilx; + struct page *page;

- if (page_was_allocated) - swap_readpage(retpage, false, plug); + mpol = get_vma_policy(vma, addr, 0, &ilx); + page = __read_swap_cache_async(entry, gfp_mask, mpol, ilx, + &page_allocated); + mpol_cond_put(mpol);

- return retpage; + if (page_allocated) + swap_readpage(page, false, plug); + return page; }

static unsigned int __swapin_nr_pages(unsigned long prev_offset, @@ -615,7 +622,8 @@ static unsigned long swapin_nr_pages(unsigned long offset) * swap_cluster_readahead - swap in pages in hope we need them soon * @entry: swap entry of this memory * @gfp_mask: memory allocation flags - * @vmf: fault information + * @mpol: NUMA memory allocation policy to be applied + * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE * * Returns the struct page for entry and addr, after queueing swapin. * @@ -624,13 +632,12 @@ static unsigned long swapin_nr_pages(unsigned long offset) * because it doesn't cost us any seek time. We also make sure to queue * the 'original' request together with the readahead ones... * - * This has been extended to use the NUMA policies from the mm triggering - * the readahead. - * - * Caller must hold read mmap_lock if vmf->vma is not NULL. + * Note: it is intentional that the same NUMA policy and interleave index + * are used for every page of the readahead: neighbouring pages on swap + * are fairly likely to have been swapped out from the same node. */ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, - struct vm_fault *vmf) + struct mempolicy *mpol, pgoff_t ilx) { struct page *page; unsigned long entry_offset = swp_offset(entry); @@ -641,8 +648,6 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, struct blk_plug plug; struct swap_iocb *splug = NULL; bool page_allocated; - struct vm_area_struct *vma = vmf->vma; - unsigned long addr = vmf->address;

mask = swapin_nr_pages(offset) - 1; if (!mask) @@ -660,8 +665,8 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, for (offset = start_offset; offset <= end_offset ; offset++) { /* Ok, do the async read-ahead now */ page = __read_swap_cache_async( - swp_entry(swp_type(entry), offset), - gfp_mask, vma, addr, &page_allocated); + swp_entry(swp_type(entry), offset), + gfp_mask, mpol, ilx, &page_allocated); if (!page) continue; if (page_allocated) { @@ -675,11 +680,14 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, } blk_finish_plug(&plug); swap_read_unplug(splug); - lru_add_drain(); /* Push any new pages onto the LRU now */ skip: /* The page was likely read above, so no need for plugging here */ - return read_swap_cache_async(entry, gfp_mask, vma, addr, NULL); + page = __read_swap_cache_async(entry, gfp_mask, mpol, ilx, + &page_allocated); + if (unlikely(page_allocated)) + swap_readpage(page, false, NULL); + return page; }

int init_swap_address_space(unsigned int type, unsigned long nr_pages) @@ -777,8 +785,10 @@ static void swap_ra_info(struct vm_fault *vmf,

/** * swap_vma_readahead - swap in pages in hope we need them soon - * @fentry: swap entry of this memory + * @targ_entry: swap entry of the targeted memory * @gfp_mask: memory allocation flags + * @mpol: NUMA memory allocation policy to be applied + * @targ_ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE * @vmf: fault information * * Returns the struct page for entry and addr, after queueing swapin. @@ -789,16 +799,17 @@ static void swap_ra_info(struct vm_fault *vmf, * Caller must hold read mmap_lock if vmf->vma is not NULL. * */ -static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask, +static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask, + struct mempolicy *mpol, pgoff_t targ_ilx, struct vm_fault *vmf) { struct blk_plug plug; struct swap_iocb *splug = NULL; - struct vm_area_struct *vma = vmf->vma; struct page *page; pte_t *pte = NULL, pentry; unsigned long addr; swp_entry_t entry; + pgoff_t ilx; unsigned int i; bool page_allocated; struct vma_swap_readahead ra_info = { @@ -810,9 +821,10 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask, goto skip;

addr = vmf->address - (ra_info.offset * PAGE_SIZE); + ilx = targ_ilx - ra_info.offset;

blk_start_plug(&plug); - for (i = 0; i < ra_info.nr_pte; i++, addr += PAGE_SIZE) { + for (i = 0; i < ra_info.nr_pte; i++, ilx++, addr += PAGE_SIZE) { if (!pte++) { pte = pte_offset_map(vmf->pmd, addr); if (!pte) @@ -826,8 +838,8 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask, continue; pte_unmap(pte); pte = NULL; - page = __read_swap_cache_async(entry, gfp_mask, vma, - addr, &page_allocated); + page = __read_swap_cache_async(entry, gfp_mask, mpol, ilx, + &page_allocated); if (!page) continue; if (page_allocated) { @@ -846,8 +858,11 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask, lru_add_drain(); skip: /* The page was likely read above, so no need for plugging here */ - return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address, - NULL); + page = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx, + &page_allocated); + if (unlikely(page_allocated)) + swap_readpage(page, false, NULL); + return page; }

/** @@ -865,9 +880,16 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask, struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, struct vm_fault *vmf) { - return swap_use_vma_readahead() ? - swap_vma_readahead(entry, gfp_mask, vmf) : - swap_cluster_readahead(entry, gfp_mask, vmf); + struct mempolicy *mpol; + pgoff_t ilx; + struct page *page; + + mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx); + page = swap_use_vma_readahead() ? + swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) : + swap_cluster_readahead(entry, gfp_mask, mpol, ilx); + mpol_cond_put(mpol); + return page; }

#ifdef CONFIG_SYSFS diff --git a/mm/zswap.c b/mm/zswap.c index 69681b9173fd..c1da41aef0d7 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -24,6 +24,7 @@ #include <linux/swap.h> #include <linux/crypto.h> #include <linux/scatterlist.h> +#include <linux/mempolicy.h> #include <linux/mempool.h> #include <linux/zpool.h> #include <crypto/acompress.h> @@ -1057,6 +1058,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, { swp_entry_t swpentry = entry->swpentry; struct page *page; + struct mempolicy *mpol; struct scatterlist input, output; struct crypto_acomp_ctx *acomp_ctx; struct zpool *pool = zswap_find_zpool(entry); @@ -1075,8 +1077,9 @@ static int zswap_writeback_entry(struct zswap_entry *entry, }

/* try to allocate swap cache page */ - page = __read_swap_cache_async(swpentry, GFP_KERNEL, NULL, 0, - &page_was_allocated); + mpol = get_task_policy(current); + page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol, + NO_INTERLEAVE_INDEX, &page_was_allocated); if (!page) { ret = -ENOMEM; goto fail;

-- 2.33.0

patchwork bot

10:10 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy: alloc_pages_mpol() for NUMA policy without vma

反馈: 您发送到kernel@openeuler.org的补丁/补丁集，转换为PR失败！邮件列表地址：https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/U... 失败原因：应用补丁/补丁集失败，Patch failed at 0001 mempolicy: alloc_pages_mpol() for NUMA policy without vma 建议解决方法：请查看失败原因，确认补丁是否可以应用在当前期望分支的最新代码上

FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/U... Failed Reason: apply patch(es) failed, Patch failed at 0001 mempolicy: alloc_pages_mpol() for NUMA policy without vma Suggest Solution: please checkout if the failed patch(es) can work on the newest codes in expected branch

Ze Zuo

9:49 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy: mmap_lock is not needed while migrating folios

From: Hugh Dickins hughd@google.com

mainline inclusion from mainline-v6.7-rc1 commit 72e315f7a750281b4410ac30d8930f735459e72d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OHHN CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...

--------------------------------

mbind(2) holds down_write of current task's mmap_lock throughout (exclusive because it needs to set the new mempolicy on the vmas); migrate_pages(2) holds down_read of pid's mmap_lock throughout.

They both hold mmap_lock across the internal migrate_pages(), under which all new page allocations (huge or small) are made. I'm nervous about it; and migrate_pages() certainly does not need mmap_lock itself. It's done this way for mbind(2), because its page allocator is vma_alloc_folio() or alloc_hugetlb_folio_vma(), both of which depend on vma and address.

Now that we have alloc_pages_mpol(), depending on (refcounted) memory policy and interleave index, mbind(2) can be modified to use that or alloc_hugetlb_folio_nodemask(), and then not need mmap_lock across the internal migrate_pages() at all: add alloc_migration_target_by_mpol() to replace mbind's new_page().

(After that change, alloc_hugetlb_folio_vma() is used by nothing but a userfaultfd function: move it out of hugetlb.h and into the #ifdef.)

migrate_pages(2) has chosen its target node before migrating, so can continue to use the standard alloc_migration_target(); but let it take and drop mmap_lock just around migrate_to_node()'s queue_pages_range(): neither the node-to-node calculations nor the page migrations need it.

It seems unlikely, but it is conceivable that some userspace depends on the kernel's mmap_lock exclusion here, instead of doing its own locking: more likely in a testsuite than in real life. It is also possible, of course, that some pages on the list will be munmapped by another thread before they are migrated, or a newer memory policy applied to the range by that time: but such races could happen before, as soon as mmap_lock was dropped, so it does not appear to be a concern.

Link: https://lkml.kernel.org/r/21e564e8-269f-6a89-7ee2-fd612831c289@google.com Signed-off-by: Hugh Dickins hughd@google.com Cc: Andi Kleen ak@linux.intel.com Cc: Christoph Lameter cl@linux.com Cc: David Hildenbrand david@redhat.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "Huang, Ying" ying.huang@intel.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Mel Gorman mgorman@techsingularity.net Cc: Michal Hocko mhocko@suse.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Nhat Pham nphamcs@gmail.com Cc: Sidhartha Kumar sidhartha.kumar@oracle.com Cc: Suren Baghdasaryan surenb@google.com Cc: Tejun heo tj@kernel.org Cc: Vishal Moola (Oracle) vishal.moola@gmail.com Cc: Yang Shi shy828301@gmail.com Cc: Yosry Ahmed yosryahmed@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Ze Zuo zuoze1@huawei.com --- include/linux/hugetlb.h | 9 ----- mm/hugetlb.c | 38 ++++++++++--------- mm/mempolicy.c | 83 +++++++++++++++++++++-------------------- 3 files changed, 63 insertions(+), 67 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index b9a78a93859a..616eab7d2d9e 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -800,8 +800,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, nodemask_t *nmask, gfp_t gfp_mask); -struct folio *alloc_hugetlb_folio_vma(struct hstate *h, struct vm_area_struct *vma, - unsigned long address); int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping, pgoff_t idx); void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, @@ -1129,13 +1127,6 @@ alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, return NULL; }

-static inline struct folio *alloc_hugetlb_folio_vma(struct hstate *h, - struct vm_area_struct *vma, - unsigned long address) -{ - return NULL; -} - static inline int __alloc_bootmem_huge_page(struct hstate *h) { return 0; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6f90d0845c43..1c492f3718c7 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2574,24 +2574,6 @@ struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, return alloc_migrate_hugetlb_folio(h, gfp_mask, preferred_nid, nmask); }

-/* mempolicy aware migration callback */ -struct folio *alloc_hugetlb_folio_vma(struct hstate *h, struct vm_area_struct *vma, - unsigned long address) -{ - struct mempolicy *mpol; - nodemask_t *nodemask; - struct folio *folio; - gfp_t gfp_mask; - int node; - - gfp_mask = htlb_alloc_mask(h); - node = huge_node(vma, address, gfp_mask, &mpol, &nodemask); - folio = alloc_hugetlb_folio_nodemask(h, node, nodemask, gfp_mask); - mpol_cond_put(mpol); - - return folio; -} - /* * Increase the hugetlb pool such that it can accommodate a reservation * of size 'delta'. @@ -6459,6 +6441,26 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, }

#ifdef CONFIG_USERFAULTFD +/* + * Can probably be eliminated, but still used by hugetlb_mfill_atomic_pte(). + */ +static struct folio *alloc_hugetlb_folio_vma(struct hstate *h, + struct vm_area_struct *vma, unsigned long address) +{ + struct mempolicy *mpol; + nodemask_t *nodemask; + struct folio *folio; + gfp_t gfp_mask; + int node; + + gfp_mask = htlb_alloc_mask(h); + node = huge_node(vma, address, gfp_mask, &mpol, &nodemask); + folio = alloc_hugetlb_folio_nodemask(h, node, nodemask, gfp_mask); + mpol_cond_put(mpol); + + return folio; +} + /* * Used by userfaultfd UFFDIO_* ioctls. Based on userfaultfd's mfill_atomic_pte * with modifications for hugetlb pages. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0bd3a89b9c3b..bc45ab810f6b 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -420,6 +420,8 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {

static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, unsigned long flags); +static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, + pgoff_t ilx, int *nid);

static bool strictly_unmovable(unsigned long flags) { @@ -1045,6 +1047,8 @@ static long migrate_to_node(struct mm_struct *mm, int source, int dest, node_set(source, nmask);

VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))); + + mmap_read_lock(mm); vma = find_vma(mm, 0);

/* @@ -1055,6 +1059,7 @@ static long migrate_to_node(struct mm_struct *mm, int source, int dest, */ nr_failed = queue_pages_range(mm, vma->vm_start, mm->task_size, &nmask, flags | MPOL_MF_DISCONTIG_OK, &pagelist); + mmap_read_unlock(mm);

if (!list_empty(&pagelist)) { err = migrate_pages(&pagelist, alloc_migration_target, NULL, @@ -1083,8 +1088,6 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,

lru_cache_disable();

- mmap_read_lock(mm); - /* * Find a 'source' bit set in 'tmp' whose corresponding 'dest' * bit in 'to' is not also set in 'tmp'. Clear the found 'source' @@ -1164,7 +1167,6 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, if (err < 0) break; } - mmap_read_unlock(mm);

lru_cache_enable(); if (err < 0) @@ -1173,44 +1175,38 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, }

/* - * Allocate a new page for page migration based on vma policy. - * Start by assuming the page is mapped by the same vma as contains @start. - * Search forward from there, if not. N.B., this assumes that the - * list of pages handed to migrate_pages()--which is how we get here-- - * is in virtual address order. + * Allocate a new folio for page migration, according to NUMA mempolicy. */ -static struct folio *new_folio(struct folio *src, unsigned long start) +static struct folio *alloc_migration_target_by_mpol(struct folio *src, + unsigned long private) { - struct vm_area_struct *vma; - unsigned long address; - VMA_ITERATOR(vmi, current->mm, start); - gfp_t gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL; - - for_each_vma(vmi, vma) { - address = page_address_in_vma(&src->page, vma); - if (address != -EFAULT) - break; - } + struct mempolicy *pol = (struct mempolicy *)private; + pgoff_t ilx = 0; /* improve on this later */ + struct page *page; + unsigned int order; + int nid = numa_node_id(); + gfp_t gfp;

- /* - * __get_vma_policy() now expects a genuine non-NULL vma. Return NULL - * when the page can no longer be located in a vma: that is not ideal - * (migrate_pages() will give up early, presuming ENOMEM), but good - * enough to avoid a crash by syzkaller or concurrent holepunch. - */ - if (!vma) - return NULL; + order = folio_order(src); + ilx += src->index >> order;

if (folio_test_hugetlb(src)) { - return alloc_hugetlb_folio_vma(folio_hstate(src), - vma, address); + nodemask_t *nodemask; + struct hstate *h; + + h = folio_hstate(src); + gfp = htlb_alloc_mask(h); + nodemask = policy_nodemask(gfp, pol, ilx, &nid); + return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp); }

if (folio_test_large(src)) gfp = GFP_TRANSHUGE; + else + gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP;

- return vma_alloc_folio(gfp, folio_order(src), vma, address, - folio_test_large(src)); + page = alloc_pages_mpol(gfp, order, pol, ilx, nid); + return page_rmappable_folio(page); } #else

@@ -1226,7 +1222,8 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, return -ENOSYS; }

-static struct folio *new_folio(struct folio *src, unsigned long start) +static struct folio *alloc_migration_target_by_mpol(struct folio *src, + unsigned long private) { return NULL; } @@ -1299,6 +1296,7 @@ long __do_mbind(unsigned long start, unsigned long len,

if (nr_failed < 0) { err = nr_failed; + nr_failed = 0; } else { vma_iter_init(&vmi, mm, start); prev = vma_prev(&vmi); @@ -1309,19 +1307,24 @@ long __do_mbind(unsigned long start, unsigned long len, } }

- if (!err) { - if (!list_empty(&pagelist)) { - nr_failed |= migrate_pages(&pagelist, new_folio, NULL, - start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND, NULL); + mmap_write_unlock(mm); + + if (!err && !list_empty(&pagelist)) { + /* Convert MPOL_DEFAULT's NULL to task or default policy */ + if (!new) { + new = get_task_policy(current); + mpol_get(new); } - if (nr_failed && (flags & MPOL_MF_STRICT)) - err = -EIO; + nr_failed |= migrate_pages(&pagelist, + alloc_migration_target_by_mpol, NULL, + (unsigned long)new, MIGRATE_SYNC, + MR_MEMPOLICY_MBIND, NULL); }

+ if (nr_failed && (flags & MPOL_MF_STRICT)) + err = -EIO; if (!list_empty(&pagelist)) putback_movable_pages(&pagelist); - - mmap_write_unlock(mm); mpol_out: mpol_put(new); if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))

-- 2.33.0

patchwork bot

10:10 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy: mmap_lock is not needed while migrating folios

反馈: 您发送到kernel@openeuler.org的补丁/补丁集，转换为PR失败！邮件列表地址：https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/O... 失败原因：应用补丁/补丁集失败，Patch failed at 0001 mempolicy: mmap_lock is not needed while migrating folios 建议解决方法：请查看失败原因，确认补丁是否可以应用在当前期望分支的最新代码上

FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/O... Failed Reason: apply patch(es) failed, Patch failed at 0001 mempolicy: mmap_lock is not needed while migrating folios Suggest Solution: please checkout if the failed patch(es) can work on the newest codes in expected branch

Ze Zuo

9:49 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy: migration attempt to match interleave nodes

From: Hugh Dickins hughd@google.com

mainline inclusion from mainline-v6.7-rc1 commit 88c91dc58582f1d0c6f5f501051b0262d06ec4ed category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OHHN CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...

--------------------------------

Improve alloc_migration_target_by_mpol()'s treatment of MPOL_INTERLEAVE.

Make an effort in do_mbind(), to identify the correct interleave index for the first page to be migrated, so that it and all subsequent pages from the same vma will be targeted to precisely their intended nodes. Pages from following vmas will still be interleaved from the requested nodemask, but perhaps starting from a different base.

Whether this is worth doing at all, or worth improving further, is arguable: queue_folio_required() is right not to care about the precise placement on interleaved nodes; but this little effort seems appropriate.

[hughd@google.com: do vma_iter search under mmap_write_unlock()] Link: https://lkml.kernel.org/r/3311d544-fb05-a7f1-1b74-16aa0f6cd4fe@google.com Link: https://lkml.kernel.org/r/77954a5-9c9b-1c11-7d5c-3262c01b895f@google.com Signed-off-by: Hugh Dickins hughd@google.com Cc: Andi Kleen ak@linux.intel.com Cc: Christoph Lameter cl@linux.com Cc: David Hildenbrand david@redhat.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: "Huang, Ying" ying.huang@intel.com Cc: Kefeng Wang wangkefeng.wang@huawei.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Mel Gorman mgorman@techsingularity.net Cc: Michal Hocko mhocko@suse.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Nhat Pham nphamcs@gmail.com Cc: Sidhartha Kumar sidhartha.kumar@oracle.com Cc: Suren Baghdasaryan surenb@google.com Cc: Tejun heo tj@kernel.org Cc: Vishal Moola (Oracle) vishal.moola@gmail.com Cc: Yang Shi shy828301@gmail.com Cc: Yosry Ahmed yosryahmed@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Ze Zuo zuoze1@huawei.com --- mm/mempolicy.c | 55 +++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 50 insertions(+), 5 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c index bc45ab810f6b..7afc2ad0904a 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -433,6 +433,11 @@ static bool strictly_unmovable(unsigned long flags) MPOL_MF_STRICT; }

+struct migration_mpol { /* for alloc_migration_target_by_mpol() */ + struct mempolicy *pol; + pgoff_t ilx; +}; + struct queue_pages { struct list_head *pagelist; unsigned long flags; @@ -1180,8 +1185,9 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, static struct folio *alloc_migration_target_by_mpol(struct folio *src, unsigned long private) { - struct mempolicy *pol = (struct mempolicy *)private; - pgoff_t ilx = 0; /* improve on this later */ + struct migration_mpol *mmpol = (struct migration_mpol *)private; + struct mempolicy *pol = mmpol->pol; + pgoff_t ilx = mmpol->ilx; struct page *page; unsigned int order; int nid = numa_node_id(); @@ -1235,6 +1241,7 @@ long __do_mbind(unsigned long start, unsigned long len, { struct vm_area_struct *vma, *prev; struct vma_iterator vmi; + struct migration_mpol mmpol; struct mempolicy *new; unsigned long end; long err; @@ -1307,17 +1314,55 @@ long __do_mbind(unsigned long start, unsigned long len, } }

- mmap_write_unlock(mm); - if (!err && !list_empty(&pagelist)) { /* Convert MPOL_DEFAULT's NULL to task or default policy */ if (!new) { new = get_task_policy(current); mpol_get(new); } + mmpol.pol = new; + mmpol.ilx = 0; + + /* + * In the interleaved case, attempt to allocate on exactly the + * targeted nodes, for the first VMA to be migrated; for later + * VMAs, the nodes will still be interleaved from the targeted + * nodemask, but one by one may be selected differently. + */ + if (new->mode == MPOL_INTERLEAVE) { + struct page *page; + unsigned int order; + unsigned long addr = -EFAULT; + + list_for_each_entry(page, &pagelist, lru) { + if (!PageKsm(page)) + break; + } + if (!list_entry_is_head(page, &pagelist, lru)) { + vma_iter_init(&vmi, mm, start); + for_each_vma_range(vmi, vma, end) { + addr = page_address_in_vma(page, vma); + if (addr != -EFAULT) + break; + } + } + if (addr != -EFAULT) { + order = compound_order(page); + /* We already know the pol, but not the ilx */ + mpol_cond_put(get_vma_policy(vma, addr, order, + &mmpol.ilx)); + /* Set base from which to increment by index */ + mmpol.ilx -= page->index >> order; + } + } + } + + mmap_write_unlock(mm); + + if (!err && !list_empty(&pagelist)) { nr_failed |= migrate_pages(&pagelist, alloc_migration_target_by_mpol, NULL, - (unsigned long)new, MIGRATE_SYNC, + (unsigned long)&mmpol, MIGRATE_SYNC, MR_MEMPOLICY_MBIND, NULL); }

-- 2.33.0

patchwork bot

10:11 p.m.

New subject: [PATCH OLK-6.6 v2] mempolicy: migration attempt to match interleave nodes

反馈: 您发送到kernel@openeuler.org的补丁/补丁集，转换为PR失败！邮件列表地址：https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/Q... 失败原因：应用补丁/补丁集失败，Patch failed at 0001 mempolicy: migration attempt to match interleave nodes 建议解决方法：请查看失败原因，确认补丁是否可以应用在当前期望分支的最新代码上

FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/Q... Failed Reason: apply patch(es) failed, Patch failed at 0001 mempolicy: migration attempt to match interleave nodes Suggest Solution: please checkout if the failed patch(es) can work on the newest codes in expected branch

Ze Zuo

9:49 p.m.

New subject: [PATCH OLK-6.6 v2] mm/mempolicy: implement the sysfs-based weighted_interleave interface

From: Rakie Kim rakie.kim@sk.com

mainline inclusion from mainline-v6.9-rc1 commit dce41f5ae2539d1c20ae8de4e039630aec3c3f3c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OHHN CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...

--------------------------------

Patch series "mm/mempolicy: weighted interleave mempolicy and sysfs extension", v5.

Weighted interleave is a new interleave policy intended to make use of heterogeneous memory environments appearing with CXL.

The existing interleave mechanism does an even round-robin distribution of memory across all nodes in a nodemask, while weighted interleave distributes memory across nodes according to a provided weight. (Weight = # of page allocations per round)

Weighted interleave is intended to reduce average latency when bandwidth is pressured - therefore increasing total throughput.

In other words: It allows greater use of the total available bandwidth in a heterogeneous hardware environment (different hardware provides different bandwidth capacity).

As bandwidth is pressured, latency increases - first linearly and then exponentially. By keeping bandwidth usage distributed according to available bandwidth, we therefore can reduce the average latency of a cacheline fetch.

A good explanation of the bandwidth vs latency response curve: https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-re...

From the article: ``` Constant region: The latency response is fairly constant for the first 40% of the sustained bandwidth. Linear region: In between 40% to 80% of the sustained bandwidth, the latency response increases almost linearly with the bandwidth demand of the system due to contention overhead by numerous memory requests. Exponential region: Between 80% to 100% of the sustained bandwidth, the memory latency is dominated by the contention latency which can be as much as twice the idle latency or more. Maximum sustained bandwidth : Is 65% to 75% of the theoretical maximum bandwidth. ```

As a general rule of thumb: * If bandwidth usage is low, latency does not increase. It is optimal to place data in the nearest (lowest latency) device. * If bandwidth usage is high, latency increases. It is optimal to place data such that bandwidth use is optimized per-device.

This is the top line goal: Provide a user a mechanism to target using the "maximum sustained bandwidth" of each hardware component in a heterogenous memory system.

For example, the stream benchmark demonstrates that 1:1 (default) interleave is actively harmful, while weighted interleave can be beneficial. Default interleave distributes data such that too much pressure is placed on devices with lower available bandwidth.

Stream Benchmark (vs DRAM, 1 Socket + 1 CXL Device) Default interleave : -78% (slower than DRAM) Global weighting : -6% to +4% (workload dependant) Targeted weights : +2.5% to +4% (consistently better than DRAM)

Global means the task-policy was set (set_mempolicy), while targeted means VMA policies were set (mbind2). We see weighted interleave is not always beneficial when applied globally, but is always beneficial when applied to bandwidth-driving memory regions.

There are 4 patches in this set: 1) Implement system-global interleave weights as sysfs extension in mm/mempolicy.c. These weights are RCU protected, and a default weight set is provided (all weights are 1 by default).

In future work, we intend to expose an interface for HMAT/CDAT code to set reasonable default values based on the memory configuration of the system discovered at boot/hotplug.

2) A mild refactor of some interleave-logic for re-use in the new weighted interleave logic.

3) MPOL_WEIGHTED_INTERLEAVE extension for set_mempolicy/mbind

4) Protect interleave logic (weighted and normal) with the mems_allowed seq cookie. If the nodemask changes while accessing it during a rebind, just retry the access.

Included below are some performance and LTP test information, and a sample numactl branch which can be used for testing.

= Performance summary = (tests may have different configurations, see extended info below) 1) MLC (W2) : +38% over DRAM. +264% over default interleave. MLC (W5) : +40% over DRAM. +226% over default interleave. 2) Stream : -6% to +4% over DRAM, +430% over default interleave. 3) XSBench : +19% over DRAM. +47% over default interleave.

= LTP Testing Summary = existing mempolicy & mbind tests: pass mempolicy & mbind + weighted interleave (global weights): pass

= version history v5: - style fixes - mems_allowed cookie protection to detect rebind issues, prevents spurious allocation failures and/or mis-allocations - sparse warning fixes related to __rcu on local variables

===================================================================== Performance tests - MLC From - Ravi Jonnalagadda ravis.opensrc@micron.com

Hardware: Single-socket, multiple CXL memory expanders.

Workload: W2 Data Signature: 2:1 read:write DRAM only bandwidth (GBps): 298.8 DRAM + CXL (default interleave) (GBps): 113.04 DRAM + CXL (weighted interleave)(GBps): 412.5 Gain over DRAM only: 1.38x Gain over default interleave: 2.64x

Workload: W5 Data Signature: 1:1 read:write DRAM only bandwidth (GBps): 273.2 DRAM + CXL (default interleave) (GBps): 117.23 DRAM + CXL (weighted interleave)(GBps): 382.7 Gain over DRAM only: 1.4x Gain over default interleave: 2.26x

===================================================================== Performance test - Stream From - Gregory Price gregory.price@memverge.com

Hardware: Single socket, single CXL expander numactl extension: https://github.com/gmprice/numactl/tree/weighted_interleave_master

Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times Default interleave : -78% (slower than DRAM) Global weighting : -6% to +4% (workload dependant) mbind2 weights : +2.5% to +4% (consistently better than DRAM)

dram only: numactl --cpunodebind=1 --membind=1 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Function Direction BestRateMBs AvgTime MinTime MaxTime Copy: 0->0 200923.2 0.032662 0.031853 0.033301 Scale: 0->0 202123.0 0.032526 0.031664 0.032970 Add: 0->0 208873.2 0.047322 0.045961 0.047884 Triad: 0->0 208523.8 0.047262 0.046038 0.048414

CXL-only: numactl --cpunodebind=1 -w --membind=2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Copy: 0->0 22209.7 0.288661 0.288162 0.289342 Scale: 0->0 22288.2 0.287549 0.287147 0.288291 Add: 0->0 24419.1 0.393372 0.393135 0.393735 Triad: 0->0 24484.6 0.392337 0.392083 0.394331

Based on the above, the optimal weights are ~9:1 echo 9 > /sys/kernel/mm/mempolicy/weighted_interleave/node1 echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2

default interleave: numactl --cpunodebind=1 --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Copy: 0->0 44666.2 0.143671 0.143285 0.144174 Scale: 0->0 44781.6 0.143256 0.142916 0.143713 Add: 0->0 48600.7 0.197719 0.197528 0.197858 Triad: 0->0 48727.5 0.197204 0.197014 0.197439

global weighted interleave: numactl --cpunodebind=1 -w --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Copy: 0->0 190085.9 0.034289 0.033669 0.034645 Scale: 0->0 207677.4 0.031909 0.030817 0.033061 Add: 0->0 202036.8 0.048737 0.047516 0.053409 Triad: 0->0 217671.5 0.045819 0.044103 0.046755

targted regions w/ global weights (modified stream to mbind2 malloc'd regions)) numactl --cpunodebind=1 --membind=1 ./stream_c.exe -b --ntimes 100 --array-size 400M --malloc Copy: 0->0 205827.0 0.031445 0.031094 0.031984 Scale: 0->0 208171.8 0.031320 0.030744 0.032505 Add: 0->0 217352.0 0.045087 0.044168 0.046515 Triad: 0->0 216884.8 0.045062 0.044263 0.046982

===================================================================== Performance tests - XSBench From - Hyeongtak Ji hyeongtak.ji@sk.com

Hardware: Single socket, Single CXL memory Expander

NUMA node 0: 56 logical cores, 128 GB memory NUMA node 2: 96 GB CXL memory Threads: 56 Lookups: 170,000,000

Summary: +19% over DRAM. +47% over default interleave.

Performance tests - XSBench 1. dram only $ numactl -m 0 ./XSBench -s XL –p 5000000 Runtime: 36.235 seconds Lookups/s: 4,691,618

2. default interleave $ numactl –i 0,2 ./XSBench –s XL –p 5000000 Runtime: 55.243 seconds Lookups/s: 3,077,293

3. weighted interleave numactl –w –i 0,2 ./XSBench –s XL –p 5000000 Runtime: 29.262 seconds Lookups/s: 5,809,513

===================================================================== LTP Tests: https://github.com/gmprice/ltp/tree/mempolicy2

= Existing tests set_mempolicy, get_mempolicy, mbind

MPOL_WEIGHTED_INTERLEAVE added manually to test basic functionality but did not adjust tests for weighting. Basically the weights were set to 1, which is the default, and it should behave the same as MPOL_INTERLEAVE if logic is correct.

== set_mempolicy01 : passed 18, failed 0 == set_mempolicy02 : passed 10, failed 0 == set_mempolicy03 : passed 64, failed 0 == set_mempolicy04 : passed 32, failed 0 == set_mempolicy05 - n/a on non-x86 == set_mempolicy06 : passed 10, failed 0 this is set_mempolicy02 + MPOL_WEIGHTED_INTERLEAVE == set_mempolicy07 : passed 32, failed 0 set_mempolicy04 + MPOL_WEIGHTED_INTERLEAVE == get_mempolicy01 : passed 12, failed 0 change: added MPOL_WEIGHTED_INTERLEAVE == get_mempolicy02 : passed 2, failed 0 == mbind01 : passed 15, failed 0 added MPOL_WEIGHTED_INTERLEAVE == mbind02 : passed 4, failed 0 added MPOL_WEIGHTED_INTERLEAVE == mbind03 : passed 16, failed 0 added MPOL_WEIGHTED_INTERLEAVE == mbind04 : passed 48, failed 0 added MPOL_WEIGHTED_INTERLEAVE

===================================================================== numactl (set_mempolicy) w/ global weighting test numactl fork: https://github.com/gmprice/numactl/tree/weighted_interleave_master

command: numactl -w --interleave=0,1 ./eatmem

result (weights 1:1): 0176a000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=32897 N1=32896 kernelpagesize_kB=4 7fceeb9ff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=32768 N1=32769 kernelpagesize_kB=4 50% distribution is correct

result (weights 5:1): 01b14000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=54828 N1=10965 kernelpagesize_kB=4 7f47a1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54614 N1=10923 kernelpagesize_kB=4 16.666% distribution is correct

result (weights 1:5): 01f07000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=10966 N1=54827 kernelpagesize_kB=4 7f17b1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=10923 N1=54614 kernelpagesize_kB=4 16.666% distribution is correct

#include <stdio.h> #include <stdlib.h> #include <string.h> int main (void) { char* mem = malloc(1024*1024*256); memset(mem, 1, 1024*1024*256); for (int i = 0; i < ((1024*1024*256)/4096); i++) { mem = malloc(4096); mem[0] = 1; } printf("done\n"); getchar(); return 0; }

This patch (of 4):

This patch provides a way to set interleave weight information under sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN

The sysfs structure is designed as follows.

$ tree /sys/kernel/mm/mempolicy/ /sys/kernel/mm/mempolicy/ [1] └── weighted_interleave [2] ├── node0 [3] └── node1

Each file above can be explained as follows.

[1] mm/mempolicy: configuration interface for mempolicy subsystem

[2] weighted_interleave/: config interface for weighted interleave policy

[3] weighted_interleave/nodeN: weight for nodeN

If a node value is set to `0`, the system-default value will be used. As of this patch, the system-default for all nodes is always 1.

Link: https://lkml.kernel.org/r/20240202170238.90004-1-gregory.price@memverge.com Link: https://lkml.kernel.org/r/20240202170238.90004-2-gregory.price@memverge.com Suggested-by: "Huang, Ying" ying.huang@intel.com Signed-off-by: Rakie Kim rakie.kim@sk.com Signed-off-by: Honggyu Kim honggyu.kim@sk.com Co-developed-by: Gregory Price gregory.price@memverge.com Signed-off-by: Gregory Price gregory.price@memverge.com Co-developed-by: Hyeongtak Ji hyeongtak.ji@sk.com Signed-off-by: Hyeongtak Ji hyeongtak.ji@sk.com Reviewed-by: "Huang, Ying" ying.huang@intel.com Cc: Dan Williams dan.j.williams@intel.com Cc: Gregory Price gourry.memverge@gmail.com Cc: Hasan Al Maruf Hasan.Maruf@amd.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Jonathan Corbet corbet@lwn.net Cc: Michal Hocko mhocko@kernel.org Cc: Srinivasulu Thanneeru sthanneeru.opensrc@micron.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Ze Zuo zuoze1@huawei.com --- .../ABI/testing/sysfs-kernel-mm-mempolicy | 4 + ...fs-kernel-mm-mempolicy-weighted-interleave | 25 ++ mm/mempolicy.c | 223 ++++++++++++++++++ 3 files changed, 252 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy new file mode 100644 index 000000000000..8ac327fd7fb6 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy @@ -0,0 +1,4 @@ +What: /sys/kernel/mm/mempolicy/ +Date: January 2024 +Contact: Linux memory management mailing list linux-mm@kvack.org +Description: Interface for Mempolicy diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave new file mode 100644 index 000000000000..0b7972de04e9 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave @@ -0,0 +1,25 @@ +What: /sys/kernel/mm/mempolicy/weighted_interleave/ +Date: January 2024 +Contact: Linux memory management mailing list linux-mm@kvack.org +Description: Configuration Interface for the Weighted Interleave policy + +What: /sys/kernel/mm/mempolicy/weighted_interleave/nodeN +Date: January 2024 +Contact: Linux memory management mailing list linux-mm@kvack.org +Description: Weight configuration interface for nodeN + + The interleave weight for a memory node (N). These weights are + utilized by tasks which have set their mempolicy to + MPOL_WEIGHTED_INTERLEAVE. + + These weights only affect new allocations, and changes at runtime + will not cause migrations on already allocated pages. + + The minimum weight for a node is always 1. + + Minimum weight: 1 + Maximum weight: 255 + + Writing an empty string or `0` will reset the weight to the + system default. The system default may be set by the kernel + or drivers at boot or during hotplug events. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 7afc2ad0904a..76be93de7dab 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -132,6 +132,32 @@ static struct mempolicy default_policy = {

static struct mempolicy preferred_node_policy[MAX_NUMNODES];

+/* + * iw_table is the sysfs-set interleave weight table, a value of 0 denotes + * system-default value should be used. A NULL iw_table also denotes that + * system-default values should be used. Until the system-default table + * is implemented, the system-default is always 1. + * + * iw_table is RCU protected + */ +static u8 __rcu *iw_table; +static DEFINE_MUTEX(iw_table_lock); + +static u8 get_il_weight(int node) +{ + u8 *table; + u8 weight; + + rcu_read_lock(); + table = rcu_dereference(iw_table); + /* if no iw_table, use system default */ + weight = table ? table[node] : 1; + /* if value in iw_table is 0, use system default */ + weight = weight ? weight : 1; + rcu_read_unlock(); + return weight; +} + /** * numa_nearest_node - Find nearest node by state * @node: Node id to start the search @@ -3099,3 +3125,200 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) p += scnprintf(p, buffer + maxlen - p, ":%*pbl", nodemask_pr_args(&nodes)); } + +#ifdef CONFIG_SYSFS +struct iw_node_attr { + struct kobj_attribute kobj_attr; + int nid; +}; + +static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + struct iw_node_attr *node_attr; + u8 weight; + + node_attr = container_of(attr, struct iw_node_attr, kobj_attr); + weight = get_il_weight(node_attr->nid); + return sysfs_emit(buf, "%d\n", weight); +} + +static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t count) +{ + struct iw_node_attr *node_attr; + u8 *new; + u8 *old; + u8 weight = 0; + + node_attr = container_of(attr, struct iw_node_attr, kobj_attr); + if (count == 0 || sysfs_streq(buf, "")) + weight = 0; + else if (kstrtou8(buf, 0, &weight)) + return -EINVAL; + + new = kzalloc(nr_node_ids, GFP_KERNEL); + if (!new) + return -ENOMEM; + + mutex_lock(&iw_table_lock); + old = rcu_dereference_protected(iw_table, + lockdep_is_held(&iw_table_lock)); + if (old) + memcpy(new, old, nr_node_ids); + new[node_attr->nid] = weight; + rcu_assign_pointer(iw_table, new); + mutex_unlock(&iw_table_lock); + synchronize_rcu(); + kfree(old); + return count; +} + +static struct iw_node_attr **node_attrs; + +static void sysfs_wi_node_release(struct iw_node_attr *node_attr, + struct kobject *parent) +{ + if (!node_attr) + return; + sysfs_remove_file(parent, &node_attr->kobj_attr.attr); + kfree(node_attr->kobj_attr.attr.name); + kfree(node_attr); +} + +static void sysfs_wi_release(struct kobject *wi_kobj) +{ + int i; + + for (i = 0; i < nr_node_ids; i++) + sysfs_wi_node_release(node_attrs[i], wi_kobj); + kobject_put(wi_kobj); +} + +static const struct kobj_type wi_ktype = { + .sysfs_ops = &kobj_sysfs_ops, + .release = sysfs_wi_release, +}; + +static int add_weight_node(int nid, struct kobject *wi_kobj) +{ + struct iw_node_attr *node_attr; + char *name; + + node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL); + if (!node_attr) + return -ENOMEM; + + name = kasprintf(GFP_KERNEL, "node%d", nid); + if (!name) { + kfree(node_attr); + return -ENOMEM; + } + + sysfs_attr_init(&node_attr->kobj_attr.attr); + node_attr->kobj_attr.attr.name = name; + node_attr->kobj_attr.attr.mode = 0644; + node_attr->kobj_attr.show = node_show; + node_attr->kobj_attr.store = node_store; + node_attr->nid = nid; + + if (sysfs_create_file(wi_kobj, &node_attr->kobj_attr.attr)) { + kfree(node_attr->kobj_attr.attr.name); + kfree(node_attr); + pr_err("failed to add attribute to weighted_interleave\n"); + return -ENOMEM; + } + + node_attrs[nid] = node_attr; + return 0; +} + +static int add_weighted_interleave_group(struct kobject *root_kobj) +{ + struct kobject *wi_kobj; + int nid, err; + + wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL); + if (!wi_kobj) + return -ENOMEM; + + err = kobject_init_and_add(wi_kobj, &wi_ktype, root_kobj, + "weighted_interleave"); + if (err) { + kfree(wi_kobj); + return err; + } + + for_each_node_state(nid, N_POSSIBLE) { + err = add_weight_node(nid, wi_kobj); + if (err) { + pr_err("failed to add sysfs [node%d]\n", nid); + break; + } + } + if (err) + kobject_put(wi_kobj); + return 0; +} + +static void mempolicy_kobj_release(struct kobject *kobj) +{ + u8 *old; + + mutex_lock(&iw_table_lock); + old = rcu_dereference_protected(iw_table, + lockdep_is_held(&iw_table_lock)); + rcu_assign_pointer(iw_table, NULL); + mutex_unlock(&iw_table_lock); + synchronize_rcu(); + kfree(old); + kfree(node_attrs); + kfree(kobj); +} + +static const struct kobj_type mempolicy_ktype = { + .release = mempolicy_kobj_release +}; + +static int __init mempolicy_sysfs_init(void) +{ + int err; + static struct kobject *mempolicy_kobj; + + mempolicy_kobj = kzalloc(sizeof(*mempolicy_kobj), GFP_KERNEL); + if (!mempolicy_kobj) { + err = -ENOMEM; + goto err_out; + } + + node_attrs = kcalloc(nr_node_ids, sizeof(struct iw_node_attr *), + GFP_KERNEL); + if (!node_attrs) { + err = -ENOMEM; + goto mempol_out; + } + + err = kobject_init_and_add(mempolicy_kobj, &mempolicy_ktype, mm_kobj, + "mempolicy"); + if (err) + goto node_out; + + err = add_weighted_interleave_group(mempolicy_kobj); + if (err) { + pr_err("mempolicy sysfs structure failed to initialize\n"); + kobject_put(mempolicy_kobj); + return err; + } + + return err; +node_out: + kfree(node_attrs); +mempol_out: + kfree(mempolicy_kobj); +err_out: + pr_err("failed to add mempolicy kobject to the system\n"); + return err; +} + +late_initcall(mempolicy_sysfs_init); +#endif /* CONFIG_SYSFS */

-- 2.33.0

patchwork bot

10:12 p.m.

New subject: [PATCH OLK-6.6 v2] mm/mempolicy: implement the sysfs-based weighted_interleave interface

反馈：您发送到kernel@openeuler.org的补丁/补丁集，已成功转换为PR！ PR链接地址： https://gitee.com/openeuler/kernel/pulls/7200 邮件列表地址：https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/5...

FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/7200 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/5...

Ze Zuo

9:49 p.m.

New subject: [PATCH OLK-6.6 v2] mm/mempolicy: refactor a read-once mechanism into a function for re-use

From: Gregory Price gourry.memverge@gmail.com

mainline inclusion from mainline-v6.9-rc1 commit 9685e6e30d116d72fb013b0bce261a676b7575c1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OHHN CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...

--------------------------------

Move the use of barrier() to force policy->nodemask onto the stack into a function `read_once_policy_nodemask` so that it may be re-used.

Link: https://lkml.kernel.org/r/20240202170238.90004-3-gregory.price@memverge.com Signed-off-by: Gregory Price gregory.price@memverge.com Suggested-by: "Huang, Ying" ying.huang@intel.com Reviewed-by: "Huang, Ying" ying.huang@intel.com Cc: Dan Williams dan.j.williams@intel.com Cc: Hasan Al Maruf Hasan.Maruf@amd.com Cc: Honggyu Kim honggyu.kim@sk.com Cc: Hyeongtak Ji hyeongtak.ji@sk.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Jonathan Corbet corbet@lwn.net Cc: Michal Hocko mhocko@kernel.org Cc: Rakie Kim rakie.kim@sk.com Cc: Ravi Jonnalagadda ravis.opensrc@micron.com Cc: Srinivasulu Thanneeru sthanneeru.opensrc@micron.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Ze Zuo zuoze1@huawei.com --- mm/mempolicy.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 76be93de7dab..6764eba6890b 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1935,6 +1935,20 @@ unsigned int mempolicy_slab_node(void) } }

+static unsigned int read_once_policy_nodemask(struct mempolicy *pol, + nodemask_t *mask) +{ + /* + * barrier stabilizes the nodemask locally so that it can be iterated + * over safely without concern for changes. Allocators validate node + * selection does not violate mems_allowed, so this is safe. + */ + barrier(); + memcpy(mask, &pol->nodes, sizeof(nodemask_t)); + barrier(); + return nodes_weight(*mask); +} + /* * Do static interleaving for interleave index @ilx. Returns the ilx'th * node in pol->nodes (starting from ilx=0), wrapping around if ilx @@ -1942,20 +1956,12 @@ unsigned int mempolicy_slab_node(void) */ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx) { - nodemask_t nodemask = pol->nodes; + nodemask_t nodemask; unsigned int target, nnodes; int i; int nid; - /* - * The barrier will stabilize the nodemask in a register or on - * the stack so that it will stop changing under the code. - * - * Between first_node() and next_node(), pol->nodes could be changed - * by other threads. So we put pol->nodes in a local stack. - */ - barrier();

- nnodes = nodes_weight(nodemask); + nnodes = read_once_policy_nodemask(pol, &nodemask); if (!nnodes) return numa_node_id(); target = ilx % nnodes;

-- 2.33.0

patchwork bot

10:11 p.m.

New subject: [PATCH OLK-6.6 v2] mm/mempolicy: refactor a read-once mechanism into a function for re-use

反馈: 您发送到kernel@openeuler.org的补丁/补丁集，转换为PR失败！邮件列表地址：https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/W... 失败原因：应用补丁/补丁集失败，Patch failed at 0001 mm/mempolicy: refactor a read-once mechanism into a function for re-use 建议解决方法：请查看失败原因，确认补丁是否可以应用在当前期望分支的最新代码上

FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/W... Failed Reason: apply patch(es) failed, Patch failed at 0001 mm/mempolicy: refactor a read-once mechanism into a function for re-use Suggest Solution: please checkout if the failed patch(es) can work on the newest codes in expected branch

Ze Zuo

9:49 p.m.

New subject: [PATCH OLK-6.6 v2] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving

From: Gregory Price gourry.memverge@gmail.com

mainline inclusion from mainline-v6.9-rc1 commit fa3bea4e1f8202d787709b7e3654eb0a99aed758 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OHHN CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...

--------------------------------

When a system has multiple NUMA nodes and it becomes bandwidth hungry, using the current MPOL_INTERLEAVE could be an wise option.

However, if those NUMA nodes consist of different types of memory such as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based interleave policy does not optimally distribute data to make use of their different bandwidth characteristics.

Instead, interleave is more effective when the allocation policy follows each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.

This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, enabling weighted interleave between NUMA nodes. Weighted interleave allows for proportional distribution of memory across multiple numa nodes, preferably apportioned to match the bandwidth of each node.

For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight distribution is (2:1).

Weights for each node can be assigned via the new sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/

For now, the default value of all nodes will be `1`, which matches the behavior of standard 1:1 round-robin interleave. An extension will be added in the future to allow default values to be registered at kernel and device bringup time.

The policy allocates a number of pages equal to the set weights. For example, if the weights are (2,1), then 2 pages will be allocated on node0 for every 1 page allocated on node1.

The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) and mbind(2).

Some high level notes about the pieces of weighted interleave:

current->il_prev: Tracks the node previously allocated from.

current->il_weight: The active weight of the current node (current->il_prev) When this reaches 0, current->il_prev is set to the next node and current->il_weight is set to the next weight.

weighted_interleave_nodes: Counts the number of allocations as they occur, and applies the weight for the current node. When the weight reaches 0, switch to the next node. Operates only on task->mempolicy.

weighted_interleave_nid: Gets the total weight of the nodemask as well as each individual node weight, then calculates the node based on the given index. Operates on VMA policies.

bulk_array_weighted_interleave: Gets the total weight of the nodemask as well as each individual node weight, then calculates the number of "interleave rounds" as well as any delta ("partial round"). Calculates the number of pages for each node and allocates them.

If a node was scheduled for interleave via interleave_nodes, the current weight will be allocated first.

Operates only on the task->mempolicy.

One piece of complexity is the interaction between a recent refactor which split the logic to acquire the "ilx" (interleave index) of an allocation and the actually application of the interleave. If a call to alloc_pages_mpol() were made with a weighted-interleave policy and ilx set to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would operate on a VMA policy - violating the description above.

An inspection of all callers of alloc_pages_mpol() shows that all external callers set ilx to `0`, an index value, or will call get_vma_policy() to acquire the ilx.

For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks all set (pgoff_t ilx) or end up in `get_vma_policy()`. This enforces the `weighted_interleave_nodes()` and `weighted_interleave_nid()` policy requirements (task/vma respectively).

Link: https://lkml.kernel.org/r/20240202170238.90004-4-gregory.price@memverge.com Suggested-by: Hasan Al Maruf Hasan.Maruf@amd.com Signed-off-by: Gregory Price gregory.price@memverge.com Co-developed-by: Rakie Kim rakie.kim@sk.com Signed-off-by: Rakie Kim rakie.kim@sk.com Co-developed-by: Honggyu Kim honggyu.kim@sk.com Signed-off-by: Honggyu Kim honggyu.kim@sk.com Co-developed-by: Hyeongtak Ji hyeongtak.ji@sk.com Signed-off-by: Hyeongtak Ji hyeongtak.ji@sk.com Co-developed-by: Srinivasulu Thanneeru sthanneeru.opensrc@micron.com Signed-off-by: Srinivasulu Thanneeru sthanneeru.opensrc@micron.com Co-developed-by: Ravi Jonnalagadda ravis.opensrc@micron.com Signed-off-by: Ravi Jonnalagadda ravis.opensrc@micron.com Reviewed-by: "Huang, Ying" ying.huang@intel.com Cc: Dan Williams dan.j.williams@intel.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Jonathan Corbet corbet@lwn.net Cc: Michal Hocko mhocko@kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Ze Zuo zuoze1@huawei.com --- .../admin-guide/mm/numa_memory_policy.rst | 9 + include/linux/sched.h | 1 + include/uapi/linux/mempolicy.h | 1 + mm/mempolicy.c | 218 +++++++++++++++++- 4 files changed, 225 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index eca38fa81e0f..a70f20ce1ffb 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY can fall back to all existing numa nodes. This is effectively MPOL_PREFERRED allowed for a mask rather than a single node.

+MPOL_WEIGHTED_INTERLEAVE + This mode operates the same as MPOL_INTERLEAVE, except that + interleaving behavior is executed based on weights set in + /sys/kernel/mm/mempolicy/weighted_interleave/ + + Weighted interleave allocates pages on nodes according to a + weight. For example if nodes [0,1] are weighted [5,2], 5 pages + will be allocated on node0 for every 2 pages allocated on node1. + NUMA memory policy supports the following optional mode flags:

MPOL_F_STATIC_NODES diff --git a/include/linux/sched.h b/include/linux/sched.h index b65d74c5e765..f40411aa7b70 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1285,6 +1285,7 @@ struct task_struct { /* Protected by alloc_lock: */ struct mempolicy *mempolicy; short il_prev; + u8 il_weight; short pref_node_fork; #endif #ifdef CONFIG_NUMA_BALANCING diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index a8963f7ef4c2..1f9bb10d1a47 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -23,6 +23,7 @@ enum { MPOL_INTERLEAVE, MPOL_LOCAL, MPOL_PREFERRED_MANY, + MPOL_WEIGHTED_INTERLEAVE, MPOL_MAX, /* always last member of enum */ };

diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 6764eba6890b..2f41cf1856a3 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -19,6 +19,13 @@ * for anonymous memory. For process policy an process counter * is used. * + * weighted interleave + * Allocate memory interleaved over a set of nodes based on + * a set of weights (per-node), with normal fallback if it + * fails. Otherwise operates the same as interleave. + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated + * on node 0 for every 1 page allocated on node 1. + * * bind Only allocate memory on a specific set of nodes, * no fallback. * FIXME: memory is allocated starting with the first node @@ -442,6 +449,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { .create = mpol_new_nodemask, .rebind = mpol_rebind_preferred, }, + [MPOL_WEIGHTED_INTERLEAVE] = { + .create = mpol_new_nodemask, + .rebind = mpol_rebind_nodemask, + }, };

static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, @@ -882,8 +893,11 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,

old = current->mempolicy; current->mempolicy = new; - if (new && new->mode == MPOL_INTERLEAVE) + if (new && (new->mode == MPOL_INTERLEAVE || + new->mode == MPOL_WEIGHTED_INTERLEAVE)) { current->il_prev = MAX_NUMNODES-1; + current->il_weight = 0; + } task_unlock(current); mpol_put(old); ret = 0; @@ -908,6 +922,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: *nodes = pol->nodes; break; case MPOL_LOCAL: @@ -992,6 +1007,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, } else if (pol == current->mempolicy && pol->mode == MPOL_INTERLEAVE) { *policy = next_node_in(current->il_prev, pol->nodes); + } else if (pol == current->mempolicy && + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { + if (current->il_weight) + *policy = current->il_prev; + else + *policy = next_node_in(current->il_prev, + pol->nodes); } else { err = -EINVAL; goto out; @@ -1355,7 +1377,8 @@ long __do_mbind(unsigned long start, unsigned long len, * VMAs, the nodes will still be interleaved from the targeted * nodemask, but one by one may be selected differently. */ - if (new->mode == MPOL_INTERLEAVE) { + if (new->mode == MPOL_INTERLEAVE || + new->mode == MPOL_WEIGHTED_INTERLEAVE) { struct page *page; unsigned int order; unsigned long addr = -EFAULT; @@ -1810,7 +1833,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, * @vma: virtual memory area whose policy is sought * @addr: address in @vma for shared policy lookup * @order: 0, or appropriate huge_page_order for interleaving - * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE + * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE or + * MPOL_WEIGHTED_INTERLEAVE * * Returns effective policy for a VMA at specified address. * Falls back to current->mempolicy or system default policy, as necessary. @@ -1827,7 +1851,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma, pol = __get_vma_policy(vma, addr, ilx); if (!pol) pol = get_task_policy(current); - if (pol->mode == MPOL_INTERLEAVE) { + if (pol->mode == MPOL_INTERLEAVE || + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { *ilx += vma->vm_pgoff >> order; *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); } @@ -1877,6 +1902,22 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) return zone >= dynamic_policy_zone; }

+static unsigned int weighted_interleave_nodes(struct mempolicy *policy) +{ + unsigned int node = current->il_prev; + + if (!current->il_weight || !node_isset(node, policy->nodes)) { + node = next_node_in(node, policy->nodes); + /* can only happen if nodemask is being rebound */ + if (node == MAX_NUMNODES) + return node; + current->il_prev = node; + current->il_weight = get_il_weight(node); + } + current->il_weight--; + return node; +} + /* Do dynamic interleaving for a process */ static unsigned int interleave_nodes(struct mempolicy *policy) { @@ -1911,6 +1952,9 @@ unsigned int mempolicy_slab_node(void) case MPOL_INTERLEAVE: return interleave_nodes(policy);

+ case MPOL_WEIGHTED_INTERLEAVE: + return weighted_interleave_nodes(policy); + case MPOL_BIND: case MPOL_PREFERRED_MANY: { @@ -1949,6 +1993,45 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol, return nodes_weight(*mask); }

+static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) +{ + nodemask_t nodemask; + unsigned int target, nr_nodes; + u8 *table; + unsigned int weight_total = 0; + u8 weight; + int nid; + + nr_nodes = read_once_policy_nodemask(pol, &nodemask); + if (!nr_nodes) + return numa_node_id(); + + rcu_read_lock(); + table = rcu_dereference(iw_table); + /* calculate the total weight */ + for_each_node_mask(nid, nodemask) { + /* detect system default usage */ + weight = table ? table[nid] : 1; + weight = weight ? weight : 1; + weight_total += weight; + } + + /* Calculate the node offset based on totals */ + target = ilx % weight_total; + nid = first_node(nodemask); + while (target) { + /* detect system default usage */ + weight = table ? table[nid] : 1; + weight = weight ? weight : 1; + if (target < weight) + break; + target -= weight; + nid = next_node_in(nid, nodemask); + } + rcu_read_unlock(); + return nid; +} + /* * Do static interleaving for interleave index @ilx. Returns the ilx'th * node in pol->nodes (starting from ilx=0), wrapping around if ilx @@ -2014,6 +2097,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, interleave_nodes(pol) : interleave_nid(pol, ilx); *nid = tnid; break; + case MPOL_WEIGHTED_INTERLEAVE: + *nid = (ilx == NO_INTERLEAVE_INDEX) ? + weighted_interleave_nodes(pol) : + weighted_interleave_nid(pol, ilx); + break; }

return nodemask; @@ -2075,6 +2163,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: *mask = mempolicy->nodes; break;

@@ -2175,6 +2264,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, * node in its nodemask, we allocate the standard way. */ if (pol->mode != MPOL_INTERLEAVE && + pol->mode != MPOL_WEIGHTED_INTERLEAVE && (!nodemask || node_isset(nid, *nodemask))) { /* * First, try to allocate THP only on local node, but @@ -2311,6 +2401,114 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp, return total_allocated; }

+static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, + struct mempolicy *pol, unsigned long nr_pages, + struct page **page_array) +{ + struct task_struct *me = current; + unsigned long total_allocated = 0; + unsigned long nr_allocated = 0; + unsigned long rounds; + unsigned long node_pages, delta; + u8 *table, *weights, weight; + unsigned int weight_total = 0; + unsigned long rem_pages = nr_pages; + nodemask_t nodes; + int nnodes, node; + int resume_node = MAX_NUMNODES - 1; + u8 resume_weight = 0; + int prev_node; + int i; + + if (!nr_pages) + return 0; + + nnodes = read_once_policy_nodemask(pol, &nodes); + if (!nnodes) + return 0; + + /* Continue allocating from most recent node and adjust the nr_pages */ + node = me->il_prev; + weight = me->il_weight; + if (weight && node_isset(node, nodes)) { + node_pages = min(rem_pages, weight); + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array += nr_allocated; + total_allocated += nr_allocated; + /* if that's all the pages, no need to interleave */ + if (rem_pages <= weight) { + me->il_weight -= rem_pages; + return total_allocated; + } + /* Otherwise we adjust remaining pages, continue from there */ + rem_pages -= weight; + } + /* clear active weight in case of an allocation failure */ + me->il_weight = 0; + prev_node = node; + + /* create a local copy of node weights to operate on outside rcu */ + weights = kzalloc(nr_node_ids, GFP_KERNEL); + if (!weights) + return total_allocated; + + rcu_read_lock(); + table = rcu_dereference(iw_table); + if (table) + memcpy(weights, table, nr_node_ids); + rcu_read_unlock(); + + /* calculate total, detect system default usage */ + for_each_node_mask(node, nodes) { + if (!weights[node]) + weights[node] = 1; + weight_total += weights[node]; + } + + /* + * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls. + * Track which node weighted interleave should resume from. + * + * if (rounds > 0) and (delta == 0), resume_node will always be + * the node following prev_node and its weight. + */ + rounds = rem_pages / weight_total; + delta = rem_pages % weight_total; + resume_node = next_node_in(prev_node, nodes); + resume_weight = weights[resume_node]; + for (i = 0; i < nnodes; i++) { + node = next_node_in(prev_node, nodes); + weight = weights[node]; + node_pages = weight * rounds; + /* If a delta exists, add this node's portion of the delta */ + if (delta > weight) { + node_pages += weight; + delta -= weight; + } else if (delta) { + /* when delta is depleted, resume from that node */ + node_pages += delta; + resume_node = node; + resume_weight = weight - delta; + delta = 0; + } + /* node_pages can be 0 if an allocation fails and rounds == 0 */ + if (!node_pages) + break; + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array += nr_allocated; + total_allocated += nr_allocated; + if (total_allocated == nr_pages) + break; + prev_node = node; + } + me->il_prev = resume_node; + me->il_weight = resume_weight; + kfree(weights); + return total_allocated; +} + static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid, struct mempolicy *pol, unsigned long nr_pages, struct page **page_array) @@ -2351,6 +2549,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp, return alloc_pages_bulk_array_interleave(gfp, pol, nr_pages, page_array);

+ if (pol->mode == MPOL_WEIGHTED_INTERLEAVE) + return alloc_pages_bulk_array_weighted_interleave( + gfp, pol, nr_pages, page_array); + if (pol->mode == MPOL_PREFERRED_MANY) return alloc_pages_bulk_array_preferred_many(gfp, numa_node_id(), pol, nr_pages, page_array); @@ -2426,6 +2628,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: return !!nodes_equal(a->nodes, b->nodes); case MPOL_LOCAL: return true; @@ -2562,6 +2765,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, polnid = interleave_nid(pol, ilx); break;

+ case MPOL_WEIGHTED_INTERLEAVE: + polnid = weighted_interleave_nid(pol, ilx); + break; + case MPOL_PREFERRED: if (node_isset(curnid, pol->nodes)) goto out; @@ -2936,6 +3143,7 @@ static const char * const policy_modes[] = [MPOL_PREFERRED] = "prefer", [MPOL_BIND] = "bind", [MPOL_INTERLEAVE] = "interleave", + [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave", [MPOL_LOCAL] = "local", [MPOL_PREFERRED_MANY] = "prefer (many)", }; @@ -2995,6 +3203,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) } break; case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: /* * Default to online nodes with memory if no nodelist */ @@ -3105,6 +3314,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: nodes = pol->nodes; break; default:

-- 2.33.0

patchwork bot

10:14 p.m.

New subject: [PATCH OLK-6.6 v2] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving

反馈: 您发送到kernel@openeuler.org的补丁/补丁集，转换为PR失败！邮件列表地址：https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/K... 失败原因：应用补丁/补丁集失败，Patch failed at 0001 mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving 建议解决方法：请查看失败原因，确认补丁是否可以应用在当前期望分支的最新代码上

FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/K... Failed Reason: apply patch(es) failed, Patch failed at 0001 mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Suggest Solution: please checkout if the failed patch(es) can work on the newest codes in expected branch

Ze Zuo

9:49 p.m.

New subject: [PATCH OLK-6.6 v2] mm/mempolicy: protect task interleave functions with tsk->mems_allowed_seq

From: Gregory Price gourry.memverge@gmail.com

mainline inclusion from mainline-v6.9-rc1 commit 274519ed414bd2b9a77c5db78ee51778d37ceacf category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I9OHHN CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...

--------------------------------

In the event of rebind, pol->nodemask can change at the same time as an allocation occurs. We can detect this with tsk->mems_allowed_seq and prevent a miscount or an allocation failure from occurring.

The same thing happens in the allocators to detect failure, but this can prevent spurious failures in a much smaller critical section.

[gourry.memverge@gmail.com: weighted interleave checks wrong parameter] Link: https://lkml.kernel.org/r/20240206192853.3589-1-gregory.price@memverge.com Link: https://lkml.kernel.org/r/20240202170238.90004-5-gregory.price@memverge.com Signed-off-by: Gregory Price gregory.price@memverge.com Suggested-by: "Huang, Ying" ying.huang@intel.com Cc: Dan Williams dan.j.williams@intel.com Cc: Hasan Al Maruf Hasan.Maruf@amd.com Cc: Honggyu Kim honggyu.kim@sk.com Cc: Hyeongtak Ji hyeongtak.ji@sk.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Jonathan Corbet corbet@lwn.net Cc: Michal Hocko mhocko@kernel.org Cc: Rakie Kim rakie.kim@sk.com Cc: Ravi Jonnalagadda ravis.opensrc@micron.com Cc: Srinivasulu Thanneeru sthanneeru.opensrc@micron.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Ze Zuo zuoze1@huawei.com --- mm/mempolicy.c | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 2f41cf1856a3..64ef9b7a06e7 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1904,11 +1904,17 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)

static unsigned int weighted_interleave_nodes(struct mempolicy *policy) { - unsigned int node = current->il_prev; + unsigned int node; + unsigned int cpuset_mems_cookie;

+retry: + /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */ + cpuset_mems_cookie = read_mems_allowed_begin(); + node = current->il_prev; if (!current->il_weight || !node_isset(node, policy->nodes)) { node = next_node_in(node, policy->nodes); - /* can only happen if nodemask is being rebound */ + if (read_mems_allowed_retry(cpuset_mems_cookie)) + goto retry; if (node == MAX_NUMNODES) return node; current->il_prev = node; @@ -1922,8 +1928,14 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy) static unsigned int interleave_nodes(struct mempolicy *policy) { unsigned int nid; + unsigned int cpuset_mems_cookie; + + /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */ + do { + cpuset_mems_cookie = read_mems_allowed_begin(); + nid = next_node_in(current->il_prev, policy->nodes); + } while (read_mems_allowed_retry(cpuset_mems_cookie));

- nid = next_node_in(current->il_prev, policy->nodes); if (nid < MAX_NUMNODES) current->il_prev = nid; return nid; @@ -2406,6 +2418,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, struct page **page_array) { struct task_struct *me = current; + unsigned int cpuset_mems_cookie; unsigned long total_allocated = 0; unsigned long nr_allocated = 0; unsigned long rounds; @@ -2423,7 +2436,13 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, if (!nr_pages) return 0;

- nnodes = read_once_policy_nodemask(pol, &nodes); + /* read the nodes onto the stack, retry if done during rebind */ + do { + cpuset_mems_cookie = read_mems_allowed_begin(); + nnodes = read_once_policy_nodemask(pol, &nodes); + } while (read_mems_allowed_retry(cpuset_mems_cookie)); + + /* if the nodemask has become invalid, we cannot do anything */ if (!nnodes) return 0;

-- 2.33.0

patchwork bot

10:14 p.m.

New subject: [PATCH OLK-6.6 v2] mm/mempolicy: protect task interleave functions with tsk->mems_allowed_seq

反馈: 您发送到kernel@openeuler.org的补丁/补丁集，转换为PR失败！邮件列表地址：https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/P... 失败原因：应用补丁/补丁集失败，Patch failed at 0001 mm/mempolicy: protect task interleave functions with tsk->mems_allowed_seq 建议解决方法：请查看失败原因，确认补丁是否可以应用在当前期望分支的最新代码上

FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/P... Failed Reason: apply patch(es) failed, Patch failed at 0001 mm/mempolicy: protect task interleave functions with tsk->mems_allowed_seq Suggest Solution: please checkout if the failed patch(es) can work on the newest codes in expected branch

245

Age (days ago)

245

Last active (days ago)

kernel@openeuler.org

28 comments

2 participants

tags (0)

participants (2)

patchwork bot
Ze Zuo