Chen Ridong (1): cgroup/cpuset: fix panic caused by partcmd_update
Waiman Long (3): cgroup/cpuset: Optimize isolated partition only generate_sched_domains() calls cgroup/cpuset: Fix remote root partition creation problem cgroup/cpuset: Clear effective_xcpus on cpus_allowed clearing only if cpus.exclusive not set
kernel/cgroup/cpuset.c | 68 +++++++++++++++++++++++++++++++++--------- 1 file changed, 54 insertions(+), 14 deletions(-)
From: Waiman Long longman@redhat.com
mainline inclusion from mainline-v6.11-rc1 commit 1805c1729f52edaa021288473b09f9c7f74fb1ca category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAOPW2
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
----------------------------------------------------------------------
If only isolated partitions are being created underneath the cgroup root, there will only be one sched domain with top_cpuset.effective_cpus. We can skip the unnecessary sched domains scanning code and save some cycles.
Signed-off-by: Waiman Long longman@redhat.com Signed-off-by: Tejun Heo tj@kernel.org Signed-off-by: Chen Ridong chenridong@huawei.com --- kernel/cgroup/cpuset.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index b681587b84ca..a5bbbfbb30a8 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1121,6 +1121,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
/* Special case for the 99% of systems with one, full, sched domain */ if (root_load_balance && !top_cpuset.nr_subparts) { +single_root_domain: ndoms = 1; doms = alloc_sched_domains(ndoms); if (!doms) @@ -1179,6 +1180,13 @@ static int generate_sched_domains(cpumask_var_t **domains, } rcu_read_unlock();
+ /* + * If there are only isolated partitions underneath the cgroup root, + * we can optimize out unneeded sched domains scanning. + */ + if (root_load_balance && (csn == 1)) + goto single_root_domain; + for (i = 0; i < csn; i++) csa[i]->pn = i; ndoms = csn;
From: Waiman Long longman@redhat.com
mainline inclusion from mainline-v6.11-rc1 commit ccac8e8de99cbcf5e7f53251ebce917bf7bcc29c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAOPW2
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
----------------------------------------------------------------------
Since commit 181c8e091aae ("cgroup/cpuset: Introduce remote partition"), a remote partition can be created underneath a non-partition root cpuset as long as its exclusive_cpus are set to distribute exclusive CPUs down to its children. The generate_sched_domains() function, however, doesn't take into account this new behavior and hence will fail to create the sched domain needed for a remote root (non-isolated) partition.
There are two issues related to remote partition support. First of all, generate_sched_domains() has a fast path that is activated if root_load_balance is true and top_cpuset.nr_subparts is non-zero. The later condition isn't quite correct for remote partitions as nr_subparts just shows the number of local child partitions underneath it. There can be no local child partition under top_cpuset even if there are remote partitions further down the hierarchy. Fix that by checking for subpartitions_cpus which contains exclusive CPUs allocated to both local and remote partitions.
Secondly, the valid partition check for subtree skipping in the csa[] generation loop isn't enough as remote partition does not need to have a partition root parent. Fix this problem by breaking csa[] array generation loop of generate_sched_domains() into v1 and v2 specific parts and checking a cpuset's exclusive_cpus before skipping its subtree in the v2 case.
Also simplify generate_sched_domains() for cgroup v2 as only non-isolating partition roots should be included in building the cpuset array and none of the v1 scheduling attributes other than a different way to create an isolated partition are supported.
Fixes: 027bac95c74c ("cgroup/cpuset: Introduce remote partition") Signed-off-by: Waiman Long longman@redhat.com Signed-off-by: Tejun Heo tj@kernel.org Signed-off-by: Chen Ridong chenridong@huawei.com --- kernel/cgroup/cpuset.c | 55 ++++++++++++++++++++++++++++++++---------- 1 file changed, 42 insertions(+), 13 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index a5bbbfbb30a8..cf3410fd2fe2 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -180,7 +180,7 @@ struct cpuset { /* for custom sched domain */ int relax_domain_level;
- /* number of valid sub-partitions */ + /* number of valid local child partitions */ int nr_subparts;
/* partition root state */ @@ -1114,13 +1114,14 @@ static int generate_sched_domains(cpumask_var_t **domains, int nslot; /* next empty doms[] struct cpumask slot */ struct cgroup_subsys_state *pos_css; bool root_load_balance = is_sched_load_balance(&top_cpuset); + bool cgrpv2 = cgroup_subsys_on_dfl(cpuset_cgrp_subsys);
doms = NULL; dattr = NULL; csa = NULL;
/* Special case for the 99% of systems with one, full, sched domain */ - if (root_load_balance && !top_cpuset.nr_subparts) { + if (root_load_balance && cpumask_empty(subpartitions_cpus)) { single_root_domain: ndoms = 1; doms = alloc_sched_domains(ndoms); @@ -1149,16 +1150,18 @@ static int generate_sched_domains(cpumask_var_t **domains, cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) { if (cp == &top_cpuset) continue; + + if (cgrpv2) + goto v2; + /* + * v1: * Continue traversing beyond @cp iff @cp has some CPUs and * isn't load balancing. The former is obvious. The * latter: All child cpusets contain a subset of the * parent's cpus, so just skip them, and then we call * update_domain_attr_tree() to calc relax_domain_level of * the corresponding sched domain. - * - * If root is load-balancing, we can skip @cp if it - * is a subset of the root's effective_cpus. */ if (!cpumask_empty(cp->cpus_allowed) && !(is_sched_load_balance(cp) && @@ -1166,16 +1169,28 @@ static int generate_sched_domains(cpumask_var_t **domains, housekeeping_cpumask(HK_TYPE_DOMAIN)))) continue;
- if (root_load_balance && - cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus)) - continue; - if (is_sched_load_balance(cp) && !cpumask_empty(cp->effective_cpus)) csa[csn++] = cp;
- /* skip @cp's subtree if not a partition root */ - if (!is_partition_valid(cp)) + /* skip @cp's subtree */ + pos_css = css_rightmost_descendant(pos_css); + continue; + +v2: + /* + * Only valid partition roots that are not isolated and with + * non-empty effective_cpus will be saved into csn[]. + */ + if ((cp->partition_root_state == PRS_ROOT) && + !cpumask_empty(cp->effective_cpus)) + csa[csn++] = cp; + + /* + * Skip @cp's subtree if not a partition root and has no + * exclusive CPUs to be granted to child cpusets. + */ + if (!is_partition_valid(cp) && cpumask_empty(cp->exclusive_cpus)) pos_css = css_rightmost_descendant(pos_css); } rcu_read_unlock(); @@ -1229,6 +1244,20 @@ static int generate_sched_domains(cpumask_var_t **domains, dattr = kmalloc_array(ndoms, sizeof(struct sched_domain_attr), GFP_KERNEL);
+ /* + * Cgroup v2 doesn't support domain attributes, just set all of them + * to SD_ATTR_INIT. Also non-isolating partition root CPUs are a + * subset of HK_TYPE_DOMAIN housekeeping CPUs. + */ + if (cgrpv2) { + for (i = 0; i < ndoms; i++) { + cpumask_copy(doms[i], csa[i]->effective_cpus); + if (dattr) + dattr[i] = SD_ATTR_INIT; + } + goto done; + } + for (nslot = 0, i = 0; i < csn; i++) { struct cpuset *a = csa[i]; struct cpumask *dp; @@ -1388,7 +1417,7 @@ static void rebuild_sched_domains_locked(void) * root should be only a subset of the active CPUs. Since a CPU in any * partition root could be offlined, all must be checked. */ - if (top_cpuset.nr_subparts) { + if (!cpumask_empty(subpartitions_cpus)) { rcu_read_lock(); cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { if (!is_partition_valid(cs)) { @@ -4679,7 +4708,7 @@ static void cpuset_handle_hotplug(void) * In the rare case that hotplug removes all the cpus in * subpartitions_cpus, we assumed that cpus are updated. */ - if (!cpus_updated && top_cpuset.nr_subparts) + if (!cpus_updated && !cpumask_empty(subpartitions_cpus)) cpus_updated = true;
/* For v1, synchronize cpus_allowed to cpu_active_mask */
mainline inclusion from mainline-v6.11-rc5 commit 959ab6350add903e352890af53e86663739fcb9a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAOPW2
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
----------------------------------------------------------------------
We find a bug as below: BUG: unable to handle page fault for address: 00000003 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 3 PID: 358 Comm: bash Tainted: G W I 6.6.0-10893-g60d6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/4 RIP: 0010:partition_sched_domains_locked+0x483/0x600 Code: 01 48 85 d2 74 0d 48 83 05 29 3f f8 03 01 f3 48 0f bc c2 89 c0 48 9 RSP: 0018:ffffc90000fdbc58 EFLAGS: 00000202 RAX: 0000000100000003 RBX: ffff888100b3dfa0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000002fe80 RBP: ffff888100b3dfb0 R08: 0000000000000001 R09: 0000000000000000 R10: ffffc90000fdbcb0 R11: 0000000000000004 R12: 0000000000000002 R13: ffff888100a92b48 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f44a5425740(0000) GS:ffff888237d80000(0000) knlGS:0000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000100030973 CR3: 000000010722c000 CR4: 00000000000006e0 Call Trace: <TASK> ? show_regs+0x8c/0xa0 ? __die_body+0x23/0xa0 ? __die+0x3a/0x50 ? page_fault_oops+0x1d2/0x5c0 ? partition_sched_domains_locked+0x483/0x600 ? search_module_extables+0x2a/0xb0 ? search_exception_tables+0x67/0x90 ? kernelmode_fixup_or_oops+0x144/0x1b0 ? __bad_area_nosemaphore+0x211/0x360 ? up_read+0x3b/0x50 ? bad_area_nosemaphore+0x1a/0x30 ? exc_page_fault+0x890/0xd90 ? __lock_acquire.constprop.0+0x24f/0x8d0 ? __lock_acquire.constprop.0+0x24f/0x8d0 ? asm_exc_page_fault+0x26/0x30 ? partition_sched_domains_locked+0x483/0x600 ? partition_sched_domains_locked+0xf0/0x600 rebuild_sched_domains_locked+0x806/0xdc0 update_partition_sd_lb+0x118/0x130 cpuset_write_resmask+0xffc/0x1420 cgroup_file_write+0xb2/0x290 kernfs_fop_write_iter+0x194/0x290 new_sync_write+0xeb/0x160 vfs_write+0x16f/0x1d0 ksys_write+0x81/0x180 __x64_sys_write+0x21/0x30 x64_sys_call+0x2f25/0x4630 do_syscall_64+0x44/0xb0 entry_SYSCALL_64_after_hwframe+0x78/0xe2 RIP: 0033:0x7f44a553c887
It can be reproduced with cammands: cd /sys/fs/cgroup/ mkdir test cd test/ echo +cpuset > ../cgroup.subtree_control echo root > cpuset.cpus.partition cat /sys/fs/cgroup/cpuset.cpus.effective 0-3 echo 0-3 > cpuset.cpus // taking away all cpus from root
This issue is caused by the incorrect rebuilding of scheduling domains. In this scenario, test/cpuset.cpus.partition should be an invalid root and should not trigger the rebuilding of scheduling domains. When calling update_parent_effective_cpumask with partcmd_update, if newmask is not null, it should recheck newmask whether there are cpus is available for parect/cs that has tasks.
Fixes: 48a64642c526 ("cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Chen Ridong chenridong@huawei.com Signed-off-by: Waiman Long longman@redhat.com Signed-off-by: Tejun Heo tj@kernel.org Signed-off-by: Chen Ridong chenridong@huawei.com --- kernel/cgroup/cpuset.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index cf3410fd2fe2..9b08eac75f08 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1991,6 +1991,8 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd, part_error = PERR_CPUSEMPTY; goto write_error; } + /* Check newmask again, whether cpus are available for parent/cs */ + nocpu |= tasks_nocpu_error(parent, cs, newmask);
/* * partcmd_update with newmask:
From: Waiman Long longman@redhat.com
mainline inclusion from mainline-v6.11-rc5 commit 311a1bdc44a8e06024df4fd3392be0dfc8298655 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAOPW2
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
----------------------------------------------------------------------
Commit e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") adds a user writable cpuset.cpus.exclusive file for setting exclusive CPUs to be used for the creation of partitions. Since then effective_xcpus depends on both the cpuset.cpus and cpuset.cpus.exclusive setting. If cpuset.cpus.exclusive is set, effective_xcpus will depend only on cpuset.cpus.exclusive. When it is not set, effective_xcpus will be set according to the cpuset.cpus value when the cpuset becomes a valid partition root.
When cpuset.cpus is being cleared by the user, effective_xcpus should only be cleared when cpuset.cpus.exclusive is not set. However, that is not currently the case.
# cd /sys/fs/cgroup/ # mkdir test # echo +cpuset > cgroup.subtree_control # cd test # echo 3 > cpuset.cpus.exclusive # cat cpuset.cpus.exclusive.effective 3 # echo > cpuset.cpus # cat cpuset.cpus.exclusive.effective // was cleared
Fix it by clearing effective_xcpus only if cpuset.cpus.exclusive is not set.
Fixes: 1e0b07acf656 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") Cc: stable@vger.kernel.org # v6.7+ Reported-by: Chen Ridong chenridong@huawei.com Signed-off-by: Waiman Long longman@redhat.com Signed-off-by: Tejun Heo tj@kernel.org Signed-off-by: Chen Ridong chenridong@huawei.com --- kernel/cgroup/cpuset.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 9b08eac75f08..2c9e50f09fc1 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2535,7 +2535,8 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, */ if (!*buf) { cpumask_clear(trialcs->cpus_allowed); - cpumask_clear(trialcs->effective_xcpus); + if (cpumask_empty(trialcs->exclusive_cpus)) + cpumask_clear(trialcs->effective_xcpus); } else { retval = cpulist_parse(buf, trialcs->cpus_allowed); if (retval < 0)
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,转换为PR失败! 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/3... 失败原因:同步源码仓代码到fork仓失败 建议解决方法:请稍等,机器人会在下一次任务重新执行
FeedBack: The patch(es) which you have sent to kernel@openeuler.org has been converted to PR failed! Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/3... Failed Reason: sync origin kernel's codes to the fork repository failed Suggest Solution: please wait, the bot will retry in the next interval
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/11402 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/3...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/11402 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/3...