From: Daniel Sneddon daniel.sneddon@linux.intel.com
stable inclusion from stable-v5.10.136 commit 509c2c9fe75ea7493eebbb6bb2f711f37530ae19 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5N1SO CVE: CVE-2022-26373
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit 2b1299322016731d56807aa49254a5ea3080b6b3 upstream.
tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as documented for RET instructions after VM exits. Mitigate it with a new one-entry RSB stuffing mechanism and a new LFENCE.
== Background ==
Indirect Branch Restricted Speculation (IBRS) was designed to help mitigate Branch Target Injection and Speculative Store Bypass, i.e. Spectre, attacks. IBRS prevents software run in less privileged modes from affecting branch prediction in more privileged modes. IBRS requires the MSR to be written on every privilege level change.
To overcome some of the performance issues of IBRS, Enhanced IBRS was introduced. eIBRS is an "always on" IBRS, in other words, just turn it on once instead of writing the MSR on every privilege level change. When eIBRS is enabled, more privileged modes should be protected from less privileged modes, including protecting VMMs from guests.
== Problem ==
Here's a simplification of how guests are run on Linux' KVM:
void run_kvm_guest(void) { // Prepare to run guest VMRESUME(); // Clean up after guest runs }
The execution flow for that would look something like this to the processor:
1. Host-side: call run_kvm_guest() 2. Host-side: VMRESUME 3. Guest runs, does "CALL guest_function" 4. VM exit, host runs again 5. Host might make some "cleanup" function calls 6. Host-side: RET from run_kvm_guest()
Now, when back on the host, there are a couple of possible scenarios of post-guest activity the host needs to do before executing host code:
* on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not touched and Linux has to do a 32-entry stuffing.
* on eIBRS hardware, VM exit with IBRS enabled, or restoring the host IBRS=1 shortly after VM exit, has a documented side effect of flushing the RSB except in this PBRSB situation where the software needs to stuff the last RSB entry "by hand".
IOW, with eIBRS supported, host RET instructions should no longer be influenced by guest behavior after the host retires a single CALL instruction.
However, if the RET instructions are "unbalanced" with CALLs after a VM exit as is the RET in #6, it might speculatively use the address for the instruction after the CALL in #3 as an RSB prediction. This is a problem since the (untrusted) guest controls this address.
Balanced CALL/RET instruction pairs such as in step #5 are not affected.
== Solution ==
The PBRSB issue affects a wide variety of Intel processors which support eIBRS. But not all of them need mitigation. Today, X86_FEATURE_RSB_VMEXIT triggers an RSB filling sequence that mitigates PBRSB. Systems setting RSB_VMEXIT need no further mitigation - i.e., eIBRS systems which enable legacy IBRS explicitly.
However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RSB_VMEXIT and most of them need a new mitigation.
Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE which triggers a lighter-weight PBRSB mitigation versus RSB_VMEXIT.
The lighter-weight mitigation performs a CALL instruction which is immediately followed by a speculative execution barrier (INT3). This steers speculative execution to the barrier -- just like a retpoline -- which ensures that speculation can never reach an unbalanced RET. Then, ensure this CALL is retired before continuing execution with an LFENCE.
In other words, the window of exposure is opened at VM exit where RET behavior is troublesome. While the window is open, force RSB predictions sampling for RET targets to a dead end at the INT3. Close the window with the LFENCE.
There is a subset of eIBRS systems which are not vulnerable to PBRSB. Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB. Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO.
[ bp: Massage, incorporate review comments from Andy Cooper. ]
Signed-off-by: Daniel Sneddon daniel.sneddon@linux.intel.com Co-developed-by: Pawan Gupta pawan.kumar.gupta@linux.intel.com Signed-off-by: Pawan Gupta pawan.kumar.gupta@linux.intel.com Signed-off-by: Borislav Petkov bp@suse.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
conflict: arch/x86/include/asm/cpufeatures.h
Signed-off-by: Chen Jiahao chenjiahao16@huawei.com Reviewed-by: Zhang Jianhua chris.zjh@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- Documentation/admin-guide/hw-vuln/spectre.rst | 8 ++ arch/x86/include/asm/cpufeatures.h | 2 + arch/x86/include/asm/msr-index.h | 4 + arch/x86/include/asm/nospec-branch.h | 17 +++- arch/x86/kernel/cpu/bugs.c | 86 ++++++++++++++----- arch/x86/kernel/cpu/common.c | 12 ++- arch/x86/kvm/vmx/vmenter.S | 8 +- tools/arch/x86/include/asm/cpufeatures.h | 1 + tools/arch/x86/include/asm/msr-index.h | 4 + 9 files changed, 113 insertions(+), 29 deletions(-)
diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst index 6bd97cd50d62..7e061ed449aa 100644 --- a/Documentation/admin-guide/hw-vuln/spectre.rst +++ b/Documentation/admin-guide/hw-vuln/spectre.rst @@ -422,6 +422,14 @@ The possible values in this file are: 'RSB filling' Protection of RSB on context switch enabled ============= ===========================================
+ - EIBRS Post-barrier Return Stack Buffer (PBRSB) protection status: + + =========================== ======================================================= + 'PBRSB-eIBRS: SW sequence' CPU is affected and protection of RSB on VMEXIT enabled + 'PBRSB-eIBRS: Vulnerable' CPU is vulnerable + 'PBRSB-eIBRS: Not affected' CPU is not affected by PBRSB + =========================== ======================================================= + Full mitigation might require a microcode update from the CPU vendor. When the necessary microcode is not available, the kernel will report vulnerability. diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index bf531e883e82..b5852d41d9b9 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -297,6 +297,7 @@ #define X86_FEATURE_RETPOLINE_LFENCE (11*32+13) /* "" Use LFENCE for Spectre variant 2 */ #define X86_FEATURE_RETHUNK (11*32+14) /* "" Use REturn THUNK */ #define X86_FEATURE_UNRET (11*32+15) /* "" AMD BTB untrain return */ +#define X86_FEATURE_RSB_VMEXIT_LITE (11*32+17) /* "" Fill RSB on VM exit when EIBRS is enabled */
/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */ #define X86_FEATURE_AVX512_BF16 (12*32+ 5) /* AVX512 BFLOAT16 instructions */ @@ -428,5 +429,6 @@ #define X86_BUG_SRBDS X86_BUG(24) /* CPU may leak RNG bits if not mitigated */ #define X86_BUG_MMIO_STALE_DATA X86_BUG(25) /* CPU is affected by Processor MMIO Stale Data vulnerabilities */ #define X86_BUG_RETBLEED X86_BUG(26) /* CPU is affected by RETBleed */ +#define X86_BUG_EIBRS_PBRSB X86_BUG(27) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
#endif /* _ASM_X86_CPUFEATURES_H */ diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index ffc09df04d41..d9c352e76850 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -148,6 +148,10 @@ * are restricted to targets in * kernel. */ +#define ARCH_CAP_PBRSB_NO BIT(24) /* + * Not susceptible to Post-Barrier + * Return Stack Buffer Predictions. + */
#define MSR_IA32_FLUSH_CMD 0x0000010b #define L1D_FLUSH BIT(0) /* diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index 32e25cac72b2..3c615488b740 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -118,13 +118,28 @@ #endif .endm
+.macro ISSUE_UNBALANCED_RET_GUARD + ANNOTATE_INTRA_FUNCTION_CALL + call .Lunbalanced_ret_guard_@ + int3 +.Lunbalanced_ret_guard_@: + add $(BITS_PER_LONG/8), %_ASM_SP + lfence +.endm + /* * A simpler FILL_RETURN_BUFFER macro. Don't make people use the CPP * monstrosity above, manually. */ -.macro FILL_RETURN_BUFFER reg:req nr:req ftr:req +.macro FILL_RETURN_BUFFER reg:req nr:req ftr:req ftr2 +.ifb \ftr2 ALTERNATIVE "jmp .Lskip_rsb_@", "", \ftr +.else + ALTERNATIVE_2 "jmp .Lskip_rsb_@", "", \ftr, "jmp .Lunbalanced_@", \ftr2 +.endif __FILL_RETURN_BUFFER(\reg,\nr,%_ASM_SP) +.Lunbalanced_@: + ISSUE_UNBALANCED_RET_GUARD .Lskip_rsb_@: .endm
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index bc6382f5ec27..d4cb9ff639aa 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -1290,6 +1290,53 @@ static void __init spec_ctrl_disable_kernel_rrsba(void) } }
+static void __init spectre_v2_determine_rsb_fill_type_at_vmexit(enum spectre_v2_mitigation mode) +{ + /* + * Similar to context switches, there are two types of RSB attacks + * after VM exit: + * + * 1) RSB underflow + * + * 2) Poisoned RSB entry + * + * When retpoline is enabled, both are mitigated by filling/clearing + * the RSB. + * + * When IBRS is enabled, while #1 would be mitigated by the IBRS branch + * prediction isolation protections, RSB still needs to be cleared + * because of #2. Note that SMEP provides no protection here, unlike + * user-space-poisoned RSB entries. + * + * eIBRS should protect against RSB poisoning, but if the EIBRS_PBRSB + * bug is present then a LITE version of RSB protection is required, + * just a single call needs to retire before a RET is executed. + */ + switch (mode) { + case SPECTRE_V2_NONE: + return; + + case SPECTRE_V2_EIBRS_LFENCE: + case SPECTRE_V2_EIBRS: + if (boot_cpu_has_bug(X86_BUG_EIBRS_PBRSB)) { + setup_force_cpu_cap(X86_FEATURE_RSB_VMEXIT_LITE); + pr_info("Spectre v2 / PBRSB-eIBRS: Retire a single CALL on VMEXIT\n"); + } + return; + + case SPECTRE_V2_EIBRS_RETPOLINE: + case SPECTRE_V2_RETPOLINE: + case SPECTRE_V2_LFENCE: + case SPECTRE_V2_IBRS: + setup_force_cpu_cap(X86_FEATURE_RSB_VMEXIT); + pr_info("Spectre v2 / SpectreRSB : Filling RSB on VMEXIT\n"); + return; + } + + pr_warn_once("Unknown Spectre v2 mode, disabling RSB mitigation at VM exit"); + dump_stack(); +} + static void __init spectre_v2_select_mitigation(void) { enum spectre_v2_mitigation_cmd cmd = spectre_v2_parse_cmdline(); @@ -1438,28 +1485,7 @@ static void __init spectre_v2_select_mitigation(void) setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW); pr_info("Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch\n");
- /* - * Similar to context switches, there are two types of RSB attacks - * after vmexit: - * - * 1) RSB underflow - * - * 2) Poisoned RSB entry - * - * When retpoline is enabled, both are mitigated by filling/clearing - * the RSB. - * - * When IBRS is enabled, while #1 would be mitigated by the IBRS branch - * prediction isolation protections, RSB still needs to be cleared - * because of #2. Note that SMEP provides no protection here, unlike - * user-space-poisoned RSB entries. - * - * eIBRS, on the other hand, has RSB-poisoning protections, so it - * doesn't need RSB clearing after vmexit. - */ - if (boot_cpu_has(X86_FEATURE_RETPOLINE) || - boot_cpu_has(X86_FEATURE_KERNEL_IBRS)) - setup_force_cpu_cap(X86_FEATURE_RSB_VMEXIT); + spectre_v2_determine_rsb_fill_type_at_vmexit(mode);
/* * Retpoline protects the kernel, but doesn't protect firmware. IBRS @@ -2202,6 +2228,19 @@ static char *ibpb_state(void) return ""; }
+static char *pbrsb_eibrs_state(void) +{ + if (boot_cpu_has_bug(X86_BUG_EIBRS_PBRSB)) { + if (boot_cpu_has(X86_FEATURE_RSB_VMEXIT_LITE) || + boot_cpu_has(X86_FEATURE_RSB_VMEXIT)) + return ", PBRSB-eIBRS: SW sequence"; + else + return ", PBRSB-eIBRS: Vulnerable"; + } else { + return ", PBRSB-eIBRS: Not affected"; + } +} + static ssize_t spectre_v2_show_state(char *buf) { if (spectre_v2_enabled == SPECTRE_V2_LFENCE) @@ -2214,12 +2253,13 @@ static ssize_t spectre_v2_show_state(char *buf) spectre_v2_enabled == SPECTRE_V2_EIBRS_LFENCE) return sprintf(buf, "Vulnerable: eIBRS+LFENCE with unprivileged eBPF and SMT\n");
- return sprintf(buf, "%s%s%s%s%s%s\n", + return sprintf(buf, "%s%s%s%s%s%s%s\n", spectre_v2_strings[spectre_v2_enabled], ibpb_state(), boot_cpu_has(X86_FEATURE_USE_IBRS_FW) ? ", IBRS_FW" : "", stibp_state(), boot_cpu_has(X86_FEATURE_RSB_CTXSW) ? ", RSB filling" : "", + pbrsb_eibrs_state(), spectre_v2_module_string()); }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 901352bd3b42..9fc91482e85e 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1024,6 +1024,7 @@ static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c) #define NO_SWAPGS BIT(6) #define NO_ITLB_MULTIHIT BIT(7) #define NO_SPECTRE_V2 BIT(8) +#define NO_EIBRS_PBRSB BIT(9)
#define VULNWL(vendor, family, model, whitelist) \ X86_MATCH_VENDOR_FAM_MODEL(vendor, family, model, whitelist) @@ -1064,7 +1065,7 @@ static const __initconst struct x86_cpu_id cpu_vuln_whitelist[] = {
VULNWL_INTEL(ATOM_GOLDMONT, NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT), VULNWL_INTEL(ATOM_GOLDMONT_D, NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT), - VULNWL_INTEL(ATOM_GOLDMONT_PLUS, NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT), + VULNWL_INTEL(ATOM_GOLDMONT_PLUS, NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT | NO_EIBRS_PBRSB),
/* * Technically, swapgs isn't serializing on AMD (despite it previously @@ -1074,7 +1075,9 @@ static const __initconst struct x86_cpu_id cpu_vuln_whitelist[] = { * good enough for our purposes. */
- VULNWL_INTEL(ATOM_TREMONT_D, NO_ITLB_MULTIHIT), + VULNWL_INTEL(ATOM_TREMONT, NO_EIBRS_PBRSB), + VULNWL_INTEL(ATOM_TREMONT_L, NO_EIBRS_PBRSB), + VULNWL_INTEL(ATOM_TREMONT_D, NO_ITLB_MULTIHIT | NO_EIBRS_PBRSB),
/* AMD Family 0xf - 0x12 */ VULNWL_AMD(0x0f, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT), @@ -1252,6 +1255,11 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c) setup_force_cpu_bug(X86_BUG_RETBLEED); }
+ if (cpu_has(c, X86_FEATURE_IBRS_ENHANCED) && + !cpu_matches(cpu_vuln_whitelist, NO_EIBRS_PBRSB) && + !(ia32_cap & ARCH_CAP_PBRSB_NO)) + setup_force_cpu_bug(X86_BUG_EIBRS_PBRSB); + if (cpu_matches(cpu_vuln_whitelist, NO_MELTDOWN)) return;
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S index 857fa0fc49fa..982138bebb70 100644 --- a/arch/x86/kvm/vmx/vmenter.S +++ b/arch/x86/kvm/vmx/vmenter.S @@ -197,11 +197,13 @@ SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL) * entries and (in some cases) RSB underflow. * * eIBRS has its own protection against poisoned RSB, so it doesn't - * need the RSB filling sequence. But it does need to be enabled - * before the first unbalanced RET. + * need the RSB filling sequence. But it does need to be enabled, and a + * single call to retire, before the first unbalanced RET. */
- FILL_RETURN_BUFFER %_ASM_CX, RSB_CLEAR_LOOPS, X86_FEATURE_RSB_VMEXIT + FILL_RETURN_BUFFER %_ASM_CX, RSB_CLEAR_LOOPS, X86_FEATURE_RSB_VMEXIT,\ + X86_FEATURE_RSB_VMEXIT_LITE +
pop %_ASM_ARG2 /* @flags */ pop %_ASM_ARG1 /* @vmx */ diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h index 54ba20492ad1..ec53f52a06a5 100644 --- a/tools/arch/x86/include/asm/cpufeatures.h +++ b/tools/arch/x86/include/asm/cpufeatures.h @@ -296,6 +296,7 @@ #define X86_FEATURE_RETPOLINE_LFENCE (11*32+13) /* "" Use LFENCE for Spectre variant 2 */ #define X86_FEATURE_RETHUNK (11*32+14) /* "" Use REturn THUNK */ #define X86_FEATURE_UNRET (11*32+15) /* "" AMD BTB untrain return */ +#define X86_FEATURE_RSB_VMEXIT_LITE (11*32+17) /* "" Fill RSB on VM-Exit when EIBRS is enabled */
/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */ #define X86_FEATURE_AVX512_BF16 (12*32+ 5) /* AVX512 BFLOAT16 instructions */ diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h index 53373ca3b487..b8954262d767 100644 --- a/tools/arch/x86/include/asm/msr-index.h +++ b/tools/arch/x86/include/asm/msr-index.h @@ -148,6 +148,10 @@ * are restricted to targets in * kernel. */ +#define ARCH_CAP_PBRSB_NO BIT(24) /* + * Not susceptible to Post-Barrier + * Return Stack Buffer Predictions. + */
#define MSR_IA32_FLUSH_CMD 0x0000010b #define L1D_FLUSH BIT(0) /*
From: Pawan Gupta pawan.kumar.gupta@linux.intel.com
stable inclusion from stable-v5.10.136 commit 1bea03b44ea2267988cce064f5887b01d421b28c category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5N1SO CVE: CVE-2022-26373
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit ba6e31af2be96c4d0536f2152ed6f7b6c11bca47 upstream.
RSB fill sequence does not have any protection for miss-prediction of conditional branch at the end of the sequence. CPU can speculatively execute code immediately after the sequence, while RSB filling hasn't completed yet.
#define __FILL_RETURN_BUFFER(reg, nr, sp) \ mov $(nr/2), reg; \ 771: \ ANNOTATE_INTRA_FUNCTION_CALL; \ call 772f; \ 773: /* speculation trap */ \ UNWIND_HINT_EMPTY; \ pause; \ lfence; \ jmp 773b; \ 772: \ ANNOTATE_INTRA_FUNCTION_CALL; \ call 774f; \ 775: /* speculation trap */ \ UNWIND_HINT_EMPTY; \ pause; \ lfence; \ jmp 775b; \ 774: \ add $(BITS_PER_LONG/8) * 2, sp; \ dec reg; \ jnz 771b; <----- CPU can miss-predict here.
Before RSB is filled, RETs that come in program order after this macro can be executed speculatively, making them vulnerable to RSB-based attacks.
Mitigate it by adding an LFENCE after the conditional branch to prevent speculation while RSB is being filled.
Suggested-by: Andrew Cooper andrew.cooper3@citrix.com Signed-off-by: Pawan Gupta pawan.kumar.gupta@linux.intel.com Signed-off-by: Borislav Petkov bp@suse.de Signed-off-by: Daniel Sneddon daniel.sneddon@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Chen Jiahao chenjiahao16@huawei.com Reviewed-by: Zhang Jianhua chris.zjh@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/x86/include/asm/nospec-branch.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h index 3c615488b740..92771640706c 100644 --- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -60,7 +60,9 @@ 774: \ add $(BITS_PER_LONG/8) * 2, sp; \ dec reg; \ - jnz 771b; + jnz 771b; \ + /* barrier for jnz misprediction */ \ + lfence;
#ifdef __ASSEMBLY__
From: Yu Kuai yukuai3@huawei.com
hulk inclusion category: bugfix bugzilla: 186851, https://gitee.com/openeuler/kernel/issues/I5L01G CVE: NA
--------------------------------
Currently, in virtio_scsi, if 'bd->last' is not set to true while dispatching request, such io will stay in driver's queue, and driver will wait for block layer to dispatch more rqs. However, if block layer failed to dispatch more rq, it should trigger commit_rqs to inform driver.
There is a problem in blk_mq_try_issue_list_directly() that commit_rqs won't be called:
// assume that queue_depth is set to 1, list contains two rq blk_mq_try_issue_list_directly blk_mq_request_issue_directly // dispatch first rq // last is false __blk_mq_try_issue_directly blk_mq_get_dispatch_budget // succeed to get first budget __blk_mq_issue_directly scsi_queue_rq cmd->flags |= SCMD_LAST virtscsi_queuecommand kick = (sc->flags & SCMD_LAST) != 0 // kick is false, first rq won't issue to disk queued++
blk_mq_request_issue_directly // dispatch second rq __blk_mq_try_issue_directly blk_mq_get_dispatch_budget // failed to get second budget ret == BLK_STS_RESOURCE blk_mq_request_bypass_insert // errors is still 0
if (!list_empty(list) || errors && ...) // won't pass, commit_rqs won't be called
In this situation, first rq relied on second rq to dispatch, while second rq relied on first rq to complete, thus they will both hung.
Fix the problem by also treat 'BLK_STS_*RESOURCE' as 'errors' since it means that request is not queued successfully.
Same problem exists in blk_mq_dispatch_rq_list(), 'BLK_STS_*RESOURCE' can't be treated as 'errors' here, fix the problem by calling commit_rqs if queue_rq return 'BLK_STS_*RESOURCE'.
Fixes: d666ba98f849 ("blk-mq: add mq_ops->commit_rqs()") Signed-off-by: Yu Kuai yukuai3@huawei.com Link: https://lore.kernel.org/all/20220726122224.1790882-1-yukuai1@huaweicloud.com... Reviewed-by: Ming Lei ming.lei@redhat.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- block/blk-mq.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index 1941ffc4db85..b9827b3d3f63 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1484,7 +1484,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list, /* If we didn't flush the entire list, we could have told the driver * there was more coming, but that turned out to be a lie. */ - if ((!list_empty(list) || errors) && q->mq_ops->commit_rqs && queued) + if ((!list_empty(list) || errors || needs_resource || + ret == BLK_STS_DEV_RESOURCE) && q->mq_ops->commit_rqs && queued) q->mq_ops->commit_rqs(hctx); /* * Any items that need requeuing? Stuff them into hctx->dispatch, @@ -2224,6 +2225,7 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx, list_del_init(&rq->queuelist); ret = blk_mq_request_issue_directly(rq, list_empty(list)); if (ret != BLK_STS_OK) { + errors++; if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { blk_mq_request_bypass_insert(rq, false, @@ -2231,7 +2233,6 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx, break; } blk_mq_end_request(rq, ret); - errors++; } else queued++; }
From: Wenchao Hao haowenchao@huawei.com
mainline inclusion from mainline-v5.18-rc1 commit ad515cada7dac3cdf5e1ad77a0ed696f5f34e0ab category: bugfix bugzilla: 187381, https://gitee.com/openeuler/kernel/issues/I5LBBP CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
- iscsi_alloc_conn(): Allocate and initialize iscsi_cls_conn
- iscsi_add_conn(): Expose iscsi_cls_conn to userspace via sysfs
- iscsi_remove_conn(): Remove iscsi_cls_conn from sysfs
Link: https://lore.kernel.org/r/20220310015759.3296841-2-haowenchao@huawei.com Reviewed-by: Mike Christie michael.christie@oracle.com Signed-off-by: Wenchao Hao haowenchao@huawei.com Signed-off-by: Wu Bo wubo40@huawei.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com
Conflict: commit 1a709181c1a0 ("[Huawei] scsi: iscsi: fix kabi broken in struct iscsi_transport") introduce some conflicts. Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/scsi/scsi_transport_iscsi.c | 101 ++++++++++++++++++++++++++++ include/scsi/scsi_transport_iscsi.h | 4 ++ 2 files changed, 105 insertions(+)
diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c index e23cb62ab216..b5a8d274dec8 100644 --- a/drivers/scsi/scsi_transport_iscsi.c +++ b/drivers/scsi/scsi_transport_iscsi.c @@ -2384,6 +2384,107 @@ void iscsi_free_session(struct iscsi_cls_session *session) } EXPORT_SYMBOL_GPL(iscsi_free_session);
+/** + * iscsi_alloc_conn - alloc iscsi class connection + * @session: iscsi cls session + * @dd_size: private driver data size + * @cid: connection id + */ +struct iscsi_cls_conn * +iscsi_alloc_conn(struct iscsi_cls_session *session, int dd_size, uint32_t cid) +{ + struct iscsi_transport *transport = session->transport; + struct iscsi_cls_conn *conn; + struct iscsi_cls_conn_wrapper *conn_wrapper; + + conn_wrapper = kzalloc(sizeof(*conn_wrapper) + dd_size, GFP_KERNEL); + if (!conn_wrapper) + return NULL; + + conn = &conn_wrapper->conn; + if (dd_size) + conn->dd_data = &conn_wrapper[1]; + + mutex_init(&conn->ep_mutex); + spin_lock_init(&conn_wrapper->lock); + INIT_LIST_HEAD(&conn->conn_list); + INIT_WORK(&conn_wrapper->cleanup_work, iscsi_cleanup_conn_work_fn); + conn->transport = transport; + conn->cid = cid; + WRITE_ONCE(conn->state, ISCSI_CONN_DOWN); + + /* this is released in the dev's release function */ + if (!get_device(&session->dev)) + goto free_conn; + + dev_set_name(&conn->dev, "connection%d:%u", session->sid, cid); + device_initialize(&conn->dev); + conn->dev.parent = &session->dev; + conn->dev.release = iscsi_conn_release; + + return conn; + +free_conn: + kfree(conn); + return NULL; +} +EXPORT_SYMBOL_GPL(iscsi_alloc_conn); + +/** + * iscsi_add_conn - add iscsi class connection + * @conn: iscsi cls connection + * + * This will expose iscsi_cls_conn to sysfs so make sure the related + * resources for sysfs attributes are initialized before calling this. + */ +int iscsi_add_conn(struct iscsi_cls_conn *conn) +{ + int err; + unsigned long flags; + struct iscsi_cls_session *session = iscsi_dev_to_session(conn->dev.parent); + + err = device_add(&conn->dev); + if (err) { + iscsi_cls_session_printk(KERN_ERR, session, "could not " + "register connection's dev\n"); + return err; + } + err = transport_register_device(&conn->dev); + if (err) { + iscsi_cls_session_printk(KERN_ERR, session, "could not " + "register transport's dev\n"); + device_del(&conn->dev); + return err; + } + + spin_lock_irqsave(&connlock, flags); + list_add(&conn->conn_list, &connlist); + spin_unlock_irqrestore(&connlock, flags); + + return 0; +} +EXPORT_SYMBOL_GPL(iscsi_add_conn); + +/** + * iscsi_remove_conn - remove iscsi class connection from sysfs + * @conn: iscsi cls connection + * + * Remove iscsi_cls_conn from sysfs, and wait for previous + * read/write of iscsi_cls_conn's attributes in sysfs to finish. + */ +void iscsi_remove_conn(struct iscsi_cls_conn *conn) +{ + unsigned long flags; + + spin_lock_irqsave(&connlock, flags); + list_del(&conn->conn_list); + spin_unlock_irqrestore(&connlock, flags); + + transport_unregister_device(&conn->dev); + device_del(&conn->dev); +} +EXPORT_SYMBOL_GPL(iscsi_remove_conn); + /** * iscsi_create_conn - create iscsi class connection * @session: iscsi cls session diff --git a/include/scsi/scsi_transport_iscsi.h b/include/scsi/scsi_transport_iscsi.h index f9297176dcb8..1f7574d89822 100644 --- a/include/scsi/scsi_transport_iscsi.h +++ b/include/scsi/scsi_transport_iscsi.h @@ -468,6 +468,10 @@ extern struct iscsi_cls_session *iscsi_create_session(struct Scsi_Host *shost, unsigned int target_id); extern void iscsi_remove_session(struct iscsi_cls_session *session); extern void iscsi_free_session(struct iscsi_cls_session *session); +extern struct iscsi_cls_conn *iscsi_alloc_conn(struct iscsi_cls_session *sess, + int dd_size, uint32_t cid); +extern int iscsi_add_conn(struct iscsi_cls_conn *conn); +extern void iscsi_remove_conn(struct iscsi_cls_conn *conn); extern struct iscsi_cls_conn *iscsi_create_conn(struct iscsi_cls_session *sess, int dd_size, uint32_t cid); extern void iscsi_put_conn(struct iscsi_cls_conn *conn);
From: Wenchao Hao haowenchao@huawei.com
mainline inclusion from mainline-v5.18-rc1 commit 7dae459f5e56a89ab01413ae055595c982713349 category: bugfix bugzilla: 187381, https://gitee.com/openeuler/kernel/issues/I5LBBP CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
iscsi_create_conn() exposed iscsi_cls_conn to sysfs prior to initialization of iscsi_conn's dd_data. When userspace tried to access an attribute such as the connect address, a NULL pointer dereference was observed.
Do not add iscsi_cls_conn to sysfs until it has been initialized. Remove iscsi_create_conn() since it is no longer used.
Link: https://lore.kernel.org/r/20220310015759.3296841-3-haowenchao@huawei.com Reviewed-by: Mike Christie michael.christie@oracle.com Signed-off-by: Wenchao Hao haowenchao@huawei.com Signed-off-by: Wu Bo wubo40@huawei.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com
Conflict: iscsi_create_conn() is not removed Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/scsi/libiscsi.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index e361856509d5..bbf2ca613dae 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -3032,8 +3032,9 @@ iscsi_conn_setup(struct iscsi_cls_session *cls_session, int dd_size, struct iscsi_conn *conn; struct iscsi_cls_conn *cls_conn; char *data; + int err;
- cls_conn = iscsi_create_conn(cls_session, sizeof(*conn) + dd_size, + cls_conn = iscsi_alloc_conn(cls_session, sizeof(*conn) + dd_size, conn_idx); if (!cls_conn) return NULL; @@ -3073,13 +3074,21 @@ iscsi_conn_setup(struct iscsi_cls_session *cls_session, int dd_size,
init_waitqueue_head(&session->ehwait);
+ err = iscsi_add_conn(cls_conn); + if (err) + goto login_task_add_dev_fail; + return cls_conn;
+login_task_add_dev_fail: + free_pages((unsigned long) conn->data, + get_order(ISCSI_DEF_MAX_RECV_SEG_LEN)); + login_task_data_alloc_fail: kfifo_in(&session->cmdpool.queue, (void*)&conn->login_task, sizeof(void*)); login_task_alloc_fail: - iscsi_destroy_conn(cls_conn); + iscsi_put_conn(cls_conn); return NULL; } EXPORT_SYMBOL_GPL(iscsi_conn_setup);
From: Wenchao Hao haowenchao@huawei.com
mainline inclusion from mainline-v5.18-rc1 commit 8709c323091be019f76a49cf783052a5636aca85 category: bugfix bugzilla: 187381, https://gitee.com/openeuler/kernel/issues/I5LBBP CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Commit 1b8d0300a3e9 ("scsi: libiscsi: Fix UAF in iscsi_conn_get_param()/iscsi_conn_teardown()") fixed an UAF in iscsi_conn_get_param() and introduced 2 tmp_xxx varibles.
We can gracefully fix this UAF with the help of device_del(). Calling iscsi_remove_conn() at the beginning of iscsi_conn_teardown would make userspace unable to see iscsi_cls_conn. This way we we can free memory safely.
Remove iscsi_destroy_conn() since it is no longer used.
Link: https://lore.kernel.org/r/20220310015759.3296841-4-haowenchao@huawei.com Reviewed-by: Mike Christie michael.christie@oracle.com Signed-off-by: Wenchao Hao haowenchao@huawei.com Signed-off-by: Wu Bo wubo40@huawei.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com
Conflict: iscsi_destroy_conn() is not removed. Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/scsi/libiscsi.c | 10 +++++----- drivers/scsi/scsi_transport_iscsi.c | 6 +++++- 2 files changed, 10 insertions(+), 6 deletions(-)
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index bbf2ca613dae..176842a869f1 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -3104,8 +3104,8 @@ void iscsi_conn_teardown(struct iscsi_cls_conn *cls_conn) { struct iscsi_conn *conn = cls_conn->dd_data; struct iscsi_session *session = conn->session; - char *tmp_persistent_address = conn->persistent_address; - char *tmp_local_ipaddr = conn->local_ipaddr; + + iscsi_remove_conn(cls_conn);
del_timer_sync(&conn->transport_timer);
@@ -3127,6 +3127,8 @@ void iscsi_conn_teardown(struct iscsi_cls_conn *cls_conn) spin_lock_bh(&session->frwd_lock); free_pages((unsigned long) conn->data, get_order(ISCSI_DEF_MAX_RECV_SEG_LEN)); + kfree(conn->persistent_address); + kfree(conn->local_ipaddr); /* regular RX path uses back_lock */ spin_lock_bh(&session->back_lock); kfifo_in(&session->cmdpool.queue, (void*)&conn->login_task, @@ -3137,9 +3139,7 @@ void iscsi_conn_teardown(struct iscsi_cls_conn *cls_conn) spin_unlock_bh(&session->frwd_lock); mutex_unlock(&session->eh_mutex);
- iscsi_destroy_conn(cls_conn); - kfree(tmp_persistent_address); - kfree(tmp_local_ipaddr); + iscsi_put_conn(cls_conn); } EXPORT_SYMBOL_GPL(iscsi_conn_teardown);
diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c index b5a8d274dec8..a213362524c9 100644 --- a/drivers/scsi/scsi_transport_iscsi.c +++ b/drivers/scsi/scsi_transport_iscsi.c @@ -2168,7 +2168,11 @@ static int iscsi_iter_destroy_conn_fn(struct device *dev, void *data) { if (!iscsi_is_conn_dev(dev)) return 0; - return iscsi_destroy_conn(iscsi_dev_to_conn(dev)); + + iscsi_remove_conn(iscsi_dev_to_conn(dev)); + iscsi_put_conn(iscsi_dev_to_conn(dev)); + + return 0; }
void iscsi_remove_session(struct iscsi_cls_session *session)
From: Yixing Liu liuyixing1@huawei.com
mainline inclusion from mainline-v5.18-rc1 commit 70f92521584f1d1e8268311ee84413307b0fdea8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5IZO5 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=70f...
----------------------------------------------------------------------
Before destroying MPT, the reserved loopback QPs send loopback IOs (one write operation per SL). Completing these loopback IOs represents that there isn't any outstanding request in MPT, then it's safe to destroy MPT.
Link: https://lore.kernel.org/r/20220310042835.38634-1-liangwenpeng@huawei.com Signed-off-by: Yixing Liu liuyixing1@huawei.com Signed-off-by: Wenpeng Liang liangwenpeng@huawei.com Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Zhengfeng Luo luozhengfeng@h-partners.com Reviewed-by: Yangyang Li liyangyang20@huawei.com Reviewed-by: Wei Yongjun weiyongjun1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/infiniband/hw/hns/hns_roce_device.h | 2 + drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 311 +++++++++++++++++++- drivers/infiniband/hw/hns/hns_roce_hw_v2.h | 20 ++ drivers/infiniband/hw/hns/hns_roce_mr.c | 6 +- 4 files changed, 335 insertions(+), 4 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h index f600475dd1fc..2d29221055be 100644 --- a/drivers/infiniband/hw/hns/hns_roce_device.h +++ b/drivers/infiniband/hw/hns/hns_roce_device.h @@ -627,6 +627,7 @@ struct hns_roce_qp { u32 next_sge; enum ib_mtu path_mtu; u32 max_inline_data; + u8 free_mr_en;
/* 0: flush needed, 1: unneeded */ unsigned long flush_flag; @@ -894,6 +895,7 @@ struct hns_roce_hw { enum ib_qp_state new_state); int (*qp_flow_control_init)(struct hns_roce_dev *hr_dev, struct hns_roce_qp *hr_qp); + void (*dereg_mr)(struct hns_roce_dev *hr_dev); int (*init_eq)(struct hns_roce_dev *hr_dev); void (*cleanup_eq)(struct hns_roce_dev *hr_dev); int (*write_srqc)(struct hns_roce_srq *srq, void *mb_buf); diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c index 045dde54974d..7eda0f7a12cd 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c @@ -2616,6 +2616,194 @@ static void free_dip_list(struct hns_roce_dev *hr_dev) spin_unlock_irqrestore(&hr_dev->dip_list_lock, flags); }
+static void free_mr_exit(struct hns_roce_dev *hr_dev) +{ + struct hns_roce_v2_priv *priv = hr_dev->priv; + struct hns_roce_v2_free_mr *free_mr = &priv->free_mr; + int ret; + int i; + + for (i = 0; i < ARRAY_SIZE(free_mr->rsv_qp); i++) { + if (free_mr->rsv_qp[i]) { + ret = ib_destroy_qp(free_mr->rsv_qp[i]); + if (ret) + ibdev_err(&hr_dev->ib_dev, + "failed to destroy qp in free mr.\n"); + + free_mr->rsv_qp[i] = NULL; + } + } + + if (free_mr->rsv_cq) { + ib_destroy_cq(free_mr->rsv_cq); + free_mr->rsv_cq = NULL; + } + + if (free_mr->rsv_pd) { + ib_dealloc_pd(free_mr->rsv_pd); + free_mr->rsv_pd = NULL; + } +} + +static int free_mr_alloc_res(struct hns_roce_dev *hr_dev) +{ + struct hns_roce_v2_priv *priv = hr_dev->priv; + struct hns_roce_v2_free_mr *free_mr = &priv->free_mr; + struct ib_device *ibdev = &hr_dev->ib_dev; + struct ib_cq_init_attr cq_init_attr = {}; + struct ib_qp_init_attr qp_init_attr = {}; + struct ib_pd *pd; + struct ib_cq *cq; + struct ib_qp *qp; + int ret; + int i; + + pd = ib_alloc_pd(ibdev, 0); + if (IS_ERR(pd)) { + ibdev_err(ibdev, "failed to create pd for free mr.\n"); + return PTR_ERR(pd); + } + free_mr->rsv_pd = pd; + + cq_init_attr.cqe = HNS_ROCE_FREE_MR_USED_CQE_NUM; + cq = ib_create_cq(ibdev, NULL, NULL, NULL, &cq_init_attr); + if (IS_ERR(cq)) { + ibdev_err(ibdev, "failed to create cq for free mr.\n"); + ret = PTR_ERR(cq); + goto create_failed; + } + free_mr->rsv_cq = cq; + + qp_init_attr.qp_type = IB_QPT_RC; + qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.send_cq = free_mr->rsv_cq; + qp_init_attr.recv_cq = free_mr->rsv_cq; + for (i = 0; i < ARRAY_SIZE(free_mr->rsv_qp); i++) { + qp_init_attr.cap.max_send_wr = HNS_ROCE_FREE_MR_USED_SQWQE_NUM; + qp_init_attr.cap.max_send_sge = HNS_ROCE_FREE_MR_USED_SQSGE_NUM; + qp_init_attr.cap.max_recv_wr = HNS_ROCE_FREE_MR_USED_RQWQE_NUM; + qp_init_attr.cap.max_recv_sge = HNS_ROCE_FREE_MR_USED_RQSGE_NUM; + + qp = ib_create_qp(free_mr->rsv_pd, &qp_init_attr); + if (IS_ERR(qp)) { + ibdev_err(ibdev, "failed to create qp for free mr.\n"); + ret = PTR_ERR(qp); + goto create_failed; + } + + free_mr->rsv_qp[i] = qp; + } + + return 0; + +create_failed: + free_mr_exit(hr_dev); + + return ret; +} + +static int free_mr_modify_rsv_qp(struct hns_roce_dev *hr_dev, + struct ib_qp_attr *attr, int sl_num) +{ + struct hns_roce_v2_priv *priv = hr_dev->priv; + struct hns_roce_v2_free_mr *free_mr = &priv->free_mr; + struct ib_device *ibdev = &hr_dev->ib_dev; + struct hns_roce_qp *hr_qp; + int loopback; + int mask; + int ret; + + hr_qp = to_hr_qp(free_mr->rsv_qp[sl_num]); + hr_qp->free_mr_en = 1; + + mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_PORT | IB_QP_ACCESS_FLAGS; + attr->qp_state = IB_QPS_INIT; + attr->port_num = 1; + attr->qp_access_flags = IB_ACCESS_REMOTE_WRITE; + ret = ib_modify_qp(&hr_qp->ibqp, attr, mask); + if (ret) { + ibdev_err(ibdev, "failed to modify qp to init, ret = %d.\n", + ret); + return ret; + } + + loopback = hr_dev->loop_idc; + /* Set qpc lbi = 1 incidate loopback IO */ + hr_dev->loop_idc = 1; + + mask = IB_QP_STATE | IB_QP_AV | IB_QP_PATH_MTU | IB_QP_DEST_QPN | + IB_QP_RQ_PSN | IB_QP_MAX_DEST_RD_ATOMIC | IB_QP_MIN_RNR_TIMER; + attr->qp_state = IB_QPS_RTR; + attr->ah_attr.type = RDMA_AH_ATTR_TYPE_ROCE; + attr->path_mtu = IB_MTU_256; + attr->dest_qp_num = hr_qp->qpn; + attr->rq_psn = HNS_ROCE_FREE_MR_USED_PSN; + + rdma_ah_set_sl(&attr->ah_attr, (u8)sl_num); + + ret = ib_modify_qp(&hr_qp->ibqp, attr, mask); + hr_dev->loop_idc = loopback; + if (ret) { + ibdev_err(ibdev, "failed to modify qp to rtr, ret = %d.\n", + ret); + return ret; + } + + mask = IB_QP_STATE | IB_QP_SQ_PSN | IB_QP_RETRY_CNT | IB_QP_TIMEOUT | + IB_QP_RNR_RETRY | IB_QP_MAX_QP_RD_ATOMIC; + attr->qp_state = IB_QPS_RTS; + attr->sq_psn = HNS_ROCE_FREE_MR_USED_PSN; + attr->retry_cnt = HNS_ROCE_FREE_MR_USED_QP_RETRY_CNT; + attr->timeout = HNS_ROCE_FREE_MR_USED_QP_TIMEOUT; + ret = ib_modify_qp(&hr_qp->ibqp, attr, mask); + if (ret) + ibdev_err(ibdev, "failed to modify qp to rts, ret = %d.\n", + ret); + + return ret; +} + +static int free_mr_modify_qp(struct hns_roce_dev *hr_dev) +{ + struct hns_roce_v2_priv *priv = hr_dev->priv; + struct hns_roce_v2_free_mr *free_mr = &priv->free_mr; + struct ib_qp_attr attr = {}; + int ret; + int i; + + rdma_ah_set_grh(&attr.ah_attr, NULL, 0, 0, 1, 0); + rdma_ah_set_static_rate(&attr.ah_attr, 3); + rdma_ah_set_port_num(&attr.ah_attr, 1); + + for (i = 0; i < ARRAY_SIZE(free_mr->rsv_qp); i++) { + ret = free_mr_modify_rsv_qp(hr_dev, &attr, i); + if (ret) + return ret; + } + + return 0; +} + +static int free_mr_init(struct hns_roce_dev *hr_dev) +{ + int ret; + + ret = free_mr_alloc_res(hr_dev); + if (ret) + return ret; + + ret = free_mr_modify_qp(hr_dev); + if (ret) + goto err_modify_qp; + + return 0; + +err_modify_qp: + free_mr_exit(hr_dev); + + return ret; +} + static int get_hem_table(struct hns_roce_dev *hr_dev) { unsigned int qpc_count; @@ -3160,6 +3348,98 @@ static int hns_roce_v2_mw_write_mtpt(void *mb_buf, struct hns_roce_mw *mw) return 0; }
+static int free_mr_post_send_lp_wqe(struct hns_roce_qp *hr_qp) +{ + struct hns_roce_dev *hr_dev = to_hr_dev(hr_qp->ibqp.device); + struct ib_device *ibdev = &hr_dev->ib_dev; + const struct ib_send_wr *bad_wr; + struct ib_rdma_wr rdma_wr = {}; + struct ib_send_wr *send_wr; + int ret; + + send_wr = &rdma_wr.wr; + send_wr->opcode = IB_WR_RDMA_WRITE; + + ret = hns_roce_v2_post_send(&hr_qp->ibqp, send_wr, &bad_wr); + if (ret) { + ibdev_err(ibdev, "failed to post wqe for free mr, ret = %d.\n", + ret); + return ret; + } + + return 0; +} + +static int hns_roce_v2_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *wc); + +static void free_mr_send_cmd_to_hw(struct hns_roce_dev *hr_dev) +{ + struct hns_roce_v2_priv *priv = hr_dev->priv; + struct hns_roce_v2_free_mr *free_mr = &priv->free_mr; + struct ib_wc wc[ARRAY_SIZE(free_mr->rsv_qp)]; + struct ib_device *ibdev = &hr_dev->ib_dev; + struct hns_roce_qp *hr_qp; + unsigned long end; + int cqe_cnt = 0; + int npolled; + int ret; + int i; + + /* + * If the device initialization is not complete or in the uninstall + * process, then there is no need to execute free mr. + */ + if (priv->handle->rinfo.reset_state == HNS_ROCE_STATE_RST_INIT || + priv->handle->rinfo.instance_state == HNS_ROCE_STATE_INIT || + hr_dev->state == HNS_ROCE_DEVICE_STATE_UNINIT) + return; + + mutex_lock(&free_mr->mutex); + + for (i = 0; i < ARRAY_SIZE(free_mr->rsv_qp); i++) { + hr_qp = to_hr_qp(free_mr->rsv_qp[i]); + + ret = free_mr_post_send_lp_wqe(hr_qp); + if (ret) { + ibdev_err(ibdev, + "failed to send wqe (qp:0x%lx) for free mr, ret = %d.\n", + hr_qp->qpn, ret); + break; + } + + cqe_cnt++; + } + + end = msecs_to_jiffies(HNS_ROCE_V2_FREE_MR_TIMEOUT) + jiffies; + while (cqe_cnt) { + npolled = hns_roce_v2_poll_cq(free_mr->rsv_cq, cqe_cnt, wc); + if (npolled < 0) { + ibdev_err(ibdev, + "failed to poll cqe for free mr, remain %d cqe.\n", + cqe_cnt); + goto out; + } + + if (time_after(jiffies, end)) { + ibdev_err(ibdev, + "failed to poll cqe for free mr and timeout, remain %d cqe.\n", + cqe_cnt); + goto out; + } + cqe_cnt -= npolled; + } + +out: + mutex_unlock(&free_mr->mutex); +} + +static void hns_roce_v2_dereg_mr(struct hns_roce_dev *hr_dev) +{ + if (hr_dev->pci_dev->revision == PCI_REVISION_ID_HIP08) + free_mr_send_cmd_to_hw(hr_dev); +} + static void *get_cqe_v2(struct hns_roce_cq *hr_cq, int n) { return hns_roce_buf_offset(hr_cq->mtr.kmem, n * hr_cq->cqe_size); @@ -4579,6 +4859,18 @@ static int hns_roce_v2_set_path(struct ib_qp *ibqp, u8 hr_port; int ret;
+ /* + * If free_mr_en of qp is set, it means that this qp comes from + * free mr. This qp will perform the loopback operation. + * In the loopback scenario, only sl needs to be set. + */ + if (hr_qp->free_mr_en) { + hr_reg_write(context, QPC_SL, rdma_ah_get_sl(&attr->ah_attr)); + hr_reg_clear(qpc_mask, QPC_SL); + hr_qp->sl = rdma_ah_get_sl(&attr->ah_attr); + return 0; + } + ib_port = (attr_mask & IB_QP_PORT) ? attr->port_num : hr_qp->port + 1; hr_port = ib_port - 1; is_roce_protocol = rdma_cap_eth_ah(&hr_dev->ib_dev, ib_port) && @@ -6251,6 +6543,7 @@ static const struct hns_roce_hw hns_roce_hw_v2 = { .set_hem = hns_roce_v2_set_hem, .clear_hem = hns_roce_v2_clear_hem, .modify_qp = hns_roce_v2_modify_qp, + .dereg_mr = hns_roce_v2_dereg_mr, .qp_flow_control_init = hns_roce_v2_qp_flow_control_init, .init_eq = hns_roce_v2_init_eq_table, .cleanup_eq = hns_roce_v2_cleanup_eq_table, @@ -6332,14 +6625,25 @@ static int __hns_roce_hw_v2_init_instance(struct hnae3_handle *handle) ret = hns_roce_init(hr_dev); if (ret) { dev_err(hr_dev->dev, "RoCE Engine init failed!\n"); - goto error_failed_get_cfg; + goto error_failed_cfg; + } + + if (hr_dev->pci_dev->revision == PCI_REVISION_ID_HIP08) { + ret = free_mr_init(hr_dev); + if (ret) { + dev_err(hr_dev->dev, "failed to init free mr!\n"); + goto error_failed_roce_init; + } }
handle->priv = hr_dev;
return 0;
-error_failed_get_cfg: +error_failed_roce_init: + hns_roce_exit(hr_dev); + +error_failed_cfg: kfree(hr_dev->priv);
error_failed_kzalloc: @@ -6361,6 +6665,9 @@ static void __hns_roce_hw_v2_uninit_instance(struct hnae3_handle *handle, hr_dev->state = HNS_ROCE_DEVICE_STATE_UNINIT; hns_roce_handle_device_err(hr_dev);
+ if (hr_dev->pci_dev->revision == PCI_REVISION_ID_HIP08) + free_mr_exit(hr_dev); + hns_roce_exit(hr_dev); kfree(hr_dev->priv); ib_dealloc_device(&hr_dev->ib_dev); diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h index e5df163f56e0..339b1b82f906 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h @@ -139,6 +139,18 @@ enum { #define CMD_CSQ_DESC_NUM 1024 #define CMD_CRQ_DESC_NUM 1024
+/* Free mr used parameters */ +#define HNS_ROCE_FREE_MR_USED_CQE_NUM 128 +#define HNS_ROCE_FREE_MR_USED_QP_NUM 0x8 +#define HNS_ROCE_FREE_MR_USED_PSN 0x0808 +#define HNS_ROCE_FREE_MR_USED_QP_RETRY_CNT 0x7 +#define HNS_ROCE_FREE_MR_USED_QP_TIMEOUT 0x12 +#define HNS_ROCE_FREE_MR_USED_SQWQE_NUM 128 +#define HNS_ROCE_FREE_MR_USED_SQSGE_NUM 0x2 +#define HNS_ROCE_FREE_MR_USED_RQWQE_NUM 128 +#define HNS_ROCE_FREE_MR_USED_RQSGE_NUM 0x2 +#define HNS_ROCE_V2_FREE_MR_TIMEOUT 4500 + enum { NO_ARMED = 0x0, REG_NXT_CEQE = 0x2, @@ -1317,10 +1329,18 @@ struct hns_roce_link_table { #define HNS_ROCE_EXT_LLM_ENTRY(addr, id) (((id) << (64 - 12)) | ((addr) >> 12)) #define HNS_ROCE_EXT_LLM_MIN_PAGES(que_num) ((que_num) * 4 + 2)
+struct hns_roce_v2_free_mr { + struct ib_qp *rsv_qp[HNS_ROCE_FREE_MR_USED_QP_NUM]; + struct ib_cq *rsv_cq; + struct ib_pd *rsv_pd; + struct mutex mutex; +}; + struct hns_roce_v2_priv { struct hnae3_handle *handle; struct hns_roce_v2_cmq cmq; struct hns_roce_link_table ext_llm; + struct hns_roce_v2_free_mr free_mr; };
struct hns_roce_dip { diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c index 1e36ac383ea3..d1a9200a312d 100644 --- a/drivers/infiniband/hw/hns/hns_roce_mr.c +++ b/drivers/infiniband/hw/hns/hns_roce_mr.c @@ -119,8 +119,7 @@ static void free_mr_pbl(struct hns_roce_dev *hr_dev, struct hns_roce_mr *mr) hns_roce_mtr_destroy(hr_dev, &mr->pbl_mtr); }
-static void hns_roce_mr_free(struct hns_roce_dev *hr_dev, - struct hns_roce_mr *mr) +static void hns_roce_mr_free(struct hns_roce_dev *hr_dev, struct hns_roce_mr *mr) { struct ib_device *ibdev = &hr_dev->ib_dev; int ret; @@ -338,6 +337,9 @@ int hns_roce_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata) struct hns_roce_dev *hr_dev = to_hr_dev(ibmr->device); struct hns_roce_mr *mr = to_hr_mr(ibmr);
+ if (hr_dev->hw->dereg_mr) + hr_dev->hw->dereg_mr(hr_dev); + hns_roce_mr_free(hr_dev, mr); kfree(mr);
From: zhengfeng luo luozhengfeng@h-partners.com
driver inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5MQLB
----------------------------------------------------------------------
If the gid or mtu are changed after the driver is initialized, an error will happen as following:
[ 29.612240] __ib_cache_gid_add: unable to add gid fe80:0000:0000:0000:4600:4dff:fe22:abb5 error=-28 [61807.380991] hns3 0000:7d:00.0 hns_0: attr path_mtu(1)invalid while modify qp
Fixes: 70f92521584f ("RDMA/hns: Use the reserved loopback QPs to free MR before destroying MPT")
Signed-off-by: Yixing Liu liuyixing1@huawei.com Signed-off-by: Haoyue Xu xuhaoyue1@hisilicon.com Signed-off-by: zhengfeng luo luozhengfeng@h-partners.com Reviewed-by: Yangyang Li liyangyang20@huawei.com Reviewed-by: Wei Yongjun weiyongjun1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 193 +++++++++++++++------ drivers/infiniband/hw/hns/hns_roce_hw_v2.h | 12 +- 2 files changed, 150 insertions(+), 55 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c index 7eda0f7a12cd..de7873551ffc 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c @@ -2616,16 +2616,112 @@ static void free_dip_list(struct hns_roce_dev *hr_dev) spin_unlock_irqrestore(&hr_dev->dip_list_lock, flags); }
+static struct ib_pd *free_mr_init_pd(struct hns_roce_dev *hr_dev) +{ + struct hns_roce_v2_priv *priv = hr_dev->priv; + struct hns_roce_v2_free_mr *free_mr = &priv->free_mr; + struct ib_device *ibdev = &hr_dev->ib_dev; + struct hns_roce_pd *hr_pd; + struct ib_pd *pd; + int ret; + + hr_pd = kzalloc(sizeof(*hr_pd), GFP_KERNEL); + if (ZERO_OR_NULL_PTR(hr_pd)) + return NULL; + + pd = &hr_pd->ibpd; + pd->device = ibdev; + + ret = hns_roce_alloc_pd(pd, NULL); + if (ret) { + ibdev_err(ibdev, "failed to create pd for free mr.\n"); + kfree(hr_pd); + return NULL; + } + + free_mr->rsv_pd = to_hr_pd(pd); + free_mr->rsv_pd->ibpd.device = &hr_dev->ib_dev; + free_mr->rsv_pd->ibpd.uobject = NULL; + free_mr->rsv_pd->ibpd.__internal_mr = NULL; + atomic_set(&free_mr->rsv_pd->ibpd.usecnt, 0); + + return pd; +} + +static struct ib_cq *free_mr_init_cq(struct hns_roce_dev *hr_dev) +{ + struct hns_roce_v2_priv *priv = hr_dev->priv; + struct hns_roce_v2_free_mr *free_mr = &priv->free_mr; + struct ib_device *ibdev = &hr_dev->ib_dev; + struct ib_cq_init_attr cq_init_attr = {}; + struct hns_roce_cq *hr_cq; + struct ib_cq *cq; + int ret; + + cq_init_attr.cqe = HNS_ROCE_FREE_MR_USED_CQE_NUM; + + hr_cq = kzalloc(sizeof(*hr_cq), GFP_KERNEL); + if (ZERO_OR_NULL_PTR(hr_cq)) + return NULL; + + cq = &hr_cq->ib_cq; + cq->device = ibdev; + + ret = hns_roce_create_cq(cq, &cq_init_attr, NULL); + if (ret) { + ibdev_err(ibdev, "failed to create cq for free mr.\n"); + kfree(hr_cq); + return NULL; + } + + free_mr->rsv_cq = to_hr_cq(cq); + free_mr->rsv_cq->ib_cq.device = &hr_dev->ib_dev; + free_mr->rsv_cq->ib_cq.uobject = NULL; + free_mr->rsv_cq->ib_cq.comp_handler = NULL; + free_mr->rsv_cq->ib_cq.event_handler = NULL; + free_mr->rsv_cq->ib_cq.cq_context = NULL; + atomic_set(&free_mr->rsv_cq->ib_cq.usecnt, 0); + + return cq; +} + +static struct hns_roce_qp *create_free_mr_qp(struct hns_roce_dev *hr_dev, + struct ib_pd *pd, struct ib_cq *cq) +{ + struct ib_device *ibdev = &hr_dev->ib_dev; + struct ib_qp_init_attr init_attr = {}; + struct ib_qp *qp; + + init_attr.qp_type = IB_QPT_RC; + init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; + init_attr.send_cq = cq; + init_attr.recv_cq = cq; + init_attr.cap.max_send_wr = HNS_ROCE_FREE_MR_USED_SQWQE_NUM; + init_attr.cap.max_send_sge = HNS_ROCE_FREE_MR_USED_SQSGE_NUM; + init_attr.cap.max_recv_wr = HNS_ROCE_FREE_MR_USED_RQWQE_NUM; + init_attr.cap.max_recv_sge = HNS_ROCE_FREE_MR_USED_RQSGE_NUM; + + qp = hns_roce_create_qp(pd, &init_attr, NULL); + if (IS_ERR_OR_NULL(qp)) { + ibdev_err(ibdev, "failed to create qp for free mr.\n"); + return NULL; + } + + return to_hr_qp(qp); +} + static void free_mr_exit(struct hns_roce_dev *hr_dev) { struct hns_roce_v2_priv *priv = hr_dev->priv; struct hns_roce_v2_free_mr *free_mr = &priv->free_mr; + struct hns_roce_qp *hr_qp; int ret; int i;
for (i = 0; i < ARRAY_SIZE(free_mr->rsv_qp); i++) { if (free_mr->rsv_qp[i]) { - ret = ib_destroy_qp(free_mr->rsv_qp[i]); + hr_qp = to_hr_qp(&free_mr->rsv_qp[i]->ibqp); + ret = hns_roce_v2_destroy_qp_common(hr_dev, hr_qp, NULL); if (ret) ibdev_err(&hr_dev->ib_dev, "failed to destroy qp in free mr.\n"); @@ -2635,13 +2731,14 @@ static void free_mr_exit(struct hns_roce_dev *hr_dev) }
if (free_mr->rsv_cq) { - ib_destroy_cq(free_mr->rsv_cq); - free_mr->rsv_cq = NULL; + hns_roce_destroy_cq(&free_mr->rsv_cq->ib_cq, NULL); + kfree(free_mr->rsv_cq); }
if (free_mr->rsv_pd) { - ib_dealloc_pd(free_mr->rsv_pd); + hns_roce_dealloc_pd(&free_mr->rsv_pd->ibpd, NULL); free_mr->rsv_pd = NULL; + kfree(free_mr->rsv_pd); } }
@@ -2649,55 +2746,40 @@ static int free_mr_alloc_res(struct hns_roce_dev *hr_dev) { struct hns_roce_v2_priv *priv = hr_dev->priv; struct hns_roce_v2_free_mr *free_mr = &priv->free_mr; - struct ib_device *ibdev = &hr_dev->ib_dev; - struct ib_cq_init_attr cq_init_attr = {}; - struct ib_qp_init_attr qp_init_attr = {}; struct ib_pd *pd; struct ib_cq *cq; - struct ib_qp *qp; int ret; int i;
- pd = ib_alloc_pd(ibdev, 0); - if (IS_ERR(pd)) { - ibdev_err(ibdev, "failed to create pd for free mr.\n"); - return PTR_ERR(pd); - } - free_mr->rsv_pd = pd; + pd = free_mr_init_pd(hr_dev); + if (!pd) + return -ENOMEM;
- cq_init_attr.cqe = HNS_ROCE_FREE_MR_USED_CQE_NUM; - cq = ib_create_cq(ibdev, NULL, NULL, NULL, &cq_init_attr); - if (IS_ERR(cq)) { - ibdev_err(ibdev, "failed to create cq for free mr.\n"); - ret = PTR_ERR(cq); - goto create_failed; + cq = free_mr_init_cq(hr_dev); + if (!cq) { + ret = -ENOMEM; + goto create_failed_cq; } - free_mr->rsv_cq = cq;
- qp_init_attr.qp_type = IB_QPT_RC; - qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; - qp_init_attr.send_cq = free_mr->rsv_cq; - qp_init_attr.recv_cq = free_mr->rsv_cq; for (i = 0; i < ARRAY_SIZE(free_mr->rsv_qp); i++) { - qp_init_attr.cap.max_send_wr = HNS_ROCE_FREE_MR_USED_SQWQE_NUM; - qp_init_attr.cap.max_send_sge = HNS_ROCE_FREE_MR_USED_SQSGE_NUM; - qp_init_attr.cap.max_recv_wr = HNS_ROCE_FREE_MR_USED_RQWQE_NUM; - qp_init_attr.cap.max_recv_sge = HNS_ROCE_FREE_MR_USED_RQSGE_NUM; - - qp = ib_create_qp(free_mr->rsv_pd, &qp_init_attr); - if (IS_ERR(qp)) { - ibdev_err(ibdev, "failed to create qp for free mr.\n"); - ret = PTR_ERR(qp); - goto create_failed; + free_mr->rsv_qp[i] = create_free_mr_qp(hr_dev, pd, cq); + if (!free_mr->rsv_qp[i]) { + ret = -ENOMEM; + goto create_failed_qp; } - - free_mr->rsv_qp[i] = qp; + free_mr->rsv_qp[i]->ibqp.recv_cq = cq; + free_mr->rsv_qp[i]->ibqp.send_cq = cq; }
return 0;
-create_failed: - free_mr_exit(hr_dev); +create_failed_qp: + hns_roce_destroy_cq(cq, NULL); + kfree(cq); + +create_failed_cq: + hns_roce_dealloc_pd(pd, NULL); + kfree(pd);
return ret; } @@ -2709,18 +2791,19 @@ static int free_mr_modify_rsv_qp(struct hns_roce_dev *hr_dev, struct hns_roce_v2_free_mr *free_mr = &priv->free_mr; struct ib_device *ibdev = &hr_dev->ib_dev; struct hns_roce_qp *hr_qp; - int loopback; - int mask; - int ret; + int loopback, mask, ret;
- hr_qp = to_hr_qp(free_mr->rsv_qp[sl_num]); + hr_qp = to_hr_qp(&free_mr->rsv_qp[sl_num]->ibqp); hr_qp->free_mr_en = 1; + hr_qp->ibqp.device = ibdev; + hr_qp->ibqp.qp_type = IB_QPT_RC;
mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_PORT | IB_QP_ACCESS_FLAGS; attr->qp_state = IB_QPS_INIT; attr->port_num = 1; attr->qp_access_flags = IB_ACCESS_REMOTE_WRITE; - ret = ib_modify_qp(&hr_qp->ibqp, attr, mask); + ret = hr_dev->hw->modify_qp(&hr_qp->ibqp, attr, mask, IB_QPS_INIT, + IB_QPS_INIT); if (ret) { ibdev_err(ibdev, "failed to modify qp to init, ret = %d.\n", ret); @@ -2741,7 +2824,8 @@ static int free_mr_modify_rsv_qp(struct hns_roce_dev *hr_dev,
rdma_ah_set_sl(&attr->ah_attr, (u8)sl_num);
- ret = ib_modify_qp(&hr_qp->ibqp, attr, mask); + ret = hr_dev->hw->modify_qp(&hr_qp->ibqp, attr, mask, IB_QPS_INIT, + IB_QPS_RTR); hr_dev->loop_idc = loopback; if (ret) { ibdev_err(ibdev, "failed to modify qp to rtr, ret = %d.\n", @@ -2755,7 +2839,8 @@ static int free_mr_modify_rsv_qp(struct hns_roce_dev *hr_dev, attr->sq_psn = HNS_ROCE_FREE_MR_USED_PSN; attr->retry_cnt = HNS_ROCE_FREE_MR_USED_QP_RETRY_CNT; attr->timeout = HNS_ROCE_FREE_MR_USED_QP_TIMEOUT; - ret = ib_modify_qp(&hr_qp->ibqp, attr, mask); + ret = hr_dev->hw->modify_qp(&hr_qp->ibqp, attr, mask, IB_QPS_RTR, + IB_QPS_RTS); if (ret) ibdev_err(ibdev, "failed to modify qp to rts, ret = %d.\n", ret); @@ -2786,8 +2871,12 @@ static int free_mr_modify_qp(struct hns_roce_dev *hr_dev)
static int free_mr_init(struct hns_roce_dev *hr_dev) { + struct hns_roce_v2_priv *priv = hr_dev->priv; + struct hns_roce_v2_free_mr *free_mr = &priv->free_mr; int ret;
+ mutex_init(&free_mr->mutex); + ret = free_mr_alloc_res(hr_dev); if (ret) return ret; @@ -3398,7 +3487,7 @@ static void free_mr_send_cmd_to_hw(struct hns_roce_dev *hr_dev) mutex_lock(&free_mr->mutex);
for (i = 0; i < ARRAY_SIZE(free_mr->rsv_qp); i++) { - hr_qp = to_hr_qp(free_mr->rsv_qp[i]); + hr_qp = free_mr->rsv_qp[i];
ret = free_mr_post_send_lp_wqe(hr_qp); if (ret) { @@ -3413,7 +3502,7 @@ static void free_mr_send_cmd_to_hw(struct hns_roce_dev *hr_dev)
end = msecs_to_jiffies(HNS_ROCE_V2_FREE_MR_TIMEOUT) + jiffies; while (cqe_cnt) { - npolled = hns_roce_v2_poll_cq(free_mr->rsv_cq, cqe_cnt, wc); + npolled = hns_roce_v2_poll_cq(&free_mr->rsv_cq->ib_cq, cqe_cnt, wc); if (npolled < 0) { ibdev_err(ibdev, "failed to poll cqe for free mr, remain %d cqe.\n", @@ -5388,9 +5477,9 @@ static inline int modify_qp_is_ok(struct hns_roce_qp *hr_qp) hr_qp->state != IB_QPS_RESET); }
-static int hns_roce_v2_destroy_qp_common(struct hns_roce_dev *hr_dev, - struct hns_roce_qp *hr_qp, - struct ib_udata *udata) +int hns_roce_v2_destroy_qp_common(struct hns_roce_dev *hr_dev, + struct hns_roce_qp *hr_qp, + struct ib_udata *udata) { struct ib_device *ibdev = &hr_dev->ib_dev; struct hns_roce_cq *send_cq, *recv_cq; @@ -5432,7 +5521,7 @@ static int hns_roce_v2_destroy_qp_common(struct hns_roce_dev *hr_dev, return ret; }
-static int hns_roce_v2_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata) +int hns_roce_v2_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata) { struct hns_roce_dev *hr_dev = to_hr_dev(ibqp->device); struct hns_roce_qp *hr_qp = to_hr_qp(ibqp); diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h index 339b1b82f906..8aa56c24e677 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.h +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.h @@ -1330,9 +1330,9 @@ struct hns_roce_link_table { #define HNS_ROCE_EXT_LLM_MIN_PAGES(que_num) ((que_num) * 4 + 2)
struct hns_roce_v2_free_mr { - struct ib_qp *rsv_qp[HNS_ROCE_FREE_MR_USED_QP_NUM]; - struct ib_cq *rsv_cq; - struct ib_pd *rsv_pd; + struct hns_roce_qp *rsv_qp[HNS_ROCE_FREE_MR_USED_QP_NUM]; + struct hns_roce_cq *rsv_cq; + struct hns_roce_pd *rsv_pd; struct mutex mutex; };
@@ -1458,6 +1458,12 @@ struct hns_roce_sccc_clr_done { int hns_roce_v2_query_cqc_info(struct hns_roce_dev *hr_dev, u32 cqn, int *buffer);
+int hns_roce_v2_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata); + +int hns_roce_v2_destroy_qp_common(struct hns_roce_dev *hr_dev, + struct hns_roce_qp *hr_qp, + struct ib_udata *udata); + static inline void hns_roce_write64(struct hns_roce_dev *hr_dev, __le32 val[2], void __iomem *dest) {
From: Zhihao Cheng chengzhihao1@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5QH0X CVE: NA
--------------------------------
Following process: Init: v2_read_file_info: <3> dqi_free_blk 0 dqi_free_entry 5 dqi_blks 6
Step 1. chown bin f_a -> dquot_acquire -> v2_write_dquot: qtree_write_dquot do_insert_tree find_free_dqentry get_free_dqblk write_blk(info->dqi_blocks) // info->dqi_blocks = 6, failure. The content in physical block (corresponding to blk 6) is random.
Step 2. chown root f_a -> dquot_transfer -> dqput_all -> dqput -> ext4_release_dquot -> v2_release_dquot -> qtree_delete_dquot: dquot_release remove_tree free_dqentry put_free_dqblk(6) info->dqi_free_blk = blk // info->dqi_free_blk = 6
Step 3. drop cache (buffer head for block 6 is released)
Step 4. chown bin f_b -> dquot_acquire -> commit_dqblk -> v2_write_dquot: qtree_write_dquot do_insert_tree find_free_dqentry get_free_dqblk dh = (struct qt_disk_dqdbheader *)buf blk = info->dqi_free_blk // 6 ret = read_blk(info, blk, buf) // The content of buf is random info->dqi_free_blk = le32_to_cpu(dh->dqdh_next_free) // random blk
Step 5. chown bin f_c -> notify_change -> ext4_setattr -> dquot_transfer: dquot = dqget -> acquire_dquot -> ext4_acquire_dquot -> dquot_acquire -> commit_dqblk -> v2_write_dquot -> dq_insert_tree: do_insert_tree find_free_dqentry get_free_dqblk blk = info->dqi_free_blk // If blk < 0 and blk is not an error code, it will be returned as dquot
transfer_to[USRQUOTA] = dquot // A random negative value __dquot_transfer(transfer_to) dquot_add_inodes(transfer_to[cnt]) spin_lock(&dquot->dq_dqb_lock) // page fault
, which will lead to kernel page fault: Quota error (device sda): qtree_write_dquot: Error -8000 occurred while creating quota BUG: unable to handle page fault for address: ffffffffffffe120 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page Oops: 0002 [#1] PREEMPT SMP CPU: 0 PID: 5974 Comm: chown Not tainted 6.0.0-rc1-00004 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:_raw_spin_lock+0x3a/0x90 Call Trace: dquot_add_inodes+0x28/0x270 __dquot_transfer+0x377/0x840 dquot_transfer+0xde/0x540 ext4_setattr+0x405/0x14d0 notify_change+0x68e/0x9f0 chown_common+0x300/0x430 __x64_sys_fchownat+0x29/0x40
In order to avoid accessing invalid quota memory address, this patch adds block number checking of next/prev free block read from quota file.
Fetch a reproducer in [Link].
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216372 Fixes: 1da177e4c3f4152 ("Linux-2.6.12-rc2") Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Zhihao Cheng chengzhihao1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/quota/quota_tree.c | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+)
diff --git a/fs/quota/quota_tree.c b/fs/quota/quota_tree.c index 1a188fbdf34e..15b8e1a39015 100644 --- a/fs/quota/quota_tree.c +++ b/fs/quota/quota_tree.c @@ -80,6 +80,35 @@ static ssize_t write_blk(struct qtree_mem_dqinfo *info, uint blk, char *buf) return ret; }
+static inline int do_check_range(struct super_block *sb, uint val, uint max_val) +{ + if (val >= max_val) { + quota_error(sb, "Getting block too big (%u >= %u)", + val, max_val); + return -EUCLEAN; + } + + return 0; +} + +static int check_free_block(struct qtree_mem_dqinfo *info, + struct qt_disk_dqdbheader *dh) +{ + int err = 0; + uint nextblk, prevblk; + + nextblk = le32_to_cpu(dh->dqdh_next_free); + err = do_check_range(info->dqi_sb, nextblk, info->dqi_blocks); + if (err) + return err; + prevblk = le32_to_cpu(dh->dqdh_prev_free); + err = do_check_range(info->dqi_sb, prevblk, info->dqi_blocks); + if (err) + return err; + + return err; +} + /* Remove empty block from list and return it */ static int get_free_dqblk(struct qtree_mem_dqinfo *info) { @@ -94,6 +123,9 @@ static int get_free_dqblk(struct qtree_mem_dqinfo *info) ret = read_blk(info, blk, buf); if (ret < 0) goto out_buf; + ret = check_free_block(info, dh); + if (ret) + goto out_buf; info->dqi_free_blk = le32_to_cpu(dh->dqdh_next_free); } else { @@ -241,6 +273,9 @@ static uint find_free_dqentry(struct qtree_mem_dqinfo *info, *err = read_blk(info, blk, buf); if (*err < 0) goto out_buf; + *err = check_free_block(info, dh); + if (*err) + goto out_buf; } else { blk = get_free_dqblk(info); if ((int)blk < 0) { @@ -433,6 +468,9 @@ static int free_dqentry(struct qtree_mem_dqinfo *info, struct dquot *dquot, goto out_buf; } dh = (struct qt_disk_dqdbheader *)buf; + ret = check_free_block(info, dh); + if (ret) + goto out_buf; le16_add_cpu(&dh->dqdh_entries, -1); if (!le16_to_cpu(dh->dqdh_entries)) { /* Block got free? */ ret = remove_free_dqentry(info, buf, blk);
From: Zhihao Cheng chengzhihao1@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5QH0X CVE: NA
--------------------------------
Cleanup all block checking places, replace them with helper function do_check_range().
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Zhihao Cheng chengzhihao1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/quota/quota_tree.c | 28 ++++++++++++---------------- 1 file changed, 12 insertions(+), 16 deletions(-)
diff --git a/fs/quota/quota_tree.c b/fs/quota/quota_tree.c index 15b8e1a39015..c200161554db 100644 --- a/fs/quota/quota_tree.c +++ b/fs/quota/quota_tree.c @@ -80,11 +80,12 @@ static ssize_t write_blk(struct qtree_mem_dqinfo *info, uint blk, char *buf) return ret; }
-static inline int do_check_range(struct super_block *sb, uint val, uint max_val) +static inline int do_check_range(struct super_block *sb, uint val, + uint min_val, uint max_val) { - if (val >= max_val) { - quota_error(sb, "Getting block too big (%u >= %u)", - val, max_val); + if (val < min_val || val >= max_val) { + quota_error(sb, "Getting block %u out of range %u-%u", + val, min_val, max_val); return -EUCLEAN; }
@@ -98,11 +99,11 @@ static int check_free_block(struct qtree_mem_dqinfo *info, uint nextblk, prevblk;
nextblk = le32_to_cpu(dh->dqdh_next_free); - err = do_check_range(info->dqi_sb, nextblk, info->dqi_blocks); + err = do_check_range(info->dqi_sb, nextblk, 0, info->dqi_blocks); if (err) return err; prevblk = le32_to_cpu(dh->dqdh_prev_free); - err = do_check_range(info->dqi_sb, prevblk, info->dqi_blocks); + err = do_check_range(info->dqi_sb, prevblk, 0, info->dqi_blocks); if (err) return err;
@@ -527,12 +528,10 @@ static int remove_tree(struct qtree_mem_dqinfo *info, struct dquot *dquot, goto out_buf; } newblk = le32_to_cpu(ref[get_index(info, dquot->dq_id, depth)]); - if (newblk < QT_TREEOFF || newblk >= info->dqi_blocks) { - quota_error(dquot->dq_sb, "Getting block too big (%u >= %u)", - newblk, info->dqi_blocks); - ret = -EUCLEAN; + ret = do_check_range(dquot->dq_sb, newblk, QT_TREEOFF, + info->dqi_blocks); + if (ret) goto out_buf; - }
if (depth == info->dqi_qtree_depth - 1) { ret = free_dqentry(info, dquot, newblk); @@ -633,12 +632,9 @@ static loff_t find_tree_dqentry(struct qtree_mem_dqinfo *info, blk = le32_to_cpu(ref[get_index(info, dquot->dq_id, depth)]); if (!blk) /* No reference? */ goto out_buf; - if (blk < QT_TREEOFF || blk >= info->dqi_blocks) { - quota_error(dquot->dq_sb, "Getting block too big (%u >= %u)", - blk, info->dqi_blocks); - ret = -EUCLEAN; + ret = do_check_range(dquot->dq_sb, blk, QT_TREEOFF, info->dqi_blocks); + if (ret) goto out_buf; - }
if (depth < info->dqi_qtree_depth - 1) ret = find_tree_dqentry(info, dquot, blk, depth+1);
From: Zhihao Cheng chengzhihao1@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5QH0X CVE: NA
--------------------------------
It would be better to do more sanity checking (eg. dqdh_entries, block no.) for the content read from quota file, which can prevent corrupting the quota file.
Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Zhihao Cheng chengzhihao1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/quota/quota_tree.c | 43 +++++++++++++++++++++++++++++++++---------- 1 file changed, 33 insertions(+), 10 deletions(-)
diff --git a/fs/quota/quota_tree.c b/fs/quota/quota_tree.c index c200161554db..06e3ec528b2c 100644 --- a/fs/quota/quota_tree.c +++ b/fs/quota/quota_tree.c @@ -80,12 +80,12 @@ static ssize_t write_blk(struct qtree_mem_dqinfo *info, uint blk, char *buf) return ret; }
-static inline int do_check_range(struct super_block *sb, uint val, - uint min_val, uint max_val) +static inline int do_check_range(struct super_block *sb, const char *val_name, + uint val, uint min_val, uint max_val) { if (val < min_val || val >= max_val) { - quota_error(sb, "Getting block %u out of range %u-%u", - val, min_val, max_val); + quota_error(sb, "Getting %s %u out of range %u-%u", + val_name, val, min_val, max_val); return -EUCLEAN; }
@@ -99,11 +99,13 @@ static int check_free_block(struct qtree_mem_dqinfo *info, uint nextblk, prevblk;
nextblk = le32_to_cpu(dh->dqdh_next_free); - err = do_check_range(info->dqi_sb, nextblk, 0, info->dqi_blocks); + err = do_check_range(info->dqi_sb, "dqdh_next_free", nextblk, 0, + info->dqi_blocks); if (err) return err; prevblk = le32_to_cpu(dh->dqdh_prev_free); - err = do_check_range(info->dqi_sb, prevblk, 0, info->dqi_blocks); + err = do_check_range(info->dqi_sb, "dqdh_prev_free", prevblk, 0, + info->dqi_blocks); if (err) return err;
@@ -277,6 +279,11 @@ static uint find_free_dqentry(struct qtree_mem_dqinfo *info, *err = check_free_block(info, dh); if (*err) goto out_buf; + *err = do_check_range(info->dqi_sb, "dqdh_entries", + le16_to_cpu(dh->dqdh_entries), 0, + qtree_dqstr_in_blk(info)); + if (*err) + goto out_buf; } else { blk = get_free_dqblk(info); if ((int)blk < 0) { @@ -358,6 +365,10 @@ static int do_insert_tree(struct qtree_mem_dqinfo *info, struct dquot *dquot, } ref = (__le32 *)buf; newblk = le32_to_cpu(ref[get_index(info, dquot->dq_id, depth)]); + ret = do_check_range(dquot->dq_sb, "block", newblk, 0, + info->dqi_blocks); + if (ret) + goto out_buf; if (!newblk) newson = 1; if (depth == info->dqi_qtree_depth - 1) { @@ -470,6 +481,11 @@ static int free_dqentry(struct qtree_mem_dqinfo *info, struct dquot *dquot, } dh = (struct qt_disk_dqdbheader *)buf; ret = check_free_block(info, dh); + if (ret) + goto out_buf; + ret = do_check_range(info->dqi_sb, "dqdh_entries", + le16_to_cpu(dh->dqdh_entries), 1, + qtree_dqstr_in_blk(info) + 1); if (ret) goto out_buf; le16_add_cpu(&dh->dqdh_entries, -1); @@ -528,7 +544,7 @@ static int remove_tree(struct qtree_mem_dqinfo *info, struct dquot *dquot, goto out_buf; } newblk = le32_to_cpu(ref[get_index(info, dquot->dq_id, depth)]); - ret = do_check_range(dquot->dq_sb, newblk, QT_TREEOFF, + ret = do_check_range(dquot->dq_sb, "block", newblk, QT_TREEOFF, info->dqi_blocks); if (ret) goto out_buf; @@ -632,7 +648,8 @@ static loff_t find_tree_dqentry(struct qtree_mem_dqinfo *info, blk = le32_to_cpu(ref[get_index(info, dquot->dq_id, depth)]); if (!blk) /* No reference? */ goto out_buf; - ret = do_check_range(dquot->dq_sb, blk, QT_TREEOFF, info->dqi_blocks); + ret = do_check_range(dquot->dq_sb, "block", blk, QT_TREEOFF, + info->dqi_blocks); if (ret) goto out_buf;
@@ -748,7 +765,13 @@ static int find_next_id(struct qtree_mem_dqinfo *info, qid_t *id, goto out_buf; } for (i = __get_index(info, *id, depth); i < epb; i++) { - if (ref[i] == cpu_to_le32(0)) { + uint blk_no = le32_to_cpu(ref[i]); + + ret = do_check_range(info->dqi_sb, "block", blk_no, 0, + info->dqi_blocks); + if (ret) + goto out_buf; + if (blk_no == 0) { *id += level_inc; continue; } @@ -756,7 +779,7 @@ static int find_next_id(struct qtree_mem_dqinfo *info, qid_t *id, ret = 0; goto out_buf; } - ret = find_next_id(info, id, le32_to_cpu(ref[i]), depth + 1); + ret = find_next_id(info, id, blk_no, depth + 1); if (ret != -ENOENT) break; }
From: liubo liubo254@huawei.com
euleros inclusion category: feature feature: etmem bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A
------------------------------------------------- etmem, the memory vertical expansion technology, uses DRAM and high-performance storage new media to form multi-level memory storage.
The etmem feature was introduced in the previous commit (aa7f1d222cdab88f12e6d889437fed6571dec824),but only the config options for the etmem_swap and etmem_scan modules were added, and the config options for the etmem feature were not added, so in this commit, the CONFIG_ETMEM option for the etmem feature was added
Signed-off-by: liubo liubo254@huawei.com Reviewed-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- mm/Kconfig | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/mm/Kconfig b/mm/Kconfig index 9e66dfb15c52..8ff588380c55 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -514,18 +514,28 @@ config MEMCG_QOS
config ETMEM_SCAN tristate "module: etmem page scan for etmem support" - depends on MMU - depends on X86 || ARM64 + depends on ETMEM help etmem page scan feature used to scan the virtual address of the target process
config ETMEM_SWAP tristate "module: etmem page swap for etmem support" + depends on ETMEM + help + etmem page swap feature + +config ETMEM + bool "Enable etmem feature" depends on MMU depends on X86 || ARM64 + default n help - etmem page swap feature + etmem is a tiered memory extension technology that uses DRAM and memory + compression/high-performance storage media to form tiered memory storage. + Memory data is tiered, and cold data is migrated from memory media to + high-performance storage media to release memory space and reduce + memory costs.
config USERSWAP bool "Enable User Swap"
From: liubo liubo254@huawei.com
euleros inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A CVE: NA
-------------------------------- enable CONFIG_ETMEM by default.
Signed-off-by: liubo liubo254@huawei.com Reviewed-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Chao Liu liuchao173@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- arch/arm64/configs/openeuler_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 1 + 2 files changed, 2 insertions(+)
diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index d246fd508ef6..43316a09314d 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -1077,6 +1077,7 @@ CONFIG_SHRINK_PAGECACHE=y CONFIG_MEMCG_QOS=y CONFIG_ETMEM_SCAN=m CONFIG_ETMEM_SWAP=m +CONFIG_ETMEM=y CONFIG_USERSWAP=y CONFIG_CMA=y # CONFIG_CMA_DEBUG is not set diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index 3eac70518e6f..584a9361aa70 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -1020,6 +1020,7 @@ CONFIG_SHRINK_PAGECACHE=y CONFIG_MEMCG_QOS=y CONFIG_ETMEM_SCAN=m CONFIG_ETMEM_SWAP=m +CONFIG_ETMEM=y CONFIG_USERSWAP=y # CONFIG_CMA is not set CONFIG_MEM_SOFT_DIRTY=y
From: liubo liubo254@huawei.com
euleros inclusion category: feature eature: etmem bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A
------------------------------------------------- add CONFIG_ETMEM macro definition for etmem feature.
Signed-off-by: liubo liubo254@huawei.com Reviewed-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/proc/base.c | 4 ++++ fs/proc/internal.h | 2 ++ fs/proc/task_mmu.c | 2 ++ include/linux/swap.h | 2 ++ mm/vmscan.c | 2 ++ 5 files changed, 12 insertions(+)
diff --git a/fs/proc/base.c b/fs/proc/base.c index b9052be86e8d..9b4666e757f0 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3338,6 +3338,8 @@ static const struct pid_entry tgid_base_stuff[] = { REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), +#endif +#ifdef CONFIG_ETMEM REG("idle_pages", S_IRUSR|S_IWUSR, proc_mm_idle_operations), REG("swap_pages", S_IWUSR, proc_mm_swap_operations), #endif @@ -3689,6 +3691,8 @@ static const struct pid_entry tid_base_stuff[] = { REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), +#endif +#ifdef CONFIG_ETMEM REG("idle_pages", S_IRUSR|S_IWUSR, proc_mm_idle_operations), REG("swap_pages", S_IWUSR, proc_mm_swap_operations), #endif diff --git a/fs/proc/internal.h b/fs/proc/internal.h index d1fdb722f0ca..104945bbeb9f 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -304,8 +304,10 @@ extern const struct file_operations proc_pid_smaps_operations; extern const struct file_operations proc_pid_smaps_rollup_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; +#ifdef CONFIG_ETMEM extern const struct file_operations proc_mm_idle_operations; extern const struct file_operations proc_mm_swap_operations; +#endif
extern unsigned long task_vsize(struct mm_struct *); extern unsigned long task_statm(struct mm_struct *, diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 7b8a513d9f69..aee74a4074d4 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1855,6 +1855,7 @@ const struct file_operations proc_pagemap_operations = { .release = pagemap_release, };
+#ifdef CONFIG_ETMEM static DEFINE_SPINLOCK(scan_lock);
static int page_scan_lock(struct file *file, int is_lock, struct file_lock *flock) @@ -2037,6 +2038,7 @@ const struct file_operations proc_mm_swap_operations = { .open = mm_swap_open, .release = mm_swap_release, }; +#endif #endif /* CONFIG_PROC_PAGE_MONITOR */
#ifdef CONFIG_NUMA diff --git a/include/linux/swap.h b/include/linux/swap.h index f2aa72ec0e57..adb294970605 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -383,9 +383,11 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long reclaim_pages(struct list_head *page_list); +#ifdef CONFIG_ETMEM extern int add_page_for_swap(struct page *page, struct list_head *pagelist); extern struct page *get_page_from_vaddr(struct mm_struct *mm, unsigned long vaddr); +#endif #ifdef CONFIG_NUMA extern int node_reclaim_mode; extern int sysctl_min_unmapped_ratio; diff --git a/mm/vmscan.c b/mm/vmscan.c index c504e530287b..a8f7804d15d2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4582,6 +4582,7 @@ void check_move_unevictable_pages(struct pagevec *pvec) } EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
+#ifdef CONFIG_ETMEM int add_page_for_swap(struct page *page, struct list_head *pagelist) { int err = -EBUSY; @@ -4636,3 +4637,4 @@ struct page *get_page_from_vaddr(struct mm_struct *mm, unsigned long vaddr) return page; } EXPORT_SYMBOL_GPL(get_page_from_vaddr); +#endif
From: liubo liubo254@huawei.com
euleros inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A CVE: NA
-------------------------------------------------
etmem, the memory vertical expansion technology, uses DRAM and high-performance storage new media to form multi-level memory storage. By grading the stored data, etmem migrates the classified cold storage data from the storage medium to the high-performance storage medium, so as to achieve the purpose of memory capacity expansion and memory cost reduction.
When the memory expansion function etmem is running, the native swap function of the kernel needs to be disabled in certain scenarios to avoid the impact of kernel swap.
This feature provides the preceding functions.
The /sys/kernel/mm/swap/ directory provides the kernel_swap_enable sys interface to enable or disable the native swap function of the kernel.
The default value of /sys/kernel/mm/swap/kernel_swap_enable is true, that is, kernel swap is enabled by default.
Turn on kernel swap: echo true > /sys/kernel/mm/swap/kernel_swap_enable
Turn off kernel swap: echo false > /sys/kernel/mm/swap/kernel_swap_enable
Signed-off-by: liubo liubo254@huawei.com Reviewed-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- include/linux/swap.h | 5 ++++- mm/swap_state.c | 37 +++++++++++++++++++++++++++++++++++++ mm/vmscan.c | 27 +++++++++++++++++++++++++++ 3 files changed, 68 insertions(+), 1 deletion(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h index adb294970605..fbac6b2236a9 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -453,7 +453,6 @@ extern struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf); extern struct page *swapin_readahead(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf); - /* linux/mm/swapfile.c */ extern atomic_long_t nr_swap_pages; extern long total_swap_pages; @@ -730,5 +729,9 @@ static inline bool mem_cgroup_swap_full(struct page *page) } #endif
+#ifdef CONFIG_ETMEM +extern bool kernel_swap_enabled(void); +#endif + #endif /* __KERNEL__*/ #endif /* _LINUX_SWAP_H */ diff --git a/mm/swap_state.c b/mm/swap_state.c index 149f46781061..d58cbf4fe27f 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -39,6 +39,9 @@ static const struct address_space_operations swap_aops = { struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly; static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly; static bool enable_vma_readahead __read_mostly = true; +#ifdef CONFIG_ETMEM +static bool enable_kernel_swap __read_mostly = true; +#endif
#define SWAP_RA_WIN_SHIFT (PAGE_SHIFT / 2) #define SWAP_RA_HITS_MASK ((1UL << SWAP_RA_WIN_SHIFT) - 1) @@ -349,6 +352,13 @@ static inline bool swap_use_vma_readahead(void) return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap); }
+#ifdef CONFIG_ETMEM +bool kernel_swap_enabled(void) +{ + return READ_ONCE(enable_kernel_swap); +} +#endif + /* * Lookup a swap entry in the swap cache. A found page will be returned * unlocked and with its refcount incremented - we rely on the kernel @@ -909,8 +919,35 @@ static struct kobj_attribute vma_ra_enabled_attr = __ATTR(vma_ra_enabled, 0644, vma_ra_enabled_show, vma_ra_enabled_store);
+#ifdef CONFIG_ETMEM +static ssize_t kernel_swap_enable_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%s\n", enable_kernel_swap ? "true" : "false"); +} +static ssize_t kernel_swap_enable_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + if (!strncmp(buf, "true", 4) || !strncmp(buf, "1", 1)) + WRITE_ONCE(enable_kernel_swap, true); + else if (!strncmp(buf, "false", 5) || !strncmp(buf, "0", 1)) + WRITE_ONCE(enable_kernel_swap, false); + else + return -EINVAL; + + return count; +} +static struct kobj_attribute kernel_swap_enable_attr = + __ATTR(kernel_swap_enable, 0644, kernel_swap_enable_show, + kernel_swap_enable_store); +#endif + static struct attribute *swap_attrs[] = { &vma_ra_enabled_attr.attr, +#ifdef CONFIG_ETMEM + &kernel_swap_enable_attr.attr, +#endif NULL, };
diff --git a/mm/vmscan.c b/mm/vmscan.c index a8f7804d15d2..1913dbf31187 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3500,6 +3500,18 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, return false; }
+#ifdef CONFIG_ETMEM +/* + * Check if original kernel swap is enabled + * turn off kernel swap,but leave page cache reclaim on + */ +static inline void kernel_swap_check(struct scan_control *sc) +{ + if (sc != NULL && !kernel_swap_enabled()) + sc->may_swap = 0; +} +#endif + unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *nodemask) { @@ -3516,6 +3528,9 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, .may_swap = 1, };
+#ifdef CONFIG_ETMEM + kernel_swap_check(&sc); +#endif /* * scan_control uses s8 fields for order, priority, and reclaim_idx. * Confirm they are large enough for max values. @@ -3912,6 +3927,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) sc.may_writepage = !laptop_mode && !nr_boost_reclaim; sc.may_swap = !nr_boost_reclaim;
+#ifdef CONFIG_ETMEM + kernel_swap_check(&sc); +#endif + /* * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. All @@ -4288,6 +4307,10 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim) noreclaim_flag = memalloc_noreclaim_save(); set_task_reclaim_state(current, &sc.reclaim_state);
+#ifdef CONFIG_ETMEM + kernel_swap_check(&sc); +#endif + nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
set_task_reclaim_state(current, NULL); @@ -4452,6 +4475,10 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in cond_resched(); psi_memstall_enter(&pflags); fs_reclaim_acquire(sc.gfp_mask); + +#ifdef CONFIG_ETMEM + kernel_swap_check(&sc); +#endif /* * We need to be able to allocate from the reserves for RECLAIM_UNMAP * and we also need to be able to write out pages for RECLAIM_WRITE
From: liubo liubo254@huawei.com
euleros inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A CVE: NA
------------------------------------------------- etmem, the memory vertical expansion technology,
In the current etmem process, memory page swapping is implemented by invoking shrink_page_list. When this interface is invoked for the first time, pages are added to the swap cache and written to disks.The swap cache page is reclaimed only when this interface is invoked for the second time and no process accesses the page.However, in the etmem process, the user mode scans pages that have been accessed, and the migration is not delivered to pages that are not accessed by processes. Therefore, the swap cache may always be occupied. To solve the preceding problem, add the logic for actively reclaiming the swap cache.When the swap cache occupies a large amount of memory, the system proactively scans the LRU linked list and reclaims the swap cache to save memory within the specified range.
Signed-off-by: liubo liubo254@huawei.com Reviewed-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/proc/etmem_swap.c | 175 +++++++++++++++++++++++- fs/proc/task_mmu.c | 8 ++ include/linux/list.h | 17 +++ include/linux/swap.h | 35 ++++- mm/swap_state.c | 1 + mm/vmscan.c | 312 ++++++++++++++++++++++++++++++++++++++++++- 6 files changed, 542 insertions(+), 6 deletions(-)
diff --git a/fs/proc/etmem_swap.c b/fs/proc/etmem_swap.c index f9f796cfaf97..0e0a5225e301 100644 --- a/fs/proc/etmem_swap.c +++ b/fs/proc/etmem_swap.c @@ -10,6 +10,24 @@ #include <linux/mempolicy.h> #include <linux/uaccess.h> #include <linux/delay.h> +#include <linux/numa.h> +#include <linux/freezer.h> +#include <linux/kthread.h> +#include <linux/mm_inline.h> + +#define RECLAIM_SWAPCACHE_MAGIC 0X77 +#define SET_SWAPCACHE_WMARK _IOW(RECLAIM_SWAPCACHE_MAGIC, 0x02, unsigned int) +#define RECLAIM_SWAPCACHE_ON _IOW(RECLAIM_SWAPCACHE_MAGIC, 0x01, unsigned int) +#define RECLAIM_SWAPCACHE_OFF _IOW(RECLAIM_SWAPCACHE_MAGIC, 0x00, unsigned int) + +#define WATERMARK_MAX 100 +#define SWAP_SCAN_NUM_MAX 32 + +static struct task_struct *reclaim_swapcache_tk; +static bool enable_swapcache_reclaim; +static unsigned long swapcache_watermark[ETMEM_SWAPCACHE_NR_WMARK]; + +static DECLARE_WAIT_QUEUE_HEAD(reclaim_queue);
static ssize_t swap_pages_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) @@ -45,7 +63,7 @@ static ssize_t swap_pages_write(struct file *file, const char __user *buf, ret = kstrtoul(p, 16, &vaddr); if (ret != 0) continue; - /*If get page struct failed, ignore it, get next page*/ + /* If get page struct failed, ignore it, get next page */ page = get_page_from_vaddr(mm, vaddr); if (!page) continue; @@ -78,9 +96,153 @@ static int swap_pages_release(struct inode *inode, struct file *file) return 0; }
+/* check if swapcache meet requirements */ +static bool swapcache_balanced(void) +{ + return total_swapcache_pages() < swapcache_watermark[ETMEM_SWAPCACHE_WMARK_HIGH]; +} + +/* the flag present if swapcache reclaim is started */ +static bool swapcache_reclaim_enabled(void) +{ + return READ_ONCE(enable_swapcache_reclaim); +} + +static void start_swapcache_reclaim(void) +{ + if (swapcache_balanced()) + return; + /* RECLAIM_SWAPCACHE_ON trigger the thread to start running. */ + if (!waitqueue_active(&reclaim_queue)) + return; + + WRITE_ONCE(enable_swapcache_reclaim, true); + wake_up_interruptible(&reclaim_queue); +} + +static void stop_swapcache_reclaim(void) +{ + WRITE_ONCE(enable_swapcache_reclaim, false); +} + +static bool should_goto_sleep(void) +{ + if (swapcache_balanced()) + stop_swapcache_reclaim(); + + if (swapcache_reclaim_enabled()) + return false; + + return true; +} + +static int get_swapcache_watermark(unsigned int ratio) +{ + unsigned int low_watermark; + unsigned int high_watermark; + + low_watermark = ratio & 0xFF; + high_watermark = (ratio >> 8) & 0xFF; + if (low_watermark > WATERMARK_MAX || + high_watermark > WATERMARK_MAX || + low_watermark > high_watermark) + return -EPERM; + + swapcache_watermark[ETMEM_SWAPCACHE_WMARK_LOW] = totalram_pages() * + low_watermark / WATERMARK_MAX; + swapcache_watermark[ETMEM_SWAPCACHE_WMARK_HIGH] = totalram_pages() * + high_watermark / WATERMARK_MAX; + + return 0; +}
extern struct file_operations proc_swap_pages_operations;
+static void reclaim_swapcache_try_to_sleep(void) +{ + DEFINE_WAIT(wait); + + if (freezing(current) || kthread_should_stop()) + return; + + prepare_to_wait(&reclaim_queue, &wait, TASK_INTERRUPTIBLE); + if (should_goto_sleep()) { + if (!kthread_should_stop()) + schedule(); + } + finish_wait(&reclaim_queue, &wait); +} + +static void etmem_reclaim_swapcache(void) +{ + do_swapcache_reclaim(swapcache_watermark, + ARRAY_SIZE(swapcache_watermark)); + stop_swapcache_reclaim(); +} + +static int reclaim_swapcache_proactive(void *para) +{ + set_freezable(); + + while (1) { + bool ret; + + reclaim_swapcache_try_to_sleep(); + ret = try_to_freeze(); + if (kthread_should_stop()) + break; + + if (ret) + continue; + + etmem_reclaim_swapcache(); + } + + return 0; +} + +static int reclaim_swapcache_run(void) +{ + int ret = 0; + + reclaim_swapcache_tk = kthread_run(reclaim_swapcache_proactive, NULL, + "etmem_recalim_swapcache"); + if (IS_ERR(reclaim_swapcache_tk)) { + ret = PTR_ERR(reclaim_swapcache_tk); + reclaim_swapcache_tk = NULL; + } + return ret; +} + +static long swap_page_ioctl(struct file *filp, unsigned int cmd, + unsigned long arg) +{ + void __user *argp = (void __user *)arg; + unsigned int ratio; + + switch (cmd) { + case RECLAIM_SWAPCACHE_ON: + if (swapcache_reclaim_enabled()) + return 0; + start_swapcache_reclaim(); + break; + case RECLAIM_SWAPCACHE_OFF: + stop_swapcache_reclaim(); + break; + case SET_SWAPCACHE_WMARK: + if (get_user(ratio, (unsigned int __user *)argp)) + return -EFAULT; + + if (get_swapcache_watermark(ratio) != 0) + return -EFAULT; + break; + default: + return -EPERM; + } + + return 0; +} + static int swap_pages_entry(void) { proc_swap_pages_operations.flock(NULL, 1, NULL); @@ -88,8 +250,12 @@ static int swap_pages_entry(void) proc_swap_pages_operations.write = swap_pages_write; proc_swap_pages_operations.open = swap_pages_open; proc_swap_pages_operations.release = swap_pages_release; + proc_swap_pages_operations.unlocked_ioctl = swap_page_ioctl; proc_swap_pages_operations.flock(NULL, 0, NULL);
+ enable_swapcache_reclaim = false; + reclaim_swapcache_run(); + return 0; }
@@ -100,7 +266,14 @@ static void swap_pages_exit(void) proc_swap_pages_operations.write = NULL; proc_swap_pages_operations.open = NULL; proc_swap_pages_operations.release = NULL; + proc_swap_pages_operations.unlocked_ioctl = NULL; proc_swap_pages_operations.flock(NULL, 0, NULL); + + if (!IS_ERR(reclaim_swapcache_tk)) { + kthread_stop(reclaim_swapcache_tk); + reclaim_swapcache_tk = NULL; + } + return; }
MODULE_LICENSE("GPL"); diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index aee74a4074d4..c61a3fbbfd71 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -2032,11 +2032,19 @@ static int mm_swap_release(struct inode *inode, struct file *file) return ret; }
+static long mm_swap_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) +{ + if (proc_swap_pages_operations.unlocked_ioctl) + return proc_swap_pages_operations.unlocked_ioctl(filp, cmd, arg); + return 0; +} + const struct file_operations proc_mm_swap_operations = { .llseek = mem_lseek, .write = mm_swap_write, .open = mm_swap_open, .release = mm_swap_release, + .unlocked_ioctl = mm_swap_ioctl, }; #endif #endif /* CONFIG_PROC_PAGE_MONITOR */ diff --git a/include/linux/list.h b/include/linux/list.h index a18c87b63376..fa9f691f2553 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -764,6 +764,23 @@ static inline void list_splice_tail_init(struct list_head *list, !list_entry_is_head(pos, head, member); \ pos = n, n = list_prev_entry(n, member))
+/** + * list_for_each_entry_safe_reverse_from - iterate backwards over list from + * current point safe against removal + * @pos: the type * to use as a loop cursor. + * @n: another type * to use as temporary storage + * @head: the head for your list. + * @member: the name of the list_head within the struct. + * + * Iterate backwards over list of given type from current point, safe against + * removal of list entry. + */ +#define list_for_each_entry_safe_reverse_from(pos, n, head, member) \ + for (n = list_prev_entry(pos, member); \ + !list_entry_is_head(pos, head, member); \ + pos = n, n = list_prev_entry(n, member)) + + /** * list_safe_reset_next - reset a stale list_for_each_entry_safe loop * @pos: the loop cursor used in the list_for_each_entry_safe loop diff --git a/include/linux/swap.h b/include/linux/swap.h index fbac6b2236a9..cebabb6db07c 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -384,9 +384,40 @@ extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long reclaim_pages(struct list_head *page_list); #ifdef CONFIG_ETMEM +enum etmem_swapcache_watermark_en { + ETMEM_SWAPCACHE_WMARK_LOW, + ETMEM_SWAPCACHE_WMARK_HIGH, + ETMEM_SWAPCACHE_NR_WMARK +}; + extern int add_page_for_swap(struct page *page, struct list_head *pagelist); extern struct page *get_page_from_vaddr(struct mm_struct *mm, unsigned long vaddr); +extern int do_swapcache_reclaim(unsigned long *swapcache_watermark, + unsigned int watermark_nr); +extern bool kernel_swap_enabled(void); +#else +static inline int add_page_for_swap(struct page *page, struct list_head *pagelist) +{ + return 0; +} + +static inline struct page *get_page_from_vaddr(struct mm_struct *mm, + unsigned long vaddr) +{ + return NULL; +} + +static inline int do_swapcache_reclaim(unsigned long *swapcache_watermark, + unsigned int watermark_nr) +{ + return 0; +} + +static inline bool kernel_swap_enabled(void) +{ + return true; +} #endif #ifdef CONFIG_NUMA extern int node_reclaim_mode; @@ -729,9 +760,5 @@ static inline bool mem_cgroup_swap_full(struct page *page) } #endif
-#ifdef CONFIG_ETMEM -extern bool kernel_swap_enabled(void); -#endif - #endif /* __KERNEL__*/ #endif /* _LINUX_SWAP_H */ diff --git a/mm/swap_state.c b/mm/swap_state.c index d58cbf4fe27f..69d71c4be7b8 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -96,6 +96,7 @@ unsigned long total_swapcache_pages(void) } return ret; } +EXPORT_SYMBOL_GPL(total_swapcache_pages);
static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
diff --git a/mm/vmscan.c b/mm/vmscan.c index 1913dbf31187..1b333a4d247a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4615,7 +4615,7 @@ int add_page_for_swap(struct page *page, struct list_head *pagelist) int err = -EBUSY; struct page *head;
- /*If the page is mapped by more than one process, do not swap it */ + /* If the page is mapped by more than one process, do not swap it */ if (page_mapcount(page) > 1) return -EACCES;
@@ -4664,4 +4664,314 @@ struct page *get_page_from_vaddr(struct mm_struct *mm, unsigned long vaddr) return page; } EXPORT_SYMBOL_GPL(get_page_from_vaddr); + +static int add_page_for_reclaim_swapcache(struct page *page, + struct list_head *pagelist, struct lruvec *lruvec, enum lru_list lru) +{ + struct page *head; + + /* If the page is mapped by more than one process, do not swap it */ + if (page_mapcount(page) > 1) + return -EACCES; + + if (PageHuge(page)) + return -EACCES; + + head = compound_head(page); + + switch (__isolate_lru_page_prepare(head, 0)) { + case 0: + if (unlikely(!get_page_unless_zero(page))) + return -1; + + if (!TestClearPageLRU(page)) { + /* + * This page may in other isolation path, + * but we still hold lru_lock. + */ + put_page(page); + return -1; + } + + list_move(&head->lru, pagelist); + update_lru_size(lruvec, lru, page_zonenum(head), -thp_nr_pages(head)); + break; + + case -EBUSY: + return -1; + default: + break; + } + + return 0; +} + +static unsigned long reclaim_swapcache_pages_from_list(int nid, + struct list_head *page_list, unsigned long reclaim_num, bool putback_flag) +{ + struct scan_control sc = { + .may_unmap = 1, + .may_swap = 1, + .may_writepage = 1, + .gfp_mask = GFP_KERNEL, + }; + unsigned long nr_reclaimed = 0; + unsigned long nr_moved = 0; + struct page *page, *next; + LIST_HEAD(swap_pages); + struct pglist_data *pgdat = NULL; + struct reclaim_stat stat; + + pgdat = NODE_DATA(nid); + + if (putback_flag) + goto putback_list; + + if (reclaim_num == 0) + return 0; + + list_for_each_entry_safe(page, next, page_list, lru) { + if (!page_is_file_lru(page) && !__PageMovable(page) + && PageSwapCache(page)) { + ClearPageActive(page); + list_move(&page->lru, &swap_pages); + nr_moved++; + } + + if (nr_moved >= reclaim_num) + break; + } + + /* swap the pages */ + if (pgdat) + nr_reclaimed = shrink_page_list(&swap_pages, + pgdat, + &sc, + &stat, true); + + while (!list_empty(&swap_pages)) { + page = lru_to_page(&swap_pages); + list_del(&page->lru); + putback_lru_page(page); + } + + return nr_reclaimed; + +putback_list: + while (!list_empty(page_list)) { + page = lru_to_page(page_list); + list_del(&page->lru); + putback_lru_page(page); + } + + return nr_reclaimed; +} + +#define SWAP_SCAN_NUM_MAX 32 + +static bool swapcache_below_watermark(unsigned long *swapcache_watermark) +{ + return total_swapcache_pages() < swapcache_watermark[ETMEM_SWAPCACHE_WMARK_LOW]; +} + +static unsigned long get_swapcache_reclaim_num(unsigned long *swapcache_watermark) +{ + return total_swapcache_pages() > + swapcache_watermark[ETMEM_SWAPCACHE_WMARK_LOW] ? + (total_swapcache_pages() - swapcache_watermark[ETMEM_SWAPCACHE_WMARK_LOW]) : 0; +} + +/* + * The main function to reclaim swapcache, the whole reclaim process is + * divided into 3 steps. + * 1. get the total_swapcache_pages num to reclaim. + * 2. scan the LRU linked list of each memory node to obtain the + * swapcache pages that can be reclaimd. + * 3. reclaim the swapcache page until the requirements are meet. + */ +int do_swapcache_reclaim(unsigned long *swapcache_watermark, + unsigned int watermark_nr) +{ + int err = -EINVAL; + unsigned long swapcache_to_reclaim = 0; + unsigned long nr_reclaimed = 0; + unsigned long swapcache_total_reclaimable = 0; + unsigned long reclaim_page_count = 0; + + unsigned long *nr = NULL; + unsigned long *nr_to_reclaim = NULL; + struct list_head *swapcache_list = NULL; + + int nid = 0; + struct lruvec *lruvec = NULL; + struct list_head *src = NULL; + struct page *page = NULL; + struct page *next = NULL; + struct page *pos = NULL; + + struct mem_cgroup *memcg = NULL; + struct mem_cgroup *target_memcg = NULL; + + pg_data_t *pgdat = NULL; + unsigned int scan_count = 0; + int nid_num = 0; + + if (swapcache_watermark == NULL || + watermark_nr < ETMEM_SWAPCACHE_NR_WMARK) + return err; + + /* get the total_swapcache_pages num to reclaim. */ + swapcache_to_reclaim = get_swapcache_reclaim_num(swapcache_watermark); + if (swapcache_to_reclaim <= 0) + return err; + + nr = kcalloc(MAX_NUMNODES, sizeof(unsigned long), GFP_KERNEL); + if (nr == NULL) + return -ENOMEM; + + nr_to_reclaim = kcalloc(MAX_NUMNODES, sizeof(unsigned long), GFP_KERNEL); + if (nr_to_reclaim == NULL) { + kfree(nr); + return -ENOMEM; + } + + swapcache_list = kcalloc(MAX_NUMNODES, sizeof(struct list_head), GFP_KERNEL); + if (swapcache_list == NULL) { + kfree(nr); + kfree(nr_to_reclaim); + return -ENOMEM; + } + + /* + * scan the LRU linked list of each memory node to obtain the + * swapcache pages that can be reclaimd. + */ + for_each_node_state(nid, N_MEMORY) { + INIT_LIST_HEAD(&swapcache_list[nid_num]); + cond_resched(); + + pgdat = NODE_DATA(nid); + + memcg = mem_cgroup_iter(target_memcg, NULL, NULL); + do { + cond_resched(); + pos = NULL; + lruvec = mem_cgroup_lruvec(memcg, pgdat); + src = &(lruvec->lists[LRU_INACTIVE_ANON]); + spin_lock_irq(&lruvec->lru_lock); + scan_count = 0; + + /* + * Scan the swapcache pages that are not mapped from + * the end of the LRU linked list, scan SWAP_SCAN_NUM_MAX + * pages each time, and record the scan end point page. + */ + + pos = list_last_entry(src, struct page, lru); + spin_unlock_irq(&lruvec->lru_lock); +do_scan: + cond_resched(); + scan_count = 0; + spin_lock_irq(&lruvec->lru_lock); + + /* + * check if pos page is been released or not in LRU list, if true, + * cancel the subsequent page scanning of the current node. + */ + if (!pos || list_entry_is_head(pos, src, lru)) { + spin_unlock_irq(&lruvec->lru_lock); + continue; + } + + if (!PageLRU(pos) || page_lru(pos) != LRU_INACTIVE_ANON) { + spin_unlock_irq(&lruvec->lru_lock); + continue; + } + + page = pos; + pos = NULL; + /* Continue to scan down from the last scan breakpoint */ + list_for_each_entry_safe_reverse_from(page, next, src, lru) { + scan_count++; + pos = next; + if (scan_count >= SWAP_SCAN_NUM_MAX) + break; + + if (!PageSwapCache(page)) + continue; + + if (page_mapped(page)) + continue; + + if (add_page_for_reclaim_swapcache(page, + &swapcache_list[nid_num], + lruvec, LRU_INACTIVE_ANON) != 0) + continue; + + nr[nid_num]++; + swapcache_total_reclaimable++; + } + spin_unlock_irq(&lruvec->lru_lock); + + /* + * Check whether the scanned pages meet + * the reclaim requirements. + */ + if (swapcache_total_reclaimable <= swapcache_to_reclaim || + scan_count >= SWAP_SCAN_NUM_MAX) + goto do_scan; + + } while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL))); + + /* Start reclaiming the next memory node. */ + nid_num++; + } + + /* reclaim the swapcache page until the requirements are meet. */ + do { + nid_num = 0; + reclaim_page_count = 0; + + /* start swapcache page reclaim for each node. */ + for_each_node_state(nid, N_MEMORY) { + cond_resched(); + + nr_to_reclaim[nid_num] = (swapcache_to_reclaim / + (swapcache_total_reclaimable / nr[nid_num])); + reclaim_page_count += reclaim_swapcache_pages_from_list(nid, + &swapcache_list[nid_num], + nr_to_reclaim[nid_num], false); + nid_num++; + } + + nr_reclaimed += reclaim_page_count; + + /* + * Check whether the swapcache page reaches the reclaim requirement or + * the number of the swapcache page reclaimd is 0. Stop reclaim. + */ + if (nr_reclaimed >= swapcache_to_reclaim || reclaim_page_count == 0) + goto exit; + } while (!swapcache_below_watermark(swapcache_watermark) || + nr_reclaimed < swapcache_to_reclaim); +exit: + nid_num = 0; + /* + * Repopulate the swapcache pages that are not reclaimd back + * to the LRU linked list. + */ + for_each_node_state(nid, N_MEMORY) { + cond_resched(); + reclaim_swapcache_pages_from_list(nid, + &swapcache_list[nid_num], 0, true); + nid_num++; + } + + kfree(nr); + kfree(nr_to_reclaim); + kfree(swapcache_list); + + return 0; +} +EXPORT_SYMBOL_GPL(do_swapcache_reclaim); #endif
From: liubo liubo254@huawei.com
euleros inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A CVE: NA
------------------------------------------------- etmem, the memory vertical expansion technology,
The existing memory expansion tool etmem swaps out all pages that can be swapped out for the process by default, unless the page is marked with lock flag.
The function of swapping out specified pages is added. The process adds VM_SWAPFLAG flags for pages to be swapped out. The etmem adds filters to the scanning module and swaps out only these pages.
Signed-off-by: liubo liubo254@huawei.com Reviewed-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/proc/etmem_scan.c | 11 +++++++++++ fs/proc/etmem_scan.h | 5 ++++- fs/proc/etmem_swap.c | 1 + include/linux/mm.h | 4 ++++ include/uapi/asm-generic/mman-common.h | 3 +++ mm/madvise.c | 17 ++++++++++++++++- 6 files changed, 39 insertions(+), 2 deletions(-)
diff --git a/fs/proc/etmem_scan.c b/fs/proc/etmem_scan.c index 8bcb8d3af7c5..e9cad598189a 100644 --- a/fs/proc/etmem_scan.c +++ b/fs/proc/etmem_scan.c @@ -1187,6 +1187,11 @@ static int mm_idle_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk) { struct vm_area_struct *vma = walk->vma; + struct page_idle_ctrl *pic = walk->private; + + /* If the specified page swapout is set, the untagged vma is skipped. */ + if ((pic->flags & VMA_SCAN_FLAG) && !(vma->vm_flags & VM_SWAPFLAG)) + return 1;
if (vma->vm_file) { if (is_vm_hugetlb_page(vma)) @@ -1325,6 +1330,12 @@ static long page_scan_ioctl(struct file *filp, unsigned int cmd, unsigned long a case IDLE_SCAN_REMOVE_FLAGS: filp->f_flags &= ~flags; break; + case VMA_SCAN_ADD_FLAGS: + filp->f_flags |= flags; + break; + case VMA_SCAN_REMOVE_FLAGS: + filp->f_flags &= ~flags; + break; default: return -EOPNOTSUPP; } diff --git a/fs/proc/etmem_scan.h b/fs/proc/etmem_scan.h index 93a6e33f2025..e109f7f350e1 100644 --- a/fs/proc/etmem_scan.h +++ b/fs/proc/etmem_scan.h @@ -12,13 +12,16 @@ #define SCAN_AS_HUGE 0100000000 /* treat normal page as hugepage in vm */ #define SCAN_IGN_HOST 0200000000 /* ignore host access when scan vm */ #define VM_SCAN_HOST 0400000000 /* scan and add host page for vm hole(internal) */ +#define VMA_SCAN_FLAG 0x1000 /* scan the specifics vma with flag */
#define ALL_SCAN_FLAGS (SCAN_HUGE_PAGE | SCAN_SKIM_IDLE | SCAN_DIRTY_PAGE | \ - SCAN_AS_HUGE | SCAN_IGN_HOST | VM_SCAN_HOST) + SCAN_AS_HUGE | SCAN_IGN_HOST | VM_SCAN_HOST | VMA_SCAN_FLAG)
#define IDLE_SCAN_MAGIC 0x66 #define IDLE_SCAN_ADD_FLAGS _IOW(IDLE_SCAN_MAGIC, 0x0, unsigned int) #define IDLE_SCAN_REMOVE_FLAGS _IOW(IDLE_SCAN_MAGIC, 0x1, unsigned int) +#define VMA_SCAN_ADD_FLAGS _IOW(IDLE_SCAN_MAGIC, 0x2, unsigned int) +#define VMA_SCAN_REMOVE_FLAGS _IOW(IDLE_SCAN_MAGIC, 0x3, unsigned int)
enum ProcIdlePageType { PTE_ACCESSED, /* 4k page */ diff --git a/fs/proc/etmem_swap.c b/fs/proc/etmem_swap.c index 0e0a5225e301..86f5cf8c90a1 100644 --- a/fs/proc/etmem_swap.c +++ b/fs/proc/etmem_swap.c @@ -63,6 +63,7 @@ static ssize_t swap_pages_write(struct file *file, const char __user *buf, ret = kstrtoul(p, 16, &vaddr); if (ret != 0) continue; + /* If get page struct failed, ignore it, get next page */ page = get_page_from_vaddr(mm, vaddr); if (!page) diff --git a/include/linux/mm.h b/include/linux/mm.h index eb27c2eacb4a..0331a8571cfd 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -325,6 +325,10 @@ extern unsigned int kobjsize(const void *objp);
#define VM_CHECKNODE 0x200000000
+#ifdef CONFIG_ETMEM +#define VM_SWAPFLAG 0x400000000000000 /* memory swap out flag in vma */ +#endif + #ifdef CONFIG_USERSWAP /* bit[32:36] is the protection key of intel, so use a large value for VM_USWAP */ #define VM_USWAP 0x2000000000000000 diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index e75b65364dce..898ea134b2f3 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -74,6 +74,9 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */
+#define MADV_SWAPFLAG 203 /* for memory to be swap out */ +#define MADV_SWAPFLAG_REMOVE 204 + /* compatibility flags */ #define MAP_FILE 0
diff --git a/mm/madvise.c b/mm/madvise.c index 77e1dc2d4e18..bd851f2c687f 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -126,6 +126,14 @@ static long madvise_behavior(struct vm_area_struct *vma, if (error) goto out_convert_errno; break; +#ifdef CONFIG_ETMEM + case MADV_SWAPFLAG: + new_flags |= VM_SWAPFLAG; + break; + case MADV_SWAPFLAG_REMOVE: + new_flags &= ~VM_SWAPFLAG; + break; +#endif }
if (new_flags == vma->vm_flags) { @@ -974,9 +982,12 @@ madvise_behavior_valid(int behavior) #ifdef CONFIG_MEMORY_FAILURE case MADV_SOFT_OFFLINE: case MADV_HWPOISON: +#endif +#ifdef CONFIG_ETMEM + case MADV_SWAPFLAG: + case MADV_SWAPFLAG_REMOVE: #endif return true; - default: return false; } @@ -1046,6 +1057,10 @@ process_madvise_behavior_valid(int behavior) * easily if memory pressure hanppens. * MADV_PAGEOUT - the application is not expected to use this memory soon, * page out the pages in this range immediately. + * MADV_SWAPFLAG - Used in the etmem memory extension feature, the process + * specifies the memory swap area by adding a flag to a specific + * vma address. + * MADV_SWAPFLAG_REMOVE - remove the specific vma flag * * return values: * zero - success
From: Hyunwoo Kim imv4bel@gmail.com
mainline inclusion from mainline-v6.0-rc5 commit 9cb636b5f6a8cc6d1b50809ec8f8d33ae0c84c95 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5QI0W CVE: CVE-2022-40307
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs...
---------------------------
A race condition may occur if the user calls close() on another thread during a write() operation on the device node of the efi capsule.
This is a race condition that occurs between the efi_capsule_write() and efi_capsule_flush() functions of efi_capsule_fops, which ultimately results in UAF.
So, the page freeing process is modified to be done in efi_capsule_release() instead of efi_capsule_flush().
Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Hyunwoo Kim imv4bel@gmail.com Link: https://lore.kernel.org/all/20220907102920.GA88602@ubuntu/ Signed-off-by: Ard Biesheuvel ardb@kernel.org Signed-off-by: Xia Longlong xialonglong1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/firmware/efi/capsule-loader.c | 31 ++++++--------------------- 1 file changed, 7 insertions(+), 24 deletions(-)
diff --git a/drivers/firmware/efi/capsule-loader.c b/drivers/firmware/efi/capsule-loader.c index 4dde8edd53b6..3e8d4b51a814 100644 --- a/drivers/firmware/efi/capsule-loader.c +++ b/drivers/firmware/efi/capsule-loader.c @@ -242,29 +242,6 @@ static ssize_t efi_capsule_write(struct file *file, const char __user *buff, return ret; }
-/** - * efi_capsule_flush - called by file close or file flush - * @file: file pointer - * @id: not used - * - * If a capsule is being partially uploaded then calling this function - * will be treated as upload termination and will free those completed - * buffer pages and -ECANCELED will be returned. - **/ -static int efi_capsule_flush(struct file *file, fl_owner_t id) -{ - int ret = 0; - struct capsule_info *cap_info = file->private_data; - - if (cap_info->index > 0) { - pr_err("capsule upload not complete\n"); - efi_free_all_buff_pages(cap_info); - ret = -ECANCELED; - } - - return ret; -} - /** * efi_capsule_release - called by file close * @inode: not used @@ -277,6 +254,13 @@ static int efi_capsule_release(struct inode *inode, struct file *file) { struct capsule_info *cap_info = file->private_data;
+ if (cap_info->index > 0 && + (cap_info->header.headersize == 0 || + cap_info->count < cap_info->total_size)) { + pr_err("capsule upload not complete\n"); + efi_free_all_buff_pages(cap_info); + } + kfree(cap_info->pages); kfree(cap_info->phys); kfree(file->private_data); @@ -324,7 +308,6 @@ static const struct file_operations efi_capsule_fops = { .owner = THIS_MODULE, .open = efi_capsule_open, .write = efi_capsule_write, - .flush = efi_capsule_flush, .release = efi_capsule_release, .llseek = no_llseek, };