Per Downstream Port Containment Related Enhancements ECN[1]
Table 4-6, Interpretation of _OSC Control Field Returned Value,
for bit 7 of _OSC control return value:
"Firmware sets this bit to 1 to grant the OS control over PCI Express
Downstream Port Containment configuration."
"If control of this feature was requested and denied,
or was not requested, the firmware returns this bit set to 0."
We store bit 7 of _OSC control return value in host->native_dpc,
check it before enable the …
[View More]dpc service as the firmware may not
grant the control.
[1] Downstream Port Containment Related Enhancements ECN,
Jan 28, 2019, affecting PCI Firmware Specification, Rev. 3.2
https://members.pcisig.com/wg/PCI-SIG/document/12888
Signed-off-by: Yicong Yang <yangyicong(a)hisilicon.com>
---
Change since v1:
- use correct reference for _OSC control return value
drivers/pci/pcie/portdrv_core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index e1fed664..7445d03 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -253,7 +253,8 @@ static int get_port_device_capability(struct pci_dev *dev)
*/
if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) &&
pci_aer_available() &&
- (pcie_ports_dpc_native || (services & PCIE_PORT_SERVICE_AER)))
+ (pcie_ports_dpc_native ||
+ ((services & PCIE_PORT_SERVICE_AER) && host->native_dpc)))
services |= PCIE_PORT_SERVICE_DPC;
if (pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM ||
--
2.8.1
[View Less]
On an ARM64 system with a SMMUv3 implementation that fully supports
Broadcast TLB Maintenance(BTM) feature as part of the Distributed
Virtual Memory(DVM) protocol, the CPU TLB invalidate instructions are
received by SMMUv3. This is very useful when the SMMUv3 shares the
page tables with the CPU(eg: Guest SVA use case). For this to work,
the SMMU must use the same VMID that is allocated by KVM to configure
the stage 2 translations. At present KVM VMID …
[View More]allocations are recycled
on rollover and may change as a result. This will create issues if we
have to share the KVM VMID with SMMU.
Please see the discussion here,
https://lore.kernel.org/linux-iommu/20200522101755.GA3453945@myrica/
This series proposes a way to share the VMID between KVM and IOMMU
driver by,
1. Splitting the KVM VMID space into two equal halves based on the
command line option "kvm-arm.pinned_vmid_enable".
2. First half of the VMID space follows the normal recycle on rollover
policy.
3. Second half of the VMID space doesn't roll over and is used to
allocate pinned VMIDs.
4. Provides helper function to retrieve the KVM instance associated
with a device(if it is part of a vfio group).
5. Introduces generic interfaces to get/put pinned KVM VMIDs.
Open Items:
1. I couldn't figure out a way to determine whether a platform actually
fully supports DVM/BTM or not. Not sure we can take a call based on
SMMUv3 BTM feature bit alone. Probably we can get it from firmware
via IORT?
2. The current splitting of VMID space is only one way to do this and
probably not the best. Maybe we can follow the pinned ASID method used
in SVA code. Suggestions welcome here.
3. The detach_pasid_table() interface is not very clear to me as the current
Qemu prototype is not using that. This requires fixing from my side.
This is based on Jean-Philippe's SVA series[1] and Eric's SMMUv3 dual-stage
support series[2].
The branch with the whole vSVA + BTM solution is here,
https://github.com/hisilicon/kernel-dev/tree/5.10-rc4-2stage-v13-vsva-btm-r…
This is lightly tested on a HiSilicon D06 platform with uacce/zip dev test tool,
./zip_sva_per -k tlb
Thanks,
Shameer
1. https://github.com/Linaro/linux-kernel-uadk/commits/uacce-devel-5.10
2. https://lore.kernel.org/linux-iommu/20201118112151.25412-1-eric.auger@redha…
Shameer Kolothum (5):
vfio: Add a helper to retrieve kvm instance from a dev
KVM: Add generic infrastructure to support pinned VMIDs
KVM: ARM64: Add support for pinned VMIDs
iommu/arm-smmu-v3: Use pinned VMID for NESTED stage with BTM
KVM: arm64: Make sure pinned vmid is released on VM exit
arch/arm64/include/asm/kvm_host.h | 2 +
arch/arm64/kvm/Kconfig | 1 +
arch/arm64/kvm/arm.c | 116 +++++++++++++++++++-
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 49 ++++++++-
drivers/vfio/vfio.c | 12 ++
include/linux/kvm_host.h | 17 +++
include/linux/vfio.h | 1 +
virt/kvm/Kconfig | 2 +
virt/kvm/kvm_main.c | 25 +++++
9 files changed, 220 insertions(+), 5 deletions(-)
--
2.17.1
[View Less]
ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data while each cluster
has local L3 tag. On the other hand, each cluster will share some
internal system bus. This means cache is much more affine inside one cluster
than across clusters.
+-----------------------------------+ +---------+
| +------+ +------+ +---------------------------+ |
| | CPU0 | | cpu1 | …
[View More] | +-----------+ | |
| +------+ +------+ | | | | |
| +----+ L3 | | |
| +------+ +------+ cluster | | tag | | |
| | CPU2 | | CPU3 | | | | | |
| +------+ +------+ | +-----------+ | |
| | | |
+-----------------------------------+ | |
+-----------------------------------+ | |
| +------+ +------+ +--------------------------+ |
| | | | | | +-----------+ | |
| +------+ +------+ | | | | |
| | | L3 | | |
| +------+ +------+ +----+ tag | | |
| | | | | | | | | |
| +------+ +------+ | +-----------+ | |
| | | |
+-----------------------------------+ | L3 |
| data |
+-----------------------------------+ | |
| +------+ +------+ | +-----------+ | |
| | | | | | | | | |
| +------+ +------+ +----+ L3 | | |
| | | tag | | |
| +------+ +------+ | | | | |
| | | | | ++ +-----------+ | |
| +------+ +------+ |---------------------------+ |
+-----------------------------------| | |
+-----------------------------------| | |
| +------+ +------+ +---------------------------+ |
| | | | | | +-----------+ | |
| +------+ +------+ | | | | |
| +----+ L3 | | |
| +------+ +------+ | | tag | | |
| | | | | | | | | |
| +------+ +------+ | +-----------+ | |
| | | |
+-----------------------------------+ | |
+-----------------------------------+ | |
| +------+ +------+ +--------------------------+ |
| | | | | | +-----------+ | |
| +------+ +------+ | | | | |
Through the following small program, you can see the performance impact of
running it in one cluster and across two clusters:
struct foo {
int x;
int y;
} f;
void *thread1_fun(void *param)
{
int s = 0;
for (int i = 0; i < 0xfffffff; i++)
s += f.x;
}
void *thread2_fun(void *param)
{
int s = 0;
for (int i = 0; i < 0xfffffff; i++)
f.y++;
}
int main(int argc, char **argv)
{
pthread_t tid1, tid2;
pthread_create(&tid1, NULL, thread1_fun, NULL);
pthread_create(&tid2, NULL, thread2_fun, NULL);
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);
}
While running this program in one cluster, it takes:
$ time taskset -c 0,1 ./a.out
real 0m0.832s
user 0m1.649s
sys 0m0.004s
As a contrast, it takes much more time if we run the same program
in two clusters:
$ time taskset -c 0,4 ./a.out
real 0m1.133s
user 0m1.960s
sys 0m0.000s
0.832/1.133 = 73%, it is a huge difference.
Also, hackbench running on 4 cpus in single one cluster and 4 cpus in
different clusters also shows a large contrast:
* inside a cluster:
root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 20000 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 4.285
* across clusters:
root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 20000 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 5.524
The score is 4.285 vs 5.524, shorter time means better performance.
All these testing implies that we should let the Linux scheduler use
this topology to make better load balancing and WAKE_AFFINE decisions.
However, the current scheduler totally has no idea of clusters.
This patchset exposed the cluster topology first, then added the sched
domain for cluster. While it is named as "cluster", architectures and
machines can define the exact meaning of cluster as long as they have
some resources sharing under llc and they can leverage the affinity
of this resource to achive better scheduling performance.
-v3:
- rebased againest 5.11-rc2
- with respect to the comments of Valentin Schneider, Peter Zijlstra,
Vincent Guittot and Mel Gorman etc.
* moved the scheduler changes from arm64 to the common place for all
architectures.
* added SD_SHARE_CLS_RESOURCES sd_flags specifying the sched_domain
where select_idle_cpu() should begin to scan from
* removed redundant select_idle_cluster() function since all code is
in select_idle_cpu() now. it also avoided scanning cluster cpus
twice in v2 code;
* redo the hackbench in one numa after the above changes
Valentin suggested that select_idle_cpu() could begin to scan from
domain with SD_SHARE_PKG_RESOURCES. Changing like this might be too
aggressive and limit the spreading of tasks. Thus, this patch lets
the architectures and machines to decide where to start by adding
a new SD_SHARE_CLS_RESOURCES.
Barry Song (1):
scheduler: add scheduler level for clusters
Jonathan Cameron (1):
topology: Represent clusters of CPUs within a die.
Documentation/admin-guide/cputopology.rst | 26 +++++++++++---
arch/arm64/Kconfig | 7 ++++
arch/arm64/kernel/topology.c | 2 ++
drivers/acpi/pptt.c | 60 +++++++++++++++++++++++++++++++
drivers/base/arch_topology.c | 14 ++++++++
drivers/base/topology.c | 10 ++++++
include/linux/acpi.h | 5 +++
include/linux/arch_topology.h | 5 +++
include/linux/sched/sd_flags.h | 9 +++++
include/linux/sched/topology.h | 7 ++++
include/linux/topology.h | 13 +++++++
kernel/sched/fair.c | 27 ++++++++++----
kernel/sched/topology.c | 6 ++++
13 files changed, 181 insertions(+), 10 deletions(-)
--
2.7.4
[View Less]
CPU cache corrected errors are detected occasionally on
few of our ARM64 hardware boards. Though it is rare, the
probability of the CPU cache errors frequently occurring
can't be avoided. The earlier failure detection by monitoring
the cache corrected errors for the frequent occurrences and
taking preventive action could prevent more serious hardware
faults.
On Intel architectures, cache corrected errors are reported and
the affected cores are offlined in the architecture specific method.
http:…
[View More]//www.mcelog.org/cache.html
However for the firmware-first error reporting, specifically on
ARM64 architecture, there is no provision present for reporting
the cache corrected error count to the user-space and taking
preventive action such as offline the affected cores.
For this purpose, it was suggested to create the CPU EDAC
device for the CPU caches for reporting the cache error count
for the firmware-first error reporting.
User-space application could monitor the recorded corrected error
count for the earlier hardware failure detection and could take
preventive action, such as offline the corresponding CPU core/s.
Changes:
RFC V1 -> RFC V2:
1. Fixed feedback by Boris.
1.1. Added reason of this patch.
1.2. Changed CPU errors to CPU cache errors in the drivers/edac/Kconfig
1.3 Changed EDAC cache list to percpu variables.
1.4 Changed configuration depends on ARM64.
1.5. Moved discovery of cacheinfo to ghes_scan_system().
2. Changes in the descriptions.
Shiju Jose (2):
EDAC/ghes: Add EDAC device for reporting the CPU cache errors
ACPI / APEI: Add reporting ARM64 CPU cache corrected error count
Documentation/ABI/testing/sysfs-devices-edac | 15 ++
drivers/acpi/apei/ghes.c | 76 +++++++-
drivers/edac/Kconfig | 12 ++
drivers/edac/ghes_edac.c | 186 +++++++++++++++++++
include/acpi/ghes.h | 27 +++
include/linux/cper.h | 4 +
6 files changed, 316 insertions(+), 4 deletions(-)
--
2.17.1
[View Less]
BATCHED_UNMAP_TLB_FLUSH is used on x86 to do batched tlb shootdown by
sending one IPI to TLB flush all entries after unmapping pages rather
than sending an IPI to flush each individual entry.
On arm64, tlb shootdown is done by hardware. Flush instructions are
innershareable. The local flushes are limited to the boot (1 per CPU)
and when a task is getting a new ASID.
So marking this feature as "TODO" is not proper. ".." isn't good as
well. So this patch adds a "N/A" for this kind of features …
[View More]which are
not needed on some architectures.
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will(a)kernel.org>
Signed-off-by: Barry Song <song.bao.hua(a)hisilicon.com>
---
Documentation/features/arch-support.txt | 1 +
Documentation/features/vm/TLB/arch-support.txt | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/Documentation/features/arch-support.txt b/Documentation/features/arch-support.txt
index d22a1095e661..118ae031840b 100644
--- a/Documentation/features/arch-support.txt
+++ b/Documentation/features/arch-support.txt
@@ -8,4 +8,5 @@ The meaning of entries in the tables is:
| ok | # feature supported by the architecture
|TODO| # feature not yet supported by the architecture
| .. | # feature cannot be supported by the hardware
+ | N/A| # feature doesn't apply to the architecture
diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 30f75a79ce01..0d070f9f98d8 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
| alpha: | TODO |
| arc: | TODO |
| arm: | TODO |
- | arm64: | TODO |
+ | arm64: | N/A |
| c6x: | .. |
| csky: | TODO |
| h8300: | .. |
--
2.25.1
[View Less]
On 21/12/2020 13:04, Jiahui Cen wrote:
>> On 21/12/2020 03:24, Jiahui Cen wrote:
>>> Hi John,
>>>
>>> On 2020/12/18 18:40, John Garry wrote:
>>>> On 18/12/2020 06:23, Jiahui Cen wrote:
>>>>> Since the [start, end) is a half-open interval, a range with the end equal
>>>>> to the start of another range should not be considered as overlapped.
>>>>>
>>>>> Signed-off-by: Jiahui Cen<cenjiahui(a)…
[View More]huawei.com>
>>>>> ---
>>>>> lib/logic_pio.c | 2 +-
>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/lib/logic_pio.c b/lib/logic_pio.c
>>>>> index f32fe481b492..445d611f1dc1 100644
>>>>> --- a/lib/logic_pio.c
>>>>> +++ b/lib/logic_pio.c
>>>>> @@ -57,7 +57,7 @@ int logic_pio_register_range(struct logic_pio_hwaddr *new_range)
>>>>> new_range->flags == LOGIC_PIO_CPU_MMIO) {
>>>>> /* for MMIO ranges we need to check for overlap */
>>>>> if (start >= range->hw_start + range->size ||
>>>>> - end < range->hw_start) {
>>>>> + end <= range->hw_start) {
>>>> It looks like your change is correct, but should not really have an impact in practice since:
>>>> a: BIOSes generally list ascending IO port CPU addresses
>>>> b. there is space between IO port CPU address regions
>>>>
>>>> Have you seen a problem here?
>>>>
>>> No serious problem. I found it just when I was working on adding support of
>>> pci expander bridge for Arm in QEMU. I found the IO window of some extended
>>> root bus could not be registered when I inserted the extended buses' _CRS
>>> info into DSDT table in the x86 way, which does not sort the buses.
>>>
>>> Though root buses should be sorted in QEMU, would it be better to accept
>>> those non-ascending IO windows?
>>>
>> ok, so it seems that you have seen a real problem, and this issue is not just detected by code analysis.
>>
>>> BTW, for b, it seems to be no space between IO windows of different root buses
>>> generated by EDK2. Or maybe I missed something obvious.
>> I don't know about that. Anyway, your change looks ok.
>>
>> Reviewed-by: John Garry<john.garry(a)huawei.com>
>>
>> BTW, for your virt env, will there be requirement to unregister PCI MMIO ranges? Currently we don't see that in non-virt world.
>>
> Thanks for your review.
>
> And currently there is no such a requirement in my virt env.
>
I am not sure what happened to this patch, but I plan on sending some
patches in this area soon - do you want me to include this one?
Thanks,
John
[View Less]
Hi, Thomas Monjalon&Ferruh Yigit and others
I'm analyzing multiprocess with eal. I have some questions I'd like
to ask you.
Firstly, After the rte_eal_init() command is executed, the master and
slave processes are started successfully.
and traffic is continuously sent using the tester.If you run the kill -9
command to stop the slave process, restart the re-process, and start
packet receiving and sending,
how to ensure that the eal resource of the slave process is cleaned up?
…
[View More]Second, how to invoke the remove function to clear probe resources of
the slave process after the slave process exits?
Finally, I found out why the rte_eal_cleanup call was not unregistered
mp action after the process exited.
I look forward to your response.
Thanks
Lijun Ou
[View Less]