March 2021 - Linuxarm - mailweb.openeuler.org

[PATCH v2] PCI/DPC: Check host->native_dpc before enable dpc service
by Yicong Yang 28 Jul '21

28 Jul '21

Per Downstream Port Containment Related Enhancements ECN[1] Table 4-6, Interpretation of _OSC Control Field Returned Value, for bit 7 of _OSC control return value: "Firmware sets this bit to 1 to grant the OS control over PCI Express Downstream Port Containment configuration." "If control of this feature was requested and denied, or was not requested, the firmware returns this bit set to 0." We store bit 7 of _OSC control return value in host->native_dpc, check it before enable the dpc service as the firmware may not grant the control. [1] Downstream Port Containment Related Enhancements ECN, Jan 28, 2019, affecting PCI Firmware Specification, Rev. 3.2 https://members.pcisig.com/wg/PCI-SIG/document/12888 Signed-off-by: Yicong Yang <yangyicong(a)hisilicon.com> --- Change since v1: - use correct reference for _OSC control return value drivers/pci/pcie/portdrv_core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c index e1fed664..7445d03 100644 --- a/drivers/pci/pcie/portdrv_core.c +++ b/drivers/pci/pcie/portdrv_core.c @@ -253,7 +253,8 @@ static int get_port_device_capability(struct pci_dev *dev) */ if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) && pci_aer_available() && - (pcie_ports_dpc_native || (services & PCIE_PORT_SERVICE_AER))) + (pcie_ports_dpc_native || + ((services & PCIE_PORT_SERVICE_AER) && host->native_dpc))) services |= PCIE_PORT_SERVICE_DPC; if (pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM || -- 2.8.1

4 11

[RFC PATCH 0/5] KVM/ARM64 Add support for pinned VMIDs
by Shameer Kolothum 23 Jul '21

23 Jul '21

On an ARM64 system with a SMMUv3 implementation that fully supports Broadcast TLB Maintenance(BTM) feature as part of the Distributed Virtual Memory(DVM) protocol, the CPU TLB invalidate instructions are received by SMMUv3. This is very useful when the SMMUv3 shares the page tables with the CPU(eg: Guest SVA use case). For this to work, the SMMU must use the same VMID that is allocated by KVM to configure the stage 2 translations. At present KVM VMID allocations are recycled on rollover and may change as a result. This will create issues if we have to share the KVM VMID with SMMU. Please see the discussion here, https://lore.kernel.org/linux-iommu/20200522101755.GA3453945@myrica/ This series proposes a way to share the VMID between KVM and IOMMU driver by, 1. Splitting the KVM VMID space into two equal halves based on the command line option "kvm-arm.pinned_vmid_enable". 2. First half of the VMID space follows the normal recycle on rollover policy. 3. Second half of the VMID space doesn't roll over and is used to allocate pinned VMIDs. 4. Provides helper function to retrieve the KVM instance associated with a device(if it is part of a vfio group). 5. Introduces generic interfaces to get/put pinned KVM VMIDs. Open Items: 1. I couldn't figure out a way to determine whether a platform actually fully supports DVM/BTM or not. Not sure we can take a call based on SMMUv3 BTM feature bit alone. Probably we can get it from firmware via IORT? 2. The current splitting of VMID space is only one way to do this and probably not the best. Maybe we can follow the pinned ASID method used in SVA code. Suggestions welcome here. 3. The detach_pasid_table() interface is not very clear to me as the current Qemu prototype is not using that. This requires fixing from my side. This is based on Jean-Philippe's SVA series[1] and Eric's SMMUv3 dual-stage support series[2]. The branch with the whole vSVA + BTM solution is here, https://github.com/hisilicon/kernel-dev/tree/5.10-rc4-2stage-v13-vsva-btm-r… This is lightly tested on a HiSilicon D06 platform with uacce/zip dev test tool, ./zip_sva_per -k tlb Thanks, Shameer 1. https://github.com/Linaro/linux-kernel-uadk/commits/uacce-devel-5.10 2. https://lore.kernel.org/linux-iommu/20201118112151.25412-1-eric.auger@redha… Shameer Kolothum (5): vfio: Add a helper to retrieve kvm instance from a dev KVM: Add generic infrastructure to support pinned VMIDs KVM: ARM64: Add support for pinned VMIDs iommu/arm-smmu-v3: Use pinned VMID for NESTED stage with BTM KVM: arm64: Make sure pinned vmid is released on VM exit arch/arm64/include/asm/kvm_host.h | 2 + arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/arm.c | 116 +++++++++++++++++++- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 49 ++++++++- drivers/vfio/vfio.c | 12 ++ include/linux/kvm_host.h | 17 +++ include/linux/vfio.h | 1 + virt/kvm/Kconfig | 2 + virt/kvm/kvm_main.c | 25 +++++ 9 files changed, 220 insertions(+), 5 deletions(-) -- 2.17.1

4 12

[PATCH net-next 0/9] net: hns3: refactor and new features for flow director
by Huazhong Tan 19 Jun '21

19 Jun '21

This patchset refactor some functions and add some new features for flow director. patch 1~3: refactor large functions patch 4, 7: add traffic class and user-def field support for ethtool patch 5: use asynchronously configuration patch 6: clean up for hns3_del_all_fd_entries() patch 8, 9: add support for queue bonding mode Jian Shen (9): net: hns3: refactor out hclge_add_fd_entry() net: hns3: refactor out hclge_fd_get_tuple() net: hns3: refactor for function hclge_fd_convert_tuple net: hns3: add support for traffic class tuple support for flow director by ethtool net: hns3: refactor flow director configuration net: hns3: refine for hns3_del_all_fd_entries() net: hns3: add support for user-def data of flow director net: hns3: add support for queue bonding mode of flow director net: hns3: add queue bonding mode support for VF drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h | 8 + drivers/net/ethernet/hisilicon/hns3/hnae3.h | 9 +- drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c | 7 +- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 91 +- drivers/net/ethernet/hisilicon/hns3/hns3_enet.h | 14 +- drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 13 +- .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c | 2 + .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h | 21 + .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 1570 ++++++++++++++------ .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h | 63 + .../net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c | 33 + .../ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c | 2 + .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 74 + .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h | 7 + .../ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c | 17 + 15 files changed, 1450 insertions(+), 481 deletions(-) -- 2.7.4

4 20

[RFC PATCH v5 0/4] scheduler: expose the topology of clusters and add cluster scheduler
by Barry Song 21 Apr '21

21 Apr '21

ARM64 server chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data while each cluster has local L3 tag. On the other hand, each cluster will share some internal system bus. This means cache is much more affine inside one cluster than across clusters. +-----------------------------------+ +---------+ | +------+ +------+ +---------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | ++ +-----------+ | | | +------+ +------+ |---------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +---------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | There is a similar need for clustering in x86. Some x86 cores could share L2 caches that is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6 clusters of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing L3). Having a sched_domain for clusters will bring two aspects of improvement: 1. spreading unrelated tasks among clusters, which decreases the contention of resources and improve the throughput. unrelated tasks might be put randomly without cluster sched_domain: +-------------------+ +-----------------+ | +----+ +----+ | | | | |task| |task| | | | | |1 | |2 | | | | | +----+ +----+ | | | | | | | | cluster1 | | cluster2 | +-------------------+ +-----------------+ but with cluster sched_domain, they are likely to spread due to LB: +-------------------+ +-----------------+ | +----+ | | +----+ | | |task| | | |task| | | |1 | | | |2 | | | +----+ | | +----+ | | | | | | cluster1 | | cluster2 | +-------------------+ +-----------------+ 2. gathering related tasks within a cluster, which improves the cache affinity of tasks talking with each other. Without cluster sched_domain, related tasks might be put randomly. In case task1-8 have relationship as below: Task1 wakes up task4 Task2 wakes up task5 Task3 wakes up task6 Task4 wakes up task7 With the tuning of select_idle_cpu() to scan local cluster first, those tasks might get a chance to be gathered like: +---------------------------+ +----------------------+ | +----+ +-----+ | | +----+ +-----+ | | |task| |task | | | |task| |task | | | |1 | | 4 | | | |2 | |5 | | | +----+ +-----+ | | +----+ +-----+ | | | | | | cluster1 | | cluster2 | | | | | | | | | | +-----+ +------+ | | +-----+ +------+ | | |task | | task | | | |task | |task | | | |3 | | 6 | | | |4 | |8 | | | +-----+ +------+ | | +-----+ +------+ | +---------------------------+ +----------------------+ Otherwise, the result might be: +---------------------------+ +----------------------+ | +----+ +-----+ | | +----+ +-----+ | | |task| |task | | | |task| |task | | | |1 | | 2 | | | |5 | |6 | | | +----+ +-----+ | | +----+ +-----+ | | | | | | cluster1 | | cluster2 | | | | | | | | | | +-----+ +------+ | | +-----+ +------+ | | |task | | task | | | |task | |task | | | |3 | | 4 | | | |7 | |8 | | | +-----+ +------+ | | +-----+ +------+ | +---------------------------+ +----------------------+ -v5: * split "add scheduler level for clusters" into two patches to evaluate the impact of spreading and gathering separately; * add a tracepoint of select_idle_cpu for debug purpose; add bcc script in commit log; * add cluster_id = -1 in reset_cpu_topology() * rebased to tip/sched/core -v4: * rebased to tip/sched/core with the latest unified code of select_idle_cpu * added Tim's patch for x86 Jacobsville * also added benchmark data of spreading unrelated tasks * avoided the iteration of sched_domain by moving to static_key(addressing Vincent's comment * used acpi_cpu_id for acpi_find_processor_node(addressing Masa's comment) Barry Song (2): scheduler: add scheduler level for clusters scheduler: scan idle cpu in cluster before scanning the whole llc Jonathan Cameron (1): topology: Represent clusters of CPUs within a die Tim Chen (1): scheduler: Add cluster scheduler level for x86 Documentation/admin-guide/cputopology.rst | 26 +++++++++++-- arch/arm64/Kconfig | 7 ++++ arch/arm64/kernel/topology.c | 2 + arch/x86/Kconfig | 8 ++++ arch/x86/include/asm/smp.h | 7 ++++ arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/cpu/cacheinfo.c | 1 + arch/x86/kernel/cpu/common.c | 3 ++ arch/x86/kernel/smpboot.c | 43 ++++++++++++++++++++- drivers/acpi/pptt.c | 63 +++++++++++++++++++++++++++++++ drivers/base/arch_topology.c | 15 ++++++++ drivers/base/topology.c | 10 +++++ include/linux/acpi.h | 5 +++ include/linux/arch_topology.h | 5 +++ include/linux/sched/cluster.h | 19 ++++++++++ include/linux/sched/topology.h | 7 ++++ include/linux/topology.h | 13 +++++++ include/trace/events/sched.h | 22 +++++++++++ kernel/sched/core.c | 20 ++++++++++ kernel/sched/fair.c | 36 +++++++++++++++++- kernel/sched/sched.h | 1 + kernel/sched/topology.c | 5 +++ 22 files changed, 313 insertions(+), 6 deletions(-) create mode 100644 include/linux/sched/cluster.h -- 1.8.3.1

5 14

[PATCH net v3] net: sched: fix packet stuck problem for lockless qdisc
by Yunsheng Lin 20 Apr '21

20 Apr '21

Lockless qdisc has below concurrent problem: cpu0 cpu1 . . q->enqueue . . . qdisc_run_begin() . . . dequeue_skb() . . . sch_direct_xmit() . . . . q->enqueue . qdisc_run_begin() . return and do nothing . . qdisc_run_end() . cpu1 enqueue a skb without calling __qdisc_run() because cpu0 has not released the lock yet and spin_trylock() return false for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb enqueued by cpu1 when calling dequeue_skb() because cpu1 may enqueue the skb after cpu0 calling dequeue_skb() and before cpu0 calling qdisc_run_end(). Lockless qdisc has below another concurrent problem when tx_action is involved: cpu0(serving tx_action) cpu1 cpu2 . . . . q->enqueue . . qdisc_run_begin() . . dequeue_skb() . . . q->enqueue . . . . sch_direct_xmit() . . . qdisc_run_begin() . . return and do nothing . . . clear __QDISC_STATE_SCHED . . qdisc_run_begin() . . return and do nothing . . . . . . qdisc_run_end() . This patch fixes the above data race by: 1. Get the flag before doing spin_trylock(). 2. If the first spin_trylock() return false and the flag is not set before the first spin_trylock(), Set the flag and retry another spin_trylock() in case other CPU may not see the new flag after it releases the lock. 3. reschedule if the flags is set after the lock is released at the end of qdisc_run_end(). For tx_action case, the flags is also set when cpu1 is at the end if qdisc_run_end(), so tx_action will be rescheduled again to dequeue the skb enqueued by cpu2. Only clear the flag before retrying a dequeuing when dequeuing returns NULL in order to reduce the overhead of the above double spin_trylock() and __netif_schedule() calling. The performance impact of this patch, tested using pktgen and dummy netdev with pfifo_fast qdisc attached: threads without+this_patch with+this_patch delta 1 2.61Mpps 2.60Mpps -0.3% 2 3.97Mpps 3.82Mpps -3.7% 4 5.62Mpps 5.59Mpps -0.5% 8 2.78Mpps 2.77Mpps -0.3% 16 2.22Mpps 2.22Mpps -0.0% Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Signed-off-by: Yunsheng Lin <linyunsheng(a)huawei.com> --- V3: fix a compile error and a few comment typo, remove the __QDISC_STATE_DEACTIVATED checking, and update the performance data. V2: Avoid the overhead of fixing the data race as much as possible. --- include/net/sch_generic.h | 38 +++++++++++++++++++++++++++++++++++++- net/sched/sch_generic.c | 12 ++++++++++++ 2 files changed, 49 insertions(+), 1 deletion(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index f7a6e14..e3f46eb 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -36,6 +36,7 @@ struct qdisc_rate_table { enum qdisc_state_t { __QDISC_STATE_SCHED, __QDISC_STATE_DEACTIVATED, + __QDISC_STATE_NEED_RESCHEDULE, }; struct qdisc_size_table { @@ -159,8 +160,38 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc) static inline bool qdisc_run_begin(struct Qdisc *qdisc) { if (qdisc->flags & TCQ_F_NOLOCK) { + bool dont_retry = test_bit(__QDISC_STATE_NEED_RESCHEDULE, + &qdisc->state); + + if (spin_trylock(&qdisc->seqlock)) + goto nolock_empty; + + /* If the flag is set before doing the spin_trylock() and + * the above spin_trylock() return false, it means other cpu + * holding the lock will do dequeuing for us, or it wil see + * the flag set after releasing lock and reschedule the + * net_tx_action() to do the dequeuing. + */ + if (dont_retry) + return false; + + /* We could do set_bit() before the first spin_trylock(), + * and avoid doing second spin_trylock() completely, then + * we could have multi cpus doing the set_bit(). Here use + * dont_retry to avoid doing the set_bit() and the second + * spin_trylock(), which has 5% performance improvement than + * doing the set_bit() before the first spin_trylock(). + */ + set_bit(__QDISC_STATE_NEED_RESCHEDULE, + &qdisc->state); + + /* Retry again in case other CPU may not see the new flag + * after it releases the lock at the end of qdisc_run_end(). + */ if (!spin_trylock(&qdisc->seqlock)) return false; + +nolock_empty: WRITE_ONCE(qdisc->empty, false); } else if (qdisc_is_running(qdisc)) { return false; @@ -176,8 +207,13 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc) static inline void qdisc_run_end(struct Qdisc *qdisc) { write_seqcount_end(&qdisc->running); - if (qdisc->flags & TCQ_F_NOLOCK) + if (qdisc->flags & TCQ_F_NOLOCK) { spin_unlock(&qdisc->seqlock); + + if (unlikely(test_bit(__QDISC_STATE_NEED_RESCHEDULE, + &qdisc->state))) + __netif_schedule(qdisc); + } } static inline bool qdisc_may_bulk(const struct Qdisc *qdisc) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 44991ea..4953430 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) { struct pfifo_fast_priv *priv = qdisc_priv(qdisc); struct sk_buff *skb = NULL; + bool need_retry = true; int band; +retry: for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) { struct skb_array *q = band2list(priv, band); @@ -652,6 +654,16 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) } if (likely(skb)) { qdisc_update_stats_at_dequeue(qdisc, skb); + } else if (need_retry && + test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE, + &qdisc->state)) { + /* do another enqueuing after clearing the flag to + * avoid calling __netif_schedule(). + */ + smp_mb__after_atomic(); + need_retry = false; + + goto retry; } else { WRITE_ONCE(qdisc->empty, true); } -- 2.7.4

2 6

[PATCH] app/testpmd: support Tx mbuf free on demand cmd
by Lijun Ou 19 Apr '21

19 Apr '21

From: Chengwen Feng <fengchengwen(a)huawei.com> This patch support tx_done_cleanup command: tx_done_cleanup port (port_id) (queue_id) (free_cnt) User must make sure there are no concurrent access to the same Tx queue (like rte_eth_tx_burst, rte_eth_dev_tx_queue_stop and so on) when this command executed. Signed-off-by: Chengwen Feng <fengchengwen(a)huawei.com> Signed-off-by: Lijun Ou <oulijun(a)huawei.com> --- app/test-pmd/cmdline.c | 91 +++++++++++++++++++++++++++++ doc/guides/rel_notes/release_21_05.rst | 2 + doc/guides/testpmd_app_ug/testpmd_funcs.rst | 7 +++ 3 files changed, 100 insertions(+) diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c index 14110eb..832ae70 100644 --- a/app/test-pmd/cmdline.c +++ b/app/test-pmd/cmdline.c @@ -36,6 +36,7 @@ #include <rte_pci.h> #include <rte_ether.h> #include <rte_ethdev.h> +#include <rte_ethdev_driver.h> #include <rte_string_fns.h> #include <rte_devargs.h> #include <rte_flow.h> @@ -675,6 +676,9 @@ static void cmd_help_long_parsed(void *parsed_result, "set port (port_id) ptype_mask (ptype_mask)\n" " set packet types classification for a specific port\n\n" + "tx_done_cleanup (port_id) (queue_id) (free_cnt)\n" + " Cleanup a tx queue's mbuf on a port\n\n" + "set port (port_id) queue-region region_id (value) " "queue_start_index (value) queue_num (value)\n" " Set a queue region on a port\n\n" @@ -16910,6 +16914,92 @@ cmdline_parse_inst_t cmd_showport_macs = { }, }; +/* *** tx_done_cleanup *** */ +struct cmd_tx_done_cleanup_result { + cmdline_fixed_string_t clean; + cmdline_fixed_string_t port; + uint16_t port_id; + uint16_t queue_id; + uint32_t free_cnt; +}; + +static void +cmd_tx_done_cleanup_parsed(void *parsed_result, + __rte_unused struct cmdline *cl, + __rte_unused void *data) +{ + struct cmd_tx_done_cleanup_result *res = parsed_result; + struct rte_eth_dev *dev; + uint16_t port_id = res->port_id; + uint16_t queue_id = res->queue_id; + uint32_t free_cnt = res->free_cnt; + int ret; + + if (!rte_eth_dev_is_valid_port(port_id)) { + printf("Invalid port_id %u\n", port_id); + return; + } + + dev = &rte_eth_devices[port_id]; + if (queue_id >= dev->data->nb_tx_queues) { + printf("Invalid TX queue_id %u\n", queue_id); + return; + } + + if (dev->data->tx_queue_state[queue_id] != + RTE_ETH_QUEUE_STATE_STARTED) { + printf("TX queue_id %u not started!\n", queue_id); + return; + } + + /* + * rte_eth_tx_done_cleanup is a dataplane API, user must make sure + * there are no concurrent access to the same Tx queue (like + * rte_eth_tx_burst, rte_eth_dev_tx_queue_stop and so on) when this API + * called. + */ + ret = rte_eth_tx_done_cleanup(port_id, queue_id, free_cnt); + if (ret < 0) { + printf("Failed to cleanup mbuf for port %u TX queue %u " + "error desc: %s(%d)\n", + port_id, queue_id, strerror(-ret), ret); + return; + } + + printf("Cleanup port %u TX queue %u mbuf nums: %u\n", + port_id, queue_id, ret); +} + +cmdline_parse_token_string_t cmd_tx_done_cleanup_clean = + TOKEN_STRING_INITIALIZER(struct cmd_tx_done_cleanup_result, clean, + "tx_done_cleanup"); +cmdline_parse_token_string_t cmd_tx_done_cleanup_port = + TOKEN_STRING_INITIALIZER(struct cmd_tx_done_cleanup_result, port, + "port"); +cmdline_parse_token_num_t cmd_tx_done_cleanup_port_id = + TOKEN_NUM_INITIALIZER(struct cmd_tx_done_cleanup_result, port_id, + UINT16); +cmdline_parse_token_num_t cmd_tx_done_cleanup_queue_id = + TOKEN_NUM_INITIALIZER(struct cmd_tx_done_cleanup_result, queue_id, + UINT16); +cmdline_parse_token_num_t cmd_tx_done_cleanup_free_cnt = + TOKEN_NUM_INITIALIZER(struct cmd_tx_done_cleanup_result, free_cnt, + UINT32); + +cmdline_parse_inst_t cmd_tx_done_cleanup = { + .f = cmd_tx_done_cleanup_parsed, + .data = NULL, + .help_str = "tx_done_cleanup port <port_id> <queue_id> <free_cnt>", + .tokens = { + (void *)&cmd_tx_done_cleanup_clean, + (void *)&cmd_tx_done_cleanup_port, + (void *)&cmd_tx_done_cleanup_port_id, + (void *)&cmd_tx_done_cleanup_queue_id, + (void *)&cmd_tx_done_cleanup_free_cnt, + NULL, + }, +}; + /* ******************************************************************************** */ /* list of instructions */ @@ -17035,6 +17125,7 @@ cmdline_parse_ctx_t main_ctx[] = { (cmdline_parse_inst_t *)&cmd_config_rss_reta, (cmdline_parse_inst_t *)&cmd_showport_reta, (cmdline_parse_inst_t *)&cmd_showport_macs, + (cmdline_parse_inst_t *)&cmd_tx_done_cleanup, (cmdline_parse_inst_t *)&cmd_config_burst, (cmdline_parse_inst_t *)&cmd_config_thresh, (cmdline_parse_inst_t *)&cmd_config_threshold, diff --git a/doc/guides/rel_notes/release_21_05.rst b/doc/guides/rel_notes/release_21_05.rst index 23f7f0b..8077573 100644 --- a/doc/guides/rel_notes/release_21_05.rst +++ b/doc/guides/rel_notes/release_21_05.rst @@ -69,6 +69,8 @@ New Features * Added command to display Rx queue used descriptor count. ``show port (port_id) rxq (queue_id) desc used count`` + * Added command to cleanup a Tx queue's mbuf on a port. + ``tx_done_cleanup port <port_id> <queue_id> <free_cnt>`` Removed Items diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst index f59eb8a..39281f5 100644 --- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst +++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst @@ -272,6 +272,13 @@ and ready to be processed by the driver on a given RX queue:: testpmd> show port (port_id) rxq (queue_id) desc used count +cleanup txq mbufs +~~~~~~~~~~~~~~~~~~~~~~~~ + +Request the driver to free mbufs currently cached by the driver for a given port's +Tx queue:: + testpmd> tx_done_cleanup port (port_id) (queue_id) (free_cnt) + show config ~~~~~~~~~~~ -- 2.7.4

7 22

[PATCH] ethdev: add queue state when retrieve queue information
by Lijun Ou 14 Apr '21

14 Apr '21

Currently, upper-layer application could get queue state only through pointers such as dev->data->tx_queue_state[queue_id], this is not the recommended way to access it. So this patch add get queue state when call rte_eth_rx_queue_info_get and rte_eth_tx_queue_info_get API. Note: The hairpin queue is not supported with above rte_eth_*x_queue_info_get, so the queue state could be RTE_ETH_QUEUE_STATE_STARTED or RTE_ETH_QUEUE_STATE_STOPPED. Note: After add queue_state field, the 'struct rte_eth_rxq_info' size remains 128B, and the 'struct rte_eth_txq_info' size remains 64B, so it could be ABI compatible. Signed-off-by: Chengwen Feng <fengchengwen(a)huawei.com> Signed-off-by: Lijun Ou <oulijun(a)huawei.com> --- doc/guides/rel_notes/release_21_05.rst | 6 ++++++ lib/librte_ethdev/rte_ethdev.c | 3 +++ lib/librte_ethdev/rte_ethdev.h | 4 ++++ 3 files changed, 13 insertions(+) diff --git a/doc/guides/rel_notes/release_21_05.rst b/doc/guides/rel_notes/release_21_05.rst index 43063e3..165b5f7 100644 --- a/doc/guides/rel_notes/release_21_05.rst +++ b/doc/guides/rel_notes/release_21_05.rst @@ -156,6 +156,12 @@ ABI Changes * No ABI change that would break compatibility with 20.11. +* Added new field ``queue_state`` to ``rte_eth_rxq_info`` structure + to provide indicated rxq queue state. + +* Added new field ``queue_state`` to ``rte_eth_txq_info`` structure + to provide indicated txq queue state. + Known Issues ------------ diff --git a/lib/librte_ethdev/rte_ethdev.c b/lib/librte_ethdev/rte_ethdev.c index 3059aa5..fbd10b2 100644 --- a/lib/librte_ethdev/rte_ethdev.c +++ b/lib/librte_ethdev/rte_ethdev.c @@ -5042,6 +5042,8 @@ rte_eth_rx_queue_info_get(uint16_t port_id, uint16_t queue_id, memset(qinfo, 0, sizeof(*qinfo)); dev->dev_ops->rxq_info_get(dev, queue_id, qinfo); + qinfo->queue_state = dev->data->rx_queue_state[queue_id]; + return 0; } @@ -5082,6 +5084,7 @@ rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id, memset(qinfo, 0, sizeof(*qinfo)); dev->dev_ops->txq_info_get(dev, queue_id, qinfo); + qinfo->queue_state = dev->data->tx_queue_state[queue_id]; return 0; } diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h index efda313..3b83c5a 100644 --- a/lib/librte_ethdev/rte_ethdev.h +++ b/lib/librte_ethdev/rte_ethdev.h @@ -1591,6 +1591,8 @@ struct rte_eth_rxq_info { uint8_t scattered_rx; /**< scattered packets RX supported. */ uint16_t nb_desc; /**< configured number of RXDs. */ uint16_t rx_buf_size; /**< hardware receive buffer size. */ + /**< Queues state: STARTED(1) / STOPPED(0). */ + uint8_t queue_state; } __rte_cache_min_aligned; /** @@ -1600,6 +1602,8 @@ struct rte_eth_rxq_info { struct rte_eth_txq_info { struct rte_eth_txconf conf; /**< queue config parameters. */ uint16_t nb_desc; /**< configured number of TXDs. */ + /**< Queues state: STARTED(1) / STOPPED(0). */ + uint8_t queue_state; } __rte_cache_min_aligned; /* Generic Burst mode flag definition, values can be ORed. */ -- 2.7.4

6 18

[RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler
by Barry Song 14 Apr '21

14 Apr '21

ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data while each cluster has local L3 tag. On the other hand, each cluster will share some internal system bus. This means cache is much more affine inside one cluster than across clusters. +-----------------------------------+ +---------+ | +------+ +------+ +---------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | ++ +-----------+ | | | +------+ +------+ |---------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +---------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | Through the following small program, you can see the performance impact of running it in one cluster and across two clusters: struct foo { int x; int y; } f; void *thread1_fun(void *param) { int s = 0; for (int i = 0; i < 0xfffffff; i++) s += f.x; } void *thread2_fun(void *param) { int s = 0; for (int i = 0; i < 0xfffffff; i++) f.y++; } int main(int argc, char **argv) { pthread_t tid1, tid2; pthread_create(&tid1, NULL, thread1_fun, NULL); pthread_create(&tid2, NULL, thread2_fun, NULL); pthread_join(tid1, NULL); pthread_join(tid2, NULL); } While running this program in one cluster, it takes: $ time taskset -c 0,1 ./a.out real 0m0.832s user 0m1.649s sys 0m0.004s As a contrast, it takes much more time if we run the same program in two clusters: $ time taskset -c 0,4 ./a.out real 0m1.133s user 0m1.960s sys 0m0.000s 0.832/1.133 = 73%, it is a huge difference. Also, hackbench running on 4 cpus in single one cluster and 4 cpus in different clusters also shows a large contrast: * inside a cluster: root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 20000 -g 1 Running in threaded mode with 1 groups using 40 file descriptors each (== 40 tasks) Each sender will pass 20000 messages of 100 bytes Time: 4.285 * across clusters: root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 20000 -g 1 Running in threaded mode with 1 groups using 40 file descriptors each (== 40 tasks) Each sender will pass 20000 messages of 100 bytes Time: 5.524 The score is 4.285 vs 5.524, shorter time means better performance. All these testing implies that we should let the Linux scheduler use this topology to make better load balancing and WAKE_AFFINE decisions. However, the current scheduler totally has no idea of clusters. This patchset exposed the cluster topology first, then added the sched domain for cluster. While it is named as "cluster", architectures and machines can define the exact meaning of cluster as long as they have some resources sharing under llc and they can leverage the affinity of this resource to achive better scheduling performance. -v3: - rebased againest 5.11-rc2 - with respect to the comments of Valentin Schneider, Peter Zijlstra, Vincent Guittot and Mel Gorman etc. * moved the scheduler changes from arm64 to the common place for all architectures. * added SD_SHARE_CLS_RESOURCES sd_flags specifying the sched_domain where select_idle_cpu() should begin to scan from * removed redundant select_idle_cluster() function since all code is in select_idle_cpu() now. it also avoided scanning cluster cpus twice in v2 code; * redo the hackbench in one numa after the above changes Valentin suggested that select_idle_cpu() could begin to scan from domain with SD_SHARE_PKG_RESOURCES. Changing like this might be too aggressive and limit the spreading of tasks. Thus, this patch lets the architectures and machines to decide where to start by adding a new SD_SHARE_CLS_RESOURCES. Barry Song (1): scheduler: add scheduler level for clusters Jonathan Cameron (1): topology: Represent clusters of CPUs within a die. Documentation/admin-guide/cputopology.rst | 26 +++++++++++--- arch/arm64/Kconfig | 7 ++++ arch/arm64/kernel/topology.c | 2 ++ drivers/acpi/pptt.c | 60 +++++++++++++++++++++++++++++++ drivers/base/arch_topology.c | 14 ++++++++ drivers/base/topology.c | 10 ++++++ include/linux/acpi.h | 5 +++ include/linux/arch_topology.h | 5 +++ include/linux/sched/sd_flags.h | 9 +++++ include/linux/sched/topology.h | 7 ++++ include/linux/topology.h | 13 +++++++ kernel/sched/fair.c | 27 ++++++++++---- kernel/sched/topology.c | 6 ++++ 13 files changed, 181 insertions(+), 10 deletions(-) -- 2.7.4

7 19

[PATCH v1 0/2] scsi: libsas: few clean up patches
by Luo Jiaxing 13 Apr '21

13 Apr '21

Two types of errors are detected by the checkpatch. 1. Alignment between switches and cases 2. Improper use of some spaces Here are the clean up patches. Luo Jiaxing (2): scsi: libsas: make switch and case at the same indent in sas_to_ata_err() scsi: libsas: clean up for white spaces drivers/scsi/libsas/sas_ata.c | 74 ++++++++++++++++++-------------------- drivers/scsi/libsas/sas_discover.c | 2 +- drivers/scsi/libsas/sas_expander.c | 15 ++++---- 3 files changed, 43 insertions(+), 48 deletions(-) -- 2.7.4

3 4

[PATCH 0/3] Fixes for testpmd
by Lijun Ou 12 Apr '21

12 Apr '21

This series add two test bug fixes and a print style. Hongbo Zheng (1): app/testpmd: use of Rx/Tx in testpmd Huisong Li (2): app/testpmd: fix forwarding configuration when DCB test app/testpmd: remove forwarding config from parsing Rx and Tx app/test-pmd/cmdline.c | 106 ++++++++++++++++---------------- app/test-pmd/config.c | 147 +++++++++++++++++++++++++-------------------- app/test-pmd/csumonly.c | 22 +++---- app/test-pmd/icmpecho.c | 2 +- app/test-pmd/ieee1588fwd.c | 18 +++--- app/test-pmd/parameters.c | 50 +++++++-------- app/test-pmd/testpmd.c | 132 ++++++++++++++++++++-------------------- app/test-pmd/testpmd.h | 28 ++++----- app/test-pmd/txonly.c | 2 +- 9 files changed, 263 insertions(+), 244 deletions(-) -- 2.7.4

7 37