Kernel
Threads by month
- ----- 2025 -----
- May
- April
- March
- February
- January
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- 42 participants
- 18176 discussions
backport psi feature from upstream 5.4
bugzilla: https://gitee.com/openeuler/kernel/issues/I47QS2
Baruch Siach (1):
psi: fix reference to kernel commandline enable
Dan Schatzberg (1):
kernel/sched/psi.c: expose pressure metrics on root cgroup
Johannes Weiner (11):
sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD
sched: loadavg: make calc_load_n() public
sched: sched.h: make rq locking and clock functions available in
stats.h
sched: introduce this_rq_lock_irq()
psi: pressure stall information for CPU, memory, and IO
psi: cgroup support
psi: make disabling/enabling easier for vendor kernels
psi: fix aggregation idle shut-off
psi: avoid divide-by-zero crash inside virtual machines
fs: kernfs: add poll file operation
sched/psi: Fix sampling error and rare div0 crashes with cgroups and
high uptime
Josef Bacik (1):
blk-iolatency: use a percentile approache for ssd's
Liu Xinpeng (2):
psi:enable psi in config
psi:avoid kabi change
Olof Johansson (1):
kernel/sched/psi.c: simplify cgroup_move_task()
Suren Baghdasaryan (6):
psi: introduce state_mask to represent stalled psi states
psi: make psi_enable static
psi: rename psi fields in preparation for psi trigger addition
psi: split update_stats into parts
psi: track changed states
include/: refactor headers to allow kthread.h inclusion in psi_types.h
Documentation/accounting/psi.txt | 73 +++
Documentation/admin-guide/cgroup-v2.rst | 18 +
Documentation/admin-guide/kernel-parameters.txt | 4 +
arch/arm64/configs/openeuler_defconfig | 2 +
arch/powerpc/platforms/cell/cpufreq_spudemand.c | 2 +-
arch/powerpc/platforms/cell/spufs/sched.c | 9 +-
arch/s390/appldata/appldata_os.c | 4 -
arch/x86/configs/openeuler_defconfig | 2 +
block/blk-iolatency.c | 183 +++++-
drivers/cpuidle/governors/menu.c | 4 -
drivers/spi/spi-rockchip.c | 1 +
fs/kernfs/file.c | 31 +-
fs/proc/loadavg.c | 3 -
include/linux/cgroup-defs.h | 12 +
include/linux/cgroup.h | 17 +
include/linux/kernfs.h | 8 +
include/linux/kthread.h | 4 +
include/linux/psi.h | 55 ++
include/linux/psi_types.h | 95 +++
include/linux/sched.h | 13 +
include/linux/sched/loadavg.h | 24 +-
init/Kconfig | 28 +
kernel/cgroup/cgroup.c | 55 +-
kernel/debug/kdb/kdb_main.c | 7 +-
kernel/fork.c | 4 +
kernel/kthread.c | 3 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 16 +-
kernel/sched/loadavg.c | 139 ++--
kernel/sched/psi.c | 823 ++++++++++++++++++++++++
kernel/sched/sched.h | 178 ++---
kernel/sched/stats.h | 86 +++
kernel/workqueue.c | 23 +
kernel/workqueue_internal.h | 6 +-
mm/compaction.c | 5 +
mm/filemap.c | 11 +
mm/page_alloc.c | 9 +
mm/vmscan.c | 9 +
38 files changed, 1726 insertions(+), 241 deletions(-)
create mode 100644 Documentation/accounting/psi.txt
create mode 100644 include/linux/psi.h
create mode 100644 include/linux/psi_types.h
create mode 100644 kernel/sched/psi.c
--
1.8.3.1
2
24
From: zhangguijiang <zhangguijiang(a)huawei.com>
ascend inclusion
category: feature
feature: Ascend emmc adaption
bugzilla: https://gitee.com/openeuler/kernel/issues/I4F4LL
CVE: NA
--------------------
To identify Ascend HiSilicon emmc chip, we add a customized property
to dts. In this patch we add an interface to read this property. At
the same time, we provided a switch, which is CONFIG_ASCEND_HISI_MMC,
for you to get rid of our modifications.
Signed-off-by: zhangguijiang <zhangguijiang(a)huawei.com>
Reviewed-by: Ding Tianhong <dingtianhong(a)huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com>
---
drivers/mmc/Kconfig | 10 ++++
drivers/mmc/core/host.c | 43 ++++++++++++---
include/linux/mmc/host.h | 115 +++++++++++++++++++++++++++++++++++++++
include/linux/mmc/pm.h | 1 +
4 files changed, 162 insertions(+), 7 deletions(-)
diff --git a/drivers/mmc/Kconfig b/drivers/mmc/Kconfig
index ec21388311db2..8b29ecadd1862 100644
--- a/drivers/mmc/Kconfig
+++ b/drivers/mmc/Kconfig
@@ -12,6 +12,16 @@ menuconfig MMC
If you want MMC/SD/SDIO support, you should say Y here and
also to your specific host controller driver.
+config ASCEND_HISI_MMC
+ bool "Ascend HiSilicon MMC card support"
+ depends on MMC
+ default n
+ help
+ This selects for Hisilicon SoC specific extensions to the
+ Synopsys DesignWare Memory Card Interface driver.
+ You should select this option if you want mmc support on
+ Ascend platform.
+
if MMC
source "drivers/mmc/core/Kconfig"
diff --git a/drivers/mmc/core/host.c b/drivers/mmc/core/host.c
index dd1c14d8f6863..b29ee31e7865e 100644
--- a/drivers/mmc/core/host.c
+++ b/drivers/mmc/core/host.c
@@ -348,6 +348,11 @@ int mmc_of_parse(struct mmc_host *host)
EXPORT_SYMBOL(mmc_of_parse);
+static inline int mmc_is_ascend_hi_mci_1(struct device *dev)
+{
+ return !strncmp(dev_name(dev), "hi_mci.1", strlen("hi_mci.1"));
+}
+
/**
* mmc_alloc_host - initialise the per-host structure.
* @extra: sizeof private data structure
@@ -374,7 +379,10 @@ struct mmc_host *mmc_alloc_host(int extra, struct device *dev)
}
host->index = err;
-
+ if (mmc_is_ascend_customized(dev)) {
+ if (mmc_is_ascend_hi_mci_1(dev))
+ host->index = 1;
+ }
dev_set_name(&host->class_dev, "mmc%d", host->index);
host->parent = dev;
@@ -383,10 +391,11 @@ struct mmc_host *mmc_alloc_host(int extra, struct device *dev)
device_initialize(&host->class_dev);
device_enable_async_suspend(&host->class_dev);
- if (mmc_gpio_alloc(host)) {
- put_device(&host->class_dev);
- return NULL;
- }
+ if (!mmc_is_ascend_customized(host->parent))
+ if (mmc_gpio_alloc(host)) {
+ put_device(&host->class_dev);
+ return NULL;
+ }
spin_lock_init(&host->lock);
init_waitqueue_head(&host->wq);
@@ -439,7 +448,9 @@ int mmc_add_host(struct mmc_host *host)
#endif
mmc_start_host(host);
- mmc_register_pm_notifier(host);
+ if (!mmc_is_ascend_customized(host->parent) ||
+ !(host->pm_flags & MMC_PM_IGNORE_PM_NOTIFY))
+ mmc_register_pm_notifier(host);
return 0;
}
@@ -456,7 +467,9 @@ EXPORT_SYMBOL(mmc_add_host);
*/
void mmc_remove_host(struct mmc_host *host)
{
- mmc_unregister_pm_notifier(host);
+ if (!mmc_is_ascend_customized(host->parent) ||
+ !(host->pm_flags & MMC_PM_IGNORE_PM_NOTIFY))
+ mmc_unregister_pm_notifier(host);
mmc_stop_host(host);
#ifdef CONFIG_DEBUG_FS
@@ -483,3 +496,19 @@ void mmc_free_host(struct mmc_host *host)
}
EXPORT_SYMBOL(mmc_free_host);
+
+
+int mmc_is_ascend_customized(struct device *dev)
+{
+#ifdef CONFIG_ASCEND_HISI_MMC
+ static int is_ascend_customized = -1;
+
+ if (is_ascend_customized == -1)
+ is_ascend_customized = ((dev == NULL) ? 0 :
+ of_find_property(dev->of_node, "customized", NULL) != NULL);
+ return is_ascend_customized;
+#else
+ return 0;
+#endif
+}
+EXPORT_SYMBOL(mmc_is_ascend_customized);
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index 7e8e5b20e82b0..2cd5a73ab12a2 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -19,6 +19,9 @@
#include <linux/mmc/pm.h>
#include <linux/dma-direction.h>
+#include <linux/jiffies.h>
+#include <linux/version.h>
+
struct mmc_ios {
unsigned int clock; /* clock rate */
unsigned short vdd;
@@ -63,6 +66,7 @@ struct mmc_ios {
#define MMC_TIMING_MMC_DDR52 8
#define MMC_TIMING_MMC_HS200 9
#define MMC_TIMING_MMC_HS400 10
+#define MMC_TIMING_NEW_SD MMC_TIMING_UHS_SDR12
unsigned char signal_voltage; /* signalling voltage (1.8V or 3.3V) */
@@ -78,7 +82,25 @@ struct mmc_ios {
#define MMC_SET_DRIVER_TYPE_D 3
bool enhanced_strobe; /* hs400es selection */
+#ifdef CONFIG_ASCEND_HISI_MMC
+ unsigned int clock_store; /*store the clock before power off*/
+#endif
+};
+
+#ifdef CONFIG_ASCEND_HISI_MMC
+struct mmc_cmdq_host_ops {
+ int (*enable)(struct mmc_host *mmc);
+ int (*disable)(struct mmc_host *mmc, bool soft);
+ int (*restore_irqs)(struct mmc_host *mmc);
+ int (*request)(struct mmc_host *mmc, struct mmc_request *mrq);
+ int (*halt)(struct mmc_host *mmc, bool halt);
+ void (*post_req)(struct mmc_host *mmc, struct mmc_request *mrq,
+ int err);
+ void (*disable_immediately)(struct mmc_host *mmc);
+ int (*clear_and_halt)(struct mmc_host *mmc);
};
+#endif
+
struct mmc_host;
@@ -168,6 +190,12 @@ struct mmc_host_ops {
*/
int (*multi_io_quirk)(struct mmc_card *card,
unsigned int direction, int blk_size);
+#ifdef CONFIG_ASCEND_HISI_MMC
+ /* Slow down clk for ascend chip SD cards */
+ void (*slowdown_clk)(struct mmc_host *host, int timing);
+ int (*enable_enhanced_strobe)(struct mmc_host *host);
+ int (*send_cmd_direct)(struct mmc_host *host, struct mmc_request *mrq);
+#endif
};
struct mmc_cqe_ops {
@@ -255,6 +283,30 @@ struct mmc_context_info {
wait_queue_head_t wait;
};
+#ifdef CONFIG_ASCEND_HISI_MMC
+/**
+ * mmc_cmdq_context_info - describes the contexts of cmdq
+ * @active_reqs requests being processed
+ * @active_dcmd dcmd in progress, don't issue any
+ * more dcmd requests
+ * @rpmb_in_wait do not pull any more reqs till rpmb is handled
+ * @cmdq_state state of cmdq engine
+ * @req_starved completion should invoke the request_fn since
+ * no tags were available
+ * @cmdq_ctx_lock acquire this before accessing this structure
+ */
+struct mmc_cmdq_context_info {
+ unsigned long active_reqs; /* in-flight requests */
+ bool active_dcmd;
+ bool rpmb_in_wait;
+ unsigned long curr_state;
+
+ /* no free tag available */
+ unsigned long req_starved;
+ spinlock_t cmdq_ctx_lock;
+};
+#endif
+
struct regulator;
struct mmc_pwrseq;
@@ -328,6 +380,9 @@ struct mmc_host {
#define MMC_CAP_UHS_SDR50 (1 << 18) /* Host supports UHS SDR50 mode */
#define MMC_CAP_UHS_SDR104 (1 << 19) /* Host supports UHS SDR104 mode */
#define MMC_CAP_UHS_DDR50 (1 << 20) /* Host supports UHS DDR50 mode */
+#ifdef CONFIG_ASCEND_HISI_MMC
+#define MMC_CAP_RUNTIME_RESUME (1 << 20) /* Resume at runtime_resume. */
+#endif
#define MMC_CAP_UHS (MMC_CAP_UHS_SDR12 | MMC_CAP_UHS_SDR25 | \
MMC_CAP_UHS_SDR50 | MMC_CAP_UHS_SDR104 | \
MMC_CAP_UHS_DDR50)
@@ -368,6 +423,34 @@ struct mmc_host {
#define MMC_CAP2_CQE (1 << 23) /* Has eMMC command queue engine */
#define MMC_CAP2_CQE_DCMD (1 << 24) /* CQE can issue a direct command */
#define MMC_CAP2_AVOID_3_3V (1 << 25) /* Host must negotiate down from 3.3V */
+#ifdef CONFIG_ASCEND_HISI_MMC
+#define MMC_CAP2_CACHE_CTRL (1 << 1) /* Allow cache control */
+#define MMC_CAP2_NO_MULTI_READ (1 << 3) /* Multiblock read don't work */
+#define MMC_CAP2_NO_SLEEP_CMD (1 << 4) /* Don't allow sleep command */
+#define MMC_CAP2_BROKEN_VOLTAGE (1 << 7) /* Use the broken voltage */
+#define MMC_CAP2_DETECT_ON_ERR (1 << 8) /* I/O err check card removal */
+#define MMC_CAP2_HC_ERASE_SZ (1 << 9) /* High-capacity erase size */
+#define MMC_CAP2_PACKED_RD (1 << 12) /* Allow packed read */
+#define MMC_CAP2_PACKED_WR (1 << 13) /* Allow packed write */
+#define MMC_CAP2_PACKED_CMD (MMC_CAP2_PACKED_RD | \
+ MMC_CAP2_PACKED_WR)
+#define MMC_CAP2_CMD_QUEUE (1 << 18) /* support eMMC command queue */
+#define MMC_CAP2_ENHANCED_STROBE (1 << 19)
+#define MMC_CAP2_CACHE_FLUSH_BARRIER (1 << 20)
+/* Allow background operations auto enable control */
+#define MMC_CAP2_BKOPS_AUTO_CTRL (1 << 21)
+/* Allow background operations manual enable control */
+#define MMC_CAP2_BKOPS_MANUAL_CTRL (1 << 22)
+
+/* host is connected by via modem through sdio */
+#define MMC_CAP2_SUPPORT_VIA_MODEM (1 << 26)
+/* host is connected by wifi through sdio */
+#define MMC_CAP2_SUPPORT_WIFI (1 << 27)
+/* host is connected to 1102 wifi */
+#define MMC_CAP2_SUPPORT_WIFI_CMD11 (1 << 28)
+/* host do not support low power for wifi*/
+#define MMC_CAP2_WIFI_NO_LOWPWR (1 << 29)
+#endif
int fixed_drv_type; /* fixed driver type for non-removable media */
@@ -461,6 +544,12 @@ struct mmc_host {
bool cqe_on;
unsigned long private[0] ____cacheline_aligned;
+#ifdef CONFIG_ASCEND_HISI_MMC
+ const struct mmc_cmdq_host_ops *cmdq_ops;
+ int sdio_present;
+ unsigned int cmdq_slots;
+ struct mmc_cmdq_context_info cmdq_ctx;
+#endif
};
struct device_node;
@@ -588,4 +677,30 @@ static inline enum dma_data_direction mmc_get_dma_dir(struct mmc_data *data)
int mmc_send_tuning(struct mmc_host *host, u32 opcode, int *cmd_error);
int mmc_abort_tuning(struct mmc_host *host, u32 opcode);
+#ifdef CONFIG_ASCEND_HISI_MMC
+int mmc_cache_ctrl(struct mmc_host *host, u8 enable);
+int mmc_card_awake(struct mmc_host *host);
+int mmc_card_sleep(struct mmc_host *host);
+int mmc_card_can_sleep(struct mmc_host *host);
+#else
+static inline int mmc_cache_ctrl(struct mmc_host *host, u8 enable)
+{
+ return 0;
+}
+static inline int mmc_card_awake(struct mmc_host *host)
+{
+ return 0;
+}
+static inline int mmc_card_sleep(struct mmc_host *host)
+{
+ return 0;
+}
+static inline int mmc_card_can_sleep(struct mmc_host *host)
+{
+ return 0;
+}
+#endif
+
+int mmc_is_ascend_customized(struct device *dev);
+
#endif /* LINUX_MMC_HOST_H */
diff --git a/include/linux/mmc/pm.h b/include/linux/mmc/pm.h
index 4a139204c20c0..6e2d6a135c7e0 100644
--- a/include/linux/mmc/pm.h
+++ b/include/linux/mmc/pm.h
@@ -26,5 +26,6 @@ typedef unsigned int mmc_pm_flag_t;
#define MMC_PM_KEEP_POWER (1 << 0) /* preserve card power during suspend */
#define MMC_PM_WAKE_SDIO_IRQ (1 << 1) /* wake up host system on SDIO IRQ assertion */
+#define MMC_PM_IGNORE_PM_NOTIFY (1 << 2) /* ignore mmc pm notify */
#endif /* LINUX_MMC_PM_H */
--
2.25.1
1
7

27 Oct '21
From: zhangguijiang <zhangguijiang(a)huawei.com>
ascend inclusion
category: feature
feature: Ascend emmc adaption
bugzilla: https://gitee.com/openeuler/kernel/issues/I4F4LL
CVE: NA
--------------------
To identify Ascend HiSilicon emmc chip, we add a customized property
to dts. In this patch we add an interface to read this property. At
the same time, we provided a switch, which is CONFIG_ASCEND_HISI_MMC,
for you to get rid of our modifications.
Signed-off-by: zhangguijiang <zhangguijiang(a)huawei.com>
Reviewed-by: Ding Tianhong <dingtianhong(a)huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com>
---
drivers/mmc/Kconfig | 10 ++++
drivers/mmc/core/host.c | 47 +++++++++++++---
include/linux/mmc/host.h | 115 +++++++++++++++++++++++++++++++++++++++
include/linux/mmc/pm.h | 1 +
4 files changed, 164 insertions(+), 9 deletions(-)
diff --git a/drivers/mmc/Kconfig b/drivers/mmc/Kconfig
index ec21388311db2..8b29ecadd1862 100644
--- a/drivers/mmc/Kconfig
+++ b/drivers/mmc/Kconfig
@@ -12,6 +12,16 @@ menuconfig MMC
If you want MMC/SD/SDIO support, you should say Y here and
also to your specific host controller driver.
+config ASCEND_HISI_MMC
+ bool "Ascend HiSilicon MMC card support"
+ depends on MMC
+ default n
+ help
+ This selects for Hisilicon SoC specific extensions to the
+ Synopsys DesignWare Memory Card Interface driver.
+ You should select this option if you want mmc support on
+ Ascend platform.
+
if MMC
source "drivers/mmc/core/Kconfig"
diff --git a/drivers/mmc/core/host.c b/drivers/mmc/core/host.c
index f57f5de542064..69cc778706855 100644
--- a/drivers/mmc/core/host.c
+++ b/drivers/mmc/core/host.c
@@ -348,6 +348,11 @@ int mmc_of_parse(struct mmc_host *host)
EXPORT_SYMBOL(mmc_of_parse);
+static inline int mmc_is_ascend_hi_mci_1(struct device *dev)
+{
+ return !strncmp(dev_name(dev), "hi_mci.1", strlen("hi_mci.1"));
+}
+
/**
* mmc_alloc_host - initialise the per-host structure.
* @extra: sizeof private data structure
@@ -374,7 +379,10 @@ struct mmc_host *mmc_alloc_host(int extra, struct device *dev)
}
host->index = err;
-
+ if (mmc_is_ascend_customized(dev)) {
+ if (mmc_is_ascend_hi_mci_1(dev))
+ host->index = 1;
+ }
dev_set_name(&host->class_dev, "mmc%d", host->index);
host->parent = dev;
@@ -383,12 +391,13 @@ struct mmc_host *mmc_alloc_host(int extra, struct device *dev)
device_initialize(&host->class_dev);
device_enable_async_suspend(&host->class_dev);
- if (mmc_gpio_alloc(host)) {
- put_device(&host->class_dev);
- ida_simple_remove(&mmc_host_ida, host->index);
- kfree(host);
- return NULL;
- }
+ if (!mmc_is_ascend_customized(host->parent))
+ if (mmc_gpio_alloc(host)) {
+ put_device(&host->class_dev);
+ ida_simple_remove(&mmc_host_ida, host->index);
+ kfree(host);
+ return NULL;
+ }
spin_lock_init(&host->lock);
init_waitqueue_head(&host->wq);
@@ -441,7 +450,9 @@ int mmc_add_host(struct mmc_host *host)
#endif
mmc_start_host(host);
- mmc_register_pm_notifier(host);
+ if (!mmc_is_ascend_customized(host->parent) ||
+ !(host->pm_flags & MMC_PM_IGNORE_PM_NOTIFY))
+ mmc_register_pm_notifier(host);
return 0;
}
@@ -458,7 +469,9 @@ EXPORT_SYMBOL(mmc_add_host);
*/
void mmc_remove_host(struct mmc_host *host)
{
- mmc_unregister_pm_notifier(host);
+ if (!mmc_is_ascend_customized(host->parent) ||
+ !(host->pm_flags & MMC_PM_IGNORE_PM_NOTIFY))
+ mmc_unregister_pm_notifier(host);
mmc_stop_host(host);
#ifdef CONFIG_DEBUG_FS
@@ -485,3 +498,19 @@ void mmc_free_host(struct mmc_host *host)
}
EXPORT_SYMBOL(mmc_free_host);
+
+
+int mmc_is_ascend_customized(struct device *dev)
+{
+#ifdef CONFIG_ASCEND_HISI_MMC
+ static int is_ascend_customized = -1;
+
+ if (is_ascend_customized == -1)
+ is_ascend_customized = ((dev == NULL) ? 0 :
+ of_find_property(dev->of_node, "customized", NULL) != NULL);
+ return is_ascend_customized;
+#else
+ return 0;
+#endif
+}
+EXPORT_SYMBOL(mmc_is_ascend_customized);
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index 840462ed1ec7e..78b4d0a813b71 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -19,6 +19,9 @@
#include <linux/mmc/pm.h>
#include <linux/dma-direction.h>
+#include <linux/jiffies.h>
+#include <linux/version.h>
+
struct mmc_ios {
unsigned int clock; /* clock rate */
unsigned short vdd;
@@ -63,6 +66,7 @@ struct mmc_ios {
#define MMC_TIMING_MMC_DDR52 8
#define MMC_TIMING_MMC_HS200 9
#define MMC_TIMING_MMC_HS400 10
+#define MMC_TIMING_NEW_SD MMC_TIMING_UHS_SDR12
unsigned char signal_voltage; /* signalling voltage (1.8V or 3.3V) */
@@ -78,7 +82,25 @@ struct mmc_ios {
#define MMC_SET_DRIVER_TYPE_D 3
bool enhanced_strobe; /* hs400es selection */
+#ifdef CONFIG_ASCEND_HISI_MMC
+ unsigned int clock_store; /*store the clock before power off*/
+#endif
+};
+
+#ifdef CONFIG_ASCEND_HISI_MMC
+struct mmc_cmdq_host_ops {
+ int (*enable)(struct mmc_host *mmc);
+ int (*disable)(struct mmc_host *mmc, bool soft);
+ int (*restore_irqs)(struct mmc_host *mmc);
+ int (*request)(struct mmc_host *mmc, struct mmc_request *mrq);
+ int (*halt)(struct mmc_host *mmc, bool halt);
+ void (*post_req)(struct mmc_host *mmc, struct mmc_request *mrq,
+ int err);
+ void (*disable_immediately)(struct mmc_host *mmc);
+ int (*clear_and_halt)(struct mmc_host *mmc);
};
+#endif
+
struct mmc_host;
@@ -168,6 +190,12 @@ struct mmc_host_ops {
*/
int (*multi_io_quirk)(struct mmc_card *card,
unsigned int direction, int blk_size);
+#ifdef CONFIG_ASCEND_HISI_MMC
+ /* Slow down clk for ascend chip SD cards */
+ void (*slowdown_clk)(struct mmc_host *host, int timing);
+ int (*enable_enhanced_strobe)(struct mmc_host *host);
+ int (*send_cmd_direct)(struct mmc_host *host, struct mmc_request *mrq);
+#endif
};
struct mmc_cqe_ops {
@@ -255,6 +283,30 @@ struct mmc_context_info {
wait_queue_head_t wait;
};
+#ifdef CONFIG_ASCEND_HISI_MMC
+/**
+ * mmc_cmdq_context_info - describes the contexts of cmdq
+ * @active_reqs requests being processed
+ * @active_dcmd dcmd in progress, don't issue any
+ * more dcmd requests
+ * @rpmb_in_wait do not pull any more reqs till rpmb is handled
+ * @cmdq_state state of cmdq engine
+ * @req_starved completion should invoke the request_fn since
+ * no tags were available
+ * @cmdq_ctx_lock acquire this before accessing this structure
+ */
+struct mmc_cmdq_context_info {
+ unsigned long active_reqs; /* in-flight requests */
+ bool active_dcmd;
+ bool rpmb_in_wait;
+ unsigned long curr_state;
+
+ /* no free tag available */
+ unsigned long req_starved;
+ spinlock_t cmdq_ctx_lock;
+};
+#endif
+
struct regulator;
struct mmc_pwrseq;
@@ -328,6 +380,9 @@ struct mmc_host {
#define MMC_CAP_UHS_SDR50 (1 << 18) /* Host supports UHS SDR50 mode */
#define MMC_CAP_UHS_SDR104 (1 << 19) /* Host supports UHS SDR104 mode */
#define MMC_CAP_UHS_DDR50 (1 << 20) /* Host supports UHS DDR50 mode */
+#ifdef CONFIG_ASCEND_HISI_MMC
+#define MMC_CAP_RUNTIME_RESUME (1 << 20) /* Resume at runtime_resume. */
+#endif
#define MMC_CAP_UHS (MMC_CAP_UHS_SDR12 | MMC_CAP_UHS_SDR25 | \
MMC_CAP_UHS_SDR50 | MMC_CAP_UHS_SDR104 | \
MMC_CAP_UHS_DDR50)
@@ -367,6 +422,34 @@ struct mmc_host {
#define MMC_CAP2_CQE (1 << 23) /* Has eMMC command queue engine */
#define MMC_CAP2_CQE_DCMD (1 << 24) /* CQE can issue a direct command */
#define MMC_CAP2_AVOID_3_3V (1 << 25) /* Host must negotiate down from 3.3V */
+#ifdef CONFIG_ASCEND_HISI_MMC
+#define MMC_CAP2_CACHE_CTRL (1 << 1) /* Allow cache control */
+#define MMC_CAP2_NO_MULTI_READ (1 << 3) /* Multiblock read don't work */
+#define MMC_CAP2_NO_SLEEP_CMD (1 << 4) /* Don't allow sleep command */
+#define MMC_CAP2_BROKEN_VOLTAGE (1 << 7) /* Use the broken voltage */
+#define MMC_CAP2_DETECT_ON_ERR (1 << 8) /* I/O err check card removal */
+#define MMC_CAP2_HC_ERASE_SZ (1 << 9) /* High-capacity erase size */
+#define MMC_CAP2_PACKED_RD (1 << 12) /* Allow packed read */
+#define MMC_CAP2_PACKED_WR (1 << 13) /* Allow packed write */
+#define MMC_CAP2_PACKED_CMD (MMC_CAP2_PACKED_RD | \
+ MMC_CAP2_PACKED_WR)
+#define MMC_CAP2_CMD_QUEUE (1 << 18) /* support eMMC command queue */
+#define MMC_CAP2_ENHANCED_STROBE (1 << 19)
+#define MMC_CAP2_CACHE_FLUSH_BARRIER (1 << 20)
+/* Allow background operations auto enable control */
+#define MMC_CAP2_BKOPS_AUTO_CTRL (1 << 21)
+/* Allow background operations manual enable control */
+#define MMC_CAP2_BKOPS_MANUAL_CTRL (1 << 22)
+
+/* host is connected by via modem through sdio */
+#define MMC_CAP2_SUPPORT_VIA_MODEM (1 << 26)
+/* host is connected by wifi through sdio */
+#define MMC_CAP2_SUPPORT_WIFI (1 << 27)
+/* host is connected to 1102 wifi */
+#define MMC_CAP2_SUPPORT_WIFI_CMD11 (1 << 28)
+/* host do not support low power for wifi*/
+#define MMC_CAP2_WIFI_NO_LOWPWR (1 << 29)
+#endif
int fixed_drv_type; /* fixed driver type for non-removable media */
@@ -460,6 +543,12 @@ struct mmc_host {
bool cqe_on;
unsigned long private[0] ____cacheline_aligned;
+#ifdef CONFIG_ASCEND_HISI_MMC
+ const struct mmc_cmdq_host_ops *cmdq_ops;
+ int sdio_present;
+ unsigned int cmdq_slots;
+ struct mmc_cmdq_context_info cmdq_ctx;
+#endif
};
struct device_node;
@@ -587,4 +676,30 @@ static inline enum dma_data_direction mmc_get_dma_dir(struct mmc_data *data)
int mmc_send_tuning(struct mmc_host *host, u32 opcode, int *cmd_error);
int mmc_abort_tuning(struct mmc_host *host, u32 opcode);
+#ifdef CONFIG_ASCEND_HISI_MMC
+int mmc_cache_ctrl(struct mmc_host *host, u8 enable);
+int mmc_card_awake(struct mmc_host *host);
+int mmc_card_sleep(struct mmc_host *host);
+int mmc_card_can_sleep(struct mmc_host *host);
+#else
+static inline int mmc_cache_ctrl(struct mmc_host *host, u8 enable)
+{
+ return 0;
+}
+static inline int mmc_card_awake(struct mmc_host *host)
+{
+ return 0;
+}
+static inline int mmc_card_sleep(struct mmc_host *host)
+{
+ return 0;
+}
+static inline int mmc_card_can_sleep(struct mmc_host *host)
+{
+ return 0;
+}
+#endif
+
+int mmc_is_ascend_customized(struct device *dev);
+
#endif /* LINUX_MMC_HOST_H */
diff --git a/include/linux/mmc/pm.h b/include/linux/mmc/pm.h
index 4a139204c20c0..6e2d6a135c7e0 100644
--- a/include/linux/mmc/pm.h
+++ b/include/linux/mmc/pm.h
@@ -26,5 +26,6 @@ typedef unsigned int mmc_pm_flag_t;
#define MMC_PM_KEEP_POWER (1 << 0) /* preserve card power during suspend */
#define MMC_PM_WAKE_SDIO_IRQ (1 << 1) /* wake up host system on SDIO IRQ assertion */
+#define MMC_PM_IGNORE_PM_NOTIFY (1 << 2) /* ignore mmc pm notify */
#endif /* LINUX_MMC_PM_H */
--
2.25.1
1
7
首先非常感谢大家参与openEuler社区,并给openEuler kernel开源项目提补丁。
openEuler kernel开源项目的openEuler-21.03创新分支以更加开阔的视野接纳企业,高校以及所有爱好和关注Linux内核
人士的想法和建议,期望和大家共同探索底层软件在构建云与计算、5G、终端等全场景下的前景和潜力,共同推动
底软在物联网、智能计算背景下的全新视界;另外openEuler-21.03创新分支同时希望给高校提供更多的教学素材,
为高校基础研究和产教结合道路奉献绵薄之力。
- 如果您对如何参与openEuler kernel开源项目有疑问,可以发邮件至bobo.shaobowang(a)huawei.com,
也可以参考文档:https://mp.weixin.qq.com/s/a42a5VfayFeJgWitqbI8Qw
- 您也可以通过openEuler kernel官网提交issue: https://gitee.com/openeuler/kernel
以下补丁已经过社区maintainer review和openEuler社区的验证测试,将合入到openEuler-21.03分支5.10.0-4.25.0版本。
0ee74f5aa533 (HEAD -> openEuler-21.03, tag: 5.10.0-4.25.0) RDS tcp loopback connection can hang
fc7ec5aebb45 usb: gadget: f_fs: Ensure io_completion_wq is idle during unbind
dae6a368dafc ALSA: seq: Fix race of snd_seq_timer_open()
3a00695cb8e2 RDMA/mlx4: Do not map the core_clock page to user space unless enabled
ed4fd7c42adc Revert "ACPI: sleep: Put the FACS table after using it"
37c85837cf42 ASoC: Intel: bytcr_rt5640: Add quirk for the Glavey TM800A550L tablet
94fccf25dd49 nvme-tcp: remove incorrect Kconfig dep in BLK_DEV_NVME
47f090fbcbb9 regulator: fan53880: Fix missing n_voltages setting
0b9b74807478 net/nfc/rawsock.c: fix a permission check bug
789459f344e7 scsi: core: Only put parent device if host state differs from SHOST_CREATED
08f8e0fb4b59 usb: typec: ucsi: Clear PPM capability data in ucsi_init() error path
0e2bd1220f8a phy: cadence: Sierra: Fix error return code in cdns_sierra_phy_probe()
47f3671cfd67 usb: pd: Set PD_T_SINK_WAIT_CAP to 310ms
559b80a5925d scsi: core: Fix failure handling of scsi_add_host_with_dma()
e897c103ecde ALSA: hda/realtek: headphone and mic don't work on an Acer laptop
d39b22f602f5 isdn: mISDN: netjet: Fix crash in nj_probe:
1013d6a98975 nvmet: fix false keep-alive timeout when a controller is torn down
9ffec7fff577 cgroup: disable controllers at parse time
e2c4bbd88218 RDMA/ipoib: Fix warning caused by destroying non-initial netns
0eb3e33d9814 gpio: wcd934x: Fix shift-out-of-bounds error
af64e02cb927 NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()
34a0d49e311d usb: dwc3: ep0: fix NULL pointer exception
ca5ed7b6d2ac spi: bcm2835: Fix out-of-bounds access with more than 4 slaves
420a6301307e NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on error.
762e6acf28f1 regulator: core: resolve supply for boot-on/always-on regulators
d84eb5070d03 net: macb: ensure the device is available before accessing GEMGXL control registers
b9a3b65556e9 sched/fair: Make sure to update tg contrib for blocked load
074babe38e68 KVM: x86: Ensure liveliness of nested VM-Enter fail tracepoint message
447f10de04c8 ALSA: firewire-lib: fix the context to call snd_pcm_stop_xrun()
262986c9f618 dm verity: fix require_signatures module_param permissions
bfa96859a312 usb: chipidea: udc: assign interrupt number to USB gadget structure
177f5f81e9fc regulator: max77620: Use device_set_of_node_from_dev()
818f49a8aa09 USB: serial: omninet: add device id for Zyxel Omni 56K Plus
1790bfdad278 spi: Cleanup on failure of initial setup
642b2258a1f7 drm/msm/a6xx: avoid shadow NULL reference in failure path
be1c43cba161 USB: f_ncm: ncm_bitrate (speed) is unsigned
e41037151205 nvme-fabrics: decode host pathing error for connect
期待后续合作。
Alexandre GRIVEAUX (1):
USB: serial: omninet: add device id for Zyxel Omni 56K Plus
Axel Lin (1):
regulator: fan53880: Fix missing n_voltages setting
Dai Ngo (1):
NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on
error.
Dmitry Baryshkov (1):
regulator: core: resolve supply for boot-on/always-on regulators
Dmitry Osipenko (1):
regulator: max77620: Use device_set_of_node_from_dev()
Hannes Reinecke (1):
nvme-fabrics: decode host pathing error for connect
Hans de Goede (1):
ASoC: Intel: bytcr_rt5640: Add quirk for the Glavey TM800A550L tablet
Hui Wang (1):
ALSA: hda/realtek: headphone and mic don't work on an Acer laptop
Jeimon (1):
net/nfc/rawsock.c: fix a permission check bug
John Keeping (1):
dm verity: fix require_signatures module_param permissions
Jonathan Marek (1):
drm/msm/a6xx: avoid shadow NULL reference in failure path
Kamal Heib (1):
RDMA/ipoib: Fix warning caused by destroying non-initial netns
Kyle Tso (1):
usb: pd: Set PD_T_SINK_WAIT_CAP to 310ms
Li Jun (1):
usb: chipidea: udc: assign interrupt number to USB gadget structure
Lukas Wunner (2):
spi: Cleanup on failure of initial setup
spi: bcm2835: Fix out-of-bounds access with more than 4 slaves
Maciej Żenczykowski (1):
USB: f_ncm: ncm_bitrate (speed) is unsigned
Marian-Cristian Rotariu (1):
usb: dwc3: ep0: fix NULL pointer exception
Mayank Rana (1):
usb: typec: ucsi: Clear PPM capability data in ucsi_init() error path
Ming Lei (2):
scsi: core: Fix failure handling of scsi_add_host_with_dma()
scsi: core: Only put parent device if host state differs from
SHOST_CREATED
Rao Shoaib (1):
RDS tcp loopback connection can hang
Sagi Grimberg (2):
nvmet: fix false keep-alive timeout when a controller is torn down
nvme-tcp: remove incorrect Kconfig dep in BLK_DEV_NVME
Sean Christopherson (1):
KVM: x86: Ensure liveliness of nested VM-Enter fail tracepoint message
Shakeel Butt (1):
cgroup: disable controllers at parse time
Shay Drory (1):
RDMA/mlx4: Do not map the core_clock page to user space unless enabled
Srinivas Kandagatla (1):
gpio: wcd934x: Fix shift-out-of-bounds error
Takashi Iwai (1):
ALSA: seq: Fix race of snd_seq_timer_open()
Takashi Sakamoto (1):
ALSA: firewire-lib: fix the context to call snd_pcm_stop_xrun()
Trond Myklebust (1):
NFSv4: Fix deadlock between nfs4_evict_inode() and
nfs4_opendata_get_inode()
Vincent Guittot (1):
sched/fair: Make sure to update tg contrib for blocked load
Wang Wensheng (1):
phy: cadence: Sierra: Fix error return code in cdns_sierra_phy_probe()
Wesley Cheng (1):
usb: gadget: f_fs: Ensure io_completion_wq is idle during unbind
Zhang Rui (1):
Revert "ACPI: sleep: Put the FACS table after using it"
Zheyu Ma (1):
isdn: mISDN: netjet: Fix crash in nj_probe:
Zong Li (1):
net: macb: ensure the device is available before accessing GEMGXL
control registers
arch/x86/kvm/trace.h | 6 ++--
drivers/acpi/sleep.c | 4 +--
drivers/gpio/gpio-wcd934x.c | 2 +-
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 2 +-
drivers/infiniband/hw/mlx4/main.c | 5 +--
drivers/infiniband/ulp/ipoib/ipoib_netlink.c | 1 +
drivers/isdn/hardware/mISDN/netjet.c | 1 -
drivers/md/dm-verity-verify-sig.c | 2 +-
drivers/net/ethernet/cadence/macb_main.c | 3 ++
drivers/net/ethernet/mellanox/mlx4/fw.c | 3 ++
drivers/net/ethernet/mellanox/mlx4/fw.h | 1 +
drivers/net/ethernet/mellanox/mlx4/main.c | 6 ++++
drivers/nvme/host/Kconfig | 3 +-
drivers/nvme/host/fabrics.c | 5 +++
drivers/nvme/target/core.c | 15 ++++++---
drivers/nvme/target/nvmet.h | 2 +-
drivers/phy/cadence/phy-cadence-sierra.c | 1 +
drivers/regulator/core.c | 6 ++++
drivers/regulator/fan53880.c | 3 ++
drivers/regulator/max77620-regulator.c | 7 +++++
drivers/scsi/hosts.c | 16 +++++-----
drivers/spi/spi-bcm2835.c | 10 ++++--
drivers/spi/spi-bitbang.c | 18 ++++++++---
drivers/spi/spi-fsl-spi.c | 4 +++
drivers/spi/spi-omap-uwire.c | 9 +++++-
drivers/spi/spi-omap2-mcspi.c | 33 ++++++++++++--------
drivers/spi/spi-pxa2xx.c | 9 +++++-
drivers/usb/chipidea/udc.c | 1 +
drivers/usb/dwc3/ep0.c | 3 ++
drivers/usb/gadget/function/f_fs.c | 3 ++
drivers/usb/gadget/function/f_ncm.c | 2 +-
drivers/usb/serial/omninet.c | 2 ++
drivers/usb/typec/ucsi/ucsi.c | 1 +
fs/nfs/nfs4_fs.h | 1 +
fs/nfs/nfs4proc.c | 20 +++++++++++-
include/linux/mlx4/device.h | 1 +
include/linux/usb/pd.h | 2 +-
kernel/cgroup/cgroup.c | 13 ++++----
kernel/sched/fair.c | 2 +-
net/nfc/rawsock.c | 2 +-
net/rds/connection.c | 23 ++++++++++----
net/rds/tcp.c | 4 +--
net/rds/tcp.h | 3 +-
net/rds/tcp_listen.c | 6 ++++
sound/core/seq/seq_timer.c | 10 +++++-
sound/firewire/amdtp-stream.c | 2 +-
sound/pci/hda/patch_realtek.c | 12 +++++++
sound/soc/intel/boards/bytcr_rt5640.c | 11 +++++++
48 files changed, 228 insertions(+), 73 deletions(-)
--
2.25.1
1
37
kylin inclusion
category: feature
bugfix: https://gitee.com/openeuler-competition/summer-2021/issues/I3EIMT?from=proj…
CVE: NA
--------------------------------------------------
In some atomic operation scenarios, such as interrupt context, it is not possible to sleep.
Therefore, when memory allocation in this scenario, it will not enter the direct_reclaim link,
and will not even wake up the kswapd process. For example, in the soft interrupt processing
function of the network card receiving packets, there may be a phenomenon that the page cache
is too occupied and the remaining memory of the system is insufficient, and the memory cannot
be allocated for the received data packet, and the packet is directly lost.
This is the problem to be solved by the page cache limit.
The page cache limit is mainly used to detect whether the page cache exceeds the upper limit
we set (/proc/sys/vm/pagecache_limit_ratio) when the page cache is added to the application
(that is, when the add_to_page_cache_lru function is called)
Provides 3 /proc interfaces, respectively:
echo x > /proc/sys/vm/pagecache_limit_ratio(0 < x < 100):Enable page cache limit function
x means limit the percentage of page cache to the total system memory
/proc/sys/vm/pagecache_limit_ignore_dirty :Whether to ignore dirty pages when calculating the
memory occupied by the page cache. The default value is 1, which means ignore.
Because the recycling of dirty pages is time-consuming.
/proc/sys/vm/pagecache_limit_async:1 means asynchronous recycling, 0 means synchronous recycling
signed-off-by: wen zhiwei <wenzhiwei(a)kylinos.cn>
Signed-off-by: wenzhiwei <wenzhiwei(a)kylinos.cn>
---
include/linux/memcontrol.h | 7 +-
include/linux/mmzone.h | 7 +
include/linux/swap.h | 15 +
include/trace/events/vmscan.h | 28 +-
kernel/sysctl.c | 139 ++++++++
mm/filemap.c | 2 +
mm/page_alloc.c | 52 +++
mm/vmscan.c | 650 ++++++++++++++++++++++++++++++++--
mm/workingset.c | 1 +
9 files changed, 862 insertions(+), 39 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 71a5b589bddb..731a2cd2ea86 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -50,6 +50,7 @@ enum memcg_memory_event {
struct mem_cgroup_reclaim_cookie {
pg_data_t *pgdat;
+ int priority;
unsigned int generation;
};
@@ -492,8 +493,7 @@ mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
* @node combination. This can be the node lruvec, if the memory
* controller is disabled.
*/
-static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
- struct pglist_data *pgdat)
+static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, struct pglist_data *pgdat)
{
struct mem_cgroup_per_node *mz;
struct lruvec *lruvec;
@@ -1066,8 +1066,7 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
{
}
-static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
- struct pglist_data *pgdat)
+static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, struct pglist_data *pgdat)
{
return &pgdat->__lruvec;
}
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 82fceef88448..d3c5258e5d0d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -445,6 +445,13 @@ struct zone {
* changes.
*/
long lowmem_reserve[MAX_NR_ZONES];
+ /*
+ * This atomic counter is set when there is pagecache limit
+ * reclaim going on on this particular zone. Other potential
+ * reclaiers should back off to prevent from heavy lru_lock
+ * bouncing.
+ */
+ atomic_t pagecache_reclaim;
#ifdef CONFIG_NEED_MULTIPLE_NODES
int node;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 9b708c0288bc..b9329e575836 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -377,6 +377,21 @@ extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
unsigned long *nr_scanned);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
+
+#define ADDITIONAL_RECLAIM_RATIO 2
+extern unsigned long pagecache_over_limit(void);
+extern void shrink_page_cache(gfp_t mask, struct page *page);
+extern unsigned long vm_pagecache_limit_pages;
+extern unsigned long vm_pagecache_limit_reclaim_pages;
+extern int unsigned vm_pagecache_limit_ratio;
+extern int vm_pagecache_limit_reclaim_ratio;
+extern unsigned int vm_pagecache_ignore_dirty;
+extern unsigned long pagecache_over_limit(void);
+extern unsigned int vm_pagecache_limit_async;
+extern int kpagecache_limitd_run(void);
+extern void kpagecache_limitd_stop(void);
+extern unsigned int vm_pagecache_ignore_slab;
+
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long reclaim_pages(struct list_head *page_list);
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 2070df64958e..3bfe47a85f6f 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -183,48 +183,48 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re
#endif /* CONFIG_MEMCG */
TRACE_EVENT(mm_shrink_slab_start,
- TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
- long nr_objects_to_shrink, unsigned long cache_items,
- unsigned long long delta, unsigned long total_scan,
- int priority),
-
- TP_ARGS(shr, sc, nr_objects_to_shrink, cache_items, delta, total_scan,
- priority),
+ TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
+ long nr_objects_to_shrink,unsigned long pgs_scanned,
+ unsigned long lru_pgs, unsigned long cache_items,
+ unsigned long long delta, unsigned long total_scan),
+ TP_ARGS(shr, sc, nr_objects_to_shrink,pgs_scanned, lru_pgs, cache_items, delta, total_scan),
TP_STRUCT__entry(
__field(struct shrinker *, shr)
__field(void *, shrink)
__field(int, nid)
__field(long, nr_objects_to_shrink)
__field(gfp_t, gfp_flags)
+ __field(unsigned long, pgs_scanned)
+ __field(unsigned long, lru_pgs)
__field(unsigned long, cache_items)
__field(unsigned long long, delta)
__field(unsigned long, total_scan)
- __field(int, priority)
),
TP_fast_assign(
- __entry->shr = shr;
+ __entry->shr = shr;
__entry->shrink = shr->scan_objects;
__entry->nid = sc->nid;
__entry->nr_objects_to_shrink = nr_objects_to_shrink;
__entry->gfp_flags = sc->gfp_mask;
+ __entry->pgs_scanned = pgs_scanned;
+ __entry->lru_pgs = lru_pgs;
__entry->cache_items = cache_items;
__entry->delta = delta;
__entry->total_scan = total_scan;
- __entry->priority = priority;
),
-
- TP_printk("%pS %p: nid: %d objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d",
+TP_printk("%pF %p: nid: %d objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld",
__entry->shrink,
__entry->shr,
__entry->nid,
__entry->nr_objects_to_shrink,
show_gfp_flags(__entry->gfp_flags),
+ __entry->pgs_scanned,
+ __entry->lru_pgs,
__entry->cache_items,
__entry->delta,
- __entry->total_scan,
- __entry->priority)
+ __entry->total_scan)
);
TRACE_EVENT(mm_shrink_slab_end,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c7ca58de3b1b..4ef436cdfdad 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -111,6 +111,7 @@
static int sixty = 60;
#endif
+static int zero;
static int __maybe_unused neg_one = -1;
static int __maybe_unused two = 2;
static int __maybe_unused four = 4;
@@ -648,6 +649,68 @@ static int do_proc_dointvec(struct ctl_table *table, int write,
return __do_proc_dointvec(table->data, table, write,
buffer, lenp, ppos, conv, data);
}
+int setup_pagecache_limit(void)
+{
+ /* reclaim $ADDITIONAL_RECLAIM_PAGES more than limit. */
+ vm_pagecache_limit_reclaim_ratio = vm_pagecache_limit_ratio + ADDITIONAL_RECLAIM_RATIO;
+
+ if (vm_pagecache_limit_reclaim_ratio > 100)
+ vm_pagecache_limit_reclaim_ratio = 100;
+ if (vm_pagecache_limit_ratio == 0)
+ vm_pagecache_limit_reclaim_ratio = 0;
+
+ vm_pagecache_limit_pages = vm_pagecache_limit_ratio * totalram_pages() / 100;
+ vm_pagecache_limit_reclaim_pages = vm_pagecache_limit_reclaim_ratio * totalram_pages() / 100;
+ return 0;
+}
+
+static int pc_limit_proc_dointvec(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+ if (write && !ret)
+ ret = setup_pagecache_limit();
+ return ret;
+}
+static int pc_reclaim_limit_proc_dointvec(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int pre_reclaim_ratio = vm_pagecache_limit_reclaim_ratio;
+ int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+ if (write && vm_pagecache_limit_ratio == 0)
+ return -EINVAL;
+
+ if (write && !ret) {
+ if (vm_pagecache_limit_reclaim_ratio - vm_pagecache_limit_ratio < ADDITIONAL_RECLAIM_RATIO) {
+ vm_pagecache_limit_reclaim_ratio = pre_reclaim_ratio;
+ return -EINVAL;
+ }
+ vm_pagecache_limit_reclaim_pages = vm_pagecache_limit_reclaim_ratio * totalram_pages() / 100;
+ }
+ return ret;
+}
+static int pc_limit_async_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+ if (write && vm_pagecache_limit_ratio == 0)
+ return -EINVAL;
+
+ if (write && !ret) {
+ if (vm_pagecache_limit_async > 0) {
+ if (kpagecache_limitd_run()) {
+ vm_pagecache_limit_async = 0;
+ return -EINVAL;
+ }
+ }
+ else {
+ kpagecache_limitd_stop();
+ }
+ }
+ return ret;
+}
static int do_proc_douintvec_w(unsigned int *tbl_data,
struct ctl_table *table,
@@ -2711,6 +2774,14 @@ static struct ctl_table kern_table[] = {
},
{ }
};
+static int pc_limit_proc_dointvec(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos);
+
+static int pc_reclaim_limit_proc_dointvec(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos);
+
+static int pc_limit_async_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos);
static struct ctl_table vm_table[] = {
{
@@ -2833,6 +2904,74 @@ static struct ctl_table vm_table[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = &two_hundred,
},
+ {
+ .procname = "pagecache_limit_ratio",
+ .data = &vm_pagecache_limit_ratio,
+ .maxlen = sizeof(vm_pagecache_limit_ratio),
+ .mode = 0644,
+ .proc_handler = &pc_limit_proc_dointvec,
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },
+ {
+ .procname = "pagecache_limit_reclaim_ratio",
+ .data = &vm_pagecache_limit_reclaim_ratio,
+ .maxlen = sizeof(vm_pagecache_limit_reclaim_ratio),
+ .mode = 0644,
+ .proc_handler = &pc_reclaim_limit_proc_dointvec,
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },
+ {
+ .procname = "pagecache_limit_ignore_dirty",
+ .data = &vm_pagecache_ignore_dirty,
+ .maxlen = sizeof(vm_pagecache_ignore_dirty),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+#ifdef CONFIG_SHRINK_PAGECACHE
+ {
+ .procname = "cache_reclaim_s",
+ .data = &vm_cache_reclaim_s,
+ .maxlen = sizeof(vm_cache_reclaim_s),
+ .mode = 0644,
+ .proc_handler = cache_reclaim_sysctl_handler,
+ .extra1 = &vm_cache_reclaim_s_min,
+ .extra2 = &vm_cache_reclaim_s_max,
+ },
+ {
+ .procname = "cache_reclaim_weight",
+ .data = &vm_cache_reclaim_weight,
+ .maxlen = sizeof(vm_cache_reclaim_weight),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &vm_cache_reclaim_weight_min,
+ .extra2 = &vm_cache_reclaim_weight_max,
+ },
+ {
+ .procname = "cache_reclaim_enable",
+ .data = &vm_cache_reclaim_enable,
+ .maxlen = sizeof(vm_cache_reclaim_enable),
+ .mode = 0644,
+ .proc_handler = cache_reclaim_enable_handler,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+ {
+ .procname = "pagecache_limit_async",
+ .data = &vm_pagecache_limit_async,
+ .maxlen = sizeof(vm_pagecache_limit_async),
+ .mode = 0644,
+ .proc_handler = &pc_limit_async_handler,
+ },
+ {
+ .procname = "pagecache_limit_ignore_slab",
+ .data = &vm_pagecache_ignore_slab,
+ .maxlen = sizeof(vm_pagecache_ignore_slab),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+#endif
#ifdef CONFIG_HUGETLB_PAGE
{
.procname = "nr_hugepages",
diff --git a/mm/filemap.c b/mm/filemap.c
index ef611eb34aa7..808d4f02b5a5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -922,6 +922,8 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
{
void *shadow = NULL;
int ret;
+ if (unlikely(vm_pagecache_limit_pages) && pagecache_over_limit() > 0)
+ shrink_page_cache(gfp_mask, page);
__SetPageLocked(page);
ret = __add_to_page_cache_locked(page, mapping, offset,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 71afec177233..08feba42d3d7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8933,6 +8933,58 @@ void zone_pcp_reset(struct zone *zone)
local_irq_restore(flags);
}
+/* Returns a number that's positive if the pagecache is above
+ * the set limit*/
+unsigned long pagecache_over_limit()
+{
+ unsigned long should_reclaim_pages = 0;
+ unsigned long overlimit_pages = 0;
+ unsigned long delta_pages = 0;
+ unsigned long pgcache_lru_pages = 0;
+ /* We only want to limit unmapped and non-shmem page cache pages;
+ * normally all shmem pages are mapped as well*/
+ unsigned long pgcache_pages = global_node_page_state(NR_FILE_PAGES)
+ - max_t(unsigned long,
+ global_node_page_state(NR_FILE_MAPPED),
+ global_node_page_state(NR_SHMEM));
+ /* We certainly can't free more than what's on the LRU lists
+ * minus the dirty ones*/
+ if (vm_pagecache_ignore_slab)
+ pgcache_lru_pages = global_node_page_state(NR_ACTIVE_FILE)
+ + global_node_page_state(NR_INACTIVE_FILE);
+ else
+ pgcache_lru_pages = global_node_page_state(NR_ACTIVE_FILE)
+ + global_node_page_state(NR_INACTIVE_FILE)
+ + global_node_page_state(NR_SLAB_RECLAIMABLE_B)
+ + global_node_page_state(NR_SLAB_UNRECLAIMABLE_B);
+
+ if (vm_pagecache_ignore_dirty != 0)
+ pgcache_lru_pages -= global_node_page_state(NR_FILE_DIRTY) / vm_pagecache_ignore_dirty;
+ /* Paranoia */
+ if (unlikely(pgcache_lru_pages > LONG_MAX))
+ return 0;
+
+ /* Limit it to 94% of LRU (not all there might be unmapped) */
+ pgcache_lru_pages -= pgcache_lru_pages/16;
+ if (vm_pagecache_ignore_slab)
+ pgcache_pages = min_t(unsigned long, pgcache_pages, pgcache_lru_pages);
+ else
+ pgcache_pages = pgcache_lru_pages;
+
+ /*
+ *delta_pages: we should reclaim at least 2% more pages than overlimit_page, values get from
+ * /proc/vm/pagecache_limit_reclaim_pages
+ *should_reclaim_pages: the real pages we will reclaim, but it should less than pgcache_pages;
+ */
+ if (pgcache_pages > vm_pagecache_limit_pages) {
+ overlimit_pages = pgcache_pages - vm_pagecache_limit_pages;
+ delta_pages = vm_pagecache_limit_reclaim_pages - vm_pagecache_limit_pages;
+ should_reclaim_pages = min_t(unsigned long, delta_pages, vm_pagecache_limit_pages) + overlimit_pages;
+ return should_reclaim_pages;
+ }
+ return 0;
+}
+
#ifdef CONFIG_MEMORY_HOTREMOVE
/*
* All pages in the range must be in a single zone, must not contain holes,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 23f8a5242de7..1fe2c74a1c10 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -175,6 +175,39 @@ struct scan_control {
*/
int vm_swappiness = 60;
+/*
+ * The total number of pages which are beyond the high watermark within all
+ * zones.
+ */
+unsigned long vm_pagecache_limit_pages __read_mostly = 0;
+unsigned long vm_pagecache_limit_reclaim_pages = 0;
+unsigned int vm_pagecache_limit_ratio __read_mostly = 0;
+int vm_pagecache_limit_reclaim_ratio __read_mostly = 0;
+unsigned int vm_pagecache_ignore_dirty __read_mostly = 1;
+
+unsigned long vm_total_pages;
+static struct task_struct *kpclimitd = NULL;
+unsigned int vm_pagecache_ignore_slab __read_mostly = 1;
+unsigned int vm_pagecache_limit_async __read_mostly = 0;
+
+#ifdef CONFIG_SHRINK_PAGECACHE
+unsigned long vm_cache_limit_ratio;
+unsigned long vm_cache_limit_ratio_min;
+unsigned long vm_cache_limit_ratio_max;
+unsigned long vm_cache_limit_mbytes __read_mostly;
+unsigned long vm_cache_limit_mbytes_min;
+unsigned long vm_cache_limit_mbytes_max;
+static bool kpclimitd_context = false;
+int vm_cache_reclaim_s __read_mostly;
+int vm_cache_reclaim_s_min;
+int vm_cache_reclaim_s_max;
+int vm_cache_reclaim_weight __read_mostly;
+int vm_cache_reclaim_weight_min;
+int vm_cache_reclaim_weight_max;
+int vm_cache_reclaim_enable;
+static DEFINE_PER_CPU(struct delayed_work, vmscan_work);
+#endif
+
static void set_task_reclaim_state(struct task_struct *task,
struct reclaim_state *rs)
{
@@ -187,10 +220,12 @@ static void set_task_reclaim_state(struct task_struct *task,
task->reclaim_state = rs;
}
+static bool kpclimitd_context = false;
static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);
#ifdef CONFIG_MEMCG
+static DEFINE_IDR(shrinker_idr);
static int shrinker_nr_max;
/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
@@ -346,7 +381,6 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
}
}
-static DEFINE_IDR(shrinker_idr);
static int prealloc_memcg_shrinker(struct shrinker *shrinker)
{
@@ -646,7 +680,9 @@ EXPORT_SYMBOL(unregister_shrinker);
#define SHRINK_BATCH 128
static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
- struct shrinker *shrinker, int priority)
+ struct shrinker *shrinker,
+ unsigned long nr_scanned,
+ unsigned long nr_eligible)
{
unsigned long freed = 0;
unsigned long long delta;
@@ -670,8 +706,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
nr = xchg_nr_deferred(shrinker, shrinkctl);
if (shrinker->seeks) {
- delta = freeable >> priority;
- delta *= 4;
+ //delta = freeable >> priority;
+ //delta *= 4;
+ delta = (4 * nr_scanned) / shrinker->seeks;
+ delta *= freeable;
do_div(delta, shrinker->seeks);
} else {
/*
@@ -682,12 +720,12 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
delta = freeable / 2;
}
- total_scan = nr >> priority;
+ total_scan = nr;
total_scan += delta;
total_scan = min(total_scan, (2 * freeable));
trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
- freeable, delta, total_scan, priority);
+ freeable, delta, total_scan, nr_scanned,nr_eligible);
/*
* Normally, we should not scan less than batch_size objects in one
@@ -744,7 +782,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
#ifdef CONFIG_MEMCG
static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
- struct mem_cgroup *memcg, int priority)
+ struct mem_cgroup *memcg, unsigned long nr_scanned, unsigned long nr_eligible)
{
struct shrinker_info *info;
unsigned long ret, freed = 0;
@@ -780,7 +818,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
!(shrinker->flags & SHRINKER_NONSLAB))
continue;
- ret = do_shrink_slab(&sc, shrinker, priority);
+ ret = do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
if (ret == SHRINK_EMPTY) {
clear_bit(i, info->map);
/*
@@ -799,7 +837,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
* set_bit() do_shrink_slab()
*/
smp_mb__after_atomic();
- ret = do_shrink_slab(&sc, shrinker, priority);
+ ret = do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
if (ret == SHRINK_EMPTY)
ret = 0;
else
@@ -846,7 +884,8 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
*/
static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
struct mem_cgroup *memcg,
- int priority)
+ unsigned long nr_scanned,
+ unsigned long nr_eligible)
{
unsigned long ret, freed = 0;
struct shrinker *shrinker;
@@ -859,7 +898,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
* oom.
*/
if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
- return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
+ return 0;
+ // return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
if (!down_read_trylock(&shrinker_rwsem))
goto out;
@@ -871,7 +911,14 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
.memcg = memcg,
};
- ret = do_shrink_slab(&sc, shrinker, priority);
+ if (memcg_kmem_enabled() &&
+ !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
+ continue;
+
+ if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+ sc.nid = 0;
+
+ ret = do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
if (ret == SHRINK_EMPTY)
ret = 0;
freed += ret;
@@ -905,7 +952,7 @@ void drop_slab_node(int nid)
freed = 0;
memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
- freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
+ freed += shrink_slab(GFP_KERNEL, nid, memcg, 1000,1000);
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
} while (freed > 10);
}
@@ -2369,7 +2416,7 @@ unsigned long reclaim_pages(struct list_head *page_list)
EXPORT_SYMBOL_GPL(reclaim_pages);
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
- struct lruvec *lruvec, struct scan_control *sc)
+ struct lruvec *lruvec, struct mem_cgroup *memcg, struct scan_control *sc)
{
if (is_active_lru(lru)) {
if (sc->may_deactivate & (1 << is_file_lru(lru)))
@@ -2683,7 +2730,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
nr[lru] -= nr_to_scan;
nr_reclaimed += shrink_list(lru, nr_to_scan,
- lruvec, sc);
+ lruvec, NULL, sc);
}
}
@@ -2836,7 +2883,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
unsigned long reclaimed;
unsigned long scanned;
-
+ unsigned long lru_pages;
/*
* This loop can become CPU-bound when target memcgs
* aren't eligible for reclaim - either because they
@@ -2873,7 +2920,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
shrink_lruvec(lruvec, sc);
shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
- sc->priority);
+ sc->nr_scanned - scanned,
+ lru_pages);
/* Record the group's reclaim efficiency */
vmpressure(sc->gfp_mask, memcg, false,
@@ -3202,6 +3250,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
{
+ struct mem_cgroup *memcg;
struct lruvec *target_lruvec;
unsigned long refaults;
@@ -3273,8 +3322,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
if (cgroup_reclaim(sc)) {
struct lruvec *lruvec;
- lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup,
- zone->zone_pgdat);
+ lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, zone->zone_pgdat);
clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
}
}
@@ -3745,6 +3793,8 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
return sc->nr_scanned >= sc->nr_to_reclaim;
}
+static void __shrink_page_cache(gfp_t mask);
+
/*
* For kswapd, balance_pgdat() will reclaim pages across a node from zones
* that are eligible for use by the caller until at least one zone is
@@ -4208,6 +4258,27 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
wake_up_interruptible(&pgdat->kswapd_wait);
}
+/*
+ * The reclaimable count would be mostly accurate.
+ * The less reclaimable pages may be
+ * - mlocked pages, which will be moved to unevictable list when encountered
+ * - mapped pages, which may require several travels to be reclaimed
+ * - dirty pages, which is not "instantly" reclaimable
+ */
+
+static unsigned long global_reclaimable_pages(void)
+{
+ int nr;
+
+ nr = global_node_page_state(NR_ACTIVE_FILE) +
+ global_node_page_state(NR_INACTIVE_FILE);
+
+ if (get_nr_swap_pages() > 0)
+ nr += global_node_page_state(NR_ACTIVE_ANON) +
+ global_node_page_state(NR_INACTIVE_ANON);
+ return nr;
+}
+
#ifdef CONFIG_HIBERNATION
/*
* Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
@@ -4246,6 +4317,498 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
return nr_reclaimed;
}
#endif /* CONFIG_HIBERNATION */
+/*
+ * Returns non-zero if the lock has been acquired, false if somebody
+ * else is holding the lock.
+ */
+static int pagecache_reclaim_lock_zone(struct zone *zone)
+{
+ return atomic_add_unless(&zone->pagecache_reclaim, 1, 1);
+}
+
+static void pagecache_reclaim_unlock_zone(struct zone *zone)
+{
+ BUG_ON(atomic_dec_return(&zone->pagecache_reclaim));
+}
+
+/*
+ * Potential page cache reclaimers who are not able to take
+ * reclaim lock on any zone are sleeping on this waitqueue.
+ * So this is basically a congestion wait queue for them.
+ */
+DECLARE_WAIT_QUEUE_HEAD(pagecache_reclaim_wq);
+DECLARE_WAIT_QUEUE_HEAD(kpagecache_limitd_wq);
+
+/*
+ * Similar to shrink_zone but it has a different consumer - pagecache limit
+ * so we cannot reuse the original function - and we do not want to clobber
+ * that code path so we have to live with this code duplication.
+ *
+ * In short this simply scans through the given lru for all cgroups for the
+ * give zone.
+ *
+ * returns true if we managed to cumulatively reclaim (via nr_reclaimed)
+ * the given nr_to_reclaim pages, false otherwise. The caller knows that
+ * it doesn't have to touch other zones if the target was hit already.
+ *
+ * DO NOT USE OUTSIDE of shrink_all_zones unless you have a really really
+ * really good reason.
+ */
+
+static bool shrink_zone_per_memcg(struct zone *zone, enum lru_list lru,
+ unsigned long nr_to_scan, unsigned long nr_to_reclaim,
+ unsigned long *nr_reclaimed, struct scan_control *sc)
+{
+ struct mem_cgroup *root = sc->target_mem_cgroup;
+ struct mem_cgroup *memcg;
+ struct mem_cgroup_reclaim_cookie reclaim = {
+ .pgdat = zone->zone_pgdat,
+ .priority = sc->priority,
+ };
+
+ memcg = mem_cgroup_iter(root, NULL, &reclaim);
+ do {
+ struct lruvec *lruvec;
+
+ lruvec = mem_cgroup_lruvec(memcg, zone->zone_pgdat);
+ *nr_reclaimed += shrink_list(lru, nr_to_scan, lruvec, memcg, sc);
+ if (*nr_reclaimed >= nr_to_reclaim) {
+ mem_cgroup_iter_break(root, memcg);
+ return true;
+ }
+ memcg = mem_cgroup_iter(root, memcg, &reclaim);
+ } while (memcg);
+
+ return false;
+}
+/*
+ * Tries to reclaim 'nr_pages' pages from LRU lists system-wide, for given
+ * pass.
+ *
+ * For pass > 3 we also try to shrink the LRU lists that contain a few pages
+ *
+ * Returns the number of scanned zones.
+ */
+static int shrink_all_zones(unsigned long nr_pages, int pass,
+ struct scan_control *sc)
+{
+ struct zone *zone;
+ unsigned long nr_reclaimed = 0;
+ unsigned int nr_locked_zones = 0;
+ DEFINE_WAIT(wait);
+
+ prepare_to_wait(&pagecache_reclaim_wq, &wait, TASK_INTERRUPTIBLE);
+
+ for_each_populated_zone(zone) {
+ enum lru_list lru;
+
+ /*
+ * Back off if somebody is already reclaiming this zone
+ * for the pagecache reclaim.
+ */
+ if (!pagecache_reclaim_lock_zone(zone))
+ continue;
+
+
+ /*
+ * This reclaimer might scan a zone so it will never
+ * sleep on pagecache_reclaim_wq
+ */
+ finish_wait(&pagecache_reclaim_wq, &wait);
+ nr_locked_zones++;
+
+ for_each_evictable_lru(lru) {
+ enum zone_stat_item ls = NR_ZONE_LRU_BASE + lru;
+ unsigned long lru_pages = zone_page_state(zone, ls);
+
+ /* For pass = 0, we don't shrink the active list */
+ if (pass == 0 && (lru == LRU_ACTIVE_ANON ||
+ lru == LRU_ACTIVE_FILE))
+ continue;
+
+ /* Original code relied on nr_saved_scan which is no
+ * longer present so we are just considering LRU pages.
+ * This means that the zone has to have quite large
+ * LRU list for default priority and minimum nr_pages
+ * size (8*SWAP_CLUSTER_MAX). In the end we will tend
+ * to reclaim more from large zones wrt. small.
+ * This should be OK because shrink_page_cache is called
+ * when we are getting to short memory condition so
+ * LRUs tend to be large.
+ */
+ if (((lru_pages >> sc->priority) + 1) >= nr_pages || pass >= 3) {
+ unsigned long nr_to_scan;
+
+ nr_to_scan = min(nr_pages, lru_pages);
+
+ /*
+ * A bit of a hack but the code has always been
+ * updating sc->nr_reclaimed once per shrink_all_zones
+ * rather than accumulating it for all calls to shrink
+ * lru. This costs us an additional argument to
+ * shrink_zone_per_memcg but well...
+ *
+ * Let's stick with this for bug-to-bug compatibility
+ */
+ while (nr_to_scan > 0) {
+ /* shrink_list takes lru_lock with IRQ off so we
+ * should be careful about really huge nr_to_scan
+ */
+ unsigned long batch = min_t(unsigned long, nr_to_scan, SWAP_CLUSTER_MAX);
+
+ if (shrink_zone_per_memcg(zone, lru,
+ batch, nr_pages, &nr_reclaimed, sc)) {
+ pagecache_reclaim_unlock_zone(zone);
+ goto out_wakeup;
+ }
+ nr_to_scan -= batch;
+ }
+ }
+ }
+ pagecache_reclaim_unlock_zone(zone);
+ }
+ /*
+ * We have to go to sleep because all the zones are already reclaimed.
+ * One of the reclaimer will wake us up or __shrink_page_cache will
+ * do it if there is nothing to be done.
+ */
+ if (!nr_locked_zones) {
+ if (!kpclimitd_context)
+ schedule();
+ finish_wait(&pagecache_reclaim_wq, &wait);
+ goto out;
+ }
+
+out_wakeup:
+ wake_up_interruptible(&pagecache_reclaim_wq);
+ sc->nr_reclaimed += nr_reclaimed;
+out:
+ return nr_locked_zones;
+}
+
+/*
+ * Function to shrink the page cache
+ *
+ * This function calculates the number of pages (nr_pages) the page
+ * cache is over its limit and shrinks the page cache accordingly.
+ *
+ * The maximum number of pages, the page cache shrinks in one call of
+ * this function is limited to SWAP_CLUSTER_MAX pages. Therefore it may
+ * require a number of calls to actually reach the vm_pagecache_limit_kb.
+ *
+ * This function is similar to shrink_all_memory, except that it may never
+ * swap out mapped pages and only does four passes.
+ */
+static void __shrink_page_cache(gfp_t mask)
+{
+ unsigned long ret = 0;
+ int pass = 0;
+ struct reclaim_state reclaim_state;
+ struct scan_control sc = {
+ .gfp_mask = mask,
+ .may_swap = 0,
+ .may_unmap = 0,
+ .may_writepage = 0,
+ .target_mem_cgroup = NULL,
+ .reclaim_idx = MAX_NR_ZONES,
+ };
+ struct reclaim_state *old_rs = current->reclaim_state;
+ long nr_pages;
+
+ /* We might sleep during direct reclaim so make atomic context
+ * is certainly a bug.
+ */
+ BUG_ON(!(mask & __GFP_RECLAIM));
+
+retry:
+ /* How many pages are we over the limit?*/
+ nr_pages = pagecache_over_limit();
+
+ /*
+ * Return early if there's no work to do.
+ * Wake up reclaimers that couldn't scan any zone due to congestion.
+ * There is apparently nothing to do so they do not have to sleep.
+ * This makes sure that no sleeping reclaimer will stay behind.
+ * Allow breaching the limit if the task is on the way out.
+ */
+ if (nr_pages <= 0 || fatal_signal_pending(current)) {
+ wake_up_interruptible(&pagecache_reclaim_wq);
+ goto out;
+ }
+
+ /* But do a few at least */
+ nr_pages = max_t(unsigned long, nr_pages, 8*SWAP_CLUSTER_MAX);
+
+ current->reclaim_state = &reclaim_state;
+
+ /*
+ * Shrink the LRU in 4 passes:
+ * 0 = Reclaim from inactive_list only (fast)
+ * 1 = Reclaim from active list but don't reclaim mapped and dirtied (not that fast)
+ * 2 = Reclaim from active list but don't reclaim mapped (2nd pass)
+ * it may reclaim dirtied if vm_pagecache_ignore_dirty = 0
+ * 3 = same as pass 2, but it will reclaim some few pages , detail in shrink_all_zones
+ */
+ for (; pass <= 3; pass++) {
+ for (sc.priority = DEF_PRIORITY; sc.priority >= 0; sc.priority--) {
+ unsigned long nr_to_scan = nr_pages - ret;
+ struct mem_cgroup *memcg = NULL;
+ int nid;
+
+ sc.nr_scanned = 0;
+
+ /*
+ * No zone reclaimed because of too many reclaimers. Retry whether
+ * there is still something to do
+ */
+ if (!shrink_all_zones(nr_to_scan, pass, &sc))
+ goto retry;
+
+ ret += sc.nr_reclaimed;
+ if (ret >= nr_pages)
+ goto out;
+
+ reclaim_state.reclaimed_slab = 0;
+ for_each_online_node(nid) {
+ do {
+ shrink_slab(mask, nid, memcg, sc.nr_scanned,
+ global_reclaimable_pages());
+ } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
+ }
+ ret += reclaim_state.reclaimed_slab;
+
+ if (ret >= nr_pages)
+ goto out;
+
+ }
+ if (pass == 1) {
+ if (vm_pagecache_ignore_dirty == 1 ||
+ (mask & (__GFP_IO | __GFP_FS)) != (__GFP_IO | __GFP_FS) )
+ break;
+ else
+ sc.may_writepage = 1;
+ }
+ }
+
+out:
+ current->reclaim_state = old_rs;
+}
+
+#ifdef CONFIG_SHRINK_PAGECACHE
+static unsigned long __shrink_page_cache(gfp_t mask)
+{
+ struct scan_control sc = {
+ .gfp_mask = current_gfp_context(mask),
+ .reclaim_idx = gfp_zone(mask),
+ .may_writepage = !laptop_mode,
+ .nr_to_reclaim = SWAP_CLUSTER_MAX *
+ (unsigned long)vm_cache_reclaim_weight,
+ .may_unmap = 1,
+ .may_swap = 1,
+ .order = 0,
+ .priority = DEF_PRIORITY,
+ .target_mem_cgroup = NULL,
+ .nodemask = NULL,
+ };
+
+ struct zonelist *zonelist = node_zonelist(numa_node_id(), mask);
+
+ return do_try_to_free_pages(zonelist, &sc);
+}
+
+
+static void shrink_page_cache_work(struct work_struct *w);
+static void shrink_shepherd(struct work_struct *w);
+static DECLARE_DEFERRABLE_WORK(shepherd, shrink_shepherd);
+
+static void shrink_shepherd(struct work_struct *w)
+{
+ int cpu;
+
+ get_online_cpus();
+
+ for_each_online_cpu(cpu) {
+ struct delayed_work *work = &per_cpu(vmscan_work, cpu);
+
+ if (!delayed_work_pending(work) && vm_cache_reclaim_enable)
+ queue_delayed_work_on(cpu, system_wq, work, 0);
+ }
+
+ put_online_cpus();
+
+ /* we want all kernel thread to stop */
+ if (vm_cache_reclaim_enable) {
+ if (vm_cache_reclaim_s == 0)
+ schedule_delayed_work(&shepherd,
+ round_jiffies_relative(120 * HZ));
+ else
+ schedule_delayed_work(&shepherd,
+ round_jiffies_relative((unsigned long)
+ vm_cache_reclaim_s * HZ));
+ }
+}
+static void shrink_shepherd_timer(void)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ struct delayed_work *work = &per_cpu(vmscan_work, cpu);
+
+ INIT_DEFERRABLE_WORK(work, shrink_page_cache_work);
+ }
+
+ schedule_delayed_work(&shepherd,
+ round_jiffies_relative((unsigned long)vm_cache_reclaim_s * HZ));
+}
+
+unsigned long shrink_page_cache(gfp_t mask)
+{
+ unsigned long nr_pages;
+
+ /* We reclaim the highmem zone too, it is useful for 32bit arch */
+ nr_pages = __shrink_page_cache(mask | __GFP_HIGHMEM);
+
+ return nr_pages;
+}
+static void shrink_page_cache_work(struct work_struct *w)
+{
+ struct delayed_work *work = to_delayed_work(w);
+ unsigned long nr_pages;
+
+ /*
+ * if vm_cache_reclaim_enable or vm_cache_reclaim_s is zero,
+ * we do not shrink page cache again.
+ */
+ if (vm_cache_reclaim_s == 0 || !vm_cache_reclaim_enable)
+ return;
+
+ /* It should wait more time if we hardly reclaim the page cache */
+ nr_pages = shrink_page_cache(GFP_KERNEL);
+ if ((nr_pages < SWAP_CLUSTER_MAX) && vm_cache_reclaim_enable)
+ queue_delayed_work_on(smp_processor_id(), system_wq, work,
+ round_jiffies_relative(120 * HZ));
+}
+
+static void shrink_page_cache_init(void)
+{
+ vm_cache_limit_ratio = 0;
+ vm_cache_limit_ratio_min = 0;
+ vm_cache_limit_ratio_max = 100;
+ vm_cache_limit_mbytes = 0;
+ vm_cache_limit_mbytes_min = 0;
+ vm_cache_limit_mbytes_max = totalram_pages >> (20 - PAGE_SHIFT);
+ vm_cache_reclaim_s = 0;
+ vm_cache_reclaim_s_min = 0;
+ vm_cache_reclaim_s_max = 43200;
+ vm_cache_reclaim_weight = 1;
+ vm_cache_reclaim_weight_min = 1;
+ vm_cache_reclaim_weight_max = 100;
+ vm_cache_reclaim_enable = 1;
+
+ shrink_shepherd_timer();
+}
+
+static int kswapd_cpu_down_prep(unsigned int cpu)
+{
+ cancel_delayed_work_sync(&per_cpu(vmscan_work, cpu));
+
+ return 0;
+}
+int cache_reclaim_enable_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ int ret;
+
+ ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (ret)
+ return ret;
+
+ if (write)
+ schedule_delayed_work(&shepherd, round_jiffies_relative((unsigned long)vm_cache_reclaim_s * HZ));
+
+ return 0;
+}
+
+int cache_reclaim_sysctl_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ int ret;
+
+ ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (ret)
+ return ret;
+
+ if (write)
+ mod_delayed_work(system_wq, &shepherd,
+ round_jiffies_relative(
+ (unsigned long)vm_cache_reclaim_s * HZ));
+
+ return ret;
+}
+#endif
+
+static int kpagecache_limitd(void *data)
+{
+ DEFINE_WAIT(wait);
+ kpclimitd_context = true;
+
+ /*
+ * make sure all work threads woken up, when switch to async mode
+ */
+ if (waitqueue_active(&pagecache_reclaim_wq))
+ wake_up_interruptible(&pagecache_reclaim_wq);
+
+ for ( ; ; ) {
+ __shrink_page_cache(GFP_KERNEL);
+ prepare_to_wait(&kpagecache_limitd_wq, &wait, TASK_INTERRUPTIBLE);
+
+ if (!kthread_should_stop())
+ schedule();
+ else {
+ finish_wait(&kpagecache_limitd_wq, &wait);
+ break;
+ }
+ finish_wait(&kpagecache_limitd_wq, &wait);
+ }
+ kpclimitd_context = false;
+ return 0;
+}
+
+static void wakeup_kpclimitd(gfp_t mask)
+{
+ if (!waitqueue_active(&kpagecache_limitd_wq))
+ return;
+ wake_up_interruptible(&kpagecache_limitd_wq);
+}
+
+void shrink_page_cache(gfp_t mask, struct page *page)
+{
+ if (0 == vm_pagecache_limit_async)
+ __shrink_page_cache(mask);
+ else
+ wakeup_kpclimitd(mask);
+}
+
+/* It's optimal to keep kswapds on the same CPUs as their memory, but
+ not required for correctness. So if the last cpu in a node goes
+ away, we get changed to run anywhere: as the first one comes back,
+ restore their cpu bindings. */
+static int kswapd_cpu_online(unsigned int cpu)
+{
+ int nid;
+
+ for_each_node_state(nid, N_MEMORY) {
+ pg_data_t *pgdat = NODE_DATA(nid);
+ const struct cpumask *mask;
+
+ mask = cpumask_of_node(pgdat->node_id);
+
+ if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
+ /* One of our CPUs online: restore mask */
+ set_cpus_allowed_ptr(pgdat->kswapd, mask);
+ }
+ return 0;
+}
/*
* This kswapd start function will be called by init and node-hot-add.
@@ -4286,16 +4849,61 @@ void kswapd_stop(int nid)
static int __init kswapd_init(void)
{
- int nid;
+ /*int nid;
swap_setup();
for_each_node_state(nid, N_MEMORY)
kswapd_run(nid);
- return 0;
+ return 0;*/
+ int nid, ret;
+
+ swap_setup();
+ for_each_node_state(nid, N_MEMORY)
+ kswapd_run(nid);
+#ifdef CONFIG_SHRINK_PAGECACHE
+ ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
+ "mm/vmscan:online", kswapd_cpu_online,
+ kswapd_cpu_down_prep);
+#else
+ ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
+ "mm/vmscan:online", kswapd_cpu_online,
+ NULL);
+#endif
+ WARN_ON(ret < 0);
+#ifdef CONFIG_SHRINK_PAGECACHE
+ shrink_page_cache_init();
+#endif
+ return 0;
+
}
module_init(kswapd_init)
+int kpagecache_limitd_run(void)
+{
+ int ret = 0;
+
+ if (kpclimitd)
+ return 0;
+
+ kpclimitd = kthread_run(kpagecache_limitd, NULL, "kpclimitd");
+ if (IS_ERR(kpclimitd)) {
+ pr_err("Failed to start kpagecache_limitd thread\n");
+ ret = PTR_ERR(kpclimitd);
+ kpclimitd = NULL;
+ }
+ return ret;
+
+}
+
+void kpagecache_limitd_stop(void)
+{
+ if (kpclimitd) {
+ kthread_stop(kpclimitd);
+ kpclimitd = NULL;
+ }
+}
+
#ifdef CONFIG_NUMA
/*
* Node reclaim mode
diff --git a/mm/workingset.c b/mm/workingset.c
index bba4380405b4..9a5ad145b9bd 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -253,6 +253,7 @@ void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
{
struct pglist_data *pgdat = page_pgdat(page);
+ struct mem_cgroup *memcg = page_memcg(page);
unsigned long eviction;
struct lruvec *lruvec;
int memcgid;
--
2.30.0
1
0

[PATCH kernel-4.19] drivers/txgbe: fix buffer not null terminated by strncpy in txgbe_ethtool.c
by shenzijun 26 Oct '21
by shenzijun 26 Oct '21
26 Oct '21
From: 沈子俊 <shenzijun(a)kylinos.cn>
kylin inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4AG3E?from=project-issue
CVE: NA
---------------------------------------------------
change copy size in the function strncpy().
Signed-off-by: 沈子俊 <shenzijun(a)kylinos.cn>
---
drivers/net/ethernet/netswift/txgbe/txgbe_ethtool.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/netswift/txgbe/txgbe_ethtool.c b/drivers/net/ethernet/netswift/txgbe/txgbe_ethtool.c
index 5cb8ef61e04b..9af9f19fb491 100644
--- a/drivers/net/ethernet/netswift/txgbe/txgbe_ethtool.c
+++ b/drivers/net/ethernet/netswift/txgbe/txgbe_ethtool.c
@@ -1040,7 +1040,7 @@ static void txgbe_get_drvinfo(struct net_device *netdev,
strncpy(drvinfo->version, txgbe_driver_version,
sizeof(drvinfo->version) - 1);
strncpy(drvinfo->fw_version, adapter->eeprom_id,
- sizeof(drvinfo->fw_version));
+ sizeof(drvinfo->fw_version) - 1);
strncpy(drvinfo->bus_info, pci_name(adapter->pdev),
sizeof(drvinfo->bus_info) - 1);
if (adapter->num_tx_queues <= TXGBE_NUM_RX_QUEUES) {
--
2.30.0
1
0

[PATCH openEuler-1.0-LTS] blk-mq: complete req in softirq context in case of single queue
by Yang Yingliang 26 Oct '21
by Yang Yingliang 26 Oct '21
26 Oct '21
From: Ming Lei <ming.lei(a)redhat.com>
mainline inclusion
from mainline-4.20-rc1
commit 36e765392e48e0322222347c4d21078c0b94758c
category: bugfix
bugzilla: 175585
CVE: NA
-------------------------------------------------
Lot of controllers may have only one irq vector for completing IO
request. And usually affinity of the only irq vector is all possible
CPUs, however, on most of ARCH, there may be only one specific CPU
for handling this interrupt.
So if all IOs are completed in hardirq context, it is inevitable to
degrade IO performance because of increased irq latency.
This patch tries to address this issue by allowing to complete request
in softirq context, like the legacy IO path.
IOPS is observed as ~13%+ in the following randread test on raid0 over
virtio-scsi.
mdadm --create --verbose /dev/md0 --level=0 --chunk=1024 --raid-devices=8 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi
fio --time_based --name=benchmark --runtime=30 --filename=/dev/md0 --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=32 --rw=randread --blocksize=4k
Cc: Dongli Zhang <dongli.zhang(a)oracle.com>
Cc: Zach Marano <zmarano(a)google.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Bart Van Assche <bvanassche(a)acm.org>
Cc: Jianchao Wang <jianchao.w.wang(a)oracle.com>
Signed-off-by: Ming Lei <ming.lei(a)redhat.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
Signed-off-by: Lihong Kou <koulihong(a)huawei.com>
Reviewed-by: Hou Tao <houtao1(a)huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com>
---
block/blk-mq.c | 14 ++++++++++++++
block/blk-softirq.c | 5 ++---
2 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7106c94ea58fe..55c81dcafbdc2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -600,6 +600,20 @@ void blk_mq_force_complete_rq(struct request *rq)
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
+ /*
+ * Most of single queue controllers, there is only one irq vector
+ * for handling IO completion, and the only irq's affinity is set
+ * as all possible CPUs. On most of ARCHs, this affinity means the
+ * irq is handled on one specific CPU.
+ *
+ * So complete IO reqeust in softirq context in case of single queue
+ * for not degrading IO performance by irqsoff latency.
+ */
+ if (rq->q->nr_hw_queues == 1) {
+ __blk_complete_request(rq);
+ return;
+ }
+
if (!test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags)) {
rq->q->softirq_done_fn(rq);
return;
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 15c1f5e12eb89..e47a2f751884d 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -97,8 +97,8 @@ static int blk_softirq_cpu_dead(unsigned int cpu)
void __blk_complete_request(struct request *req)
{
- int ccpu, cpu;
struct request_queue *q = req->q;
+ int cpu, ccpu = q->mq_ops ? req->mq_ctx->cpu : req->cpu;
unsigned long flags;
bool shared = false;
@@ -110,8 +110,7 @@ void __blk_complete_request(struct request *req)
/*
* Select completion CPU
*/
- if (req->cpu != -1) {
- ccpu = req->cpu;
+ if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && ccpu != -1) {
if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
shared = cpus_share_cache(cpu, ccpu);
} else
--
2.25.1
1
0

[PATCH kernel-4.19] blk-mq: complete req in softirq context in case of single queue
by Yang Yingliang 26 Oct '21
by Yang Yingliang 26 Oct '21
26 Oct '21
From: Ming Lei <ming.lei(a)redhat.com>
mainline inclusion
from mainline-4.20-rc1
commit 36e765392e48e0322222347c4d21078c0b94758c
category: bugfix
bugzilla: 175585
CVE: NA
-------------------------------------------------
Lot of controllers may have only one irq vector for completing IO
request. And usually affinity of the only irq vector is all possible
CPUs, however, on most of ARCH, there may be only one specific CPU
for handling this interrupt.
So if all IOs are completed in hardirq context, it is inevitable to
degrade IO performance because of increased irq latency.
This patch tries to address this issue by allowing to complete request
in softirq context, like the legacy IO path.
IOPS is observed as ~13%+ in the following randread test on raid0 over
virtio-scsi.
mdadm --create --verbose /dev/md0 --level=0 --chunk=1024 --raid-devices=8 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi
fio --time_based --name=benchmark --runtime=30 --filename=/dev/md0 --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=32 --rw=randread --blocksize=4k
Cc: Dongli Zhang <dongli.zhang(a)oracle.com>
Cc: Zach Marano <zmarano(a)google.com>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Bart Van Assche <bvanassche(a)acm.org>
Cc: Jianchao Wang <jianchao.w.wang(a)oracle.com>
Signed-off-by: Ming Lei <ming.lei(a)redhat.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
Signed-off-by: Lihong Kou <koulihong(a)huawei.com>
Reviewed-by: Tao Hou <houtao1(a)huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com>
---
block/blk-mq.c | 14 ++++++++++++++
block/blk-softirq.c | 5 ++---
2 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index be1e2ad4631aa..52a04f6ffeea2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -603,6 +603,20 @@ void blk_mq_force_complete_rq(struct request *rq)
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
+ /*
+ * Most of single queue controllers, there is only one irq vector
+ * for handling IO completion, and the only irq's affinity is set
+ * as all possible CPUs. On most of ARCHs, this affinity means the
+ * irq is handled on one specific CPU.
+ *
+ * So complete IO reqeust in softirq context in case of single queue
+ * for not degrading IO performance by irqsoff latency.
+ */
+ if (rq->q->nr_hw_queues == 1) {
+ __blk_complete_request(rq);
+ return;
+ }
+
if (!test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags)) {
rq->q->softirq_done_fn(rq);
return;
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 15c1f5e12eb89..e47a2f751884d 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -97,8 +97,8 @@ static int blk_softirq_cpu_dead(unsigned int cpu)
void __blk_complete_request(struct request *req)
{
- int ccpu, cpu;
struct request_queue *q = req->q;
+ int cpu, ccpu = q->mq_ops ? req->mq_ctx->cpu : req->cpu;
unsigned long flags;
bool shared = false;
@@ -110,8 +110,7 @@ void __blk_complete_request(struct request *req)
/*
* Select completion CPU
*/
- if (req->cpu != -1) {
- ccpu = req->cpu;
+ if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && ccpu != -1) {
if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
shared = cpus_share_cache(cpu, ccpu);
} else
--
2.25.1
1
0

[PATCH kernel-4.19 1/8] ovl: simplify setting of origin for index lookup
by Yang Yingliang 26 Oct '21
by Yang Yingliang 26 Oct '21
26 Oct '21
From: Vivek Goyal <vgoyal(a)redhat.com>
mainline inclusion
from mainline-v5.8-rc1
commit 59fb20138a9b5249a4176d5bbc5c670a97343061
category: bugfix
bugzilla: NA
CVE: NA
-------------------------------------------------
overlayfs can keep index of copied up files and directories and it seems to
serve two primary puroposes. For regular files, it avoids breaking lower
hardlinks over copy up. For directories it seems to be used for various
error checks.
During ovl_lookup(), we lookup for index using lower dentry in many a
cases. That lower dentry is called "origin" and following is a summary of
current logic.
If there is no upperdentry, always lookup for index using lower dentry.
For regular files it helps avoiding breaking hard links over copyup and for
directories it seems to be just error checks.
If there is an upperdentry, then there are 3 possible cases.
- For directories, lower dentry is found using two ways. One is regular
path based lookup in lower layers and second is using ORIGIN xattr on
upper dentry. First verify that path based lookup lower dentry matches
the one pointed by upper ORIGIN xattr. If yes, use this verified origin
for index lookup.
- For regular files (non-metacopy), there is no path based lookup in lower
layers as lookup stops once we find upper dentry. So there is no origin
verification. If there is ORIGIN xattr present on upper, use that to
lookup index otherwise don't.
- For regular metacopy files, again lower dentry is found using path based
lookup as well as ORIGIN xattr on upper. Path based lookup is continued
in this case to find lower data dentry for metacopy upper. So like
directories we only use verified origin. If ORIGIN xattr is not present
(Either because lower did not support file handles or because this is
hardlink copied up with index=off), then don't use path lookup based
lower dentry as origin. This is same as regular non-metacopy file case.
Suggested-by: Amir Goldstein <amir73il(a)gmail.com>
Signed-off-by: Vivek Goyal <vgoyal(a)redhat.com>
Reviewed-by: Amir Goldstein <amir73il(a)gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi(a)redhat.com>
Signed-off-by: Zheng Liang <zhengliang6(a)huawei.com>
Reviewed-by: Zhang Yi <yi.zhang(a)huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang(a)huawei.com>
---
fs/overlayfs/namei.c | 29 +++++++++++++++++------------
1 file changed, 17 insertions(+), 12 deletions(-)
diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index 145bfdde53feb..968ad757c578e 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -1014,25 +1014,30 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
}
stack = origin_path;
ctr = 1;
+ origin = origin_path->dentry;
origin_path = NULL;
}
/*
- * Lookup index by lower inode and verify it matches upper inode.
- * We only trust dir index if we verified that lower dir matches
- * origin, otherwise dir index entries may be inconsistent and we
- * ignore them.
+ * Always lookup index if there is no-upperdentry.
*
- * For non-dir upper metacopy dentry, we already set "origin" if we
- * verified that lower matched upper origin. If upper origin was
- * not present (because lower layer did not support fh encode/decode),
- * or indexing is not enabled, do not set "origin" and skip looking up
- * index. This case should be handled in same way as a non-dir upper
- * without ORIGIN is handled.
+ * For the case of upperdentry, we have set origin by now if it
+ * needed to be set. There are basically three cases.
+ *
+ * For directories, lookup index by lower inode and verify it matches
+ * upper inode. We only trust dir index if we verified that lower dir
+ * matches origin, otherwise dir index entries may be inconsistent
+ * and we ignore them.
+ *
+ * For regular upper, we already set origin if upper had ORIGIN
+ * xattr. There is no verification though as there is no path
+ * based dentry lookup in lower in this case.
+ *
+ * For metacopy upper, we set a verified origin already if index
+ * is enabled and if upper had an ORIGIN xattr.
*
- * Always lookup index of non-dir non-metacopy and non-upper.
*/
- if (ctr && (!upperdentry || (!d.is_dir && !metacopy)))
+ if (!upperdentry && ctr)
origin = stack[0].dentry;
if (origin && ovl_indexdir(dentry->d_sb) &&
--
2.25.1
1
7

26 Oct '21
Reviewed-by: Cheng Jian <cj.chengjian(a)huawei.com>
在 2021/10/23 17:03, xjx00 写道:
> From: Maciej Żenczykowski <maze(a)google.com>
>
> stable inclusion
> from stable-v5.10.44
> commit 0f5a20b1fd9da3ac9f7c6edcad522712ca694d5c
> bugzilla:https://bugzilla.openeuler.org/show_bug.cgi?id=358
> CVE: NA
>
> -------------------------------------------------
>
> commit 3370139745853f7826895293e8ac3aec1430508e upstream.
>
> [ 190.544755] configfs-gadget gadget: notify speed -44967296
>
> This is because 4250000000 - 2**32 is -44967296.
>
> Fixes: 9f6ce4240a2b ("usb: gadget: f_ncm.c added")
> Cc: Brooke Basile <brookebasile(a)gmail.com>
> Cc: Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>
> Cc: Felipe Balbi <balbi(a)kernel.org>
> Cc: Lorenzo Colitti <lorenzo(a)google.com>
> Cc: Yauheni Kaliuta <yauheni.kaliuta(a)nokia.com>
> Cc: Linux USB Mailing List <linux-usb(a)vger.kernel.org>
> Acked-By: Lorenzo Colitti <lorenzo(a)google.com>
> Signed-off-by: Maciej Żenczykowski <maze(a)google.com>
> Cc: stable <stable(a)vger.kernel.org>
> Link: https://lore.kernel.org/r/20210608005344.3762668-1-zenczykowski@gmail.com
> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
> Signed-off-by: xjx00 <xjxyklwx(a)126.com>
> ---
> drivers/usb/gadget/function/f_ncm.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/usb/gadget/function/f_ncm.c b/drivers/usb/gadget/function/f_ncm.c
> index 019bea8e09cc..0d23c6c11a13 100644
> --- a/drivers/usb/gadget/function/f_ncm.c
> +++ b/drivers/usb/gadget/function/f_ncm.c
> @@ -583,7 +583,7 @@ static void ncm_do_notify(struct f_ncm *ncm)
> data[0] = cpu_to_le32(ncm_bitrate(cdev->gadget));
> data[1] = data[0];
>
> - DBG(cdev, "notify speed %d\n", ncm_bitrate(cdev->gadget));
> + DBG(cdev, "notify speed %u\n", ncm_bitrate(cdev->gadget));
> ncm->notify_state = NCM_NOTIFY_CONNECT;
> break;
> }
2
1