From: Alexey Gladkov gladkov.alexey@gmail.com
mainline inclusion from mainline-v5.2-rc1 commit 898490c010b5d2e499e03b7e815fc214209ac583 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I390TB CVE: NA
Backporting this patch is to fix an issue happened when the 5.10 kernel is building on openEuler 20.03 system.
In openEuler 20.03 system, some modules ware built-in kernel. But in 5.10 kernel, those corresponding modules will be built as KO. For built-in kernel, kmode 2.7+ will fetch the modinfo from modules.builtin.modinfo which is only supported in kernel 5.2+.
In the rpmbuild process, kernel spec will call kmod to query modinfo of the KO images. It will fail for 'file missing'.
With backporting the mainline commit below, kmod can fetch any module's information from the corresponding module image or modules.builtin.modinfo.
---------------------------
Problem:
When a kernel module is compiled as a separate module, some important information about the kernel module is available via .modinfo section of the module. In contrast, when the kernel module is compiled into the kernel, that information is not available.
Information about built-in modules is necessary in the following cases:
1. When it is necessary to find out what additional parameters can be passed to the kernel at boot time.
2. When you need to know which module names and their aliases are in the kernel. This is very useful for creating an initrd image.
Proposal:
The proposed patch does not remove .modinfo section with module information from the vmlinux at the build time and saves it into a separate file after kernel linking. So, the kernel does not increase in size and no additional information remains in it. Information is stored in the same format as in the separate modules (null-terminated string array). Because the .modinfo section is already exported with a separate modules, we are not creating a new API.
It can be easily read in the userspace:
$ tr '\0' '\n' < modules.builtin.modinfo ext4.softdep=pre: crc32c ext4.license=GPL ext4.description=Fourth Extended Filesystem ext4.author=Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others ext4.alias=fs-ext4 ext4.alias=ext3 ext4.alias=fs-ext3 ext4.alias=ext2 ext4.alias=fs-ext2 md_mod.alias=block-major-9-* md_mod.alias=md md_mod.description=MD RAID framework md_mod.license=GPL md_mod.parmtype=create_on_open:bool md_mod.parmtype=start_dirty_degraded:int ...
Co-Developed-by: Gleb Fotengauer-Malinovskiy glebfm@altlinux.org Signed-off-by: Gleb Fotengauer-Malinovskiy glebfm@altlinux.org Signed-off-by: Alexey Gladkov gladkov.alexey@gmail.com Acked-by: Jessica Yu jeyu@kernel.org Signed-off-by: Masahiro Yamada yamada.masahiro@socionext.com Signed-off-by: Zhichang Yuan erik.yuan@arm.com Reviewed-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- .gitignore | 1 + Documentation/dontdiff | 1 + Documentation/kbuild/kbuild.txt | 5 +++++ Makefile | 2 ++ include/asm-generic/vmlinux.lds.h | 1 + include/linux/module.h | 1 + include/linux/moduleparam.h | 12 +++++------- scripts/link-vmlinux.sh | 3 +++ 8 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/.gitignore b/.gitignore index 2d498af502ff1..546cef8d9b8fe 100644 --- a/.gitignore +++ b/.gitignore @@ -57,6 +57,7 @@ modules.builtin /vmlinuz /System.map /Module.markers +/modules.builtin.modinfo
# # RPM spec file (make rpm-pkg) diff --git a/Documentation/dontdiff b/Documentation/dontdiff index 2228fcc8e29f4..3d4d5a402b8be 100644 --- a/Documentation/dontdiff +++ b/Documentation/dontdiff @@ -179,6 +179,7 @@ mktables mktree modpost modules.builtin +modules.builtin.modinfo modules.order modversions.h* nconf diff --git a/Documentation/kbuild/kbuild.txt b/Documentation/kbuild/kbuild.txt index 8390c360d4b35..7f48e48f3fd27 100644 --- a/Documentation/kbuild/kbuild.txt +++ b/Documentation/kbuild/kbuild.txt @@ -11,6 +11,11 @@ modules.builtin This file lists all modules that are built into the kernel. This is used by modprobe to not fail when trying to load something builtin.
+modules.builtin.modinfo +-------------------------------------------------- +This file contains modinfo from all modules that are built into the kernel. +Unlike modinfo of a separate module, all fields are prefixed with module name. +
Environment variables
diff --git a/Makefile b/Makefile index 040b3cd699b01..e01e33b35daaf 100644 --- a/Makefile +++ b/Makefile @@ -1288,6 +1288,7 @@ _modinst_: fi @cp -f $(objtree)/modules.order $(MODLIB)/ @cp -f $(objtree)/modules.builtin $(MODLIB)/ + @cp -f $(objtree)/modules.builtin.modinfo $(MODLIB)/ $(Q)$(MAKE) -f $(srctree)/scripts/Makefile.modinst
# This depmod is only for convenience to give the initial @@ -1328,6 +1329,7 @@ endif # CONFIG_MODULES
# Directories & files removed with 'make clean' CLEAN_DIRS += $(MODVERDIR) include/ksym +CLEAN_FILES += modules.builtin.modinfo
# Directories & files removed with 'make mrproper' MRPROPER_DIRS += include/config usr/include include/generated \ diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 2d632a74cc5e9..0276b6950ae1d 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -855,6 +855,7 @@ EXIT_CALL \ *(.discard) \ *(.discard.*) \ + *(.modinfo) \ }
/** diff --git a/include/linux/module.h b/include/linux/module.h index 49942432f0101..5056a346f69e9 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -239,6 +239,7 @@ extern typeof(name) __mod_##type##__##name##_device_table \ #define MODULE_VERSION(_version) MODULE_INFO(version, _version) #else #define MODULE_VERSION(_version) \ + MODULE_INFO(version, _version); \ static struct module_version_attribute ___modver_attr = { \ .mattr = { \ .attr = { \ diff --git a/include/linux/moduleparam.h b/include/linux/moduleparam.h index ba36506db4fb7..5ba250d9172ac 100644 --- a/include/linux/moduleparam.h +++ b/include/linux/moduleparam.h @@ -10,23 +10,21 @@ module name. */ #ifdef MODULE #define MODULE_PARAM_PREFIX /* empty */ +#define __MODULE_INFO_PREFIX /* empty */ #else #define MODULE_PARAM_PREFIX KBUILD_MODNAME "." +/* We cannot use MODULE_PARAM_PREFIX because some modules override it. */ +#define __MODULE_INFO_PREFIX KBUILD_MODNAME "." #endif
/* Chosen so that structs with an unsigned long line up. */ #define MAX_PARAM_PREFIX_LEN (64 - sizeof(unsigned long))
-#ifdef MODULE #define __MODULE_INFO(tag, name, info) \ static const char __UNIQUE_ID(name)[] \ __used __attribute__((section(".modinfo"), unused, aligned(1))) \ - = __stringify(tag) "=" info -#else /* !MODULE */ -/* This struct is here for syntactic coherency, it is not used */ -#define __MODULE_INFO(tag, name, info) \ - struct __UNIQUE_ID(name) {} -#endif + = __MODULE_INFO_PREFIX __stringify(tag) "=" info + #define __MODULE_PARM_TYPE(name, _type) \ __MODULE_INFO(parmtype, name##type, #name ":" _type)
diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh index c8cf45362bd6f..c09e87e9c2b9f 100755 --- a/scripts/link-vmlinux.sh +++ b/scripts/link-vmlinux.sh @@ -226,6 +226,9 @@ modpost_link vmlinux.o # modpost vmlinux.o to check for section mismatches ${MAKE} -f "${srctree}/scripts/Makefile.modpost" vmlinux.o
+info MODINFO modules.builtin.modinfo +${OBJCOPY} -j .modinfo -O binary vmlinux.o modules.builtin.modinfo + kallsymso="" kallsyms_vmlinux="" if [ -n "${CONFIG_KALLSYMS}" ]; then
From: Miklos Szeredi mszeredi@redhat.com
mainline inclusion from mainline-v5.8-rc1 commit 130fdbc3d1f9966dd4230709c30f3768bccd3065 category: bugfix bugzilla: NA CVE: CVE-2020-16120
--------------------------------
The three instances of ovl_path_open() in overlayfs/readdir.c do three different things:
- pass f_flags from overlay file - pass O_RDONLY | O_DIRECTORY - pass just O_RDONLY
The value of f_flags can be (other than O_RDONLY):
O_WRONLY - not possible for a directory O_RDWR - not possible for a directory O_CREAT - masked out by dentry_open() O_EXCL - masked out by dentry_open() O_NOCTTY - masked out by dentry_open() O_TRUNC - masked out by dentry_open() O_APPEND - no effect on directory ops O_NDELAY - no effect on directory ops O_NONBLOCK - no effect on directory ops __O_SYNC - no effect on directory ops O_DSYNC - no effect on directory ops FASYNC - no effect on directory ops O_DIRECT - no effect on directory ops O_LARGEFILE - ? O_DIRECTORY - only affects lookup O_NOFOLLOW - only affects lookup O_NOATIME - overlay sets this unconditionally in ovl_path_open() O_CLOEXEC - only affects fd allocation O_PATH - no effect on directory ops __O_TMPFILE - not possible for a directory
Fon non-merge directories we use the underlying filesystem's iterate; in this case honor O_LARGEFILE from the original file to make sure that open doesn't get rejected.
For merge directories it's safe to pass O_LARGEFILE unconditionally since userspace will only see the artificial offsets created by overlayfs.
Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/overlayfs/readdir.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c index ae99b90a8b98b..b98df843ac966 100644 --- a/fs/overlayfs/readdir.c +++ b/fs/overlayfs/readdir.c @@ -300,7 +300,7 @@ static inline int ovl_dir_read(struct path *realpath, struct file *realfile; int err;
- realfile = ovl_path_open(realpath, O_RDONLY | O_DIRECTORY); + realfile = ovl_path_open(realpath, O_RDONLY | O_LARGEFILE); if (IS_ERR(realfile)) return PTR_ERR(realfile);
@@ -823,6 +823,12 @@ static loff_t ovl_dir_llseek(struct file *file, loff_t offset, int origin) return res; }
+static struct file *ovl_dir_open_realfile(struct file *file, + struct path *realpath) +{ + return ovl_path_open(realpath, O_RDONLY | (file->f_flags & O_LARGEFILE)); +} + static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end, int datasync) { @@ -845,7 +851,7 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end, struct path upperpath;
ovl_path_upper(dentry, &upperpath); - realfile = ovl_path_open(&upperpath, O_RDONLY); + realfile = ovl_dir_open_realfile(file, &upperpath);
inode_lock(inode); if (!od->upperfile) { @@ -896,7 +902,7 @@ static int ovl_dir_open(struct inode *inode, struct file *file) return -ENOMEM;
type = ovl_path_real(file->f_path.dentry, &realpath); - realfile = ovl_path_open(&realpath, file->f_flags); + realfile = ovl_dir_open_realfile(file, &realpath); if (IS_ERR(realfile)) { kfree(od); return PTR_ERR(realfile);
From: Miklos Szeredi mszeredi@redhat.com
mainline inclusion from mainline-v5.8-rc1 commit 48bd024b8a40d73ad6b086de2615738da0c7004f category: bugfix bugzilla: NA CVE: CVE-2020-16120
--------------------------------
In preparation for more permission checking, override credentials for directory operations on the underlying filesystems.
Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/overlayfs/readdir.c | 27 +++++++++++++++++++++------ 1 file changed, 21 insertions(+), 6 deletions(-)
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c index b98df843ac966..75a9a04eb56aa 100644 --- a/fs/overlayfs/readdir.c +++ b/fs/overlayfs/readdir.c @@ -735,8 +735,10 @@ static int ovl_iterate(struct file *file, struct dir_context *ctx) struct ovl_dir_file *od = file->private_data; struct dentry *dentry = file->f_path.dentry; struct ovl_cache_entry *p; + const struct cred *old_cred; int err;
+ old_cred = ovl_override_creds(dentry->d_sb); if (!ctx->pos) ovl_dir_reset(file);
@@ -750,17 +752,20 @@ static int ovl_iterate(struct file *file, struct dir_context *ctx) (ovl_same_fs(dentry->d_sb) && (ovl_is_impure_dir(file) || OVL_TYPE_MERGE(ovl_path_type(dentry->d_parent))))) { - return ovl_iterate_real(file, ctx); + err = ovl_iterate_real(file, ctx); + } else { + err = iterate_dir(od->realfile, ctx); } - return iterate_dir(od->realfile, ctx); + goto out; }
if (!od->cache) { struct ovl_dir_cache *cache;
cache = ovl_cache_get(dentry); + err = PTR_ERR(cache); if (IS_ERR(cache)) - return PTR_ERR(cache); + goto out;
od->cache = cache; ovl_seek_cursor(od, ctx->pos); @@ -772,7 +777,7 @@ static int ovl_iterate(struct file *file, struct dir_context *ctx) if (!p->ino) { err = ovl_cache_update_ino(&file->f_path, p); if (err) - return err; + goto out; } if (!dir_emit(ctx, p->name, p->len, p->ino, p->type)) break; @@ -780,7 +785,10 @@ static int ovl_iterate(struct file *file, struct dir_context *ctx) od->cursor = p->l_node.next; ctx->pos++; } - return 0; + err = 0; +out: + revert_creds(old_cred); + return err; }
static loff_t ovl_dir_llseek(struct file *file, loff_t offset, int origin) @@ -826,7 +834,14 @@ static loff_t ovl_dir_llseek(struct file *file, loff_t offset, int origin) static struct file *ovl_dir_open_realfile(struct file *file, struct path *realpath) { - return ovl_path_open(realpath, O_RDONLY | (file->f_flags & O_LARGEFILE)); + struct file *res; + const struct cred *old_cred; + + old_cred = ovl_override_creds(file_inode(file)->i_sb); + res = ovl_path_open(realpath, O_RDONLY | (file->f_flags & O_LARGEFILE)); + revert_creds(old_cred); + + return res; }
static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
From: Miklos Szeredi mszeredi@redhat.com
mainline inclusion from mainline-v5.8-rc1 commit 56230d956739b9cb1cbde439d76227d77979a04d category: bugfix bugzilla: NA CVE: CVE-2020-16120
--------------------------------
Check permission before opening a real file.
ovl_path_open() is used by readdir and copy-up routines.
ovl_permission() theoretically already checked copy up permissions, but it doesn't hurt to re-do these checks during the actual copy-up.
For directory reading ovl_permission() only checks access to topmost underlying layer. Readdir on a merged directory accesses layers below the topmost one as well. Permission wasn't checked for these layers.
Note: modifying ovl_permission() to perform this check would be far more complex and hence more bug prone. The result is less precise permissions returned in access(2). If this turns out to be an issue, we can revisit this bug.
Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/overlayfs/util.c | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-)
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c index d0570ac2788b9..eb9411461b695 100644 --- a/fs/overlayfs/util.c +++ b/fs/overlayfs/util.c @@ -466,7 +466,32 @@ bool ovl_is_whiteout(struct dentry *dentry)
struct file *ovl_path_open(struct path *path, int flags) { - return dentry_open(path, flags | O_NOATIME, current_cred()); + struct inode *inode = d_inode(path->dentry); + int err, acc_mode; + + if (flags & ~(O_ACCMODE | O_LARGEFILE)) + BUG(); + + switch (flags & O_ACCMODE) { + case O_RDONLY: + acc_mode = MAY_READ; + break; + case O_WRONLY: + acc_mode = MAY_WRITE; + break; + default: + BUG(); + } + + err = inode_permission(inode, acc_mode | MAY_OPEN); + if (err) + return ERR_PTR(err); + + /* O_NOATIME is an optimization, don't fail if not permitted */ + if (inode_owner_or_capable(inode)) + flags |= O_NOATIME; + + return dentry_open(path, flags, current_cred()); }
/* Caller should hold ovl_inode->lock */
From: Miklos Szeredi mszeredi@redhat.com
mainline inclusion from mainline-v5.8-rc1 commit 292f902a40c11f043a5ca1305a114da0e523eaa3 category: bugfix bugzilla: NA CVE: CVE-2020-16120
--------------------------------
Verify LSM permissions for underlying file, since vfs_ioctl() doesn't do it.
[Stephen Rothwell] export security_file_ioctl
Signed-off-by: Miklos Szeredi mszeredi@redhat.com Conflicts: fs/overlayfs/file.c [yyl: adjust context] Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/overlayfs/file.c | 5 ++++- security/security.c | 1 + 2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c index 73c3e2c21edb0..b4ab06c3f3c67 100644 --- a/fs/overlayfs/file.c +++ b/fs/overlayfs/file.c @@ -12,6 +12,7 @@ #include <linux/xattr.h> #include <linux/uio.h> #include <linux/uaccess.h> +#include <linux/security.h> #include "overlayfs.h"
static char ovl_whatisit(struct inode *inode, struct inode *realinode) @@ -403,7 +404,9 @@ static long ovl_real_ioctl(struct file *file, unsigned int cmd, return ret;
old_cred = ovl_override_creds(file_inode(file)->i_sb); - ret = vfs_ioctl(real.file, cmd, arg); + ret = security_file_ioctl(real.file, cmd, arg); + if (!ret) + ret = vfs_ioctl(real.file, cmd, arg); revert_creds(old_cred);
fdput(real); diff --git a/security/security.c b/security/security.c index 17e2bed11bf71..b4f8c09568824 100644 --- a/security/security.c +++ b/security/security.c @@ -877,6 +877,7 @@ int security_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { return call_int_hook(file_ioctl, 0, file, cmd, arg); } +EXPORT_SYMBOL_GPL(security_file_ioctl);
static inline unsigned long mmap_prot(struct file *file, unsigned long prot) {
From: Miklos Szeredi mszeredi@redhat.com
mainline inclusion from mainline-v5.8-rc1 commit 05acefb4872dae89e772729efb194af754c877e8 category: bugfix bugzilla: NA CVE: CVE-2020-16120
--------------------------------
Call inode_permission() on real inode before opening regular file on one of the underlying layers.
In some cases ovl_permission() already checks access to an underlying file, but it misses the metacopy case, and possibly other ones as well.
Removing the redundant permission check from ovl_permission() should be considered later.
Signed-off-by: Miklos Szeredi mszeredi@redhat.com Conflicts: fs/overlayfs/file.c [yyl: adjust context] Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/overlayfs/file.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c index b4ab06c3f3c67..f8e733c29cc47 100644 --- a/fs/overlayfs/file.c +++ b/fs/overlayfs/file.c @@ -35,10 +35,22 @@ static struct file *ovl_open_realfile(const struct file *file, struct file *realfile; const struct cred *old_cred; int flags = file->f_flags | OVL_OPEN_FLAGS; + int acc_mode = ACC_MODE(flags); + int err; + + if (flags & O_APPEND) + acc_mode |= MAY_APPEND;
old_cred = ovl_override_creds(inode->i_sb); - realfile = open_with_fake_path(&file->f_path, flags, realinode, - current_cred()); + err = inode_permission(realinode, MAY_OPEN | acc_mode); + if (err) { + realfile = ERR_PTR(err); + } else if (!inode_owner_or_capable(realinode)) { + realfile = ERR_PTR(-EPERM); + } else { + realfile = open_with_fake_path(&file->f_path, flags, realinode, + current_cred()); + } revert_creds(old_cred);
pr_debug("open(%p[%pD2/%c], 0%o) -> (%p, 0%o)\n",
From: Miklos Szeredi mszeredi@redhat.com
mainline inclusion from mainline-v5.11-rc1 commit b6650dab404c701d7fe08a108b746542a934da84 category: bugfix bugzilla: NA CVE: CVE-2020-16120
--------------------------------
In case the file cannot be opened with O_NOATIME because of lack of capabilities, then clear O_NOATIME instead of failing.
Remove WARN_ON(), since it would now trigger if O_NOATIME was cleared. Noticed by Amir Goldstein.
Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/overlayfs/file.c | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-)
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c index f8e733c29cc47..470ea215bebca 100644 --- a/fs/overlayfs/file.c +++ b/fs/overlayfs/file.c @@ -45,9 +45,10 @@ static struct file *ovl_open_realfile(const struct file *file, err = inode_permission(realinode, MAY_OPEN | acc_mode); if (err) { realfile = ERR_PTR(err); - } else if (!inode_owner_or_capable(realinode)) { - realfile = ERR_PTR(-EPERM); } else { + if (!inode_owner_or_capable(realinode)) + flags &= ~O_NOATIME; + realfile = open_with_fake_path(&file->f_path, flags, realinode, current_cred()); } @@ -67,12 +68,6 @@ static int ovl_change_flags(struct file *file, unsigned int flags) struct inode *inode = file_inode(file); int err;
- flags |= OVL_OPEN_FLAGS; - - /* If some flag changed that cannot be changed then something's amiss */ - if (WARN_ON((file->f_flags ^ flags) & ~OVL_SETFL_MASK)) - return -EIO; - flags &= OVL_SETFL_MASK;
if (((flags ^ file->f_flags) & O_APPEND) && IS_APPEND(inode))
From: liubo liubo254@huawei.com
euleros inclusion category: feature feature: etmem bugzilla: 49889
-------------------------------------------------
etmem, the memory vertical expansion technology, uses DRAM and high-performance storage new media to form multi-level memory storage. By grading the stored data, etmem migrates the classified cold storage data from the storage medium to the high-performance storage medium, so as to achieve the purpose of memory capacity expansion and memory cost reduction.
The etmem feature is mainly composed of two parts: etmem_scan and etmem_swap.
This patch is mainly used to generate etmem_scan.ko. etmem_scan.ko is used to scan the virtual address of the target process and return the address access information to the user mode for grading cold and hot pages.
Signed-off-by: Fengguang Wu fengguang.wu@intel.com Signed-off-by: yanxiaodan yanxiaodan@huawei.com Signed-off-by: Feilong Lin linfeilong@huawei.com Signed-off-by: geruijun geruijun@huawei.com Signed-off-by: liubo liubo254@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Reviewed-by: Jing Xiangfengjingxiangfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/proc/Makefile | 1 + fs/proc/base.c | 2 + fs/proc/etmem_scan.c | 1046 ++++++++++++++++++++++++++++++++++++++ fs/proc/etmem_scan.h | 132 +++++ fs/proc/internal.h | 1 + fs/proc/task_mmu.c | 66 +++ include/linux/mm_types.h | 18 + lib/Kconfig | 6 + mm/pagewalk.c | 1 + virt/kvm/kvm_main.c | 6 + 10 files changed, 1279 insertions(+) create mode 100644 fs/proc/etmem_scan.c create mode 100644 fs/proc/etmem_scan.h
diff --git a/fs/proc/Makefile b/fs/proc/Makefile index ead487e805108..c1ebd017a83bc 100644 --- a/fs/proc/Makefile +++ b/fs/proc/Makefile @@ -33,3 +33,4 @@ proc-$(CONFIG_PROC_KCORE) += kcore.o proc-$(CONFIG_PROC_VMCORE) += vmcore.o proc-$(CONFIG_PRINTK) += kmsg.o proc-$(CONFIG_PROC_PAGE_MONITOR) += page.o +obj-$(CONFIG_ETMEM_SCAN) += etmem_scan.o diff --git a/fs/proc/base.c b/fs/proc/base.c index b78875fa78f4e..9ea434f9da4ee 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2989,6 +2989,7 @@ static const struct pid_entry tgid_base_stuff[] = { REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), + REG("idle_pages", S_IRUSR|S_IWUSR, proc_mm_idle_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), @@ -3377,6 +3378,7 @@ static const struct pid_entry tid_base_stuff[] = { REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), + REG("idle_pages", S_IRUSR|S_IWUSR, proc_mm_idle_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), diff --git a/fs/proc/etmem_scan.c b/fs/proc/etmem_scan.c new file mode 100644 index 0000000000000..94bf125de6072 --- /dev/null +++ b/fs/proc/etmem_scan.c @@ -0,0 +1,1046 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/pagemap.h> +#include <linux/mm.h> +#include <linux/hugetlb.h> +#include <linux/kernel.h> +#include <linux/sched.h> +#include <linux/proc_fs.h> +#include <linux/uaccess.h> +#include <linux/kvm.h> +#include <linux/kvm_host.h> +#include <linux/bitmap.h> +#include <linux/sched/mm.h> +#include <linux/version.h> +#include <linux/module.h> +#include <linux/io.h> +#include <linux/uaccess.h> +#include <asm/cacheflush.h> +#include <asm/page.h> +#include <asm/pgalloc.h> +#include <asm/tlb.h> +#include <asm/pgtable.h> +#ifdef CONFIG_ARM64 +#include <asm/pgtable-types.h> +#include <asm/memory.h> +#include <asm/kvm_mmu.h> +#include <asm/kvm_arm.h> +#include <asm/stage2_pgtable.h> +#endif +#include "etmem_scan.h" + +#ifdef CONFIG_X86_64 +/* + * Fallback to false for kernel doens't support KVM_INVALID_SPTE + * ept_idle can sitll work in this situation but the scan accuracy may drop, + * depends on the access frequences of the workload. + */ +#ifdef KVM_INVALID_SPTE +#define KVM_CHECK_INVALID_SPTE(val) ((val) == KVM_INVALID_SPTE) +#else +#define KVM_CHECK_INVALID_SPTE(val) (0) +#endif + +# define kvm_arch_mmu_pointer(vcpu) (&(vcpu->arch.mmu)) +# define kvm_mmu_ad_disabled(mmu) (mmu->base_role.ad_disabled) +#endif /*CONFIG_X86_64*/ + +#ifdef CONFIG_ARM64 +#define if_pmd_thp_or_huge(pmd) (if_pmd_huge(pmd) || pmd_trans_huge(pmd)) +#endif /* CONFIG_ARM64 */ + +#ifdef DEBUG + +#define debug_printk trace_printk + +#define set_restart_gpa(val, note) ({ \ + unsigned long old_val = pic->restart_gpa; \ + pic->restart_gpa = (val); \ + trace_printk("restart_gpa=%lx %luK %s %s %d\n", \ + (val), (pic->restart_gpa - old_val) >> 10, \ + note, __func__, __LINE__); \ +}) + +#define set_next_hva(val, note) ({ \ + unsigned long old_val = pic->next_hva; \ + pic->next_hva = (val); \ + trace_printk(" next_hva=%lx %luK %s %s %d\n", \ + (val), (pic->next_hva - old_val) >> 10, \ + note, __func__, __LINE__); \ +}) + +#else + +#define debug_printk(...) + +#define set_restart_gpa(val, note) ({ \ + pic->restart_gpa = (val); \ +}) + +#define set_next_hva(val, note) ({ \ + pic->next_hva = (val); \ +}) + +#endif + +static unsigned long pagetype_size[16] = { + [PTE_ACCESSED] = PAGE_SIZE, /* 4k page */ + [PMD_ACCESSED] = PMD_SIZE, /* 2M page */ + [PUD_PRESENT] = PUD_SIZE, /* 1G page */ + + [PTE_DIRTY_M] = PAGE_SIZE, + [PMD_DIRTY_M] = PMD_SIZE, + + [PTE_IDLE] = PAGE_SIZE, + [PMD_IDLE] = PMD_SIZE, + [PMD_IDLE_PTES] = PMD_SIZE, + + [PTE_HOLE] = PAGE_SIZE, + [PMD_HOLE] = PMD_SIZE, +}; + +static void u64_to_u8(uint64_t n, uint8_t *p) +{ + p += sizeof(uint64_t) - 1; + + *p-- = n; n >>= 8; + *p-- = n; n >>= 8; + *p-- = n; n >>= 8; + *p-- = n; n >>= 8; + + *p-- = n; n >>= 8; + *p-- = n; n >>= 8; + *p-- = n; n >>= 8; + *p = n; +} + +static void dump_pic(struct page_idle_ctrl *pic) +{ + debug_printk("page_idle_ctrl: pie_read=%d pie_read_max=%d", + pic->pie_read, + pic->pie_read_max); + debug_printk(" buf_size=%d bytes_copied=%d next_hva=%pK", + pic->buf_size, + pic->bytes_copied, + pic->next_hva); + debug_printk(" restart_gpa=%pK pa_to_hva=%pK\n", + pic->restart_gpa, + pic->gpa_to_hva); +} + +#ifdef CONFIG_ARM64 +static int if_pmd_huge(pmd_t pmd) +{ + return pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT); +} + +static int if_pud_huge(pud_t pud) +{ +#ifndef __PAGETABLE_PMD_FOLDED + return pud_val(pud) && !(pud_val(pud) & PUD_TABLE_BIT); +#else + return 0; +#endif +} + +static inline bool if_stage2_pud_huge(struct kvm *kvm, pud_t pud) +{ + if (kvm_stage2_has_pmd(kvm)) + return if_pud_huge(pud); + else + return 0; +} +#endif + +static void pic_report_addr(struct page_idle_ctrl *pic, unsigned long addr) +{ + unsigned long hva; + + pic->kpie[pic->pie_read++] = PIP_CMD_SET_HVA; + hva = addr; + u64_to_u8(hva, &pic->kpie[pic->pie_read]); + pic->pie_read += sizeof(uint64_t); + dump_pic(pic); +} + +static int pic_add_page(struct page_idle_ctrl *pic, + unsigned long addr, + unsigned long next, + enum ProcIdlePageType page_type) +{ + unsigned long page_size = pagetype_size[page_type]; + + dump_pic(pic); + + /* align kernel/user vision of cursor position */ + next = round_up(next, page_size); + + if (!pic->pie_read || + addr + pic->gpa_to_hva != pic->next_hva) { + /* merge hole */ + if (page_type == PTE_HOLE || + page_type == PMD_HOLE) { + set_restart_gpa(next, "PTE_HOLE|PMD_HOLE"); + return 0; + } + + if (addr + pic->gpa_to_hva < pic->next_hva) { + debug_printk("page_idle: addr moves backwards\n"); + WARN_ONCE(1, "page_idle: addr moves backwards"); + } + + if (pic->pie_read + sizeof(uint64_t) + 2 >= pic->pie_read_max) { + set_restart_gpa(addr, "PAGE_IDLE_KBUF_FULL"); + return PAGE_IDLE_KBUF_FULL; + } + + pic_report_addr(pic, round_down(addr, page_size) + + pic->gpa_to_hva); + } else { + if (PIP_TYPE(pic->kpie[pic->pie_read - 1]) == page_type && + PIP_SIZE(pic->kpie[pic->pie_read - 1]) < 0xF) { + set_next_hva(next + pic->gpa_to_hva, "IN-PLACE INC"); + set_restart_gpa(next, "IN-PLACE INC"); + pic->kpie[pic->pie_read - 1]++; + WARN_ONCE(page_size < next-addr, "next-addr too large"); + return 0; + } + if (pic->pie_read >= pic->pie_read_max) { + set_restart_gpa(addr, "PAGE_IDLE_KBUF_FULL"); + return PAGE_IDLE_KBUF_FULL; + } + } + + set_next_hva(next + pic->gpa_to_hva, "NEW-ITEM"); + set_restart_gpa(next, "NEW-ITEM"); + pic->kpie[pic->pie_read] = PIP_COMPOSE(page_type, 1); + pic->pie_read++; + + return 0; +} + +static int init_page_idle_ctrl_buffer(struct page_idle_ctrl *pic) +{ + pic->pie_read = 0; + pic->pie_read_max = min(PAGE_IDLE_KBUF_SIZE, + pic->buf_size - pic->bytes_copied); + /* reserve space for PIP_CMD_SET_HVA in the end */ + pic->pie_read_max -= sizeof(uint64_t) + 1; + + /* + * Align with PAGE_IDLE_KBUF_FULL + * logic in pic_add_page(), to avoid pic->pie_read = 0 when + * PAGE_IDLE_KBUF_FULL happened. + */ + if (pic->pie_read_max <= sizeof(uint64_t) + 2) + return PAGE_IDLE_KBUF_FULL; + + memset(pic->kpie, 0, sizeof(pic->kpie)); + return 0; +} + +static void setup_page_idle_ctrl(struct page_idle_ctrl *pic, void *buf, + int buf_size, unsigned int flags) +{ + pic->buf = buf; + pic->buf_size = buf_size; + pic->bytes_copied = 0; + pic->next_hva = 0; + pic->gpa_to_hva = 0; + pic->restart_gpa = 0; + pic->last_va = 0; + pic->flags = flags; +} + +static int page_idle_copy_user(struct page_idle_ctrl *pic, + unsigned long start, unsigned long end) +{ + int bytes_read; + int lc = 0; /* last copy? */ + int ret; + + dump_pic(pic); + + /* Break out of loop on no more progress. */ + if (!pic->pie_read) { + lc = 1; + if (start < end) + start = end; + } + + if (start >= end && start > pic->next_hva) { + set_next_hva(start, "TAIL-HOLE"); + pic_report_addr(pic, start); + } + + bytes_read = pic->pie_read; + if (!bytes_read) + return 1; + + ret = copy_to_user(pic->buf, pic->kpie, bytes_read); + if (ret) + return -EFAULT; + + pic->buf += bytes_read; + pic->bytes_copied += bytes_read; + if (pic->bytes_copied >= pic->buf_size) + return PAGE_IDLE_BUF_FULL; + if (lc) + return lc; + + ret = init_page_idle_ctrl_buffer(pic); + if (ret) + return ret; + + cond_resched(); + return 0; +} + +#ifdef CONFIG_X86_64 +static int ept_pte_range(struct page_idle_ctrl *pic, + pmd_t *pmd, unsigned long addr, unsigned long end) +{ + pte_t *pte; + enum ProcIdlePageType page_type; + int err = 0; + + pte = pte_offset_kernel(pmd, addr); + do { + if (KVM_CHECK_INVALID_SPTE(pte->pte)) { + page_type = PTE_IDLE; + } else if (!ept_pte_present(*pte)) + page_type = PTE_HOLE; + else if (!test_and_clear_bit(_PAGE_BIT_EPT_ACCESSED, + (unsigned long *) &pte->pte)) + page_type = PTE_IDLE; + else { + page_type = PTE_ACCESSED; + if (pic->flags & SCAN_DIRTY_PAGE) { + if (test_and_clear_bit(_PAGE_BIT_EPT_DIRTY, + (unsigned long *) &pte->pte)) + page_type = PTE_DIRTY_M; + } + } + + err = pic_add_page(pic, addr, addr + PAGE_SIZE, page_type); + if (err) + break; + } while (pte++, addr += PAGE_SIZE, addr != end); + + return err; +} + + +static int ept_pmd_range(struct page_idle_ctrl *pic, + pud_t *pud, unsigned long addr, unsigned long end) +{ + pmd_t *pmd; + unsigned long next; + enum ProcIdlePageType page_type; + enum ProcIdlePageType pte_page_type; + int err = 0; + + if (pic->flags & SCAN_HUGE_PAGE) + pte_page_type = PMD_IDLE_PTES; + else + pte_page_type = IDLE_PAGE_TYPE_MAX; + + pmd = pmd_offset(pud, addr); + do { + next = pmd_addr_end(addr, end); + if (KVM_CHECK_INVALID_SPTE(pmd->pmd)) + page_type = PMD_IDLE; + else if (!ept_pmd_present(*pmd)) + page_type = PMD_HOLE; /* likely won't hit here */ + else if (!pmd_large(*pmd)) + page_type = pte_page_type; + else if (!test_and_clear_bit(_PAGE_BIT_EPT_ACCESSED, + (unsigned long *)pmd)) + page_type = PMD_IDLE; + else { + page_type = PMD_ACCESSED; + if ((pic->flags & SCAN_DIRTY_PAGE) && + test_and_clear_bit(_PAGE_BIT_EPT_DIRTY, + (unsigned long *) pmd)) + page_type = PMD_DIRTY_M; + } + + if (page_type != IDLE_PAGE_TYPE_MAX) + err = pic_add_page(pic, addr, next, page_type); + else + err = ept_pte_range(pic, pmd, addr, next); + if (err) + break; + } while (pmd++, addr = next, addr != end); + + return err; +} + + +static int ept_pud_range(struct page_idle_ctrl *pic, + p4d_t *p4d, unsigned long addr, unsigned long end) +{ + pud_t *pud; + unsigned long next; + int err = 0; + + pud = pud_offset(p4d, addr); + do { + next = pud_addr_end(addr, end); + + if (!ept_pud_present(*pud)) { + set_restart_gpa(next, "PUD_HOLE"); + continue; + } + + if (pud_large(*pud)) + err = pic_add_page(pic, addr, next, PUD_PRESENT); + else + err = ept_pmd_range(pic, pud, addr, next); + + if (err) + break; + } while (pud++, addr = next, addr != end); + + return err; +} + +static int ept_p4d_range(struct page_idle_ctrl *pic, + pgd_t *pgd, unsigned long addr, unsigned long end) +{ + p4d_t *p4d; + unsigned long next; + int err = 0; + + p4d = p4d_offset(pgd, addr); + do { + next = p4d_addr_end(addr, end); + if (!ept_p4d_present(*p4d)) { + set_restart_gpa(next, "P4D_HOLE"); + continue; + } + + err = ept_pud_range(pic, p4d, addr, next); + if (err) + break; + } while (p4d++, addr = next, addr != end); + + return err; +} + + +static int ept_page_range(struct page_idle_ctrl *pic, + unsigned long addr, + unsigned long end) +{ + struct kvm_vcpu *vcpu; + struct kvm_mmu *mmu; + pgd_t *ept_root; + pgd_t *pgd; + unsigned long next; + int err = 0; + + WARN_ON(addr >= end); + + spin_lock(&pic->kvm->mmu_lock); + + vcpu = kvm_get_vcpu(pic->kvm, 0); + if (!vcpu) { + spin_unlock(&pic->kvm->mmu_lock); + return -EINVAL; + } + + mmu = kvm_arch_mmu_pointer(vcpu); + if (!VALID_PAGE(mmu->root_hpa)) { + spin_unlock(&pic->kvm->mmu_lock); + return -EINVAL; + } + + ept_root = __va(mmu->root_hpa); + + spin_unlock(&pic->kvm->mmu_lock); + local_irq_disable(); + pgd = pgd_offset_pgd(ept_root, addr); + do { + next = pgd_addr_end(addr, end); + if (!ept_pgd_present(*pgd)) { + set_restart_gpa(next, "PGD_HOLE"); + continue; + } + + err = ept_p4d_range(pic, pgd, addr, next); + if (err) + break; + } while (pgd++, addr = next, addr != end); + local_irq_enable(); + return err; +} + +static int ept_idle_supports_cpu(struct kvm *kvm) +{ + struct kvm_vcpu *vcpu; + struct kvm_mmu *mmu; + int ret; + + vcpu = kvm_get_vcpu(kvm, 0); + if (!vcpu) + return -EINVAL; + + spin_lock(&kvm->mmu_lock); + mmu = kvm_arch_mmu_pointer(vcpu); + if (kvm_mmu_ad_disabled(mmu)) { + printk(KERN_NOTICE "CPU does not support EPT A/D bits tracking\n"); + ret = -EINVAL; + } else if (mmu->shadow_root_level != 4 + (!!pgtable_l5_enabled())) { + printk(KERN_NOTICE "Unsupported EPT level %d\n", mmu->shadow_root_level); + ret = -EINVAL; + } else + ret = 0; + spin_unlock(&kvm->mmu_lock); + + return ret; +} + +#else +static int arm_pte_range(struct page_idle_ctrl *pic, + pmd_t *pmd, unsigned long addr, unsigned long end) +{ + pte_t *pte; + enum ProcIdlePageType page_type; + int err = 0; + + pte = pte_offset_kernel(pmd, addr); + do { + if (!pte_present(*pte)) + page_type = PTE_HOLE; + else if (!test_and_clear_bit(_PAGE_MM_BIT_ACCESSED, + (unsigned long *) &pte->pte)) + page_type = PTE_IDLE; + else + page_type = PTE_ACCESSED; + + err = pic_add_page(pic, addr, addr + PAGE_SIZE, page_type); + if (err) + break; + } while (pte++, addr += PAGE_SIZE, addr != end); + + return err; +} + +static int arm_pmd_range(struct page_idle_ctrl *pic, + pud_t *pud, unsigned long addr, unsigned long end) +{ + pmd_t *pmd; + unsigned long next; + struct kvm *kvm = pic->kvm; + enum ProcIdlePageType page_type; + enum ProcIdlePageType pte_page_type; + int err = 0; + + if (pic->flags & SCAN_HUGE_PAGE) + pte_page_type = PMD_IDLE_PTES; + else + pte_page_type = IDLE_PAGE_TYPE_MAX; + + pmd = stage2_pmd_offset(kvm, pud, addr); + do { + next = stage2_pmd_addr_end(kvm, addr, end); + if (!pmd_present(*pmd)) + page_type = PMD_HOLE; + else if (!if_pmd_thp_or_huge(*pmd)) + page_type = pte_page_type; + else if (!test_and_clear_bit(_PAGE_MM_BIT_ACCESSED, + (unsigned long *)pmd)) + page_type = PMD_IDLE; + else + page_type = PMD_ACCESSED; + + if (page_type != IDLE_PAGE_TYPE_MAX) + err = pic_add_page(pic, addr, next, page_type); + else + err = arm_pte_range(pic, pmd, addr, next); + if (err) + break; + } while (pmd++, addr = next, addr != end); + + return err; +} + +static int arm_pud_range(struct page_idle_ctrl *pic, + pgd_t *pgd, unsigned long addr, unsigned long end) +{ + pud_t *pud; + unsigned long next; + struct kvm *kvm = pic->kvm; + int err = 0; + + pud = stage2_pud_offset(kvm, pgd, addr); + do { + next = stage2_pud_addr_end(kvm, addr, end); + if (!stage2_pud_present(kvm, *pud)) { + set_restart_gpa(next, "PUD_HOLE"); + continue; + } + + if (if_stage2_pud_huge(kvm, *pud)) + err = pic_add_page(pic, addr, next, PUD_PRESENT); + else + err = arm_pmd_range(pic, pud, addr, next); + if (err) + break; + } while (pud++, addr = next, addr != end); + + return err; +} + +static int arm_page_range(struct page_idle_ctrl *pic, + unsigned long addr, + unsigned long end) +{ + pgd_t *pgd; + unsigned long next; + struct kvm *kvm = pic->kvm; + int err = 0; + + WARN_ON(addr >= end); + + spin_lock(&pic->kvm->mmu_lock); + pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr); + spin_unlock(&pic->kvm->mmu_lock); + + local_irq_disable(); + do { + next = stage2_pgd_addr_end(kvm, addr, end); + if (!stage2_pgd_present(kvm, *pgd)) { + set_restart_gpa(next, "PGD_HOLE"); + continue; + } + + err = arm_pud_range(pic, pgd, addr, next); + if (err) + break; + } while (pgd++, addr = next, addr != end); + + local_irq_enable(); + return err; +} +#endif + +/* + * Depending on whether hva falls in a memslot: + * + * 1) found => return gpa and remaining memslot size in *addr_range + * + * |<----- addr_range --------->| + * [ mem slot ] + * ^hva + * + * 2) not found => return hole size in *addr_range + * + * |<----- addr_range --------->| + * [first mem slot above hva ] + * ^hva + * + * If hva is above all mem slots, *addr_range will be ~0UL. + * We can finish read(2). + */ +static unsigned long vm_idle_find_gpa(struct page_idle_ctrl *pic, + unsigned long hva, + unsigned long *addr_range) +{ + struct kvm *kvm = pic->kvm; + struct kvm_memslots *slots; + struct kvm_memory_slot *memslot; + unsigned long hva_end; + gfn_t gfn; + + *addr_range = ~0UL; + mutex_lock(&kvm->slots_lock); + slots = kvm_memslots(pic->kvm); + kvm_for_each_memslot(memslot, slots) { + hva_end = memslot->userspace_addr + + (memslot->npages << PAGE_SHIFT); + + if (hva >= memslot->userspace_addr && hva < hva_end) { + gpa_t gpa; + + gfn = hva_to_gfn_memslot(hva, memslot); + *addr_range = hva_end - hva; + gpa = gfn_to_gpa(gfn); + mutex_unlock(&kvm->slots_lock); + return gpa; + } + + if (memslot->userspace_addr > hva) + *addr_range = min(*addr_range, + memslot->userspace_addr - hva); + } + mutex_unlock(&kvm->slots_lock); + return INVALID_PAGE; +} + +static int vm_idle_walk_hva_range(struct page_idle_ctrl *pic, + unsigned long start, unsigned long end) +{ + unsigned long gpa_addr; + unsigned long addr_range; + unsigned long va_end; + int ret; + +#ifdef CONFIG_X86_64 + ret = ept_idle_supports_cpu(pic->kvm); + if (ret) + return ret; +#endif + + ret = init_page_idle_ctrl_buffer(pic); + if (ret) + return ret; + + for (; start < end;) { + gpa_addr = vm_idle_find_gpa(pic, start, &addr_range); + + if (gpa_addr == INVALID_PAGE) { + pic->gpa_to_hva = 0; + if (addr_range == ~0UL) { + set_restart_gpa(TASK_SIZE, "EOF"); + va_end = end; + } else { + start += addr_range; + set_restart_gpa(start, "OUT-OF-SLOT"); + va_end = start; + } + } else { + pic->gpa_to_hva = start - gpa_addr; +#ifdef CONFIG_ARM64 + arm_page_range(pic, gpa_addr, gpa_addr + addr_range); +#else + ept_page_range(pic, gpa_addr, gpa_addr + addr_range); +#endif + va_end = pic->gpa_to_hva + gpa_addr + addr_range; + } + + start = pic->restart_gpa + pic->gpa_to_hva; + ret = page_idle_copy_user(pic, start, va_end); + if (ret) + break; + } + + if (pic->bytes_copied) + ret = 0; + return ret; +} + +static ssize_t vm_idle_read(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + struct mm_struct *mm = file->private_data; + struct page_idle_ctrl *pic; + unsigned long hva_start = *ppos; + unsigned long hva_end = hva_start + (count << (3 + PAGE_SHIFT)); + int ret; + + pic = kzalloc(sizeof(*pic), GFP_KERNEL); + if (!pic) + return -ENOMEM; + + setup_page_idle_ctrl(pic, buf, count, file->f_flags); + pic->kvm = mm_kvm(mm); + + ret = vm_idle_walk_hva_range(pic, hva_start, hva_end); + if (ret) + goto out_kvm; + + ret = pic->bytes_copied; + *ppos = pic->next_hva; +out_kvm: + return ret; + +} + +static ssize_t mm_idle_read(struct file *file, char *buf, + size_t count, loff_t *ppos); + +static ssize_t page_scan_read(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + struct mm_struct *mm = file->private_data; + unsigned long hva_start = *ppos; + unsigned long hva_end = hva_start + (count << (3 + PAGE_SHIFT)); + + if ((hva_start >= TASK_SIZE) || (hva_end >= TASK_SIZE)) { + debug_printk("page_idle_read past TASK_SIZE: %pK %pK %lx\n", + hva_start, hva_end, TASK_SIZE); + return 0; + } + if (hva_end <= hva_start) { + debug_printk("page_idle_read past EOF: %pK %pK\n", + hva_start, hva_end); + return 0; + } + if (*ppos & (PAGE_SIZE - 1)) { + debug_printk("page_idle_read unaligned ppos: %pK\n", + hva_start); + return -EINVAL; + } + if (count < PAGE_IDLE_BUF_MIN) { + debug_printk("page_idle_read small count: %lx\n", + (unsigned long)count); + return -EINVAL; + } + + if (!mm_kvm(mm)) + return mm_idle_read(file, buf, count, ppos); + + return vm_idle_read(file, buf, count, ppos); +} + +static int page_scan_open(struct inode *inode, struct file *file) +{ + if (!try_module_get(THIS_MODULE)) + return -EBUSY; + + return 0; +} + +static int page_scan_release(struct inode *inode, struct file *file) +{ + struct mm_struct *mm = file->private_data; + struct kvm *kvm; + int ret = 0; + + if (!mm) { + ret = -EBADF; + goto out; + } + + kvm = mm_kvm(mm); + if (!kvm) { + ret = -EINVAL; + goto out; + } +#ifdef CONFIG_X86_64 + spin_lock(&kvm->mmu_lock); + kvm_flush_remote_tlbs(kvm); + spin_unlock(&kvm->mmu_lock); +#endif + +out: + module_put(THIS_MODULE); + return ret; +} + +static int mm_idle_pmd_large(pmd_t pmd) +{ +#ifdef CONFIG_ARM64 + return if_pmd_thp_or_huge(pmd); +#else + return pmd_large(pmd); +#endif +} + +static int mm_idle_pte_range(struct page_idle_ctrl *pic, pmd_t *pmd, + unsigned long addr, unsigned long next) +{ + enum ProcIdlePageType page_type; + pte_t *pte; + int err = 0; + + pte = pte_offset_kernel(pmd, addr); + do { + if (!pte_present(*pte)) + page_type = PTE_HOLE; + else if (!test_and_clear_bit(_PAGE_MM_BIT_ACCESSED, + (unsigned long *) &pte->pte)) + page_type = PTE_IDLE; + else { + page_type = PTE_ACCESSED; + } + + err = pic_add_page(pic, addr, addr + PAGE_SIZE, page_type); + if (err) + break; + } while (pte++, addr += PAGE_SIZE, addr != next); + + return err; +} + +static int mm_idle_pmd_entry(pmd_t *pmd, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + struct page_idle_ctrl *pic = walk->private; + enum ProcIdlePageType page_type; + enum ProcIdlePageType pte_page_type; + int err; + + /* + * Skip duplicate PMD_IDLE_PTES: when the PMD crosses VMA boundary, + * walk_page_range() can call on the same PMD twice. + */ + if ((addr & PMD_MASK) == (pic->last_va & PMD_MASK)) { + debug_printk("ignore duplicate addr %pK %pK\n", + addr, pic->last_va); + return 0; + } + pic->last_va = addr; + + if (pic->flags & SCAN_HUGE_PAGE) + pte_page_type = PMD_IDLE_PTES; + else + pte_page_type = IDLE_PAGE_TYPE_MAX; + + if (!pmd_present(*pmd)) + page_type = PMD_HOLE; + else if (!mm_idle_pmd_large(*pmd)) + page_type = pte_page_type; + else if (!test_and_clear_bit(_PAGE_MM_BIT_ACCESSED, + (unsigned long *)pmd)) + page_type = PMD_IDLE; + else + page_type = PMD_ACCESSED; + + if (page_type != IDLE_PAGE_TYPE_MAX) + err = pic_add_page(pic, addr, next, page_type); + else + err = mm_idle_pte_range(pic, pmd, addr, next); + + return err; +} + +static int mm_idle_pud_entry(pud_t *pud, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + struct page_idle_ctrl *pic = walk->private; + + if ((addr & PUD_MASK) != (pic->last_va & PUD_MASK)) { + pic_add_page(pic, addr, next, PUD_PRESENT); + pic->last_va = addr; + } + return 1; +} + +static int mm_idle_test_walk(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + struct vm_area_struct *vma = walk->vma; + + if (vma->vm_file) { + if ((vma->vm_flags & (VM_WRITE|VM_MAYSHARE)) == VM_WRITE) + return 0; + return 1; + } + + return 0; +} + +static int mm_idle_walk_range(struct page_idle_ctrl *pic, + unsigned long start, + unsigned long end, + struct mm_walk *walk) +{ + struct vm_area_struct *vma; + int ret = 0; + + ret = init_page_idle_ctrl_buffer(pic); + if (ret) + return ret; + + for (; start < end;) { + down_read(&walk->mm->mmap_sem); + vma = find_vma(walk->mm, start); + if (vma) { + if (end > vma->vm_start) { + local_irq_disable(); + ret = walk_page_range(start, end, walk); + local_irq_enable(); + } else + set_restart_gpa(vma->vm_start, "VMA-HOLE"); + } else + set_restart_gpa(TASK_SIZE, "EOF"); + up_read(&walk->mm->mmap_sem); + + WARN_ONCE(pic->gpa_to_hva, "non-zero gpa_to_hva"); + start = pic->restart_gpa; + ret = page_idle_copy_user(pic, start, end); + if (ret) + break; + } + + if (pic->bytes_copied) { + if (ret != PAGE_IDLE_BUF_FULL && pic->next_hva < end) + debug_printk("partial scan: next_hva=%pK end=%pK\n", + pic->next_hva, end); + ret = 0; + } else + WARN_ONCE(1, "nothing read"); + return ret; +} + +static ssize_t mm_idle_read(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + struct mm_struct *mm = file->private_data; + struct mm_walk mm_walk = {}; + struct page_idle_ctrl *pic; + unsigned long va_start = *ppos; + unsigned long va_end = va_start + (count << (3 + PAGE_SHIFT)); + int ret; + + if (va_end <= va_start) { + debug_printk("%s past EOF: %pK %pK\n", + __func__, va_start, va_end); + return 0; + } + if (*ppos & (PAGE_SIZE - 1)) { + debug_printk("%s unaligned ppos: %pK\n", + __func__, va_start); + return -EINVAL; + } + if (count < PAGE_IDLE_BUF_MIN) { + debug_printk("%s small count: %lx\n", + __func__, (unsigned long)count); + return -EINVAL; + } + + pic = kzalloc(sizeof(*pic), GFP_KERNEL); + if (!pic) + return -ENOMEM; + + setup_page_idle_ctrl(pic, buf, count, file->f_flags); + + mm_walk.mm = mm; + mm_walk.pmd_entry = mm_idle_pmd_entry; + mm_walk.pud_entry = mm_idle_pud_entry; + mm_walk.test_walk = mm_idle_test_walk; + mm_walk.private = pic; + + ret = mm_idle_walk_range(pic, va_start, va_end, &mm_walk); + if (ret) + goto out_free; + + ret = pic->bytes_copied; + *ppos = pic->next_hva; +out_free: + kfree(pic); + return ret; +} + +extern struct file_operations proc_page_scan_operations; + +static int page_scan_entry(void) +{ + proc_page_scan_operations.owner = THIS_MODULE; + proc_page_scan_operations.read = page_scan_read; + proc_page_scan_operations.open = page_scan_open; + proc_page_scan_operations.release = page_scan_release; + return 0; +} + +static void page_scan_exit(void) +{ + memset(&proc_page_scan_operations, 0, + sizeof(proc_page_scan_operations)); +} + +MODULE_LICENSE("GPL"); +module_init(page_scan_entry); +module_exit(page_scan_exit); diff --git a/fs/proc/etmem_scan.h b/fs/proc/etmem_scan.h new file mode 100644 index 0000000000000..305739f92eef2 --- /dev/null +++ b/fs/proc/etmem_scan.h @@ -0,0 +1,132 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _PAGE_IDLE_H +#define _PAGE_IDLE_H + +#define SCAN_HUGE_PAGE O_NONBLOCK /* only huge page */ +#define SCAN_SKIM_IDLE O_NOFOLLOW /* stop on PMD_IDLE_PTES */ +#define SCAN_DIRTY_PAGE O_NOATIME /* report pte/pmd dirty bit */ + +enum ProcIdlePageType { + PTE_ACCESSED, /* 4k page */ + PMD_ACCESSED, /* 2M page */ + PUD_PRESENT, /* 1G page */ + + PTE_DIRTY_M, + PMD_DIRTY_M, + + PTE_IDLE, + PMD_IDLE, + PMD_IDLE_PTES, /* all PTE idle */ + + PTE_HOLE, + PMD_HOLE, + + PIP_CMD, + + IDLE_PAGE_TYPE_MAX +}; + +#define PIP_TYPE(a) (0xf & (a >> 4)) +#define PIP_SIZE(a) (0xf & a) +#define PIP_COMPOSE(type, nr) ((type << 4) | nr) + +#define PIP_CMD_SET_HVA PIP_COMPOSE(PIP_CMD, 0) + +#ifndef INVALID_PAGE +#define INVALID_PAGE ~0UL +#endif + +#ifdef CONFIG_ARM64 +#define _PAGE_MM_BIT_ACCESSED 10 +#else +#define _PAGE_MM_BIT_ACCESSED _PAGE_BIT_ACCESSED +#endif + +#ifdef CONFIG_X86_64 +#define _PAGE_BIT_EPT_ACCESSED 8 +#define _PAGE_BIT_EPT_DIRTY 9 +#define _PAGE_EPT_ACCESSED (_AT(pteval_t, 1) << _PAGE_BIT_EPT_ACCESSED) +#define _PAGE_EPT_DIRTY (_AT(pteval_t, 1) << _PAGE_BIT_EPT_DIRTY) + +#define _PAGE_EPT_PRESENT (_AT(pteval_t, 7)) + +static inline int ept_pte_present(pte_t a) +{ + return pte_flags(a) & _PAGE_EPT_PRESENT; +} + +static inline int ept_pmd_present(pmd_t a) +{ + return pmd_flags(a) & _PAGE_EPT_PRESENT; +} + +static inline int ept_pud_present(pud_t a) +{ + return pud_flags(a) & _PAGE_EPT_PRESENT; +} + +static inline int ept_p4d_present(p4d_t a) +{ + return p4d_flags(a) & _PAGE_EPT_PRESENT; +} + +static inline int ept_pgd_present(pgd_t a) +{ + return pgd_flags(a) & _PAGE_EPT_PRESENT; +} + +static inline int ept_pte_accessed(pte_t a) +{ + return pte_flags(a) & _PAGE_EPT_ACCESSED; +} + +static inline int ept_pmd_accessed(pmd_t a) +{ + return pmd_flags(a) & _PAGE_EPT_ACCESSED; +} + +static inline int ept_pud_accessed(pud_t a) +{ + return pud_flags(a) & _PAGE_EPT_ACCESSED; +} + +static inline int ept_p4d_accessed(p4d_t a) +{ + return p4d_flags(a) & _PAGE_EPT_ACCESSED; +} + +static inline int ept_pgd_accessed(pgd_t a) +{ + return pgd_flags(a) & _PAGE_EPT_ACCESSED; +} +#endif + +extern struct file_operations proc_page_scan_operations; + +#define PAGE_IDLE_KBUF_FULL 1 +#define PAGE_IDLE_BUF_FULL 2 +#define PAGE_IDLE_BUF_MIN (sizeof(uint64_t) * 2 + 3) + +#define PAGE_IDLE_KBUF_SIZE 8000 + +struct page_idle_ctrl { + struct mm_struct *mm; + struct kvm *kvm; + + uint8_t kpie[PAGE_IDLE_KBUF_SIZE]; + int pie_read; + int pie_read_max; + + void __user *buf; + int buf_size; + int bytes_copied; + + unsigned long next_hva; /* GPA for EPT; VA for PT */ + unsigned long gpa_to_hva; + unsigned long restart_gpa; + unsigned long last_va; + + unsigned int flags; +}; + +#endif diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 4f14906ef16b5..55b4a9b716a99 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -299,6 +299,7 @@ extern const struct file_operations proc_pid_smaps_operations; extern const struct file_operations proc_pid_smaps_rollup_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; +extern const struct file_operations proc_mm_idle_operations;
extern unsigned long task_vsize(struct mm_struct *); extern unsigned long task_statm(struct mm_struct *, diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 657b159229394..ee258adb631ba 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1598,6 +1598,72 @@ const struct file_operations proc_pagemap_operations = { .open = pagemap_open, .release = pagemap_release, }; + +/* will be filled when kvm_ept_idle module loads */ +struct file_operations proc_page_scan_operations = { +}; +EXPORT_SYMBOL_GPL(proc_page_scan_operations); + +static ssize_t mm_idle_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct mm_struct *mm = file->private_data; + int ret = 0; + + if (!mm || !mmget_not_zero(mm)) { + ret = -ESRCH; + return ret; + } + if (proc_page_scan_operations.read) + ret = proc_page_scan_operations.read(file, buf, count, ppos); + + mmput(mm); + return ret; +} + +static int mm_idle_open(struct inode *inode, struct file *file) +{ + struct mm_struct *mm = NULL; + + if (!file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN)) + return -EPERM; + + mm = proc_mem_open(inode, PTRACE_MODE_READ); + if (IS_ERR(mm)) + return PTR_ERR(mm); + + file->private_data = mm; + + if (proc_page_scan_operations.open) + return proc_page_scan_operations.open(inode, file); + + return 0; +} + +static int mm_idle_release(struct inode *inode, struct file *file) +{ + struct mm_struct *mm = file->private_data; + + if (mm) { + if (!mm_kvm(mm)) + flush_tlb_mm(mm); + mmdrop(mm); + } + + if (proc_page_scan_operations.release) + return proc_page_scan_operations.release(inode, file); + + return 0; +} + +const struct file_operations proc_mm_idle_operations = { + .llseek = mem_lseek, /* borrow this */ + .read = mm_idle_read, + .open = mm_idle_open, + .release = mm_idle_release, +}; + + #endif /* CONFIG_PROC_PAGE_MONITOR */
#ifdef CONFIG_NUMA diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 4c811612a3a1b..51a85ba5ac915 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -28,6 +28,7 @@ typedef int vm_fault_t; struct address_space; struct mem_cgroup; struct hmm; +struct kvm;
/* * Each physical page in the system has a struct page associated with @@ -518,7 +519,12 @@ struct mm_struct { #endif } __randomize_layout;
+#if IS_ENABLED(CONFIG_KVM) && !defined(__GENKSYMS__) + struct kvm *kvm; +#else KABI_RESERVE(1) +#endif + KABI_RESERVE(2) KABI_RESERVE(3) KABI_RESERVE(4) @@ -536,6 +542,18 @@ struct mm_struct {
extern struct mm_struct init_mm;
+#if IS_ENABLED(CONFIG_KVM) +static inline struct kvm *mm_kvm(struct mm_struct *mm) +{ + return mm->kvm; +} +#else +static inline struct kvm *mm_kvm(struct mm_struct *mm) +{ + return NULL; +} +#endif + /* Pointer magic because the dynamic array size confuses some compilers. */ static inline void mm_init_cpumask(struct mm_struct *mm) { diff --git a/lib/Kconfig b/lib/Kconfig index a3928d4438b50..f332b0a05db26 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -599,6 +599,12 @@ config PARMAN config PRIME_NUMBERS tristate
+config ETMEM_SCAN + tristate "module: etmem page scan for etmem support" + help + etmem page scan feature + used to scan the virtual address of the target process + config STRING_SELFTEST tristate "Test string functions"
diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 9dd747151f031..0c0aeb878d426 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -339,6 +339,7 @@ int walk_page_range(unsigned long start, unsigned long end, } while (start = next, start < end); return err; } +EXPORT_SYMBOL_GPL(walk_page_range);
int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk) { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 012627990b437..bd0147186c25d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -792,6 +792,9 @@ static void kvm_destroy_vm(struct kvm *kvm) struct mm_struct *mm = kvm->mm;
kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm); +#if IS_ENABLED(CONFIG_KVM) + mm->kvm = NULL; +#endif kvm_destroy_vm_debugfs(kvm); kvm_arch_sync_events(kvm); mutex_lock(&kvm_lock); @@ -3602,6 +3605,9 @@ static int kvm_dev_ioctl_create_vm(unsigned long type) fput(file); return -ENOMEM; } +#if IS_ENABLED(CONFIG_KVM) + kvm->mm->kvm = kvm; +#endif kvm_uevent_notify_change(KVM_EVENT_CREATE_VM, kvm);
fd_install(r, file);
From: liubo liubo254@huawei.com
euleros inclusion category: feature feature: etmem bugzilla: 49889
-------------------------------------------------
In order to achieve the goal of memory expansion, cold pages need to be migrated to the swap partition, etmem_swap.ko is to achieve this purpose.
This patch is mainly used to generate etmem_swap.ko. etmem_swap.ko is used to transfer the address passed in the user state for page migration.
Signed-off-by: yanxiaodan yanxiaodan@huawei.com Signed-off-by: linmiaohe linmiaohe@huawei.com Signed-off-by: louhongxiang louhongxiang@huawei.com Signed-off-by: liubo liubo254@huawei.com Signed-off-by: geruijun geruijun@huawei.com Signed-off-by: liangchenshu liangchenshu@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Reviewed-by: Jing Xiangfengjingxiangfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/proc/Makefile | 1 + fs/proc/base.c | 2 + fs/proc/etmem_swap.c | 102 +++++++++++++++++++++++++++++++++++++++ fs/proc/internal.h | 1 + fs/proc/task_mmu.c | 51 ++++++++++++++++++++ include/linux/swap.h | 5 ++ lib/Kconfig | 5 ++ mm/vmscan.c | 112 +++++++++++++++++++++++++++++++++++++++++++ 8 files changed, 279 insertions(+) create mode 100644 fs/proc/etmem_swap.c
diff --git a/fs/proc/Makefile b/fs/proc/Makefile index c1ebd017a83bc..b9c9f59aba456 100644 --- a/fs/proc/Makefile +++ b/fs/proc/Makefile @@ -34,3 +34,4 @@ proc-$(CONFIG_PROC_VMCORE) += vmcore.o proc-$(CONFIG_PRINTK) += kmsg.o proc-$(CONFIG_PROC_PAGE_MONITOR) += page.o obj-$(CONFIG_ETMEM_SCAN) += etmem_scan.o +obj-$(CONFIG_ETMEM_SWAP) += etmem_swap.o diff --git a/fs/proc/base.c b/fs/proc/base.c index 9ea434f9da4ee..c66f5ffadb58d 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2990,6 +2990,7 @@ static const struct pid_entry tgid_base_stuff[] = { REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), REG("idle_pages", S_IRUSR|S_IWUSR, proc_mm_idle_operations), + REG("swap_pages", S_IWUSR, proc_mm_swap_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), @@ -3379,6 +3380,7 @@ static const struct pid_entry tid_base_stuff[] = { REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), REG("idle_pages", S_IRUSR|S_IWUSR, proc_mm_idle_operations), + REG("swap_pages", S_IWUSR, proc_mm_swap_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), diff --git a/fs/proc/etmem_swap.c b/fs/proc/etmem_swap.c new file mode 100644 index 0000000000000..b24c706c3b2a3 --- /dev/null +++ b/fs/proc/etmem_swap.c @@ -0,0 +1,102 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/init.h> +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/string.h> +#include <linux/proc_fs.h> +#include <linux/sched/mm.h> +#include <linux/mm.h> +#include <linux/swap.h> +#include <linux/mempolicy.h> +#include <linux/uaccess.h> +#include <linux/delay.h> + +static ssize_t swap_pages_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + char *p, *data, *data_ptr_res; + unsigned long vaddr; + struct mm_struct *mm = file->private_data; + struct page *page; + LIST_HEAD(pagelist); + int ret = 0; + + if (!mm || !mmget_not_zero(mm)) { + ret = -ESRCH; + goto out; + } + + if (count < 0) { + ret = -EOPNOTSUPP; + goto out_mm; + } + + data = memdup_user_nul(buf, count); + if (IS_ERR(data)) { + ret = PTR_ERR(data); + goto out_mm; + } + + data_ptr_res = data; + while ((p = strsep(&data, "\n")) != NULL) { + if (!*p) + continue; + + ret = kstrtoul(p, 16, &vaddr); + if (ret != 0) + continue; + /*If get page struct failed, ignore it, get next page*/ + page = get_page_from_vaddr(mm, vaddr); + if (!page) + continue; + + add_page_for_swap(page, &pagelist); + } + + if (!list_empty(&pagelist)) + reclaim_pages(&pagelist); + + ret = count; + kfree(data_ptr_res); +out_mm: + mmput(mm); +out: + return ret; +} + +static int swap_pages_open(struct inode *inode, struct file *file) +{ + if (!try_module_get(THIS_MODULE)) + return -EBUSY; + + return 0; +} + +static int swap_pages_release(struct inode *inode, struct file *file) +{ + module_put(THIS_MODULE); + return 0; +} + + +extern struct file_operations proc_swap_pages_operations; + +static int swap_pages_entry(void) +{ + proc_swap_pages_operations.owner = THIS_MODULE; + proc_swap_pages_operations.write = swap_pages_write; + proc_swap_pages_operations.open = swap_pages_open; + proc_swap_pages_operations.release = swap_pages_release; + + return 0; +} + +static void swap_pages_exit(void) +{ + memset(&proc_swap_pages_operations, 0, + sizeof(proc_swap_pages_operations)); +} + +MODULE_LICENSE("GPL"); +module_init(swap_pages_entry); +module_exit(swap_pages_exit); diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 55b4a9b716a99..2df432cc38af3 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -300,6 +300,7 @@ extern const struct file_operations proc_pid_smaps_rollup_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; extern const struct file_operations proc_mm_idle_operations; +extern const struct file_operations proc_mm_swap_operations;
extern unsigned long task_vsize(struct mm_struct *); extern unsigned long task_statm(struct mm_struct *, diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index ee258adb631ba..ac7f57badcfdf 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1663,7 +1663,58 @@ const struct file_operations proc_mm_idle_operations = { .release = mm_idle_release, };
+/*swap pages*/ +struct file_operations proc_swap_pages_operations = { +}; +EXPORT_SYMBOL_GPL(proc_swap_pages_operations); + +static ssize_t mm_swap_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + if (proc_swap_pages_operations.write) + return proc_swap_pages_operations.write(file, buf, count, ppos); + + return -1; +} + +static int mm_swap_open(struct inode *inode, struct file *file) +{ + struct mm_struct *mm = NULL; + + if (!file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN)) + return -EPERM; + + mm = proc_mem_open(inode, PTRACE_MODE_READ); + if (IS_ERR(mm)) + return PTR_ERR(mm); + + file->private_data = mm; + + if (proc_swap_pages_operations.open) + return proc_swap_pages_operations.open(inode, file); + + return 0; +} + +static int mm_swap_release(struct inode *inode, struct file *file) +{ + struct mm_struct *mm = file->private_data;
+ if (mm) + mmdrop(mm); + + if (proc_swap_pages_operations.release) + return proc_swap_pages_operations.release(inode, file); + + return 0; +} + +const struct file_operations proc_mm_swap_operations = { + .llseek = mem_lseek, + .write = mm_swap_write, + .open = mm_swap_open, + .release = mm_swap_release, +}; #endif /* CONFIG_PROC_PAGE_MONITOR */
#ifdef CONFIG_NUMA diff --git a/include/linux/swap.h b/include/linux/swap.h index aecda6766417a..e7fa800d52ea3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -374,6 +374,11 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern unsigned long vm_total_pages;
+extern unsigned long reclaim_pages(struct list_head *page_list); +extern int add_page_for_swap(struct page *page, struct list_head *pagelist); +extern struct page *get_page_from_vaddr(struct mm_struct *mm, + unsigned long vaddr); + #ifdef CONFIG_SHRINK_PAGECACHE extern unsigned long vm_cache_limit_ratio; extern unsigned long vm_cache_limit_ratio_min; diff --git a/lib/Kconfig b/lib/Kconfig index f332b0a05db26..edb7d40d1f608 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -605,6 +605,11 @@ config ETMEM_SCAN etmem page scan feature used to scan the virtual address of the target process
+config ETMEM_SWAP + tristate "module: etmem page swap for etmem support" + help + etmem page swap feature + config STRING_SELFTEST tristate "Test string functions"
diff --git a/mm/vmscan.c b/mm/vmscan.c index 7cfa9561c2568..92be608b467b6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -36,6 +36,7 @@ #include <linux/topology.h> #include <linux/cpu.h> #include <linux/cpuset.h> +#include <linux/mempolicy.h> #include <linux/compaction.h> #include <linux/notifier.h> #include <linux/rwsem.h> @@ -4403,3 +4404,114 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages) } } #endif /* CONFIG_SHMEM */ + +unsigned long reclaim_pages(struct list_head *page_list) +{ + int nid = NUMA_NO_NODE; + unsigned long nr_reclaimed = 0; + LIST_HEAD(node_page_list); + struct reclaim_stat dummy_stat; + struct page *page; + struct scan_control sc = { + .gfp_mask = GFP_KERNEL, + .priority = DEF_PRIORITY, + .may_writepage = 1, + .may_unmap = 1, + .may_swap = 1, + }; + + while (!list_empty(page_list)) { + page = lru_to_page(page_list); + if (nid == NUMA_NO_NODE) { + nid = page_to_nid(page); + INIT_LIST_HEAD(&node_page_list); + } + + if (nid == page_to_nid(page)) { + ClearPageActive(page); + list_move(&page->lru, &node_page_list); + continue; + } + + nr_reclaimed += shrink_page_list(&node_page_list, + NODE_DATA(nid), + &sc, 0, + &dummy_stat, false); + while (!list_empty(&node_page_list)) { + page = lru_to_page(&node_page_list); + list_del(&page->lru); + putback_lru_page(page); + } + + nid = NUMA_NO_NODE; + } + + if (!list_empty(&node_page_list)) { + nr_reclaimed += shrink_page_list(&node_page_list, + NODE_DATA(nid), + &sc, 0, + &dummy_stat, false); + while (!list_empty(&node_page_list)) { + page = lru_to_page(&node_page_list); + list_del(&page->lru); + putback_lru_page(page); + } + } + + return nr_reclaimed; +} +EXPORT_SYMBOL_GPL(reclaim_pages); + +int add_page_for_swap(struct page *page, struct list_head *pagelist) +{ + int err = -EBUSY; + struct page *head; + + /*If the page is mapped by more than one process, do not swap it */ + if (page_mapcount(page) > 1) + return -EACCES; + + if (PageHuge(page)) + return -EACCES; + + head = compound_head(page); + err = isolate_lru_page(head); + if (err) { + put_page(page); + return err; + } + put_page(page); + if (PageUnevictable(page)) + putback_lru_page(page); + else + list_add_tail(&head->lru, pagelist); + + err = 0; + return err; +} +EXPORT_SYMBOL_GPL(add_page_for_swap); +struct page *get_page_from_vaddr(struct mm_struct *mm, unsigned long vaddr) +{ + struct page *page; + struct vm_area_struct *vma; + unsigned int follflags; + + down_read(&mm->mmap_sem); + + vma = find_vma(mm, vaddr); + if (!vma || vaddr < vma->vm_start || vma->vm_flags & VM_LOCKED) { + up_read(&mm->mmap_sem); + return NULL; + } + + follflags = FOLL_GET | FOLL_DUMP; + page = follow_page(vma, vaddr, follflags); + if (IS_ERR(page) || !page) { + up_read(&mm->mmap_sem); + return NULL; + } + + up_read(&mm->mmap_sem); + return page; +} +EXPORT_SYMBOL_GPL(get_page_from_vaddr);
From: liubo liubo254@huawei.com
euleros inclusion category: feature feature: etmem bugzilla: 49889
-------------------------------------------------
Enable etmem feature config option. set default value of CONFIG_ETMEM_SCAN and CONFIG_ETMEM_SWAP.
Before using the etmem feature, need to insert etmem_scan.ko and etmem_swap.ko first.
Signed-off-by: liubo liubo254@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Reviewed-by: Jing Xiangfengjingxiangfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/arm64/configs/hulk_defconfig | 2 ++ arch/arm64/configs/openeuler_defconfig | 2 ++ arch/x86/configs/hulk_defconfig | 2 ++ arch/x86/configs/openeuler_defconfig | 2 ++ 4 files changed, 8 insertions(+)
diff --git a/arch/arm64/configs/hulk_defconfig b/arch/arm64/configs/hulk_defconfig index f8f7890254641..39bf10f0e3ad7 100644 --- a/arch/arm64/configs/hulk_defconfig +++ b/arch/arm64/configs/hulk_defconfig @@ -5700,3 +5700,5 @@ CONFIG_IO_STRICT_DEVMEM=y # CONFIG_DEBUG_EFI is not set # CONFIG_ARM64_RELOC_TEST is not set # CONFIG_CORESIGHT is not set +CONFIG_ETMEM_SCAN=m +CONFIG_ETMEM_SWAP=m diff --git a/arch/arm64/configs/openeuler_defconfig b/arch/arm64/configs/openeuler_defconfig index ad860594978a3..a99848f198007 100644 --- a/arch/arm64/configs/openeuler_defconfig +++ b/arch/arm64/configs/openeuler_defconfig @@ -6044,3 +6044,5 @@ CONFIG_IO_STRICT_DEVMEM=y # CONFIG_ARM64_RELOC_TEST is not set # CONFIG_CORESIGHT is not set CONFIG_SMMU_BYPASS_DEV=y +CONFIG_ETMEM_SCAN=m +CONFIG_ETMEM_SWAP=m diff --git a/arch/x86/configs/hulk_defconfig b/arch/x86/configs/hulk_defconfig index eabaf356664d2..b41ec0e4f6dca 100644 --- a/arch/x86/configs/hulk_defconfig +++ b/arch/x86/configs/hulk_defconfig @@ -7530,3 +7530,5 @@ CONFIG_OPTIMIZE_INLINING=y # CONFIG_PUNIT_ATOM_DEBUG is not set CONFIG_UNWINDER_ORC=y # CONFIG_UNWINDER_FRAME_POINTER is not set +CONFIG_ETMEM_SCAN=m +CONFIG_ETMEM_SWAP=m diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index 4b7502fb79e70..540dc77a9f8d6 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -7538,3 +7538,5 @@ CONFIG_OPTIMIZE_INLINING=y # CONFIG_PUNIT_ATOM_DEBUG is not set CONFIG_UNWINDER_ORC=y # CONFIG_UNWINDER_FRAME_POINTER is not set +CONFIG_ETMEM_SCAN=m +CONFIG_ETMEM_SWAP=m
From: Lu Jialin lujialin4@huawei.com
hulk inclusion category: feature/cgroups bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=7 CVE: NA
--------
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/x86/configs/hulk_defconfig | 2 +- arch/x86/configs/openeuler_defconfig | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/configs/hulk_defconfig b/arch/x86/configs/hulk_defconfig index b41ec0e4f6dca..bf7860e16038f 100644 --- a/arch/x86/configs/hulk_defconfig +++ b/arch/x86/configs/hulk_defconfig @@ -152,7 +152,7 @@ CONFIG_CGROUP_PERF=y CONFIG_CGROUP_BPF=y # CONFIG_CGROUP_DEBUG is not set CONFIG_SOCK_CGROUP_DATA=y -# CONFIG_CGROUP_FILES is not set +CONFIG_CGROUP_FILES=y CONFIG_NAMESPACES=y CONFIG_UTS_NS=y CONFIG_IPC_NS=y diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index 540dc77a9f8d6..54184d30bb51a 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -152,7 +152,7 @@ CONFIG_CGROUP_PERF=y CONFIG_CGROUP_BPF=y # CONFIG_CGROUP_DEBUG is not set CONFIG_SOCK_CGROUP_DATA=y -# CONFIG_CGROUP_FILES is not set +CONFIG_CGROUP_FILES=y CONFIG_NAMESPACES=y CONFIG_UTS_NS=y CONFIG_IPC_NS=y
From: Roberto Sassu roberto.sassu@huawei.com
hulk inclusion category: feature feature: IMA digest lists bugzilla: https://gitee.com/openEuler/kernel/issues/I3916O CVE: NA
------------------------------------------------
This patch includes pubring.gpg in system_certificates.o only if it is found in the certs directory of the source tree.
Signed-off-by: Roberto Sassu <roberto.sassu(a)huawei.com> Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- certs/Makefile | 13 +++++++------ certs/system_certificates.S | 2 +- 2 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/certs/Makefile b/certs/Makefile index 5053e3c86c971..766c5d003093f 100644 --- a/certs/Makefile +++ b/certs/Makefile @@ -4,12 +4,6 @@ #
obj-$(CONFIG_SYSTEM_TRUSTED_KEYRING) += system_keyring.o system_certificates.o -ifdef CONFIG_PGP_PRELOAD_PUBLIC_KEYS -ifneq ($(shell ls certs/pubring.gpg 2> /dev/null), certs/pubring.gpg) -$(shell touch certs/pubring.gpg) -endif -$(obj)/system_certificates.o: certs/pubring.gpg -endif obj-$(CONFIG_SYSTEM_BLACKLIST_KEYRING) += blacklist.o ifneq ($(CONFIG_SYSTEM_BLACKLIST_HASH_LIST),"") obj-$(CONFIG_SYSTEM_BLACKLIST_KEYRING) += blacklist_hashes.o @@ -27,6 +21,13 @@ $(obj)/system_certificates.o: $(obj)/x509_certificate_list # Cope with signing_key.x509 existing in $(srctree) not $(objtree) AFLAGS_system_certificates.o := -I$(srctree)
+ifdef CONFIG_PGP_PRELOAD_PUBLIC_KEYS +ifeq ($(shell ls $(srctree)/certs/pubring.gpg 2> /dev/null), $(srctree)/certs/pubring.gpg) +AFLAGS_system_certificates.o += -DHAVE_PUBRING_GPG +$(obj)/system_certificates.o: $(srctree)/certs/pubring.gpg +endif +endif + quiet_cmd_extract_certs = EXTRACT_CERTS $(patsubst "%",%,$(2)) cmd_extract_certs = scripts/extract-cert $(2) $@ || ( rm $@; exit 1)
diff --git a/certs/system_certificates.S b/certs/system_certificates.S index bcb7c4b4cc366..e5f58711c38c6 100644 --- a/certs/system_certificates.S +++ b/certs/system_certificates.S @@ -40,7 +40,7 @@ system_certificate_list_size: .globl pgp_public_keys pgp_public_keys: __pgp_key_list_start: -#ifdef CONFIG_PGP_PRELOAD_PUBLIC_KEYS +#ifdef HAVE_PUBRING_GPG .incbin "certs/pubring.gpg" #endif __pgp_key_list_end:
From: Ondrej Jirman <megous(a)megous.com>
mainline inclusion from mainline-v5.2-rc1 commit e3062e05e1cfe378bb9b3fa0bef46711372bcf13 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3AUFW CVE: NA
------------------------------------------------
SDIO based brcm43456 is currently misdetected as brcm43455 and the wrong firmware name is used. Correct the detection and load the correct firmware file. Chiprev for brcm43456 is "9".
Signed-off-by: Ondrej Jirman <megous(a)megous.com> Signed-off-by: Kalle Valo <kvalo(a)codeaurora.org> Signed-off-by: Fang Yafen <yafen(a)iscas.ac.cn> Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c index a5195bdb4d9bd..0cab0f914c81e 100644 --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c @@ -621,6 +621,7 @@ BRCMF_FW_DEF(43430A0, "brcmfmac43430a0-sdio"); /* Note the names are not postfixed with a1 for backward compatibility */ BRCMF_FW_DEF(43430A1, "brcmfmac43430-sdio"); BRCMF_FW_DEF(43455, "brcmfmac43455-sdio"); +BRCMF_FW_DEF(43456, "brcmfmac43456-sdio"); BRCMF_FW_DEF(4354, "brcmfmac4354-sdio"); BRCMF_FW_DEF(4356, "brcmfmac4356-sdio"); BRCMF_FW_DEF(4373, "brcmfmac4373-sdio"); @@ -640,7 +641,8 @@ static const struct brcmf_firmware_mapping brcmf_sdio_fwnames[] = { BRCMF_FW_ENTRY(BRCM_CC_4339_CHIP_ID, 0xFFFFFFFF, 4339), BRCMF_FW_ENTRY(BRCM_CC_43430_CHIP_ID, 0x00000001, 43430A0), BRCMF_FW_ENTRY(BRCM_CC_43430_CHIP_ID, 0xFFFFFFFE, 43430A1), - BRCMF_FW_ENTRY(BRCM_CC_4345_CHIP_ID, 0xFFFFFFC0, 43455), + BRCMF_FW_ENTRY(BRCM_CC_4345_CHIP_ID, 0x00000200, 43456), + BRCMF_FW_ENTRY(BRCM_CC_4345_CHIP_ID, 0xFFFFFDC0, 43455), BRCMF_FW_ENTRY(BRCM_CC_4354_CHIP_ID, 0xFFFFFFFF, 4354), BRCMF_FW_ENTRY(BRCM_CC_4356_CHIP_ID, 0xFFFFFFFF, 4356), BRCMF_FW_ENTRY(CY_CC_4373_CHIP_ID, 0xFFFFFFFF, 4373)
From: Weilong Chen chenweilong@huawei.com
ascend inclusion category: feature bugzilla: 46922 CVE: NA
-------------------------------------
Adding the MIDR encodings for HiSilicon Taishan v200 CPUs, which is used in Kunpeng ARM64 server SoCs. TSV200 is the abbreviation of Taishan v200. There are two variants of TSV200, variant 0 and variant 1.
Signed-off-by: Weilong Chen chenweilong@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Ding Tianhong dingtianhong@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/arm64/include/asm/cputype.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/arm64/include/asm/cputype.h b/arch/arm64/include/asm/cputype.h index ac266b64a75cc..f33947af79d9a 100644 --- a/arch/arm64/include/asm/cputype.h +++ b/arch/arm64/include/asm/cputype.h @@ -100,6 +100,7 @@ #define NVIDIA_CPU_PART_CARMEL 0x004
#define HISI_CPU_PART_TSV110 0xD01 +#define HISI_CPU_PART_TSV200 0xD02
#define MIDR_CORTEX_A53 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A53) #define MIDR_CORTEX_A57 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A57) @@ -121,6 +122,7 @@ #define MIDR_NVIDIA_DENVER MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_DENVER) #define MIDR_NVIDIA_CARMEL MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_CARMEL) #define MIDR_HISI_TSV110 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, HISI_CPU_PART_TSV110) +#define MIDR_HISI_TSV200 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, HISI_CPU_PART_TSV200)
#ifndef __ASSEMBLY__
From: Weilong Chen chenweilong@huawei.com
ascend inclusion category: feature bugzilla: 46922 CVE: NA
-------------------------------------
Taishan's L1/L2 cache is inclusive, and the data is consistent. Any change of L1 does not require DC operation to brush CL in L1 to L2. It's safe that don't clean data cache by address to point of unification.
Without IDC featrue, kernel needs to flush icache as well as dcache, causes performance degradation.
The flaw refers to V110/V200 variant 1.
Signed-off-by: Weilong Chen chenweilong@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Ding Tianhong dingtianhong@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- Documentation/arm64/silicon-errata.txt | 1 + arch/arm64/Kconfig | 9 ++++++++ arch/arm64/include/asm/cpucaps.h | 3 ++- arch/arm64/kernel/cpu_errata.c | 32 ++++++++++++++++++++++++++ 4 files changed, 44 insertions(+), 1 deletion(-)
diff --git a/Documentation/arm64/silicon-errata.txt b/Documentation/arm64/silicon-errata.txt index 667ea906266ed..82501eb655440 100644 --- a/Documentation/arm64/silicon-errata.txt +++ b/Documentation/arm64/silicon-errata.txt @@ -76,6 +76,7 @@ stable kernels. | Hisilicon | Hip0{5,6,7} | #161010101 | HISILICON_ERRATUM_161010101 | | Hisilicon | Hip0{6,7} | #161010701 | N/A | | Hisilicon | Hip07 | #161600802 | HISILICON_ERRATUM_161600802 | +| Hisilicon | TSV{110,200} | #1980005 | HISILICON_ERRATUM_1980005 | | | | | | | Qualcomm Tech. | Kryo/Falkor v1 | E1003 | QCOM_FALKOR_ERRATUM_1003 | | Qualcomm Tech. | Falkor v1 | E1009 | QCOM_FALKOR_ERRATUM_1009 | diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 3bce08c9aabe5..b7caf370a14b7 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -642,6 +642,15 @@ config HISILICON_ERRATUM_161600802
If unsure, say Y.
+config HISILICON_ERRATUM_1980005 + bool "Hisilicon erratum IDC support" + default y + help + The HiSilicon TSV100/200 SoC support idc but report wrong value to + kernel. + + If unsure, say Y. + config QCOM_FALKOR_ERRATUM_E1041 bool "Falkor E1041: Speculative instruction fetches might cause errant memory access" default y diff --git a/arch/arm64/include/asm/cpucaps.h b/arch/arm64/include/asm/cpucaps.h index dd02f2e8d0f5a..103dddddf3f20 100644 --- a/arch/arm64/include/asm/cpucaps.h +++ b/arch/arm64/include/asm/cpucaps.h @@ -71,7 +71,8 @@ #define ARM64_HAS_TLB_RANGE 50 #define ARM64_HAS_RNG 51 #define ARM64_HAS_E0PD 52 +#define ARM64_WORKAROUND_HISILICON_1980005 53
-#define ARM64_NCAPS 53 +#define ARM64_NCAPS 54
#endif /* __ASM_CPUCAPS_H */ diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c index ec7bb83fabfba..46a2b5849f949 100644 --- a/arch/arm64/kernel/cpu_errata.c +++ b/arch/arm64/kernel/cpu_errata.c @@ -66,6 +66,29 @@ is_kryo_midr(const struct arm64_cpu_capabilities *entry, int scope) return model == entry->midr_range.model; }
+#ifdef CONFIG_HISILICON_ERRATUM_1980005 +static bool +hisilicon_1980005_match(const struct arm64_cpu_capabilities *entry, + int scope) +{ + static const struct midr_range idc_support_list[] = { + MIDR_ALL_VERSIONS(MIDR_HISI_TSV110), + MIDR_REV(MIDR_HISI_TSV200, 1, 0), + { /* sentinel */ } + }; + + return is_midr_in_range_list(read_cpuid_id(), idc_support_list); +} + +static void +hisilicon_1980005_enable(const struct arm64_cpu_capabilities *__unused) +{ + cpus_set_cap(ARM64_HAS_CACHE_IDC); + arm64_ftr_reg_ctrel0.sys_val |= BIT(CTR_IDC_SHIFT); + sysreg_clear_set(sctlr_el1, SCTLR_EL1_UCT, 0); +} +#endif + static bool has_mismatched_cache_type(const struct arm64_cpu_capabilities *entry, int scope) @@ -776,6 +799,15 @@ const struct arm64_cpu_capabilities arm64_errata[] = { .type = ARM64_CPUCAP_LOCAL_CPU_ERRATUM, .cpu_enable = cpu_enable_trap_ctr_access, }, +#ifdef CONFIG_HISILICON_ERRATUM_1980005 + { + .desc = "Taishan IDC coherence workaround", + .capability = ARM64_WORKAROUND_HISILICON_1980005, + .matches = hisilicon_1980005_match, + .type = ARM64_CPUCAP_SYSTEM_FEATURE, + .cpu_enable = hisilicon_1980005_enable, + }, +#endif #ifdef CONFIG_QCOM_FALKOR_ERRATUM_1003 { .desc = "Qualcomm Technologies Falkor/Kryo erratum 1003",
From: Weilong Chen chenweilong@huawei.com
ascend inclusion category:feature bugzilla: 46922 CVE: NA
-------------------------------------------------
set default value of CONFIG_HISILICON_ERRATUM_1980005
Signed-off-by: Weilong Chen chenweilong@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Ding Tianhong dingtianhong@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- arch/arm64/configs/hulk_defconfig | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/arm64/configs/hulk_defconfig b/arch/arm64/configs/hulk_defconfig index 39bf10f0e3ad7..947dc54339b7d 100644 --- a/arch/arm64/configs/hulk_defconfig +++ b/arch/arm64/configs/hulk_defconfig @@ -399,6 +399,7 @@ CONFIG_ARM64_ERRATUM_845719=y # CONFIG_QCOM_QDF2400_ERRATUM_0065 is not set # CONFIG_SOCIONEXT_SYNQUACER_PREITS is not set CONFIG_HISILICON_ERRATUM_161600802=y +CONFIG_HISILICON_ERRATUM_1980005=y # CONFIG_QCOM_FALKOR_ERRATUM_E1041 is not set CONFIG_ARM64_4K_PAGES=y # CONFIG_ARM64_16K_PAGES is not set
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.11-rc5 commit f318903c0bf42448b4c884732df2bbb0ef7a2284 category: bugfix bugzilla: 47241 CVE: NA
-------------------
In Cilium we're mainly using BPF cgroup hooks today in order to implement kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*), ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic between Cilium managed nodes. While this works in its current shape and avoids packet-level NAT for inter Cilium managed node traffic, there is one major limitation we're facing today, that is, lack of netns awareness.
In Kubernetes, the concept of Pods (which hold one or multiple containers) has been built around network namespaces, so while we can use the global scope of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing NodePort ports on loopback addresses), we also have the need to differentiate between initial network namespaces and non-initial one. For example, ExternalIP services mandate that non-local service IPs are not to be translated from the host (initial) network namespace as one example. Right now, we have an ugly work-around in place where non-local service IPs for ExternalIP services are not xlated from connect() and friends BPF hooks but instead via less efficient packet-level NAT on the veth tc ingress hook for Pod traffic.
On top of determining whether we're in initial or non-initial network namespace we also have a need for a socket-cookie like mechanism for network namespaces scope. Socket cookies have the nice property that they can be combined as part of the key structure e.g. for BPF LRU maps without having to worry that the cookie could be recycled. We are planning to use this for our sessionAffinity implementation for services. Therefore, add a new bpf_get_netns_cookie() helper which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would provide the cookie for the initial network namespace while passing the context instead of NULL would provide the cookie from the application's network namespace. We're using a hole, so no size increase; the assignment happens only once. Therefore this allows for a comparison on initial namespace as well as regular cookie usage as we have today with socket cookies. We could later on enable this helper for other program types as well as we would see need.
(*) Both externalTrafficPolicy={Local|Cluster} types [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c
Signed-off-by: Daniel Borkmann daniel@iogearbox.net Signed-off-by: Alexei Starovoitov ast@kernel.org Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323...
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Di Zhu zhudi21@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/bpf.h | 1 + include/net/net_namespace.h | 10 +++++++++ include/uapi/linux/bpf.h | 15 +++++++++++++- kernel/bpf/verifier.c | 16 +++++++++------ net/core/filter.c | 37 ++++++++++++++++++++++++++++++++++ net/core/net_namespace.c | 15 ++++++++++++++ tools/include/uapi/linux/bpf.h | 15 +++++++++++++- 7 files changed, 101 insertions(+), 8 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 6bdc7157c232d..0cc25af3457ff 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -154,6 +154,7 @@ enum bpf_arg_type { ARG_CONST_SIZE_OR_ZERO, /* number of bytes accessed from memory or 0 */
ARG_PTR_TO_CTX, /* pointer to context */ + ARG_PTR_TO_CTX_OR_NULL, /* pointer to context or NULL */ ARG_ANYTHING, /* any (initialized) argument is ok */ };
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index 5007eaba207d5..b2f080b10819e 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -150,6 +150,9 @@ struct net { #ifdef CONFIG_XFRM struct netns_xfrm xfrm; #endif + + atomic64_t net_cookie; /* written once */ + #if IS_ENABLED(CONFIG_IP_VS) struct netns_ipvs *ipvs; #endif @@ -250,6 +253,8 @@ static inline int check_net(const struct net *net)
void net_drop_ns(void *);
+u64 net_gen_cookie(struct net *net); + #else
static inline struct net *get_net(struct net *net) @@ -277,6 +282,11 @@ static inline int check_net(const struct net *net) return 1; }
+static inline u64 net_gen_cookie(struct net *net) +{ + return 0; +} + #define net_drop_ns NULL #endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index a2b606060588f..858e4c9d57492 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2144,6 +2144,18 @@ union bpf_attr { * request in the skb. * Return * 0 on success, or a negative error in case of failure. + * u64 bpf_get_netns_cookie(void *ctx) + * Description + * Retrieve the cookie (generated by the kernel) of the network + * namespace the input *ctx* is associated with. The network + * namespace cookie remains stable for its lifetime and provides + * a global identifier that can be assumed unique. If *ctx* is + * NULL, then the helper returns the cookie for the initial + * network namespace. The cookie itself is very similar to that + * of bpf_get_socket_cookie() helper, but for network namespaces + * instead of sockets. + * Return + * A 8-byte long opaque number. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -2229,7 +2241,8 @@ union bpf_attr { FN(get_current_cgroup_id), \ FN(get_local_storage), \ FN(sk_select_reuseport), \ - FN(skb_ancestor_cgroup_id), + FN(skb_ancestor_cgroup_id), \ + FN(get_netns_cookie),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 240bdbde8a153..24488efeaead0 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1924,13 +1924,17 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, expected_type = CONST_PTR_TO_MAP; if (type != expected_type) goto err_type; - } else if (arg_type == ARG_PTR_TO_CTX) { + } else if (arg_type == ARG_PTR_TO_CTX || + arg_type == ARG_PTR_TO_CTX_OR_NULL) { expected_type = PTR_TO_CTX; - if (type != expected_type) - goto err_type; - err = check_ctx_reg(env, reg, regno); - if (err < 0) - return err; + if (!(register_is_null(reg) && + arg_type == ARG_PTR_TO_CTX_OR_NULL)) { + if (type != expected_type) + goto err_type; + err = check_ctx_reg(env, reg, regno); + if (err < 0) + return err; + } } else if (arg_type_is_mem_ptr(arg_type)) { expected_type = PTR_TO_STACK; /* One exception here. In case function allows for NULL to be diff --git a/net/core/filter.c b/net/core/filter.c index 257bc9276fbb6..6fbad720649dd 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3872,6 +3872,39 @@ static const struct bpf_func_proto bpf_get_socket_cookie_sock_ops_proto = { .arg1_type = ARG_PTR_TO_CTX, };
+static u64 __bpf_get_netns_cookie(struct sock *sk) +{ +#ifdef CONFIG_NET_NS + return net_gen_cookie(sk ? sk->sk_net.net : &init_net); +#else + return 0; +#endif +} + +BPF_CALL_1(bpf_get_netns_cookie_sock, struct sock *, ctx) +{ + return __bpf_get_netns_cookie(ctx); +} + +static const struct bpf_func_proto bpf_get_netns_cookie_sock_proto = { + .func = bpf_get_netns_cookie_sock, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX_OR_NULL, +}; + +BPF_CALL_1(bpf_get_netns_cookie_sock_addr, struct bpf_sock_addr_kern *, ctx) +{ + return __bpf_get_netns_cookie(ctx ? ctx->sk : NULL); +} + +static const struct bpf_func_proto bpf_get_netns_cookie_sock_addr_proto = { + .func = bpf_get_netns_cookie_sock_addr, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX_OR_NULL, +}; + BPF_CALL_1(bpf_get_socket_uid, struct sk_buff *, skb) { struct sock *sk = sk_to_full_sk(skb->sk); @@ -4876,6 +4909,8 @@ sock_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_current_uid_gid_proto; case BPF_FUNC_get_local_storage: return &bpf_get_local_storage_proto; + case BPF_FUNC_get_netns_cookie: + return &bpf_get_netns_cookie_sock_proto; default: return bpf_base_func_proto(func_id); } @@ -4900,6 +4935,8 @@ sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) } case BPF_FUNC_get_socket_cookie: return &bpf_get_socket_cookie_sock_addr_proto; + case BPF_FUNC_get_netns_cookie: + return &bpf_get_netns_cookie_sock_addr_proto; case BPF_FUNC_get_local_storage: return &bpf_get_local_storage_proto; default: diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index c60123dff8039..973d00a7fb591 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -61,6 +61,20 @@ EXPORT_SYMBOL_GPL(pernet_ops_rwsem);
static unsigned int max_gen_ptrs = INITIAL_NET_GEN_PTRS;
+static atomic64_t cookie_gen; + +u64 net_gen_cookie(struct net *net) +{ + while (1) { + u64 res = atomic64_read(&net->net_cookie); + + if (res) + return res; + res = atomic64_inc_return(&cookie_gen); + atomic64_cmpxchg(&net->net_cookie, 0, res); + } +} + static struct net_generic *net_alloc_generic(void) { struct net_generic *ng; @@ -905,6 +919,7 @@ static int __init net_ns_init(void) panic("Could not allocate generic netns");
rcu_assign_pointer(init_net.gen, ng); + net_gen_cookie(&init_net);
down_write(&pernet_ops_rwsem); if (setup_net(&init_net, &init_user_ns)) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 13944978ada5b..9b587d01e6f14 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -2141,6 +2141,18 @@ union bpf_attr { * request in the skb. * Return * 0 on success, or a negative error in case of failure. + * u64 bpf_get_netns_cookie(void *ctx) + * Description + * Retrieve the cookie (generated by the kernel) of the network + * namespace the input *ctx* is associated with. The network + * namespace cookie remains stable for its lifetime and provides + * a global identifier that can be assumed unique. If *ctx* is + * NULL, then the helper returns the cookie for the initial + * network namespace. The cookie itself is very similar to that + * of bpf_get_socket_cookie() helper, but for network namespaces + * instead of sockets. + * Return + * A 8-byte long opaque number. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -2226,7 +2238,8 @@ union bpf_attr { FN(get_current_cgroup_id), \ FN(get_local_storage), \ FN(sk_select_reuseport), \ - FN(skb_ancestor_cgroup_id), + FN(skb_ancestor_cgroup_id), \ + FN(get_netns_cookie),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.11-rc5 commit 1b66d253610c7f8f257103808a9460223a087469 category: bugfix bugzilla: 47241 CVE: NA
-------------------
As stated in 983695fa6765 ("bpf: fix unconnected udp hooks"), the objective for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be transparent to applications. In Cilium we make use of these hooks [0] in order to enable E-W load balancing for existing Kubernetes service types for all Cilium managed nodes in the cluster. Those backends can be local or remote. The main advantage of this approach is that it operates as close as possible to the socket, and therefore allows to avoid packet-based NAT given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses.
This also allows to expose NodePort services on loopback addresses in the host namespace, for example. As another advantage, this also efficiently blocks bind requests for applications in the host namespace for exposed ports. However, one missing item is that we also need to perform reverse xlation for inet{,6}_getname() hooks such that we can return the service IP/port tuple back to the application instead of the remote peer address.
The vast majority of applications does not bother about getpeername(), but in a few occasions we've seen breakage when validating the peer's address since it returns unexpectedly the backend tuple instead of the service one. Therefore, this trivial patch allows to customise and adds a getpeername() as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order to address this situation.
Simple example:
# ./cilium/cilium service list ID Frontend Service Type Backend 1 1.2.3.4:80 ClusterIP 1 => 10.0.0.10:80
Before; curl's verbose output example, no getpeername() reverse xlation:
# curl --verbose 1.2.3.4 * Rebuilt URL to: 1.2.3.4/ * Trying 1.2.3.4... * TCP_NODELAY set * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
GET / HTTP/1.1 Host: 1.2.3.4 User-Agent: curl/7.58.0 Accept: */*
[...]
After; with getpeername() reverse xlation:
# curl --verbose 1.2.3.4 * Rebuilt URL to: 1.2.3.4/ * Trying 1.2.3.4... * TCP_NODELAY set * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
GET / HTTP/1.1 Host: 1.2.3.4 User-Agent: curl/7.58.0 Accept: */*
[...]
Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed peer to the context similar as in inet{,6}_getname() fashion, but API-wise this is suboptimal as it always enforces programs having to test for ctx->peer which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split. Similarly, the checked return code is on tnum_range(1, 1), but if a use case comes up in future, it can easily be changed to return an error code instead. Helper and ctx member access is the same as with connect/sendmsg/etc hooks.
[0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c
Signed-off-by: Daniel Borkmann daniel@iogearbox.net Signed-off-by: Alexei Starovoitov ast@kernel.org Acked-by: Andrii Nakryiko andriin@fb.com Acked-by: Andrey Ignatov rdna@fb.com Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841...
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Di Zhu zhudi21@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/bpf-cgroup.h | 1 + include/uapi/linux/bpf.h | 4 ++++ kernel/bpf/syscall.c | 12 ++++++++++++ kernel/bpf/verifier.c | 6 +++++- net/core/filter.c | 4 ++++ net/ipv4/af_inet.c | 7 ++++++- net/ipv6/af_inet6.c | 7 ++++++- tools/include/uapi/linux/bpf.h | 4 ++++ 8 files changed, 42 insertions(+), 3 deletions(-)
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index 30c9d0247e7f0..c0e359fbffb00 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -311,6 +311,7 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map, }
#define cgroup_bpf_enabled (0) +#define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, type, t_ctx) ({ 0; }) #define BPF_CGROUP_PRE_CONNECT_ENABLED(sk) (0) #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; }) #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; }) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 858e4c9d57492..0ecca3abc2a92 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -175,6 +175,10 @@ enum bpf_attach_type { BPF_LIRC_MODE2, BPF_CGROUP_UDP4_RECVMSG = 19, BPF_CGROUP_UDP6_RECVMSG, + BPF_CGROUP_INET4_GETPEERNAME, + BPF_CGROUP_INET6_GETPEERNAME, + BPF_CGROUP_INET4_GETSOCKNAME, + BPF_CGROUP_INET6_GETSOCKNAME, __MAX_BPF_ATTACH_TYPE };
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 9bdbe025d8747..80bffe7943f9d 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1356,6 +1356,10 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type, case BPF_CGROUP_INET6_BIND: case BPF_CGROUP_INET4_CONNECT: case BPF_CGROUP_INET6_CONNECT: + case BPF_CGROUP_INET4_GETPEERNAME: + case BPF_CGROUP_INET6_GETPEERNAME: + case BPF_CGROUP_INET4_GETSOCKNAME: + case BPF_CGROUP_INET6_GETSOCKNAME: case BPF_CGROUP_UDP4_SENDMSG: case BPF_CGROUP_UDP6_SENDMSG: case BPF_CGROUP_UDP4_RECVMSG: @@ -1649,6 +1653,10 @@ static int bpf_prog_attach(const union bpf_attr *attr) case BPF_CGROUP_INET6_BIND: case BPF_CGROUP_INET4_CONNECT: case BPF_CGROUP_INET6_CONNECT: + case BPF_CGROUP_INET4_GETPEERNAME: + case BPF_CGROUP_INET6_GETPEERNAME: + case BPF_CGROUP_INET4_GETSOCKNAME: + case BPF_CGROUP_INET6_GETSOCKNAME: case BPF_CGROUP_UDP4_SENDMSG: case BPF_CGROUP_UDP6_SENDMSG: case BPF_CGROUP_UDP4_RECVMSG: @@ -1775,6 +1783,10 @@ static int bpf_prog_query(const union bpf_attr *attr, case BPF_CGROUP_INET6_POST_BIND: case BPF_CGROUP_INET4_CONNECT: case BPF_CGROUP_INET6_CONNECT: + case BPF_CGROUP_INET4_GETPEERNAME: + case BPF_CGROUP_INET6_GETPEERNAME: + case BPF_CGROUP_INET4_GETSOCKNAME: + case BPF_CGROUP_INET6_GETSOCKNAME: case BPF_CGROUP_UDP4_SENDMSG: case BPF_CGROUP_UDP6_SENDMSG: case BPF_CGROUP_UDP4_RECVMSG: diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 24488efeaead0..91b6e91a9765d 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4300,7 +4300,11 @@ static int check_return_code(struct bpf_verifier_env *env) switch (env->prog->type) { case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: if (env->prog->expected_attach_type == BPF_CGROUP_UDP4_RECVMSG || - env->prog->expected_attach_type == BPF_CGROUP_UDP6_RECVMSG) + env->prog->expected_attach_type == BPF_CGROUP_UDP6_RECVMSG || + env->prog->expected_attach_type == BPF_CGROUP_INET4_GETPEERNAME || + env->prog->expected_attach_type == BPF_CGROUP_INET6_GETPEERNAME || + env->prog->expected_attach_type == BPF_CGROUP_INET4_GETSOCKNAME || + env->prog->expected_attach_type == BPF_CGROUP_INET6_GETSOCKNAME) range = tnum_range(1, 1); case BPF_PROG_TYPE_CGROUP_SKB: case BPF_PROG_TYPE_CGROUP_SOCK: diff --git a/net/core/filter.c b/net/core/filter.c index 6fbad720649dd..a1077e879aa42 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5606,6 +5606,8 @@ static bool sock_addr_is_valid_access(int off, int size, switch (prog->expected_attach_type) { case BPF_CGROUP_INET4_BIND: case BPF_CGROUP_INET4_CONNECT: + case BPF_CGROUP_INET4_GETPEERNAME: + case BPF_CGROUP_INET4_GETSOCKNAME: case BPF_CGROUP_UDP4_SENDMSG: case BPF_CGROUP_UDP4_RECVMSG: break; @@ -5617,6 +5619,8 @@ static bool sock_addr_is_valid_access(int off, int size, switch (prog->expected_attach_type) { case BPF_CGROUP_INET6_BIND: case BPF_CGROUP_INET6_CONNECT: + case BPF_CGROUP_INET6_GETPEERNAME: + case BPF_CGROUP_INET6_GETSOCKNAME: case BPF_CGROUP_UDP6_SENDMSG: case BPF_CGROUP_UDP6_RECVMSG: break; diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 14e10214cf876..8037f88157d2e 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -758,7 +758,7 @@ EXPORT_SYMBOL(inet_accept); * This does both peername and sockname. */ int inet_getname(struct socket *sock, struct sockaddr *uaddr, - int peer) + int peer) { struct sock *sk = sock->sk; struct inet_sock *inet = inet_sk(sk); @@ -779,6 +779,11 @@ int inet_getname(struct socket *sock, struct sockaddr *uaddr, sin->sin_port = inet->inet_sport; sin->sin_addr.s_addr = addr; } + if (cgroup_bpf_enabled) + BPF_CGROUP_RUN_SA_PROG_LOCK(sk, (struct sockaddr *)sin, + peer ? BPF_CGROUP_INET4_GETPEERNAME : + BPF_CGROUP_INET4_GETSOCKNAME, + NULL); memset(sin->sin_zero, 0, sizeof(sin->sin_zero)); return sizeof(*sin); } diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index 5c2351deedc8f..907ec00dc3349 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -507,7 +507,7 @@ EXPORT_SYMBOL_GPL(inet6_destroy_sock); */
int inet6_getname(struct socket *sock, struct sockaddr *uaddr, - int peer) + int peer) { struct sockaddr_in6 *sin = (struct sockaddr_in6 *)uaddr; struct sock *sk = sock->sk; @@ -535,6 +535,11 @@ int inet6_getname(struct socket *sock, struct sockaddr *uaddr,
sin->sin6_port = inet->inet_sport; } + if (cgroup_bpf_enabled) + BPF_CGROUP_RUN_SA_PROG_LOCK(sk, (struct sockaddr *)sin, + peer ? BPF_CGROUP_INET6_GETPEERNAME : + BPF_CGROUP_INET6_GETSOCKNAME, + NULL); sin->sin6_scope_id = ipv6_iface_scope_id(&sin->sin6_addr, sk->sk_bound_dev_if); return sizeof(*sin); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 9b587d01e6f14..2df44d8ebb892 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -172,6 +172,10 @@ enum bpf_attach_type { BPF_CGROUP_UDP4_SENDMSG, BPF_CGROUP_UDP6_SENDMSG, BPF_LIRC_MODE2, + BPF_CGROUP_INET4_GETPEERNAME, + BPF_CGROUP_INET6_GETPEERNAME, + BPF_CGROUP_INET4_GETSOCKNAME, + BPF_CGROUP_INET6_GETSOCKNAME, __MAX_BPF_ATTACH_TYPE };
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.11-rc5 commit cd48bdda4fb82c2fe569d97af4217c530168c99c category: bugfix bugzilla: 47241 CVE: NA
-------------------
Generating and retrieving socket cookies are a useful feature that is exposed to BPF for various program types through bpf_get_socket_cookie() helper.
The fact that the cookie counter is per netns is quite a limitation for BPF in practice in particular for programs in host namespace that use socket cookies as part of a map lookup key since they will be causing socket cookie collisions e.g. when attached to BPF cgroup hooks or cls_bpf on tc egress in host namespace handling container traffic from veth or ipvlan devices with peer in different netns. Change the counter to be global instead.
Socket cookie consumers must assume the value as opqaue in any case. Not every socket must have a cookie generated and knowledge of the counter value itself does not provide much value either way hence conversion to global is fine.
Signed-off-by: Daniel Borkmann daniel@iogearbox.net Cc: Eric Dumazet edumazet@google.com Cc: Alexei Starovoitov ast@kernel.org Cc: Willem de Bruijn willemb@google.com Cc: Martynas Pumputis m@lambda.lt Signed-off-by: David S. Miller davem@davemloft.net
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Di Zhu zhudi21@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/net/net_namespace.h | 1 - include/uapi/linux/bpf.h | 4 ++-- net/core/sock_diag.c | 3 ++- 3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index b2f080b10819e..7fb3d0fb5ec29 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -58,7 +58,6 @@ struct net { spinlock_t rules_mod_lock;
u32 hash_mix; - atomic64_t cookie_gen;
struct list_head list; /* list of network namespaces */ struct list_head exit_list; /* To linked to call pernet exit diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 0ecca3abc2a92..8bbf0a79fe08e 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1380,8 +1380,8 @@ union bpf_attr { * If no cookie has been set yet, generate a new cookie. Once * generated, the socket cookie remains stable for the life of the * socket. This helper can be useful for monitoring per socket - * networking traffic statistics as it provides a unique socket - * identifier per namespace. + * networking traffic statistics as it provides a global socket + * identifier that can be assumed unique. * Return * A 8-byte long non-decreasing number on success, or 0 if the * socket field is missing inside *skb*. diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c index 3312a5849a974..c13ffbd33d8d6 100644 --- a/net/core/sock_diag.c +++ b/net/core/sock_diag.c @@ -19,6 +19,7 @@ static const struct sock_diag_handler *sock_diag_handlers[AF_MAX]; static int (*inet_rcv_compat)(struct sk_buff *skb, struct nlmsghdr *nlh); static DEFINE_MUTEX(sock_diag_table_mutex); static struct workqueue_struct *broadcast_wq; +static atomic64_t cookie_gen;
u64 sock_gen_cookie(struct sock *sk) { @@ -27,7 +28,7 @@ u64 sock_gen_cookie(struct sock *sk)
if (res) return res; - res = atomic64_inc_return(&sock_net(sk)->cookie_gen); + res = atomic64_inc_return(&cookie_gen); atomic64_cmpxchg(&sk->sk_cookie, 0, res); } }
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.11-rc5 commit 3dbc6adac1f3b83fd4c39899c747da7b417e3ffc category: bugfix bugzilla: 47241 CVE: NA
-------------------
Sync BPF uapi header in order to pull in BPF_CGROUP_UDP{4,6}_RECVMSG attach types. This is done and preferred as an extra patch in order to ease sync of libbpf.
Signed-off-by: Daniel Borkmann daniel@iogearbox.net Acked-by: Andrey Ignatov rdna@fb.com Acked-by: Martin KaFai Lau kafai@fb.com Signed-off-by: Alexei Starovoitov ast@kernel.org
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Di Zhu zhudi21@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/include/uapi/linux/bpf.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 2df44d8ebb892..722126e295122 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -176,6 +176,8 @@ enum bpf_attach_type { BPF_CGROUP_INET6_GETPEERNAME, BPF_CGROUP_INET4_GETSOCKNAME, BPF_CGROUP_INET6_GETSOCKNAME, + BPF_CGROUP_UDP4_RECVMSG, + BPF_CGROUP_UDP6_RECVMSG, __MAX_BPF_ATTACH_TYPE };
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.11-rc5 commit 9bb59ac1f6c362f14b58187bc56e737780c52c19 category: bugfix bugzilla: 47241 CVE: NA
-------------------
Another trivial patch to libbpf in order to enable identifying and attaching programs to BPF_CGROUP_UDP{4,6}_RECVMSG by section name.
Signed-off-by: Daniel Borkmann daniel@iogearbox.net Signed-off-by: Alexei Starovoitov ast@kernel.org
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Di Zhu zhudi21@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/lib/bpf/libbpf.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 249fa8d7376e3..0ca56681f9cf5 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -2135,6 +2135,8 @@ static const struct { BPF_SA_PROG_SEC("cgroup/connect6", BPF_CGROUP_INET6_CONNECT), BPF_SA_PROG_SEC("cgroup/sendmsg4", BPF_CGROUP_UDP4_SENDMSG), BPF_SA_PROG_SEC("cgroup/sendmsg6", BPF_CGROUP_UDP6_SENDMSG), + BPF_SA_PROG_SEC("cgroup/recvmsg4", BPF_CGROUP_UDP4_RECVMSG), + BPF_SA_PROG_SEC("cgroup/recvmsg6", BPF_CGROUP_UDP6_RECVMSG), BPF_S_PROG_SEC("cgroup/post_bind4", BPF_CGROUP_INET4_POST_BIND), BPF_S_PROG_SEC("cgroup/post_bind6", BPF_CGROUP_INET6_POST_BIND), };
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.11-rc5 commit 000aa1250d572171807b47fb9cd3fadfbcc36ad0 category: bugfix bugzilla: 47241 CVE: NA
-------------------
Trivial patch to bpftool in order to complete enabling attaching programs to BPF_CGROUP_UDP{4,6}_RECVMSG.
Signed-off-by: Daniel Borkmann daniel@iogearbox.net Acked-by: Andrey Ignatov rdna@fb.com Acked-by: Martin KaFai Lau kafai@fb.com Signed-off-by: Alexei Starovoitov ast@kernel.org
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Di Zhu zhudi21@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/bpf/bpftool/Documentation/bpftool-cgroup.rst | 6 +++++- tools/bpf/bpftool/Documentation/bpftool-prog.rst | 3 ++- tools/bpf/bpftool/bash-completion/bpftool | 6 +++--- tools/bpf/bpftool/cgroup.c | 5 ++++- tools/bpf/bpftool/prog.c | 3 ++- 5 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst index edbe81534c6d2..6ab451aed998c 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst @@ -29,7 +29,7 @@ MAP COMMANDS | *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* } | *ATTACH_TYPE* := { **ingress** | **egress** | **sock_create** | **sock_ops** | **device** | | **bind4** | **bind6** | **post_bind4** | **post_bind6** | **connect4** | **connect6** | -| **sendmsg4** | **sendmsg6** } +| **sendmsg4** | **sendmsg6** | **recvmsg4** | **recvmsg6** } | *ATTACH_FLAGS* := { **multi** | **override** }
DESCRIPTION @@ -86,6 +86,10 @@ DESCRIPTION unconnected udp4 socket (since 4.18); **sendmsg6** call to sendto(2), sendmsg(2), sendmmsg(2) for an unconnected udp6 socket (since 4.18). + **recvmsg4** call to recvfrom(2), recvmsg(2), recvmmsg(2) for + an unconnected udp4 socket (since 5.2); + **recvmsg6** call to recvfrom(2), recvmsg(2), recvmmsg(2) for + an unconnected udp6 socket (since 5.2);
**bpftool cgroup detach** *CGROUP* *ATTACH_TYPE* *PROG* Detach *PROG* from the cgroup *CGROUP* and attach type diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst b/tools/bpf/bpftool/Documentation/bpftool-prog.rst index 64156a16d5300..72ed624cf173a 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst @@ -35,7 +35,8 @@ MAP COMMANDS | **cgroup/sock** | **cgroup/dev** | **lwt_in** | **lwt_out** | **lwt_xmit** | | **lwt_seg6local** | **sockops** | **sk_skb** | **sk_msg** | **lirc_mode2** | | **cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | **cgroup/post_bind6** | -| **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** | **cgroup/sendmsg6** +| **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** | **cgroup/sendmsg6** | +| **cgroup/recvmsg4** | **cgroup/recvmsg6** | }
diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool index c2b6b2176f3b7..3f5dfcd9266b6 100644 --- a/tools/bpf/bpftool/bash-completion/bpftool +++ b/tools/bpf/bpftool/bash-completion/bpftool @@ -321,7 +321,7 @@ _bpftool()
case $prev in type) - COMPREPLY=( $( compgen -W "socket kprobe kretprobe classifier action tracepoint raw_tracepoint xdp perf_event cgroup/skb cgroup/sock cgroup/dev lwt_in lwt_out lwt_xmit lwt_seg6local sockops sk_skb sk_msg lirc_mode2 cgroup/bind4 cgroup/bind6 cgroup/connect4 cgroup/connect6 cgroup/sendmsg4 cgroup/sendmsg6 cgroup/post_bind4 cgroup/post_bind6" -- \ + COMPREPLY=( $( compgen -W "socket kprobe kretprobe classifier action tracepoint raw_tracepoint xdp perf_event cgroup/skb cgroup/sock cgroup/dev lwt_in lwt_out lwt_xmit lwt_seg6local sockops sk_skb sk_msg lirc_mode2 cgroup/bind4 cgroup/bind6 cgroup/connect4 cgroup/connect6 cgroup/sendmsg4 cgroup/sendmsg6 cgroup/recvmsg4 cgroup/recvmsg6 cgroup/post_bind4 cgroup/post_bind6" -- \ "$cur" ) ) return 0 ;; @@ -501,7 +501,7 @@ _bpftool() attach|detach) local ATTACH_TYPES='ingress egress sock_create sock_ops \ device bind4 bind6 post_bind4 post_bind6 connect4 \ - connect6 sendmsg4 sendmsg6' + connect6 sendmsg4 sendmsg6 recvmsg4 recvmsg6' local ATTACH_FLAGS='multi override' local PROG_TYPE='id pinned tag' case $prev in @@ -511,7 +511,7 @@ _bpftool() ;; ingress|egress|sock_create|sock_ops|device|bind4|bind6|\ post_bind4|post_bind6|connect4|connect6|sendmsg4|\ - sendmsg6) + sendmsg6|recvmsg4|recvmsg6) COMPREPLY=( $( compgen -W "$PROG_TYPE" -- \ "$cur" ) ) return 0 diff --git a/tools/bpf/bpftool/cgroup.c b/tools/bpf/bpftool/cgroup.c index adbcd84818f74..9ad4cc3ab0b4b 100644 --- a/tools/bpf/bpftool/cgroup.c +++ b/tools/bpf/bpftool/cgroup.c @@ -25,7 +25,8 @@ " ATTACH_TYPE := { ingress | egress | sock_create |\n" \ " sock_ops | device | bind4 | bind6 |\n" \ " post_bind4 | post_bind6 | connect4 |\n" \ - " connect6 | sendmsg4 | sendmsg6 }" + " connect6 | sendmsg4 | sendmsg6 |\n" \ + " recvmsg4 | recvmsg6 }"
static const char * const attach_type_strings[] = { [BPF_CGROUP_INET_INGRESS] = "ingress", @@ -41,6 +42,8 @@ static const char * const attach_type_strings[] = { [BPF_CGROUP_INET6_POST_BIND] = "post_bind6", [BPF_CGROUP_UDP4_SENDMSG] = "sendmsg4", [BPF_CGROUP_UDP6_SENDMSG] = "sendmsg6", + [BPF_CGROUP_UDP4_RECVMSG] = "recvmsg4", + [BPF_CGROUP_UDP6_RECVMSG] = "recvmsg6", [__MAX_BPF_ATTACH_TYPE] = NULL, };
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c index 4f9611af46422..2b5476312cba6 100644 --- a/tools/bpf/bpftool/prog.c +++ b/tools/bpf/bpftool/prog.c @@ -958,7 +958,8 @@ static int do_help(int argc, char **argv) " lwt_seg6local | sockops | sk_skb | sk_msg | lirc_mode2 |\n" " cgroup/bind4 | cgroup/bind6 | cgroup/post_bind4 |\n" " cgroup/post_bind6 | cgroup/connect4 | cgroup/connect6 |\n" - " cgroup/sendmsg4 | cgroup/sendmsg6 }\n" + " cgroup/sendmsg4 | cgroup/sendmsg6 | cgroup/recvmsg4 |\n" + " cgroup/recvmsg6 }\n" " " HELP_SPEC_OPTIONS "\n" "", bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.11-rc5 commit c04c0d2b968ac45d6ef020316808ef6c82325a82 category: bugfix bugzilla: 47241 CVE: NA
-------------------
Large verifier speed improvements allow to increase verifier complexity limit. Now regardless of the program composition and its size it takes little time for the verifier to hit insn_processed limit. On typical x86 machine non-debug kernel processes 1M instructions in 1/10 of a second. (before these speed improvements specially crafted programs could be hitting multi-second verification times) Full kasan kernel with debug takes ~1 second for the same 1M insns. Hence bump the BPF_COMPLEXITY_LIMIT_INSNS limit to 1M. Also increase the number of instructions per program from 4k to internal BPF_COMPLEXITY_LIMIT_INSNS limit. 4k limit was confusing to users, since small programs with hundreds of insns could be hitting BPF_COMPLEXITY_LIMIT_INSNS limit. Sometimes adding more insns and bpf_trace_printk debug statements would make the verifier accept the program while removing code would make the verifier reject it. Some user space application started to add #define MAX_FOO to their programs and do: MAX_FOO=100; again: compile with MAX_FOO; try to load; if (fails_to_load) { reduce MAX_FOO; goto again; } to be able to fit maximum amount of processing into single program. Other users artificially split their single program into a set of programs and use all 32 iterations of tail_calls to increase compute limits. And the most advanced folks used unlimited tc-bpf filter list to execute many bpf programs. Essentially the users managed to workaround 4k insn limit. This patch removes the limit for root programs from uapi. BPF_COMPLEXITY_LIMIT_INSNS is the kernel internal limit and success to load the program no longer depends on program size, but on 'smartness' of the verifier only. The verifier will continue to get smarter with every kernel release.
Signed-off-by: Alexei Starovoitov ast@kernel.org Signed-off-by: Daniel Borkmann daniel@iogearbox.net
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Di Zhu zhudi21@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/linux/bpf.h | 1 + kernel/bpf/syscall.c | 3 ++- kernel/bpf/verifier.c | 1 - 3 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 0cc25af3457ff..432c4cf2a418d 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -328,6 +328,7 @@ struct bpf_array { }; };
+#define BPF_COMPLEXITY_LIMIT_INSNS 1000000 /* yes. 1M insns */ #define MAX_TAIL_CALL_CNT 32
struct bpf_event_entry { diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 80bffe7943f9d..fc0628d133166 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1399,7 +1399,8 @@ static int bpf_prog_load(union bpf_attr *attr) /* eBPF programs must be GPL compatible to use GPL-ed functions */ is_gpl = license_is_gpl_compatible(license);
- if (attr->insn_cnt == 0 || attr->insn_cnt > BPF_MAXINSNS) + if (attr->insn_cnt == 0 || + attr->insn_cnt > (capable(CAP_SYS_ADMIN) ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS)) return -E2BIG;
if (type == BPF_PROG_TYPE_KPROBE && diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 91b6e91a9765d..f6b7920917f92 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -154,7 +154,6 @@ struct bpf_verifier_stack_elem { struct bpf_verifier_stack_elem *next; };
-#define BPF_COMPLEXITY_LIMIT_INSNS 131072 #define BPF_COMPLEXITY_LIMIT_STACK 1024 #define BPF_COMPLEXITY_LIMIT_STATES 64
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.11-rc5 commit 4519efa6f8ea343e43ade21b0189b0b295439202 category: bugfix bugzilla: 47241 CVE: NA
-------------------
The BPF_PROG_LOAD condition for kernel version <= 5.1 is
log->len_total > UINT_MAX >> 8 /* (16 * 1024 * 1024) - 1 */
Signed-off-by: McCabe, Robert J robert.mccabe@rockwellcollins.com Signed-off-by: Alexei Starovoitov ast@kernel.org
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Di Zhu zhudi21@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- tools/lib/bpf/bpf.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index c3145ab3bdcac..4e603c5d5d319 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -71,7 +71,7 @@ struct bpf_load_program_attr { };
/* Recommend log buffer size */ -#define BPF_LOG_BUF_SIZE (256 * 1024) +#define BPF_LOG_BUF_SIZE (UINT32_MAX >> 8) /* verifier maximum in kernels <= 5.1 */ int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr, char *log_buf, size_t log_buf_sz); int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns,
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.11-rc5 commit 5a95cbb80ef8d8f2db29ab10777cd4742e6fc8ec category: bugfix bugzilla: 47241 CVE: NA
-------------------
Fix a redefinition of 'net_gen_cookie' error that was overlooked when net ns is not configured.
Fixes: f318903c0bf4 ("bpf: Add netns cookie and enable it for bpf cgroup hooks") Reported-by: kbuild test robot lkp@intel.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net
Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Di Zhu zhudi21@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- include/net/net_namespace.h | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index 7fb3d0fb5ec29..dbfc56676eeef 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -204,6 +204,8 @@ extern struct list_head net_namespace_list; struct net *get_net_ns_by_pid(pid_t pid); struct net *get_net_ns_by_fd(int fd);
+u64 net_gen_cookie(struct net *net); + #ifdef CONFIG_SYSCTL void ipx_register_sysctl(void); void ipx_unregister_sysctl(void); @@ -252,8 +254,6 @@ static inline int check_net(const struct net *net)
void net_drop_ns(void *);
-u64 net_gen_cookie(struct net *net); - #else
static inline struct net *get_net(struct net *net) @@ -281,11 +281,6 @@ static inline int check_net(const struct net *net) return 1; }
-static inline u64 net_gen_cookie(struct net *net) -{ - return 0; -} - #define net_drop_ns NULL #endif
From: Aichun Li liaichun@huawei.com
mainline inclusion from mainline-v5.11-rc5 commit bcf3a2953d36bbfb9bd44ccb3db0897d935cc485 category: bugfix bugzilla: 47241 CVE: NA
-------------------
The kernel may fail to boot or devices may fail to come up when initializing iscsi_tcp devices starting with Linux 5.8.
Commit a79af8a64d39 ("[SCSI] iscsi_tcp: use iscsi_conn_get_addr_param libiscsi function") introduced getpeername() within the session spinlock.
Commit 1b66d253610c ("bpf: Add get{peer, sock}name attach types for sock_addr") introduced BPF_CGROUP_RUN_SA_PROG_LOCK() within getpeername(), which acquires a mutex and when used from iscsi_tcp devices can now lead to "BUG: scheduling while atomic:" and subsequent damage.
Ensure that the spinlock is released before calling getpeername() or getsockname(). sock_hold() and sock_put() are used to ensure that the socket reference is preserved until after the getpeername() or getsockname() complete.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1877345 Link: https://lkml.org/lkml/2020/7/28/1085 Link: https://lkml.org/lkml/2020/8/31/459 Link: https://lore.kernel.org/r/20200928043329.606781-1-mark.mielke@gmail.com Fixes: a79af8a64d39 ("[SCSI] iscsi_tcp: use iscsi_conn_get_addr_param libiscsi function") Fixes: 1b66d253610c ("bpf: Add get{peer, sock}name attach types for sock_addr") Cc: stable@vger.kernel.org Reported-by: Marc Dionne marc.c.dionne@gmail.com Tested-by: Marc Dionne marc.c.dionne@gmail.com Reviewed-by: Mike Christie michael.christie@oracle.com Signed-off-by: Mark Mielke mark.mielke@gmail.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Aichun Li liaichun@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Di Zhu zhudi21@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/scsi/iscsi_tcp.c | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-)
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c index 93ce990198081..2543d227200fd 100644 --- a/drivers/scsi/iscsi_tcp.c +++ b/drivers/scsi/iscsi_tcp.c @@ -749,6 +749,7 @@ static int iscsi_sw_tcp_conn_get_param(struct iscsi_cls_conn *cls_conn, struct iscsi_tcp_conn *tcp_conn = conn->dd_data; struct iscsi_sw_tcp_conn *tcp_sw_conn = tcp_conn->dd_data; struct sockaddr_in6 addr; + struct socket *sock; int rc;
switch(param) { @@ -760,13 +761,17 @@ static int iscsi_sw_tcp_conn_get_param(struct iscsi_cls_conn *cls_conn, spin_unlock_bh(&conn->session->frwd_lock); return -ENOTCONN; } + sock = tcp_sw_conn->sock; + sock_hold(sock->sk); + spin_unlock_bh(&conn->session->frwd_lock); + if (param == ISCSI_PARAM_LOCAL_PORT) - rc = kernel_getsockname(tcp_sw_conn->sock, + rc = kernel_getsockname(sock, (struct sockaddr *)&addr); else - rc = kernel_getpeername(tcp_sw_conn->sock, + rc = kernel_getpeername(sock, (struct sockaddr *)&addr); - spin_unlock_bh(&conn->session->frwd_lock); + sock_put(sock->sk); if (rc < 0) return rc;
@@ -788,6 +793,7 @@ static int iscsi_sw_tcp_host_get_param(struct Scsi_Host *shost, struct iscsi_tcp_conn *tcp_conn; struct iscsi_sw_tcp_conn *tcp_sw_conn; struct sockaddr_in6 addr; + struct socket *sock; int rc;
switch (param) { @@ -802,16 +808,18 @@ static int iscsi_sw_tcp_host_get_param(struct Scsi_Host *shost, return -ENOTCONN; } tcp_conn = conn->dd_data; - tcp_sw_conn = tcp_conn->dd_data; - if (!tcp_sw_conn->sock) { + sock = tcp_sw_conn->sock; + if (!sock) { spin_unlock_bh(&session->frwd_lock); return -ENOTCONN; } + sock_hold(sock->sk); + spin_unlock_bh(&session->frwd_lock);
- rc = kernel_getsockname(tcp_sw_conn->sock, + rc = kernel_getsockname(sock, (struct sockaddr *)&addr); - spin_unlock_bh(&session->frwd_lock); + sock_put(sock->sk); if (rc < 0) return rc;
From: Ye Bin yebin10@huawei.com
hulk inclusion category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
This reverts commit eed1f8e19630ff89b2d877b660cda03bef92e85b.
Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/ext4_jbd2.c | 1 + fs/ext4/file.c | 1 - fs/ext4/inode.c | 1 - fs/ext4/namei.c | 6 ------ fs/ext4/resize.c | 4 ---- fs/ext4/xattr.c | 1 - 6 files changed, 1 insertion(+), 13 deletions(-)
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c index f9ac7dfd93bf0..a589b7f795582 100644 --- a/fs/ext4/ext4_jbd2.c +++ b/fs/ext4/ext4_jbd2.c @@ -361,6 +361,7 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line, struct buffer_head *bh = EXT4_SB(sb)->s_sbh; int err = 0;
+ ext4_superblock_csum_set(sb); if (ext4_handle_valid(handle)) { err = jbd2_journal_dirty_metadata(handle, bh); if (err) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 1703871fa2d0f..52d155b4e7334 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -434,7 +434,6 @@ static int ext4_sample_last_mounted(struct super_block *sb, goto out_journal; strlcpy(sbi->s_es->s_last_mounted, cp, sizeof(sbi->s_es->s_last_mounted)); - ext4_superblock_csum_set(sb); ext4_handle_dirty_super(handle, sb); out_journal: ext4_journal_stop(handle); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 8b70f35c54d05..49ac78cafc781 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5407,7 +5407,6 @@ static int ext4_do_update_inode(handle_t *handle, if (err) goto out_brelse; ext4_set_feature_large_file(sb); - ext4_superblock_csum_set(sb); ext4_handle_sync(handle); err = ext4_handle_dirty_super(handle, sb); } diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 722f4506058de..f68d441803214 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2891,10 +2891,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) (le32_to_cpu(sbi->s_es->s_inodes_count))) { /* Insert this inode at the head of the on-disk orphan list */ NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan); - lock_buffer(sbi->s_sbh); sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino); - ext4_superblock_csum_set(sb); - unlock_buffer(sbi->s_sbh); dirty = true; } list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan); @@ -2977,10 +2974,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) mutex_unlock(&sbi->s_orphan_lock); goto out_brelse; } - lock_buffer(sbi->s_sbh); sbi->s_es->s_last_orphan = cpu_to_le32(ino_next); - ext4_superblock_csum_set(inode->i_sb); - unlock_buffer(sbi->s_sbh); mutex_unlock(&sbi->s_orphan_lock); err = ext4_handle_dirty_super(handle, inode->i_sb); } else { diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index c2e007d836e47..6a0c5c880354a 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -901,7 +901,6 @@ static int add_new_gdb(handle_t *handle, struct inode *inode, ext4_kvfree_array_rcu(o_group_desc);
le16_add_cpu(&es->s_reserved_gdt_blocks, -1); - ext4_superblock_csum_set(sb); err = ext4_handle_dirty_super(handle, sb); if (err) ext4_std_error(sb, err); @@ -1424,7 +1423,6 @@ static void ext4_update_super(struct super_block *sb, * active. */ ext4_r_blocks_count_set(es, ext4_r_blocks_count(es) + reserved_blocks); - ext4_superblock_csum_set(sb);
/* Update the free space counts */ percpu_counter_add(&sbi->s_freeclusters_counter, @@ -1723,7 +1721,6 @@ static int ext4_group_extend_no_check(struct super_block *sb,
ext4_blocks_count_set(es, o_blocks_count + add); ext4_free_blocks_count_set(es, ext4_free_blocks_count(es) + add); - ext4_superblock_csum_set(sb); ext4_debug("freeing blocks %llu through %llu\n", o_blocks_count, o_blocks_count + add); /* We add the blocks to the bitmap and set the group need init bit */ @@ -1885,7 +1882,6 @@ static int ext4_convert_meta_bg(struct super_block *sb, struct inode *inode) ext4_set_feature_meta_bg(sb); sbi->s_es->s_first_meta_bg = cpu_to_le32(num_desc_blocks(sb, sbi->s_groups_count)); - ext4_superblock_csum_set(sb);
err = ext4_handle_dirty_super(handle, sb); if (err) { diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index ae029dccebc1c..24cf730ba6b02 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -791,7 +791,6 @@ static void ext4_xattr_update_super_block(handle_t *handle, BUFFER_TRACE(EXT4_SB(sb)->s_sbh, "get_write_access"); if (ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh) == 0) { ext4_set_feature_xattr(sb); - ext4_superblock_csum_set(sb); ext4_handle_dirty_super(handle, sb); } }
From: Jan Kara jack@suse.cz
mainline inclusion from mainline-v5.11-rc1 commit 81414b4dd48f596bf33e1b32c2e43e2047150ca6 category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
Superblock is written out either through ext4_commit_super() or through ext4_handle_dirty_super(). In both cases we recompute the checksum so it is not necessary to recompute it after updating superblock free inodes & blocks counters.
Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Andreas Dilger adilger@dilger.ca Link: https://lore.kernel.org/r/20201127113405.26867-3-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 18870ae874ab6..a254d7cf869ad 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4549,13 +4549,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) block = ext4_count_free_clusters(sb); ext4_free_blocks_count_set(sbi->s_es, EXT4_C2B(sbi, block)); - ext4_superblock_csum_set(sb); err = percpu_counter_init(&sbi->s_freeclusters_counter, block, GFP_KERNEL); if (!err) { unsigned long freei = ext4_count_free_inodes(sb); sbi->s_es->s_free_inodes_count = cpu_to_le32(freei); - ext4_superblock_csum_set(sb); err = percpu_counter_init(&sbi->s_freeinodes_counter, freei, GFP_KERNEL); }
From: Theodore Ts'o tytso@mit.edu
mainline inclusion from mainline-v5.6-rc1 commit 878520ac45f9f698432d4276db3d9144b83931b6 category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
This allows the cause of an ext4_error() report to be categorized based on whether it was triggered due to an I/O error, or an memory allocation error, or other possible causes. Most errors are caused by a detected file system inconsistency, so the default code stored in the superblock will be EXT4_ERR_EFSCORRUPTED.
Link: https://lore.kernel.org/r/20191204032335.7683-1-tytso@mit.edu Signed-off-by: Theodore Ts'o tytso@mit.edu
conflicts: fs/ext4/ext4.h fs/ext4/inode.c fs/ext4/namei.c
Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/balloc.c | 1 + fs/ext4/ext4.h | 30 +++++++++++++++++++- fs/ext4/ext4_jbd2.c | 3 ++ fs/ext4/extents.c | 1 + fs/ext4/ialloc.c | 2 ++ fs/ext4/inline.c | 2 ++ fs/ext4/inode.c | 8 +++++- fs/ext4/mballoc.c | 4 +++ fs/ext4/mmp.c | 6 +++- fs/ext4/namei.c | 4 +++ fs/ext4/super.c | 68 ++++++++++++++++++++++++++++++++++++++++++++- fs/ext4/xattr.c | 4 ++- 12 files changed, 128 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index aa4d8702bac21..244087a0d329c 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -517,6 +517,7 @@ int ext4_wait_block_bitmap(struct super_block *sb, ext4_group_t block_group, wait_on_buffer(bh); ext4_simulate_fail_bh(sb, bh, EXT4_SIM_BBITMAP_EIO); if (!buffer_uptodate(bh)) { + ext4_set_errno(sb, EIO); ext4_error(sb, "Cannot read block bitmap - " "block_group = %u, block_bitmap = %llu", block_group, (unsigned long long) bh->b_blocknr); diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 8797c05f27c0f..0f34b445461dd 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1338,7 +1338,8 @@ struct ext4_super_block { __u8 s_lastcheck_hi; __u8 s_first_error_time_hi; __u8 s_last_error_time_hi; - __u8 s_pad[2]; + __u8 s_first_error_errcode; + __u8 s_last_error_errcode; __le32 s_reserved[96]; /* Padding to the end of the block */ __le32 s_checksum; /* crc32c(superblock) */ }; @@ -1562,6 +1563,32 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino) ino <= le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count)); }
+/* + * Error number codes for s_{first,last}_error_errno + * + * Linux errno numbers are architecture specific, so we need to translate + * them into something which is architecture independent. We don't define + * codes for all errno's; just the ones which are most likely to be the cause + * of an ext4_error() call. + */ +#define EXT4_ERR_UNKNOWN 1 +#define EXT4_ERR_EIO 2 +#define EXT4_ERR_ENOMEM 3 +#define EXT4_ERR_EFSBADCRC 4 +#define EXT4_ERR_EFSCORRUPTED 5 +#define EXT4_ERR_ENOSPC 6 +#define EXT4_ERR_ENOKEY 7 +#define EXT4_ERR_EROFS 8 +#define EXT4_ERR_EFBIG 9 +#define EXT4_ERR_EEXIST 10 +#define EXT4_ERR_ERANGE 11 +#define EXT4_ERR_EOVERFLOW 12 +#define EXT4_ERR_EBUSY 13 +#define EXT4_ERR_ENOTDIR 14 +#define EXT4_ERR_ENOTEMPTY 15 +#define EXT4_ERR_ESHUTDOWN 16 +#define EXT4_ERR_EFAULT 17 + /* * Simulate_fail codes */ @@ -2693,6 +2720,7 @@ extern const char *ext4_decode_error(struct super_block *sb, int errno, extern void ext4_mark_group_bitmap_corrupted(struct super_block *sb, ext4_group_t block_group, unsigned int flags); +extern void ext4_set_errno(struct super_block *sb, int err);
extern __printf(4, 5) void __ext4_error(struct super_block *, const char *, unsigned int, diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c index a589b7f795582..c43632cf98862 100644 --- a/fs/ext4/ext4_jbd2.c +++ b/fs/ext4/ext4_jbd2.c @@ -58,6 +58,7 @@ static int ext4_journal_check_start(struct super_block *sb) * take the FS itself readonly cleanly. */ if (journal && is_journal_aborted(journal)) { + ext4_set_errno(sb, -journal->j_errno); ext4_abort(sb, "Detected aborted journal"); return -EROFS; } @@ -273,6 +274,7 @@ int __ext4_forget(const char *where, unsigned int line, handle_t *handle, if (err) { ext4_journal_abort_handle(where, line, __func__, bh, handle, err); + ext4_set_errno(inode->i_sb, -err); __ext4_abort(inode->i_sb, where, line, "error %d when attempting revoke", err); } @@ -345,6 +347,7 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line, es = EXT4_SB(inode->i_sb)->s_es; es->s_last_error_block = cpu_to_le64(bh->b_blocknr); + ext4_set_errno(inode->i_sb, EIO); ext4_error_inode(inode, where, line, bh->b_blocknr, "IO error syncing itable block"); diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 232ba564c7f71..f8cb7d75ae7d4 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -492,6 +492,7 @@ static int __ext4_ext_check(const char *function, unsigned int line, return 0;
corrupted: + ext4_set_errno(inode->i_sb, -err); ext4_error_inode(inode, function, line, 0, "pblk %llu bad header/extent: %s - magic %x, " "entries %u, max %u(%u), depth %u(%u)", diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index f7989081ff540..770d023faa2ea 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -197,6 +197,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group) ext4_simulate_fail_bh(sb, bh, EXT4_SIM_IBITMAP_EIO); if (!buffer_uptodate(bh)) { put_bh(bh); + ext4_set_errno(sb, EIO); ext4_error(sb, "Cannot read inode bitmap - " "block_group = %u, inode_bitmap = %llu", block_group, bitmap_blk); @@ -1236,6 +1237,7 @@ struct inode *ext4_orphan_get(struct super_block *sb, unsigned long ino) inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL); if (IS_ERR(inode)) { err = PTR_ERR(inode); + ext4_set_errno(sb, -err); ext4_error(sb, "couldn't read orphan inode %lu (err %d)", ino, err); return inode; diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c index c952461876595..c8ddb8f99c22a 100644 --- a/fs/ext4/inline.c +++ b/fs/ext4/inline.c @@ -98,6 +98,7 @@ int ext4_get_max_inline_size(struct inode *inode)
error = ext4_get_inode_loc(inode, &iloc); if (error) { + ext4_set_errno(inode->i_sb, -error); ext4_error_inode(inode, __func__, __LINE__, 0, "can't get inode location %lu", inode->i_ino); @@ -1764,6 +1765,7 @@ bool empty_inline_dir(struct inode *dir, int *has_inline_data)
err = ext4_get_inode_loc(dir, &iloc); if (err) { + ext4_set_errno(dir->i_sb, -err); EXT4_ERROR_INODE(dir, "error %d getting inode %lu block", err, dir->i_ino); return true; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 49ac78cafc781..548374c49dc59 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -288,6 +288,7 @@ void ext4_evict_inode(struct inode *inode) if (inode->i_blocks) { err = ext4_truncate(inode); if (err) { + ext4_set_errno(inode->i_sb, -err); ext4_error(inode->i_sb, "couldn't truncate inode %lu (err %d)", inode->i_ino, err); @@ -2596,10 +2597,12 @@ static int mpage_map_and_submit_extent(handle_t *handle, EXT4_I(inode)->i_disksize = disksize; up_write(&EXT4_I(inode)->i_data_sem); err2 = ext4_mark_inode_dirty(handle, inode); - if (err2) + if (err2) { + ext4_set_errno(inode->i_sb, -err2); ext4_error(inode->i_sb, "Failed to mark inode %lu dirty", inode->i_ino); + } if (!err) err = err2; } @@ -4735,6 +4738,7 @@ static int __ext4_get_inode_loc(struct inode *inode, wait_on_buffer(bh); if (!buffer_uptodate(bh)) { simulate_eio: + ext4_set_errno(inode->i_sb, EIO); EXT4_ERROR_INODE_BLOCK(inode, block, "unable to read itable block"); brelse(bh); @@ -4944,6 +4948,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
if (!ext4_inode_csum_verify(inode, raw_inode, ei) || ext4_simulate_fail(sb, EXT4_SIM_INODE_CRC)) { + ext4_set_errno(inode->i_sb, EFSBADCRC); ext4_error_inode(inode, function, line, 0, "iget: checksum invalid"); ret = -EFSBADCRC; @@ -5492,6 +5497,7 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc) if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) sync_dirty_buffer(iloc.bh); if (buffer_req(iloc.bh) && !buffer_uptodate(iloc.bh)) { + ext4_set_errno(inode->i_sb, EIO); EXT4_ERROR_INODE_BLOCK(inode, iloc.bh->b_blocknr, "IO error syncing inode"); err = -EIO; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 23e94193c8b4b..69de2abbdff5f 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -3923,6 +3923,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, bitmap_bh = ext4_read_block_bitmap(sb, group); if (IS_ERR(bitmap_bh)) { err = PTR_ERR(bitmap_bh); + ext4_set_errno(sb, -err); ext4_error(sb, "Error %d reading block bitmap for %u", err, group); return 0; @@ -4091,6 +4092,7 @@ void ext4_discard_preallocations(struct inode *inode) err = ext4_mb_load_buddy_gfp(sb, group, &e4b, GFP_NOFS|__GFP_NOFAIL); if (err) { + ext4_set_errno(sb, -err); ext4_error(sb, "Error %d loading buddy information for %u", err, group); continue; @@ -4099,6 +4101,7 @@ void ext4_discard_preallocations(struct inode *inode) bitmap_bh = ext4_read_block_bitmap(sb, group); if (IS_ERR(bitmap_bh)) { err = PTR_ERR(bitmap_bh); + ext4_set_errno(sb, -err); ext4_error(sb, "Error %d reading block bitmap for %u", err, group); ext4_mb_unload_buddy(&e4b); @@ -4353,6 +4356,7 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb, err = ext4_mb_load_buddy_gfp(sb, group, &e4b, GFP_NOFS|__GFP_NOFAIL); if (err) { + ext4_set_errno(sb, -err); ext4_error(sb, "Error %d loading buddy information for %u", err, group); continue; diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c index 9d00e0dd2ba99..87f7551c5132e 100644 --- a/fs/ext4/mmp.c +++ b/fs/ext4/mmp.c @@ -174,8 +174,10 @@ static int kmmpd(void *data) * (s_mmp_update_interval * 60) seconds. */ if (retval) { - if ((failed_writes % 60) == 0) + if ((failed_writes % 60) == 0) { + ext4_set_errno(sb, -retval); ext4_error(sb, "Error writing to MMP block"); + } failed_writes++; }
@@ -206,6 +208,7 @@ static int kmmpd(void *data)
retval = read_mmp_block(sb, &bh_check, mmp_block); if (retval) { + ext4_set_errno(sb, -retval); ext4_error(sb, "error reading MMP data: %d", retval); goto exit_thread; @@ -219,6 +222,7 @@ static int kmmpd(void *data) "Error while updating MMP info. " "The filesystem seems to have been" " multiply mounted."); + ext4_set_errno(sb, EBUSY); ext4_error(sb, "abort"); put_bh(bh_check); retval = -EBUSY; diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index f68d441803214..d1012089222f1 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -159,6 +159,7 @@ static struct buffer_head *__ext4_read_dirblock(struct inode *inode, !ext4_simulate_fail(inode->i_sb, EXT4_SIM_DIRBLOCK_CRC)) set_buffer_verified(bh); else { + ext4_set_errno(inode->i_sb, EFSBADCRC); ext4_error_inode(inode, func, line, block, "Directory index failed checksum"); brelse(bh); @@ -170,6 +171,7 @@ static struct buffer_head *__ext4_read_dirblock(struct inode *inode, !ext4_simulate_fail(inode->i_sb, EXT4_SIM_DIRBLOCK_CRC)) set_buffer_verified(bh); else { + ext4_set_errno(inode->i_sb, EFSBADCRC); ext4_error_inode(inode, func, line, block, "Directory block failed checksum"); brelse(bh); @@ -1450,6 +1452,7 @@ static struct buffer_head *__ext4_find_entry(struct inode *dir, goto next; wait_on_buffer(bh); if (!buffer_uptodate(bh)) { + ext4_set_errno(sb, EIO); EXT4_ERROR_INODE(dir, "reading directory lblock %lu", (unsigned long) block); brelse(bh); @@ -1461,6 +1464,7 @@ static struct buffer_head *__ext4_find_entry(struct inode *dir, (struct ext4_dir_entry *)bh->b_data) && !ext4_dirent_csum_verify(dir, (struct ext4_dir_entry *)bh->b_data)) { + ext4_set_errno(sb, EFSBADCRC); EXT4_ERROR_INODE(dir, "checksumming directory " "block %lu", (unsigned long)block); brelse(bh); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index a254d7cf869ad..8f643d3149232 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -375,6 +375,8 @@ static void __save_error_info(struct super_block *sb, const char *func, ext4_update_tstamp(es, s_last_error_time); strncpy(es->s_last_error_func, func, sizeof(es->s_last_error_func)); es->s_last_error_line = cpu_to_le32(line); + if (es->s_last_error_errcode == 0) + es->s_last_error_errcode = EXT4_ERR_EFSCORRUPTED; if (!es->s_first_error_time) { es->s_first_error_time = es->s_last_error_time; es->s_first_error_time_hi = es->s_last_error_time_hi; @@ -383,6 +385,7 @@ static void __save_error_info(struct super_block *sb, const char *func, es->s_first_error_line = cpu_to_le32(line); es->s_first_error_ino = es->s_last_error_ino; es->s_first_error_block = es->s_last_error_block; + es->s_first_error_errcode = es->s_last_error_errcode; } /* * Start the daily error reporting function if it hasn't been @@ -682,6 +685,66 @@ const char *ext4_decode_error(struct super_block *sb, int errno, return errstr; }
+void ext4_set_errno(struct super_block *sb, int err) +{ + if (err < 0) + err = -err; + + switch (err) { + case EIO: + err = EXT4_ERR_EIO; + break; + case ENOMEM: + err = EXT4_ERR_ENOMEM; + break; + case EFSBADCRC: + err = EXT4_ERR_EFSBADCRC; + break; + case EFSCORRUPTED: + err = EXT4_ERR_EFSCORRUPTED; + break; + case ENOSPC: + err = EXT4_ERR_ENOSPC; + break; + case ENOKEY: + err = EXT4_ERR_ENOKEY; + break; + case EROFS: + err = EXT4_ERR_EROFS; + break; + case EFBIG: + err = EXT4_ERR_EFBIG; + break; + case EEXIST: + err = EXT4_ERR_EEXIST; + break; + case ERANGE: + err = EXT4_ERR_ERANGE; + break; + case EOVERFLOW: + err = EXT4_ERR_EOVERFLOW; + break; + case EBUSY: + err = EXT4_ERR_EBUSY; + break; + case ENOTDIR: + err = EXT4_ERR_ENOTDIR; + break; + case ENOTEMPTY: + err = EXT4_ERR_ENOTEMPTY; + break; + case ESHUTDOWN: + err = EXT4_ERR_ESHUTDOWN; + break; + case EFAULT: + err = EXT4_ERR_EFAULT; + break; + default: + err = EXT4_ERR_UNKNOWN; + } + EXT4_SB(sb)->s_es->s_last_error_errcode = err; +} + /* __ext4_std_error decodes expected errors from journaling functions * automatically and invokes the appropriate error response. */
@@ -706,6 +769,7 @@ void __ext4_std_error(struct super_block *sb, const char *function, sb->s_id, function, line, errstr); }
+ ext4_set_errno(sb, -errno); save_error_info(sb, function, line); ext4_handle_error(sb); } @@ -1033,8 +1097,10 @@ static void ext4_put_super(struct super_block *sb) aborted = is_journal_aborted(sbi->s_journal); err = jbd2_journal_destroy(sbi->s_journal); sbi->s_journal = NULL; - if ((err < 0) && !aborted) + if ((err < 0) && !aborted) { + ext4_set_errno(sb, -err); ext4_abort(sb, "Couldn't clean up the journal"); + } }
ext4_unregister_sysfs(sb); diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index 24cf730ba6b02..7781e34c8ce24 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -2886,9 +2886,11 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode, bh = ext4_sb_bread(inode->i_sb, EXT4_I(inode)->i_file_acl, REQ_PRIO); if (IS_ERR(bh)) { error = PTR_ERR(bh); - if (error == -EIO) + if (error == -EIO) { + ext4_set_errno(inode->i_sb, EIO); EXT4_ERROR_INODE(inode, "block %llu read error", EXT4_I(inode)->i_file_acl); + } bh = NULL; goto cleanup; }
From: Theodore Ts'o tytso@mit.edu
mainline inclusion from mainline-v5.7-rc1 commit 54d3adbc29f0c7c53890da1683e629cd220d7201 category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
Using a separate function, ext4_set_errno() to set the errno is problematic because it doesn't do the right thing once s_last_error_errorcode is non-zero. It's also less racy to set all of the error information all at once. (Also, as a bonus, it shrinks code size slightly.)
Link: https://lore.kernel.org/r/20200329020404.686965-1-tytso@mit.edu Fixes: 878520ac45f9 ("ext4: save the error code which triggered...") Signed-off-by: Theodore Ts'o tytso@mit.edu
conflicts: fs/ext4/balloc.c fs/ext4/block_validity.c fs/ext4/ialloc.c fs/ext4/inode.c fs/ext4/namei.c fs/ext4/super.c
Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/balloc.c | 7 +- fs/ext4/block_validity.c | 13 ++- fs/ext4/ext4.h | 54 ++++++++----- fs/ext4/ext4_jbd2.c | 13 +-- fs/ext4/extents.c | 27 +++---- fs/ext4/ialloc.c | 13 ++- fs/ext4/indirect.c | 2 +- fs/ext4/inline.c | 13 ++- fs/ext4/inode.c | 29 +++---- fs/ext4/mballoc.c | 21 +++-- fs/ext4/mmp.c | 13 ++- fs/ext4/move_extent.c | 4 +- fs/ext4/namei.c | 24 +++--- fs/ext4/super.c | 166 ++++++++++++++++++--------------------- fs/ext4/xattr.c | 10 +-- 15 files changed, 197 insertions(+), 212 deletions(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index 244087a0d329c..7c92728276951 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -517,10 +517,9 @@ int ext4_wait_block_bitmap(struct super_block *sb, ext4_group_t block_group, wait_on_buffer(bh); ext4_simulate_fail_bh(sb, bh, EXT4_SIM_BBITMAP_EIO); if (!buffer_uptodate(bh)) { - ext4_set_errno(sb, EIO); - ext4_error(sb, "Cannot read block bitmap - " - "block_group = %u, block_bitmap = %llu", - block_group, (unsigned long long) bh->b_blocknr); + ext4_error_err(sb, EIO, "Cannot read block bitmap - " + "block_group = %u, block_bitmap = %llu", + block_group, (unsigned long long) bh->b_blocknr); ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_BBITMAP_CORRUPT); return -EIO; diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c index 2471577d5c09e..868c386282022 100644 --- a/fs/ext4/block_validity.c +++ b/fs/ext4/block_validity.c @@ -172,9 +172,11 @@ static int ext4_protect_reserved_inode(struct super_block *sb, err = add_system_zone(system_blks, map.m_pblk, n, ino); if (err < 0) { if (err == -EFSCORRUPTED) { - ext4_error(sb, "blocks %llu-%llu from inode %u " - "overlap system zone", map.m_pblk, - map.m_pblk + map.m_len - 1, ino); + __ext4_error(sb, __func__, __LINE__, -err, + map.m_pblk, "blocks %llu-%llu " + "from inode %u overlap system zone", + map.m_pblk, + map.m_pblk + map.m_len - 1, ino); } break; } @@ -304,7 +306,6 @@ int ext4_inode_block_valid(struct inode *inode, ext4_fsblk_t start_blk, if ((start_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) || (start_blk + count < start_blk) || (start_blk + count > ext4_blocks_count(sbi->s_es))) { - sbi->s_es->s_last_error_block = cpu_to_le64(start_blk); return 0; }
@@ -327,8 +328,6 @@ int ext4_inode_block_valid(struct inode *inode, ext4_fsblk_t start_blk, n = n->rb_right; else { ret = (entry->ino == inode->i_ino); - if (!ret) - sbi->s_es->s_last_error_block = cpu_to_le64(start_blk); break; } } @@ -340,7 +339,6 @@ int ext4_inode_block_valid(struct inode *inode, ext4_fsblk_t start_blk, int ext4_check_blockref(const char *function, unsigned int line, struct inode *inode, __le32 *p, unsigned int max) { - struct ext4_super_block *es = EXT4_SB(inode->i_sb)->s_es; __le32 *bref = p; unsigned int blk;
@@ -353,7 +351,6 @@ int ext4_check_blockref(const char *function, unsigned int line, blk = le32_to_cpu(*bref++); if (blk && unlikely(!ext4_inode_block_valid(inode, blk, 1))) { - es->s_last_error_block = cpu_to_le64(blk); ext4_error_inode(inode, function, line, blk, "invalid block"); return -EFSCORRUPTED; diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 0f34b445461dd..052eb06815fa7 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2720,21 +2720,20 @@ extern const char *ext4_decode_error(struct super_block *sb, int errno, extern void ext4_mark_group_bitmap_corrupted(struct super_block *sb, ext4_group_t block_group, unsigned int flags); -extern void ext4_set_errno(struct super_block *sb, int err);
-extern __printf(4, 5) -void __ext4_error(struct super_block *, const char *, unsigned int, +extern __printf(6, 7) +void __ext4_error(struct super_block *, const char *, unsigned int, int, __u64, const char *, ...); -extern __printf(5, 6) -void __ext4_error_inode(struct inode *, const char *, unsigned int, ext4_fsblk_t, - const char *, ...); +extern __printf(6, 7) +void __ext4_error_inode(struct inode *, const char *, unsigned int, + ext4_fsblk_t, int, const char *, ...); extern __printf(5, 6) void __ext4_error_file(struct file *, const char *, unsigned int, ext4_fsblk_t, const char *, ...); extern void __ext4_std_error(struct super_block *, const char *, unsigned int, int); -extern __printf(4, 5) -void __ext4_abort(struct super_block *, const char *, unsigned int, +extern __printf(5, 6) +void __ext4_abort(struct super_block *, const char *, unsigned int, int, const char *, ...); extern __printf(4, 5) void __ext4_warning(struct super_block *, const char *, unsigned int, @@ -2755,8 +2754,12 @@ void __ext4_grp_locked_error(const char *, unsigned int, #define EXT4_ERROR_INODE(inode, fmt, a...) \ ext4_error_inode((inode), __func__, __LINE__, 0, (fmt), ## a)
-#define EXT4_ERROR_INODE_BLOCK(inode, block, fmt, a...) \ - ext4_error_inode((inode), __func__, __LINE__, (block), (fmt), ## a) +#define EXT4_ERROR_INODE_ERR(inode, err, fmt, a...) \ + __ext4_error_inode((inode), __func__, __LINE__, 0, (err), (fmt), ## a) + +#define ext4_error_inode_block(inode, block, err, fmt, a...) \ + __ext4_error_inode((inode), __func__, __LINE__, (block), (err), \ + (fmt), ## a)
#define EXT4_ERROR_FILE(file, block, fmt, a...) \ ext4_error_file((file), __func__, __LINE__, (block), (fmt), ## a) @@ -2764,13 +2767,18 @@ void __ext4_grp_locked_error(const char *, unsigned int, #ifdef CONFIG_PRINTK
#define ext4_error_inode(inode, func, line, block, fmt, ...) \ - __ext4_error_inode(inode, func, line, block, fmt, ##__VA_ARGS__) + __ext4_error_inode(inode, func, line, block, 0, fmt, ##__VA_ARGS__) +#define ext4_error_inode_err(inode, func, line, block, err, fmt, ...) \ + __ext4_error_inode((inode), (func), (line), (block), \ + (err), (fmt), ##__VA_ARGS__) #define ext4_error_file(file, func, line, block, fmt, ...) \ __ext4_error_file(file, func, line, block, fmt, ##__VA_ARGS__) #define ext4_error(sb, fmt, ...) \ - __ext4_error(sb, __func__, __LINE__, fmt, ##__VA_ARGS__) -#define ext4_abort(sb, fmt, ...) \ - __ext4_abort(sb, __func__, __LINE__, fmt, ##__VA_ARGS__) + __ext4_error((sb), __func__, __LINE__, 0, 0, (fmt), ##__VA_ARGS__) +#define ext4_error_err(sb, err, fmt, ...) \ + __ext4_error((sb), __func__, __LINE__, (err), 0, (fmt), ##__VA_ARGS__) +#define ext4_abort(sb, err, fmt, ...) \ + __ext4_abort((sb), __func__, __LINE__, (err), (fmt), ##__VA_ARGS__) #define ext4_warning(sb, fmt, ...) \ __ext4_warning(sb, __func__, __LINE__, fmt, ##__VA_ARGS__) #define ext4_warning_inode(inode, fmt, ...) \ @@ -2788,7 +2796,12 @@ void __ext4_grp_locked_error(const char *, unsigned int, #define ext4_error_inode(inode, func, line, block, fmt, ...) \ do { \ no_printk(fmt, ##__VA_ARGS__); \ - __ext4_error_inode(inode, "", 0, block, " "); \ + __ext4_error_inode(inode, "", 0, block, 0, " "); \ +} while (0) +#define ext4_error_inode_err(inode, func, line, block, err, fmt, ...) \ +do { \ + no_printk(fmt, ##__VA_ARGS__); \ + __ext4_error_inode(inode, "", 0, block, err, " "); \ } while (0) #define ext4_error_file(file, func, line, block, fmt, ...) \ do { \ @@ -2798,12 +2811,17 @@ do { \ #define ext4_error(sb, fmt, ...) \ do { \ no_printk(fmt, ##__VA_ARGS__); \ - __ext4_error(sb, "", 0, " "); \ + __ext4_error(sb, "", 0, 0, 0, " "); \ +} while (0) +#define ext4_error_err(sb, err, fmt, ...) \ +do { \ + no_printk(fmt, ##__VA_ARGS__); \ + __ext4_error(sb, "", 0, err, 0, " "); \ } while (0) -#define ext4_abort(sb, fmt, ...) \ +#define ext4_abort(sb, err, fmt, ...) \ do { \ no_printk(fmt, ##__VA_ARGS__); \ - __ext4_abort(sb, "", 0, " "); \ + __ext4_abort(sb, "", 0, err, " "); \ } while (0) #define ext4_warning(sb, fmt, ...) \ do { \ diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c index c43632cf98862..35ce16e690d37 100644 --- a/fs/ext4/ext4_jbd2.c +++ b/fs/ext4/ext4_jbd2.c @@ -58,8 +58,7 @@ static int ext4_journal_check_start(struct super_block *sb) * take the FS itself readonly cleanly. */ if (journal && is_journal_aborted(journal)) { - ext4_set_errno(sb, -journal->j_errno); - ext4_abort(sb, "Detected aborted journal"); + ext4_abort(sb, -journal->j_errno, "Detected aborted journal"); return -EROFS; } return 0; @@ -274,8 +273,7 @@ int __ext4_forget(const char *where, unsigned int line, handle_t *handle, if (err) { ext4_journal_abort_handle(where, line, __func__, bh, handle, err); - ext4_set_errno(inode->i_sb, -err); - __ext4_abort(inode->i_sb, where, line, + __ext4_abort(inode->i_sb, where, line, -err, "error %d when attempting revoke", err); } BUFFER_TRACE(bh, "exit"); @@ -345,11 +343,8 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line, struct ext4_super_block *es;
es = EXT4_SB(inode->i_sb)->s_es; - es->s_last_error_block = - cpu_to_le64(bh->b_blocknr); - ext4_set_errno(inode->i_sb, EIO); - ext4_error_inode(inode, where, line, - bh->b_blocknr, + ext4_error_inode_err(inode, where, line, + bh->b_blocknr, EIO, "IO error syncing itable block"); err = -EIO; } diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index f8cb7d75ae7d4..ebf024258e3c2 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -401,8 +401,8 @@ static int ext4_valid_extent_idx(struct inode *inode, }
static int ext4_valid_extent_entries(struct inode *inode, - struct ext4_extent_header *eh, - int depth) + struct ext4_extent_header *eh, + ext4_fsblk_t *pblk, int depth) { unsigned short entries; if (eh->eh_entries == 0) @@ -413,8 +413,6 @@ static int ext4_valid_extent_entries(struct inode *inode, if (depth == 0) { /* leaf entries */ struct ext4_extent *ext = EXT_FIRST_EXTENT(eh); - struct ext4_super_block *es = EXT4_SB(inode->i_sb)->s_es; - ext4_fsblk_t pblock = 0; ext4_lblk_t lblock = 0; ext4_lblk_t prev = 0; int len = 0; @@ -426,8 +424,7 @@ static int ext4_valid_extent_entries(struct inode *inode, lblock = le32_to_cpu(ext->ee_block); len = ext4_ext_get_actual_len(ext); if ((lblock <= prev) && prev) { - pblock = ext4_ext_pblock(ext); - es->s_last_error_block = cpu_to_le64(pblock); + *pblk = ext4_ext_pblock(ext); return 0; } ext++; @@ -474,7 +471,7 @@ static int __ext4_ext_check(const char *function, unsigned int line, error_msg = "invalid eh_entries"; goto corrupted; } - if (!ext4_valid_extent_entries(inode, eh, depth)) { + if (!ext4_valid_extent_entries(inode, eh, &pblk, depth)) { error_msg = "invalid extent entries"; goto corrupted; } @@ -492,14 +489,14 @@ static int __ext4_ext_check(const char *function, unsigned int line, return 0;
corrupted: - ext4_set_errno(inode->i_sb, -err); - ext4_error_inode(inode, function, line, 0, - "pblk %llu bad header/extent: %s - magic %x, " - "entries %u, max %u(%u), depth %u(%u)", - (unsigned long long) pblk, error_msg, - le16_to_cpu(eh->eh_magic), - le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max), - max, le16_to_cpu(eh->eh_depth), depth); + ext4_error_inode_err(inode, function, line, 0, -err, + "pblk %llu bad header/extent: %s - magic %x, " + "entries %u, max %u(%u), depth %u(%u)", + (unsigned long long) pblk, error_msg, + le16_to_cpu(eh->eh_magic), + le16_to_cpu(eh->eh_entries), + le16_to_cpu(eh->eh_max), + max, le16_to_cpu(eh->eh_depth), depth); return err; }
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 770d023faa2ea..141bfa55ed682 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -197,10 +197,9 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group) ext4_simulate_fail_bh(sb, bh, EXT4_SIM_IBITMAP_EIO); if (!buffer_uptodate(bh)) { put_bh(bh); - ext4_set_errno(sb, EIO); - ext4_error(sb, "Cannot read inode bitmap - " - "block_group = %u, inode_bitmap = %llu", - block_group, bitmap_blk); + ext4_error_err(sb, EIO, "Cannot read inode bitmap - " + "block_group = %u, inode_bitmap = %llu", + block_group, bitmap_blk); ext4_mark_group_bitmap_corrupted(sb, block_group, EXT4_GROUP_INFO_IBITMAP_CORRUPT); return ERR_PTR(-EIO); @@ -1237,9 +1236,9 @@ struct inode *ext4_orphan_get(struct super_block *sb, unsigned long ino) inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL); if (IS_ERR(inode)) { err = PTR_ERR(inode); - ext4_set_errno(sb, -err); - ext4_error(sb, "couldn't read orphan inode %lu (err %d)", - ino, err); + ext4_error_err(sb, -err, + "couldn't read orphan inode %lu (err %d)", + ino, err); return inode; }
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c index 0385e94a2120c..de42b31728550 100644 --- a/fs/ext4/indirect.c +++ b/fs/ext4/indirect.c @@ -1049,7 +1049,7 @@ static void ext4_free_branches(handle_t *handle, struct inode *inode, * (should be rare). */ if (!bh) { - EXT4_ERROR_INODE_BLOCK(inode, nr, + ext4_error_inode_block(inode, nr, EIO, "Read failure"); continue; } diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c index c8ddb8f99c22a..7eeed4d50deac 100644 --- a/fs/ext4/inline.c +++ b/fs/ext4/inline.c @@ -98,10 +98,9 @@ int ext4_get_max_inline_size(struct inode *inode)
error = ext4_get_inode_loc(inode, &iloc); if (error) { - ext4_set_errno(inode->i_sb, -error); - ext4_error_inode(inode, __func__, __LINE__, 0, - "can't get inode location %lu", - inode->i_ino); + ext4_error_inode_err(inode, __func__, __LINE__, 0, -error, + "can't get inode location %lu", + inode->i_ino); return 0; }
@@ -1765,9 +1764,9 @@ bool empty_inline_dir(struct inode *dir, int *has_inline_data)
err = ext4_get_inode_loc(dir, &iloc); if (err) { - ext4_set_errno(dir->i_sb, -err); - EXT4_ERROR_INODE(dir, "error %d getting inode %lu block", - err, dir->i_ino); + EXT4_ERROR_INODE_ERR(dir, -err, + "error %d getting inode %lu block", + err, dir->i_ino); return true; }
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 548374c49dc59..b5f0e5b552668 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -288,10 +288,9 @@ void ext4_evict_inode(struct inode *inode) if (inode->i_blocks) { err = ext4_truncate(inode); if (err) { - ext4_set_errno(inode->i_sb, -err); - ext4_error(inode->i_sb, - "couldn't truncate inode %lu (err %d)", - inode->i_ino, err); + ext4_error_err(inode->i_sb, -err, + "couldn't truncate inode %lu (err %d)", + inode->i_ino, err); goto stop_handle; } } @@ -2598,10 +2597,9 @@ static int mpage_map_and_submit_extent(handle_t *handle, up_write(&EXT4_I(inode)->i_data_sem); err2 = ext4_mark_inode_dirty(handle, inode); if (err2) { - ext4_set_errno(inode->i_sb, -err2); - ext4_error(inode->i_sb, - "Failed to mark inode %lu dirty", - inode->i_ino); + ext4_error_err(inode->i_sb, -err2, + "Failed to mark inode %lu dirty", + inode->i_ino); } if (!err) err = err2; @@ -4738,8 +4736,7 @@ static int __ext4_get_inode_loc(struct inode *inode, wait_on_buffer(bh); if (!buffer_uptodate(bh)) { simulate_eio: - ext4_set_errno(inode->i_sb, EIO); - EXT4_ERROR_INODE_BLOCK(inode, block, + ext4_error_inode_block(inode, block, EIO, "unable to read itable block"); brelse(bh); return -EIO; @@ -4885,7 +4882,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino, (ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))) { if (flags & EXT4_IGET_HANDLE) return ERR_PTR(-ESTALE); - __ext4_error(sb, function, line, + __ext4_error(sb, function, line, EFSCORRUPTED, 0, "inode #%lu: comm %s: iget: illegal inode #", ino, current->comm); return ERR_PTR(-EFSCORRUPTED); @@ -4948,9 +4945,8 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
if (!ext4_inode_csum_verify(inode, raw_inode, ei) || ext4_simulate_fail(sb, EXT4_SIM_INODE_CRC)) { - ext4_set_errno(inode->i_sb, EFSBADCRC); - ext4_error_inode(inode, function, line, 0, - "iget: checksum invalid"); + ext4_error_inode_err(inode, function, line, 0, EFSBADCRC, + "iget: checksum invalid"); ret = -EFSBADCRC; goto bad_inode; } @@ -5497,9 +5493,8 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc) if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) sync_dirty_buffer(iloc.bh); if (buffer_req(iloc.bh) && !buffer_uptodate(iloc.bh)) { - ext4_set_errno(inode->i_sb, EIO); - EXT4_ERROR_INODE_BLOCK(inode, iloc.bh->b_blocknr, - "IO error syncing inode"); + ext4_error_inode_block(inode, iloc.bh->b_blocknr, EIO, + "IO error syncing inode"); err = -EIO; } brelse(iloc.bh); diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 69de2abbdff5f..4a4641ac35841 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -3923,9 +3923,9 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, bitmap_bh = ext4_read_block_bitmap(sb, group); if (IS_ERR(bitmap_bh)) { err = PTR_ERR(bitmap_bh); - ext4_set_errno(sb, -err); - ext4_error(sb, "Error %d reading block bitmap for %u", - err, group); + ext4_error_err(sb, -err, + "Error %d reading block bitmap for %u", + err, group); return 0; }
@@ -4092,18 +4092,16 @@ void ext4_discard_preallocations(struct inode *inode) err = ext4_mb_load_buddy_gfp(sb, group, &e4b, GFP_NOFS|__GFP_NOFAIL); if (err) { - ext4_set_errno(sb, -err); - ext4_error(sb, "Error %d loading buddy information for %u", - err, group); + ext4_error_err(sb, -err, "Error %d loading buddy information for %u", + err, group); continue; }
bitmap_bh = ext4_read_block_bitmap(sb, group); if (IS_ERR(bitmap_bh)) { err = PTR_ERR(bitmap_bh); - ext4_set_errno(sb, -err); - ext4_error(sb, "Error %d reading block bitmap for %u", - err, group); + ext4_error_err(sb, -err, "Error %d reading block bitmap for %u", + err, group); ext4_mb_unload_buddy(&e4b); continue; } @@ -4356,9 +4354,8 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb, err = ext4_mb_load_buddy_gfp(sb, group, &e4b, GFP_NOFS|__GFP_NOFAIL); if (err) { - ext4_set_errno(sb, -err); - ext4_error(sb, "Error %d loading buddy information for %u", - err, group); + ext4_error_err(sb, -err, "Error %d loading buddy information for %u", + err, group); continue; } ext4_lock_group(sb, group); diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c index 87f7551c5132e..d34cb8c466556 100644 --- a/fs/ext4/mmp.c +++ b/fs/ext4/mmp.c @@ -175,8 +175,8 @@ static int kmmpd(void *data) */ if (retval) { if ((failed_writes % 60) == 0) { - ext4_set_errno(sb, -retval); - ext4_error(sb, "Error writing to MMP block"); + ext4_error_err(sb, -retval, + "Error writing to MMP block"); } failed_writes++; } @@ -208,9 +208,9 @@ static int kmmpd(void *data)
retval = read_mmp_block(sb, &bh_check, mmp_block); if (retval) { - ext4_set_errno(sb, -retval); - ext4_error(sb, "error reading MMP data: %d", - retval); + ext4_error_err(sb, -retval, + "error reading MMP data: %d", + retval); goto exit_thread; }
@@ -222,8 +222,7 @@ static int kmmpd(void *data) "Error while updating MMP info. " "The filesystem seems to have been" " multiply mounted."); - ext4_set_errno(sb, EBUSY); - ext4_error(sb, "abort"); + ext4_error_err(sb, EBUSY, "abort"); put_bh(bh_check); retval = -EBUSY; goto exit_thread; diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c index 30628c4ae777e..c2b288cd78839 100644 --- a/fs/ext4/move_extent.c +++ b/fs/ext4/move_extent.c @@ -422,8 +422,8 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode, block_len_in_page, 0, &err2); ext4_double_up_write_data_sem(orig_inode, donor_inode); if (replaced_count != block_len_in_page) { - EXT4_ERROR_INODE_BLOCK(orig_inode, (sector_t)(orig_blk_offset), - "Unable to copy data block," + ext4_error_inode_block(orig_inode, (sector_t)(orig_blk_offset), + EIO, "Unable to copy data block," " data will be lost."); *err = -EIO; } diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index d1012089222f1..cb51b7aaacb42 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -159,9 +159,9 @@ static struct buffer_head *__ext4_read_dirblock(struct inode *inode, !ext4_simulate_fail(inode->i_sb, EXT4_SIM_DIRBLOCK_CRC)) set_buffer_verified(bh); else { - ext4_set_errno(inode->i_sb, EFSBADCRC); - ext4_error_inode(inode, func, line, block, - "Directory index failed checksum"); + ext4_error_inode_err(inode, func, line, block, + EFSBADCRC, + "Directory index failed checksum"); brelse(bh); return ERR_PTR(-EFSBADCRC); } @@ -171,9 +171,9 @@ static struct buffer_head *__ext4_read_dirblock(struct inode *inode, !ext4_simulate_fail(inode->i_sb, EXT4_SIM_DIRBLOCK_CRC)) set_buffer_verified(bh); else { - ext4_set_errno(inode->i_sb, EFSBADCRC); - ext4_error_inode(inode, func, line, block, - "Directory block failed checksum"); + ext4_error_inode_err(inode, func, line, block, + EFSBADCRC, + "Directory block failed checksum"); brelse(bh); return ERR_PTR(-EFSBADCRC); } @@ -1452,9 +1452,9 @@ static struct buffer_head *__ext4_find_entry(struct inode *dir, goto next; wait_on_buffer(bh); if (!buffer_uptodate(bh)) { - ext4_set_errno(sb, EIO); - EXT4_ERROR_INODE(dir, "reading directory lblock %lu", - (unsigned long) block); + EXT4_ERROR_INODE_ERR(dir, EIO, + "reading directory lblock %lu", + (unsigned long) block); brelse(bh); ret = ERR_PTR(-EIO); goto cleanup_and_exit; @@ -1464,9 +1464,9 @@ static struct buffer_head *__ext4_find_entry(struct inode *dir, (struct ext4_dir_entry *)bh->b_data) && !ext4_dirent_csum_verify(dir, (struct ext4_dir_entry *)bh->b_data)) { - ext4_set_errno(sb, EFSBADCRC); - EXT4_ERROR_INODE(dir, "checksumming directory " - "block %lu", (unsigned long)block); + EXT4_ERROR_INODE_ERR(dir, EFSBADCRC, + "checksumming directory " + "block %lu", (unsigned long)block); brelse(bh); ret = ERR_PTR(-EFSBADCRC); goto cleanup_and_exit; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 8f643d3149232..ee35d55a823a8 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -363,10 +363,12 @@ static time64_t __ext4_get_tstamp(__le32 *lo, __u8 *hi) #define ext4_get_tstamp(es, tstamp) \ __ext4_get_tstamp(&(es)->tstamp, &(es)->tstamp ## _hi)
-static void __save_error_info(struct super_block *sb, const char *func, - unsigned int line) +static void __save_error_info(struct super_block *sb, int error, + __u32 ino, __u64 block, + const char *func, unsigned int line) { struct ext4_super_block *es = EXT4_SB(sb)->s_es; + int err;
EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS; if (bdev_read_only(sb->s_bdev)) @@ -375,8 +377,62 @@ static void __save_error_info(struct super_block *sb, const char *func, ext4_update_tstamp(es, s_last_error_time); strncpy(es->s_last_error_func, func, sizeof(es->s_last_error_func)); es->s_last_error_line = cpu_to_le32(line); - if (es->s_last_error_errcode == 0) - es->s_last_error_errcode = EXT4_ERR_EFSCORRUPTED; + es->s_last_error_ino = cpu_to_le32(ino); + es->s_last_error_block = cpu_to_le64(block); + switch (error) { + case EIO: + err = EXT4_ERR_EIO; + break; + case ENOMEM: + err = EXT4_ERR_ENOMEM; + break; + case EFSBADCRC: + err = EXT4_ERR_EFSBADCRC; + break; + case 0: + case EFSCORRUPTED: + err = EXT4_ERR_EFSCORRUPTED; + break; + case ENOSPC: + err = EXT4_ERR_ENOSPC; + break; + case ENOKEY: + err = EXT4_ERR_ENOKEY; + break; + case EROFS: + err = EXT4_ERR_EROFS; + break; + case EFBIG: + err = EXT4_ERR_EFBIG; + break; + case EEXIST: + err = EXT4_ERR_EEXIST; + break; + case ERANGE: + err = EXT4_ERR_ERANGE; + break; + case EOVERFLOW: + err = EXT4_ERR_EOVERFLOW; + break; + case EBUSY: + err = EXT4_ERR_EBUSY; + break; + case ENOTDIR: + err = EXT4_ERR_ENOTDIR; + break; + case ENOTEMPTY: + err = EXT4_ERR_ENOTEMPTY; + break; + case ESHUTDOWN: + err = EXT4_ERR_ESHUTDOWN; + break; + case EFAULT: + err = EXT4_ERR_EFAULT; + break; + default: + err = EXT4_ERR_UNKNOWN; + } + es->s_last_error_errcode = err; if (!es->s_first_error_time) { es->s_first_error_time = es->s_last_error_time; es->s_first_error_time_hi = es->s_last_error_time_hi; @@ -396,10 +452,11 @@ static void __save_error_info(struct super_block *sb, const char *func, le32_add_cpu(&es->s_error_count, 1); }
-static void save_error_info(struct super_block *sb, const char *func, - unsigned int line) +static void save_error_info(struct super_block *sb, int error, + __u32 ino, __u64 block, + const char *func, unsigned int line) { - __save_error_info(sb, func, line); + __save_error_info(sb, error, ino, block, func, line); if (!bdev_read_only(sb->s_bdev)) ext4_commit_super(sb, 1); } @@ -548,7 +605,8 @@ static void ext4_handle_error(struct super_block *sb) "EXT4-fs error")
void __ext4_error(struct super_block *sb, const char *function, - unsigned int line, const char *fmt, ...) + unsigned int line, int error, __u64 block, + const char *fmt, ...) { struct va_format vaf; va_list args; @@ -566,24 +624,21 @@ void __ext4_error(struct super_block *sb, const char *function, sb->s_id, function, line, current->comm, &vaf); va_end(args); } - save_error_info(sb, function, line); + save_error_info(sb, error, 0, block, function, line); ext4_handle_error(sb); }
void __ext4_error_inode(struct inode *inode, const char *function, - unsigned int line, ext4_fsblk_t block, + unsigned int line, ext4_fsblk_t block, int error, const char *fmt, ...) { va_list args; struct va_format vaf; - struct ext4_super_block *es = EXT4_SB(inode->i_sb)->s_es;
if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb)))) return;
trace_ext4_error(inode->i_sb, function, line); - es->s_last_error_ino = cpu_to_le32(inode->i_ino); - es->s_last_error_block = cpu_to_le64(block); if (ext4_error_ratelimit(inode->i_sb)) { va_start(args, fmt); vaf.fmt = fmt; @@ -600,7 +655,8 @@ void __ext4_error_inode(struct inode *inode, const char *function, current->comm, &vaf); va_end(args); } - save_error_info(inode->i_sb, function, line); + save_error_info(inode->i_sb, error, inode->i_ino, block, + function, line); ext4_handle_error(inode->i_sb); }
@@ -619,7 +675,6 @@ void __ext4_error_file(struct file *file, const char *function,
trace_ext4_error(inode->i_sb, function, line); es = EXT4_SB(inode->i_sb)->s_es; - es->s_last_error_ino = cpu_to_le32(inode->i_ino); if (ext4_error_ratelimit(inode->i_sb)) { path = file_path(file, pathname, sizeof(pathname)); if (IS_ERR(path)) @@ -641,7 +696,8 @@ void __ext4_error_file(struct file *file, const char *function, current->comm, path, &vaf); va_end(args); } - save_error_info(inode->i_sb, function, line); + save_error_info(inode->i_sb, EFSCORRUPTED, inode->i_ino, block, + function, line); ext4_handle_error(inode->i_sb); }
@@ -685,66 +741,6 @@ const char *ext4_decode_error(struct super_block *sb, int errno, return errstr; }
-void ext4_set_errno(struct super_block *sb, int err) -{ - if (err < 0) - err = -err; - - switch (err) { - case EIO: - err = EXT4_ERR_EIO; - break; - case ENOMEM: - err = EXT4_ERR_ENOMEM; - break; - case EFSBADCRC: - err = EXT4_ERR_EFSBADCRC; - break; - case EFSCORRUPTED: - err = EXT4_ERR_EFSCORRUPTED; - break; - case ENOSPC: - err = EXT4_ERR_ENOSPC; - break; - case ENOKEY: - err = EXT4_ERR_ENOKEY; - break; - case EROFS: - err = EXT4_ERR_EROFS; - break; - case EFBIG: - err = EXT4_ERR_EFBIG; - break; - case EEXIST: - err = EXT4_ERR_EEXIST; - break; - case ERANGE: - err = EXT4_ERR_ERANGE; - break; - case EOVERFLOW: - err = EXT4_ERR_EOVERFLOW; - break; - case EBUSY: - err = EXT4_ERR_EBUSY; - break; - case ENOTDIR: - err = EXT4_ERR_ENOTDIR; - break; - case ENOTEMPTY: - err = EXT4_ERR_ENOTEMPTY; - break; - case ESHUTDOWN: - err = EXT4_ERR_ESHUTDOWN; - break; - case EFAULT: - err = EXT4_ERR_EFAULT; - break; - default: - err = EXT4_ERR_UNKNOWN; - } - EXT4_SB(sb)->s_es->s_last_error_errcode = err; -} - /* __ext4_std_error decodes expected errors from journaling functions * automatically and invokes the appropriate error response. */
@@ -769,8 +765,7 @@ void __ext4_std_error(struct super_block *sb, const char *function, sb->s_id, function, line, errstr); }
- ext4_set_errno(sb, -errno); - save_error_info(sb, function, line); + save_error_info(sb, -errno, 0, 0, function, line); ext4_handle_error(sb); }
@@ -785,7 +780,7 @@ void __ext4_std_error(struct super_block *sb, const char *function, */
void __ext4_abort(struct super_block *sb, const char *function, - unsigned int line, const char *fmt, ...) + unsigned int line, int error, const char *fmt, ...) { struct va_format vaf; va_list args; @@ -793,7 +788,7 @@ void __ext4_abort(struct super_block *sb, const char *function, if (unlikely(ext4_forced_shutdown(EXT4_SB(sb)))) return;
- save_error_info(sb, function, line); + save_error_info(sb, error, 0, 0, function, line); va_start(args, fmt); vaf.fmt = fmt; vaf.va = &args; @@ -813,7 +808,6 @@ void __ext4_abort(struct super_block *sb, const char *function, */ smp_wmb(); sb->s_flags |= SB_RDONLY; - save_error_info(sb, function, line); ext4_netlink_send_info(sb, 2); } if (test_opt(sb, ERRORS_PANIC) && !system_going_down()) @@ -884,15 +878,12 @@ __acquires(bitlock) { struct va_format vaf; va_list args; - struct ext4_super_block *es = EXT4_SB(sb)->s_es;
if (unlikely(ext4_forced_shutdown(EXT4_SB(sb)))) return;
trace_ext4_error(sb, function, line); - es->s_last_error_ino = cpu_to_le32(ino); - es->s_last_error_block = cpu_to_le64(block); - __save_error_info(sb, function, line); + __save_error_info(sb, EFSCORRUPTED, ino, block, function, line);
if (ext4_error_ratelimit(sb)) { va_start(args, fmt); @@ -1098,8 +1089,7 @@ static void ext4_put_super(struct super_block *sb) err = jbd2_journal_destroy(sbi->s_journal); sbi->s_journal = NULL; if ((err < 0) && !aborted) { - ext4_set_errno(sb, -err); - ext4_abort(sb, "Couldn't clean up the journal"); + ext4_abort(sb, -err, "Couldn't clean up the journal"); } }
@@ -5465,7 +5455,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data) }
if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED) - ext4_abort(sb, "Abort forced by user"); + ext4_abort(sb, EXT4_ERR_ESHUTDOWN, "Abort forced by user");
sb->s_flags = (sb->s_flags & ~SB_POSIXACL) | (test_opt(sb, POSIX_ACL) ? SB_POSIXACL : 0); diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index 7781e34c8ce24..1a8416f522313 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -245,7 +245,7 @@ __ext4_xattr_check_block(struct inode *inode, struct buffer_head *bh, bh->b_data); errout: if (error) - __ext4_error_inode(inode, function, line, 0, + __ext4_error_inode(inode, function, line, 0, -error, "corrupted xattr block %llu", (unsigned long long) bh->b_blocknr); else @@ -269,7 +269,7 @@ __xattr_check_inode(struct inode *inode, struct ext4_xattr_ibody_header *header, error = ext4_xattr_check_entries(IFIRST(header), end, IFIRST(header)); errout: if (error) - __ext4_error_inode(inode, function, line, 0, + __ext4_error_inode(inode, function, line, 0, -error, "corrupted in-inode xattr"); return error; } @@ -2887,9 +2887,9 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode, if (IS_ERR(bh)) { error = PTR_ERR(bh); if (error == -EIO) { - ext4_set_errno(inode->i_sb, EIO); - EXT4_ERROR_INODE(inode, "block %llu read error", - EXT4_I(inode)->i_file_acl); + EXT4_ERROR_INODE_ERR(inode, EIO, + "block %llu read error", + EXT4_I(inode)->i_file_acl); } bh = NULL; goto cleanup;
From: Jan Kara jack@suse.cz
mainline inclusion from mainline-v5.11-rc1 commit 93c20bc3eafba52c134cf5183f18833b9bd22bf8 category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
We use __ext4_error() when ext4_protect_reserved_inode() finds filesystem corruption. However EXT4_ERROR_INODE_ERR() is perfectly capable of reporting all the needed information. So just use that.
Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Andreas Dilger adilger@dilger.ca Link: https://lore.kernel.org/r/20201127113405.26867-4-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu
conflicts: fs/ext4/block_validity.c
Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/block_validity.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/block_validity.c b/fs/ext4/block_validity.c index 868c386282022..fc2a2159421ea 100644 --- a/fs/ext4/block_validity.c +++ b/fs/ext4/block_validity.c @@ -172,11 +172,10 @@ static int ext4_protect_reserved_inode(struct super_block *sb, err = add_system_zone(system_blks, map.m_pblk, n, ino); if (err < 0) { if (err == -EFSCORRUPTED) { - __ext4_error(sb, __func__, __LINE__, -err, - map.m_pblk, "blocks %llu-%llu " - "from inode %u overlap system zone", + EXT4_ERROR_INODE_ERR(inode, -err, + "blocks %llu-%llu from inode overlap system zone", map.m_pblk, - map.m_pblk + map.m_len - 1, ino); + map.m_pblk + map.m_len - 1); } break; }
From: Jan Kara jack@suse.cz
mainline inclusion from mainline-v5.11-rc1 commit 014c9caa29d3a44e0de695c99ef18bec3e887d52 category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
The only difference between __ext4_abort() and __ext4_error() is that the former one ignores errors=continue mount option. Unify the code to reduce duplication.
Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Andreas Dilger adilger@dilger.ca Link: https://lore.kernel.org/r/20201127113405.26867-5-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu
conflicts: fs/ext4/super.c
Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/ext4.h | 29 +++++++--------- fs/ext4/ext4_jbd2.c | 4 +-- fs/ext4/inode.c | 2 +- fs/ext4/super.c | 84 ++++++++++++--------------------------------- 4 files changed, 37 insertions(+), 82 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 052eb06815fa7..73cd56bfd9307 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2721,9 +2721,9 @@ extern void ext4_mark_group_bitmap_corrupted(struct super_block *sb, ext4_group_t block_group, unsigned int flags);
-extern __printf(6, 7) -void __ext4_error(struct super_block *, const char *, unsigned int, int, __u64, - const char *, ...); +extern __printf(7, 8) +void __ext4_error(struct super_block *, const char *, unsigned int, bool, + int, __u64, const char *, ...); extern __printf(6, 7) void __ext4_error_inode(struct inode *, const char *, unsigned int, ext4_fsblk_t, int, const char *, ...); @@ -2732,9 +2732,6 @@ void __ext4_error_file(struct file *, const char *, unsigned int, ext4_fsblk_t, const char *, ...); extern void __ext4_std_error(struct super_block *, const char *, unsigned int, int); -extern __printf(5, 6) -void __ext4_abort(struct super_block *, const char *, unsigned int, int, - const char *, ...); extern __printf(4, 5) void __ext4_warning(struct super_block *, const char *, unsigned int, const char *, ...); @@ -2764,6 +2761,9 @@ void __ext4_grp_locked_error(const char *, unsigned int, #define EXT4_ERROR_FILE(file, block, fmt, a...) \ ext4_error_file((file), __func__, __LINE__, (block), (fmt), ## a)
+#define ext4_abort(sb, err, fmt, a...) \ + __ext4_error((sb), __func__, __LINE__, true, (err), 0, (fmt), ## a) + #ifdef CONFIG_PRINTK
#define ext4_error_inode(inode, func, line, block, fmt, ...) \ @@ -2774,11 +2774,11 @@ void __ext4_grp_locked_error(const char *, unsigned int, #define ext4_error_file(file, func, line, block, fmt, ...) \ __ext4_error_file(file, func, line, block, fmt, ##__VA_ARGS__) #define ext4_error(sb, fmt, ...) \ - __ext4_error((sb), __func__, __LINE__, 0, 0, (fmt), ##__VA_ARGS__) + __ext4_error((sb), __func__, __LINE__, false, 0, 0, (fmt), \ + ##__VA_ARGS__) #define ext4_error_err(sb, err, fmt, ...) \ - __ext4_error((sb), __func__, __LINE__, (err), 0, (fmt), ##__VA_ARGS__) -#define ext4_abort(sb, err, fmt, ...) \ - __ext4_abort((sb), __func__, __LINE__, (err), (fmt), ##__VA_ARGS__) + __ext4_error((sb), __func__, __LINE__, false, (err), 0, (fmt), \ + ##__VA_ARGS__) #define ext4_warning(sb, fmt, ...) \ __ext4_warning(sb, __func__, __LINE__, fmt, ##__VA_ARGS__) #define ext4_warning_inode(inode, fmt, ...) \ @@ -2811,17 +2811,12 @@ do { \ #define ext4_error(sb, fmt, ...) \ do { \ no_printk(fmt, ##__VA_ARGS__); \ - __ext4_error(sb, "", 0, 0, 0, " "); \ + __ext4_error(sb, "", 0, false, 0, 0, " "); \ } while (0) #define ext4_error_err(sb, err, fmt, ...) \ do { \ no_printk(fmt, ##__VA_ARGS__); \ - __ext4_error(sb, "", 0, err, 0, " "); \ -} while (0) -#define ext4_abort(sb, err, fmt, ...) \ -do { \ - no_printk(fmt, ##__VA_ARGS__); \ - __ext4_abort(sb, "", 0, err, " "); \ + __ext4_error(sb, "", 0, false, err, 0, " "); \ } while (0) #define ext4_warning(sb, fmt, ...) \ do { \ diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c index 35ce16e690d37..af28089958587 100644 --- a/fs/ext4/ext4_jbd2.c +++ b/fs/ext4/ext4_jbd2.c @@ -273,8 +273,8 @@ int __ext4_forget(const char *where, unsigned int line, handle_t *handle, if (err) { ext4_journal_abort_handle(where, line, __func__, bh, handle, err); - __ext4_abort(inode->i_sb, where, line, -err, - "error %d when attempting revoke", err); + __ext4_error(inode->i_sb, where, line, true, -err, 0, + "error %d when attempting revoke", err); } BUFFER_TRACE(bh, "exit"); return err; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index b5f0e5b552668..c9ebed13d2601 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4882,7 +4882,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino, (ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))) { if (flags & EXT4_IGET_HANDLE) return ERR_PTR(-ESTALE); - __ext4_error(sb, function, line, EFSCORRUPTED, 0, + __ext4_error(sb, function, line, false, EFSCORRUPTED, 0, "inode #%lu: comm %s: iget: illegal inode #", ino, current->comm); return ERR_PTR(-EFSCORRUPTED); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index ee35d55a823a8..73164ee74e9ba 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -560,9 +560,15 @@ static void ext4_netlink_send_info(struct super_block *sb, int ext4_errno) * We'll just use the jbd2_journal_abort() error code to record an error in * the journal instead. On recovery, the journal will complain about * that error until we've noted it down and cleared it. + * + * If force_ro is set, we unconditionally force the filesystem into an + * ABORT|READONLY state, unless the error response on the fs has been set to + * panic in which case we take the easy way out and panic immediately. This is + * used to deal with unrecoverable failures such as journal IO errors or ENOMEM + * at a critical moment in log management. */
-static void ext4_handle_error(struct super_block *sb) +static void ext4_handle_error(struct super_block *sb, bool force_ro) { journal_t *journal = EXT4_SB(sb)->s_journal;
@@ -572,7 +578,7 @@ static void ext4_handle_error(struct super_block *sb) if (sb_rdonly(sb)) return;
- if (test_opt(sb, ERRORS_CONT)) + if (!force_ro && test_opt(sb, ERRORS_CONT)) goto out;
EXT4_SB(sb)->s_mount_flags |= EXT4_MF_FS_ABORTED; @@ -583,19 +589,18 @@ static void ext4_handle_error(struct super_block *sb) * could panic during 'reboot -f' as the underlying device got already * disabled. */ - if (test_opt(sb, ERRORS_RO) || system_going_down()) { - ext4_msg(sb, KERN_CRIT, "Remounting filesystem read-only"); - /* - * Make sure updated value of ->s_mount_flags will be visible - * before ->s_flags update - */ - smp_wmb(); - sb->s_flags |= SB_RDONLY; - } else if (test_opt(sb, ERRORS_PANIC)) { + if (test_opt(sb, ERRORS_PANIC) && !system_going_down()) { panic("EXT4-fs (device %s): panic forced after error\n", sb->s_id); }
+ ext4_msg(sb, KERN_CRIT, "Remounting filesystem read-only"); + /* + * Make sure updated value of ->s_mount_flags will be visible before + * ->s_flags update + */ + smp_wmb(); + sb->s_flags |= SB_RDONLY; out: ext4_netlink_send_info(sb, 1); } @@ -605,7 +610,7 @@ static void ext4_handle_error(struct super_block *sb) "EXT4-fs error")
void __ext4_error(struct super_block *sb, const char *function, - unsigned int line, int error, __u64 block, + unsigned int line, bool force_ro, int error, __u64 block, const char *fmt, ...) { struct va_format vaf; @@ -625,7 +630,7 @@ void __ext4_error(struct super_block *sb, const char *function, va_end(args); } save_error_info(sb, error, 0, block, function, line); - ext4_handle_error(sb); + ext4_handle_error(sb, force_ro); }
void __ext4_error_inode(struct inode *inode, const char *function, @@ -657,7 +662,7 @@ void __ext4_error_inode(struct inode *inode, const char *function, } save_error_info(inode->i_sb, error, inode->i_ino, block, function, line); - ext4_handle_error(inode->i_sb); + ext4_handle_error(inode->i_sb, false); }
void __ext4_error_file(struct file *file, const char *function, @@ -698,7 +703,7 @@ void __ext4_error_file(struct file *file, const char *function, } save_error_info(inode->i_sb, EFSCORRUPTED, inode->i_ino, block, function, line); - ext4_handle_error(inode->i_sb); + ext4_handle_error(inode->i_sb, false); }
const char *ext4_decode_error(struct super_block *sb, int errno, @@ -766,52 +771,7 @@ void __ext4_std_error(struct super_block *sb, const char *function, }
save_error_info(sb, -errno, 0, 0, function, line); - ext4_handle_error(sb); -} - -/* - * ext4_abort is a much stronger failure handler than ext4_error. The - * abort function may be used to deal with unrecoverable failures such - * as journal IO errors or ENOMEM at a critical moment in log management. - * - * We unconditionally force the filesystem into an ABORT|READONLY state, - * unless the error response on the fs has been set to panic in which - * case we take the easy way out and panic immediately. - */ - -void __ext4_abort(struct super_block *sb, const char *function, - unsigned int line, int error, const char *fmt, ...) -{ - struct va_format vaf; - va_list args; - - if (unlikely(ext4_forced_shutdown(EXT4_SB(sb)))) - return; - - save_error_info(sb, error, 0, 0, function, line); - va_start(args, fmt); - vaf.fmt = fmt; - vaf.va = &args; - printk(KERN_CRIT "EXT4-fs error (device %s): %s:%d: %pV\n", - sb->s_id, function, line, &vaf); - va_end(args); - - if (sb_rdonly(sb) == 0) { - EXT4_SB(sb)->s_mount_flags |= EXT4_MF_FS_ABORTED; - if (EXT4_SB(sb)->s_journal) - jbd2_journal_abort(EXT4_SB(sb)->s_journal, -EIO); - - ext4_msg(sb, KERN_CRIT, "Remounting filesystem read-only"); - /* - * Make sure updated value of ->s_mount_flags will be visible - * before ->s_flags update - */ - smp_wmb(); - sb->s_flags |= SB_RDONLY; - ext4_netlink_send_info(sb, 2); - } - if (test_opt(sb, ERRORS_PANIC) && !system_going_down()) - panic("EXT4-fs panic from previous error\n"); + ext4_handle_error(sb, false); }
void __ext4_msg(struct super_block *sb, @@ -910,7 +870,7 @@ __acquires(bitlock)
ext4_unlock_group(sb, grp); ext4_commit_super(sb, 1); - ext4_handle_error(sb); + ext4_handle_error(sb, false); /* * We only get here in the ERRORS_RO case; relocking the group * may be dangerous, but nothing bad will happen since the
From: Jan Kara jack@suse.cz
mainline inclusion from mainline-v5.11-rc1 commit 4067662388f97d0f360e568820d9d5bac6a3c9fa category: bugfix bugzilla: 46785 CVE: NA
-----------------------------------------------
Just move error info related functions in super.c close to ext4_handle_error(). We'll want to combine save_error_info() with ext4_handle_error() and this makes change more obvious and saves a forward declaration as well. No functional change.
Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Andreas Dilger adilger@dilger.ca Link: https://lore.kernel.org/r/20201127113405.26867-6-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu
conflicts: fs/ext4/super.c
Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 172 ++++++++++++++++++++++++------------------------ 1 file changed, 86 insertions(+), 86 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 73164ee74e9ba..4a74e631f7782 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -363,6 +363,92 @@ static time64_t __ext4_get_tstamp(__le32 *lo, __u8 *hi) #define ext4_get_tstamp(es, tstamp) \ __ext4_get_tstamp(&(es)->tstamp, &(es)->tstamp ## _hi)
+/* + * The del_gendisk() function uninitializes the disk-specific data + * structures, including the bdi structure, without telling anyone + * else. Once this happens, any attempt to call mark_buffer_dirty() + * (for example, by ext4_commit_super), will cause a kernel OOPS. + * This is a kludge to prevent these oops until we can put in a proper + * hook in del_gendisk() to inform the VFS and file system layers. + */ +static int block_device_ejected(struct super_block *sb) +{ + struct inode *bd_inode = sb->s_bdev->bd_inode; + struct backing_dev_info *bdi = inode_to_bdi(bd_inode); + + return bdi->dev == NULL; +} + +static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn) +{ + struct super_block *sb = journal->j_private; + struct ext4_sb_info *sbi = EXT4_SB(sb); + int error = is_journal_aborted(journal); + struct ext4_journal_cb_entry *jce; + + BUG_ON(txn->t_state == T_FINISHED); + + ext4_process_freed_data(sb, txn->t_tid); + + spin_lock(&sbi->s_md_lock); + while (!list_empty(&txn->t_private_list)) { + jce = list_entry(txn->t_private_list.next, + struct ext4_journal_cb_entry, jce_list); + list_del_init(&jce->jce_list); + spin_unlock(&sbi->s_md_lock); + jce->jce_func(sb, jce, error); + spin_lock(&sbi->s_md_lock); + } + spin_unlock(&sbi->s_md_lock); +} + +static bool system_going_down(void) +{ + return system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF + || system_state == SYSTEM_RESTART; +} + +static void ext4_netlink_send_info(struct super_block *sb, int ext4_errno) +{ + int size; + sk_buff_data_t old_tail; + struct sk_buff *skb; + struct nlmsghdr *nlh; + struct ext4_err_msg *msg; + + if (ext4nl) { + if (IS_EXT2_SB(sb)) + return; + size = NLMSG_SPACE(sizeof(struct ext4_err_msg)); + skb = alloc_skb(size, GFP_ATOMIC); + if (!skb) { + printk(KERN_ERR "Cannot alloc skb!"); + return; + } + old_tail = skb->tail; + nlh = nlmsg_put(skb, 0, 0, NLMSG_ERROR, size - sizeof(*nlh), 0); + if (!nlh) + goto nlmsg_failure; + msg = (struct ext4_err_msg *)NLMSG_DATA(nlh); + if (IS_EXT3_SB(sb)) + msg->magic = EXT3_ERROR_MAGIC; + else + msg->magic = EXT4_ERROR_MAGIC; + memcpy(msg->s_id, sb->s_id, sizeof(sb->s_id)); + msg->s_flags = sb->s_flags; + msg->ext4_errno = ext4_errno; + nlh->nlmsg_len = skb->tail - old_tail; + NETLINK_CB(skb).portid = 0; + NETLINK_CB(skb).dst_group = NL_EXT4_ERROR_GROUP; + netlink_broadcast(ext4nl, skb, 0, NL_EXT4_ERROR_GROUP, + GFP_ATOMIC); + return; +nlmsg_failure: + if (skb) + kfree_skb(skb); + } +} + static void __save_error_info(struct super_block *sb, int error, __u32 ino, __u64 block, const char *func, unsigned int line) @@ -461,92 +547,6 @@ static void save_error_info(struct super_block *sb, int error, ext4_commit_super(sb, 1); }
-/* - * The del_gendisk() function uninitializes the disk-specific data - * structures, including the bdi structure, without telling anyone - * else. Once this happens, any attempt to call mark_buffer_dirty() - * (for example, by ext4_commit_super), will cause a kernel OOPS. - * This is a kludge to prevent these oops until we can put in a proper - * hook in del_gendisk() to inform the VFS and file system layers. - */ -static int block_device_ejected(struct super_block *sb) -{ - struct inode *bd_inode = sb->s_bdev->bd_inode; - struct backing_dev_info *bdi = inode_to_bdi(bd_inode); - - return bdi->dev == NULL; -} - -static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn) -{ - struct super_block *sb = journal->j_private; - struct ext4_sb_info *sbi = EXT4_SB(sb); - int error = is_journal_aborted(journal); - struct ext4_journal_cb_entry *jce; - - BUG_ON(txn->t_state == T_FINISHED); - - ext4_process_freed_data(sb, txn->t_tid); - - spin_lock(&sbi->s_md_lock); - while (!list_empty(&txn->t_private_list)) { - jce = list_entry(txn->t_private_list.next, - struct ext4_journal_cb_entry, jce_list); - list_del_init(&jce->jce_list); - spin_unlock(&sbi->s_md_lock); - jce->jce_func(sb, jce, error); - spin_lock(&sbi->s_md_lock); - } - spin_unlock(&sbi->s_md_lock); -} - -static bool system_going_down(void) -{ - return system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF - || system_state == SYSTEM_RESTART; -} - -static void ext4_netlink_send_info(struct super_block *sb, int ext4_errno) -{ - int size; - sk_buff_data_t old_tail; - struct sk_buff *skb; - struct nlmsghdr *nlh; - struct ext4_err_msg *msg; - - if (ext4nl) { - if (IS_EXT2_SB(sb)) - return; - size = NLMSG_SPACE(sizeof(struct ext4_err_msg)); - skb = alloc_skb(size, GFP_ATOMIC); - if (!skb) { - printk(KERN_ERR "Cannot alloc skb!"); - return; - } - old_tail = skb->tail; - nlh = nlmsg_put(skb, 0, 0, NLMSG_ERROR, size - sizeof(*nlh), 0); - if (!nlh) - goto nlmsg_failure; - msg = (struct ext4_err_msg *)NLMSG_DATA(nlh); - if (IS_EXT3_SB(sb)) - msg->magic = EXT3_ERROR_MAGIC; - else - msg->magic = EXT4_ERROR_MAGIC; - memcpy(msg->s_id, sb->s_id, sizeof(sb->s_id)); - msg->s_flags = sb->s_flags; - msg->ext4_errno = ext4_errno; - nlh->nlmsg_len = skb->tail - old_tail; - NETLINK_CB(skb).portid = 0; - NETLINK_CB(skb).dst_group = NL_EXT4_ERROR_GROUP; - netlink_broadcast(ext4nl, skb, 0, NL_EXT4_ERROR_GROUP, - GFP_ATOMIC); - return; -nlmsg_failure: - if (skb) - kfree_skb(skb); - } -} - /* Deal with the reporting of failure conditions on a filesystem such as * inconsistencies detected or read IO failures. *
From: Jan Kara jack@suse.cz
mainline inclusion from mainline-v5.11-rc1 commit 02a7780e4d2fcf438ac6773bc469e7ada2af56be category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
We convert errno's to ext4 on-disk format error codes in save_error_info(). Add a function and a bit of macro magic to make this simpler.
Signed-off-by: Jan Kara jack@suse.cz Reviewed-by: Andreas Dilger adilger@dilger.ca Link: https://lore.kernel.org/r/20201127113405.26867-7-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu
conflicts: fs/ext4/super.c
Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 95 +++++++++++++++++++++---------------------------- 1 file changed, 40 insertions(+), 55 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 4a74e631f7782..d6e604ca9f0e5 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -449,76 +449,61 @@ static void ext4_netlink_send_info(struct super_block *sb, int ext4_errno) } }
+struct ext4_err_translation { + int code; + int errno; +}; + +#define EXT4_ERR_TRANSLATE(err) { .code = EXT4_ERR_##err, .errno = err } + +static struct ext4_err_translation err_translation[] = { + EXT4_ERR_TRANSLATE(EIO), + EXT4_ERR_TRANSLATE(ENOMEM), + EXT4_ERR_TRANSLATE(EFSBADCRC), + EXT4_ERR_TRANSLATE(EFSCORRUPTED), + EXT4_ERR_TRANSLATE(ENOSPC), + EXT4_ERR_TRANSLATE(ENOKEY), + EXT4_ERR_TRANSLATE(EROFS), + EXT4_ERR_TRANSLATE(EFBIG), + EXT4_ERR_TRANSLATE(EEXIST), + EXT4_ERR_TRANSLATE(ERANGE), + EXT4_ERR_TRANSLATE(EOVERFLOW), + EXT4_ERR_TRANSLATE(EBUSY), + EXT4_ERR_TRANSLATE(ENOTDIR), + EXT4_ERR_TRANSLATE(ENOTEMPTY), + EXT4_ERR_TRANSLATE(ESHUTDOWN), + EXT4_ERR_TRANSLATE(EFAULT), +}; + +static int ext4_errno_to_code(int errno) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(err_translation); i++) + if (err_translation[i].errno == errno) + return err_translation[i].code; + return EXT4_ERR_UNKNOWN; +} + static void __save_error_info(struct super_block *sb, int error, __u32 ino, __u64 block, const char *func, unsigned int line) { struct ext4_super_block *es = EXT4_SB(sb)->s_es; - int err;
EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS; if (bdev_read_only(sb->s_bdev)) return; + /* We default to EFSCORRUPTED error... */ + if (error == 0) + error = EFSCORRUPTED; es->s_state |= cpu_to_le16(EXT4_ERROR_FS); ext4_update_tstamp(es, s_last_error_time); strncpy(es->s_last_error_func, func, sizeof(es->s_last_error_func)); es->s_last_error_line = cpu_to_le32(line); es->s_last_error_ino = cpu_to_le32(ino); es->s_last_error_block = cpu_to_le64(block); - switch (error) { - case EIO: - err = EXT4_ERR_EIO; - break; - case ENOMEM: - err = EXT4_ERR_ENOMEM; - break; - case EFSBADCRC: - err = EXT4_ERR_EFSBADCRC; - break; - case 0: - case EFSCORRUPTED: - err = EXT4_ERR_EFSCORRUPTED; - break; - case ENOSPC: - err = EXT4_ERR_ENOSPC; - break; - case ENOKEY: - err = EXT4_ERR_ENOKEY; - break; - case EROFS: - err = EXT4_ERR_EROFS; - break; - case EFBIG: - err = EXT4_ERR_EFBIG; - break; - case EEXIST: - err = EXT4_ERR_EEXIST; - break; - case ERANGE: - err = EXT4_ERR_ERANGE; - break; - case EOVERFLOW: - err = EXT4_ERR_EOVERFLOW; - break; - case EBUSY: - err = EXT4_ERR_EBUSY; - break; - case ENOTDIR: - err = EXT4_ERR_ENOTDIR; - break; - case ENOTEMPTY: - err = EXT4_ERR_ENOTEMPTY; - break; - case ESHUTDOWN: - err = EXT4_ERR_ESHUTDOWN; - break; - case EFAULT: - err = EXT4_ERR_EFAULT; - break; - default: - err = EXT4_ERR_UNKNOWN; - } - es->s_last_error_errcode = err; + es->s_last_error_errcode = ext4_errno_to_code(error); if (!es->s_first_error_time) { es->s_first_error_time = es->s_last_error_time; es->s_first_error_time_hi = es->s_last_error_time_hi;
From: Jan Kara jack@suse.cz
mainline inclusion from mainline-v5.11-rc1 commit c92dc856848f32781e37b88c1b7f875e274f5efb category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
When filesystem inconsistency is detected with group locked, we currently try to modify superblock to store error there without blocking. However this can cause superblock checksum failures (or DIF/DIX failure) when the superblock is just being written out.
Make error handling code just store error information in ext4_sb_info structure and copy it to on-disk superblock only in ext4_commit_super(). In case of error happening with group locked, we just postpone the superblock flushing to a workqueue.
[ Added fixup so that s_first_error_* does not get updated after the file system is remounted. Also added fix for syzbot failure. - Ted ]
Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20201127113405.26867-8-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Cc: Hillf Danton hdanton@sina.com Reported-by: syzbot+9043030c040ce1849a60@syzkaller.appspotmail.com
conflicts: fs/ext4/ext4.h fs/ext4/super.c
Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/ext4.h | 21 +++++++++ fs/ext4/super.c | 120 +++++++++++++++++++++++++++++++++--------------- 2 files changed, 104 insertions(+), 37 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 73cd56bfd9307..2a63c69bd19e2 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1545,6 +1545,27 @@ struct ext4_sb_info { /* Record the errseq of the backing block device */ errseq_t s_bdev_wb_err; spinlock_t s_bdev_wb_lock; + + /* Information about errors that happened during this mount */ + spinlock_t s_error_lock; + int s_add_error_count; + int s_first_error_code; + __u32 s_first_error_line; + __u32 s_first_error_ino; + __u64 s_first_error_block; + const char *s_first_error_func; + time64_t s_first_error_time; + int s_last_error_code; + __u32 s_last_error_line; + __u32 s_last_error_ino; + __u64 s_last_error_block; + const char *s_last_error_func; + time64_t s_last_error_time; + /* + * If we are in a context where we cannot update error information in + * the on-disk superblock, we queue this work to do it. + */ + struct work_struct s_error_work; };
static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index d6e604ca9f0e5..9ae5a84c80c91 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -344,10 +344,8 @@ void ext4_itable_unused_set(struct super_block *sb, bg->bg_itable_unused_hi = cpu_to_le16(count >> 16); }
-static void __ext4_update_tstamp(__le32 *lo, __u8 *hi) +static void __ext4_update_tstamp(__le32 *lo, __u8 *hi, time64_t now) { - time64_t now = ktime_get_real_seconds(); - now = clamp_val(now, 0, (1ull << 40) - 1);
*lo = cpu_to_le32(lower_32_bits(now)); @@ -359,7 +357,8 @@ static time64_t __ext4_get_tstamp(__le32 *lo, __u8 *hi) return ((time64_t)(*hi) << 32) + le32_to_cpu(*lo); } #define ext4_update_tstamp(es, tstamp) \ - __ext4_update_tstamp(&(es)->tstamp, &(es)->tstamp ## _hi) + __ext4_update_tstamp(&(es)->tstamp, &(es)->tstamp ## _hi, \ + ktime_get_real_seconds()) #define ext4_get_tstamp(es, tstamp) \ __ext4_get_tstamp(&(es)->tstamp, &(es)->tstamp ## _hi)
@@ -489,7 +488,7 @@ static void __save_error_info(struct super_block *sb, int error, __u32 ino, __u64 block, const char *func, unsigned int line) { - struct ext4_super_block *es = EXT4_SB(sb)->s_es; + struct ext4_sb_info *sbi = EXT4_SB(sb);
EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS; if (bdev_read_only(sb->s_bdev)) @@ -497,30 +496,24 @@ static void __save_error_info(struct super_block *sb, int error, /* We default to EFSCORRUPTED error... */ if (error == 0) error = EFSCORRUPTED; - es->s_state |= cpu_to_le16(EXT4_ERROR_FS); - ext4_update_tstamp(es, s_last_error_time); - strncpy(es->s_last_error_func, func, sizeof(es->s_last_error_func)); - es->s_last_error_line = cpu_to_le32(line); - es->s_last_error_ino = cpu_to_le32(ino); - es->s_last_error_block = cpu_to_le64(block); - es->s_last_error_errcode = ext4_errno_to_code(error); - if (!es->s_first_error_time) { - es->s_first_error_time = es->s_last_error_time; - es->s_first_error_time_hi = es->s_last_error_time_hi; - strncpy(es->s_first_error_func, func, - sizeof(es->s_first_error_func)); - es->s_first_error_line = cpu_to_le32(line); - es->s_first_error_ino = es->s_last_error_ino; - es->s_first_error_block = es->s_last_error_block; - es->s_first_error_errcode = es->s_last_error_errcode; - } - /* - * Start the daily error reporting function if it hasn't been - * started already - */ - if (!es->s_error_count) - mod_timer(&EXT4_SB(sb)->s_err_report, jiffies + 24*60*60*HZ); - le32_add_cpu(&es->s_error_count, 1); + + spin_lock(&sbi->s_error_lock); + sbi->s_add_error_count++; + sbi->s_last_error_code = error; + sbi->s_last_error_line = line; + sbi->s_last_error_ino = ino; + sbi->s_last_error_block = block; + sbi->s_last_error_func = func; + sbi->s_last_error_time = ktime_get_real_seconds(); + if (!sbi->s_first_error_time) { + sbi->s_first_error_code = error; + sbi->s_first_error_line = line; + sbi->s_first_error_ino = ino; + sbi->s_first_error_block = block; + sbi->s_first_error_func = func; + sbi->s_first_error_time = sbi->s_last_error_time; + } + spin_unlock(&sbi->s_error_lock); }
static void save_error_info(struct super_block *sb, int error, @@ -590,6 +583,14 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro) ext4_netlink_send_info(sb, 1); }
+static void flush_stashed_error_work(struct work_struct *work) +{ + struct ext4_sb_info *sbi = container_of(work, struct ext4_sb_info, + s_error_work); + + ext4_commit_super(sbi->s_sb, 1); +} + #define ext4_error_ratelimit(sb) \ ___ratelimit(&(EXT4_SB(sb)->s_err_ratelimit_state), \ "EXT4-fs error") @@ -828,8 +829,6 @@ __acquires(bitlock) return;
trace_ext4_error(sb, function, line); - __save_error_info(sb, EFSCORRUPTED, ino, block, function, line); - if (ext4_error_ratelimit(sb)) { va_start(args, fmt); vaf.fmt = fmt; @@ -845,16 +844,15 @@ __acquires(bitlock) va_end(args); }
- if (test_opt(sb, WARN_ON_ERROR)) - WARN_ON_ONCE(1); - if (test_opt(sb, ERRORS_CONT)) { - ext4_commit_super(sb, 0); + if (test_opt(sb, WARN_ON_ERROR)) + WARN_ON_ONCE(1); + __save_error_info(sb, EFSCORRUPTED, ino, block, function, line); + schedule_work(&EXT4_SB(sb)->s_error_work); return; } - ext4_unlock_group(sb, grp); - ext4_commit_super(sb, 1); + save_error_info(sb, EFSCORRUPTED, ino, block, function, line); ext4_handle_error(sb, false); /* * We only get here in the ERRORS_RO case; relocking the group @@ -1027,6 +1025,7 @@ static void ext4_put_super(struct super_block *sb) ext4_unregister_li_request(sb); ext4_quota_off_umount(sb);
+ flush_work(&sbi->s_error_work); destroy_workqueue(sbi->rsv_conversion_wq);
if (sbi->s_journal) { @@ -4303,6 +4302,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) }
timer_setup(&sbi->s_err_report, print_daily_error_info, 0); + spin_lock_init(&sbi->s_error_lock); + INIT_WORK(&sbi->s_error_work, flush_stashed_error_work);
/* Register extent status tree shrinker */ if (ext4_es_register_shrinker(sbi)) @@ -4702,6 +4703,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) ext4_es_unregister_shrinker(sbi); failed_mount3: del_timer_sync(&sbi->s_err_report); + flush_work(&sbi->s_error_work); if (sbi->s_mmp_tsk) kthread_stop(sbi->s_mmp_tsk); failed_mount2: @@ -5023,6 +5025,7 @@ static int ext4_load_journal(struct super_block *sb,
static int ext4_commit_super(struct super_block *sb, int sync) { + struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_super_block *es = EXT4_SB(sb)->s_es; struct buffer_head *sbh = EXT4_SB(sb)->s_sbh; int error = 0; @@ -5059,6 +5062,46 @@ static int ext4_commit_super(struct super_block *sb, int sync) es->s_free_inodes_count = cpu_to_le32(percpu_counter_sum_positive( &EXT4_SB(sb)->s_freeinodes_counter)); + /* Copy error information to the on-disk superblock */ + spin_lock(&sbi->s_error_lock); + if (sbi->s_add_error_count > 0) { + es->s_state |= cpu_to_le16(EXT4_ERROR_FS); + if (!es->s_first_error_time && !es->s_first_error_time_hi) { + __ext4_update_tstamp(&es->s_first_error_time, + &es->s_first_error_time_hi, + sbi->s_first_error_time); + strncpy(es->s_first_error_func, sbi->s_first_error_func, + sizeof(es->s_first_error_func)); + es->s_first_error_line = + cpu_to_le32(sbi->s_first_error_line); + es->s_first_error_ino = + cpu_to_le32(sbi->s_first_error_ino); + es->s_first_error_block = + cpu_to_le64(sbi->s_first_error_block); + es->s_first_error_errcode = + ext4_errno_to_code(sbi->s_first_error_code); + } + __ext4_update_tstamp(&es->s_last_error_time, + &es->s_last_error_time_hi, + sbi->s_last_error_time); + strncpy(es->s_last_error_func, sbi->s_last_error_func, + sizeof(es->s_last_error_func)); + es->s_last_error_line = cpu_to_le32(sbi->s_last_error_line); + es->s_last_error_ino = cpu_to_le32(sbi->s_last_error_ino); + es->s_last_error_block = cpu_to_le64(sbi->s_last_error_block); + es->s_last_error_errcode = + ext4_errno_to_code(sbi->s_last_error_code); + /* + * Start the daily error reporting function if it hasn't been + * started already + */ + if (!es->s_error_count) + mod_timer(&sbi->s_err_report, jiffies + 24*60*60*HZ); + le32_add_cpu(&es->s_error_count, sbi->s_add_error_count); + sbi->s_add_error_count = 0; + } + spin_unlock(&sbi->s_error_lock); + BUFFER_TRACE(sbh, "marking dirty"); ext4_superblock_csum_set(sb); if (sync) @@ -5412,6 +5455,9 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data) set_task_ioprio(sbi->s_journal->j_task, journal_ioprio); }
+ /* Flush outstanding errors before changing fs state */ + flush_work(&sbi->s_error_work); + if ((bool)(*flags & SB_RDONLY) != sb_rdonly(sb)) { if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED) { err = -EROFS;
From: Jan Kara jack@suse.cz
mainline inclusion from mainline-v5.11-rc4 commit e789ca0cc1d51296832b8424fa4008ce6e9d1703 category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
save_error_info() is always called together with ext4_handle_error(). Combine them into a single call and move unconditional bits out of save_error_info() into ext4_handle_error().
Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20201216101844.22917-2-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu
conflicts: fs/ext4/super.c
Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 9ae5a84c80c91..f0986a1ee303c 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -490,9 +490,6 @@ static void __save_error_info(struct super_block *sb, int error, { struct ext4_sb_info *sbi = EXT4_SB(sb);
- EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS; - if (bdev_read_only(sb->s_bdev)) - return; /* We default to EFSCORRUPTED error... */ if (error == 0) error = EFSCORRUPTED; @@ -546,13 +543,19 @@ static void save_error_info(struct super_block *sb, int error, * at a critical moment in log management. */
-static void ext4_handle_error(struct super_block *sb, bool force_ro) +static void ext4_handle_error(struct super_block *sb, bool force_ro, int error, + __u32 ino, __u64 block, + const char *func, unsigned int line) { journal_t *journal = EXT4_SB(sb)->s_journal;
+ EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS; if (test_opt(sb, WARN_ON_ERROR)) WARN_ON_ONCE(1);
+ if (!bdev_read_only(sb->s_bdev)) + save_error_info(sb, error, ino, block, func, line); + if (sb_rdonly(sb)) return;
@@ -615,8 +618,7 @@ void __ext4_error(struct super_block *sb, const char *function, sb->s_id, function, line, current->comm, &vaf); va_end(args); } - save_error_info(sb, error, 0, block, function, line); - ext4_handle_error(sb, force_ro); + ext4_handle_error(sb, force_ro, error, 0, block, function, line); }
void __ext4_error_inode(struct inode *inode, const char *function, @@ -646,9 +648,8 @@ void __ext4_error_inode(struct inode *inode, const char *function, current->comm, &vaf); va_end(args); } - save_error_info(inode->i_sb, error, inode->i_ino, block, - function, line); - ext4_handle_error(inode->i_sb, false); + ext4_handle_error(inode->i_sb, false, error, inode->i_ino, block, + function, line); }
void __ext4_error_file(struct file *file, const char *function, @@ -687,9 +688,8 @@ void __ext4_error_file(struct file *file, const char *function, current->comm, path, &vaf); va_end(args); } - save_error_info(inode->i_sb, EFSCORRUPTED, inode->i_ino, block, - function, line); - ext4_handle_error(inode->i_sb, false); + ext4_handle_error(inode->i_sb, false, EFSCORRUPTED, inode->i_ino, block, + function, line); }
const char *ext4_decode_error(struct super_block *sb, int errno, @@ -756,8 +756,7 @@ void __ext4_std_error(struct super_block *sb, const char *function, sb->s_id, function, line, errstr); }
- save_error_info(sb, -errno, 0, 0, function, line); - ext4_handle_error(sb, false); + ext4_handle_error(sb, false, -errno, 0, 0, function, line); }
void __ext4_msg(struct super_block *sb, @@ -847,13 +846,14 @@ __acquires(bitlock) if (test_opt(sb, ERRORS_CONT)) { if (test_opt(sb, WARN_ON_ERROR)) WARN_ON_ONCE(1); + EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS; __save_error_info(sb, EFSCORRUPTED, ino, block, function, line); - schedule_work(&EXT4_SB(sb)->s_error_work); + if (!bdev_read_only(sb->s_bdev)) + schedule_work(&EXT4_SB(sb)->s_error_work); return; } ext4_unlock_group(sb, grp); - save_error_info(sb, EFSCORRUPTED, ino, block, function, line); - ext4_handle_error(sb, false); + ext4_handle_error(sb, false, EFSCORRUPTED, ino, block, function, line); /* * We only get here in the ERRORS_RO case; relocking the group * may be dangerous, but nothing bad will happen since the
From: Jan Kara jack@suse.cz
mainline inclusion from mainline-v5.11-rc4 commit 4392fbc4bab57db3760f0fb61258cb7089b37665 category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
Everybody passes 1 as sync argument of ext4_commit_super(). Just drop it.
Reviewed-by: Harshad Shirwadkar harshadshirwadkar@gmail.com Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20201216101844.22917-3-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 47 ++++++++++++++++++++++------------------------- 1 file changed, 22 insertions(+), 25 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index f0986a1ee303c..371ba6dda9de7 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -68,7 +68,7 @@ static struct ratelimit_state ext4_mount_msg_ratelimit; static int ext4_load_journal(struct super_block *, struct ext4_super_block *, unsigned long journal_devnum); static int ext4_show_options(struct seq_file *seq, struct dentry *root); -static int ext4_commit_super(struct super_block *sb, int sync); +static int ext4_commit_super(struct super_block *sb); static int ext4_mark_recovery_complete(struct super_block *sb, struct ext4_super_block *es); static int ext4_clear_journal_err(struct super_block *sb, @@ -519,7 +519,7 @@ static void save_error_info(struct super_block *sb, int error, { __save_error_info(sb, error, ino, block, func, line); if (!bdev_read_only(sb->s_bdev)) - ext4_commit_super(sb, 1); + ext4_commit_super(sb); }
/* Deal with the reporting of failure conditions on a filesystem such as @@ -591,7 +591,7 @@ static void flush_stashed_error_work(struct work_struct *work) struct ext4_sb_info *sbi = container_of(work, struct ext4_sb_info, s_error_work);
- ext4_commit_super(sbi->s_sb, 1); + ext4_commit_super(sbi->s_sb); }
#define ext4_error_ratelimit(sb) \ @@ -1049,7 +1049,7 @@ static void ext4_put_super(struct super_block *sb) es->s_state = cpu_to_le16(sbi->s_mount_state); } if (!sb_rdonly(sb)) - ext4_commit_super(sb, 1); + ext4_commit_super(sb);
rcu_read_lock(); group_desc = rcu_dereference(sbi->s_group_desc); @@ -2347,7 +2347,7 @@ static int ext4_setup_super(struct super_block *sb, struct ext4_super_block *es, if (sbi->s_journal) ext4_set_feature_journal_needs_recovery(sb);
- err = ext4_commit_super(sb, 1); + err = ext4_commit_super(sb); done: if (test_opt(sb, DEBUG)) printk(KERN_INFO "[EXT4 FS bs=%lu, gc=%u, " @@ -4471,7 +4471,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) if (DUMMY_ENCRYPTION_ENABLED(sbi) && !sb_rdonly(sb) && !ext4_has_feature_encrypt(sb)) { ext4_set_feature_encrypt(sb); - ext4_commit_super(sb, 1); + ext4_commit_super(sb); }
/* @@ -5013,7 +5013,7 @@ static int ext4_load_journal(struct super_block *sb, es->s_journal_dev = cpu_to_le32(journal_devnum);
/* Make sure we flush the recovery flag to disk. */ - ext4_commit_super(sb, 1); + ext4_commit_super(sb); }
return 0; @@ -5023,7 +5023,7 @@ static int ext4_load_journal(struct super_block *sb, return err; }
-static int ext4_commit_super(struct super_block *sb, int sync) +static int ext4_commit_super(struct super_block *sb) { struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_super_block *es = EXT4_SB(sb)->s_es; @@ -5104,8 +5104,7 @@ static int ext4_commit_super(struct super_block *sb, int sync)
BUFFER_TRACE(sbh, "marking dirty"); ext4_superblock_csum_set(sb); - if (sync) - lock_buffer(sbh); + lock_buffer(sbh); if (buffer_write_io_error(sbh) || !buffer_uptodate(sbh)) { /* * Oh, dear. A previous attempt to write the @@ -5121,16 +5120,14 @@ static int ext4_commit_super(struct super_block *sb, int sync) set_buffer_uptodate(sbh); } mark_buffer_dirty(sbh); - if (sync) { - unlock_buffer(sbh); - error = __sync_dirty_buffer(sbh, - REQ_SYNC | (test_opt(sb, BARRIER) ? REQ_FUA : 0)); - if (buffer_write_io_error(sbh)) { - ext4_msg(sb, KERN_ERR, "I/O error while writing " - "superblock"); - clear_buffer_write_io_error(sbh); - set_buffer_uptodate(sbh); - } + unlock_buffer(sbh); + error = __sync_dirty_buffer(sbh, + REQ_SYNC | (test_opt(sb, BARRIER) ? REQ_FUA : 0)); + if (buffer_write_io_error(sbh)) { + ext4_msg(sb, KERN_ERR, "I/O error while writing " + "superblock"); + clear_buffer_write_io_error(sbh); + set_buffer_uptodate(sbh); } return error; } @@ -5161,7 +5158,7 @@ static int ext4_mark_recovery_complete(struct super_block *sb,
if (ext4_has_feature_journal_needs_recovery(sb) && sb_rdonly(sb)) { ext4_clear_feature_journal_needs_recovery(sb); - ext4_commit_super(sb, 1); + ext4_commit_super(sb); } out: jbd2_journal_unlock_updates(journal); @@ -5203,7 +5200,7 @@ static int ext4_clear_journal_err(struct super_block *sb,
EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS; es->s_state |= cpu_to_le16(EXT4_ERROR_FS); - ext4_commit_super(sb, 1); + ext4_commit_super(sb);
jbd2_journal_clear_err(journal); jbd2_journal_update_sb_errno(journal); @@ -5305,7 +5302,7 @@ static int ext4_freeze(struct super_block *sb) ext4_clear_feature_journal_needs_recovery(sb); }
- error = ext4_commit_super(sb, 1); + error = ext4_commit_super(sb); out: if (journal) /* we rely on upper layer to stop further updates */ @@ -5327,7 +5324,7 @@ static int ext4_unfreeze(struct super_block *sb) ext4_set_feature_journal_needs_recovery(sb); }
- ext4_commit_super(sb, 1); + ext4_commit_super(sb); return 0; }
@@ -5595,7 +5592,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data) }
if (sbi->s_journal == NULL && !(old_sb_flags & SB_RDONLY)) { - err = ext4_commit_super(sb, 1); + err = ext4_commit_super(sb); if (err) goto restore_opts; }
From: Jan Kara jack@suse.cz
mainline inclusion from mainline-v5.11-rc4 commit 05c2c00f3769abb9e323fcaca70d2de0b48af7ba category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
Protect all superblock modifications (including checksum computation) with a superblock buffer lock. That way we are sure computed checksum matches current superblock contents (a mismatch could cause checksum failures in nojournal mode or if an unjournalled superblock update races with a journalled one). Also we avoid modifying superblock contents while it is being written out (which can cause DIF/DIX failures if we are running in nojournal mode).
Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20201216101844.22917-4-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/ext4_jbd2.c | 1 - fs/ext4/file.c | 3 +++ fs/ext4/inode.c | 3 +++ fs/ext4/namei.c | 6 ++++++ fs/ext4/resize.c | 12 ++++++++++++ fs/ext4/super.c | 2 +- fs/ext4/xattr.c | 3 +++ 7 files changed, 28 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c index af28089958587..f720d8ceeff44 100644 --- a/fs/ext4/ext4_jbd2.c +++ b/fs/ext4/ext4_jbd2.c @@ -359,7 +359,6 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line, struct buffer_head *bh = EXT4_SB(sb)->s_sbh; int err = 0;
- ext4_superblock_csum_set(sb); if (ext4_handle_valid(handle)) { err = jbd2_journal_dirty_metadata(handle, bh); if (err) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 52d155b4e7334..76dadab4638d9 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -432,8 +432,11 @@ static int ext4_sample_last_mounted(struct super_block *sb, err = ext4_journal_get_write_access(handle, sbi->s_sbh); if (err) goto out_journal; + lock_buffer(sbi->s_sbh); strlcpy(sbi->s_es->s_last_mounted, cp, sizeof(sbi->s_es->s_last_mounted)); + ext4_superblock_csum_set(sb); + unlock_buffer(sbi->s_sbh); ext4_handle_dirty_super(handle, sb); out_journal: ext4_journal_stop(handle); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c9ebed13d2601..c2b3e87e55056 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5407,7 +5407,10 @@ static int ext4_do_update_inode(handle_t *handle, err = ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh); if (err) goto out_brelse; + lock_buffer(EXT4_SB(sb)->s_sbh); ext4_set_feature_large_file(sb); + ext4_superblock_csum_set(sb); + unlock_buffer(EXT4_SB(sb)->s_sbh); ext4_handle_sync(handle); err = ext4_handle_dirty_super(handle, sb); } diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index cb51b7aaacb42..54b78d43a6e54 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2895,7 +2895,10 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) (le32_to_cpu(sbi->s_es->s_inodes_count))) { /* Insert this inode at the head of the on-disk orphan list */ NEXT_ORPHAN(inode) = le32_to_cpu(sbi->s_es->s_last_orphan); + lock_buffer(sbi->s_sbh); sbi->s_es->s_last_orphan = cpu_to_le32(inode->i_ino); + ext4_superblock_csum_set(sb); + unlock_buffer(sbi->s_sbh); dirty = true; } list_add(&EXT4_I(inode)->i_orphan, &sbi->s_orphan); @@ -2978,7 +2981,10 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) mutex_unlock(&sbi->s_orphan_lock); goto out_brelse; } + lock_buffer(sbi->s_sbh); sbi->s_es->s_last_orphan = cpu_to_le32(ino_next); + ext4_superblock_csum_set(inode->i_sb); + unlock_buffer(sbi->s_sbh); mutex_unlock(&sbi->s_orphan_lock); err = ext4_handle_dirty_super(handle, inode->i_sb); } else { diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index 6a0c5c880354a..48c71de2e461b 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -900,7 +900,10 @@ static int add_new_gdb(handle_t *handle, struct inode *inode, EXT4_SB(sb)->s_gdb_count++; ext4_kvfree_array_rcu(o_group_desc);
+ lock_buffer(EXT4_SB(sb)->s_sbh); le16_add_cpu(&es->s_reserved_gdt_blocks, -1); + ext4_superblock_csum_set(sb); + unlock_buffer(EXT4_SB(sb)->s_sbh); err = ext4_handle_dirty_super(handle, sb); if (err) ext4_std_error(sb, err); @@ -1386,6 +1389,7 @@ static void ext4_update_super(struct super_block *sb, reserved_blocks *= blocks_count; do_div(reserved_blocks, 100);
+ lock_buffer(sbi->s_sbh); ext4_blocks_count_set(es, ext4_blocks_count(es) + blocks_count); ext4_free_blocks_count_set(es, ext4_free_blocks_count(es) + free_blocks); le32_add_cpu(&es->s_inodes_count, EXT4_INODES_PER_GROUP(sb) * @@ -1423,6 +1427,8 @@ static void ext4_update_super(struct super_block *sb, * active. */ ext4_r_blocks_count_set(es, ext4_r_blocks_count(es) + reserved_blocks); + ext4_superblock_csum_set(sb); + unlock_buffer(sbi->s_sbh);
/* Update the free space counts */ percpu_counter_add(&sbi->s_freeclusters_counter, @@ -1719,8 +1725,11 @@ static int ext4_group_extend_no_check(struct super_block *sb, goto errout; }
+ lock_buffer(EXT4_SB(sb)->s_sbh); ext4_blocks_count_set(es, o_blocks_count + add); ext4_free_blocks_count_set(es, ext4_free_blocks_count(es) + add); + ext4_superblock_csum_set(sb); + unlock_buffer(EXT4_SB(sb)->s_sbh); ext4_debug("freeing blocks %llu through %llu\n", o_blocks_count, o_blocks_count + add); /* We add the blocks to the bitmap and set the group need init bit */ @@ -1878,10 +1887,13 @@ static int ext4_convert_meta_bg(struct super_block *sb, struct inode *inode) if (err) goto errout;
+ lock_buffer(sbi->s_sbh); ext4_clear_feature_resize_inode(sb); ext4_set_feature_meta_bg(sb); sbi->s_es->s_first_meta_bg = cpu_to_le32(num_desc_blocks(sb, sbi->s_groups_count)); + ext4_superblock_csum_set(sb); + unlock_buffer(sbi->s_sbh);
err = ext4_handle_dirty_super(handle, sb); if (err) { diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 371ba6dda9de7..8859803da1661 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -5033,6 +5033,7 @@ static int ext4_commit_super(struct super_block *sb) if (!sbh || block_device_ejected(sb)) return error;
+ lock_buffer(sbh); /* * If the file system is mounted read-only, don't update the * superblock write time. This avoids updating the superblock @@ -5104,7 +5105,6 @@ static int ext4_commit_super(struct super_block *sb)
BUFFER_TRACE(sbh, "marking dirty"); ext4_superblock_csum_set(sb); - lock_buffer(sbh); if (buffer_write_io_error(sbh) || !buffer_uptodate(sbh)) { /* * Oh, dear. A previous attempt to write the diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index 1a8416f522313..4c2d1afd005fd 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -790,7 +790,10 @@ static void ext4_xattr_update_super_block(handle_t *handle,
BUFFER_TRACE(EXT4_SB(sb)->s_sbh, "get_write_access"); if (ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh) == 0) { + lock_buffer(EXT4_SB(sb)->s_sbh); ext4_set_feature_xattr(sb); + ext4_superblock_csum_set(sb); + unlock_buffer(EXT4_SB(sb)->s_sbh); ext4_handle_dirty_super(handle, sb); } }
From: Jan Kara jack@suse.cz
mainline inclusion from mainline-v5.11-rc4 commit 2d01ddc86606564fb08c56e3bc93a0693895f710 category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
If journalling is still working at the moment we get to writing error information to the superblock we cannot write directly to the superblock as such write could race with journalled update of the superblock and cause journal checksum failures, writing inconsistent information to the journal or other problems. We cannot journal the superblock directly from the error handling functions as we are running in uncertain context and could deadlock so just punt journalled superblock update to a workqueue.
Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20201216101844.22917-5-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu
conflicts: fs/ext4/super.c
Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 101 +++++++++++++++++++++++++++++++++++------------- 1 file changed, 75 insertions(+), 26 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 8859803da1661..7a6a2c73caf66 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -68,6 +68,7 @@ static struct ratelimit_state ext4_mount_msg_ratelimit; static int ext4_load_journal(struct super_block *, struct ext4_super_block *, unsigned long journal_devnum); static int ext4_show_options(struct seq_file *seq, struct dentry *root); +static void ext4_update_super(struct super_block *sb); static int ext4_commit_super(struct super_block *sb); static int ext4_mark_recovery_complete(struct super_block *sb, struct ext4_super_block *es); @@ -484,9 +485,9 @@ static int ext4_errno_to_code(int errno) return EXT4_ERR_UNKNOWN; }
-static void __save_error_info(struct super_block *sb, int error, - __u32 ino, __u64 block, - const char *func, unsigned int line) +static void save_error_info(struct super_block *sb, int error, + __u32 ino, __u64 block, + const char *func, unsigned int line) { struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -513,15 +514,6 @@ static void __save_error_info(struct super_block *sb, int error, spin_unlock(&sbi->s_error_lock); }
-static void save_error_info(struct super_block *sb, int error, - __u32 ino, __u64 block, - const char *func, unsigned int line) -{ - __save_error_info(sb, error, ino, block, func, line); - if (!bdev_read_only(sb->s_bdev)) - ext4_commit_super(sb); -} - /* Deal with the reporting of failure conditions on a filesystem such as * inconsistencies detected or read IO failures. * @@ -548,23 +540,38 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error, const char *func, unsigned int line) { journal_t *journal = EXT4_SB(sb)->s_journal; + bool continue_fs = !force_ro && test_opt(sb, ERRORS_CONT);
EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS; if (test_opt(sb, WARN_ON_ERROR)) WARN_ON_ONCE(1);
- if (!bdev_read_only(sb->s_bdev)) + if (!continue_fs && !sb_rdonly(sb)) { + EXT4_SB(sb)->s_mount_flags |= EXT4_MF_FS_ABORTED; + if (journal) + jbd2_journal_abort(journal, -EIO); + } + + if (!bdev_read_only(sb->s_bdev)) { save_error_info(sb, error, ino, block, func, line); + /* + * In case the fs should keep running, we need to writeout + * superblock through the journal. Due to lock ordering + * constraints, it may not be safe to do it right here so we + * defer superblock flushing to a workqueue. + */ + if (continue_fs) + schedule_work(&EXT4_SB(sb)->s_error_work); + else + ext4_commit_super(sb); + }
if (sb_rdonly(sb)) return;
- if (!force_ro && test_opt(sb, ERRORS_CONT)) + if (continue_fs) goto out;
- EXT4_SB(sb)->s_mount_flags |= EXT4_MF_FS_ABORTED; - if (journal) - jbd2_journal_abort(journal, -EIO); /* * We force ERRORS_RO behavior when system is rebooting. Otherwise we * could panic during 'reboot -f' as the underlying device got already @@ -590,7 +597,38 @@ static void flush_stashed_error_work(struct work_struct *work) { struct ext4_sb_info *sbi = container_of(work, struct ext4_sb_info, s_error_work); + journal_t *journal = sbi->s_journal; + handle_t *handle;
+ /* + * If the journal is still running, we have to write out superblock + * through the journal to avoid collisions of other journalled sb + * updates. + * + * We use directly jbd2 functions here to avoid recursing back into + * ext4 error handling code during handling of previous errors. + */ + if (!sb_rdonly(sbi->s_sb) && journal) { + handle = jbd2_journal_start(journal, 1); + if (IS_ERR(handle)) + goto write_directly; + if (jbd2_journal_get_write_access(handle, sbi->s_sbh)) { + jbd2_journal_stop(handle); + goto write_directly; + } + ext4_update_super(sbi->s_sb); + if (jbd2_journal_dirty_metadata(handle, sbi->s_sbh)) { + jbd2_journal_stop(handle); + goto write_directly; + } + jbd2_journal_stop(handle); + return; + } +write_directly: + /* + * Write through journal failed. Write sb directly to get error info + * out and hope for the best. + */ ext4_commit_super(sbi->s_sb); }
@@ -847,9 +885,11 @@ __acquires(bitlock) if (test_opt(sb, WARN_ON_ERROR)) WARN_ON_ONCE(1); EXT4_SB(sb)->s_mount_state |= EXT4_ERROR_FS; - __save_error_info(sb, EFSCORRUPTED, ino, block, function, line); - if (!bdev_read_only(sb->s_bdev)) + if (!bdev_read_only(sb->s_bdev)) { + save_error_info(sb, EFSCORRUPTED, ino, block, function, + line); schedule_work(&EXT4_SB(sb)->s_error_work); + } return; } ext4_unlock_group(sb, grp); @@ -5023,15 +5063,12 @@ static int ext4_load_journal(struct super_block *sb, return err; }
-static int ext4_commit_super(struct super_block *sb) +/* Copy state of EXT4_SB(sb) into buffer for on-disk superblock */ +static void ext4_update_super(struct super_block *sb) { struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_super_block *es = EXT4_SB(sb)->s_es; struct buffer_head *sbh = EXT4_SB(sb)->s_sbh; - int error = 0; - - if (!sbh || block_device_ejected(sb)) - return error;
lock_buffer(sbh); /* @@ -5103,8 +5140,20 @@ static int ext4_commit_super(struct super_block *sb) } spin_unlock(&sbi->s_error_lock);
- BUFFER_TRACE(sbh, "marking dirty"); ext4_superblock_csum_set(sb); + unlock_buffer(sbh); +} + +static int ext4_commit_super(struct super_block *sb) +{ + struct buffer_head *sbh = EXT4_SB(sb)->s_sbh; + int error = 0; + + if (!sbh || block_device_ejected(sb)) + return error; + + ext4_update_super(sb); + if (buffer_write_io_error(sbh) || !buffer_uptodate(sbh)) { /* * Oh, dear. A previous attempt to write the @@ -5119,8 +5168,8 @@ static int ext4_commit_super(struct super_block *sb) clear_buffer_write_io_error(sbh); set_buffer_uptodate(sbh); } + BUFFER_TRACE(sbh, "marking dirty"); mark_buffer_dirty(sbh); - unlock_buffer(sbh); error = __sync_dirty_buffer(sbh, REQ_SYNC | (test_opt(sb, BARRIER) ? REQ_FUA : 0)); if (buffer_write_io_error(sbh)) {
From: Jan Kara jack@suse.cz
mainline inclusion from mainline-v5.11-rc4 commit e92ad03fa53498f12b3f5ecb8822adc3bf815b28 category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
No behavioral change.
Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20201216101844.22917-6-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 7a6a2c73caf66..6b93f83564bdd 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -5067,8 +5067,8 @@ static int ext4_load_journal(struct super_block *sb, static void ext4_update_super(struct super_block *sb) { struct ext4_sb_info *sbi = EXT4_SB(sb); - struct ext4_super_block *es = EXT4_SB(sb)->s_es; - struct buffer_head *sbh = EXT4_SB(sb)->s_sbh; + struct ext4_super_block *es = sbi->s_es; + struct buffer_head *sbh = sbi->s_sbh;
lock_buffer(sbh); /* @@ -5085,21 +5085,20 @@ static void ext4_update_super(struct super_block *sb) ext4_update_tstamp(es, s_wtime); if (sb->s_bdev->bd_part) es->s_kbytes_written = - cpu_to_le64(EXT4_SB(sb)->s_kbytes_written + + cpu_to_le64(sbi->s_kbytes_written + ((part_stat_read(sb->s_bdev->bd_part, sectors[STAT_WRITE]) - - EXT4_SB(sb)->s_sectors_written_start) >> 1)); + sbi->s_sectors_written_start) >> 1)); else - es->s_kbytes_written = - cpu_to_le64(EXT4_SB(sb)->s_kbytes_written); - if (percpu_counter_initialized(&EXT4_SB(sb)->s_freeclusters_counter)) + es->s_kbytes_written = cpu_to_le64(sbi->s_kbytes_written); + if (percpu_counter_initialized(&sbi->s_freeclusters_counter)) ext4_free_blocks_count_set(es, - EXT4_C2B(EXT4_SB(sb), percpu_counter_sum_positive( - &EXT4_SB(sb)->s_freeclusters_counter))); - if (percpu_counter_initialized(&EXT4_SB(sb)->s_freeinodes_counter)) + EXT4_C2B(sbi, percpu_counter_sum_positive( + &sbi->s_freeclusters_counter))); + if (percpu_counter_initialized(&sbi->s_freeinodes_counter)) es->s_free_inodes_count = cpu_to_le32(percpu_counter_sum_positive( - &EXT4_SB(sb)->s_freeinodes_counter)); + &sbi->s_freeinodes_counter)); /* Copy error information to the on-disk superblock */ spin_lock(&sbi->s_error_lock); if (sbi->s_add_error_count > 0) {
From: Jan Kara jack@suse.cz
mainline inclusion from mainline-v5.11-rc4 commit a3f5cf14ff917d46a4d491cf86210fd639d1ff38 category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
The wrapper is now useless since it does what ext4_handle_dirty_metadata() does. Just remove it.
Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20201216101844.22917-9-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/ext4_jbd2.c | 16 ---------------- fs/ext4/ext4_jbd2.h | 5 ----- fs/ext4/file.c | 2 +- fs/ext4/inode.c | 3 ++- fs/ext4/namei.c | 4 ++-- fs/ext4/resize.c | 8 ++++---- fs/ext4/xattr.c | 2 +- 7 files changed, 10 insertions(+), 30 deletions(-)
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c index f720d8ceeff44..74a7bd566646c 100644 --- a/fs/ext4/ext4_jbd2.c +++ b/fs/ext4/ext4_jbd2.c @@ -352,19 +352,3 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line, } return err; } - -int __ext4_handle_dirty_super(const char *where, unsigned int line, - handle_t *handle, struct super_block *sb) -{ - struct buffer_head *bh = EXT4_SB(sb)->s_sbh; - int err = 0; - - if (ext4_handle_valid(handle)) { - err = jbd2_journal_dirty_metadata(handle, bh); - if (err) - ext4_journal_abort_handle(where, line, __func__, - bh, handle, err); - } else - mark_buffer_dirty(bh); - return err; -} diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h index 25396e51138a3..d11c073a8d2d1 100644 --- a/fs/ext4/ext4_jbd2.h +++ b/fs/ext4/ext4_jbd2.h @@ -244,9 +244,6 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line, handle_t *handle, struct inode *inode, struct buffer_head *bh);
-int __ext4_handle_dirty_super(const char *where, unsigned int line, - handle_t *handle, struct super_block *sb); - #define ext4_journal_get_write_access(handle, bh) \ __ext4_journal_get_write_access(__func__, __LINE__, (handle), (bh)) #define ext4_forget(handle, is_metadata, inode, bh, block_nr) \ @@ -257,8 +254,6 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line, #define ext4_handle_dirty_metadata(handle, inode, bh) \ __ext4_handle_dirty_metadata(__func__, __LINE__, (handle), (inode), \ (bh)) -#define ext4_handle_dirty_super(handle, sb) \ - __ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb))
handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line, int type, int blocks, int rsv_blocks, diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 76dadab4638d9..4e791056a860b 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -437,7 +437,7 @@ static int ext4_sample_last_mounted(struct super_block *sb, sizeof(sbi->s_es->s_last_mounted)); ext4_superblock_csum_set(sb); unlock_buffer(sbi->s_sbh); - ext4_handle_dirty_super(handle, sb); + ext4_handle_dirty_metadata(handle, NULL, sbi->s_sbh); out_journal: ext4_journal_stop(handle); out: diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c2b3e87e55056..4827c35c6deb4 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5412,7 +5412,8 @@ static int ext4_do_update_inode(handle_t *handle, ext4_superblock_csum_set(sb); unlock_buffer(EXT4_SB(sb)->s_sbh); ext4_handle_sync(handle); - err = ext4_handle_dirty_super(handle, sb); + err = ext4_handle_dirty_metadata(handle, NULL, + EXT4_SB(sb)->s_sbh); } ext4_update_inode_fsync_trans(handle, inode, need_datasync); out_brelse: diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 54b78d43a6e54..300e5e17f6e73 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2905,7 +2905,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) mutex_unlock(&sbi->s_orphan_lock);
if (dirty) { - err = ext4_handle_dirty_super(handle, sb); + err = ext4_handle_dirty_metadata(handle, NULL, sbi->s_sbh); rc = ext4_mark_iloc_dirty(handle, inode, &iloc); if (!err) err = rc; @@ -2986,7 +2986,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) ext4_superblock_csum_set(inode->i_sb); unlock_buffer(sbi->s_sbh); mutex_unlock(&sbi->s_orphan_lock); - err = ext4_handle_dirty_super(handle, inode->i_sb); + err = ext4_handle_dirty_metadata(handle, NULL, sbi->s_sbh); } else { struct ext4_iloc iloc2; struct inode *i_prev = diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index 48c71de2e461b..cb89381ac5dde 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -904,7 +904,7 @@ static int add_new_gdb(handle_t *handle, struct inode *inode, le16_add_cpu(&es->s_reserved_gdt_blocks, -1); ext4_superblock_csum_set(sb); unlock_buffer(EXT4_SB(sb)->s_sbh); - err = ext4_handle_dirty_super(handle, sb); + err = ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh); if (err) ext4_std_error(sb, err); return err; @@ -1523,7 +1523,7 @@ static int ext4_flex_group_add(struct super_block *sb,
ext4_update_super(sb, flex_gd);
- err = ext4_handle_dirty_super(handle, sb); + err = ext4_handle_dirty_metadata(handle, NULL, sbi->s_sbh);
exit_journal: err2 = ext4_journal_stop(handle); @@ -1736,7 +1736,7 @@ static int ext4_group_extend_no_check(struct super_block *sb, err = ext4_group_add_blocks(handle, sb, o_blocks_count, add); if (err) goto errout; - ext4_handle_dirty_super(handle, sb); + ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh); ext4_debug("freed blocks %llu through %llu\n", o_blocks_count, o_blocks_count + add); errout: @@ -1895,7 +1895,7 @@ static int ext4_convert_meta_bg(struct super_block *sb, struct inode *inode) ext4_superblock_csum_set(sb); unlock_buffer(sbi->s_sbh);
- err = ext4_handle_dirty_super(handle, sb); + err = ext4_handle_dirty_metadata(handle, NULL, sbi->s_sbh); if (err) { ext4_std_error(sb, err); goto errout; diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index 4c2d1afd005fd..0654b00bbdc1d 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -794,7 +794,7 @@ static void ext4_xattr_update_super_block(handle_t *handle, ext4_set_feature_xattr(sb); ext4_superblock_csum_set(sb); unlock_buffer(EXT4_SB(sb)->s_sbh); - ext4_handle_dirty_super(handle, sb); + ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh); } }
From: Theodore Ts'o tytso@mit.edu
mainline inclusion from mainline-v5.12-rc1 commit 027f14f5357279655c3ebc6d14daff8368d4f53f category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
If we try to make any changes via the journal between when the journal is initialized, but before the multi-block allocated is initialized, we will end up deferencing a NULL pointer when the journal commit callback function calls ext4_process_freed_data().
The proximate cause of this failure was commit 2d01ddc86606 ("ext4: save error info to sb through journal if available") since file system corruption problems detected before the call to ext4_mb_init() would result in a journal commit before we aborted the mount of the file system.... and we would then trigger the NULL pointer deref.
Link: https://lore.kernel.org/r/YAm8qH/0oo2ofSMR@mit.edu Reported-by: Murphy Zhou jencce.kernel@gmail.com Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu
conflicts: fs/ext4/super.c
Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 6b93f83564bdd..1e040e0bc4879 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4480,8 +4480,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
- sbi->s_journal->j_commit_callback = ext4_journal_commit_callback; - no_journal: if (!test_opt(sb, NO_MBCACHE)) { sbi->s_ea_block_cache = ext4_xattr_create_cache(); @@ -4588,6 +4586,14 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) goto failed_mount5; }
+ /* + * We can only set up the journal commit callback once + * mballoc is initialized + */ + if (sbi->s_journal) + sbi->s_journal->j_commit_callback = + ext4_journal_commit_callback; + block = ext4_count_free_clusters(sb); ext4_free_blocks_count_set(sbi->s_es, EXT4_C2B(sbi, block));
From: Jason Yan yanaijie@huawei.com
mainline inclusion from mainline-v5.7-rc2 commit 05ca87c149ae8078fb2a23adc6329eed5bb078fb category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
Fix the following gcc warning:
fs/ext4/super.c:599:27: warning: variable 'es' set but not used [-Wunused-but-set-variable] struct ext4_super_block *es; ^~ Fixes: 2ea2fc775321 ("ext4: save all error info in save_error_info() and drop ext4_set_errno()") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Jason Yan yanaijie@huawei.com Link: https://lore.kernel.org/r/20200402033939.25303-1-yanaijie@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/super.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 1e040e0bc4879..f92b5cc115e2e 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -696,7 +696,6 @@ void __ext4_error_file(struct file *file, const char *function, { va_list args; struct va_format vaf; - struct ext4_super_block *es; struct inode *inode = file_inode(file); char pathname[80], *path;
@@ -704,7 +703,6 @@ void __ext4_error_file(struct file *file, const char *function, return;
trace_ext4_error(inode->i_sb, function, line); - es = EXT4_SB(inode->i_sb)->s_es; if (ext4_error_ratelimit(inode->i_sb)) { path = file_path(file, pathname, sizeof(pathname)); if (IS_ERR(path))
From: Jason Yan yanaijie@huawei.com
mainline inclusion from mainline-v5.7-rc2 commit 648814111af26485762a22da0f4b3159f3f9632c category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
Fix the following gcc warning:
fs/ext4/ext4_jbd2.c:341:30: warning: variable 'es' set but not used [-Wunused-but-set-variable] struct ext4_super_block *es; ^~
Fixes: 2ea2fc775321 ("ext4: save all error info in save_error_info() and drop ext4_set_errno()") Reported-by: Hulk Robot hulkci@huawei.com Signed-off-by: Jason Yan yanaijie@huawei.com Link: https://lore.kernel.org/r/20200402034759.29957-1-yanaijie@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/ext4/ext4_jbd2.c | 3 --- 1 file changed, 3 deletions(-)
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c index 74a7bd566646c..fd7c41da1f8f9 100644 --- a/fs/ext4/ext4_jbd2.c +++ b/fs/ext4/ext4_jbd2.c @@ -340,9 +340,6 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line, if (inode && inode_needs_sync(inode)) { sync_dirty_buffer(bh); if (buffer_req(bh) && !buffer_uptodate(bh)) { - struct ext4_super_block *es; - - es = EXT4_SB(inode->i_sb)->s_es; ext4_error_inode_err(inode, where, line, bh->b_blocknr, EIO, "IO error syncing itable block");
From: Yu Kuai yukuai3@huawei.com
hulk inclusion category: bugfix bugzilla: 50526 CVE: NA ---------------------------
Inode atime/mtime is 64-bit, however xfs ondisk atime/mtime is 32-bit( supported range is from Dec 13 20:45:52 UTC 1901 to Jan 19 03:14:07 UTC 2038). Thus if in-memory atime/mtime overflow, after umount and mount, atime/mtime will be wrong.
In order to fix it, truncate atime/ctime/mtime in xfs_vn_setattr().
This problem was fixed in commit 22b139691f9e ("fs: Fill in max and min timestamps in superblock") from mainline, which relied on commit 50e17c000c46 ("vfs: Add timestamp_truncate() api") and commit 188d20bcd1eb ("vfs: Add file timestamp range support"). However, kabi will be broken if we backport these patches, thus we do local adaptation for xfs instead.
Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- fs/xfs/libxfs/xfs_format.h | 12 ++++++++++++ fs/xfs/xfs_iops.c | 17 ++++++++++++++--- 2 files changed, 26 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h index afbe336600e16..c63c27248d5e9 100644 --- a/fs/xfs/libxfs/xfs_format.h +++ b/fs/xfs/libxfs/xfs_format.h @@ -832,6 +832,18 @@ typedef struct xfs_timestamp { __be32 t_nsec; /* timestamp nanoseconds */ } xfs_timestamp_t;
+/* + * Smallest possible ondisk seconds value with traditional timestamps. This + * corresponds exactly with the incore timestamp Dec 13 20:45:52 UTC 1901. + */ +#define XFS_LEGACY_TIME_MIN ((int64_t)S32_MIN) + +/* + * Largest possible ondisk seconds value with traditional timestamps. This + * corresponds exactly with the incore timestamp Jan 19 03:14:07 UTC 2038. + */ +#define XFS_LEGACY_TIME_MAX ((int64_t)S32_MAX) + /* * On-disk inode structure. * diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 6011086b51deb..0ac63cafb32a1 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -584,6 +584,17 @@ xfs_setattr_mode( inode->i_mode |= mode & ~S_IFMT; }
+static inline struct timespec64 xfs_timestamp_truncate(struct timespec64 t) +{ + t.tv_sec = clamp(t.tv_sec, XFS_LEGACY_TIME_MIN, XFS_LEGACY_TIME_MAX); + + if (unlikely(t.tv_sec == XFS_LEGACY_TIME_MIN || + t.tv_sec == XFS_LEGACY_TIME_MAX)) + t.tv_sec = 0; + + return t; +} + void xfs_setattr_time( struct xfs_inode *ip, @@ -594,11 +605,11 @@ xfs_setattr_time( ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
if (iattr->ia_valid & ATTR_ATIME) - inode->i_atime = iattr->ia_atime; + inode->i_atime = xfs_timestamp_truncate(iattr->ia_atime); if (iattr->ia_valid & ATTR_CTIME) - inode->i_ctime = iattr->ia_ctime; + inode->i_ctime = xfs_timestamp_truncate(iattr->ia_ctime); if (iattr->ia_valid & ATTR_MTIME) - inode->i_mtime = iattr->ia_mtime; + inode->i_mtime = xfs_timestamp_truncate(iattr->ia_mtime); }
static int
From: Li Huafei lihuafei1@huawei.com
hulk inclusion category: bugfix bugzilla: 50618 CVE: NA
-------------------------------------------------
We got a use-after-free report when doing kernel fuzz tests with KSSAN turned on:
[ 1367.884099] BUG: KASAN: use-after-free in ftrace_ops_list_func+0xf7/0x220 [ 1367.885153] Read of size 8 at addr ffff8884f81a47d0 by tasksyz-executor/99086 [ 1367.886517] CPU: 2 PID: 99086 Comm: syz-executor Kdump: loaded Tainted: G --------- -t -4.18.0-147.5.1.2.h379.kasan.eulerosv2r9.x86_64 #1 [ 1367.886522] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 [ 1367.886525] Call Trace: [ 1367.886534] dump_stack+0xc2/0x12e [ 1367.886542] ? orc_sort_cmp+0xb0/0xb0 [ 1367.886551] print_address_description+0x70/0x360 [ 1367.886558] ? orc_sort_cmp+0xb0/0xb0 [ 1367.886566] ? perf_trace_buf_alloc+0x190/0x190 [ 1367.886571] kasan_report+0x1b2/0x330 [ 1367.886578] ? ftrace_ops_list_func+0xf7/0x220 [ 1367.886585] ? orc_find+0x560/0x5a0 [ 1367.886597] ? ftrace_ops_list_func+0xf7/0x220 [ 1367.886603] ftrace_ops_list_func+0xf7/0x220 [ 1367.886609] ? __save_stack_trace+0x92/0x100 [ 1367.886616] ftrace_call+0x5/0x34 [ 1367.886623] ? do_syscall_64+0x98/0x2c0 [ 1367.886629] ? do_syscall_64+0x98/0x2c0 [ 1367.886635] ? deref_stack_reg+0xd0/0xd0 [ 1367.886644] ? unwind_get_return_address+0x5/0x50 [ 1367.886651] unwind_get_return_address+0x5/0x50 [ 1367.886656] __save_stack_trace+0x92/0x100 [ 1367.886665] ? do_syscall_64+0x98/0x2c0 [ 1367.886673] save_stack+0x47/0xd0 [ 1367.886680] ? __kasan_slab_free+0x130/0x180 [ 1367.886685] ? kfree+0xa5/0x1e0 [ 1367.886692] ? cgroup_show_path+0x1fd/0x250 [ 1367.886699] ? kernfs_sop_show_path+0xad/0xf0 [ 1367.886705] ? show_mountinfo+0x169/0x4c0 [ 1367.886712] ? seq_read+0x716/0x950 [ 1367.886718] ? __vfs_read+0x55/0xb0 [ 1367.886723] ? vfs_read+0xe7/0x210 [ 1367.886729] ? ksys_pread64+0x95/0xd0 [ 1367.886734] ? objects_show+0x10/0x10 [ 1367.886740] ? ftrace_ops_test+0xba/0x120 [ 1367.886746] ? ftrace_find_tramp_ops_next+0x90/0x90 [ 1367.886753] ? ftrace_find_tramp_ops_next+0x90/0x90 [ 1367.886760] ? ftrace_find_tramp_ops_next+0x90/0x90 [ 1367.886766] ? objects_show+0x10/0x10 [ 1367.886772] ? ftrace_ops_list_func+0x147/0x220 [ 1367.886778] ? __kasan_slab_free+0xac/0x180 [ 1367.886784] ? cgroup_show_path+0x1fd/0x250 [ 1367.886790] ? ftrace_call+0x5/0x34 [ 1367.886796] ? cgroup_show_path+0x1fd/0x250 [ 1367.886802] ? cgroup_show_path+0x1fd/0x250 [ 1367.886811] ? fixup_red_left+0x5/0x30 [ 1367.886817] ? cgroup_show_path+0x1fd/0x250 [ 1367.886824] __kasan_slab_free+0x130/0x180 [ 1367.886831] ? cgroup_show_path+0x1fd/0x250 [ 1367.886835] kfree+0xa5/0x1e0 [ 1367.886842] cgroup_show_path+0x1fd/0x250 [ 1367.886850] ? init_and_link_css+0x370/0x370 [ 1367.886856] kernfs_sop_show_path+0xad/0xf0 [ 1367.886863] show_mountinfo+0x169/0x4c0 [ 1367.886869] ? kernfs_test_super+0x80/0x80 [ 1367.886875] ? show_vfsmnt+0x270/0x270 [ 1367.886880] ? m_next+0x32/0x80 [ 1367.886886] ? show_vfsmnt+0x270/0x270 [ 1367.886891] ? m_show+0x31/0x50 [ 1367.886900] seq_read+0x716/0x950 [ 1367.886911] ? seq_lseek+0x1e0/0x1e0 [ 1367.886916] ? ftrace_call+0x5/0x34 [ 1367.886922] ? ftrace_call+0x5/0x34 [ 1367.886931] ? seq_lseek+0x1e0/0x1e0 [ 1367.886938] __vfs_read+0x55/0xb0 [ 1367.886945] vfs_read+0xe7/0x210 [ 1367.886954] ksys_pread64+0x95/0xd0 [ 1367.886961] do_syscall_64+0x98/0x2c0 [ 1367.886971] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 1367.886976] RIP: 0033:0x46436d [ 1367.886983] Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b4 ff ff ff f7 d8 64 89 01 48 [ 1367.886987] RSP: 002b:00007f83ffff4c28 EFLAGS: 00000246 ORIG_RAX: 0000000000000011 [ 1367.886999] RAX: ffffffffffffffda RBX: 000000000057cfa0 RCX: 000000000046436d [ 1367.887002] RDX: 0000000000001000 RSI: 0000000020000140 RDI: 0000000000000003 [ 1367.887006] RBP: 000000000057cfa0 R08: 0000000000000000 R09: 0000000000000000 [ 1367.887009] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000057cfac [ 1367.887013] R13: 00007f83ffff5700 R14: 00000000004d1e47 R15: 0000000000000fff
[ 1367.887275] Allocated by task 99101: [ 1367.887848] kasan_kmalloc+0xa0/0xd0 [ 1367.887853] kmem_cache_alloc_trace+0xfc/0x220 [ 1367.887860] perf_event_alloc.part.19+0x50/0x14d0 [ 1367.887865] perf_event_alloc+0x67/0x90 [ 1367.887871] __do_sys_perf_event_open+0x20e/0x14c0 [ 1367.887876] do_syscall_64+0x98/0x2c0 [ 1367.887882] entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 1367.888133] Freed by task 99101: [ 1367.888651] __kasan_slab_free+0x130/0x180 [ 1367.888655] kfree+0xa5/0x1e0 [ 1367.888661] perf_event_alloc.part.19+0xca4/0x14d0 [ 1367.888666] perf_event_alloc+0x67/0x90 [ 1367.888672] __do_sys_perf_event_open+0x20e/0x14c0 [ 1367.888677] do_syscall_64+0x98/0x2c0 [ 1367.888683] entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 1367.888935] The buggy address belongs to the object at ffff8884f81a4400 which belongs to the cache kmalloc-2k of size 2048 [ 1367.890854] The buggy address is located 976 bytes inside of 2048-byte region [ffff8884f81a4400, ffff8884f81a4c00) [ 1367.892661] The buggy address belongs to the page: [ 1367.893404] page:ffffea0013e06800 count:1 mapcount:0 mapping:ffff888107c0cf00 index:0x0 compound_mapcount: 0 [ 1367.894915] flags: 0x17ffffc0008100(slab|head) [ 1367.895613] raw: 0017ffffc0008100 ffffea0014bda208 ffffea00140c4208 ffff888107c0cf00 [ 1367.896808] raw: 0000000000000000 00000000000f000f 00000001ffffffff 0000000000000000 [ 1367.898000] page dumped because: kasan: bad access detected
[ 1367.899107] Memory state around the buggy address: [ 1367.899880] ffff8884f81a4680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 1367.900995] ffff8884f81a4700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 1367.902106] >ffff8884f81a4780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 1367.903218]
[ 1367.904122] ffff8884f81a4800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 1367.905234] ffff8884f81a4880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
There is a race between perf_alloc_event() and __ftrace_ops_list_func() on 'event'. When adding a perf event, if the event needs to use the trace framework, it needs to register ftrace_ops with ftrace, which is a structural member of perf event. If perf_alloc_event() fails, it will release the event directly, but if ftrace_ops has been successfully registered, and the corresponding trace point is triggered, then __ftrace_ops_list_func() will still reference the ftrace_ops that perf just registered, but it has been released with the event is freed, so use-after-free happens.
__ftrace_ops_list_func() uses rcu synchronization to access ftrace_ops, so in perf_alloc_event() we call synchronize_rcu() before releasing 'event' to make sure all 'event' references are complete.
Signed-off-by: Yang JiHong yangjihong1@huawei.com Signed-off-by: Li Huafei lihuafei1@huawei.com Reviewed-by: Kuohai Xu xukuohai@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- kernel/events/core.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/events/core.c b/kernel/events/core.c index 42da44e6a5e09..e41d6a5221277 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -10251,6 +10251,8 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, put_pid_ns(event->ns); if (event->hw.target) put_task_struct(event->hw.target); + + synchronize_rcu(); kfree(event);
return ERR_PTR(err);
From: Paolo Valente paolo.valente@linaro.org
mainline inclusion from mainline-5.6-rc1 commit ecedd3d7e19911ab8fe42f17b77c0a30fe7f4db3 category: bugfix bugzilla: 50775 CVE: NA
---------------------------
In bfq_bfqq_move(), the bfq_queue, say Q, to be moved to a new group may happen to be deactivated in the scheduling data structures of the source group (and then activated in the destination group). If Q is referred only by the data structures in the source group when the deactivation happens, then Q is freed upon the deactivation.
This commit addresses this issue by getting an extra reference before the possible deactivation, and releasing this extra reference after Q has been moved.
Tested-by: Chris Evich cevich@redhat.com Tested-by: Oleksandr Natalenko oleksandr@natalenko.name Signed-off-by: Paolo Valente paolo.valente@linaro.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- block/bfq-cgroup.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index 9c3957d37069f..2073ff4e8001a 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -568,6 +568,12 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, bfq_bfqq_expire(bfqd, bfqd->in_service_queue, false, BFQQE_PREEMPTED);
+ /* + * get extra reference to prevent bfqq from being freed in + * next possible deactivate + */ + bfqq->ref++; + if (bfq_bfqq_busy(bfqq)) bfq_deactivate_bfqq(bfqd, bfqq, false, false); else if (entity->on_st) @@ -586,6 +592,8 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
if (!bfqd->in_service_queue && !bfqd->rq_in_driver) bfq_schedule_dispatch(bfqd); + /* release extra ref taken above */ + bfq_put_queue(bfqq); }
/**
From: Paolo Valente paolo.valente@linaro.org
mainline inclusion from mainline-5.7-rc1 commit fd1bb3ae54a9a2e0c42709de861c69aa146b8955 category: bugfix bugzilla: 50775 CVE: NA
---------------------------
Commit ecedd3d7e199 ("block, bfq: get extra ref to prevent a queue from being freed during a group move") gets an extra reference to a bfq_queue before possibly deactivating it (temporarily), in bfq_bfqq_move(). This prevents the bfq_queue from disappearing before being reactivated in its new group.
Yet, the bfq_queue may also be expired (i.e., its service may be stopped) before the bfq_queue is deactivated. And also an expiration may lead to a premature freeing. This commit fixes this issue by simply moving forward the getting of the extra reference already introduced by commit ecedd3d7e199 ("block, bfq: get extra ref to prevent a queue from being freed during a group move").
Reported-by: cki-project@redhat.com Tested-by: cki-project@redhat.com Signed-off-by: Paolo Valente paolo.valente@linaro.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- block/bfq-cgroup.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index 2073ff4e8001a..34cc877cc9bc1 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -558,6 +558,12 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, { struct bfq_entity *entity = &bfqq->entity;
+ /* + * Get extra reference to prevent bfqq from being freed in + * next possible expire or deactivate. + */ + bfqq->ref++; + /* If bfqq is empty, then bfq_bfqq_expire also invokes * bfq_del_bfqq_busy, thereby removing bfqq and its entity * from data structures related to current group. Otherwise we @@ -568,12 +574,6 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, bfq_bfqq_expire(bfqd, bfqd->in_service_queue, false, BFQQE_PREEMPTED);
- /* - * get extra reference to prevent bfqq from being freed in - * next possible deactivate - */ - bfqq->ref++; - if (bfq_bfqq_busy(bfqq)) bfq_deactivate_bfqq(bfqd, bfqq, false, false); else if (entity->on_st) @@ -592,7 +592,7 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
if (!bfqd->in_service_queue && !bfqd->rq_in_driver) bfq_schedule_dispatch(bfqd); - /* release extra ref taken above */ + /* release extra ref taken above, bfqq may happen to be freed now */ bfq_put_queue(bfqq); }
From: Paolo Valente paolo.valente@linaro.org
mainline inclusion from mainline-5.7-rc1 commit c8997736650060594845e42c5d01d3118aec8d25 category: bugfix bugzilla: 50775 CVE: NA
bfq_release_process_ref() was introduced by commit 478de3380c1c ("block, bfq: deschedule empty bfq_queues not referred by any process"), however, this patch is not related to this issue and involved with other patches, thus we defined it here instead of backport the patch.
---------------------------
A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The goal of this put is to release a process reference to a bfq_queue. But process-reference releases may trigger also some extra operation, and, to this goal, are handled through bfq_release_process_ref(). So, turn the invocation of bfq_put_queue() into an invocation of bfq_release_process_ref().
Tested-by: cki-project@redhat.com Signed-off-by: Paolo Valente paolo.valente@linaro.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- block/bfq-cgroup.c | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-)
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index 34cc877cc9bc1..ab66b2bdf869f 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -596,6 +596,27 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, bfq_put_queue(bfqq); }
+static +void bfq_release_process_ref(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + /* + * To prevent bfqq's service guarantees from being violated, + * bfqq may be left busy, i.e., queued for service, even if + * empty (see comments in __bfq_bfqq_expire() for + * details). But, if no process will send requests to bfqq any + * longer, then there is no point in keeping bfqq queued for + * service. In addition, keeping bfqq queued for service, but + * with no process ref any longer, may have caused bfqq to be + * freed when dequeued from service. But this is assumed to + * never happen. + */ + if (bfq_bfqq_busy(bfqq) && RB_EMPTY_ROOT(&bfqq->sort_list) && + bfqq != bfqd->in_service_queue) + bfq_del_bfqq_busy(bfqd, bfqq, false); + + bfq_put_queue(bfqq); +} + /** * __bfq_bic_change_cgroup - move @bic to @cgroup. * @bfqd: the queue descriptor. @@ -629,10 +650,7 @@ static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
if (entity->sched_data != &bfqg->sched_data) { bic_set_bfqq(bic, NULL, 0); - bfq_log_bfqq(bfqd, async_bfqq, - "bic_change_group: %p %d", - async_bfqq, async_bfqq->ref); - bfq_put_queue(async_bfqq); + bfq_release_process_ref(bfqd, async_bfqq); } }
From: Paolo Valente paolo.valente@linaro.org
mainline inclusion from mainline-5.7-rc1 commit 576682fa52cbd95deb3773449566274f206acc58 category: bugfix bugzilla: 50775 CVE: NA
---------------------------
bfq_reparent_leaf_entity() reparents the input leaf entity (a leaf entity represents just a bfq_queue in an entity tree). Yet, the input entity is guaranteed to always be a leaf entity only in two-level entity trees. In this respect, because of the error fixed by commit 14afc5936197 ("block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group()"), all (wrongly collapsed) entity trees happened to actually have only two levels. After the latter commit, this does not hold any longer.
This commit fixes this problem by modifying bfq_reparent_leaf_entity(), so that it searches an active leaf entity down the path that stems from the input entity. Such a leaf entity is guaranteed to exist when bfq_reparent_leaf_entity() is invoked.
Tested-by: cki-project@redhat.com Signed-off-by: Paolo Valente paolo.valente@linaro.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- block/bfq-cgroup.c | 48 ++++++++++++++++++++++++++++++---------------- 1 file changed, 31 insertions(+), 17 deletions(-)
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index ab66b2bdf869f..59f19116a9cf7 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -751,39 +751,53 @@ static void bfq_flush_idle_tree(struct bfq_service_tree *st) /** * bfq_reparent_leaf_entity - move leaf entity to the root_group. * @bfqd: the device data structure with the root group. - * @entity: the entity to move. + * @entity: the entity to move, if entity is a leaf; or the parent entity + * of an active leaf entity to move, if entity is not a leaf. */ static void bfq_reparent_leaf_entity(struct bfq_data *bfqd, - struct bfq_entity *entity) + struct bfq_entity *entity, + int ioprio_class) { - struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + struct bfq_queue *bfqq; + struct bfq_entity *child_entity = entity; + + while (child_entity->my_sched_data) { /* leaf not reached yet */ + struct bfq_sched_data *child_sd = child_entity->my_sched_data; + struct bfq_service_tree *child_st = child_sd->service_tree + + ioprio_class; + struct rb_root *child_active = &child_st->active;
+ child_entity = bfq_entity_of(rb_first(child_active)); + + if (!child_entity) + child_entity = child_sd->in_service_entity; + } + + bfqq = bfq_entity_to_bfqq(child_entity); bfq_bfqq_move(bfqd, bfqq, bfqd->root_group); }
/** - * bfq_reparent_active_entities - move to the root group all active - * entities. + * bfq_reparent_active_queues - move to the root group all active queues. * @bfqd: the device data structure with the root group. * @bfqg: the group to move from. - * @st: the service tree with the entities. + * @st: the service tree to start the search from. */ -static void bfq_reparent_active_entities(struct bfq_data *bfqd, - struct bfq_group *bfqg, - struct bfq_service_tree *st) +static void bfq_reparent_active_queues(struct bfq_data *bfqd, + struct bfq_group *bfqg, + struct bfq_service_tree *st, + int ioprio_class) { struct rb_root *active = &st->active; - struct bfq_entity *entity = NULL; - - if (!RB_EMPTY_ROOT(&st->active)) - entity = bfq_entity_of(rb_first(active)); + struct bfq_entity *entity;
- for (; entity ; entity = bfq_entity_of(rb_first(active))) - bfq_reparent_leaf_entity(bfqd, entity); + while ((entity = bfq_entity_of(rb_first(active)))) + bfq_reparent_leaf_entity(bfqd, entity, ioprio_class);
if (bfqg->sched_data.in_service_entity) bfq_reparent_leaf_entity(bfqd, - bfqg->sched_data.in_service_entity); + bfqg->sched_data.in_service_entity, + ioprio_class); }
/** @@ -834,7 +848,7 @@ static void bfq_pd_offline(struct blkg_policy_data *pd) * There is no need to put the sync queues, as the * scheduler has taken no reference. */ - bfq_reparent_active_entities(bfqd, bfqg, st); + bfq_reparent_active_queues(bfqd, bfqg, st, i); }
__bfq_deactivate_entity(entity, false);
From: Paolo Valente paolo.valente@linaro.org
mainline inclusion from mainline-5.7-rc1 commit 4d38a87fbb77fb9ff2ff4e914162a8ae6453eff5 category: bugfix bugzilla: 50775 CVE: NA
---------------------------
In bfq_pd_offline(), the function bfq_flush_idle_tree() is invoked to flush the rb tree that contains all idle entities belonging to the pd (cgroup) being destroyed. In particular, bfq_flush_idle_tree() is invoked before bfq_reparent_active_queues(). Yet the latter may happen to add some entities to the idle tree. It happens if, in some of the calls to bfq_bfqq_move() performed by bfq_reparent_active_queues(), the queue to move is empty and gets expired.
This commit simply reverses the invocation order between bfq_flush_idle_tree() and bfq_reparent_active_queues().
Tested-by: cki-project@redhat.com Signed-off-by: Paolo Valente paolo.valente@linaro.org Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- block/bfq-cgroup.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-)
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index 59f19116a9cf7..73afec832fbf2 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -829,13 +829,6 @@ static void bfq_pd_offline(struct blkg_policy_data *pd) for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) { st = bfqg->sched_data.service_tree + i;
- /* - * The idle tree may still contain bfq_queues belonging - * to exited task because they never migrated to a different - * cgroup from the one being destroyed now. - */ - bfq_flush_idle_tree(st); - /* * It may happen that some queues are still active * (busy) upon group destruction (if the corresponding @@ -849,6 +842,19 @@ static void bfq_pd_offline(struct blkg_policy_data *pd) * scheduler has taken no reference. */ bfq_reparent_active_queues(bfqd, bfqg, st, i); + + /* + * The idle tree may still contain bfq_queues + * belonging to exited task because they never + * migrated to a different cgroup from the one being + * destroyed now. In addition, even + * bfq_reparent_active_queues() may happen to add some + * entities to the idle tree. It happens if, in some + * of the calls to bfq_bfqq_move() performed by + * bfq_reparent_active_queues(), the queue to move is + * empty and gets expired. + */ + bfq_flush_idle_tree(st); }
__bfq_deactivate_entity(entity, false);
From: Guoqing Jiang guoqing.jiang@cloud.ionos.com
mainline inclusion from mainline-5.8-rc1 commit 21e0958ec9684e76e32f822c5e611a7d7ea0a5ba category: bugfix bugzilla: 35792 CVE: NA
---------------------------
Coly reported possible circular locking dependencyi with LOCKDEP enabled, quote the below info from the detailed report [1].
[ 1607.673903] Chain exists of: [ 1607.673903] kn->count#256 --> (wq_completion)md_misc --> (work_completion)(&rdev->del_work) [ 1607.673903] [ 1607.827946] Possible unsafe locking scenario: [ 1607.827946] [ 1607.898780] CPU0 CPU1 [ 1607.952980] ---- ---- [ 1608.007173] lock((work_completion)(&rdev->del_work)); [ 1608.069690] lock((wq_completion)md_misc); [ 1608.149887] lock((work_completion)(&rdev->del_work)); [ 1608.242563] lock(kn->count#256); [ 1608.283238] [ 1608.283238] *** DEADLOCK *** [ 1608.283238] [ 1608.354078] 2 locks held by kworker/5:0/843: [ 1608.405152] #0: ffff8889eecc9948 ((wq_completion)md_misc){+.+.}, at: process_one_work+0x42b/0xb30 [ 1608.512399] #1: ffff888a1d3b7e10 ((work_completion)(&rdev->del_work)){+.+.}, at: process_one_work+0x42b/0xb30 [ 1608.632130]
Since works (rdev->del_work and mddev->del_work) are queued in md_misc_wq, then lockdep_map lock is held if either of them are running, then both of them try to hold kernfs lock by call kobject_del. Then if new_dev_store or array_state_store are triggered by write to the related sysfs node, so the write operation gets kernfs lock, but need the lockdep_map because all of them would trigger flush_workqueue(md_misc_wq) finally, then the same lockdep_map lock is needed.
To suppress the lockdep warnning, we should flush the workqueue in case the related work is pending. And several works are attached to md_misc_wq, so we need to check which work should be checked:
1. for __md_stop_writes, the purpose of call flush workqueue is ensure sync thread is started if it was starting, so check mddev->del_work is pending or not since md_start_sync is attached to mddev->del_work.
2. __md_stop flushes md_misc_wq to ensure event_work is done, check the event_work is enough. Assume raid_{ctr,dtr} -> md_stop -> __md_stop doesn't need the kernfs lock.
3. both new_dev_store (holds kernfs lock) and ADD_NEW_DISK ioctl (holds the bdev->bd_mutex) call flush_workqueue to ensure md_delayed_delete has completed, this case will be handled in next patch.
4. md_open flushes workqueue to ensure the previous md is disappeared, but it holds bdev->bd_mutex then try to flush workqueue, so it is better to check mddev->del_work as well to avoid potential lock issue, this will be done in another patch.
[1]: https://marc.info/?l=linux-raid&m=158518958031584&w=2
Cc: Coly Li colyli@suse.de Reported-by: Coly Li colyli@suse.de Signed-off-by: Guoqing Jiang guoqing.jiang@cloud.ionos.com Signed-off-by: Song Liu songliubraving@fb.com Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com --- drivers/md/md.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 941dbfa7c06d5..fb97c0b1d510c 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5886,7 +5886,8 @@ static void md_clean(struct mddev *mddev) static void __md_stop_writes(struct mddev *mddev) { set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); - flush_workqueue(md_misc_wq); + if (work_pending(&mddev->del_work)) + flush_workqueue(md_misc_wq); if (mddev->sync_thread) { set_bit(MD_RECOVERY_INTR, &mddev->recovery); md_reap_sync_thread(mddev); @@ -5936,7 +5937,8 @@ static void __md_stop(struct mddev *mddev) md_bitmap_destroy(mddev); mddev_detach(mddev); /* Ensure ->event_work is done */ - flush_workqueue(md_misc_wq); + if (mddev->event_work.func) + flush_workqueue(md_misc_wq); spin_lock(&mddev->lock); mddev->pers = NULL; spin_unlock(&mddev->lock);