[PATCH OLK-5.10 4/6] exec: Add a new AT_CHECK flag to execveat(2)

28 Nov 2024

From: Mickaël Salaün <mic@digikod.net>

maillist inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/IAZ996
CVE: NA

Reference: https://lore.kernel.org/all/20241011184422.977903-2-mic@digikod.net/

--------------------------------

Add a new AT_CHECK flag to execveat(2) to check if a file would be
allowed for execution.  The main use case is for script interpreters and
dynamic linkers to check execution permission according to the kernel's
security policy. Another use case is to add context to access logs e.g.,
which script (instead of interpreter) accessed a file.  As any
executable code, scripts could also use this check [1].

This is different from faccessat(2) + X_OK which only checks a subset of
access rights (i.e. inode permission and mount options for regular
files), but not the full context (e.g. all LSM access checks).  The main
use case for access(2) is for SUID processes to (partially) check access
on behalf of their caller.  The main use case for execveat(2) + AT_CHECK
is to check if a script execution would be allowed, according to all the
different restrictions in place.  Because the use of AT_CHECK follows
the exact kernel semantic as for a real execution, user space gets the
same error codes.

An interesting point of using execveat(2) instead of openat2(2) is that
it decouples the check from the enforcement.  Indeed, the security check
can be logged (e.g. with audit) without blocking an execution
environment not yet ready to enforce a strict security policy.

LSMs can control or log execution requests with
security_bprm_creds_for_exec().  However, to enforce a consistent and
complete access control (e.g. on binary's dependencies) LSMs should
restrict file executability, or mesure executed files, with
security_file_open() by checking file->f_flags & __FMODE_EXEC.

Because AT_CHECK is dedicated to user space interpreters, it doesn't
make sense for the kernel to parse the checked files, look for
interpreters known to the kernel (e.g. ELF, shebang), and return ENOEXEC
if the format is unknown.  Because of that, security_bprm_check() is
never called when AT_CHECK is used.

It should be noted that script interpreters cannot directly use
execveat(2) (without this new AT_CHECK flag) because this could lead to
unexpected behaviors e.g., `python script.sh` could lead to Bash being
executed to interpret the script.  Unlike the kernel, script
interpreters may just interpret the shebang as a simple comment, which
should not change for backward compatibility reasons.

Because scripts or libraries files might not currently have the
executable permission set, or because we might want specific users to be
allowed to run arbitrary scripts, the following patch provides a dynamic
configuration mechanism with the SECBIT_EXEC_RESTRICT_FILE and
SECBIT_EXEC_DENY_INTERACTIVE securebits.

This is a redesign of the CLIP OS 4's O_MAYEXEC:
https://github.com/clipos-archive/src_platform_clip-patches/blob/f5cb330d6b6...
This patch has been used for more than a decade with customized script
interpreters.  Some examples can be found here:
https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Serge Hallyn <serge@hallyn.com>
Link: https://docs.python.org/3/library/io.html#io.open_code [1]
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Link: https://lore.kernel.org/r/20241011184422.977903-2-mic@digikod.net
Conflicts:
	include/linux/binfmts.h
	include/uapi/linux/fcntl.h
	kernel/audit.h
	security/security.c
[Context conflicts]
Signed-off-by: Gu Bowen <gubowen5@huawei.com>
---
 fs/exec.c                  | 18 ++++++++++++++++--
 include/linux/binfmts.h    |  7 ++++++-
 include/uapi/linux/fcntl.h | 31 +++++++++++++++++++++++++++++++
 kernel/audit.h             |  1 +
 kernel/auditsc.c           |  1 +
 security/security.c        | 12 ++++++++++++
 6 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index a2f6fffcda57..e328d435be64 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -913,7 +913,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
 		.lookup_flags = LOOKUP_FOLLOW,
 	};
 
-	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
+	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH | AT_CHECK)) != 0)
 		return ERR_PTR(-EINVAL);
 	if (flags & AT_SYMLINK_NOFOLLOW)
 		open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW;
@@ -1542,6 +1542,20 @@ static struct linux_binprm *alloc_bprm(int fd, struct filename *filename, int fl
 	}
 	bprm->interp = bprm->filename;
 
+	/*
+	 * At this point, security_file_open() has already been called (with
+	 * __FMODE_EXEC) and access control checks for AT_CHECK will stop just
+	 * after the security_bprm_creds_for_exec() call in bprm_execve().
+	 * Indeed, the kernel should not try to parse the content of the file
+	 * with exec_binprm() nor change the calling thread, which means that
+	 * the following security functions will be not called:
+	 * - security_bprm_check()
+	 * - security_bprm_creds_from_file()
+	 * - security_bprm_committing_creds()
+	 * - security_bprm_committed_creds()
+	 */
+	bprm->is_check = !!(flags & AT_CHECK);
+
 	retval = bprm_mm_init(bprm);
 	if (!retval)
 		return bprm;
@@ -1835,7 +1849,7 @@ static int bprm_execve(struct linux_binprm *bprm)
 
 	/* Set the unchanging part of bprm->cred */
 	retval = security_bprm_creds_for_exec(bprm);
-	if (retval)
+	if (retval || bprm->is_check)
 		goto out;
 
 	retval = exec_binprm(bprm);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 1716a07a9809..c30a797ce359 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -41,7 +41,12 @@ struct linux_binprm {
 		 * Set when errors can no longer be returned to the
 		 * original userspace.
 		 */
-		point_of_no_return:1;
+		point_of_no_return:1,
+		/*
+		 * Set by user space to check executability according to the
+		 * caller's environment.
+		 */
+		is_check:1;
 #ifdef __alpha__
 	unsigned int taso:1;
 #endif
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..4a4d5a176d5f 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -111,4 +111,35 @@
 
 #define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
 
+/*
+ * AT_CHECK only performs a check on a regular file and returns 0 if execution
+ * of this file would be allowed, ignoring the file format and then the related
+ * interpreter dependencies (e.g. ELF libraries, script's shebang).
+ *
+ * Programs should always perform this check to apply kernel-level checks
+ * against files that are not directly executed by the kernel but passed to a
+ * user space interpreter instead.  All files that contain executable code,
+ * from the point of view of the interpreter, should be checked.  However the
+ * result of this check should only be enforced according to
+ * SECBIT_EXEC_RESTRICT_FILE or SECBIT_EXEC_DENY_INTERACTIVE.  See securebits.h
+ * documentation and the samples/check-exec/inc.c example.
+ *
+ * The main purpose of this flag is to improve the security and consistency of
+ * an execution environment to ensure that direct file execution (e.g.
+ * `./script.sh`) and indirect file execution (e.g. `sh script.sh`) lead to the
+ * same result.  For instance, this can be used to check if a file is
+ * trustworthy according to the caller's environment.
+ *
+ * In a secure environment, libraries and any executable dependencies should
+ * also be checked.  For instance, dynamic linking should make sure that all
+ * libraries are allowed for execution to avoid trivial bypass (e.g. using
+ * LD_PRELOAD).  For such secure execution environment to make sense, only
+ * trusted code should be executable, which also requires integrity guarantees.
+ *
+ * To avoid race conditions leading to time-of-check to time-of-use issues,
+ * AT_CHECK should be used with AT_EMPTY_PATH to check against a file
+ * descriptor instead of a path.
+ */
+#define AT_CHECK		0x10000
+
 #endif /* _UAPI_LINUX_FCNTL_H */
diff --git a/kernel/audit.h b/kernel/audit.h
index 1918019e6aaf..e4642918d723 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -187,6 +187,7 @@ struct audit_context {
 		} mmap;
 		struct {
 			int			argc;
+			bool                    is_check;
 		} execve;
 		struct {
 			char			*name;
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 57b982b44732..f1da45d8afcb 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -2403,6 +2403,7 @@ void __audit_bprm(struct linux_binprm *bprm)
 
 	context->type = AUDIT_EXECVE;
 	context->execve.argc = bprm->argc;
+	context->execve.is_check = bprm->is_check;
 }
 
 
diff --git a/security/security.c b/security/security.c
index 7a0126882466..e6a6482ae3cb 100644
--- a/security/security.c
+++ b/security/security.c
@@ -850,6 +850,13 @@ int security_vm_enough_memory_mm(struct mm_struct *mm, long pages)
 	return __vm_enough_memory(mm, pages, cap_sys_admin);
 }
 
+/*
+ * If execveat(2) is called with the AT_CHECK flag, bprm->is_check is set.  The
+ * result must be the same as without this flag even if the execution will
+ * never really happen and @bprm will always be dropped.
+ *
+ * This hook must not change current->cred, only @bprm->cred.
+ */
 int security_bprm_creds_for_exec(struct linux_binprm *bprm)
 {
 	return call_int_hook(bprm_creds_for_exec, 0, bprm);
@@ -1643,6 +1650,11 @@ int security_file_receive(struct file *file)
 	return call_int_hook(file_receive, 0, file);
 }
 
+/*
+ * We can check if a file is opened for execution (e.g. execve(2) call), either
+ * directly or indirectly (e.g. ELF's ld.so) by checking file->f_flags &
+ * __FMODE_EXEC .
+ */
 int security_file_open(struct file *file)
 {
 	int ret;
-- 
2.25.1