Daniel Borkmann (1): bpf: Fix __reg_bound_offset 64->32 var_off subreg propagation
Krister Johansen (1): bpf: ensure main program has an extable
Kumar Kartikeya Dwivedi (1): bpf: Clobber stack slot when writing over spilled PTR_TO_BTF_ID
Martin KaFai Lau (1): bpf: Check the other end of slot_type for STACK_SPILL
Stanislav Fomichev (2): bpf: Don't EFAULT for getsockopt with optval=NULL bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
Wang Yufen (1): bpf: Fix memory leaks in __check_func_call
kernel/bpf/cgroup.c | 24 +++++++- kernel/bpf/verifier.c | 61 ++++++++++++------- .../testing/selftests/bpf/prog_tests/align.c | 4 +- 3 files changed, 61 insertions(+), 28 deletions(-)
From: Wang Yufen wangyufen@huawei.com
mainline inclusion from mainline-v6.1-rc6 commit eb86559a691cea5fa63e57a03ec3dc9c31e97955 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAAAW9
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
kmemleak reports this issue:
unreferenced object 0xffff88817139d000 (size 2048): comm "test_progs", pid 33246, jiffies 4307381979 (age 45851.820s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<0000000045f075f0>] kmalloc_trace+0x27/0xa0 [<0000000098b7c90a>] __check_func_call+0x316/0x1230 [<00000000b4c3c403>] check_helper_call+0x172e/0x4700 [<00000000aa3875b7>] do_check+0x21d8/0x45e0 [<000000001147357b>] do_check_common+0x767/0xaf0 [<00000000b5a595b4>] bpf_check+0x43e3/0x5bc0 [<0000000011e391b1>] bpf_prog_load+0xf26/0x1940 [<0000000007f765c0>] __sys_bpf+0xd2c/0x3650 [<00000000839815d6>] __x64_sys_bpf+0x75/0xc0 [<00000000946ee250>] do_syscall_64+0x3b/0x90 [<0000000000506b7f>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
The root case here is: In function prepare_func_exit(), the callee is not released in the abnormal scenario after "state->curframe--;". To fix, move "state->curframe--;" to the very bottom of the function, right when we free callee and reset frame[] pointer to NULL, as Andrii suggested.
In addition, function __check_func_call() has a similar problem. In the abnormal scenario before "state->curframe++;", the callee also should be released by free_func_state().
Fixes: 69c087ba6225 ("bpf: Add bpf_for_each_map_elem() helper") Fixes: fd978bf7fd31 ("bpf: Add reference tracking to verifier") Signed-off-by: Wang Yufen wangyufen@huawei.com Link: https://lore.kernel.org/r/1667884291-15666-1-git-send-email-wangyufen@huawei... Signed-off-by: Martin KaFai Lau martin.lau@kernel.org Conflicts: kernel/bpf/verifier.c [Resolve conflicts dut to __check_func_call() is not exists before refactor] Signed-off-by: Zheng Yejian zhengyejian1@huawei.com --- kernel/bpf/verifier.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index b999b801da8c..c0366edc066d 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -5213,7 +5213,7 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn, /* Transfer references to the callee */ err = copy_reference_state(callee, caller); if (err) - return err; + goto err_out;
/* copy r1 - r5 args that callee can access. The copy includes parent * pointers, which connects us up to the liveness chain @@ -5236,6 +5236,10 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn, print_verifier_state(env, callee); } return 0; +err_out: + free_func_state(callee); + state->frame[state->curframe + 1] = NULL; + return err; }
static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx) @@ -5258,8 +5262,7 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx) return -EINVAL; }
- state->curframe--; - caller = state->frame[state->curframe]; + caller = state->frame[state->curframe - 1]; /* return to the caller whatever r0 had in the callee */ caller->regs[BPF_REG_0] = *r0;
@@ -5277,7 +5280,7 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx) } /* clear everything in the callee */ free_func_state(callee); - state->frame[state->curframe + 1] = NULL; + state->frame[state->curframe--] = NULL; return 0; }
From: Martin KaFai Lau kafai@fb.com
stable inclusion from stable-v5.10.163 commit cdd73a5ed0840da88a3b9ad353f568e6f156d417 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7PJ9N
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
----------------------------------------------------
[ Upstream commit 27113c59b6d0a587b29ae72d4ff3f832f58b0651 ]
Every 8 bytes of the stack is tracked by a bpf_stack_state. Within each bpf_stack_state, there is a 'u8 slot_type[8]' to track the type of each byte. Verifier tests slot_type[0] == STACK_SPILL to decide if the spilled reg state is saved. Verifier currently only saves the reg state if the whole 8 bytes are spilled to the stack, so checking the slot_type[7] is the same as checking slot_type[0].
The later patch will allow verifier to save the bounded scalar reg also for <8 bytes spill. There is a llvm patch [1] to ensure the <8 bytes spill will be 8-byte aligned, so checking slot_type[7] instead of slot_type[0] is required.
While at it, this patch refactors the slot_type[0] == STACK_SPILL test into a new function is_spilled_reg() and change the slot_type[0] check to slot_type[7] check in there also.
[1] https://reviews.llvm.org/D109073
Signed-off-by: Martin KaFai Lau kafai@fb.com Signed-off-by: Alexei Starovoitov ast@kernel.org Link: https://lore.kernel.org/bpf/20210922004934.624194-1-kafai@fb.com Stable-dep-of: 529409ea92d5 ("bpf: propagate precision across all frames, not just the last one") Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: zhaoxiaoqiang11 zhaoxiaoqiang11@jd.com Signed-off-by: Zheng Yejian zhengyejian1@huawei.com --- kernel/bpf/verifier.c | 30 +++++++++++++++++++----------- 1 file changed, 19 insertions(+), 11 deletions(-)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index c0366edc066d..013fa412afd9 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -566,6 +566,14 @@ const char *kernel_type_name(u32 id) btf_type_by_id(btf_vmlinux, id)->name_off); }
+/* The reg state of a pointer or a bounded scalar was saved when + * it was spilled to the stack. + */ +static bool is_spilled_reg(const struct bpf_stack_state *stack) +{ + return stack->slot_type[BPF_REG_SIZE - 1] == STACK_SPILL; +} + static void print_verifier_state(struct bpf_verifier_env *env, const struct bpf_func_state *state) { @@ -668,7 +676,7 @@ static void print_verifier_state(struct bpf_verifier_env *env, continue; verbose(env, " fp%d", (-i - 1) * BPF_REG_SIZE); print_liveness(env, state->stack[i].spilled_ptr.live); - if (state->stack[i].slot_type[0] == STACK_SPILL) { + if (is_spilled_reg(&state->stack[i])) { reg = &state->stack[i].spilled_ptr; t = reg->type; verbose(env, "=%s", reg_type_str(env, t)); @@ -2049,7 +2057,7 @@ static void mark_all_scalars_precise(struct bpf_verifier_env *env, reg->precise = true; } for (j = 0; j < func->allocated_stack / BPF_REG_SIZE; j++) { - if (func->stack[j].slot_type[0] != STACK_SPILL) + if (!is_spilled_reg(&func->stack[j])) continue; reg = &func->stack[j].spilled_ptr; if (reg->type != SCALAR_VALUE) @@ -2091,7 +2099,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno, }
while (spi >= 0) { - if (func->stack[spi].slot_type[0] != STACK_SPILL) { + if (!is_spilled_reg(&func->stack[spi])) { stack_mask = 0; break; } @@ -2190,7 +2198,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno, return 0; }
- if (func->stack[i].slot_type[0] != STACK_SPILL) { + if (!is_spilled_reg(&func->stack[i])) { stack_mask &= ~(1ull << i); continue; } @@ -2375,7 +2383,7 @@ static int check_stack_write_fixed_off(struct bpf_verifier_env *env, /* regular write of data into stack destroys any spilled ptr */ state->stack[spi].spilled_ptr.type = NOT_INIT; /* Mark slots as STACK_MISC if they belonged to spilled ptr. */ - if (state->stack[spi].slot_type[0] == STACK_SPILL) + if (is_spilled_reg(&state->stack[spi])) for (i = 0; i < BPF_REG_SIZE; i++) state->stack[spi].slot_type[i] = STACK_MISC;
@@ -2580,7 +2588,7 @@ static int check_stack_read_fixed_off(struct bpf_verifier_env *env, stype = reg_state->stack[spi].slot_type; reg = ®_state->stack[spi].spilled_ptr;
- if (stype[0] == STACK_SPILL) { + if (is_spilled_reg(®_state->stack[spi])) { if (size != BPF_REG_SIZE) { if (reg->type != SCALAR_VALUE) { verbose_linfo(env, env->insn_idx, "; "); @@ -4113,11 +4121,11 @@ static int check_stack_range_initialized( goto mark; }
- if (state->stack[spi].slot_type[0] == STACK_SPILL && + if (is_spilled_reg(&state->stack[spi]) && state->stack[spi].spilled_ptr.type == PTR_TO_BTF_ID) goto mark;
- if (state->stack[spi].slot_type[0] == STACK_SPILL && + if (is_spilled_reg(&state->stack[spi]) && (state->stack[spi].spilled_ptr.type == SCALAR_VALUE || env->allow_ptr_leaks)) { if (clobber) { @@ -9341,9 +9349,9 @@ static bool stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old, * return false to continue verification of this path */ return false; - if (i % BPF_REG_SIZE) + if (i % BPF_REG_SIZE != BPF_REG_SIZE - 1) continue; - if (old->stack[spi].slot_type[0] != STACK_SPILL) + if (!is_spilled_reg(&old->stack[spi])) continue; if (!regsafe(env, &old->stack[spi].spilled_ptr, &cur->stack[spi].spilled_ptr, idmap)) @@ -9550,7 +9558,7 @@ static int propagate_precision(struct bpf_verifier_env *env, }
for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) { - if (state->stack[i].slot_type[0] != STACK_SPILL) + if (!is_spilled_reg(&state->stack[i])) continue; state_reg = &state->stack[i].spilled_ptr; if (state_reg->type != SCALAR_VALUE ||
From: Kumar Kartikeya Dwivedi memxor@gmail.com
mainline inclusion from mainline-v6.2-rc1 commit 261f4664caffdeb9dff4e83ee3c0334b1c3a552f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAAAW9
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When support was added for spilled PTR_TO_BTF_ID to be accessed by helper memory access, the stack slot was not overwritten to STACK_MISC (and that too is only safe when env->allow_ptr_leaks is true).
This means that helpers who take ARG_PTR_TO_MEM and write to it may essentially overwrite the value while the verifier continues to track the slot for spilled register.
This can cause issues when PTR_TO_BTF_ID is spilled to stack, and then overwritten by helper write access, which can then be passed to BPF helpers or kfuncs.
Handle this by falling back to the case introduced in a later commit, which will also handle PTR_TO_BTF_ID along with other pointer types, i.e. cd17d38f8b28 ("bpf: Permits pointers on stack for helper calls").
Finally, include a comment on why REG_LIVE_WRITTEN is not being set when clobber is set to true. In short, the reason is that while when clobber is unset, we know that we won't be writing, when it is true, we *may* write to any of the stack slots in that range. It may be a partial or complete write, to just one or many stack slots.
We cannot be sure, hence to be conservative, we leave things as is and never set REG_LIVE_WRITTEN for any stack slot. However, clobber still needs to reset them to STACK_MISC assuming writes happened. However read marks still need to be propagated upwards from liveness point of view, as parent stack slot's contents may still continue to matter to child states.
Cc: Yonghong Song yhs@meta.com Fixes: 1d68f22b3d53 ("bpf: Handle spilled PTR_TO_BTF_ID properly when checking stack_boundary") Signed-off-by: Kumar Kartikeya Dwivedi memxor@gmail.com Link: https://lore.kernel.org/r/20221103191013.1236066-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov ast@kernel.org Conflicts: kernel/bpf/verifier.c [Resolve conflicts due to lack the refactor patch 5844101a1be9 ("bpf: Reject programs that try to load __percpu memory.")] Signed-off-by: Zheng Yejian zhengyejian1@huawei.com --- kernel/bpf/verifier.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 013fa412afd9..ad25b1f38a04 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4121,10 +4121,6 @@ static int check_stack_range_initialized( goto mark; }
- if (is_spilled_reg(&state->stack[spi]) && - state->stack[spi].spilled_ptr.type == PTR_TO_BTF_ID) - goto mark; - if (is_spilled_reg(&state->stack[spi]) && (state->stack[spi].spilled_ptr.type == SCALAR_VALUE || env->allow_ptr_leaks)) { @@ -4154,6 +4150,11 @@ static int check_stack_range_initialized( mark_reg_read(env, &state->stack[spi].spilled_ptr, state->stack[spi].spilled_ptr.parent, REG_LIVE_READ64); + /* We do not set REG_LIVE_WRITTEN for stack slot, as we can not + * be sure that whether stack slot is written to or not. Hence, + * we must still conservatively propagate reads upwards even if + * helper may write to the entire memory range. + */ } return 0; }
From: Daniel Borkmann daniel@iogearbox.net
mainline inclusion from mainline-v6.4-rc1 commit 7be14c1c9030f73cc18b4ff23b78a0a081f16188 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAAAW9
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Xu reports that after commit 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking"), the following BPF program is rejected by the verifier:
0: (61) r2 = *(u32 *)(r1 +0) ; R2_w=pkt(off=0,r=0,imm=0) 1: (61) r3 = *(u32 *)(r1 +4) ; R3_w=pkt_end(off=0,imm=0) 2: (bf) r1 = r2 3: (07) r1 += 1 4: (2d) if r1 > r3 goto pc+8 5: (71) r1 = *(u8 *)(r2 +0) ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) 6: (18) r0 = 0x7fffffffffffff10 8: (0f) r1 += r0 ; R1_w=scalar(umin=0x7fffffffffffff10,umax=0x800000000000000f) 9: (18) r0 = 0x8000000000000000 11: (07) r0 += 1 12: (ad) if r0 < r1 goto pc-2 13: (b7) r0 = 0 14: (95) exit
And the verifier log says:
func#0 @0 0: R1=ctx(off=0,imm=0) R10=fp0 0: (61) r2 = *(u32 *)(r1 +0) ; R1=ctx(off=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 1: (61) r3 = *(u32 *)(r1 +4) ; R1=ctx(off=0,imm=0) R3_w=pkt_end(off=0,imm=0) 2: (bf) r1 = r2 ; R1_w=pkt(off=0,r=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 3: (07) r1 += 1 ; R1_w=pkt(off=1,r=0,imm=0) 4: (2d) if r1 > r3 goto pc+8 ; R1_w=pkt(off=1,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) 5: (71) r1 = *(u8 *)(r2 +0) ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) R2_w=pkt(off=0,r=1,imm=0) 6: (18) r0 = 0x7fffffffffffff10 ; R0_w=9223372036854775568 8: (0f) r1 += r0 ; R0_w=9223372036854775568 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775823,s32_min=-240,s32_max=15) 9: (18) r0 = 0x8000000000000000 ; R0_w=-9223372036854775808 11: (07) r0 += 1 ; R0_w=-9223372036854775807 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775809) 13: (b7) r0 = 0 ; R0_w=0 14: (95) exit
from 12 to 11: R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775806 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775806 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775810,var_off=(0x8000000000000000; 0xffffffff)) 13: safe
[...]
from 12 to 11: R0_w=-9223372036854775795 R1=scalar(umin=9223372036854775822,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775794 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775794 R1=scalar(umin=9223372036854775822,umax=9223372036854775822,var_off=(0x8000000000000000; 0xffffffff)) 13: safe
from 12 to 11: R0_w=-9223372036854775794 R1=scalar(umin=9223372036854775823,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775793 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775793 R1=scalar(umin=9223372036854775823,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) 13: safe
from 12 to 11: R0_w=-9223372036854775793 R1=scalar(umin=9223372036854775824,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775792 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775792 R1=scalar(umin=9223372036854775824,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) 13: safe
[...]
The 64bit umin=9223372036854775810 bound continuously bumps by +1 while umax=9223372036854775823 stays as-is until the verifier complexity limit is reached and the program gets finally rejected. During this simulation, the umin also eventually surpasses umax. Looking at the first 'from 12 to 11' output line from the loop, R1 has the following state:
R1_w=scalar(umin=0x8000000000000002 (9223372036854775810), umax=0x800000000000000f (9223372036854775823), var_off=(0x8000000000000000; 0xffffffff))
The var_off has technically not an inconsistent state but it's very imprecise and far off surpassing 64bit umax bounds whereas the expected output with refined known bits in var_off should have been like:
R1_w=scalar(umin=0x8000000000000002 (9223372036854775810), umax=0x800000000000000f (9223372036854775823), var_off=(0x8000000000000000; 0xf))
In the above log, var_off stays as var_off=(0x8000000000000000; 0xffffffff) and does not converge into a narrower mask where more bits become known, eventually transforming R1 into a constant upon umin=9223372036854775823, umax=9223372036854775823 case where the verifier would have terminated and let the program pass.
The __reg_combine_64_into_32() marks the subregister unknown and propagates 64bit {s,u}min/{s,u}max bounds to their 32bit equivalents iff they are within the 32bit universe. The question came up whether __reg_combine_64_into_32() should special case the situation that when 64bit {s,u}min bounds have the same value as 64bit {s,u}max bounds to then assign the latter as well to the 32bit reg->{s,u}32_{min,max}_value. As can be seen from the above example however, that is just /one/ special case and not a /generic/ solution given above example would still not be addressed this way and remain at an imprecise var_off=(0x8000000000000000; 0xffffffff).
The improvement is needed in __reg_bound_offset() to refine var32_off with the updated var64_off instead of the prior reg->var_off. The reg_bounds_sync() code first refines information about the register's min/max bounds via __update_reg_bounds() from the current var_off, then in __reg_deduce_bounds() from sign bit and with the potentially learned bits from bounds it'll update the var_off tnum in __reg_bound_offset(). For example, intersecting with the old var_off might have improved bounds slightly, e.g. if umax was 0x7f...f and var_off was (0; 0xf...fc), then new var_off will then result in (0; 0x7f...fc). The intersected var64_off holds then the universe which is a superset of var32_off. The point for the latter is not to broaden, but to further refine known bits based on the intersection of var_off with 32 bit bounds, so that we later construct the final var_off from upper and lower 32 bits. The final __update_reg_bounds() can then potentially still slightly refine bounds if more bits became known from the new var_off.
After the improvement, we can see R1 converging successively:
func#0 @0 0: R1=ctx(off=0,imm=0) R10=fp0 0: (61) r2 = *(u32 *)(r1 +0) ; R1=ctx(off=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 1: (61) r3 = *(u32 *)(r1 +4) ; R1=ctx(off=0,imm=0) R3_w=pkt_end(off=0,imm=0) 2: (bf) r1 = r2 ; R1_w=pkt(off=0,r=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 3: (07) r1 += 1 ; R1_w=pkt(off=1,r=0,imm=0) 4: (2d) if r1 > r3 goto pc+8 ; R1_w=pkt(off=1,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) 5: (71) r1 = *(u8 *)(r2 +0) ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) R2_w=pkt(off=0,r=1,imm=0) 6: (18) r0 = 0x7fffffffffffff10 ; R0_w=9223372036854775568 8: (0f) r1 += r0 ; R0_w=9223372036854775568 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775823,s32_min=-240,s32_max=15) 9: (18) r0 = 0x8000000000000000 ; R0_w=-9223372036854775808 11: (07) r0 += 1 ; R0_w=-9223372036854775807 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775809) 13: (b7) r0 = 0 ; R0_w=0 14: (95) exit
from 12 to 11: R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775806 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775806 R1_w=-9223372036854775806 13: safe
from 12 to 11: R0_w=-9223372036854775806 R1_w=scalar(umin=9223372036854775811,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775805 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775805 R1_w=-9223372036854775805 13: safe
[...]
from 12 to 11: R0_w=-9223372036854775798 R1=scalar(umin=9223372036854775819,umax=9223372036854775823,var_off=(0x8000000000000008; 0x7),s32_min=8,s32_max=15,u32_min=8,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775797 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775797 R1=-9223372036854775797 13: safe
from 12 to 11: R0_w=-9223372036854775797 R1=scalar(umin=9223372036854775820,umax=9223372036854775823,var_off=(0x800000000000000c; 0x3),s32_min=12,s32_max=15,u32_min=12,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775796 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775796 R1=-9223372036854775796 13: safe
from 12 to 11: R0_w=-9223372036854775796 R1=scalar(umin=9223372036854775821,umax=9223372036854775823,var_off=(0x800000000000000c; 0x3),s32_min=12,s32_max=15,u32_min=12,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775795 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775795 R1=-9223372036854775795 13: safe
from 12 to 11: R0_w=-9223372036854775795 R1=scalar(umin=9223372036854775822,umax=9223372036854775823,var_off=(0x800000000000000e; 0x1),s32_min=14,s32_max=15,u32_min=14,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775794 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775794 R1=-9223372036854775794 13: safe
from 12 to 11: R0_w=-9223372036854775794 R1=-9223372036854775793 R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775793 12: (ad) if r0 < r1 goto pc-2 last_idx 12 first_idx 12 parent didn't have regs=1 stack=0 marks: R0_rw=P-9223372036854775801 R1_r=scalar(umin=9223372036854775815,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 last_idx 11 first_idx 11 regs=1 stack=0 before 11: (07) r0 += 1 parent didn't have regs=1 stack=0 marks: R0_rw=P-9223372036854775805 R1_rw=scalar(umin=9223372036854775812,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 last_idx 12 first_idx 0 regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=1 stack=0 before 11: (07) r0 += 1 regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=1 stack=0 before 11: (07) r0 += 1 regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=1 stack=0 before 11: (07) r0 += 1 regs=1 stack=0 before 9: (18) r0 = 0x8000000000000000 last_idx 12 first_idx 12 parent didn't have regs=2 stack=0 marks: R0_rw=P-9223372036854775801 R1_r=Pscalar(umin=9223372036854775815,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 last_idx 11 first_idx 11 regs=2 stack=0 before 11: (07) r0 += 1 parent didn't have regs=2 stack=0 marks: R0_rw=P-9223372036854775805 R1_rw=Pscalar(umin=9223372036854775812,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 last_idx 12 first_idx 0 regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=2 stack=0 before 11: (07) r0 += 1 regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=2 stack=0 before 11: (07) r0 += 1 regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=2 stack=0 before 11: (07) r0 += 1 regs=2 stack=0 before 9: (18) r0 = 0x8000000000000000 regs=2 stack=0 before 8: (0f) r1 += r0 regs=3 stack=0 before 6: (18) r0 = 0x7fffffffffffff10 regs=2 stack=0 before 5: (71) r1 = *(u8 *)(r2 +0) 13: safe
from 4 to 13: safe verification time 322 usec stack depth 0 processed 56 insns (limit 1000000) max_states_per_insn 1 total_states 3 peak_states 3 mark_read 1
This also fixes up a test case along with this improvement where we match on the verifier log. The updated log now has a refined var_off, too.
Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Reported-by: Xu Kuohai xukuohai@huaweicloud.com Conflicts: tools/testing/selftests/bpf/prog_tests/align.c [Resolve conflicts] Signed-off-by: Daniel Borkmann daniel@iogearbox.net Signed-off-by: Andrii Nakryiko andrii@kernel.org Reviewed-by: John Fastabend john.fastabend@gmail.com Link: https://lore.kernel.org/bpf/20230314203424.4015351-2-xukuohai@huaweicloud.co... Link: https://lore.kernel.org/bpf/20230322213056.2470-1-daniel@iogearbox.net Signed-off-by: Zheng Yejian zhengyejian1@huawei.com --- kernel/bpf/verifier.c | 6 +++--- tools/testing/selftests/bpf/prog_tests/align.c | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index ad25b1f38a04..37bbb2f6d86c 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1275,9 +1275,9 @@ static void __reg_bound_offset(struct bpf_reg_state *reg) struct tnum var64_off = tnum_intersect(reg->var_off, tnum_range(reg->umin_value, reg->umax_value)); - struct tnum var32_off = tnum_intersect(tnum_subreg(reg->var_off), - tnum_range(reg->u32_min_value, - reg->u32_max_value)); + struct tnum var32_off = tnum_intersect(tnum_subreg(var64_off), + tnum_range(reg->u32_min_value, + reg->u32_max_value));
reg->var_off = tnum_or(tnum_clear_subreg(var64_off), var32_off); } diff --git a/tools/testing/selftests/bpf/prog_tests/align.c b/tools/testing/selftests/bpf/prog_tests/align.c index 5861446d0777..41a3668dfd39 100644 --- a/tools/testing/selftests/bpf/prog_tests/align.c +++ b/tools/testing/selftests/bpf/prog_tests/align.c @@ -565,14 +565,14 @@ static struct bpf_align_test tests[] = { /* New unknown value in R7 is (4n), >= 76 */ {15, "R7_w=inv(id=0,umin_value=76,umax_value=1096,var_off=(0x0; 0x7fc))"}, /* Adding it to packet pointer gives nice bounds again */ - {16, "R5_w=pkt(id=3,off=0,r=0,umin_value=2,umax_value=1082,var_off=(0x2; 0xfffffffc)"}, + {16, "R5_w=pkt(id=3,off=0,r=0,umin_value=2,umax_value=1082,var_off=(0x2; 0x7fc)"}, /* At the time the word size load is performed from R5, * its total fixed offset is NET_IP_ALIGN + reg->off (0) * which is 2. Then the variable offset is (4n+2), so * the total offset is 4-byte aligned and meets the * load's requirements. */ - {20, "R5=pkt(id=3,off=0,r=4,umin_value=2,umax_value=1082,var_off=(0x2; 0xfffffffc)"}, + {20, "R5=pkt(id=3,off=0,r=4,umin_value=2,umax_value=1082,var_off=(0x2; 0x7fc)"}, }, }, };
From: Stanislav Fomichev sdf@google.com
stable inclusion from stable-v5.10.180 commit f4fc43fde12a1b87b0433a9f8e6ad6fb5fe9cd8a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I8DDFN
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit 00e74ae0863827d944e36e56a4ce1e77e50edb91 ]
Some socket options do getsockopt with optval=NULL to estimate the size of the final buffer (which is returned via optlen). This breaks BPF getsockopt assumptions about permitted optval buffer size. Let's enforce these assumptions only when non-NULL optval is provided.
Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Reported-by: Martin KaFai Lau martin.lau@kernel.org Signed-off-by: Stanislav Fomichev sdf@google.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Link: https://lore.kernel.org/bpf/ZD7Js4fj5YyI2oLd@google.com/T/#mb68daf700f87a924... Link: https://lore.kernel.org/bpf/20230418225343.553806-2-sdf@google.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: sanglipeng sanglipeng1@jd.com Signed-off-by: Zheng Yejian zhengyejian1@huawei.com --- kernel/bpf/cgroup.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index ab72dd3da57e..60e7b0604a65 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -1582,7 +1582,7 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, goto out; }
- if (ctx.optlen > max_optlen || ctx.optlen < 0) { + if (optval && (ctx.optlen > max_optlen || ctx.optlen < 0)) { ret = -EFAULT; goto out; } @@ -1596,8 +1596,11 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, }
if (ctx.optlen != 0) { - if (copy_to_user(optval, ctx.optval, ctx.optlen) || - put_user(ctx.optlen, optlen)) { + if (optval && copy_to_user(optval, ctx.optval, ctx.optlen)) { + ret = -EFAULT; + goto out; + } + if (put_user(ctx.optlen, optlen)) { ret = -EFAULT; goto out; }
From: Stanislav Fomichev sdf@google.com
mainline inclusion from mainline-v6.5-rc1 commit 29ebbba7d46136cba324264e513a1e964ca16c0a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAAAW9
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
With the way the hooks implemented right now, we have a special condition: optval larger than PAGE_SIZE will expose only first 4k into BPF; any modifications to the optval are ignored. If the BPF program doesn't handle this condition by resetting optlen to 0, the userspace will get EFAULT.
The intention of the EFAULT was to make it apparent to the developers that the program is doing something wrong. However, this inadvertently might affect production workloads with the BPF programs that are not too careful (i.e., returning EFAULT for perfectly valid setsockopt/getsockopt calls).
Let's try to minimize the chance of BPF program screwing up userspace by ignoring the output of those BPF programs (instead of returning EFAULT to the userspace). pr_info_once those cases to the dmesg to help with figuring out what's going wrong.
Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Suggested-by: Martin KaFai Lau martin.lau@kernel.org Conflicts: kernel/bpf/cgroup.c [Resolve conflicts] Signed-off-by: Stanislav Fomichev sdf@google.com Link: https://lore.kernel.org/r/20230511170456.1759459-2-sdf@google.com Signed-off-by: Martin KaFai Lau martin.lau@kernel.org Signed-off-by: Zheng Yejian zhengyejian1@huawei.com --- kernel/bpf/cgroup.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+)
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index 60e7b0604a65..3328201d92b1 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -1477,6 +1477,12 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level, ret = 1; } else if (ctx.optlen > max_optlen || ctx.optlen < -1) { /* optlen is out of bounds */ + if (*optlen > PAGE_SIZE && ctx.optlen >= 0) { + pr_info_once("bpf setsockopt: ignoring program buffer with optlen=%d (max_optlen=%d)\n", + ctx.optlen, max_optlen); + ret = 0; + goto out; + } ret = -EFAULT; } else { /* optlen within bounds, run kernel handler */ @@ -1532,6 +1538,7 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, .optname = optname, .retval = retval, }; + int orig_optlen; int ret;
/* Opportunistic check to see whether we have any BPF program @@ -1541,6 +1548,7 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, if (__cgroup_bpf_prog_array_is_empty(cgrp, CGROUP_GETSOCKOPT)) return retval;
+ orig_optlen = max_optlen; ctx.optlen = max_optlen;
max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf); @@ -1564,6 +1572,7 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, ret = -EFAULT; goto out; } + orig_optlen = ctx.optlen;
if (copy_from_user(ctx.optval, optval, min(ctx.optlen, max_optlen)) != 0) { @@ -1583,6 +1592,12 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, }
if (optval && (ctx.optlen > max_optlen || ctx.optlen < 0)) { + if (orig_optlen > PAGE_SIZE && ctx.optlen >= 0) { + pr_info_once("bpf getsockopt: ignoring program buffer with optlen=%d (max_optlen=%d)\n", + ctx.optlen, max_optlen); + ret = retval; + goto out; + } ret = -EFAULT; goto out; }
From: Krister Johansen kjlx@templeofstupid.com
mainline inclusion from mainline-v6.4 commit 0108a4e9f3584a7a2c026d1601b0682ff7335d95 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IAAAW9
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When subprograms are in use, the main program is not jit'd after the subprograms because jit_subprogs sets a value for prog->bpf_func upon success. Subsequent calls to the JIT are bypassed when this value is non-NULL. This leads to a situation where the main program and its func[0] counterpart are both in the bpf kallsyms tree, but only func[0] has an extable. Extables are only created during JIT. Now there are two nearly identical program ksym entries in the tree, but only one has an extable. Depending upon how the entries are placed, there's a chance that a fault will call search_extable on the aux with the NULL entry.
Since jit_subprogs already copies state from func[0] to the main program, include the extable pointer in this state duplication. Additionally, ensure that the copy of the main program in func[0] is not added to the bpf_prog_kallsyms table. Instead, let the main program get added later in bpf_prog_load(). This ensures there is only a single copy of the main program in the kallsyms table, and that its tag matches the tag observed by tooling like bpftool.
Cc: stable@vger.kernel.org Fixes: 1c2a088a6626 ("bpf: x64: add JIT support for multi-function programs") Conflicts: kernel/bpf/verifier.c [Resolve conflicts] Signed-off-by: Krister Johansen kjlx@templeofstupid.com Acked-by: Yonghong Song yhs@fb.com Acked-by: Ilya Leoshkevich iii@linux.ibm.com Tested-by: Ilya Leoshkevich iii@linux.ibm.com Link: https://lore.kernel.org/r/6de9b2f4b4724ef56efbb0339daaa66c8b68b1e7.168661666... Signed-off-by: Alexei Starovoitov ast@kernel.org Signed-off-by: Zheng Yejian zhengyejian1@huawei.com --- kernel/bpf/verifier.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 37bbb2f6d86c..ed8896b64825 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -11310,9 +11310,10 @@ static int jit_subprogs(struct bpf_verifier_env *env) }
/* finally lock prog and jit images for all functions and - * populate kallsysm + * populate kallsysm. Begin at the first subprogram, since + * bpf_prog_load will add the kallsyms for the main program. */ - for (i = 0; i < env->subprog_cnt; i++) { + for (i = 1; i < env->subprog_cnt; i++) { bpf_prog_lock_ro(func[i]); bpf_prog_kallsyms_add(func[i]); } @@ -11332,6 +11333,8 @@ static int jit_subprogs(struct bpf_verifier_env *env)
prog->jited = 1; prog->bpf_func = func[0]->bpf_func; + prog->aux->extable = func[0]->aux->extable; + prog->aux->num_exentries = func[0]->aux->num_exentries; prog->aux->func = func; prog->aux->func_cnt = env->subprog_cnt; bpf_prog_free_unused_jited_linfo(prog);
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/9771 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/E...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/9771 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/E...