From: Bing-Jhong Billy Jheng billy@starlabs.sg
stable inclusion from stable-v5.10.171 commit 08681391b84da27133deefaaddefd0acfa90c2be category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6V7V1 CVE: CVE-2023-1872
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
io_get_file_fixed will access io_uring's context. Lock it if it is invoked unlocked (eg via io-wq) to avoid a race condition with fixed files getting unregistered.
No single upstream patch exists for this issue, it was fixed as part of the file assignment changes that went into the 5.18 cycle.
Signed-off-by: Jheng, Bing-Jhong Billy billy@starlabs.sg Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: ZhaoLong Wang wangzhaolong1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- io_uring/io_uring.c | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index a200750b536c..98cf1c413691 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -1090,7 +1090,8 @@ static int __io_register_rsrc_update(struct io_ring_ctx *ctx, unsigned type, unsigned nr_args); static void io_clean_op(struct io_kiocb *req); static struct file *io_file_get(struct io_ring_ctx *ctx, - struct io_kiocb *req, int fd, bool fixed); + struct io_kiocb *req, int fd, bool fixed, + unsigned int issue_flags); static void __io_queue_sqe(struct io_kiocb *req); static void io_rsrc_put_work(struct work_struct *work);
@@ -3914,7 +3915,7 @@ static int io_tee(struct io_kiocb *req, unsigned int issue_flags) return -EAGAIN;
in = io_file_get(req->ctx, req, sp->splice_fd_in, - (sp->flags & SPLICE_F_FD_IN_FIXED)); + (sp->flags & SPLICE_F_FD_IN_FIXED), issue_flags); if (!in) { ret = -EBADF; goto done; @@ -3954,7 +3955,7 @@ static int io_splice(struct io_kiocb *req, unsigned int issue_flags) return -EAGAIN;
in = io_file_get(req->ctx, req, sp->splice_fd_in, - (sp->flags & SPLICE_F_FD_IN_FIXED)); + (sp->flags & SPLICE_F_FD_IN_FIXED), issue_flags); if (!in) { ret = -EBADF; goto done; @@ -6742,13 +6743,16 @@ static void io_fixed_file_set(struct io_fixed_file *file_slot, struct file *file }
static inline struct file *io_file_get_fixed(struct io_ring_ctx *ctx, - struct io_kiocb *req, int fd) + struct io_kiocb *req, int fd, + unsigned int issue_flags) { - struct file *file; + struct file *file = NULL; unsigned long file_ptr;
+ io_ring_submit_lock(ctx, !(issue_flags & IO_URING_F_NONBLOCK)); + if (unlikely((unsigned int)fd >= ctx->nr_user_files)) - return NULL; + goto out; fd = array_index_nospec(fd, ctx->nr_user_files); file_ptr = io_fixed_file_slot(&ctx->file_table, fd)->file_ptr; file = (struct file *) (file_ptr & FFS_MASK); @@ -6756,6 +6760,8 @@ static inline struct file *io_file_get_fixed(struct io_ring_ctx *ctx, /* mask in overlapping REQ_F and FFS bits */ req->flags |= (file_ptr << REQ_F_NOWAIT_READ_BIT); io_req_set_rsrc_node(req); +out: + io_ring_submit_unlock(ctx, !(issue_flags & IO_URING_F_NONBLOCK)); return file; }
@@ -6773,10 +6779,11 @@ static struct file *io_file_get_normal(struct io_ring_ctx *ctx, }
static inline struct file *io_file_get(struct io_ring_ctx *ctx, - struct io_kiocb *req, int fd, bool fixed) + struct io_kiocb *req, int fd, bool fixed, + unsigned int issue_flags) { if (fixed) - return io_file_get_fixed(ctx, req, fd); + return io_file_get_fixed(ctx, req, fd, issue_flags); else return io_file_get_normal(ctx, req, fd); } @@ -6998,7 +7005,7 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
if (io_op_defs[req->opcode].needs_file) { req->file = io_file_get(ctx, req, READ_ONCE(sqe->fd), - (sqe_flags & IOSQE_FIXED_FILE)); + (sqe_flags & IOSQE_FIXED_FILE), 0); if (unlikely(!req->file)) ret = -EBADF; }
From: Jens Axboe axboe@kernel.dk
stable inclusion from stable-v5.10.172 commit da24142b1ef9fd5d36b76e36bab328a5b27523e8 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6V7V1 CVE: CVE-2023-1872
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
We can't use 0 here, as io_init_req() is always invoked with the ctx uring_lock held. Newer kernels have IO_URING_F_UNLOCKED for this, but previously we used IO_URING_F_NONBLOCK to indicate this as well.
Fixes: 08681391b84d ("io_uring: add missing lock in io_get_file_fixed") Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: ZhaoLong Wang wangzhaolong1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- io_uring/io_uring.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 98cf1c413691..d9ba27f4c116 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -7005,7 +7005,8 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
if (io_op_defs[req->opcode].needs_file) { req->file = io_file_get(ctx, req, READ_ONCE(sqe->fd), - (sqe_flags & IOSQE_FIXED_FILE), 0); + (sqe_flags & IOSQE_FIXED_FILE), + IO_URING_F_NONBLOCK); if (unlikely(!req->file)) ret = -EBADF; }
From: Dave Chinner dchinner@redhat.com
mainline inclusion from mainline-v5.14-rc1 commit b2ae3a9ef91152931b99620c431cf3805daa1429 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Because I cannot tell if the NEED_FLUSH flag is being set correctly by the log force and CIL push machinery without it.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/xfs/xfs_log_priv.h | 13 ++++++++++--- fs/xfs/xfs_trace.h | 5 ++++- 2 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index ca19a3f04bee..210b3f16e973 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -48,6 +48,16 @@ enum xlog_iclog_state { { XLOG_STATE_CALLBACK, "XLOG_STATE_CALLBACK" }, \ { XLOG_STATE_DIRTY, "XLOG_STATE_DIRTY" }
+/* + * In core log flags + */ +#define XLOG_ICL_NEED_FLUSH (1 << 0) /* iclog needs REQ_PREFLUSH */ +#define XLOG_ICL_NEED_FUA (1 << 1) /* iclog needs REQ_FUA */ + +#define XLOG_ICL_STRINGS \ + { XLOG_ICL_NEED_FLUSH, "XLOG_ICL_NEED_FLUSH" }, \ + { XLOG_ICL_NEED_FUA, "XLOG_ICL_NEED_FUA" } +
/* * Log ticket flags @@ -132,9 +142,6 @@ enum xlog_iclog_state {
#define XLOG_COVER_OPS 5
-#define XLOG_ICL_NEED_FLUSH (1 << 0) /* iclog needs REQ_PREFLUSH */ -#define XLOG_ICL_NEED_FUA (1 << 1) /* iclog needs REQ_FUA */ - /* Ticket reservation region accounting */ #define XLOG_TIC_LEN_MAX 15
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index ae96aaa3bb31..2e90041fed8d 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -4010,6 +4010,7 @@ DECLARE_EVENT_CLASS(xlog_iclog_class, __field(uint32_t, state) __field(int32_t, refcount) __field(uint32_t, offset) + __field(uint32_t, flags) __field(unsigned long long, lsn) __field(unsigned long, caller_ip) ), @@ -4018,15 +4019,17 @@ DECLARE_EVENT_CLASS(xlog_iclog_class, __entry->state = iclog->ic_state; __entry->refcount = atomic_read(&iclog->ic_refcnt); __entry->offset = iclog->ic_offset; + __entry->flags = iclog->ic_flags; __entry->lsn = be64_to_cpu(iclog->ic_header.h_lsn); __entry->caller_ip = caller_ip; ), - TP_printk("dev %d:%d state %s refcnt %d offset %u lsn 0x%llx caller %pS", + TP_printk("dev %d:%d state %s refcnt %d offset %u lsn 0x%llx flags %s caller %pS", MAJOR(__entry->dev), MINOR(__entry->dev), __print_symbolic(__entry->state, XLOG_STATE_STRINGS), __entry->refcount, __entry->offset, __entry->lsn, + __print_flags(__entry->flags, "|", XLOG_ICL_STRINGS), (char *)__entry->caller_ip)
);
From: Dave Chinner dchinner@redhat.com
mainline inclusion from mainline-v5.14-rc1 commit 9d110014205cb1129fa570d8de83d486fa199354 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
From the department of "generic/482 keeps on giving", we bring you another tail update race condition:
iclog: S1 C1 +-----------------------+-----------------------+ S2 EOIC
Two checkpoints in a single iclog. One is complete, the other just contains the start record and overruns into a new iclog.
Timeline:
Before S1: Cache flush, log tail = X At S1: Metadata stable, write start record and checkpoint At C1: Write commit record, set NEED_FUA Single iclog checkpoint, so no need for NEED_FLUSH Log tail still = X, so no need for NEED_FLUSH
After C1, Before S2: Cache flush, log tail = X At S2: Metadata stable, write start record and checkpoint After S2: Log tail moves to X+1 At EOIC: End of iclog, more journal data to write Releases iclog Not a commit iclog, so no need for NEED_FLUSH Writes log tail X+1 into iclog.
At this point, the iclog has tail X+1 and NEED_FUA set. There has been no cache flush for the metadata between X and X+1, and the iclog writes the new tail permanently to the log. THis is sufficient to violate on disk metadata/journal ordering.
We have two options here. The first is to detect this case in some manner and ensure that the partial checkpoint write sets NEED_FLUSH when the iclog is already marked NEED_FUA and the log tail changes. This seems somewhat fragile and quite complex to get right, and it doesn't actually make it obvious what underlying problem it is actually addressing from reading the code.
The second option seems much cleaner to me, because it is derived directly from the requirements of the C1 commit record in the iclog. That is, when we write this commit record to the iclog, we've guaranteed that the metadata/data ordering is correct for tail update purposes. Hence if we only write the log tail into the iclog for the *first* commit record rather than the log tail at the last release, we guarantee that the log tail does not move past where the the first commit record in the log expects it to be.
IOWs, taking the first option means that replay of C1 becomes dependent on future operations doing the right thing, not just the C1 checkpoint itself doing the right thing. This makes log recovery almost impossible to reason about because now we have to take into account what might or might not have happened in the future when looking at checkpoints in the log rather than just having to reconstruct the past...
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/xfs/xfs_log.c | 50 +++++++++++++++++++++++++++++++++++------------- 1 file changed, 37 insertions(+), 13 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 7bf94b1d3dbd..18167294dd62 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -75,13 +75,12 @@ xlog_verify_iclog( STATIC void xlog_verify_tail_lsn( struct xlog *log, - struct xlog_in_core *iclog, - xfs_lsn_t tail_lsn); + struct xlog_in_core *iclog); #else #define xlog_verify_dest_ptr(a,b) #define xlog_verify_grant_tail(a) #define xlog_verify_iclog(a,b,c) -#define xlog_verify_tail_lsn(a,b,c) +#define xlog_verify_tail_lsn(a,b) #endif
STATIC int @@ -523,12 +522,31 @@ xlog_state_shutdown_callbacks(
/* * Flush iclog to disk if this is the last reference to the given iclog and the - * it is in the WANT_SYNC state. If the caller passes in a non-zero - * @old_tail_lsn and the current log tail does not match, there may be metadata - * on disk that must be persisted before this iclog is written. To satisfy that - * requirement, set the XLOG_ICL_NEED_FLUSH flag as a condition for writing this - * iclog with the new log tail value. + * it is in the WANT_SYNC state. + * + * If the caller passes in a non-zero @old_tail_lsn and the current log tail + * does not match, there may be metadata on disk that must be persisted before + * this iclog is written. To satisfy that requirement, set the + * XLOG_ICL_NEED_FLUSH flag as a condition for writing this iclog with the new + * log tail value. + * + * If XLOG_ICL_NEED_FUA is already set on the iclog, we need to ensure that the + * log tail is updated correctly. NEED_FUA indicates that the iclog will be + * written to stable storage, and implies that a commit record is contained + * within the iclog. We need to ensure that the log tail does not move beyond + * the tail that the first commit record in the iclog ordered against, otherwise + * correct recovery of that checkpoint becomes dependent on future operations + * performed on this iclog. + * + * Hence if NEED_FUA is set and the current iclog tail lsn is empty, write the + * current tail into iclog. Once the iclog tail is set, future operations must + * not modify it, otherwise they potentially violate ordering constraints for + * the checkpoint commit that wrote the initial tail lsn value. The tail lsn in + * the iclog will get zeroed on activation of the iclog after sync, so we + * always capture the tail lsn on the iclog on the first NEED_FUA release + * regardless of the number of active reference counts on this iclog. */ + int xlog_state_release_iclog( struct xlog *log, @@ -552,6 +570,10 @@ xlog_state_release_iclog(
if (old_tail_lsn && tail_lsn != old_tail_lsn) iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; + + if ((iclog->ic_flags & XLOG_ICL_NEED_FUA) && + !iclog->ic_header.h_tail_lsn) + iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); }
last_ref = atomic_dec_and_test(&iclog->ic_refcnt); @@ -576,8 +598,9 @@ xlog_state_release_iclog( }
iclog->ic_state = XLOG_STATE_SYNCING; - iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); - xlog_verify_tail_lsn(log, iclog, tail_lsn); + if (!iclog->ic_header.h_tail_lsn) + iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); + xlog_verify_tail_lsn(log, iclog); trace_xlog_iclog_syncing(iclog, _RET_IP_);
spin_unlock(&log->l_icloglock); @@ -2553,6 +2576,7 @@ xlog_state_activate_iclog( memset(iclog->ic_header.h_cycle_data, 0, sizeof(iclog->ic_header.h_cycle_data)); iclog->ic_header.h_lsn = 0; + iclog->ic_header.h_tail_lsn = 0; }
/* @@ -3581,10 +3605,10 @@ xlog_verify_grant_tail( STATIC void xlog_verify_tail_lsn( struct xlog *log, - struct xlog_in_core *iclog, - xfs_lsn_t tail_lsn) + struct xlog_in_core *iclog) { - int blocks; + xfs_lsn_t tail_lsn = be64_to_cpu(iclog->ic_header.h_tail_lsn); + int blocks;
if (CYCLE_LSN(tail_lsn) == log->l_prev_cycle) { blocks =
From: Dave Chinner dchinner@redhat.com
mainline inclusion from mainline-v5.17-rc6 commit 919edbadebe17a67193533f531c2920c03e40fa4 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Jan Kara reported a performance regression in dbench that he bisected down to commit bad77c375e8d ("xfs: CIL checkpoint flushes caches unconditionally").
Whilst developing the journal flush/fua optimisations this cache was part of, it appeared to made a significant difference to performance. However, now that this patchset has settled and all the correctness issues fixed, there does not appear to be any significant performance benefit to asynchronous cache flushes.
In fact, the opposite is true on some storage types and workloads, where additional cache flushes that can occur from fsync heavy workloads have measurable and significant impact on overall throughput.
Local dbench testing shows little difference on dbench runs with sync vs async cache flushes on either fast or slow SSD storage, and no difference in streaming concurrent async transaction workloads like fs-mark.
Fast NVME storage.
From `dbench -t 30`, CIL scale:
clients async sync BW Latency BW Latency 1 935.18 0.855 915.64 0.903 8 2404.51 6.873 2341.77 6.511 16 3003.42 6.460 2931.57 6.529 32 3697.23 7.939 3596.28 7.894 128 7237.43 15.495 7217.74 11.588 512 5079.24 90.587 5167.08 95.822
fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize
create chown unlink async 1m41s 1m16s 2m03s sync 1m40s 1m19s 1m54s
Slower SATA SSD storage:
From `dbench -t 30`, CIL scale:
clients async sync BW Latency BW Latency 1 78.59 15.792 83.78 10.729 8 367.88 92.067 404.63 59.943 16 564.51 72.524 602.71 76.089 32 831.66 105.984 870.26 110.482 128 1659.76 102.969 1624.73 91.356 512 2135.91 223.054 2603.07 161.160
fsmark, 16 threads, create w/32k logbsize
create unlink async 5m06s 4m15s sync 5m00s 4m22s
And on Jan's test machine:
5.18-rc8-vanilla 5.18-rc8-patched Amean 1 71.22 ( 0.00%) 64.94 * 8.81%* Amean 2 93.03 ( 0.00%) 84.80 * 8.85%* Amean 4 150.54 ( 0.00%) 137.51 * 8.66%* Amean 8 252.53 ( 0.00%) 242.24 * 4.08%* Amean 16 454.13 ( 0.00%) 439.08 * 3.31%* Amean 32 835.24 ( 0.00%) 829.74 * 0.66%* Amean 64 1740.59 ( 0.00%) 1686.73 * 3.09%*
Performance and cache flush behaviour is restored to pre-regression levels.
As such, we can now consider the async cache flush mechanism an unnecessary exercise in premature optimisation and hence we can now remove it and the infrastructure it requires completely.
Fixes: bad77c375e8d ("xfs: CIL checkpoint flushes caches unconditionally") Reported-and-tested-by: Jan Kara jack@suse.cz Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/xfs/xfs_bio_io.c | 35 ----------------------------------- fs/xfs/xfs_linux.h | 2 -- fs/xfs/xfs_log.c | 36 +++++++++++------------------------- fs/xfs/xfs_log_cil.c | 42 +++++++++++++----------------------------- fs/xfs/xfs_log_priv.h | 3 +-- 5 files changed, 25 insertions(+), 93 deletions(-)
diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c index b20a9b639acb..e2148f2d5d6b 100644 --- a/fs/xfs/xfs_bio_io.c +++ b/fs/xfs/xfs_bio_io.c @@ -9,41 +9,6 @@ static inline unsigned int bio_max_vecs(unsigned int count) return min_t(unsigned, howmany(count, PAGE_SIZE), BIO_MAX_PAGES); }
-static void -xfs_flush_bdev_async_endio( - struct bio *bio) -{ - complete(bio->bi_private); -} - -/* - * Submit a request for an async cache flush to run. If the request queue does - * not require flush operations, just skip it altogether. If the caller needs - * to wait for the flush completion at a later point in time, they must supply a - * valid completion. This will be signalled when the flush completes. The - * caller never sees the bio that is issued here. - */ -void -xfs_flush_bdev_async( - struct bio *bio, - struct block_device *bdev, - struct completion *done) -{ - struct request_queue *q = bdev->bd_disk->queue; - - if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) { - complete(done); - return; - } - - bio_init(bio, NULL, 0); - bio_set_dev(bio, bdev); - bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC; - bio->bi_private = done; - bio->bi_end_io = xfs_flush_bdev_async_endio; - - submit_bio(bio); -} int xfs_rw_bdev( struct block_device *bdev, diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h index 953d98bc4832..af6be9b9ccdf 100644 --- a/fs/xfs/xfs_linux.h +++ b/fs/xfs/xfs_linux.h @@ -196,8 +196,6 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y)
int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count, char *data, unsigned int op); -void xfs_flush_bdev_async(struct bio *bio, struct block_device *bdev, - struct completion *done);
#define ASSERT_ALWAYS(expr) \ (likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__)) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 18167294dd62..6e70143176fb 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -524,12 +524,6 @@ xlog_state_shutdown_callbacks( * Flush iclog to disk if this is the last reference to the given iclog and the * it is in the WANT_SYNC state. * - * If the caller passes in a non-zero @old_tail_lsn and the current log tail - * does not match, there may be metadata on disk that must be persisted before - * this iclog is written. To satisfy that requirement, set the - * XLOG_ICL_NEED_FLUSH flag as a condition for writing this iclog with the new - * log tail value. - * * If XLOG_ICL_NEED_FUA is already set on the iclog, we need to ensure that the * log tail is updated correctly. NEED_FUA indicates that the iclog will be * written to stable storage, and implies that a commit record is contained @@ -546,12 +540,10 @@ xlog_state_shutdown_callbacks( * always capture the tail lsn on the iclog on the first NEED_FUA release * regardless of the number of active reference counts on this iclog. */ - int xlog_state_release_iclog( struct xlog *log, - struct xlog_in_core *iclog, - xfs_lsn_t old_tail_lsn) + struct xlog_in_core *iclog) { xfs_lsn_t tail_lsn; bool last_ref; @@ -562,18 +554,14 @@ xlog_state_release_iclog( /* * Grabbing the current log tail needs to be atomic w.r.t. the writing * of the tail LSN into the iclog so we guarantee that the log tail does - * not move between deciding if a cache flush is required and writing - * the LSN into the iclog below. + * not move between the first time we know that the iclog needs to be + * made stable and when we eventually submit it. */ - if (old_tail_lsn || iclog->ic_state == XLOG_STATE_WANT_SYNC) { + if ((iclog->ic_state == XLOG_STATE_WANT_SYNC || + (iclog->ic_flags & XLOG_ICL_NEED_FUA)) && + !iclog->ic_header.h_tail_lsn) { tail_lsn = xlog_assign_tail_lsn(log->l_mp); - - if (old_tail_lsn && tail_lsn != old_tail_lsn) - iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; - - if ((iclog->ic_flags & XLOG_ICL_NEED_FUA) && - !iclog->ic_header.h_tail_lsn) - iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); + iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); }
last_ref = atomic_dec_and_test(&iclog->ic_refcnt); @@ -598,8 +586,6 @@ xlog_state_release_iclog( }
iclog->ic_state = XLOG_STATE_SYNCING; - if (!iclog->ic_header.h_tail_lsn) - iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); xlog_verify_tail_lsn(log, iclog); trace_xlog_iclog_syncing(iclog, _RET_IP_);
@@ -868,7 +854,7 @@ xlog_force_iclog( iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA; if (iclog->ic_state == XLOG_STATE_ACTIVE) xlog_state_switch_iclogs(iclog->ic_log, iclog, 0); - return xlog_state_release_iclog(iclog->ic_log, iclog, 0); + return xlog_state_release_iclog(iclog->ic_log, iclog); }
/* @@ -2322,7 +2308,7 @@ xlog_write_copy_finish( ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC || xlog_is_shutdown(log)); release_iclog: - error = xlog_state_release_iclog(log, iclog, 0); + error = xlog_state_release_iclog(log, iclog); spin_unlock(&log->l_icloglock); return error; } @@ -2539,7 +2525,7 @@ xlog_write(
spin_lock(&log->l_icloglock); xlog_state_finish_copy(log, iclog, record_cnt, data_cnt); - error = xlog_state_release_iclog(log, iclog, 0); + error = xlog_state_release_iclog(log, iclog); spin_unlock(&log->l_icloglock);
return error; @@ -2963,7 +2949,7 @@ xlog_state_get_iclog_space( * reference to the iclog. */ if (!atomic_add_unless(&iclog->ic_refcnt, -1, 1)) - error = xlog_state_release_iclog(log, iclog, 0); + error = xlog_state_release_iclog(log, iclog); spin_unlock(&log->l_icloglock); if (error) return error; diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index c2fa07909ea6..1022b4bde0da 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -705,11 +705,21 @@ xlog_cil_set_ctx_write_state( * The LSN we need to pass to the log items on transaction * commit is the LSN reported by the first log vector write, not * the commit lsn. If we use the commit record lsn then we can - * move the tail beyond the grant write head. + * move the grant write head beyond the tail LSN and overwrite + * it. */ ctx->start_lsn = lsn; wake_up_all(&cil->xc_start_wait); spin_unlock(&cil->xc_push_lock); + + /* + * Make sure the metadata we are about to overwrite in the log + * has been flushed to stable storage before this iclog is + * issued. + */ + spin_lock(&cil->xc_log->l_icloglock); + iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; + spin_unlock(&cil->xc_log->l_icloglock); return; }
@@ -888,10 +898,7 @@ xlog_cil_push_work( struct xfs_trans_header thdr; struct xfs_log_iovec lhdr; struct xfs_log_vec lvhdr = { NULL }; - xfs_lsn_t preflush_tail_lsn; xfs_csn_t push_seq; - struct bio bio; - DECLARE_COMPLETION_ONSTACK(bdev_flush); bool push_commit_stable;
new_ctx = xlog_cil_ctx_alloc(); @@ -961,23 +968,6 @@ xlog_cil_push_work( list_add(&ctx->committing, &cil->xc_committing); spin_unlock(&cil->xc_push_lock);
- /* - * The CIL is stable at this point - nothing new will be added to it - * because we hold the flush lock exclusively. Hence we can now issue - * a cache flush to ensure all the completed metadata in the journal we - * are about to overwrite is on stable storage. - * - * Because we are issuing this cache flush before we've written the - * tail lsn to the iclog, we can have metadata IO completions move the - * tail forwards between the completion of this flush and the iclog - * being written. In this case, we need to re-issue the cache flush - * before the iclog write. To detect whether the log tail moves, sample - * the tail LSN *before* we issue the flush. - */ - preflush_tail_lsn = atomic64_read(&log->l_tail_lsn); - xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev, - &bdev_flush); - /* * Pull all the log vectors off the items in the CIL, and remove the * items from the CIL. We don't need the CIL lock here because it's only @@ -1054,12 +1044,6 @@ xlog_cil_push_work( lvhdr.lv_iovecp = &lhdr; lvhdr.lv_next = ctx->lv_chain;
- /* - * Before we format and submit the first iclog, we have to ensure that - * the metadata writeback ordering cache flush is complete. - */ - wait_for_completion(&bdev_flush); - error = xlog_cil_write_chain(ctx, &lvhdr); if (error) goto out_abort_free_ticket; @@ -1118,7 +1102,7 @@ xlog_cil_push_work( if (push_commit_stable && ctx->commit_iclog->ic_state == XLOG_STATE_ACTIVE) xlog_state_switch_iclogs(log, ctx->commit_iclog, 0); - xlog_state_release_iclog(log, ctx->commit_iclog, preflush_tail_lsn); + xlog_state_release_iclog(log, ctx->commit_iclog);
/* Not safe to reference ctx now! */
@@ -1139,7 +1123,7 @@ xlog_cil_push_work( return; } spin_lock(&log->l_icloglock); - xlog_state_release_iclog(log, ctx->commit_iclog, 0); + xlog_state_release_iclog(log, ctx->commit_iclog); /* Not safe to reference ctx now! */ spin_unlock(&log->l_icloglock); } diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 210b3f16e973..eee173bdc8f4 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -518,8 +518,7 @@ void xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
void xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog, int eventual_size); -int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog, - xfs_lsn_t log_tail_lsn); +int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog);
/* * When we crack an atomic LSN, we sample it first so that the value will not
From: Brian Foster bfoster@redhat.com
mainline inclusion from mainline-v5.16-rc5 commit 6191cf3ad59fda5901160633fef8e41b064a5246 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The xfs_inodegc_stop() helper performs a high level flush of pending work on the percpu queues and then runs a cancel_work_sync() on each of the percpu work tasks to ensure all work has completed before returning. While cancel_work_sync() waits for wq tasks to complete, it does not guarantee work tasks have started. This means that the _stop() helper can queue and instantly cancel a wq task without having completed the associated work. This can be observed by tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>" test:
xfs_destroy_inode: ... ino 0x83 ... xfs_inode_set_need_inactive: ... ino 0x83 ... xfs_inodegc_stop: ... ... xfs_inodegc_start: ... xfs_inodegc_worker: ... xfs_inode_inactivating: ... ino 0x83 ...
The first few lines show that the inode is removed and need inactive state set, but the inactivation work has not completed before the inodegc mechanism stops. The inactivation doesn't actually occur until the fs is unfrozen and the gc mechanism starts back up. Note that this test requires fsfreeze to reproduce because xfs_freeze indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush().
When this occurs, the workqueue try_to_grab_pending() logic first tries to steal the pending bit, which does not succeed because the bit has been set by queue_work_on(). Subsequently, it checks for association of a pool workqueue from the work item under the pool lock. This association is set at the point a work item is queued and cleared when dequeued for processing. If the association exists, the work item is removed from the queue and cancel_work_sync() returns true. If the pwq association is cleared, the remove attempt assumes the task is busy and retries (eventually returning false to the caller after waiting for the work task to complete).
To avoid this race, we can flush each work item explicitly before cancel. However, since the _queue_all() already schedules each underlying work item, the workqueue level helpers are sufficient to achieve the same ordering effect. E.g., the inodegc enabled flag prevents scheduling any further work in the _stop() case. Use the drain_workqueue() helper in this particular case to make the intent a bit more self explanatory.
Signed-off-by: Brian Foster bfoster@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Dave Chinner dchinner@redhat.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/xfs/xfs_icache.c | 22 ++++------------------ 1 file changed, 4 insertions(+), 18 deletions(-)
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 4f09e6991e4e..3e6fc0c90d20 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1905,28 +1905,20 @@ xfs_inodegc_worker( }
/* - * Force all currently queued inode inactivation work to run immediately, and - * wait for the work to finish. Two pass - queue all the work first pass, wait - * for it in a second pass. + * Force all currently queued inode inactivation work to run immediately and + * wait for the work to finish. */ void xfs_inodegc_flush( struct xfs_mount *mp) { - struct xfs_inodegc *gc; - int cpu; - if (!xfs_is_inodegc_enabled(mp)) return;
trace_xfs_inodegc_flush(mp, __return_address);
xfs_inodegc_queue_all(mp); - - for_each_online_cpu(cpu) { - gc = per_cpu_ptr(mp->m_inodegc, cpu); - flush_work(&gc->work); - } + flush_workqueue(mp->m_inodegc_wq); }
/* @@ -1937,18 +1929,12 @@ void xfs_inodegc_stop( struct xfs_mount *mp) { - struct xfs_inodegc *gc; - int cpu; - if (!xfs_clear_inodegc_enabled(mp)) return;
xfs_inodegc_queue_all(mp); + drain_workqueue(mp->m_inodegc_wq);
- for_each_online_cpu(cpu) { - gc = per_cpu_ptr(mp->m_inodegc, cpu); - cancel_work_sync(&gc->work); - } trace_xfs_inodegc_stop(mp, __return_address); }
From: Dave Chinner dchinner@redhat.com
mainline inclusion from mainline-v5.19-rc2 commit 5e672cd69f0a534a445df4372141fd0d1d00901d category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The current blocking mechanism for pushing the inodegc queue out to disk can result in systems becoming unusable when there is a long running inodegc operation. This is because the statfs() implementation currently issues a blocking flush of the inodegc queue and a significant number of common system utilities will call statfs() to discover something about the underlying filesystem.
This can result in userspace operations getting stuck on inodegc progress, and when trying to remove a heavily reflinked file on slow storage with a full journal, this can result in delays measuring in hours.
Avoid this problem by adding "push" function that expedites the flushing of the inodegc queue, but doesn't wait for it to complete.
Convert xfs_fs_statfs() and xfs_qm_scall_getquota() to use this mechanism so they don't block but still ensure that queued operations are expedited.
Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues") Reported-by: Chris Dunlop chris@onthe.net.au Signed-off-by: Dave Chinner dchinner@redhat.com [djwong: fix _getquota_next to use _inodegc_push too] Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/xfs/xfs_icache.c | 20 +++++++++++++++----- fs/xfs/xfs_icache.h | 1 + fs/xfs/xfs_qm_syscalls.c | 9 ++++++--- fs/xfs/xfs_super.c | 7 +++++-- fs/xfs/xfs_trace.h | 1 + 5 files changed, 28 insertions(+), 10 deletions(-)
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 3e6fc0c90d20..43c93b4facbc 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1905,19 +1905,29 @@ xfs_inodegc_worker( }
/* - * Force all currently queued inode inactivation work to run immediately and - * wait for the work to finish. + * Expedite all pending inodegc work to run immediately. This does not wait for + * completion of the work. */ void -xfs_inodegc_flush( +xfs_inodegc_push( struct xfs_mount *mp) { if (!xfs_is_inodegc_enabled(mp)) return; + trace_xfs_inodegc_push(mp, __return_address); + xfs_inodegc_queue_all(mp); +}
+/* + * Force all currently queued inode inactivation work to run immediately and + * wait for the work to finish. + */ +void +xfs_inodegc_flush( + struct xfs_mount *mp) +{ + xfs_inodegc_push(mp); trace_xfs_inodegc_flush(mp, __return_address); - - xfs_inodegc_queue_all(mp); flush_workqueue(mp->m_inodegc_wq); }
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h index 2e4cfddf8b8e..6cd180721659 100644 --- a/fs/xfs/xfs_icache.h +++ b/fs/xfs/xfs_icache.h @@ -76,6 +76,7 @@ void xfs_blockgc_stop(struct xfs_mount *mp); void xfs_blockgc_start(struct xfs_mount *mp);
void xfs_inodegc_worker(struct work_struct *work); +void xfs_inodegc_push(struct xfs_mount *mp); void xfs_inodegc_flush(struct xfs_mount *mp); void xfs_inodegc_stop(struct xfs_mount *mp); void xfs_inodegc_start(struct xfs_mount *mp); diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c index b541dc3fac17..e429284f78c1 100644 --- a/fs/xfs/xfs_qm_syscalls.c +++ b/fs/xfs/xfs_qm_syscalls.c @@ -481,9 +481,12 @@ xfs_qm_scall_getquota( struct xfs_dquot *dqp; int error;
- /* Flush inodegc work at the start of a quota reporting scan. */ + /* + * Expedite pending inodegc work at the start of a quota reporting + * scan but don't block waiting for it to complete. + */ if (id == 0) - xfs_inodegc_flush(mp); + xfs_inodegc_push(mp);
/* * Try to get the dquot. We don't want it allocated on disk, so don't @@ -525,7 +528,7 @@ xfs_qm_scall_getquota_next(
/* Flush inodegc work at the start of a quota reporting scan. */ if (*id == 0) - xfs_inodegc_flush(mp); + xfs_inodegc_push(mp);
error = xfs_qm_dqget_next(mp, *id, type, &dqp); if (error) diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 04110183f3ed..8ba96ddedf3f 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -789,8 +789,11 @@ xfs_fs_statfs( xfs_extlen_t lsize; int64_t ffree;
- /* Wait for whatever inactivations are in progress. */ - xfs_inodegc_flush(mp); + /* + * Expedite background inodegc but don't wait. We do not want to block + * here waiting hours for a billion extent file to be truncated. + */ + xfs_inodegc_push(mp);
statp->f_type = XFS_SUPER_MAGIC; statp->f_namelen = MAXNAMELEN - 1; diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 2e90041fed8d..cdf6e23ec517 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -187,6 +187,7 @@ DEFINE_EVENT(xfs_fs_class, name, \ TP_PROTO(struct xfs_mount *mp, void *caller_ip), \ TP_ARGS(mp, caller_ip)) DEFINE_FS_EVENT(xfs_inodegc_flush); +DEFINE_FS_EVENT(xfs_inodegc_push); DEFINE_FS_EVENT(xfs_inodegc_start); DEFINE_FS_EVENT(xfs_inodegc_stop); DEFINE_FS_EVENT(xfs_inodegc_queue);
From: Dave Chinner dchinner@redhat.com
mainline inclusion from mainline-v5.19-rc2 commit 7cf2b0f9611b9971d663e1fc3206eeda3b902922 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Currently inodegc work can sit queued on the per-cpu queue until the workqueue is either flushed of the queue reaches a depth that triggers work queuing (and later throttling). This means that we could queue work that waits for a long time for some other event to trigger flushing.
Hence instead of just queueing work at a specific depth, use a delayed work that queues the work at a bound time. We can still schedule the work immediately at a given depth, but we no long need to worry about leaving a number of items on the list that won't get processed until external events prevail.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++-------------- fs/xfs/xfs_mount.h | 2 +- fs/xfs/xfs_super.c | 2 +- 3 files changed, 24 insertions(+), 16 deletions(-)
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 43c93b4facbc..318330432297 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -447,7 +447,7 @@ xfs_inodegc_queue_all( for_each_online_cpu(cpu) { gc = per_cpu_ptr(mp->m_inodegc, cpu); if (!llist_empty(&gc->list)) - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); } }
@@ -1874,8 +1874,8 @@ void xfs_inodegc_worker( struct work_struct *work) { - struct xfs_inodegc *gc = container_of(work, struct xfs_inodegc, - work); + struct xfs_inodegc *gc = container_of(to_delayed_work(work), + struct xfs_inodegc, work); struct llist_node *node = llist_del_all(&gc->list); struct xfs_inode *ip, *n; unsigned int nofs_flag; @@ -2064,6 +2064,7 @@ xfs_inodegc_queue( struct xfs_inodegc *gc; int items; unsigned int shrinker_hits; + unsigned long queue_delay = 1;
trace_xfs_inode_set_need_inactive(ip); spin_lock(&ip->i_flags_lock); @@ -2075,19 +2076,26 @@ xfs_inodegc_queue( items = READ_ONCE(gc->items); WRITE_ONCE(gc->items, items + 1); shrinker_hits = READ_ONCE(gc->shrinker_hits); - put_cpu_ptr(gc);
- if (!xfs_is_inodegc_enabled(mp)) + /* + * We queue the work while holding the current CPU so that the work + * is scheduled to run on this CPU. + */ + if (!xfs_is_inodegc_enabled(mp)) { + put_cpu_ptr(gc); return; - - if (xfs_inodegc_want_queue_work(ip, items)) { - trace_xfs_inodegc_queue(mp, __return_address); - queue_work(mp->m_inodegc_wq, &gc->work); }
+ if (xfs_inodegc_want_queue_work(ip, items)) + queue_delay = 0; + + trace_xfs_inodegc_queue(mp, __return_address); + mod_delayed_work(mp->m_inodegc_wq, &gc->work, queue_delay); + put_cpu_ptr(gc); + if (xfs_inodegc_want_flush_work(ip, items, shrinker_hits)) { trace_xfs_inodegc_throttle(mp, __return_address); - flush_work(&gc->work); + flush_delayed_work(&gc->work); } }
@@ -2104,7 +2112,7 @@ xfs_inodegc_cpu_dead( unsigned int count = 0;
dead_gc = per_cpu_ptr(mp->m_inodegc, dead_cpu); - cancel_work_sync(&dead_gc->work); + cancel_delayed_work_sync(&dead_gc->work);
if (llist_empty(&dead_gc->list)) return; @@ -2123,12 +2131,12 @@ xfs_inodegc_cpu_dead( llist_add_batch(first, last, &gc->list); count += READ_ONCE(gc->items); WRITE_ONCE(gc->items, count); - put_cpu_ptr(gc);
if (xfs_is_inodegc_enabled(mp)) { trace_xfs_inodegc_queue(mp, __return_address); - queue_work(mp->m_inodegc_wq, &gc->work); + mod_delayed_work(mp->m_inodegc_wq, &gc->work, 0); } + put_cpu_ptr(gc); }
/* @@ -2223,7 +2231,7 @@ xfs_inodegc_shrinker_scan( unsigned int h = READ_ONCE(gc->shrinker_hits);
WRITE_ONCE(gc->shrinker_hits, h + 1); - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); no_items = false; } } diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 3dc54ed4315a..4410d0fad492 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -60,7 +60,7 @@ struct xfs_error_cfg { */ struct xfs_inodegc { struct llist_head list; - struct work_struct work; + struct delayed_work work;
/* approximate count of inodes in the list */ unsigned int items; diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 8ba96ddedf3f..6d8352ba121a 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1104,7 +1104,7 @@ xfs_inodegc_init_percpu( gc = per_cpu_ptr(mp->m_inodegc, cpu); init_llist_head(&gc->list); gc->items = 0; - INIT_WORK(&gc->work, xfs_inodegc_worker); + INIT_DELAYED_WORK(&gc->work, xfs_inodegc_worker); } return 0; }
From: "Darrick J. Wong" darrick.wong@oracle.com
mainline inclusion from mainline-v5.10-rc5 commit 3945ae03d822aa47584dd502ac024ae1e1eb9e2d category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
A couple of the superblock validation checks apply only to the kernel, so move them to xfs_fc_fill_super before we add the needsrepair "feature", which will prevent the kernel (but not xfsprogs) from mounting the filesystem. This also reduces the diff between kernel and userspace libxfs.
Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Dave Chinner dchinner@redhat.com Reviewed-by: Brian Foster bfoster@redhat.com Reviewed-by: Eric Sandeen sandeen@redhat.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/xfs/libxfs/xfs_sb.c | 27 --------------------------- fs/xfs/xfs_super.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+), 27 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c index 1296cff23a3b..19d957baa783 100644 --- a/fs/xfs/libxfs/xfs_sb.c +++ b/fs/xfs/libxfs/xfs_sb.c @@ -382,17 +382,6 @@ xfs_validate_sb_common( return -EFSCORRUPTED; }
- /* - * Until this is fixed only page-sized or smaller data blocks work. - */ - if (unlikely(sbp->sb_blocksize > PAGE_SIZE)) { - xfs_warn(mp, - "File system with blocksize %d bytes. " - "Only pagesize (%ld) or less will currently work.", - sbp->sb_blocksize, PAGE_SIZE); - return -ENOSYS; - } - /* * Currently only very few inode sizes are supported. */ @@ -408,22 +397,6 @@ xfs_validate_sb_common( return -ENOSYS; }
- if (xfs_sb_validate_fsb_count(sbp, sbp->sb_dblocks) || - xfs_sb_validate_fsb_count(sbp, sbp->sb_rblocks)) { - xfs_warn(mp, - "file system too large to be mounted on this system."); - return -EFBIG; - } - - /* - * Don't touch the filesystem if a user tool thinks it owns the primary - * superblock. mkfs doesn't clear the flag from secondary supers, so - * we don't check them at all. - */ - if (XFS_BUF_ADDR(bp) == XFS_SB_DADDR && sbp->sb_inprogress) { - xfs_warn(mp, "Offline file system operation in progress!"); - return -EFSCORRUPTED; - } return 0; }
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 6d8352ba121a..a79deea63a26 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1550,6 +1550,38 @@ xfs_fc_fill_super( #endif }
+ /* + * Don't touch the filesystem if a user tool thinks it owns the primary + * superblock. mkfs doesn't clear the flag from secondary supers, so + * we don't check them at all. + */ + if (mp->m_sb.sb_inprogress) { + xfs_warn(mp, "Offline file system operation in progress!"); + error = -EFSCORRUPTED; + goto out_free_sb; + } + + /* + * Until this is fixed only page-sized or smaller data blocks work. + */ + if (mp->m_sb.sb_blocksize > PAGE_SIZE) { + xfs_warn(mp, + "File system with blocksize %d bytes. " + "Only pagesize (%ld) or less will currently work.", + mp->m_sb.sb_blocksize, PAGE_SIZE); + error = -ENOSYS; + goto out_free_sb; + } + + /* Ensure this filesystem fits in the page cache limits */ + if (xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_dblocks) || + xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_rblocks)) { + xfs_warn(mp, + "file system too large to be mounted on this system."); + error = -EFBIG; + goto out_free_sb; + } + /* * XFS block mappings use 54 bits to store the logical block offset. * This should suffice to handle the maximum file size that the VFS
From: Dave Chinner dchinner@redhat.com
mainline inclusion from mainline-v5.14-rc4 commit 04fcad80cd068731a779fb442f78234732683755 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Introduce a helper function xfs_buf_daddr() to extract the disk address of the buffer from the struct xfs_buf. This will replace direct accesses to bp->b_bn and bp->b_maps[0].bm_bn, as well as the XFS_BUF_ADDR() macro.
This patch introduces the helper function and replaces all uses of XFS_BUF_ADDR() as this is just a simple sed replacement.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/xfs/libxfs/xfs_alloc_btree.c | 2 +- fs/xfs/libxfs/xfs_attr.c | 4 ++-- fs/xfs/libxfs/xfs_bmap.c | 4 ++-- fs/xfs/libxfs/xfs_bmap_btree.c | 2 +- fs/xfs/libxfs/xfs_btree.c | 10 +++++----- fs/xfs/libxfs/xfs_ialloc_btree.c | 2 +- fs/xfs/libxfs/xfs_inode_buf.c | 2 +- fs/xfs/libxfs/xfs_refcount_btree.c | 2 +- fs/xfs/libxfs/xfs_rmap_btree.c | 2 +- fs/xfs/libxfs/xfs_sb.c | 2 +- fs/xfs/scrub/btree.c | 4 ++-- fs/xfs/xfs_attr_inactive.c | 2 +- fs/xfs/xfs_buf.c | 2 +- fs/xfs/xfs_buf.h | 6 +++++- fs/xfs/xfs_trans_buf.c | 2 +- 15 files changed, 26 insertions(+), 22 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c index a43e4c50e69b..ef2b8ee87b8d 100644 --- a/fs/xfs/libxfs/xfs_alloc_btree.c +++ b/fs/xfs/libxfs/xfs_alloc_btree.c @@ -90,7 +90,7 @@ xfs_allocbt_free_block( xfs_agblock_t bno; int error;
- bno = xfs_daddr_to_agbno(cur->bc_mp, XFS_BUF_ADDR(bp)); + bno = xfs_daddr_to_agbno(cur->bc_mp, xfs_buf_daddr(bp)); error = xfs_alloc_put_freelist(cur->bc_tp, agbp, NULL, bno, 1); if (error) return error; diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c index cc016776d21e..6da91abb129b 100644 --- a/fs/xfs/libxfs/xfs_attr.c +++ b/fs/xfs/libxfs/xfs_attr.c @@ -1313,7 +1313,7 @@ xfs_attr_fillstate(xfs_da_state_t *state) ASSERT((path->active >= 0) && (path->active < XFS_DA_NODE_MAXDEPTH)); for (blk = path->blk, level = 0; level < path->active; blk++, level++) { if (blk->bp) { - blk->disk_blkno = XFS_BUF_ADDR(blk->bp); + blk->disk_blkno = xfs_buf_daddr(blk->bp); blk->bp = NULL; } else { blk->disk_blkno = 0; @@ -1328,7 +1328,7 @@ xfs_attr_fillstate(xfs_da_state_t *state) ASSERT((path->active >= 0) && (path->active < XFS_DA_NODE_MAXDEPTH)); for (blk = path->blk, level = 0; level < path->active; blk++, level++) { if (blk->bp) { - blk->disk_blkno = XFS_BUF_ADDR(blk->bp); + blk->disk_blkno = xfs_buf_daddr(blk->bp); blk->bp = NULL; } else { blk->disk_blkno = 0; diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index ea5a9a93948c..243099a4f647 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -239,7 +239,7 @@ xfs_bmap_get_bp( for (i = 0; i < XFS_BTREE_MAXLEVELS; i++) { if (!cur->bc_bufs[i]) break; - if (XFS_BUF_ADDR(cur->bc_bufs[i]) == bno) + if (xfs_buf_daddr(cur->bc_bufs[i]) == bno) return cur->bc_bufs[i]; }
@@ -248,7 +248,7 @@ xfs_bmap_get_bp( struct xfs_buf_log_item *bip = (struct xfs_buf_log_item *)lip;
if (bip->bli_item.li_type == XFS_LI_BUF && - XFS_BUF_ADDR(bip->bli_buf) == bno) + xfs_buf_daddr(bip->bli_buf) == bno) return bip->bli_buf; }
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c index c71741e2857c..b48b38c78b55 100644 --- a/fs/xfs/libxfs/xfs_bmap_btree.c +++ b/fs/xfs/libxfs/xfs_bmap_btree.c @@ -282,7 +282,7 @@ xfs_bmbt_free_block( struct xfs_mount *mp = cur->bc_mp; struct xfs_inode *ip = cur->bc_ino.ip; struct xfs_trans *tp = cur->bc_tp; - xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp)); + xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, xfs_buf_daddr(bp)); struct xfs_owner_info oinfo;
xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, cur->bc_ino.whichfork); diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index 7c4d3fd47950..4102c7245b2c 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -423,7 +423,7 @@ xfs_btree_dup_cursor( bp = cur->bc_bufs[i]; if (bp) { error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, - XFS_BUF_ADDR(bp), mp->m_bsize, + xfs_buf_daddr(bp), mp->m_bsize, 0, &bp, cur->bc_ops->buf_ops); if (error) { @@ -1195,10 +1195,10 @@ xfs_btree_buf_to_ptr( { if (cur->bc_flags & XFS_BTREE_LONG_PTRS) ptr->l = cpu_to_be64(XFS_DADDR_TO_FSB(cur->bc_mp, - XFS_BUF_ADDR(bp))); + xfs_buf_daddr(bp))); else { ptr->s = cpu_to_be32(xfs_daddr_to_agbno(cur->bc_mp, - XFS_BUF_ADDR(bp))); + xfs_buf_daddr(bp))); } }
@@ -1742,7 +1742,7 @@ xfs_btree_lookup_get_block( error = xfs_btree_ptr_to_daddr(cur, pp, &daddr); if (error) return error; - if (bp && XFS_BUF_ADDR(bp) == daddr) { + if (bp && xfs_buf_daddr(bp) == daddr) { *blkp = XFS_BUF_TO_BLOCK(bp); return 0; } @@ -4516,7 +4516,7 @@ xfs_btree_sblock_verify( return __this_address;
/* sibling pointer verification */ - agno = xfs_daddr_to_agno(mp, XFS_BUF_ADDR(bp)); + agno = xfs_daddr_to_agno(mp, xfs_buf_daddr(bp)); if (block->bb_u.s.bb_leftsib != cpu_to_be32(NULLAGBLOCK) && !xfs_verify_agbno(mp, agno, be32_to_cpu(block->bb_u.s.bb_leftsib))) return __this_address; diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c index cc919a2ee870..5a0e4d70f0b4 100644 --- a/fs/xfs/libxfs/xfs_ialloc_btree.c +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c @@ -156,7 +156,7 @@ __xfs_inobt_free_block( { xfs_inobt_mod_blockcount(cur, -1); return xfs_free_extent(cur->bc_tp, - XFS_DADDR_TO_FSB(cur->bc_mp, XFS_BUF_ADDR(bp)), 1, + XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp)), 1, &XFS_RMAP_OINFO_INOBT, resv); }
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c index 3e0f5741d2d3..4b247577fdc6 100644 --- a/fs/xfs/libxfs/xfs_inode_buf.c +++ b/fs/xfs/libxfs/xfs_inode_buf.c @@ -48,7 +48,7 @@ xfs_inode_buf_verify( /* * Validate the magic number and version of every inode in the buffer */ - agno = xfs_daddr_to_agno(mp, XFS_BUF_ADDR(bp)); + agno = xfs_daddr_to_agno(mp, xfs_buf_daddr(bp)); ni = XFS_BB_TO_FSB(mp, bp->b_length) * mp->m_sb.sb_inopblock; for (i = 0; i < ni; i++) { int di_ok; diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c index a6ac60ae9421..b2e0407c0068 100644 --- a/fs/xfs/libxfs/xfs_refcount_btree.c +++ b/fs/xfs/libxfs/xfs_refcount_btree.c @@ -102,7 +102,7 @@ xfs_refcountbt_free_block( struct xfs_mount *mp = cur->bc_mp; struct xfs_buf *agbp = cur->bc_ag.agbp; struct xfs_agf *agf = agbp->b_addr; - xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp)); + xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, xfs_buf_daddr(bp)); int error;
trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_ag.agno, diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c index 9f5bcbd834c3..ab06af29d9db 100644 --- a/fs/xfs/libxfs/xfs_rmap_btree.c +++ b/fs/xfs/libxfs/xfs_rmap_btree.c @@ -124,7 +124,7 @@ xfs_rmapbt_free_block( xfs_agblock_t bno; int error;
- bno = xfs_daddr_to_agbno(cur->bc_mp, XFS_BUF_ADDR(bp)); + bno = xfs_daddr_to_agbno(cur->bc_mp, xfs_buf_daddr(bp)); trace_xfs_rmapbt_free_block(cur->bc_mp, cur->bc_ag.agno, bno, 1); be32_add_cpu(&agf->agf_rmap_blocks, -1); diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c index 19d957baa783..836bbcdc88bd 100644 --- a/fs/xfs/libxfs/xfs_sb.c +++ b/fs/xfs/libxfs/xfs_sb.c @@ -156,7 +156,7 @@ xfs_validate_sb_write( * secondary superblocks, so allow this usage to continue because * we never read counters from such superblocks. */ - if (XFS_BUF_ADDR(bp) == XFS_SB_DADDR && !sbp->sb_inprogress && + if (xfs_buf_daddr(bp) == XFS_SB_DADDR && !sbp->sb_inprogress && (sbp->sb_fdblocks > sbp->sb_dblocks || !xfs_verify_icount(mp, sbp->sb_icount) || sbp->sb_ifree > sbp->sb_icount)) { diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c index 043984245a6d..5c962b9d80e0 100644 --- a/fs/xfs/scrub/btree.c +++ b/fs/xfs/scrub/btree.c @@ -435,12 +435,12 @@ xchk_btree_check_owner( if (!co) return -ENOMEM; co->level = level; - co->daddr = XFS_BUF_ADDR(bp); + co->daddr = xfs_buf_daddr(bp); list_add_tail(&co->list, &bs->to_check); return 0; }
- return xchk_btree_check_block_owner(bs, level, XFS_BUF_ADDR(bp)); + return xchk_btree_check_block_owner(bs, level, xfs_buf_daddr(bp)); }
/* diff --git a/fs/xfs/xfs_attr_inactive.c b/fs/xfs/xfs_attr_inactive.c index 0be9a3567c95..d9c7ff1469c8 100644 --- a/fs/xfs/xfs_attr_inactive.c +++ b/fs/xfs/xfs_attr_inactive.c @@ -178,7 +178,7 @@ xfs_attr3_node_inactive( return error;
/* save for re-read later */ - child_blkno = XFS_BUF_ADDR(child_bp); + child_blkno = xfs_buf_daddr(child_bp);
/* * Invalidate the subtree, however we have to. diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 4cf07ffd9edf..0bdf1811db15 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1402,7 +1402,7 @@ xfs_buf_ioerror_alert( { xfs_buf_alert_ratelimited(bp, "XFS: metadata IO error", "metadata I/O error in "%pS" at daddr 0x%llx len %d error %d", - func, (uint64_t)XFS_BUF_ADDR(bp), + func, (uint64_t)xfs_buf_daddr(bp), bp->b_length, -bp->b_error); }
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index 6ab044cea75d..200793275c36 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -305,9 +305,13 @@ extern void xfs_buf_terminate(void); * In future, uncached buffers will pass the block number directly to the io * request function and hence these macros will go away at that point. */ -#define XFS_BUF_ADDR(bp) ((bp)->b_maps[0].bm_bn) #define XFS_BUF_SET_ADDR(bp, bno) ((bp)->b_maps[0].bm_bn = (xfs_daddr_t)(bno))
+static inline xfs_daddr_t xfs_buf_daddr(struct xfs_buf *bp) +{ + return bp->b_maps[0].bm_bn; +} + void xfs_buf_set_ref(struct xfs_buf *bp, int lru_ref);
/* diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c index 342817e15ada..ae645a80ded0 100644 --- a/fs/xfs/xfs_trans_buf.c +++ b/fs/xfs/xfs_trans_buf.c @@ -38,7 +38,7 @@ xfs_trans_buf_item_match( blip = (struct xfs_buf_log_item *)lip; if (blip->bli_item.li_type == XFS_LI_BUF && blip->bli_buf->b_target == target && - XFS_BUF_ADDR(blip->bli_buf) == map[0].bm_bn && + xfs_buf_daddr(blip->bli_buf) == map[0].bm_bn && blip->bli_buf->b_length == len) { ASSERT(blip->bli_buf->b_map_count == nmaps); return blip->bli_buf;
From: Dave Chinner dchinner@redhat.com
mainline inclusion from mainline-v5.18-rc2 commit dc04db2aa7c9307e740d6d0e173085301c173b1a category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
To catch the obvious graph cycle problem and hence potential endless looping.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Dave Chinner david@fromorbit.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/xfs/libxfs/xfs_btree.c | 140 ++++++++++++++++++++++++++++---------- 1 file changed, 105 insertions(+), 35 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index 4102c7245b2c..ab29c4135148 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -50,6 +50,52 @@ xfs_btree_magic( return magic; }
+static xfs_failaddr_t +xfs_btree_check_lblock_siblings( + struct xfs_mount *mp, + struct xfs_btree_cur *cur, + int level, + xfs_fsblock_t fsb, + xfs_fsblock_t sibling) +{ + if (sibling == NULLFSBLOCK) + return NULL; + if (sibling == fsb) + return __this_address; + if (level >= 0) { + if (!xfs_btree_check_lptr(cur, sibling, level + 1)) + return __this_address; + } else { + if (!xfs_verify_fsbno(mp, sibling)) + return __this_address; + } + + return NULL; +} + +static xfs_failaddr_t +xfs_btree_check_sblock_siblings( + struct xfs_mount *mp, + struct xfs_btree_cur *cur, + int level, + xfs_agnumber_t agno, + xfs_agblock_t agbno, + xfs_agblock_t sibling) +{ + if (sibling == NULLAGBLOCK) + return NULL; + if (sibling == agbno) + return __this_address; + if (level >= 0) { + if (!xfs_btree_check_sptr(cur, sibling, level + 1)) + return __this_address; + } else { + if (!xfs_verify_agbno(mp, agno, sibling)) + return __this_address; + } + return NULL; +} + /* * Check a long btree block header. Return the address of the failing check, * or NULL if everything is ok. @@ -64,6 +110,8 @@ __xfs_btree_check_lblock( struct xfs_mount *mp = cur->bc_mp; xfs_btnum_t btnum = cur->bc_btnum; int crc = xfs_sb_version_hascrc(&mp->m_sb); + xfs_failaddr_t fa; + xfs_fsblock_t fsb = NULLFSBLOCK;
if (crc) { if (!uuid_equal(&block->bb_u.l.bb_uuid, &mp->m_sb.sb_meta_uuid)) @@ -82,16 +130,16 @@ __xfs_btree_check_lblock( if (be16_to_cpu(block->bb_numrecs) > cur->bc_ops->get_maxrecs(cur, level)) return __this_address; - if (block->bb_u.l.bb_leftsib != cpu_to_be64(NULLFSBLOCK) && - !xfs_btree_check_lptr(cur, be64_to_cpu(block->bb_u.l.bb_leftsib), - level + 1)) - return __this_address; - if (block->bb_u.l.bb_rightsib != cpu_to_be64(NULLFSBLOCK) && - !xfs_btree_check_lptr(cur, be64_to_cpu(block->bb_u.l.bb_rightsib), - level + 1)) - return __this_address;
- return NULL; + if (bp) + fsb = XFS_DADDR_TO_FSB(mp, xfs_buf_daddr(bp)); + + fa = xfs_btree_check_lblock_siblings(mp, cur, level, fsb, + be64_to_cpu(block->bb_u.l.bb_leftsib)); + if (!fa) + fa = xfs_btree_check_lblock_siblings(mp, cur, level, fsb, + be64_to_cpu(block->bb_u.l.bb_rightsib)); + return fa; }
/* Check a long btree block header. */ @@ -129,6 +177,9 @@ __xfs_btree_check_sblock( struct xfs_mount *mp = cur->bc_mp; xfs_btnum_t btnum = cur->bc_btnum; int crc = xfs_sb_version_hascrc(&mp->m_sb); + xfs_failaddr_t fa; + xfs_agblock_t agbno = NULLAGBLOCK; + xfs_agnumber_t agno = NULLAGNUMBER;
if (crc) { if (!uuid_equal(&block->bb_u.s.bb_uuid, &mp->m_sb.sb_meta_uuid)) @@ -145,16 +196,18 @@ __xfs_btree_check_sblock( if (be16_to_cpu(block->bb_numrecs) > cur->bc_ops->get_maxrecs(cur, level)) return __this_address; - if (block->bb_u.s.bb_leftsib != cpu_to_be32(NULLAGBLOCK) && - !xfs_btree_check_sptr(cur, be32_to_cpu(block->bb_u.s.bb_leftsib), - level + 1)) - return __this_address; - if (block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK) && - !xfs_btree_check_sptr(cur, be32_to_cpu(block->bb_u.s.bb_rightsib), - level + 1)) - return __this_address;
- return NULL; + if (bp) { + agbno = xfs_daddr_to_agbno(mp, xfs_buf_daddr(bp)); + agno = xfs_daddr_to_agno(mp, xfs_buf_daddr(bp)); + } + + fa = xfs_btree_check_sblock_siblings(mp, cur, level, agno, agbno, + be32_to_cpu(block->bb_u.s.bb_leftsib)); + if (!fa) + fa = xfs_btree_check_sblock_siblings(mp, cur, level, agno, + agbno, be32_to_cpu(block->bb_u.s.bb_rightsib)); + return fa; }
/* Check a short btree block header. */ @@ -4281,6 +4334,21 @@ xfs_btree_visit_block( if (xfs_btree_ptr_is_null(cur, &rptr)) return -ENOENT;
+ /* + * We only visit blocks once in this walk, so we have to avoid the + * internal xfs_btree_lookup_get_block() optimisation where it will + * return the same block without checking if the right sibling points + * back to us and creates a cyclic reference in the btree. + */ + if (cur->bc_flags & XFS_BTREE_LONG_PTRS) { + if (be64_to_cpu(rptr.l) == XFS_DADDR_TO_FSB(cur->bc_mp, + xfs_buf_daddr(bp))) + return -EFSCORRUPTED; + } else { + if (be32_to_cpu(rptr.s) == xfs_daddr_to_agbno(cur->bc_mp, + xfs_buf_daddr(bp))) + return -EFSCORRUPTED; + } return xfs_btree_lookup_get_block(cur, level, &rptr, &block); }
@@ -4455,20 +4523,21 @@ xfs_btree_lblock_verify( { struct xfs_mount *mp = bp->b_mount; struct xfs_btree_block *block = XFS_BUF_TO_BLOCK(bp); + xfs_fsblock_t fsb; + xfs_failaddr_t fa;
/* numrecs verification */ if (be16_to_cpu(block->bb_numrecs) > max_recs) return __this_address;
/* sibling pointer verification */ - if (block->bb_u.l.bb_leftsib != cpu_to_be64(NULLFSBLOCK) && - !xfs_verify_fsbno(mp, be64_to_cpu(block->bb_u.l.bb_leftsib))) - return __this_address; - if (block->bb_u.l.bb_rightsib != cpu_to_be64(NULLFSBLOCK) && - !xfs_verify_fsbno(mp, be64_to_cpu(block->bb_u.l.bb_rightsib))) - return __this_address; - - return NULL; + fsb = XFS_DADDR_TO_FSB(mp, xfs_buf_daddr(bp)); + fa = xfs_btree_check_lblock_siblings(mp, NULL, -1, fsb, + be64_to_cpu(block->bb_u.l.bb_leftsib)); + if (!fa) + fa = xfs_btree_check_lblock_siblings(mp, NULL, -1, fsb, + be64_to_cpu(block->bb_u.l.bb_rightsib)); + return fa; }
/** @@ -4509,7 +4578,9 @@ xfs_btree_sblock_verify( { struct xfs_mount *mp = bp->b_mount; struct xfs_btree_block *block = XFS_BUF_TO_BLOCK(bp); - xfs_agblock_t agno; + xfs_agnumber_t agno; + xfs_agblock_t agbno; + xfs_failaddr_t fa;
/* numrecs verification */ if (be16_to_cpu(block->bb_numrecs) > max_recs) @@ -4517,14 +4588,13 @@ xfs_btree_sblock_verify(
/* sibling pointer verification */ agno = xfs_daddr_to_agno(mp, xfs_buf_daddr(bp)); - if (block->bb_u.s.bb_leftsib != cpu_to_be32(NULLAGBLOCK) && - !xfs_verify_agbno(mp, agno, be32_to_cpu(block->bb_u.s.bb_leftsib))) - return __this_address; - if (block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK) && - !xfs_verify_agbno(mp, agno, be32_to_cpu(block->bb_u.s.bb_rightsib))) - return __this_address; - - return NULL; + agbno = xfs_daddr_to_agbno(mp, xfs_buf_daddr(bp)); + fa = xfs_btree_check_sblock_siblings(mp, NULL, -1, agno, agbno, + be32_to_cpu(block->bb_u.s.bb_leftsib)); + if (!fa) + fa = xfs_btree_check_sblock_siblings(mp, NULL, -1, agno, agbno, + be32_to_cpu(block->bb_u.s.bb_rightsib)); + return fa; }
/*
From: Dave Chinner dchinner@redhat.com
mainline inclusion from mainline-v5.18-rc2 commit 5672225e8f2a872a22b0cecedba7a6644af1fb84 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Commit dc04db2aa7c9 has caused a small aim7 regression, showing a small increase in CPU usage in __xfs_btree_check_sblock() as a result of the extra checking.
This is likely due to the endian conversion of the sibling poitners being unconditional instead of relying on the compiler to endian convert the NULL pointer at compile time and avoiding the runtime conversion for this common case.
Rework the checks so that endian conversion of the sibling pointers is only done if they are not null as the original code did.
.... and these need to be "inline" because the compiler completely fails to inline them automatically like it should be doing.
$ size fs/xfs/libxfs/xfs_btree.o* text data bss dec hex filename 51874 240 0 52114 cb92 fs/xfs/libxfs/xfs_btree.o.orig 51562 240 0 51802 ca5a fs/xfs/libxfs/xfs_btree.o.inline
Just when you think the tools have advanced sufficiently we don't have to care about stuff like this anymore, along comes a reminder that *our tools still suck*.
Fixes: dc04db2aa7c9 ("xfs: detect self referencing btree sibling pointers") Reported-by: kernel test robot oliver.sang@intel.com Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Dave Chinner david@fromorbit.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/xfs/libxfs/xfs_btree.c | 47 +++++++++++++++++++++++++++------------ 1 file changed, 33 insertions(+), 14 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index ab29c4135148..7879a2e09c0a 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -50,16 +50,31 @@ xfs_btree_magic( return magic; }
-static xfs_failaddr_t +/* + * These sibling pointer checks are optimised for null sibling pointers. This + * happens a lot, and we don't need to byte swap at runtime if the sibling + * pointer is NULL. + * + * These are explicitly marked at inline because the cost of calling them as + * functions instead of inlining them is about 36 bytes extra code per call site + * on x86-64. Yes, gcc-11 fails to inline them, and explicit inlining of these + * two sibling check functions reduces the compiled code size by over 300 + * bytes. + */ +static inline xfs_failaddr_t xfs_btree_check_lblock_siblings( struct xfs_mount *mp, struct xfs_btree_cur *cur, int level, xfs_fsblock_t fsb, - xfs_fsblock_t sibling) + __be64 dsibling) { - if (sibling == NULLFSBLOCK) + xfs_fsblock_t sibling; + + if (dsibling == cpu_to_be64(NULLFSBLOCK)) return NULL; + + sibling = be64_to_cpu(dsibling); if (sibling == fsb) return __this_address; if (level >= 0) { @@ -73,17 +88,21 @@ xfs_btree_check_lblock_siblings( return NULL; }
-static xfs_failaddr_t +static inline xfs_failaddr_t xfs_btree_check_sblock_siblings( struct xfs_mount *mp, struct xfs_btree_cur *cur, int level, xfs_agnumber_t agno, xfs_agblock_t agbno, - xfs_agblock_t sibling) + __be32 dsibling) { - if (sibling == NULLAGBLOCK) + xfs_agblock_t sibling; + + if (dsibling == cpu_to_be32(NULLAGBLOCK)) return NULL; + + sibling = be32_to_cpu(dsibling); if (sibling == agbno) return __this_address; if (level >= 0) { @@ -135,10 +154,10 @@ __xfs_btree_check_lblock( fsb = XFS_DADDR_TO_FSB(mp, xfs_buf_daddr(bp));
fa = xfs_btree_check_lblock_siblings(mp, cur, level, fsb, - be64_to_cpu(block->bb_u.l.bb_leftsib)); + block->bb_u.l.bb_leftsib); if (!fa) fa = xfs_btree_check_lblock_siblings(mp, cur, level, fsb, - be64_to_cpu(block->bb_u.l.bb_rightsib)); + block->bb_u.l.bb_rightsib); return fa; }
@@ -203,10 +222,10 @@ __xfs_btree_check_sblock( }
fa = xfs_btree_check_sblock_siblings(mp, cur, level, agno, agbno, - be32_to_cpu(block->bb_u.s.bb_leftsib)); + block->bb_u.s.bb_leftsib); if (!fa) fa = xfs_btree_check_sblock_siblings(mp, cur, level, agno, - agbno, be32_to_cpu(block->bb_u.s.bb_rightsib)); + agbno, block->bb_u.s.bb_rightsib); return fa; }
@@ -4533,10 +4552,10 @@ xfs_btree_lblock_verify( /* sibling pointer verification */ fsb = XFS_DADDR_TO_FSB(mp, xfs_buf_daddr(bp)); fa = xfs_btree_check_lblock_siblings(mp, NULL, -1, fsb, - be64_to_cpu(block->bb_u.l.bb_leftsib)); + block->bb_u.l.bb_leftsib); if (!fa) fa = xfs_btree_check_lblock_siblings(mp, NULL, -1, fsb, - be64_to_cpu(block->bb_u.l.bb_rightsib)); + block->bb_u.l.bb_rightsib); return fa; }
@@ -4590,10 +4609,10 @@ xfs_btree_sblock_verify( agno = xfs_daddr_to_agno(mp, xfs_buf_daddr(bp)); agbno = xfs_daddr_to_agbno(mp, xfs_buf_daddr(bp)); fa = xfs_btree_check_sblock_siblings(mp, NULL, -1, agno, agbno, - be32_to_cpu(block->bb_u.s.bb_leftsib)); + block->bb_u.s.bb_leftsib); if (!fa) fa = xfs_btree_check_sblock_siblings(mp, NULL, -1, agno, agbno, - be32_to_cpu(block->bb_u.s.bb_rightsib)); + block->bb_u.s.bb_rightsib); return fa; }
From: "Darrick J. Wong" djwong@kernel.org
mainline inclusion from mainline-v5.18-rc2 commit a54f78def73d847cb060b18c4e4a3d1d26c9ca6d category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The recent patch to improve btree cycle checking caused a regression when I rebased the in-memory btree branch atop the 5.19 for-next branch, because in-memory short-pointer btrees do not have AG numbers. This produced the following complaint from kmemleak:
unreferenced object 0xffff88803d47dde8 (size 264): comm "xfs_io", pid 4889, jiffies 4294906764 (age 24.072s) hex dump (first 32 bytes): 90 4d 0b 0f 80 88 ff ff 00 a0 bd 05 80 88 ff ff .M.............. e0 44 3a a0 ff ff ff ff 00 df 08 06 80 88 ff ff .D:............. backtrace: [<ffffffffa0388059>] xfbtree_dup_cursor+0x49/0xc0 [xfs] [<ffffffffa029887b>] xfs_btree_dup_cursor+0x3b/0x200 [xfs] [<ffffffffa029af5d>] __xfs_btree_split+0x6ad/0x820 [xfs] [<ffffffffa029b130>] xfs_btree_split+0x60/0x110 [xfs] [<ffffffffa029f6da>] xfs_btree_make_block_unfull+0x19a/0x1f0 [xfs] [<ffffffffa029fada>] xfs_btree_insrec+0x3aa/0x810 [xfs] [<ffffffffa029fff3>] xfs_btree_insert+0xb3/0x240 [xfs] [<ffffffffa02cb729>] xfs_rmap_insert+0x99/0x200 [xfs] [<ffffffffa02cf142>] xfs_rmap_map_shared+0x192/0x5f0 [xfs] [<ffffffffa02cf60b>] xfs_rmap_map_raw+0x6b/0x90 [xfs] [<ffffffffa0384a85>] xrep_rmap_stash+0xd5/0x1d0 [xfs] [<ffffffffa0384dc0>] xrep_rmap_visit_bmbt+0xa0/0xf0 [xfs] [<ffffffffa0384fb6>] xrep_rmap_scan_iext+0x56/0xa0 [xfs] [<ffffffffa03850d8>] xrep_rmap_scan_ifork+0xd8/0x160 [xfs] [<ffffffffa0385195>] xrep_rmap_scan_inode+0x35/0x80 [xfs] [<ffffffffa03852ee>] xrep_rmap_find_rmaps+0x10e/0x270 [xfs]
I noticed that xfs_btree_insrec has a bunch of debug code that return out of the function immediately, without freeing the "new" btree cursor that can be returned when _make_block_unfull calls xfs_btree_split. Fix the error return in this function to free the btree cursor.
Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Dave Chinner dchinner@redhat.com Signed-off-by: Dave Chinner david@fromorbit.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/xfs/libxfs/xfs_btree.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index 7879a2e09c0a..748a21f52dec 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -3276,7 +3276,7 @@ xfs_btree_insrec( struct xfs_btree_block *block; /* btree block */ struct xfs_buf *bp; /* buffer for block */ union xfs_btree_ptr nptr; /* new block ptr */ - struct xfs_btree_cur *ncur; /* new btree cursor */ + struct xfs_btree_cur *ncur = NULL; /* new btree cursor */ union xfs_btree_key nkey; /* new block key */ union xfs_btree_key *lkey; int optr; /* old key/record index */ @@ -3356,7 +3356,7 @@ xfs_btree_insrec( #ifdef DEBUG error = xfs_btree_check_block(cur, block, level, bp); if (error) - return error; + goto error0; #endif
/* @@ -3376,7 +3376,7 @@ xfs_btree_insrec( for (i = numrecs - ptr; i >= 0; i--) { error = xfs_btree_debug_check_ptr(cur, pp, i, level); if (error) - return error; + goto error0; }
xfs_btree_shift_keys(cur, kp, 1, numrecs - ptr + 1); @@ -3461,6 +3461,8 @@ xfs_btree_insrec( return 0;
error0: + if (ncur) + xfs_btree_del_cursor(ncur, error); return error; }
From: Frederic Weisbecker frederic@kernel.org
mainline inclusion from mainline-v5.16-rc4 commit 53e87e3cdc155f20c3417b689df8d2ac88d79576 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6WCC1 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU is guaranteed to stay online and to never stop its tick.
Meanwhile on some rare case, the dedicated timekeeper may be running with interrupts disabled for a while, such as in stop_machine.
If jiffies stop being updated, a nohz_full CPU may end up endlessly programming the next tick in the past, taking the last jiffies update monotonic timestamp as a stale base, resulting in an tick storm.
Here is a scenario where it matters:
0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU.
1) A stop machine callback is queued to execute somewhere.
2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1 can still take IRQs.
3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward.
4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking last_jiffies_update as a base. But last_jiffies_update hasn't been updated for 2 jiffies since the timekeeper has interrupts disabled.
5) clockevents_program_event(), which relies on ktime_get(), observes that the expiration is in the past and therefore programs the min delta event on the clock.
6) The tick fires immediately, goto 3)
7) Tick storm, the nohz_full CPU is drown and takes ages to reach MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation.
Solve this with unconditionally updating jiffies if the value is stale on nohz_full IRQ entry. IRQs and other disturbances are expected to be rare enough on nohz_full for the unconditional call to ktime_get() to actually matter.
Reported-by: Paul E. McKenney paulmck@kernel.org Signed-off-by: Frederic Weisbecker frederic@kernel.org Signed-off-by: Thomas Gleixner tglx@linutronix.de Tested-by: Paul E. McKenney paulmck@kernel.org Link: https://lore.kernel.org/r/20211026141055.57358-2-frederic@kernel.org
Conflicts: kernel/softirq.c
Signed-off-by: Yu Liao liaoyu15@huawei.com Reviewed-by: Xiongfeng Wang wangxiongfeng2@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- kernel/softirq.c | 3 ++- kernel/time/tick-sched.c | 7 +++++++ 2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/kernel/softirq.c b/kernel/softirq.c index 19668d614f47..4196b9f84690 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -350,7 +350,8 @@ asmlinkage __visible void do_softirq(void) */ void irq_enter_rcu(void) { - if (is_idle_task(current) && !in_interrupt()) { + if (tick_nohz_full_cpu(smp_processor_id()) || + (is_idle_task(current) && !in_interrupt())) { /* * Prevent raise_softirq from needlessly waking up ksoftirqd * here, as softirq will be serviced on return from interrupt. diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 33750db5b564..aed5d6b6ca71 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -1347,6 +1347,13 @@ static inline void tick_nohz_irq_enter(void) now = ktime_get(); if (ts->idle_active) tick_nohz_stop_idle(ts, now); + /* + * If all CPUs are idle. We may need to update a stale jiffies value. + * Note nohz_full is a special case: a timekeeper is guaranteed to stay + * alive but it might be busy looping with interrupts disabled in some + * rare case (typically stop machine). So we must make sure we have a + * last resort. + */ if (ts->tick_stopped) tick_nohz_update_jiffies(now); }
From: Zheng Wang zyytlz.wz@163.com
stable inclusion from stable-v5.10.177 commit 75e2144291e847009fbc0350e10ec588ff96e05a category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6W80A CVE: CVE-2023-30772
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit 06615d11cc78162dfd5116efb71f29eb29502d37 ]
In da9150_charger_probe, &charger->otg_work is bound with da9150_charger_otg_work. da9150_charger_otg_ncb may be called to start the work.
If we remove the module which will call da9150_charger_remove to make cleanup, there may be a unfinished work. The possible sequence is as follows:
Fix it by canceling the work before cleanup in the da9150_charger_remove
CPU0 CPUc1
|da9150_charger_otg_work da9150_charger_remove | power_supply_unregister | device_unregister | power_supply_dev_release| kfree(psy) | | | power_supply_changed(charger->usb); | //use
Fixes: c1a281e34dae ("power: Add support for DA9150 Charger") Signed-off-by: Zheng Wang zyytlz.wz@163.com Signed-off-by: Sebastian Reichel sebastian.reichel@collabora.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Guo Mengqi guomengqi3@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Reviewed-by: Weilong Chen chenweilong@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/power/supply/da9150-charger.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/power/supply/da9150-charger.c b/drivers/power/supply/da9150-charger.c index f9314cc0cd75..6b987da58655 100644 --- a/drivers/power/supply/da9150-charger.c +++ b/drivers/power/supply/da9150-charger.c @@ -662,6 +662,7 @@ static int da9150_charger_remove(struct platform_device *pdev)
if (!IS_ERR_OR_NULL(charger->usb_phy)) usb_unregister_notifier(charger->usb_phy, &charger->otg_nb); + cancel_work_sync(&charger->otg_work);
power_supply_unregister(charger->battery); power_supply_unregister(charger->usb);
From: David Howells dhowells@redhat.com
stable inclusion from stable-v5.10.157 commit 3535c632e6d16c98f76e615da8dc0cb2750c66cc category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6VK2H CVE: CVE-2023-2006
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit 3bcd6c7eaa53b56c3f584da46a1f7652e759d0e5 ]
After rxrpc_unbundle_conn() has removed a connection from a bundle, it checks to see if there are any conns with available channels and, if not, removes and attempts to destroy the bundle.
Whilst it does check after grabbing client_bundles_lock that there are no connections attached, this races with rxrpc_look_up_bundle() retrieving the bundle, but not attaching a connection for the connection to be attached later.
There is therefore a window in which the bundle can get destroyed before we manage to attach a new connection to it.
Fix this by adding an "active" counter to struct rxrpc_bundle:
(1) rxrpc_connect_call() obtains an active count by prepping/looking up a bundle and ditches it before returning.
(2) If, during rxrpc_connect_call(), a connection is added to the bundle, this obtains an active count, which is held until the connection is discarded.
(3) rxrpc_deactivate_bundle() is created to drop an active count on a bundle and destroy it when the active count reaches 0. The active count is checked inside client_bundles_lock() to prevent a race with rxrpc_look_up_bundle().
(4) rxrpc_unbundle_conn() then calls rxrpc_deactivate_bundle().
Fixes: 245500d853e9 ("rxrpc: Rewrite the client connection manager") Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-15975 Signed-off-by: David Howells dhowells@redhat.com Tested-by: zdi-disclosures@trendmicro.com cc: Marc Dionne marc.dionne@auristor.com cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller davem@davemloft.net
Conflicts: net/rxrpc/ar-internal.h net/rxrpc/conn_client.c
Signed-off-by: Wang Yufen wangyufen@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- net/rxrpc/ar-internal.h | 1 + net/rxrpc/conn_client.c | 36 ++++++++++++++++++++++-------------- 2 files changed, 23 insertions(+), 14 deletions(-)
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h index ccb65412b670..90ab12559092 100644 --- a/net/rxrpc/ar-internal.h +++ b/net/rxrpc/ar-internal.h @@ -392,6 +392,7 @@ enum rxrpc_conn_proto_state { struct rxrpc_bundle { struct rxrpc_conn_parameters params; atomic_t usage; + atomic_t active; /* Number of active users */ unsigned int debug_id; bool try_upgrade; /* True if the bundle is attempting upgrade */ bool alloc_conn; /* True if someone's getting a conn */ diff --git a/net/rxrpc/conn_client.c b/net/rxrpc/conn_client.c index f5fb223aba82..34430cc9a0a3 100644 --- a/net/rxrpc/conn_client.c +++ b/net/rxrpc/conn_client.c @@ -40,6 +40,8 @@ __read_mostly unsigned long rxrpc_conn_idle_client_fast_expiry = 2 * HZ; DEFINE_IDR(rxrpc_client_conn_ids); static DEFINE_SPINLOCK(rxrpc_conn_id_lock);
+static void rxrpc_deactivate_bundle(struct rxrpc_bundle *bundle); + /* * Get a connection ID and epoch for a client connection from the global pool. * The connection struct pointer is then recorded in the idr radix tree. The @@ -123,6 +125,7 @@ static struct rxrpc_bundle *rxrpc_alloc_bundle(struct rxrpc_conn_parameters *cp, bundle->params = *cp; rxrpc_get_peer(bundle->params.peer); atomic_set(&bundle->usage, 1); + atomic_set(&bundle->active, 1); spin_lock_init(&bundle->channel_lock); INIT_LIST_HEAD(&bundle->waiting_calls); } @@ -341,6 +344,7 @@ static struct rxrpc_bundle *rxrpc_look_up_bundle(struct rxrpc_conn_parameters *c rxrpc_free_bundle(candidate); found_bundle: rxrpc_get_bundle(bundle); + atomic_inc(&bundle->active); spin_unlock(&local->client_bundles_lock); _leave(" = %u [found]", bundle->debug_id); return bundle; @@ -438,6 +442,7 @@ static void rxrpc_add_conn_to_bundle(struct rxrpc_bundle *bundle, gfp_t gfp) if (old) trace_rxrpc_client(old, -1, rxrpc_client_replace); candidate->bundle_shift = shift; + atomic_inc(&bundle->active); bundle->conns[i] = candidate; for (j = 0; j < RXRPC_MAXCALLS; j++) set_bit(shift + j, &bundle->avail_chans); @@ -728,6 +733,7 @@ int rxrpc_connect_call(struct rxrpc_sock *rx, smp_rmb();
out_put_bundle: + rxrpc_deactivate_bundle(bundle); rxrpc_put_bundle(bundle); out: _leave(" = %d", ret); @@ -903,9 +909,8 @@ void rxrpc_disconnect_client_call(struct rxrpc_bundle *bundle, struct rxrpc_call static void rxrpc_unbundle_conn(struct rxrpc_connection *conn) { struct rxrpc_bundle *bundle = conn->bundle; - struct rxrpc_local *local = bundle->params.local; unsigned int bindex; - bool need_drop = false, need_put = false; + bool need_drop = false; int i;
_enter("C=%x", conn->debug_id); @@ -924,15 +929,22 @@ static void rxrpc_unbundle_conn(struct rxrpc_connection *conn) } spin_unlock(&bundle->channel_lock);
- /* If there are no more connections, remove the bundle */ - if (!bundle->avail_chans) { - _debug("maybe unbundle"); - spin_lock(&local->client_bundles_lock); + if (need_drop) { + rxrpc_deactivate_bundle(bundle); + rxrpc_put_connection(conn); + } +}
- for (i = 0; i < ARRAY_SIZE(bundle->conns); i++) - if (bundle->conns[i]) - break; - if (i == ARRAY_SIZE(bundle->conns) && !bundle->params.exclusive) { +/* + * Drop the active count on a bundle. + */ +static void rxrpc_deactivate_bundle(struct rxrpc_bundle *bundle) +{ + struct rxrpc_local *local = bundle->params.local; + bool need_put = false; + + if (atomic_dec_and_lock(&bundle->active, &local->client_bundles_lock)) { + if (!bundle->params.exclusive) { _debug("erase bundle"); rb_erase(&bundle->local_node, &local->client_bundles); need_put = true; @@ -942,10 +954,6 @@ static void rxrpc_unbundle_conn(struct rxrpc_connection *conn) if (need_put) rxrpc_put_bundle(bundle); } - - if (need_drop) - rxrpc_put_connection(conn); - _leave(""); }
/*
From: Zheng Wang zyytlz.wz@163.com
mainline inclusion from mainline-v6.3-rc3 commit cb090e64cf25602b9adaf32d5dfc9c8bec493cd1 category: bugfix bugzilla: 188657, https://gitee.com/src-openeuler/kernel/issues/I6T36A CVE: CVE-2023-1855
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
In xgene_hwmon_probe, &ctx->workq is bound with xgene_hwmon_evt_work. Then it will be started.
If we remove the driver which will call xgene_hwmon_remove to clean up, there may be unfinished work.
The possible sequence is as follows:
Fix it by finishing the work before cleanup in xgene_hwmon_remove.
CPU0 CPU1
|xgene_hwmon_evt_work xgene_hwmon_remove | kfifo_free(&ctx->async_msg_fifo);| | |kfifo_out_spinlocked |//use &ctx->async_msg_fifo Fixes: 2ca492e22cb7 ("hwmon: (xgene) Fix crash when alarm occurs before driver probe") Signed-off-by: Zheng Wang zyytlz.wz@163.com Link: https://lore.kernel.org/r/20230310084007.1403388-1-zyytlz.wz@163.com Signed-off-by: Guenter Roeck linux@roeck-us.net Signed-off-by: Zhao Wenhui zhaowenhui8@huawei.com Reviewed-by: songping yu yusongping@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Zhang Qiao zhangqiao22@huawei.com Reviewed-by: Chen Hui judy.chenhui@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/hwmon/xgene-hwmon.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/hwmon/xgene-hwmon.c b/drivers/hwmon/xgene-hwmon.c index f2a5af239c95..f5d3cf86753f 100644 --- a/drivers/hwmon/xgene-hwmon.c +++ b/drivers/hwmon/xgene-hwmon.c @@ -768,6 +768,7 @@ static int xgene_hwmon_remove(struct platform_device *pdev) { struct xgene_hwmon_dev *ctx = platform_get_drvdata(pdev);
+ cancel_work_sync(&ctx->workq); hwmon_device_unregister(ctx->hwmon_dev); kfifo_free(&ctx->async_msg_fifo); if (acpi_disabled)
From: Nikolay Aleksandrov razor@blackwall.org
mainline inclusion from mainline-v6.3-rc3 commit 9ec7eb60dcbcb6c41076defbc5df7bbd95ceaba5 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6WNGK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
Add bond_ether_setup helper which is used to fix ether_setup() calls in the bonding driver. It takes care of both IFF_MASTER and IFF_SLAVE flags, the former is always restored and the latter only if it was set. If the bond enslaves non-ARPHRD_ETHER device (changes its type), then releases it and enslaves ARPHRD_ETHER device (changes back) then we use ether_setup() to restore the bond device type but it also resets its flags and removes IFF_MASTER and IFF_SLAVE[1]. Use the bond_ether_setup helper to restore both after such transition.
[1] reproduce (nlmon is non-ARPHRD_ETHER): $ ip l add nlmon0 type nlmon $ ip l add bond2 type bond mode active-backup $ ip l set nlmon0 master bond2 $ ip l set nlmon0 nomaster $ ip l add bond1 type bond (we use bond1 as ARPHRD_ETHER device to restore bond2's mode) $ ip l set bond1 master bond2 $ ip l sh dev bond2 37: bond2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether be:d7:c5:40:5b:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500 (notice bond2's IFF_MASTER is missing)
Fixes: e36b9d16c6a6 ("bonding: clean muticast addresses when device changes type") Signed-off-by: Nikolay Aleksandrov razor@blackwall.org Signed-off-by: David S. Miller davem@davemloft.net Conflicts: drivers/net/bonding/bond_main.c Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/net/bonding/bond_main.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 5b0a640eccf6..a5ac22cbcb2d 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1690,6 +1690,19 @@ void bond_lower_state_changed(struct slave *slave) netdev_lower_state_changed(slave->dev, &info); }
+/* The bonding driver uses ether_setup() to convert a master bond device + * to ARPHRD_ETHER, that resets the target netdevice's flags so we always + * have to restore the IFF_MASTER flag, and only restore IFF_SLAVE if it was set + */ +static void bond_ether_setup(struct net_device *bond_dev) +{ + unsigned int slave_flag = bond_dev->flags & IFF_SLAVE; + + ether_setup(bond_dev); + bond_dev->flags |= IFF_MASTER | slave_flag; + bond_dev->priv_flags &= ~IFF_TX_SKB_SHARING; +} + /* enslave device <slave> to bond device <master> */ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev, struct netlink_ext_ack *extack) @@ -1784,10 +1797,8 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
if (slave_dev->type != ARPHRD_ETHER) bond_setup_by_slave(bond_dev, slave_dev); - else { - ether_setup(bond_dev); - bond_dev->priv_flags &= ~IFF_TX_SKB_SHARING; - } + else + bond_ether_setup(bond_dev);
call_netdevice_notifiers(NETDEV_POST_TYPE_CHANGE, bond_dev);
From: Nikolay Aleksandrov razor@blackwall.org
mainline inclusion from mainline-v6.3-rc3 commit e667d469098671261d558be0cd93dca4d285ce1e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6WNGK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
syzbot reported a warning[1] where the bond device itself is a slave and we try to enslave a non-ethernet device as the first slave which fails but then in the error path when ether_setup() restores the bond device it also clears all flags. In my previous fix[2] I restored the IFF_MASTER flag, but I didn't consider the case that the bond device itself might also be a slave with IFF_SLAVE set, so we need to restore that flag as well. Use the bond_ether_setup helper which does the right thing and restores the bond's flags properly.
Steps to reproduce using a nlmon dev: $ ip l add nlmon0 type nlmon $ ip l add bond1 type bond $ ip l add bond2 type bond $ ip l set bond1 master bond2 $ ip l set dev nlmon0 master bond1 $ ip -d l sh dev bond1 22: bond1: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noqueue master bond2 state DOWN mode DEFAULT group default qlen 1000 (now bond1's IFF_SLAVE flag is gone and we'll hit a warning[3] if we try to delete it)
[1] https://syzkaller.appspot.com/bug?id=391c7b1f6522182899efba27d891f1743e8eb3e... [2] commit 7d5cd2ce5292 ("bonding: correctly handle bonding type change on enslave failure") [3] example warning: [ 27.008664] bond1: (slave nlmon0): The slave device specified does not support setting the MAC address [ 27.008692] bond1: (slave nlmon0): Error -95 calling set_mac_address [ 32.464639] bond1 (unregistering): Released all slaves [ 32.464685] ------------[ cut here ]------------ [ 32.464686] WARNING: CPU: 1 PID: 2004 at net/core/dev.c:10829 unregister_netdevice_many+0x72a/0x780 [ 32.464694] Modules linked in: br_netfilter bridge bonding virtio_net [ 32.464699] CPU: 1 PID: 2004 Comm: ip Kdump: loaded Not tainted 5.18.0-rc3+ #47 [ 32.464703] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc37 04/01/2014 [ 32.464704] RIP: 0010:unregister_netdevice_many+0x72a/0x780 [ 32.464707] Code: 99 fd ff ff ba 90 1a 00 00 48 c7 c6 f4 02 66 96 48 c7 c7 20 4d 35 96 c6 05 fa c7 2b 02 01 e8 be 6f 4a 00 0f 0b e9 73 fd ff ff <0f> 0b e9 5f fd ff ff 80 3d e3 c7 2b 02 00 0f 85 3b fd ff ff ba 59 [ 32.464710] RSP: 0018:ffffa006422d7820 EFLAGS: 00010206 [ 32.464712] RAX: ffff8f6e077140a0 RBX: ffffa006422d7888 RCX: 0000000000000000 [ 32.464714] RDX: ffff8f6e12edbe58 RSI: 0000000000000296 RDI: ffffffff96d4a520 [ 32.464716] RBP: ffff8f6e07714000 R08: ffffffff96d63600 R09: ffffa006422d7728 [ 32.464717] R10: 0000000000000ec0 R11: ffffffff9698c988 R12: ffff8f6e12edb140 [ 32.464719] R13: dead000000000122 R14: dead000000000100 R15: ffff8f6e12edb140 [ 32.464723] FS: 00007f297c2f1740(0000) GS:ffff8f6e5d900000(0000) knlGS:0000000000000000 [ 32.464725] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 32.464726] CR2: 00007f297bf1c800 CR3: 00000000115e8000 CR4: 0000000000350ee0 [ 32.464730] Call Trace: [ 32.464763] <TASK> [ 32.464767] rtnl_dellink+0x13e/0x380 [ 32.464776] ? cred_has_capability.isra.0+0x68/0x100 [ 32.464780] ? __rtnl_unlock+0x33/0x60 [ 32.464783] ? bpf_lsm_capset+0x10/0x10 [ 32.464786] ? security_capable+0x36/0x50 [ 32.464790] rtnetlink_rcv_msg+0x14e/0x3b0 [ 32.464792] ? _copy_to_iter+0xb1/0x790 [ 32.464796] ? post_alloc_hook+0xa0/0x160 [ 32.464799] ? rtnl_calcit.isra.0+0x110/0x110 [ 32.464802] netlink_rcv_skb+0x50/0xf0 [ 32.464806] netlink_unicast+0x216/0x340 [ 32.464809] netlink_sendmsg+0x23f/0x480 [ 32.464812] sock_sendmsg+0x5e/0x60 [ 32.464815] ____sys_sendmsg+0x22c/0x270 [ 32.464818] ? import_iovec+0x17/0x20 [ 32.464821] ? sendmsg_copy_msghdr+0x59/0x90 [ 32.464823] ? do_set_pte+0xa0/0xe0 [ 32.464828] ___sys_sendmsg+0x81/0xc0 [ 32.464832] ? mod_objcg_state+0xc6/0x300 [ 32.464835] ? refill_obj_stock+0xa9/0x160 [ 32.464838] ? memcg_slab_free_hook+0x1a5/0x1f0 [ 32.464842] __sys_sendmsg+0x49/0x80 [ 32.464847] do_syscall_64+0x3b/0x90 [ 32.464851] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 32.464865] RIP: 0033:0x7f297bf2e5e7 [ 32.464868] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10 [ 32.464869] RSP: 002b:00007ffd96c824c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 32.464872] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f297bf2e5e7 [ 32.464874] RDX: 0000000000000000 RSI: 00007ffd96c82540 RDI: 0000000000000003 [ 32.464875] RBP: 00000000640f19de R08: 0000000000000001 R09: 000000000000007c [ 32.464876] R10: 00007f297bffabe0 R11: 0000000000000246 R12: 0000000000000001 [ 32.464877] R13: 00007ffd96c82d20 R14: 00007ffd96c82610 R15: 000055bfe38a7020 [ 32.464881] </TASK> [ 32.464882] ---[ end trace 0000000000000000 ]---
Fixes: 7d5cd2ce5292 ("bonding: correctly handle bonding type change on enslave failure") Reported-by: syzbot+9dfc3f3348729cc82277@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=391c7b1f6522182899efba27d891f1743e8eb3e... Signed-off-by: Nikolay Aleksandrov razor@blackwall.org Reviewed-by: Michal Kubiak michal.kubiak@intel.com Acked-by: Jonathan Toppins jtoppins@redhat.com Acked-by: Jay Vosburgh jay.vosburgh@canonical.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/net/bonding/bond_main.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index a5ac22cbcb2d..a30d24ad9110 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2185,9 +2185,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev, eth_hw_addr_random(bond_dev); if (bond_dev->type != ARPHRD_ETHER) { dev_close(bond_dev); - ether_setup(bond_dev); - bond_dev->flags |= IFF_MASTER; - bond_dev->priv_flags &= ~IFF_TX_SKB_SHARING; + bond_ether_setup(bond_dev); } }
From: Ido Schimmel idosch@nvidia.com
mainline inclusion from mainline-v6.3 commit c484fcc058bada604d7e4e5228d4affb646ddbc2 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6WNGK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
---------------------------
When a net device is put administratively up, its 'IFF_UP' flag is set (if not set already) and a 'NETDEV_UP' notification is emitted, which causes the 8021q driver to add VLAN ID 0 on the device. The reverse happens when a net device is put administratively down.
When changing the type of a bond to Ethernet, its 'IFF_UP' flag is incorrectly cleared, resulting in the kernel skipping the above process and VLAN ID 0 being leaked [1].
Fix by restoring the flag when changing the type to Ethernet, in a similar fashion to the restoration of the 'IFF_SLAVE' flag.
The issue can be reproduced using the script in [2], with example out before and after the fix in [3].
[1] unreferenced object 0xffff888103479900 (size 256): comm "ip", pid 329, jiffies 4294775225 (age 28.561s) hex dump (first 32 bytes): 00 a0 0c 15 81 88 ff ff 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff81a6051a>] kmalloc_trace+0x2a/0xe0 [<ffffffff8406426c>] vlan_vid_add+0x30c/0x790 [<ffffffff84068e21>] vlan_device_event+0x1491/0x21a0 [<ffffffff81440c8e>] notifier_call_chain+0xbe/0x1f0 [<ffffffff8372383a>] call_netdevice_notifiers_info+0xba/0x150 [<ffffffff837590f2>] __dev_notify_flags+0x132/0x2e0 [<ffffffff8375ad9f>] dev_change_flags+0x11f/0x180 [<ffffffff8379af36>] do_setlink+0xb96/0x4060 [<ffffffff837adf6a>] __rtnl_newlink+0xc0a/0x18a0 [<ffffffff837aec6c>] rtnl_newlink+0x6c/0xa0 [<ffffffff837ac64e>] rtnetlink_rcv_msg+0x43e/0xe00 [<ffffffff839a99e0>] netlink_rcv_skb+0x170/0x440 [<ffffffff839a738f>] netlink_unicast+0x53f/0x810 [<ffffffff839a7fcb>] netlink_sendmsg+0x96b/0xe90 [<ffffffff8369d12f>] ____sys_sendmsg+0x30f/0xa70 [<ffffffff836a6d7a>] ___sys_sendmsg+0x13a/0x1e0 unreferenced object 0xffff88810f6a83e0 (size 32): comm "ip", pid 329, jiffies 4294775225 (age 28.561s) hex dump (first 32 bytes): a0 99 47 03 81 88 ff ff a0 99 47 03 81 88 ff ff ..G.......G..... 81 00 00 00 01 00 00 00 cc cc cc cc cc cc cc cc ................ backtrace: [<ffffffff81a6051a>] kmalloc_trace+0x2a/0xe0 [<ffffffff84064369>] vlan_vid_add+0x409/0x790 [<ffffffff84068e21>] vlan_device_event+0x1491/0x21a0 [<ffffffff81440c8e>] notifier_call_chain+0xbe/0x1f0 [<ffffffff8372383a>] call_netdevice_notifiers_info+0xba/0x150 [<ffffffff837590f2>] __dev_notify_flags+0x132/0x2e0 [<ffffffff8375ad9f>] dev_change_flags+0x11f/0x180 [<ffffffff8379af36>] do_setlink+0xb96/0x4060 [<ffffffff837adf6a>] __rtnl_newlink+0xc0a/0x18a0 [<ffffffff837aec6c>] rtnl_newlink+0x6c/0xa0 [<ffffffff837ac64e>] rtnetlink_rcv_msg+0x43e/0xe00 [<ffffffff839a99e0>] netlink_rcv_skb+0x170/0x440 [<ffffffff839a738f>] netlink_unicast+0x53f/0x810 [<ffffffff839a7fcb>] netlink_sendmsg+0x96b/0xe90 [<ffffffff8369d12f>] ____sys_sendmsg+0x30f/0xa70 [<ffffffff836a6d7a>] ___sys_sendmsg+0x13a/0x1e0
[2] ip link add name t-nlmon type nlmon ip link add name t-dummy type dummy ip link add name t-bond type bond mode active-backup
ip link set dev t-bond up ip link set dev t-nlmon master t-bond ip link set dev t-nlmon nomaster ip link show dev t-bond ip link set dev t-dummy master t-bond ip link show dev t-bond
ip link del dev t-bond ip link del dev t-dummy ip link del dev t-nlmon
[3] Before:
12: t-bond: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 link/netlink 12: t-bond: <BROADCAST,MULTICAST,MASTER,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 46:57:39:a4:46:a2 brd ff:ff:ff:ff:ff:ff
After:
12: t-bond: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 link/netlink 12: t-bond: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 66:48:7b:74:b6:8a brd ff:ff:ff:ff:ff:ff
Fixes: e36b9d16c6a6 ("bonding: clean muticast addresses when device changes type") Fixes: 75c78500ddad ("bonding: remap muticast addresses without using dev_close() and dev_open()") Fixes: 9ec7eb60dcbc ("bonding: restore IFF_MASTER/SLAVE flags on bond enslave ether type change") Reported-by: Mirsad Goran Todorovac mirsad.todorovac@alu.unizg.hr Link: https://lore.kernel.org/netdev/78a8a03b-6070-3e6b-5042-f848dab16fb8@alu.uniz... Tested-by: Mirsad Goran Todorovac mirsad.todorovac@alu.unizg.hr Signed-off-by: Ido Schimmel idosch@nvidia.com Acked-by: Jay Vosburgh jay.vosburgh@canonical.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/net/bonding/bond_main.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index a30d24ad9110..67b7ef45eb83 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1692,14 +1692,15 @@ void bond_lower_state_changed(struct slave *slave)
/* The bonding driver uses ether_setup() to convert a master bond device * to ARPHRD_ETHER, that resets the target netdevice's flags so we always - * have to restore the IFF_MASTER flag, and only restore IFF_SLAVE if it was set + * have to restore the IFF_MASTER flag, and only restore IFF_SLAVE and IFF_UP + * if they were set */ static void bond_ether_setup(struct net_device *bond_dev) { - unsigned int slave_flag = bond_dev->flags & IFF_SLAVE; + unsigned int flags = bond_dev->flags & (IFF_SLAVE | IFF_UP);
ether_setup(bond_dev); - bond_dev->flags |= IFF_MASTER | slave_flag; + bond_dev->flags |= IFF_MASTER | flags; bond_dev->priv_flags &= ~IFF_TX_SKB_SHARING; }