[PATCH OLK-6.6 v5 00/13] xfs: support log grant track bytes

Support log grant track bytes, instead of lsn. Darrick J. Wong (1): xfs: revert AIL TASK_KILLABLE threshold Dave Chinner (10): xfs: move and rename xfs_trans_committed_bulk xfs: AIL doesn't need manual pushing xfs: background AIL push should target physical space xfs: ensure log tail is always up to date xfs: l_last_sync_lsn is really AIL state xfs: collapse xlog_state_set_callback in caller xfs: track log space pinned by the AIL xfs: pass the full grant head to accounting functions xfs: grant heads track byte counts, not LSNs xfs: skip flushing log items during push Long Li (2): Revert "xfs: update the last_sync_lsn with ctx start lsn" xfs: fix incorrect tail lsn tracking when AIL is empty Documentation/ABI/testing/sysfs-fs-xfs | 18 +- fs/xfs/libxfs/xfs_defer.c | 4 +- fs/xfs/xfs_inode.c | 1 + fs/xfs/xfs_inode_item.c | 6 +- fs/xfs/xfs_log.c | 514 ++++++------------------- fs/xfs/xfs_log.h | 1 - fs/xfs/xfs_log_cil.c | 179 ++++++++- fs/xfs/xfs_log_priv.h | 61 ++- fs/xfs/xfs_log_recover.c | 24 +- fs/xfs/xfs_sysfs.c | 29 +- fs/xfs/xfs_trace.c | 1 + fs/xfs/xfs_trace.h | 42 +- fs/xfs/xfs_trans.c | 129 ------- fs/xfs/xfs_trans.h | 4 +- fs/xfs/xfs_trans_ail.c | 247 ++++++------ fs/xfs/xfs_trans_priv.h | 45 ++- 16 files changed, 556 insertions(+), 749 deletions(-) -- 2.39.2

hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC8HLD -------------------------------- This reverts commit 84be6cefea7b9cffc28ed4b6f59dccd131a167f8. Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/xfs_log.c | 19 ++++++++----------- 1 file changed, 8 insertions(+), 11 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 1fad7cf850a7..ac085fbdbc86 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -2706,22 +2706,19 @@ xlog_get_lowest_lsn( static void xlog_state_set_callback( struct xlog *log, - struct xlog_in_core *iclog) + struct xlog_in_core *iclog, + xfs_lsn_t header_lsn) { - struct xfs_cil_ctx *ctx; - trace_xlog_iclog_callback(iclog, _RET_IP_); iclog->ic_state = XLOG_STATE_CALLBACK; - ctx = list_first_entry_or_null(&iclog->ic_callbacks, - struct xfs_cil_ctx, iclog_entry); - if (!ctx) - return; - ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), - ctx->start_lsn) <= 0); + header_lsn) <= 0); + + if (list_empty_careful(&iclog->ic_callbacks)) + return; - atomic64_set(&log->l_last_sync_lsn, ctx->start_lsn); + atomic64_set(&log->l_last_sync_lsn, header_lsn); xlog_grant_push_ail(log, 0); } @@ -2756,7 +2753,7 @@ xlog_state_iodone_process_iclog( lowest_lsn = xlog_get_lowest_lsn(log); if (lowest_lsn && XFS_LSN_CMP(lowest_lsn, header_lsn) < 0) return false; - xlog_state_set_callback(log, iclog); + xlog_state_set_callback(log, iclog, header_lsn); return false; default: /* -- 2.39.2

From: Dave Chinner <dchinner@redhat.com> mainline inclusion from mainline-v6.10-rc2 commit 613e2fdbbc7b4bd0cd324e3a025b3061eb8c947d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC8HLD CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Ever since the CIL and delayed logging was introduced, xfs_trans_committed_bulk() has been a purely CIL checkpoint completion function and not a transaction commit completion function. Now that we are adding log specific updates to this function, it really does not have anything to do with the transaction subsystem - it is really log and log item level functionality. This should be part of the CIL code as it is the callback that moves log items from the CIL checkpoint to the AIL. Move it and rename it to xlog_cil_ail_insert(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/xfs_log_cil.c | 132 +++++++++++++++++++++++++++++++++++++++- fs/xfs/xfs_trans.c | 129 --------------------------------------- fs/xfs/xfs_trans_priv.h | 3 - 3 files changed, 131 insertions(+), 133 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 67a99d94701e..e2e921484b6a 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -694,6 +694,136 @@ xlog_cil_insert_items( } } +static inline void +xlog_cil_ail_insert_batch( + struct xfs_ail *ailp, + struct xfs_ail_cursor *cur, + struct xfs_log_item **log_items, + int nr_items, + xfs_lsn_t commit_lsn) +{ + int i; + + spin_lock(&ailp->ail_lock); + /* xfs_trans_ail_update_bulk drops ailp->ail_lock */ + xfs_trans_ail_update_bulk(ailp, cur, log_items, nr_items, commit_lsn); + + for (i = 0; i < nr_items; i++) { + struct xfs_log_item *lip = log_items[i]; + + if (lip->li_ops->iop_unpin) + lip->li_ops->iop_unpin(lip, 0); + } +} + +/* + * Take the checkpoint's log vector chain of items and insert the attached log + * items into the AIL. This uses bulk insertion techniques to minimise AIL lock + * traffic. + * + * If we are called with the aborted flag set, it is because a log write during + * a CIL checkpoint commit has failed. In this case, all the items in the + * checkpoint have already gone through iop_committed and iop_committing, which + * means that checkpoint commit abort handling is treated exactly the same as an + * iclog write error even though we haven't started any IO yet. Hence in this + * case all we need to do is iop_committed processing, followed by an + * iop_unpin(aborted) call. + * + * The AIL cursor is used to optimise the insert process. If commit_lsn is not + * at the end of the AIL, the insert cursor avoids the need to walk the AIL to + * find the insertion point on every xfs_log_item_batch_insert() call. This + * saves a lot of needless list walking and is a net win, even though it + * slightly increases that amount of AIL lock traffic to set it up and tear it + * down. + */ +static void +xlog_cil_ail_insert( + struct xlog *log, + struct list_head *lv_chain, + xfs_lsn_t commit_lsn, + bool aborted) +{ +#define LOG_ITEM_BATCH_SIZE 32 + struct xfs_ail *ailp = log->l_ailp; + struct xfs_log_item *log_items[LOG_ITEM_BATCH_SIZE]; + struct xfs_log_vec *lv; + struct xfs_ail_cursor cur; + int i = 0; + + spin_lock(&ailp->ail_lock); + xfs_trans_ail_cursor_last(ailp, &cur, commit_lsn); + spin_unlock(&ailp->ail_lock); + + /* unpin all the log items */ + list_for_each_entry(lv, lv_chain, lv_list) { + struct xfs_log_item *lip = lv->lv_item; + xfs_lsn_t item_lsn; + + if (aborted) + set_bit(XFS_LI_ABORTED, &lip->li_flags); + + if (lip->li_ops->flags & XFS_ITEM_RELEASE_WHEN_COMMITTED) { + lip->li_ops->iop_release(lip); + continue; + } + + if (lip->li_ops->iop_committed) + item_lsn = lip->li_ops->iop_committed(lip, commit_lsn); + else + item_lsn = commit_lsn; + + /* item_lsn of -1 means the item needs no further processing */ + if (XFS_LSN_CMP(item_lsn, (xfs_lsn_t)-1) == 0) + continue; + + /* + * if we are aborting the operation, no point in inserting the + * object into the AIL as we are in a shutdown situation. + */ + if (aborted) { + ASSERT(xlog_is_shutdown(ailp->ail_log)); + if (lip->li_ops->iop_unpin) + lip->li_ops->iop_unpin(lip, 1); + continue; + } + + if (item_lsn != commit_lsn) { + + /* + * Not a bulk update option due to unusual item_lsn. + * Push into AIL immediately, rechecking the lsn once + * we have the ail lock. Then unpin the item. This does + * not affect the AIL cursor the bulk insert path is + * using. + */ + spin_lock(&ailp->ail_lock); + if (XFS_LSN_CMP(item_lsn, lip->li_lsn) > 0) + xfs_trans_ail_update(ailp, lip, item_lsn); + else + spin_unlock(&ailp->ail_lock); + if (lip->li_ops->iop_unpin) + lip->li_ops->iop_unpin(lip, 0); + continue; + } + + /* Item is a candidate for bulk AIL insert. */ + log_items[i++] = lv->lv_item; + if (i >= LOG_ITEM_BATCH_SIZE) { + xlog_cil_ail_insert_batch(ailp, &cur, log_items, + LOG_ITEM_BATCH_SIZE, commit_lsn); + i = 0; + } + } + + /* make sure we insert the remainder! */ + if (i) + xlog_cil_ail_insert_batch(ailp, &cur, log_items, i, commit_lsn); + + spin_lock(&ailp->ail_lock); + xfs_trans_ail_cursor_done(&cur); + spin_unlock(&ailp->ail_lock); +} + static void xlog_cil_free_logvec( struct list_head *lv_chain) @@ -733,7 +863,7 @@ xlog_cil_committed( spin_unlock(&ctx->cil->xc_push_lock); } - xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, &ctx->lv_chain, + xlog_cil_ail_insert(ctx->cil->xc_log, &ctx->lv_chain, ctx->start_lsn, abort); xfs_extent_busy_sort(&ctx->busy_extents.extent_list); diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index ccfaa13e17f1..56049dd86370 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -725,135 +725,6 @@ xfs_trans_free_items( } } -static inline void -xfs_log_item_batch_insert( - struct xfs_ail *ailp, - struct xfs_ail_cursor *cur, - struct xfs_log_item **log_items, - int nr_items, - xfs_lsn_t commit_lsn) -{ - int i; - - spin_lock(&ailp->ail_lock); - /* xfs_trans_ail_update_bulk drops ailp->ail_lock */ - xfs_trans_ail_update_bulk(ailp, cur, log_items, nr_items, commit_lsn); - - for (i = 0; i < nr_items; i++) { - struct xfs_log_item *lip = log_items[i]; - - if (lip->li_ops->iop_unpin) - lip->li_ops->iop_unpin(lip, 0); - } -} - -/* - * Bulk operation version of xfs_trans_committed that takes a log vector of - * items to insert into the AIL. This uses bulk AIL insertion techniques to - * minimise lock traffic. - * - * If we are called with the aborted flag set, it is because a log write during - * a CIL checkpoint commit has failed. In this case, all the items in the - * checkpoint have already gone through iop_committed and iop_committing, which - * means that checkpoint commit abort handling is treated exactly the same - * as an iclog write error even though we haven't started any IO yet. Hence in - * this case all we need to do is iop_committed processing, followed by an - * iop_unpin(aborted) call. - * - * The AIL cursor is used to optimise the insert process. If commit_lsn is not - * at the end of the AIL, the insert cursor avoids the need to walk - * the AIL to find the insertion point on every xfs_log_item_batch_insert() - * call. This saves a lot of needless list walking and is a net win, even - * though it slightly increases that amount of AIL lock traffic to set it up - * and tear it down. - */ -void -xfs_trans_committed_bulk( - struct xfs_ail *ailp, - struct list_head *lv_chain, - xfs_lsn_t commit_lsn, - bool aborted) -{ -#define LOG_ITEM_BATCH_SIZE 32 - struct xfs_log_item *log_items[LOG_ITEM_BATCH_SIZE]; - struct xfs_log_vec *lv; - struct xfs_ail_cursor cur; - int i = 0; - - spin_lock(&ailp->ail_lock); - xfs_trans_ail_cursor_last(ailp, &cur, commit_lsn); - spin_unlock(&ailp->ail_lock); - - /* unpin all the log items */ - list_for_each_entry(lv, lv_chain, lv_list) { - struct xfs_log_item *lip = lv->lv_item; - xfs_lsn_t item_lsn; - - if (aborted) - set_bit(XFS_LI_ABORTED, &lip->li_flags); - - if (lip->li_ops->flags & XFS_ITEM_RELEASE_WHEN_COMMITTED) { - lip->li_ops->iop_release(lip); - continue; - } - - if (lip->li_ops->iop_committed) - item_lsn = lip->li_ops->iop_committed(lip, commit_lsn); - else - item_lsn = commit_lsn; - - /* item_lsn of -1 means the item needs no further processing */ - if (XFS_LSN_CMP(item_lsn, (xfs_lsn_t)-1) == 0) - continue; - - /* - * if we are aborting the operation, no point in inserting the - * object into the AIL as we are in a shutdown situation. - */ - if (aborted) { - ASSERT(xlog_is_shutdown(ailp->ail_log)); - if (lip->li_ops->iop_unpin) - lip->li_ops->iop_unpin(lip, 1); - continue; - } - - if (item_lsn != commit_lsn) { - - /* - * Not a bulk update option due to unusual item_lsn. - * Push into AIL immediately, rechecking the lsn once - * we have the ail lock. Then unpin the item. This does - * not affect the AIL cursor the bulk insert path is - * using. - */ - spin_lock(&ailp->ail_lock); - if (XFS_LSN_CMP(item_lsn, lip->li_lsn) > 0) - xfs_trans_ail_update(ailp, lip, item_lsn); - else - spin_unlock(&ailp->ail_lock); - if (lip->li_ops->iop_unpin) - lip->li_ops->iop_unpin(lip, 0); - continue; - } - - /* Item is a candidate for bulk AIL insert. */ - log_items[i++] = lv->lv_item; - if (i >= LOG_ITEM_BATCH_SIZE) { - xfs_log_item_batch_insert(ailp, &cur, log_items, - LOG_ITEM_BATCH_SIZE, commit_lsn); - i = 0; - } - } - - /* make sure we insert the remainder! */ - if (i) - xfs_log_item_batch_insert(ailp, &cur, log_items, i, commit_lsn); - - spin_lock(&ailp->ail_lock); - xfs_trans_ail_cursor_done(&cur); - spin_unlock(&ailp->ail_lock); -} - /* * Sort transaction items prior to running precommit operations. This will * attempt to order the items such that they will always be locked in the same diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h index d5400150358e..52a45f0a5ef1 100644 --- a/fs/xfs/xfs_trans_priv.h +++ b/fs/xfs/xfs_trans_priv.h @@ -19,9 +19,6 @@ void xfs_trans_add_item(struct xfs_trans *, struct xfs_log_item *); void xfs_trans_del_item(struct xfs_log_item *); void xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp); -void xfs_trans_committed_bulk(struct xfs_ail *ailp, - struct list_head *lv_chain, - xfs_lsn_t commit_lsn, bool aborted); /* * AIL traversal cursor. * -- 2.39.2

From: Dave Chinner <dchinner@redhat.com> mainline inclusion from mainline-v6.10-rc2 commit 9adf40249e6cfd7231c2973bb305f6c20902bfd9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC8HLD CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- We have a mechanism that checks the amount of log space remaining available every time we make a transaction reservation. If the amount of space is below a threshold (25% free) we push on the AIL to tell it to do more work. To do this, we end up calculating the LSN that the AIL needs to push to on every reservation and updating the push target for the AIL with that new target LSN. This is silly and expensive. The AIL is perfectly capable of calculating the push target itself, and it will always be running when the AIL contains objects. What the target does is determine if the AIL needs to do any work before it goes back to sleep. If we haven't run out of reservation space or memory (or some other push all trigger), it will simply go back to sleep for a while if there is more than 25% of the journal space free without doing anything. If there are items in the AIL at a lower LSN than the target, it will try to push up to the target or to the point of getting stuck before going back to sleep and trying again soon after.` Hence we can modify the AIL to calculate it's own 25% push target before it starts a push using the same reserve grant head based calculation as is currently used, and remove all the places where we ask the AIL to push to a new 25% free target. We can also drop the minimum free space size of 256BBs from the calculation because the 25% of a minimum sized log is *always going to be larger than 256BBs. This does still require a manual push in certain circumstances. These circumstances arise when the AIL is not full, but the reservation grants consume the entire of the free space in the log. In this case, we still need to push on the AIL to free up space, so when we hit this condition (i.e. reservation going to sleep to wait on log space) we do a single push to tell the AIL it should empty itself. This will keep the AIL moving as new reservations come in and want more space, rather than keep queuing them and having to push the AIL repeatedly. The reason for using the "push all" when grant space runs out is that we can run out of grant space when there is more than 25% of the log free. Small logs are notorious for this, and we have a hack in the log callback code (xlog_state_set_callback()) where we push the AIL because the *head* moved) to ensure that we kick the AIL when we consume space in it because that can push us over the "less than 25% available" available that starts tail pushing back up again. Hence when we run out of grant space and are going to sleep, we have to consider that the grant space may be consuming almost all the log space and there is almost nothing in the AIL. In this situation, the AIL pins the tail and moving the tail forwards is the only way the grant space will come available, so we have to force the AIL to push everything to guarantee grant space will eventually be returned. Hence triggering a "push all" just before sleeping removes all the nasty corner cases we have in other parts of the code that work around the "we didn't ask the AIL to push enough to free grant space" condition that leads to log space hangs... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Conflicts: fs/xfs/xfs_log.h [not merge mainline commit 5302a5c8beb2 ("xfs: only clear log incompat flags at clean unmount")] Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/libxfs/xfs_defer.c | 4 +- fs/xfs/xfs_log.c | 135 ++----------------------------- fs/xfs/xfs_log.h | 1 - fs/xfs/xfs_log_priv.h | 2 + fs/xfs/xfs_trans_ail.c | 162 +++++++++++++++++--------------------- fs/xfs/xfs_trans_priv.h | 33 ++++++-- 6 files changed, 108 insertions(+), 229 deletions(-) diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index 363da37a8e7f..a78ec2438e0e 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -12,12 +12,14 @@ #include "xfs_mount.h" #include "xfs_defer.h" #include "xfs_trans.h" +#include "xfs_trans_priv.h" #include "xfs_buf_item.h" #include "xfs_inode.h" #include "xfs_inode_item.h" #include "xfs_trace.h" #include "xfs_icache.h" #include "xfs_log.h" +#include "xfs_log_priv.h" #include "xfs_rmap.h" #include "xfs_refcount.h" #include "xfs_bmap.h" @@ -463,7 +465,7 @@ xfs_defer_relog( * the log threshold once per call. */ if (threshold_lsn == NULLCOMMITLSN) { - threshold_lsn = xlog_grant_push_threshold(log, 0); + threshold_lsn = xfs_ail_push_target(log->l_ailp); if (threshold_lsn == NULLCOMMITLSN) break; } diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index ac085fbdbc86..e45efa6e1a8e 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -30,10 +30,6 @@ xlog_alloc_log( struct xfs_buftarg *log_target, xfs_daddr_t blk_offset, int num_bblks); -STATIC int -xlog_space_left( - struct xlog *log, - atomic64_t *head); STATIC void xlog_dealloc_log( struct xlog *log); @@ -51,10 +47,6 @@ xlog_state_get_iclog_space( struct xlog_ticket *ticket, int *logoffsetp); STATIC void -xlog_grant_push_ail( - struct xlog *log, - int need_bytes); -STATIC void xlog_sync( struct xlog *log, struct xlog_in_core *iclog, @@ -242,42 +234,15 @@ xlog_grant_head_wake( { struct xlog_ticket *tic; int need_bytes; - bool woken_task = false; list_for_each_entry(tic, &head->waiters, t_queue) { - - /* - * There is a chance that the size of the CIL checkpoints in - * progress at the last AIL push target calculation resulted in - * limiting the target to the log head (l_last_sync_lsn) at the - * time. This may not reflect where the log head is now as the - * CIL checkpoints may have completed. - * - * Hence when we are woken here, it may be that the head of the - * log that has moved rather than the tail. As the tail didn't - * move, there still won't be space available for the - * reservation we require. However, if the AIL has already - * pushed to the target defined by the old log head location, we - * will hang here waiting for something else to update the AIL - * push target. - * - * Therefore, if there isn't space to wake the first waiter on - * the grant head, we need to push the AIL again to ensure the - * target reflects both the current log tail and log head - * position before we wait for the tail to move again. - */ - need_bytes = xlog_ticket_reservation(log, head, tic); - if (*free_bytes < need_bytes) { - if (!woken_task) - xlog_grant_push_ail(log, need_bytes); + if (*free_bytes < need_bytes) return false; - } *free_bytes -= need_bytes; trace_xfs_log_grant_wake_up(log, tic); wake_up_process(tic->t_task); - woken_task = true; } return true; @@ -296,13 +261,15 @@ xlog_grant_head_wait( do { if (xlog_is_shutdown(log)) goto shutdown; - xlog_grant_push_ail(log, need_bytes); __set_current_state(TASK_UNINTERRUPTIBLE); spin_unlock(&head->lock); XFS_STATS_INC(log->l_mp, xs_sleep_logspace); + /* Push on the AIL to free up all the log space. */ + xfs_ail_push_all(log->l_ailp); + trace_xfs_log_grant_sleep(log, tic); schedule(); trace_xfs_log_grant_wake(log, tic); @@ -418,9 +385,6 @@ xfs_log_regrant( * of rolling transactions in the log easily. */ tic->t_tid++; - - xlog_grant_push_ail(log, tic->t_unit_res); - tic->t_curr_res = tic->t_unit_res; if (tic->t_cnt > 0) return 0; @@ -477,12 +441,7 @@ xfs_log_reserve( ASSERT(*ticp == NULL); tic = xlog_ticket_alloc(log, unit_bytes, cnt, permanent); *ticp = tic; - - xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt - : tic->t_unit_res); - trace_xfs_log_reserve(log, tic); - error = xlog_grant_head_check(log, &log->l_reserve_head, tic, &need_bytes); if (error) @@ -1332,7 +1291,7 @@ xlog_assign_tail_lsn( * shortcut invalidity asserts in this case so that we don't trigger them * falsely. */ -STATIC int +int xlog_space_left( struct xlog *log, atomic64_t *head) @@ -1674,89 +1633,6 @@ xlog_alloc_log( return ERR_PTR(error); } /* xlog_alloc_log */ -/* - * Compute the LSN that we'd need to push the log tail towards in order to have - * (a) enough on-disk log space to log the number of bytes specified, (b) at - * least 25% of the log space free, and (c) at least 256 blocks free. If the - * log free space already meets all three thresholds, this function returns - * NULLCOMMITLSN. - */ -xfs_lsn_t -xlog_grant_push_threshold( - struct xlog *log, - int need_bytes) -{ - xfs_lsn_t threshold_lsn = 0; - xfs_lsn_t last_sync_lsn; - int free_blocks; - int free_bytes; - int threshold_block; - int threshold_cycle; - int free_threshold; - - ASSERT(BTOBB(need_bytes) < log->l_logBBsize); - - free_bytes = xlog_space_left(log, &log->l_reserve_head.grant); - free_blocks = BTOBBT(free_bytes); - - /* - * Set the threshold for the minimum number of free blocks in the - * log to the maximum of what the caller needs, one quarter of the - * log, and 256 blocks. - */ - free_threshold = BTOBB(need_bytes); - free_threshold = max(free_threshold, (log->l_logBBsize >> 2)); - free_threshold = max(free_threshold, 256); - if (free_blocks >= free_threshold) - return NULLCOMMITLSN; - - xlog_crack_atomic_lsn(&log->l_tail_lsn, &threshold_cycle, - &threshold_block); - threshold_block += free_threshold; - if (threshold_block >= log->l_logBBsize) { - threshold_block -= log->l_logBBsize; - threshold_cycle += 1; - } - threshold_lsn = xlog_assign_lsn(threshold_cycle, - threshold_block); - /* - * Don't pass in an lsn greater than the lsn of the last - * log record known to be on disk. Use a snapshot of the last sync lsn - * so that it doesn't change between the compare and the set. - */ - last_sync_lsn = atomic64_read(&log->l_last_sync_lsn); - if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0) - threshold_lsn = last_sync_lsn; - - return threshold_lsn; -} - -/* - * Push the tail of the log if we need to do so to maintain the free log space - * thresholds set out by xlog_grant_push_threshold. We may need to adopt a - * policy which pushes on an lsn which is further along in the log once we - * reach the high water mark. In this manner, we would be creating a low water - * mark. - */ -STATIC void -xlog_grant_push_ail( - struct xlog *log, - int need_bytes) -{ - xfs_lsn_t threshold_lsn; - - threshold_lsn = xlog_grant_push_threshold(log, need_bytes); - if (threshold_lsn == NULLCOMMITLSN || xlog_is_shutdown(log)) - return; - - /* - * Get the transaction layer to kick the dirty buffers out to - * disk asynchronously. No point in trying to do this if - * the filesystem is shutting down. - */ - xfs_ail_push(log->l_ailp, threshold_lsn); -} - /* * Stamp cycle number in every block */ @@ -2719,7 +2595,6 @@ xlog_state_set_callback( return; atomic64_set(&log->l_last_sync_lsn, header_lsn); - xlog_grant_push_ail(log, 0); } /* diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 2728886c2963..6b6ee35b3885 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -156,7 +156,6 @@ int xfs_log_quiesce(struct xfs_mount *mp); void xfs_log_clean(struct xfs_mount *mp); bool xfs_log_check_lsn(struct xfs_mount *, xfs_lsn_t); -xfs_lsn_t xlog_grant_push_threshold(struct xlog *log, int need_bytes); bool xlog_force_shutdown(struct xlog *log, uint32_t shutdown_flags); void xlog_use_incompat_feat(struct xlog *log); diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index e30c06ec20e3..f3a68b4a351f 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -578,6 +578,8 @@ xlog_assign_grant_head(atomic64_t *head, int cycle, int space) atomic64_set(head, xlog_assign_grant_head_val(cycle, space)); } +int xlog_space_left(struct xlog *log, atomic64_t *head); + /* * Committed Item List interfaces */ diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 1098452e7f95..31a4e5e5d899 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -134,25 +134,6 @@ xfs_ail_min_lsn( return lsn; } -/* - * Return the maximum lsn held in the AIL, or zero if the AIL is empty. - */ -static xfs_lsn_t -xfs_ail_max_lsn( - struct xfs_ail *ailp) -{ - xfs_lsn_t lsn = 0; - struct xfs_log_item *lip; - - spin_lock(&ailp->ail_lock); - lip = xfs_ail_max(ailp); - if (lip) - lsn = lip->li_lsn; - spin_unlock(&ailp->ail_lock); - - return lsn; -} - /* * The cursor keeps track of where our current traversal is up to by tracking * the next item in the list for us. However, for this to be safe, removing an @@ -414,6 +395,56 @@ xfsaild_push_item( return lip->li_ops->iop_push(lip, &ailp->ail_buf_list); } +/* + * Compute the LSN that we'd need to push the log tail towards in order to have + * at least 25% of the log space free. If the log free space already meets this + * threshold, this function returns NULLCOMMITLSN. + */ +xfs_lsn_t +__xfs_ail_push_target( + struct xfs_ail *ailp) +{ + struct xlog *log = ailp->ail_log; + xfs_lsn_t threshold_lsn = 0; + xfs_lsn_t last_sync_lsn; + int free_blocks; + int free_bytes; + int threshold_block; + int threshold_cycle; + int free_threshold; + + free_bytes = xlog_space_left(log, &log->l_reserve_head.grant); + free_blocks = BTOBBT(free_bytes); + + /* + * The threshold for the minimum number of free blocks is one quarter of + * the entire log space. + */ + free_threshold = log->l_logBBsize >> 2; + if (free_blocks >= free_threshold) + return NULLCOMMITLSN; + + xlog_crack_atomic_lsn(&log->l_tail_lsn, &threshold_cycle, + &threshold_block); + threshold_block += free_threshold; + if (threshold_block >= log->l_logBBsize) { + threshold_block -= log->l_logBBsize; + threshold_cycle += 1; + } + threshold_lsn = xlog_assign_lsn(threshold_cycle, + threshold_block); + /* + * Don't pass in an lsn greater than the lsn of the last + * log record known to be on disk. Use a snapshot of the last sync lsn + * so that it doesn't change between the compare and the set. + */ + last_sync_lsn = atomic64_read(&log->l_last_sync_lsn); + if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0) + threshold_lsn = last_sync_lsn; + + return threshold_lsn; +} + static long xfsaild_push( struct xfs_ail *ailp) @@ -454,21 +485,24 @@ xfsaild_push( * capture updates that occur after the sync push waiter has gone to * sleep. */ - if (waitqueue_active(&ailp->ail_empty)) { + if (test_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate) || + waitqueue_active(&ailp->ail_empty)) { lip = xfs_ail_max(ailp); if (lip) target = lip->li_lsn; + else + clear_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate); } else { - /* barrier matches the ail_target update in xfs_ail_push() */ - smp_rmb(); - target = ailp->ail_target; - ailp->ail_target_prev = target; + target = __xfs_ail_push_target(ailp); } + if (target == NULLCOMMITLSN) + goto out_done; + /* we're done if the AIL is empty or our push has reached the end */ lip = xfs_trans_ail_cursor_first(ailp, &cur, ailp->ail_last_pushed_lsn); if (!lip) - goto out_done; + goto out_done_cursor; XFS_STATS_INC(mp, xs_push_ail); @@ -553,8 +587,9 @@ xfsaild_push( lsn = lip->li_lsn; } -out_done: +out_done_cursor: xfs_trans_ail_cursor_done(&cur); +out_done: spin_unlock(&ailp->ail_lock); if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list)) @@ -603,7 +638,7 @@ xfsaild( set_freezable(); while (1) { - if (tout && tout <= 20) + if (tout) set_current_state(TASK_KILLABLE|TASK_FREEZABLE); else set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); @@ -639,21 +674,9 @@ xfsaild( break; } + /* Idle if the AIL is empty. */ spin_lock(&ailp->ail_lock); - - /* - * Idle if the AIL is empty and we are not racing with a target - * update. We check the AIL after we set the task to a sleep - * state to guarantee that we either catch an ail_target update - * or that a wake_up resets the state to TASK_RUNNING. - * Otherwise, we run the risk of sleeping indefinitely. - * - * The barrier matches the ail_target update in xfs_ail_push(). - */ - smp_rmb(); - if (!xfs_ail_min(ailp) && - ailp->ail_target == ailp->ail_target_prev && - list_empty(&ailp->ail_buf_list)) { + if (!xfs_ail_min(ailp) && list_empty(&ailp->ail_buf_list)) { spin_unlock(&ailp->ail_lock); schedule(); tout = 0; @@ -675,56 +698,6 @@ xfsaild( return 0; } -/* - * This routine is called to move the tail of the AIL forward. It does this by - * trying to flush items in the AIL whose lsns are below the given - * threshold_lsn. - * - * The push is run asynchronously in a workqueue, which means the caller needs - * to handle waiting on the async flush for space to become available. - * We don't want to interrupt any push that is in progress, hence we only queue - * work if we set the pushing bit appropriately. - * - * We do this unlocked - we only need to know whether there is anything in the - * AIL at the time we are called. We don't need to access the contents of - * any of the objects, so the lock is not needed. - */ -void -xfs_ail_push( - struct xfs_ail *ailp, - xfs_lsn_t threshold_lsn) -{ - struct xfs_log_item *lip; - - lip = xfs_ail_min(ailp); - if (!lip || xlog_is_shutdown(ailp->ail_log) || - XFS_LSN_CMP(threshold_lsn, ailp->ail_target) <= 0) - return; - - /* - * Ensure that the new target is noticed in push code before it clears - * the XFS_AIL_PUSHING_BIT. - */ - smp_wmb(); - xfs_trans_ail_copy_lsn(ailp, &ailp->ail_target, &threshold_lsn); - smp_wmb(); - - wake_up_process(ailp->ail_task); -} - -/* - * Push out all items in the AIL immediately - */ -void -xfs_ail_push_all( - struct xfs_ail *ailp) -{ - xfs_lsn_t threshold_lsn = xfs_ail_max_lsn(ailp); - - if (threshold_lsn) - xfs_ail_push(ailp, threshold_lsn); -} - /* * Push out all items in the AIL immediately and wait until the AIL is empty. */ @@ -829,6 +802,13 @@ xfs_trans_ail_update_bulk( if (!list_empty(&tmp)) xfs_ail_splice(ailp, cur, &tmp, lsn); + /* + * If this is the first insert, wake up the push daemon so it can + * actively scan for items to push. + */ + if (!mlip) + wake_up_process(ailp->ail_task); + xfs_ail_update_finish(ailp, tail_lsn); } diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h index 52a45f0a5ef1..9a131e7fae94 100644 --- a/fs/xfs/xfs_trans_priv.h +++ b/fs/xfs/xfs_trans_priv.h @@ -52,16 +52,18 @@ struct xfs_ail { struct xlog *ail_log; struct task_struct *ail_task; struct list_head ail_head; - xfs_lsn_t ail_target; - xfs_lsn_t ail_target_prev; struct list_head ail_cursors; spinlock_t ail_lock; xfs_lsn_t ail_last_pushed_lsn; int ail_log_flush; + unsigned long ail_opstate; struct list_head ail_buf_list; wait_queue_head_t ail_empty; }; +/* Push all items out of the AIL immediately. */ +#define XFS_AIL_OPSTATE_PUSH_ALL 0u + /* * From xfs_trans_ail.c */ @@ -98,10 +100,29 @@ void xfs_ail_update_finish(struct xfs_ail *ailp, xfs_lsn_t old_lsn) __releases(ailp->ail_lock); void xfs_trans_ail_delete(struct xfs_log_item *lip, int shutdown_type); -void xfs_ail_push(struct xfs_ail *, xfs_lsn_t); -void xfs_ail_push_all(struct xfs_ail *); -void xfs_ail_push_all_sync(struct xfs_ail *); -struct xfs_log_item *xfs_ail_min(struct xfs_ail *ailp); +static inline void xfs_ail_push(struct xfs_ail *ailp) +{ + wake_up_process(ailp->ail_task); +} + +static inline void xfs_ail_push_all(struct xfs_ail *ailp) +{ + if (!test_and_set_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate)) + xfs_ail_push(ailp); +} + +xfs_lsn_t __xfs_ail_push_target(struct xfs_ail *ailp); +static inline xfs_lsn_t xfs_ail_push_target(struct xfs_ail *ailp) +{ + xfs_lsn_t lsn; + + spin_lock(&ailp->ail_lock); + lsn = __xfs_ail_push_target(ailp); + spin_unlock(&ailp->ail_lock); + return lsn; +} + +void xfs_ail_push_all_sync(struct xfs_ail *ailp); xfs_lsn_t xfs_ail_min_lsn(struct xfs_ail *ailp); struct xfs_log_item * xfs_trans_ail_cursor_first(struct xfs_ail *ailp, -- 2.39.2

From: Dave Chinner <dchinner@redhat.com> mainline inclusion from mainline-v6.10-rc2 commit b50b4c49d8d79af05ac3bb3587f58589713139cc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC8HLD CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Currently the AIL attempts to keep 25% of the "log space" free, where the current used space is tracked by the reserve grant head. That is, it tracks both physical space used plus the amount reserved by transactions in progress. When we start tail pushing, we are trying to make space for new reservations by writing back older metadata and the log is generally physically full of dirty metadata, and reservations for modifications in flight take up whatever space the AIL can physically free up. Hence we don't really need to take into account the reservation space that has been used - we just need to keep the log tail moving as fast as we can to free up space for more reservations to be made. We know exactly how much physical space the journal is consuming in the AIL (i.e. max LSN - min LSN) so we can base push thresholds directly on this state rather than have to look at grant head reservations to determine how much to physically push out of the log. This also allows code that needs to know if log items in the current transaction need to be pushed or re-logged to simply sample the current target - they don't need to calculate the current target themselves. This avoids the need for any locking when doing such checks. Further, moving to a physical target means we don't need "push all until empty semantics" like were introduced in the previous patch. We can now test and clear the "push all" as a one-shot command to set the target to the current head of the AIL. This allows the xfsaild to maximise the use of log space right up to the point where conditions indicate that the xfsaild is not keeping up with load and it needs to work harder, and as soon as those constraints go away (i.e. external code no longer needs everything pushed) the xfsaild will return to maintaining the normal 25% free space thresholds. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Conflicts: fs/xfs/xfs_log_priv.h [not merge mainline commit 2c03d9560ece ("xfs: fix CIL sparse lock context warnings")] Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/libxfs/xfs_defer.c | 2 +- fs/xfs/xfs_log_priv.h | 18 ++++++ fs/xfs/xfs_trans_ail.c | 116 +++++++++++++++++++------------------- fs/xfs/xfs_trans_priv.h | 11 +--- 4 files changed, 80 insertions(+), 67 deletions(-) diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index a78ec2438e0e..38be292f6144 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -465,7 +465,7 @@ xfs_defer_relog( * the log threshold once per call. */ if (threshold_lsn == NULLCOMMITLSN) { - threshold_lsn = xfs_ail_push_target(log->l_ailp); + threshold_lsn = xfs_ail_get_push_target(log->l_ailp); if (threshold_lsn == NULLCOMMITLSN) break; } diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index f3a68b4a351f..7cec0b9647bf 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -627,6 +627,24 @@ xlog_wait( int xlog_wait_on_iclog(struct xlog_in_core *iclog); +/* Calculate the distance between two LSNs in bytes */ +static inline uint64_t +xlog_lsn_sub( + struct xlog *log, + xfs_lsn_t high, + xfs_lsn_t low) +{ + uint32_t hi_cycle = CYCLE_LSN(high); + uint32_t hi_block = BLOCK_LSN(high); + uint32_t lo_cycle = CYCLE_LSN(low); + uint32_t lo_block = BLOCK_LSN(low); + + if (hi_cycle == lo_cycle) + return BBTOB(hi_block - lo_block); + ASSERT((hi_cycle == lo_cycle + 1) || xlog_is_shutdown(log)); + return (uint64_t)log->l_logsize - BBTOB(lo_block - hi_block); +} + /* * The LSN is valid so long as it is behind the current LSN. If it isn't, this * means that the next log record that includes this metadata could have a diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 31a4e5e5d899..fddc9dcf834f 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -398,51 +398,69 @@ xfsaild_push_item( /* * Compute the LSN that we'd need to push the log tail towards in order to have * at least 25% of the log space free. If the log free space already meets this - * threshold, this function returns NULLCOMMITLSN. + * threshold, this function returns the lowest LSN in the AIL to slowly keep + * writeback ticking over and the tail of the log moving forward. */ -xfs_lsn_t -__xfs_ail_push_target( +static xfs_lsn_t +xfs_ail_calc_push_target( struct xfs_ail *ailp) { - struct xlog *log = ailp->ail_log; - xfs_lsn_t threshold_lsn = 0; - xfs_lsn_t last_sync_lsn; - int free_blocks; - int free_bytes; - int threshold_block; - int threshold_cycle; - int free_threshold; - - free_bytes = xlog_space_left(log, &log->l_reserve_head.grant); - free_blocks = BTOBBT(free_bytes); + struct xlog *log = ailp->ail_log; + struct xfs_log_item *lip; + xfs_lsn_t target_lsn; + xfs_lsn_t max_lsn; + xfs_lsn_t min_lsn; + int32_t free_bytes; + uint32_t target_block; + uint32_t target_cycle; + + lockdep_assert_held(&ailp->ail_lock); + + lip = xfs_ail_max(ailp); + if (!lip) + return NULLCOMMITLSN; + + max_lsn = lip->li_lsn; + min_lsn = __xfs_ail_min_lsn(ailp); /* - * The threshold for the minimum number of free blocks is one quarter of - * the entire log space. + * If we are supposed to push all the items in the AIL, we want to push + * to the current head. We then clear the push flag so that we don't + * keep pushing newly queued items beyond where the push all command was + * run. If the push waiter wants to empty the ail, it should queue + * itself on the ail_empty wait queue. */ - free_threshold = log->l_logBBsize >> 2; - if (free_blocks >= free_threshold) - return NULLCOMMITLSN; + if (test_and_clear_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate)) + return max_lsn; + + /* If someone wants the AIL empty, keep pushing everything we have. */ + if (waitqueue_active(&ailp->ail_empty)) + return max_lsn; - xlog_crack_atomic_lsn(&log->l_tail_lsn, &threshold_cycle, - &threshold_block); - threshold_block += free_threshold; - if (threshold_block >= log->l_logBBsize) { - threshold_block -= log->l_logBBsize; - threshold_cycle += 1; - } - threshold_lsn = xlog_assign_lsn(threshold_cycle, - threshold_block); /* - * Don't pass in an lsn greater than the lsn of the last - * log record known to be on disk. Use a snapshot of the last sync lsn - * so that it doesn't change between the compare and the set. + * Background pushing - attempt to keep 25% of the log free and if we + * have that much free retain the existing target. */ - last_sync_lsn = atomic64_read(&log->l_last_sync_lsn); - if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0) - threshold_lsn = last_sync_lsn; + free_bytes = log->l_logsize - xlog_lsn_sub(log, max_lsn, min_lsn); + if (free_bytes >= log->l_logsize >> 2) + return ailp->ail_target; + + target_cycle = CYCLE_LSN(min_lsn); + target_block = BLOCK_LSN(min_lsn) + (log->l_logBBsize >> 2); + if (target_block >= log->l_logBBsize) { + target_block -= log->l_logBBsize; + target_cycle += 1; + } + target_lsn = xlog_assign_lsn(target_cycle, target_block); + + /* Cap the target to the highest LSN known to be in the AIL. */ + if (XFS_LSN_CMP(target_lsn, max_lsn) > 0) + return max_lsn; - return threshold_lsn; + /* If the existing target is higher than the new target, keep it. */ + if (XFS_LSN_CMP(ailp->ail_target, target_lsn) >= 0) + return ailp->ail_target; + return target_lsn; } static long @@ -453,7 +471,6 @@ xfsaild_push( struct xfs_ail_cursor cur; struct xfs_log_item *lip; xfs_lsn_t lsn; - xfs_lsn_t target = NULLCOMMITLSN; long tout; int stuck = 0; int flushing = 0; @@ -478,25 +495,8 @@ xfsaild_push( } spin_lock(&ailp->ail_lock); - - /* - * If we have a sync push waiter, we always have to push till the AIL is - * empty. Update the target to point to the end of the AIL so that - * capture updates that occur after the sync push waiter has gone to - * sleep. - */ - if (test_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate) || - waitqueue_active(&ailp->ail_empty)) { - lip = xfs_ail_max(ailp); - if (lip) - target = lip->li_lsn; - else - clear_bit(XFS_AIL_OPSTATE_PUSH_ALL, &ailp->ail_opstate); - } else { - target = __xfs_ail_push_target(ailp); - } - - if (target == NULLCOMMITLSN) + WRITE_ONCE(ailp->ail_target, xfs_ail_calc_push_target(ailp)); + if (ailp->ail_target == NULLCOMMITLSN) goto out_done; /* we're done if the AIL is empty or our push has reached the end */ @@ -506,10 +506,10 @@ xfsaild_push( XFS_STATS_INC(mp, xs_push_ail); - ASSERT(target != NULLCOMMITLSN); + ASSERT(ailp->ail_target != NULLCOMMITLSN); lsn = lip->li_lsn; - while ((XFS_LSN_CMP(lip->li_lsn, target) <= 0)) { + while ((XFS_LSN_CMP(lip->li_lsn, ailp->ail_target) <= 0)) { int lock_result; /* @@ -595,7 +595,7 @@ xfsaild_push( if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list)) ailp->ail_log_flush++; - if (!count || XFS_LSN_CMP(lsn, target) >= 0) { + if (!count || XFS_LSN_CMP(lsn, ailp->ail_target) >= 0) { /* * We reached the target or the AIL is empty, so wait a bit * longer for I/O to complete and remove pushed items from the diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h index 9a131e7fae94..60b4707c3a65 100644 --- a/fs/xfs/xfs_trans_priv.h +++ b/fs/xfs/xfs_trans_priv.h @@ -59,6 +59,7 @@ struct xfs_ail { unsigned long ail_opstate; struct list_head ail_buf_list; wait_queue_head_t ail_empty; + xfs_lsn_t ail_target; }; /* Push all items out of the AIL immediately. */ @@ -111,15 +112,9 @@ static inline void xfs_ail_push_all(struct xfs_ail *ailp) xfs_ail_push(ailp); } -xfs_lsn_t __xfs_ail_push_target(struct xfs_ail *ailp); -static inline xfs_lsn_t xfs_ail_push_target(struct xfs_ail *ailp) +static inline xfs_lsn_t xfs_ail_get_push_target(struct xfs_ail *ailp) { - xfs_lsn_t lsn; - - spin_lock(&ailp->ail_lock); - lsn = __xfs_ail_push_target(ailp); - spin_unlock(&ailp->ail_lock); - return lsn; + return READ_ONCE(ailp->ail_target); } void xfs_ail_push_all_sync(struct xfs_ail *ailp); -- 2.39.2

From: Dave Chinner <dchinner@redhat.com> mainline inclusion from mainline-v6.10-rc2 commit a07776ab814d432190a902c2c3fac867c4e76934 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC8HLD CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Whenever we write an iclog, we call xlog_assign_tail_lsn() to update the current tail before we write it into the iclog header. This means we have to take the AIL lock on every iclog write just to check if the tail of the log has moved. This doesn't avoid races with log tail updates - the log tail could move immediately after we assign the tail to the iclog header and hence by the time the iclog reaches stable storage the tail LSN has moved forward in memory. Hence the log tail LSN in the iclog header is really just a point in time snapshot of the current state of the AIL. With this in mind, if we simply update the in memory log->l_tail_lsn every time it changes in the AIL, there is no need to update the in memory value when we are writing it into an iclog - it will already be up-to-date in memory and checking the AIL again will not change this. Hence xlog_state_release_iclog() does not need to check the AIL to update the tail lsn and can just sample it directly without needing to take the AIL lock. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/xfs_log.c | 5 ++--- fs/xfs/xfs_trans_ail.c | 17 +++++++++++++++-- 2 files changed, 17 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index e45efa6e1a8e..2fd98fc3c63a 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -530,7 +530,6 @@ xlog_state_release_iclog( struct xlog_in_core *iclog, struct xlog_ticket *ticket) { - xfs_lsn_t tail_lsn; bool last_ref; lockdep_assert_held(&log->l_icloglock); @@ -545,8 +544,8 @@ xlog_state_release_iclog( if ((iclog->ic_state == XLOG_STATE_WANT_SYNC || (iclog->ic_flags & XLOG_ICL_NEED_FUA)) && !iclog->ic_header.h_tail_lsn) { - tail_lsn = xlog_assign_tail_lsn(log->l_mp); - iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); + iclog->ic_header.h_tail_lsn = + cpu_to_be64(atomic64_read(&log->l_tail_lsn)); } last_ref = atomic_dec_and_test(&iclog->ic_refcnt); diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index fddc9dcf834f..e75ba6845e2b 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -720,6 +720,13 @@ xfs_ail_push_all_sync( finish_wait(&ailp->ail_empty, &wait); } +/* + * Callers should pass the original tail lsn so that we can detect if the tail + * has moved as a result of the operation that was performed. If the caller + * needs to force a tail LSN update, it should pass NULLCOMMITLSN to bypass the + * "did the tail LSN change?" checks. If the caller wants to avoid a tail update + * (e.g. it knows the tail did not change) it should pass an @old_lsn of 0. + */ void xfs_ail_update_finish( struct xfs_ail *ailp, @@ -804,10 +811,16 @@ xfs_trans_ail_update_bulk( /* * If this is the first insert, wake up the push daemon so it can - * actively scan for items to push. + * actively scan for items to push. We also need to do a log tail + * LSN update to ensure that it is correctly tracked by the log, so + * set the tail_lsn to NULLCOMMITLSN so that xfs_ail_update_finish() + * will see that the tail lsn has changed and will update the tail + * appropriately. */ - if (!mlip) + if (!mlip) { wake_up_process(ailp->ail_task); + tail_lsn = NULLCOMMITLSN; + } xfs_ail_update_finish(ailp, tail_lsn); } -- 2.39.2

From: Dave Chinner <dchinner@redhat.com> mainline inclusion from mainline-v6.10-rc2 commit 0dcd5a10d9878dc5a3fd17c0331646a69ebb5da6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC8HLD CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The current implementation of xlog_assign_tail_lsn() assumes that when the AIL is empty, the log tail matches the LSN of the last written commit record. This is recorded in xlog_state_set_callback() as log->l_last_sync_lsn when the iclog state changes to XLOG_STATE_CALLBACK. This change is then immediately followed by running the callbacks on the iclog which then insert the log items into the AIL at the "commit lsn" of that checkpoint. The AIL tracks log items via the start record LSN of the checkpoint, not the commit record LSN. This is because we can pipeline multiple checkpoints, and so the start record of checkpoint N+1 can be written before the commit record of checkpoint N. i.e: start N commit N +-------------+------------+----------------+ start N+1 commit N+1 The tail of the log cannot be moved to the LSN of commit N when all the items of that checkpoint are written back, because then the start record for N+1 is no longer in the active portion of the log and recovery will fail/corrupt the filesystem. Hence when all the log items in checkpoint N are written back, the tail of the log most now only move as far forwards as the start LSN of checkpoint N+1. Hence we cannot use the maximum start record LSN the AIL sees as a replacement the pointer to the current head of the on-disk log records. However, we currently only use the l_last_sync_lsn when the AIL is empty - when there is no start LSN remaining, the tail of the log moves to the LSN of the last commit record as this is where recovery needs to start searching for recoverable records. THe next checkpoint will have a start record LSN that is higher than l_last_sync_lsn, and so everything still works correctly when new checkpoints are written to an otherwise empty log. l_last_sync_lsn is an atomic variable because it is currently updated when an iclog with callbacks attached moves to the CALLBACK state. While we hold the icloglock at this point, we don't hold the AIL lock. When we assign the log tail, we hold the AIL lock, not the icloglock because we have to look up the AIL. Hence it is an atomic variable so it's not bound to a specific lock context. However, the iclog callbacks are only used for CIL checkpoints. We don't use callbacks with unmount record writes, so the l_last_sync_lsn variable only gets updated when we are processing CIL checkpoint callbacks. And those callbacks run under AIL lock contexts, not icloglock context. The CIL checkpoint already knows what the LSN of the iclog the commit record was written to (obtained when written into the iclog before submission) and so we can update the l_last_sync_lsn under the AIL lock in this callback. No other iclog callbacks will run until the currently executing one completes, and hence we can update the l_last_sync_lsn under the AIL lock safely. This means l_last_sync_lsn can move to the AIL as the "ail_head_lsn" and it can be used to replace the atomic l_last_sync_lsn in the iclog code. This makes tracking the log tail belong entirely to the AIL, rather than being smeared across log, iclog and AIL state and locking. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/xfs_log.c | 81 +++++----------------------------------- fs/xfs/xfs_log_cil.c | 54 ++++++++++++++++++++------- fs/xfs/xfs_log_priv.h | 9 ++--- fs/xfs/xfs_log_recover.c | 19 +++++----- fs/xfs/xfs_trace.c | 1 + fs/xfs/xfs_trace.h | 8 ++-- fs/xfs/xfs_trans_ail.c | 26 +++++++++++-- fs/xfs/xfs_trans_priv.h | 13 +++++++ 8 files changed, 102 insertions(+), 109 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 2fd98fc3c63a..8a4dceb320dd 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1232,47 +1232,6 @@ xfs_log_cover( return error; } -/* - * We may be holding the log iclog lock upon entering this routine. - */ -xfs_lsn_t -xlog_assign_tail_lsn_locked( - struct xfs_mount *mp) -{ - struct xlog *log = mp->m_log; - struct xfs_log_item *lip; - xfs_lsn_t tail_lsn; - - assert_spin_locked(&mp->m_ail->ail_lock); - - /* - * To make sure we always have a valid LSN for the log tail we keep - * track of the last LSN which was committed in log->l_last_sync_lsn, - * and use that when the AIL was empty. - */ - lip = xfs_ail_min(mp->m_ail); - if (lip) - tail_lsn = lip->li_lsn; - else - tail_lsn = atomic64_read(&log->l_last_sync_lsn); - trace_xfs_log_assign_tail_lsn(log, tail_lsn); - atomic64_set(&log->l_tail_lsn, tail_lsn); - return tail_lsn; -} - -xfs_lsn_t -xlog_assign_tail_lsn( - struct xfs_mount *mp) -{ - xfs_lsn_t tail_lsn; - - spin_lock(&mp->m_ail->ail_lock); - tail_lsn = xlog_assign_tail_lsn_locked(mp); - spin_unlock(&mp->m_ail->ail_lock); - - return tail_lsn; -} - /* * Return the space in the log between the tail and the head. The head * is passed in the cycle/bytes formal parms. In the special case where @@ -1507,7 +1466,6 @@ xlog_alloc_log( log->l_prev_block = -1; /* log->l_tail_lsn = 0x100000000LL; cycle = 1; current block = 0 */ xlog_assign_atomic_lsn(&log->l_tail_lsn, 1, 0); - xlog_assign_atomic_lsn(&log->l_last_sync_lsn, 1, 0); log->l_curr_cycle = 1; /* 0 is bad since this is initial value */ if (xfs_has_logv2(mp) && mp->m_sb.sb_logsunit > 1) @@ -2556,44 +2514,23 @@ xlog_get_lowest_lsn( return lowest_lsn; } -/* - * Completion of a iclog IO does not imply that a transaction has completed, as - * transactions can be large enough to span many iclogs. We cannot change the - * tail of the log half way through a transaction as this may be the only - * transaction in the log and moving the tail to point to the middle of it - * will prevent recovery from finding the start of the transaction. Hence we - * should only update the last_sync_lsn if this iclog contains transaction - * completion callbacks on it. - * - * We have to do this before we drop the icloglock to ensure we are the only one - * that can update it. - * - * If we are moving the last_sync_lsn forwards, we also need to ensure we kick - * the reservation grant head pushing. This is due to the fact that the push - * target is bound by the current last_sync_lsn value. Hence if we have a large - * amount of log space bound up in this committing transaction then the - * last_sync_lsn value may be the limiting factor preventing tail pushing from - * freeing space in the log. Hence once we've updated the last_sync_lsn we - * should push the AIL to ensure the push target (and hence the grant head) is - * no longer bound by the old log head location and can move forwards and make - * progress again. - */ static void xlog_state_set_callback( struct xlog *log, struct xlog_in_core *iclog, xfs_lsn_t header_lsn) { + /* + * If there are no callbacks on this iclog, we can mark it clean + * immediately and return. Otherwise we need to run the + * callbacks. + */ + if (list_empty(&iclog->ic_callbacks)) { + xlog_state_clean_iclog(log, iclog); + return; + } trace_xlog_iclog_callback(iclog, _RET_IP_); iclog->ic_state = XLOG_STATE_CALLBACK; - - ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), - header_lsn) <= 0); - - if (list_empty_careful(&iclog->ic_callbacks)) - return; - - atomic64_set(&log->l_last_sync_lsn, header_lsn); } /* diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index e2e921484b6a..f998989f8217 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -721,6 +721,24 @@ xlog_cil_ail_insert_batch( * items into the AIL. This uses bulk insertion techniques to minimise AIL lock * traffic. * + * The AIL tracks log items via the start record LSN of the checkpoint, + * not the commit record LSN. This is because we can pipeline multiple + * checkpoints, and so the start record of checkpoint N+1 can be + * written before the commit record of checkpoint N. i.e: + * + * start N commit N + * +-------------+------------+----------------+ + * start N+1 commit N+1 + * + * The tail of the log cannot be moved to the LSN of commit N when all + * the items of that checkpoint are written back, because then the + * start record for N+1 is no longer in the active portion of the log + * and recovery will fail/corrupt the filesystem. + * + * Hence when all the log items in checkpoint N are written back, the + * tail of the log most now only move as far forwards as the start LSN + * of checkpoint N+1. + * * If we are called with the aborted flag set, it is because a log write during * a CIL checkpoint commit has failed. In this case, all the items in the * checkpoint have already gone through iop_committed and iop_committing, which @@ -738,24 +756,33 @@ xlog_cil_ail_insert_batch( */ static void xlog_cil_ail_insert( - struct xlog *log, - struct list_head *lv_chain, - xfs_lsn_t commit_lsn, + struct xfs_cil_ctx *ctx, bool aborted) { #define LOG_ITEM_BATCH_SIZE 32 - struct xfs_ail *ailp = log->l_ailp; + struct xfs_ail *ailp = ctx->cil->xc_log->l_ailp; struct xfs_log_item *log_items[LOG_ITEM_BATCH_SIZE]; struct xfs_log_vec *lv; struct xfs_ail_cursor cur; int i = 0; + /* + * Update the AIL head LSN with the commit record LSN of this + * checkpoint. As iclogs are always completed in order, this should + * always be the same (as iclogs can contain multiple commit records) or + * higher LSN than the current head. We do this before insertion of the + * items so that log space checks during insertion will reflect the + * space that this checkpoint has already consumed. + */ + ASSERT(XFS_LSN_CMP(ctx->commit_lsn, ailp->ail_head_lsn) >= 0 || + aborted); spin_lock(&ailp->ail_lock); - xfs_trans_ail_cursor_last(ailp, &cur, commit_lsn); + ailp->ail_head_lsn = ctx->commit_lsn; + xfs_trans_ail_cursor_last(ailp, &cur, ctx->start_lsn); spin_unlock(&ailp->ail_lock); /* unpin all the log items */ - list_for_each_entry(lv, lv_chain, lv_list) { + list_for_each_entry(lv, &ctx->lv_chain, lv_list) { struct xfs_log_item *lip = lv->lv_item; xfs_lsn_t item_lsn; @@ -768,9 +795,10 @@ xlog_cil_ail_insert( } if (lip->li_ops->iop_committed) - item_lsn = lip->li_ops->iop_committed(lip, commit_lsn); + item_lsn = lip->li_ops->iop_committed(lip, + ctx->start_lsn); else - item_lsn = commit_lsn; + item_lsn = ctx->start_lsn; /* item_lsn of -1 means the item needs no further processing */ if (XFS_LSN_CMP(item_lsn, (xfs_lsn_t)-1) == 0) @@ -787,7 +815,7 @@ xlog_cil_ail_insert( continue; } - if (item_lsn != commit_lsn) { + if (item_lsn != ctx->start_lsn) { /* * Not a bulk update option due to unusual item_lsn. @@ -810,14 +838,15 @@ xlog_cil_ail_insert( log_items[i++] = lv->lv_item; if (i >= LOG_ITEM_BATCH_SIZE) { xlog_cil_ail_insert_batch(ailp, &cur, log_items, - LOG_ITEM_BATCH_SIZE, commit_lsn); + LOG_ITEM_BATCH_SIZE, ctx->start_lsn); i = 0; } } /* make sure we insert the remainder! */ if (i) - xlog_cil_ail_insert_batch(ailp, &cur, log_items, i, commit_lsn); + xlog_cil_ail_insert_batch(ailp, &cur, log_items, i, + ctx->start_lsn); spin_lock(&ailp->ail_lock); xfs_trans_ail_cursor_done(&cur); @@ -863,8 +892,7 @@ xlog_cil_committed( spin_unlock(&ctx->cil->xc_push_lock); } - xlog_cil_ail_insert(ctx->cil->xc_log, &ctx->lv_chain, - ctx->start_lsn, abort); + xlog_cil_ail_insert(ctx, abort); xfs_extent_busy_sort(&ctx->busy_extents.extent_list); xfs_extent_busy_clear(mp, &ctx->busy_extents.extent_list, diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 7cec0b9647bf..2312acffe8b4 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -431,13 +431,10 @@ struct xlog { int l_prev_block; /* previous logical log block */ /* - * l_last_sync_lsn and l_tail_lsn are atomics so they can be set and - * read without needing to hold specific locks. To avoid operations - * contending with other hot objects, place each of them on a separate - * cacheline. + * l_tail_lsn is atomic so it can be set and read without needing to + * hold specific locks. To avoid operations contending with other hot + * objects, it on a separate cacheline. */ - /* lsn of last LR on disk */ - atomic64_t l_last_sync_lsn ____cacheline_aligned_in_smp; /* lsn of 1st LR with unflushed * buffers */ atomic64_t l_tail_lsn ____cacheline_aligned_in_smp; diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 08821241f19d..f88efeb5de51 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -1177,8 +1177,8 @@ xlog_check_unmount_rec( */ xlog_assign_atomic_lsn(&log->l_tail_lsn, log->l_curr_cycle, after_umount_blk); - xlog_assign_atomic_lsn(&log->l_last_sync_lsn, - log->l_curr_cycle, after_umount_blk); + log->l_ailp->ail_head_lsn = + atomic64_read(&log->l_tail_lsn); *tail_blk = after_umount_blk; *clean = true; @@ -1212,7 +1212,7 @@ xlog_set_state( if (bump_cycle) log->l_curr_cycle++; atomic64_set(&log->l_tail_lsn, be64_to_cpu(rhead->h_tail_lsn)); - atomic64_set(&log->l_last_sync_lsn, be64_to_cpu(rhead->h_lsn)); + log->l_ailp->ail_head_lsn = be64_to_cpu(rhead->h_lsn); xlog_assign_grant_head(&log->l_reserve_head.grant, log->l_curr_cycle, BBTOB(log->l_curr_block)); xlog_assign_grant_head(&log->l_write_head.grant, log->l_curr_cycle, @@ -3340,14 +3340,13 @@ xlog_do_recover( /* * We now update the tail_lsn since much of the recovery has completed - * and there may be space available to use. If there were no extent - * or iunlinks, we can free up the entire log and set the tail_lsn to - * be the last_sync_lsn. This was set in xlog_find_tail to be the - * lsn of the last known good LR on disk. If there are extent frees - * or iunlinks they will have some entries in the AIL; so we look at - * the AIL to determine how to set the tail_lsn. + * and there may be space available to use. If there were no extent or + * iunlinks, we can free up the entire log. This was set in + * xlog_find_tail to be the lsn of the last known good LR on disk. If + * there are extent frees or iunlinks they will have some entries in the + * AIL; so we look at the AIL to determine how to set the tail_lsn. */ - xlog_assign_tail_lsn(mp); + xfs_ail_assign_tail_lsn(log->l_ailp); /* * Now that we've finished replaying all buffer and inode updates, diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c index 8a5dc1538aa8..2bf489b445ee 100644 --- a/fs/xfs/xfs_trace.c +++ b/fs/xfs/xfs_trace.c @@ -22,6 +22,7 @@ #include "xfs_trans.h" #include "xfs_log.h" #include "xfs_log_priv.h" +#include "xfs_trans_priv.h" #include "xfs_buf_item.h" #include "xfs_quota.h" #include "xfs_dquot_item.h" diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index d562028281f5..b11f46890098 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -1383,19 +1383,19 @@ TRACE_EVENT(xfs_log_assign_tail_lsn, __field(dev_t, dev) __field(xfs_lsn_t, new_lsn) __field(xfs_lsn_t, old_lsn) - __field(xfs_lsn_t, last_sync_lsn) + __field(xfs_lsn_t, head_lsn) ), TP_fast_assign( __entry->dev = log->l_mp->m_super->s_dev; __entry->new_lsn = new_lsn; __entry->old_lsn = atomic64_read(&log->l_tail_lsn); - __entry->last_sync_lsn = atomic64_read(&log->l_last_sync_lsn); + __entry->head_lsn = log->l_ailp->ail_head_lsn; ), - TP_printk("dev %d:%d new tail lsn %d/%d, old lsn %d/%d, last sync %d/%d", + TP_printk("dev %d:%d new tail lsn %d/%d, old lsn %d/%d, head lsn %d/%d", MAJOR(__entry->dev), MINOR(__entry->dev), CYCLE_LSN(__entry->new_lsn), BLOCK_LSN(__entry->new_lsn), CYCLE_LSN(__entry->old_lsn), BLOCK_LSN(__entry->old_lsn), - CYCLE_LSN(__entry->last_sync_lsn), BLOCK_LSN(__entry->last_sync_lsn)) + CYCLE_LSN(__entry->head_lsn), BLOCK_LSN(__entry->head_lsn)) ) DECLARE_EVENT_CLASS(xfs_file_class, diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index e75ba6845e2b..9a3e121c1e94 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -720,6 +720,26 @@ xfs_ail_push_all_sync( finish_wait(&ailp->ail_empty, &wait); } +void +__xfs_ail_assign_tail_lsn( + struct xfs_ail *ailp) +{ + struct xlog *log = ailp->ail_log; + xfs_lsn_t tail_lsn; + + assert_spin_locked(&ailp->ail_lock); + + if (xlog_is_shutdown(log)) + return; + + tail_lsn = __xfs_ail_min_lsn(ailp); + if (!tail_lsn) + tail_lsn = ailp->ail_head_lsn; + + trace_xfs_log_assign_tail_lsn(log, tail_lsn); + atomic64_set(&log->l_tail_lsn, tail_lsn); +} + /* * Callers should pass the original tail lsn so that we can detect if the tail * has moved as a result of the operation that was performed. If the caller @@ -734,15 +754,13 @@ xfs_ail_update_finish( { struct xlog *log = ailp->ail_log; - /* if the tail lsn hasn't changed, don't do updates or wakeups. */ + /* If the tail lsn hasn't changed, don't do updates or wakeups. */ if (!old_lsn || old_lsn == __xfs_ail_min_lsn(ailp)) { spin_unlock(&ailp->ail_lock); return; } - if (!xlog_is_shutdown(log)) - xlog_assign_tail_lsn_locked(log->l_mp); - + __xfs_ail_assign_tail_lsn(ailp); if (list_empty(&ailp->ail_head)) wake_up_all(&ailp->ail_empty); spin_unlock(&ailp->ail_lock); diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h index 60b4707c3a65..bd841df93021 100644 --- a/fs/xfs/xfs_trans_priv.h +++ b/fs/xfs/xfs_trans_priv.h @@ -55,6 +55,7 @@ struct xfs_ail { struct list_head ail_cursors; spinlock_t ail_lock; xfs_lsn_t ail_last_pushed_lsn; + xfs_lsn_t ail_head_lsn; int ail_log_flush; unsigned long ail_opstate; struct list_head ail_buf_list; @@ -130,6 +131,18 @@ struct xfs_log_item * xfs_trans_ail_cursor_next(struct xfs_ail *ailp, struct xfs_ail_cursor *cur); void xfs_trans_ail_cursor_done(struct xfs_ail_cursor *cur); +void __xfs_ail_assign_tail_lsn(struct xfs_ail *ailp); + +static inline void +xfs_ail_assign_tail_lsn( + struct xfs_ail *ailp) +{ + + spin_lock(&ailp->ail_lock); + __xfs_ail_assign_tail_lsn(ailp); + spin_unlock(&ailp->ail_lock); +} + #if BITS_PER_LONG != 64 static inline void xfs_trans_ail_copy_lsn( -- 2.39.2

From: Dave Chinner <dchinner@redhat.com> mainline inclusion from mainline-v6.10-rc2 commit be5abd323bf4bee137d80d605ff30a7a66dad96d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC8HLD CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The function is called from a single place, and it isn't just setting the iclog state to XLOG_STATE_CALLBACK - it can mark iclogs clean, which moves them to states after CALLBACK. Hence the function is now badly named, and should just be folded into the caller where the iclog completion logic makes a whole lot more sense. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/xfs_log.c | 31 +++++++++++-------------------- 1 file changed, 11 insertions(+), 20 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 8a4dceb320dd..7a7c57116b36 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -2514,25 +2514,6 @@ xlog_get_lowest_lsn( return lowest_lsn; } -static void -xlog_state_set_callback( - struct xlog *log, - struct xlog_in_core *iclog, - xfs_lsn_t header_lsn) -{ - /* - * If there are no callbacks on this iclog, we can mark it clean - * immediately and return. Otherwise we need to run the - * callbacks. - */ - if (list_empty(&iclog->ic_callbacks)) { - xlog_state_clean_iclog(log, iclog); - return; - } - trace_xlog_iclog_callback(iclog, _RET_IP_); - iclog->ic_state = XLOG_STATE_CALLBACK; -} - /* * Return true if we need to stop processing, false to continue to the next * iclog. The caller will need to run callbacks if the iclog is returned in the @@ -2564,7 +2545,17 @@ xlog_state_iodone_process_iclog( lowest_lsn = xlog_get_lowest_lsn(log); if (lowest_lsn && XFS_LSN_CMP(lowest_lsn, header_lsn) < 0) return false; - xlog_state_set_callback(log, iclog, header_lsn); + /* + * If there are no callbacks on this iclog, we can mark it clean + * immediately and return. Otherwise we need to run the + * callbacks. + */ + if (list_empty(&iclog->ic_callbacks)) { + xlog_state_clean_iclog(log, iclog); + return false; + } + trace_xlog_iclog_callback(iclog, _RET_IP_); + iclog->ic_state = XLOG_STATE_CALLBACK; return false; default: /* -- 2.39.2

From: Dave Chinner <dchinner@redhat.com> mainline inclusion from mainline-v6.10-rc2 commit 551bf13ba8b24a9b938e85061c9e03bd452ea28d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC8HLD CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Currently we track space used in the log by grant heads. These store the reserved space as a physical log location and combine both space reserved for future use with space already used in the log in a single variable. The amount of space consumed in the log is then calculated as the distance between the log tail and the grant head. The problem with tracking the grant head as a physical location comes from the fact that it tracks both log cycle count and offset into the log in bytes in a single 64 bit variable. because the cycle count on disk is a 32 bit number, this also limits the offset into the log to 32 bits. ANd because that is in bytes, we are limited to being able to track only 2GB of log space in the grant head. Hence to support larger physical logs, we need to track used space differently in the grant head. We no longer use the grant head for guiding AIL pushing, so the only thing it is now used for is determining if we've run out of reservation space via the calculation in xlog_space_left(). What we really need to do is move the grant heads away from tracking physical space in the log. The issue here is that space consumed in the log is not directly tracked by the current mechanism - the space consumed in the log by grant head reservations gets returned to the free pool by the tail of the log moving forward. i.e. the space isn't directly tracked or calculated, but the used grant space gets "freed" as the physical limits of the log are updated without actually needing to update the grant heads. Hence to move away from implicit, zero-update log space tracking we need to explicitly track the amount of physical space the log actually consumes separately to the in-memory reservations for operations that will be committed to the journal. Luckily, we already track the information we need to calculate this in the AIL itself. That is, the space currently consumed by the journal is the maximum LSN that the AIL has seen minus the current log tail. As we update both of these items dynamically as the head and tail of the log moves, we always know exactly how much space the journal consumes. This means that we also know exactly how much space the currently active reservations require, and exactly how much free space we have remaining for new reservations to be made. Most importantly, we know what these spaces are indepedently of the physical locations of the head and tail of the log. Hence by separating out the physical space consumed by the journal, we can now track reservations in the grant heads purely as a byte count, and the log can be considered full when the tail space + reservation space exceeds the size of the log. This means we can use the full 64 bits of grant head space for reservation space, completely removing the 32 bit byte count limitation on log size that they impose. Hence the first step in this conversion is to track and update the "log tail space" every time the AIL tail or maximum seen LSN changes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/xfs_log_cil.c | 9 ++++++--- fs/xfs/xfs_log_priv.h | 1 + fs/xfs/xfs_trans_ail.c | 9 ++++++--- 3 files changed, 13 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index f998989f8217..ca5c6d119f53 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -772,14 +772,17 @@ xlog_cil_ail_insert( * always be the same (as iclogs can contain multiple commit records) or * higher LSN than the current head. We do this before insertion of the * items so that log space checks during insertion will reflect the - * space that this checkpoint has already consumed. + * space that this checkpoint has already consumed. We call + * xfs_ail_update_finish() so that tail space and space-based wakeups + * will be recalculated appropriately. */ ASSERT(XFS_LSN_CMP(ctx->commit_lsn, ailp->ail_head_lsn) >= 0 || aborted); spin_lock(&ailp->ail_lock); - ailp->ail_head_lsn = ctx->commit_lsn; xfs_trans_ail_cursor_last(ailp, &cur, ctx->start_lsn); - spin_unlock(&ailp->ail_lock); + ailp->ail_head_lsn = ctx->commit_lsn; + /* xfs_ail_update_finish() drops the ail_lock */ + xfs_ail_update_finish(ailp, NULLCOMMITLSN); /* unpin all the log items */ list_for_each_entry(lv, &ctx->lv_chain, lv_list) { diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 2312acffe8b4..b300a6f2592a 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -440,6 +440,7 @@ struct xlog { struct xlog_grant_head l_reserve_head; struct xlog_grant_head l_write_head; + uint64_t l_tail_space; struct xfs_kobj l_kobj; diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 9a3e121c1e94..1e935b38f176 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -736,6 +736,8 @@ __xfs_ail_assign_tail_lsn( if (!tail_lsn) tail_lsn = ailp->ail_head_lsn; + WRITE_ONCE(log->l_tail_space, + xlog_lsn_sub(log, ailp->ail_head_lsn, tail_lsn)); trace_xfs_log_assign_tail_lsn(log, tail_lsn); atomic64_set(&log->l_tail_lsn, tail_lsn); } @@ -743,9 +745,10 @@ __xfs_ail_assign_tail_lsn( /* * Callers should pass the original tail lsn so that we can detect if the tail * has moved as a result of the operation that was performed. If the caller - * needs to force a tail LSN update, it should pass NULLCOMMITLSN to bypass the - * "did the tail LSN change?" checks. If the caller wants to avoid a tail update - * (e.g. it knows the tail did not change) it should pass an @old_lsn of 0. + * needs to force a tail space update, it should pass NULLCOMMITLSN to bypass + * the "did the tail LSN change?" checks. If the caller wants to avoid a tail + * update (e.g. it knows the tail did not change) it should pass an @old_lsn of + * 0. */ void xfs_ail_update_finish( -- 2.39.2

From: Dave Chinner <dchinner@redhat.com> mainline inclusion from mainline-v6.10-rc2 commit de302cea1e3b812e89a15b4eb349d063b2ab3aa1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC8HLD CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- Because we are going to need them soon. API change only, no logic changes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/xfs_log.c | 157 +++++++++++++++++++++--------------------- fs/xfs/xfs_log_priv.h | 2 - 2 files changed, 77 insertions(+), 82 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 7a7c57116b36..7564de260499 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -136,10 +136,10 @@ xlog_prepare_iovec( static void xlog_grant_sub_space( struct xlog *log, - atomic64_t *head, + struct xlog_grant_head *head, int bytes) { - int64_t head_val = atomic64_read(head); + int64_t head_val = atomic64_read(&head->grant); int64_t new, old; do { @@ -155,17 +155,17 @@ xlog_grant_sub_space( old = head_val; new = xlog_assign_grant_head_val(cycle, space); - head_val = atomic64_cmpxchg(head, old, new); + head_val = atomic64_cmpxchg(&head->grant, old, new); } while (head_val != old); } static void xlog_grant_add_space( struct xlog *log, - atomic64_t *head, + struct xlog_grant_head *head, int bytes) { - int64_t head_val = atomic64_read(head); + int64_t head_val = atomic64_read(&head->grant); int64_t new, old; do { @@ -184,7 +184,7 @@ xlog_grant_add_space( old = head_val; new = xlog_assign_grant_head_val(cycle, space); - head_val = atomic64_cmpxchg(head, old, new); + head_val = atomic64_cmpxchg(&head->grant, old, new); } while (head_val != old); } @@ -197,6 +197,63 @@ xlog_grant_head_init( spin_lock_init(&head->lock); } +/* + * Return the space in the log between the tail and the head. The head + * is passed in the cycle/bytes formal parms. In the special case where + * the reserve head has wrapped passed the tail, this calculation is no + * longer valid. In this case, just return 0 which means there is no space + * in the log. This works for all places where this function is called + * with the reserve head. Of course, if the write head were to ever + * wrap the tail, we should blow up. Rather than catch this case here, + * we depend on other ASSERTions in other parts of the code. XXXmiken + * + * If reservation head is behind the tail, we have a problem. Warn about it, + * but then treat it as if the log is empty. + * + * If the log is shut down, the head and tail may be invalid or out of whack, so + * shortcut invalidity asserts in this case so that we don't trigger them + * falsely. + */ +static int +xlog_grant_space_left( + struct xlog *log, + struct xlog_grant_head *head) +{ + int tail_bytes; + int tail_cycle; + int head_cycle; + int head_bytes; + + xlog_crack_grant_head(&head->grant, &head_cycle, &head_bytes); + xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_bytes); + tail_bytes = BBTOB(tail_bytes); + if (tail_cycle == head_cycle && head_bytes >= tail_bytes) + return log->l_logsize - (head_bytes - tail_bytes); + if (tail_cycle + 1 < head_cycle) + return 0; + + /* Ignore potential inconsistency when shutdown. */ + if (xlog_is_shutdown(log)) + return log->l_logsize; + + if (tail_cycle < head_cycle) { + ASSERT(tail_cycle == (head_cycle - 1)); + return tail_bytes - head_bytes; + } + + /* + * The reservation head is behind the tail. In this case we just want to + * return the size of the log as the amount of space left. + */ + xfs_alert(log->l_mp, "xlog_grant_space_left: head behind tail"); + xfs_alert(log->l_mp, " tail_cycle = %d, tail_bytes = %d", + tail_cycle, tail_bytes); + xfs_alert(log->l_mp, " GH cycle = %d, GH bytes = %d", + head_cycle, head_bytes); + ASSERT(0); + return log->l_logsize; +} + STATIC void xlog_grant_head_wake_all( struct xlog_grant_head *head) @@ -277,7 +334,7 @@ xlog_grant_head_wait( spin_lock(&head->lock); if (xlog_is_shutdown(log)) goto shutdown; - } while (xlog_space_left(log, &head->grant) < need_bytes); + } while (xlog_grant_space_left(log, head) < need_bytes); list_del_init(&tic->t_queue); return 0; @@ -322,7 +379,7 @@ xlog_grant_head_check( * otherwise try to get some space for this transaction. */ *need_bytes = xlog_ticket_reservation(log, head, tic); - free_bytes = xlog_space_left(log, &head->grant); + free_bytes = xlog_grant_space_left(log, head); if (!list_empty_careful(&head->waiters)) { spin_lock(&head->lock); if (!xlog_grant_head_wake(log, head, &free_bytes) || @@ -396,7 +453,7 @@ xfs_log_regrant( if (error) goto out_error; - xlog_grant_add_space(log, &log->l_write_head.grant, need_bytes); + xlog_grant_add_space(log, &log->l_write_head, need_bytes); trace_xfs_log_regrant_exit(log, tic); xlog_verify_grant_tail(log); return 0; @@ -447,8 +504,8 @@ xfs_log_reserve( if (error) goto out_error; - xlog_grant_add_space(log, &log->l_reserve_head.grant, need_bytes); - xlog_grant_add_space(log, &log->l_write_head.grant, need_bytes); + xlog_grant_add_space(log, &log->l_reserve_head, need_bytes); + xlog_grant_add_space(log, &log->l_write_head, need_bytes); trace_xfs_log_reserve_exit(log, tic); xlog_verify_grant_tail(log); return 0; @@ -1109,7 +1166,7 @@ xfs_log_space_wake( ASSERT(!xlog_in_recovery(log)); spin_lock(&log->l_write_head.lock); - free_bytes = xlog_space_left(log, &log->l_write_head.grant); + free_bytes = xlog_grant_space_left(log, &log->l_write_head); xlog_grant_head_wake(log, &log->l_write_head, &free_bytes); spin_unlock(&log->l_write_head.lock); } @@ -1118,7 +1175,7 @@ xfs_log_space_wake( ASSERT(!xlog_in_recovery(log)); spin_lock(&log->l_reserve_head.lock); - free_bytes = xlog_space_left(log, &log->l_reserve_head.grant); + free_bytes = xlog_grant_space_left(log, &log->l_reserve_head); xlog_grant_head_wake(log, &log->l_reserve_head, &free_bytes); spin_unlock(&log->l_reserve_head.lock); } @@ -1232,64 +1289,6 @@ xfs_log_cover( return error; } -/* - * Return the space in the log between the tail and the head. The head - * is passed in the cycle/bytes formal parms. In the special case where - * the reserve head has wrapped passed the tail, this calculation is no - * longer valid. In this case, just return 0 which means there is no space - * in the log. This works for all places where this function is called - * with the reserve head. Of course, if the write head were to ever - * wrap the tail, we should blow up. Rather than catch this case here, - * we depend on other ASSERTions in other parts of the code. XXXmiken - * - * If reservation head is behind the tail, we have a problem. Warn about it, - * but then treat it as if the log is empty. - * - * If the log is shut down, the head and tail may be invalid or out of whack, so - * shortcut invalidity asserts in this case so that we don't trigger them - * falsely. - */ -int -xlog_space_left( - struct xlog *log, - atomic64_t *head) -{ - int tail_bytes; - int tail_cycle; - int head_cycle; - int head_bytes; - - xlog_crack_grant_head(head, &head_cycle, &head_bytes); - xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_bytes); - tail_bytes = BBTOB(tail_bytes); - if (tail_cycle == head_cycle && head_bytes >= tail_bytes) - return log->l_logsize - (head_bytes - tail_bytes); - if (tail_cycle + 1 < head_cycle) - return 0; - - /* Ignore potential inconsistency when shutdown. */ - if (xlog_is_shutdown(log)) - return log->l_logsize; - - if (tail_cycle < head_cycle) { - ASSERT(tail_cycle == (head_cycle - 1)); - return tail_bytes - head_bytes; - } - - /* - * The reservation head is behind the tail. In this case we just want to - * return the size of the log as the amount of space left. - */ - xfs_alert(log->l_mp, "xlog_space_left: head behind tail"); - xfs_alert(log->l_mp, " tail_cycle = %d, tail_bytes = %d", - tail_cycle, tail_bytes); - xfs_alert(log->l_mp, " GH cycle = %d, GH bytes = %d", - head_cycle, head_bytes); - ASSERT(0); - return log->l_logsize; -} - - static void xlog_ioend_work( struct work_struct *work) @@ -1888,8 +1887,8 @@ xlog_sync( if (ticket) { ticket->t_curr_res -= roundoff; } else { - xlog_grant_add_space(log, &log->l_reserve_head.grant, roundoff); - xlog_grant_add_space(log, &log->l_write_head.grant, roundoff); + xlog_grant_add_space(log, &log->l_reserve_head, roundoff); + xlog_grant_add_space(log, &log->l_write_head, roundoff); } /* put cycle number in every block */ @@ -2809,17 +2808,15 @@ xfs_log_ticket_regrant( if (ticket->t_cnt > 0) ticket->t_cnt--; - xlog_grant_sub_space(log, &log->l_reserve_head.grant, - ticket->t_curr_res); - xlog_grant_sub_space(log, &log->l_write_head.grant, - ticket->t_curr_res); + xlog_grant_sub_space(log, &log->l_reserve_head, ticket->t_curr_res); + xlog_grant_sub_space(log, &log->l_write_head, ticket->t_curr_res); ticket->t_curr_res = ticket->t_unit_res; trace_xfs_log_ticket_regrant_sub(log, ticket); /* just return if we still have some of the pre-reserved space */ if (!ticket->t_cnt) { - xlog_grant_add_space(log, &log->l_reserve_head.grant, + xlog_grant_add_space(log, &log->l_reserve_head, ticket->t_unit_res); trace_xfs_log_ticket_regrant_exit(log, ticket); @@ -2867,8 +2864,8 @@ xfs_log_ticket_ungrant( bytes += ticket->t_unit_res*ticket->t_cnt; } - xlog_grant_sub_space(log, &log->l_reserve_head.grant, bytes); - xlog_grant_sub_space(log, &log->l_write_head.grant, bytes); + xlog_grant_sub_space(log, &log->l_reserve_head, bytes); + xlog_grant_sub_space(log, &log->l_write_head, bytes); trace_xfs_log_ticket_ungrant_exit(log, ticket); diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index b300a6f2592a..cf7b231f2816 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -576,8 +576,6 @@ xlog_assign_grant_head(atomic64_t *head, int cycle, int space) atomic64_set(head, xlog_assign_grant_head_val(cycle, space)); } -int xlog_space_left(struct xlog *log, atomic64_t *head); - /* * Committed Item List interfaces */ -- 2.39.2

From: Dave Chinner <dchinner@redhat.com> mainline inclusion from mainline-v6.10-rc2 commit c1220522ef405a9ebf19447330c9e9de5dfc649c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC8HLD CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The grant heads in the log track the space reserved in the log for running transactions. They do this by tracking how far ahead of the tail that the reservation has reached, and the units for doing this are {cycle,bytes} for the reserve head rather than {cycle,blocks} which are normal used by LSNs. This is annoyingly complex because we have to split, crack and combined these tuples for any calculation we do to determine log space and targets. This is computationally expensive as well as difficult to do atomically and locklessly, as well as limiting the size of the log to 2^32 bytes. Really, though, all the grant heads are tracking is how much space is currently available for use in the log. We can track this as a simply byte count - we just don't care what the actual physical location in the log the head and tail are at, just how much space we have remaining before the head and tail overlap. So, convert the grant heads to track the byte reservations that are active rather than the current (cycle, offset) tuples. This means an empty log has zero bytes consumed, and a full log is when the reservations reach the size of the log minus the space consumed by the AIL. This greatly simplifies the accounting and checks for whether there is space available. We no longer need to crack or combine LSNs to determine how much space the log has left, nor do we need to look at the head or tail of the log to determine how close to full we are. There is, however, a complexity that needs to be handled. We know how much space is being tracked in the AIL now via log->l_tail_space and the log tickets track active reservations and return the unused portions to the grant heads when ungranted. Unfortunately, we don't track the used portion of the grant, so when we transfer log items from the CIL to the AIL, the space accounted to the grant heads is transferred to the log tail space. Hence when we move the AIL head forwards on item insert, we have to remove that space from the grant heads. We also remove the xlog_verify_grant_tail() debug function as it is no longer useful. The check it performs has been racy since delayed logging was introduced, but now it is clearly only detecting false positives so remove it. The result of this substantially simpler accounting algorithm is an increase in sustained transaction rate from ~1.3 million transactions/s to ~1.9 million transactions/s with no increase in CPU usage. We also remove the 32 bit space limitation on the grant heads, which will allow us to increase the journal size beyond 2GB in future. Note that this renames the sysfs files exposing the log grant space now that the values are exported in bytes. This allows xfstests to auto-detect the old or new ABI. [hch: move xlog_grant_sub_space out of line, update the xlog_grant_{add,sub}_space prototypes, rename the sysfs files to allow auto-detection in xfstests] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- Documentation/ABI/testing/sysfs-fs-xfs | 18 +- fs/xfs/xfs_log.c | 246 +++++++++---------------- fs/xfs/xfs_log_cil.c | 12 ++ fs/xfs/xfs_log_priv.h | 33 +--- fs/xfs/xfs_log_recover.c | 4 - fs/xfs/xfs_sysfs.c | 29 +-- fs/xfs/xfs_trace.h | 34 ++-- 7 files changed, 138 insertions(+), 238 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-fs-xfs b/Documentation/ABI/testing/sysfs-fs-xfs index 82d8e2f79834..7da4de948b46 100644 --- a/Documentation/ABI/testing/sysfs-fs-xfs +++ b/Documentation/ABI/testing/sysfs-fs-xfs @@ -15,25 +15,23 @@ Description: The log sequence number (LSN) of the current tail of the log. The LSN is exported in "cycle:basic block" format. -What: /sys/fs/xfs/<disk>/log/reserve_grant_head -Date: July 2014 -KernelVersion: 3.17 +What: /sys/fs/xfs/<disk>/log/reserve_grant_head_bytes +Date: June 2024 +KernelVersion: 6.11 Contact: linux-xfs@vger.kernel.org Description: The current state of the log reserve grant head. It represents the total log reservation of all currently - outstanding transactions. The grant head is exported in - "cycle:bytes" format. + outstanding transactions in bytes. Users: xfstests -What: /sys/fs/xfs/<disk>/log/write_grant_head -Date: July 2014 -KernelVersion: 3.17 +What: /sys/fs/xfs/<disk>/log/write_grant_head_bytes +Date: June 2024 +KernelVersion: 6.11 Contact: linux-xfs@vger.kernel.org Description: The current state of the log write grant head. It represents the total log reservation of all currently outstanding transactions, including regrants due to - rolling transactions. The grant head is exported in - "cycle:bytes" format. + rolling transactions in bytes. Users: xfstests diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 7564de260499..5c295edd1a0b 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -53,9 +53,6 @@ xlog_sync( struct xlog_ticket *ticket); #if defined(DEBUG) STATIC void -xlog_verify_grant_tail( - struct xlog *log); -STATIC void xlog_verify_iclog( struct xlog *log, struct xlog_in_core *iclog, @@ -65,7 +62,6 @@ xlog_verify_tail_lsn( struct xlog *log, struct xlog_in_core *iclog); #else -#define xlog_verify_grant_tail(a) #define xlog_verify_iclog(a,b,c) #define xlog_verify_tail_lsn(a,b) #endif @@ -133,125 +129,64 @@ xlog_prepare_iovec( return buf; } -static void +static inline void xlog_grant_sub_space( - struct xlog *log, struct xlog_grant_head *head, - int bytes) + int64_t bytes) { - int64_t head_val = atomic64_read(&head->grant); - int64_t new, old; - - do { - int cycle, space; - - xlog_crack_grant_head_val(head_val, &cycle, &space); - - space -= bytes; - if (space < 0) { - space += log->l_logsize; - cycle--; - } - - old = head_val; - new = xlog_assign_grant_head_val(cycle, space); - head_val = atomic64_cmpxchg(&head->grant, old, new); - } while (head_val != old); + atomic64_sub(bytes, &head->grant); } -static void +static inline void xlog_grant_add_space( - struct xlog *log, struct xlog_grant_head *head, - int bytes) + int64_t bytes) { - int64_t head_val = atomic64_read(&head->grant); - int64_t new, old; - - do { - int tmp; - int cycle, space; - - xlog_crack_grant_head_val(head_val, &cycle, &space); - - tmp = log->l_logsize - space; - if (tmp > bytes) - space += bytes; - else { - space = bytes - tmp; - cycle++; - } - - old = head_val; - new = xlog_assign_grant_head_val(cycle, space); - head_val = atomic64_cmpxchg(&head->grant, old, new); - } while (head_val != old); + atomic64_add(bytes, &head->grant); } -STATIC void +static void xlog_grant_head_init( struct xlog_grant_head *head) { - xlog_assign_grant_head(&head->grant, 1, 0); + atomic64_set(&head->grant, 0); INIT_LIST_HEAD(&head->waiters); spin_lock_init(&head->lock); } +void +xlog_grant_return_space( + struct xlog *log, + xfs_lsn_t old_head, + xfs_lsn_t new_head) +{ + int64_t diff = xlog_lsn_sub(log, new_head, old_head); + + xlog_grant_sub_space(&log->l_reserve_head, diff); + xlog_grant_sub_space(&log->l_write_head, diff); +} + /* - * Return the space in the log between the tail and the head. The head - * is passed in the cycle/bytes formal parms. In the special case where - * the reserve head has wrapped passed the tail, this calculation is no - * longer valid. In this case, just return 0 which means there is no space - * in the log. This works for all places where this function is called - * with the reserve head. Of course, if the write head were to ever - * wrap the tail, we should blow up. Rather than catch this case here, - * we depend on other ASSERTions in other parts of the code. XXXmiken - * - * If reservation head is behind the tail, we have a problem. Warn about it, - * but then treat it as if the log is empty. - * - * If the log is shut down, the head and tail may be invalid or out of whack, so - * shortcut invalidity asserts in this case so that we don't trigger them - * falsely. + * Return the space in the log between the tail and the head. In the case where + * we have overrun available reservation space, return 0. The memory barrier + * pairs with the smp_wmb() in xlog_cil_ail_insert() to ensure that grant head + * vs tail space updates are seen in the correct order and hence avoid + * transients as space is transferred from the grant heads to the AIL on commit + * completion. */ -static int +static uint64_t xlog_grant_space_left( struct xlog *log, struct xlog_grant_head *head) { - int tail_bytes; - int tail_cycle; - int head_cycle; - int head_bytes; - - xlog_crack_grant_head(&head->grant, &head_cycle, &head_bytes); - xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_bytes); - tail_bytes = BBTOB(tail_bytes); - if (tail_cycle == head_cycle && head_bytes >= tail_bytes) - return log->l_logsize - (head_bytes - tail_bytes); - if (tail_cycle + 1 < head_cycle) - return 0; - - /* Ignore potential inconsistency when shutdown. */ - if (xlog_is_shutdown(log)) - return log->l_logsize; - - if (tail_cycle < head_cycle) { - ASSERT(tail_cycle == (head_cycle - 1)); - return tail_bytes - head_bytes; - } + int64_t free_bytes; - /* - * The reservation head is behind the tail. In this case we just want to - * return the size of the log as the amount of space left. - */ - xfs_alert(log->l_mp, "xlog_grant_space_left: head behind tail"); - xfs_alert(log->l_mp, " tail_cycle = %d, tail_bytes = %d", - tail_cycle, tail_bytes); - xfs_alert(log->l_mp, " GH cycle = %d, GH bytes = %d", - head_cycle, head_bytes); - ASSERT(0); - return log->l_logsize; + smp_rmb(); /* paired with smp_wmb in xlog_cil_ail_insert() */ + free_bytes = log->l_logsize - READ_ONCE(log->l_tail_space) - + atomic64_read(&head->grant); + if (free_bytes > 0) + return free_bytes; + return 0; } STATIC void @@ -453,9 +388,8 @@ xfs_log_regrant( if (error) goto out_error; - xlog_grant_add_space(log, &log->l_write_head, need_bytes); + xlog_grant_add_space(&log->l_write_head, need_bytes); trace_xfs_log_regrant_exit(log, tic); - xlog_verify_grant_tail(log); return 0; out_error: @@ -504,10 +438,9 @@ xfs_log_reserve( if (error) goto out_error; - xlog_grant_add_space(log, &log->l_reserve_head, need_bytes); - xlog_grant_add_space(log, &log->l_write_head, need_bytes); + xlog_grant_add_space(&log->l_reserve_head, need_bytes); + xlog_grant_add_space(&log->l_write_head, need_bytes); trace_xfs_log_reserve_exit(log, tic); - xlog_verify_grant_tail(log); return 0; out_error: @@ -1887,8 +1820,8 @@ xlog_sync( if (ticket) { ticket->t_curr_res -= roundoff; } else { - xlog_grant_add_space(log, &log->l_reserve_head, roundoff); - xlog_grant_add_space(log, &log->l_write_head, roundoff); + xlog_grant_add_space(&log->l_reserve_head, roundoff); + xlog_grant_add_space(&log->l_write_head, roundoff); } /* put cycle number in every block */ @@ -2808,16 +2741,15 @@ xfs_log_ticket_regrant( if (ticket->t_cnt > 0) ticket->t_cnt--; - xlog_grant_sub_space(log, &log->l_reserve_head, ticket->t_curr_res); - xlog_grant_sub_space(log, &log->l_write_head, ticket->t_curr_res); + xlog_grant_sub_space(&log->l_reserve_head, ticket->t_curr_res); + xlog_grant_sub_space(&log->l_write_head, ticket->t_curr_res); ticket->t_curr_res = ticket->t_unit_res; trace_xfs_log_ticket_regrant_sub(log, ticket); /* just return if we still have some of the pre-reserved space */ if (!ticket->t_cnt) { - xlog_grant_add_space(log, &log->l_reserve_head, - ticket->t_unit_res); + xlog_grant_add_space(&log->l_reserve_head, ticket->t_unit_res); trace_xfs_log_ticket_regrant_exit(log, ticket); ticket->t_curr_res = ticket->t_unit_res; @@ -2864,8 +2796,8 @@ xfs_log_ticket_ungrant( bytes += ticket->t_unit_res*ticket->t_cnt; } - xlog_grant_sub_space(log, &log->l_reserve_head, bytes); - xlog_grant_sub_space(log, &log->l_write_head, bytes); + xlog_grant_sub_space(&log->l_reserve_head, bytes); + xlog_grant_sub_space(&log->l_write_head, bytes); trace_xfs_log_ticket_ungrant_exit(log, ticket); @@ -3337,42 +3269,27 @@ xlog_ticket_alloc( } #if defined(DEBUG) -/* - * Check to make sure the grant write head didn't just over lap the tail. If - * the cycles are the same, we can't be overlapping. Otherwise, make sure that - * the cycles differ by exactly one and check the byte count. - * - * This check is run unlocked, so can give false positives. Rather than assert - * on failures, use a warn-once flag and a panic tag to allow the admin to - * determine if they want to panic the machine when such an error occurs. For - * debug kernels this will have the same effect as using an assert but, unlinke - * an assert, it can be turned off at runtime. - */ -STATIC void -xlog_verify_grant_tail( - struct xlog *log) +static void +xlog_verify_dump_tail( + struct xlog *log, + struct xlog_in_core *iclog) { - int tail_cycle, tail_blocks; - int cycle, space; - - xlog_crack_grant_head(&log->l_write_head.grant, &cycle, &space); - xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_blocks); - if (tail_cycle != cycle) { - if (cycle - 1 != tail_cycle && - !test_and_set_bit(XLOG_TAIL_WARN, &log->l_opstate)) { - xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, - "%s: cycle - 1 != tail_cycle", __func__); - } - - if (space > BBTOB(tail_blocks) && - !test_and_set_bit(XLOG_TAIL_WARN, &log->l_opstate)) { - xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, - "%s: space > BBTOB(tail_blocks)", __func__); - } - } -} - -/* check if it will fit */ + xfs_alert(log->l_mp, +"ran out of log space tail 0x%llx/0x%llx, head lsn 0x%llx, head 0x%x/0x%x, prev head 0x%x/0x%x", + iclog ? be64_to_cpu(iclog->ic_header.h_tail_lsn) : -1, + atomic64_read(&log->l_tail_lsn), + log->l_ailp->ail_head_lsn, + log->l_curr_cycle, log->l_curr_block, + log->l_prev_cycle, log->l_prev_block); + xfs_alert(log->l_mp, +"write grant 0x%llx, reserve grant 0x%llx, tail_space 0x%llx, size 0x%x, iclog flags 0x%x", + atomic64_read(&log->l_write_head.grant), + atomic64_read(&log->l_reserve_head.grant), + log->l_tail_space, log->l_logsize, + iclog ? iclog->ic_flags : -1); +} + +/* Check if the new iclog will fit in the log. */ STATIC void xlog_verify_tail_lsn( struct xlog *log, @@ -3381,21 +3298,34 @@ xlog_verify_tail_lsn( xfs_lsn_t tail_lsn = be64_to_cpu(iclog->ic_header.h_tail_lsn); int blocks; - if (CYCLE_LSN(tail_lsn) == log->l_prev_cycle) { - blocks = - log->l_logBBsize - (log->l_prev_block - BLOCK_LSN(tail_lsn)); - if (blocks < BTOBB(iclog->ic_offset)+BTOBB(log->l_iclog_hsize)) - xfs_emerg(log->l_mp, "%s: ran out of log space", __func__); - } else { - ASSERT(CYCLE_LSN(tail_lsn)+1 == log->l_prev_cycle); + if (CYCLE_LSN(tail_lsn) == log->l_prev_cycle) { + blocks = log->l_logBBsize - + (log->l_prev_block - BLOCK_LSN(tail_lsn)); + if (blocks < BTOBB(iclog->ic_offset) + + BTOBB(log->l_iclog_hsize)) { + xfs_emerg(log->l_mp, + "%s: ran out of log space", __func__); + xlog_verify_dump_tail(log, iclog); + } + return; + } - if (BLOCK_LSN(tail_lsn) == log->l_prev_block) + if (CYCLE_LSN(tail_lsn) + 1 != log->l_prev_cycle) { + xfs_emerg(log->l_mp, "%s: head has wrapped tail.", __func__); + xlog_verify_dump_tail(log, iclog); + return; + } + if (BLOCK_LSN(tail_lsn) == log->l_prev_block) { xfs_emerg(log->l_mp, "%s: tail wrapped", __func__); + xlog_verify_dump_tail(log, iclog); + return; + } blocks = BLOCK_LSN(tail_lsn) - log->l_prev_block; - if (blocks < BTOBB(iclog->ic_offset) + 1) - xfs_emerg(log->l_mp, "%s: ran out of log space", __func__); - } + if (blocks < BTOBB(iclog->ic_offset) + 1) { + xfs_emerg(log->l_mp, "%s: ran out of iclog space", __func__); + xlog_verify_dump_tail(log, iclog); + } } /* diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index ca5c6d119f53..4b8079af8901 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -764,6 +764,7 @@ xlog_cil_ail_insert( struct xfs_log_item *log_items[LOG_ITEM_BATCH_SIZE]; struct xfs_log_vec *lv; struct xfs_ail_cursor cur; + xfs_lsn_t old_head; int i = 0; /* @@ -780,10 +781,21 @@ xlog_cil_ail_insert( aborted); spin_lock(&ailp->ail_lock); xfs_trans_ail_cursor_last(ailp, &cur, ctx->start_lsn); + old_head = ailp->ail_head_lsn; ailp->ail_head_lsn = ctx->commit_lsn; /* xfs_ail_update_finish() drops the ail_lock */ xfs_ail_update_finish(ailp, NULLCOMMITLSN); + /* + * We move the AIL head forwards to account for the space used in the + * log before we remove that space from the grant heads. This prevents a + * transient condition where reservation space appears to become + * available on return, only for it to disappear again immediately as + * the AIL head update accounts in the log tail space. + */ + smp_wmb(); /* paired with smp_rmb in xlog_grant_space_left */ + xlog_grant_return_space(ailp->ail_log, old_head, ailp->ail_head_lsn); + /* unpin all the log items */ list_for_each_entry(lv, &ctx->lv_chain, lv_list) { struct xfs_log_item *lip = lv->lv_item; diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index cf7b231f2816..d5ca8619bb62 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -546,36 +546,6 @@ xlog_assign_atomic_lsn(atomic64_t *lsn, uint cycle, uint block) atomic64_set(lsn, xlog_assign_lsn(cycle, block)); } -/* - * When we crack the grant head, we sample it first so that the value will not - * change while we are cracking it into the component values. This means we - * will always get consistent component values to work from. - */ -static inline void -xlog_crack_grant_head_val(int64_t val, int *cycle, int *space) -{ - *cycle = val >> 32; - *space = val & 0xffffffff; -} - -static inline void -xlog_crack_grant_head(atomic64_t *head, int *cycle, int *space) -{ - xlog_crack_grant_head_val(atomic64_read(head), cycle, space); -} - -static inline int64_t -xlog_assign_grant_head_val(int cycle, int space) -{ - return ((int64_t)cycle << 32) | space; -} - -static inline void -xlog_assign_grant_head(atomic64_t *head, int cycle, int space) -{ - atomic64_set(head, xlog_assign_grant_head_val(cycle, space)); -} - /* * Committed Item List interfaces */ @@ -641,6 +611,9 @@ xlog_lsn_sub( return (uint64_t)log->l_logsize - BBTOB(lo_block - hi_block); } +void xlog_grant_return_space(struct xlog *log, xfs_lsn_t old_head, + xfs_lsn_t new_head); + /* * The LSN is valid so long as it is behind the current LSN. If it isn't, this * means that the next log record that includes this metadata could have a diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index f88efeb5de51..e6a5fa03a530 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -1213,10 +1213,6 @@ xlog_set_state( log->l_curr_cycle++; atomic64_set(&log->l_tail_lsn, be64_to_cpu(rhead->h_tail_lsn)); log->l_ailp->ail_head_lsn = be64_to_cpu(rhead->h_lsn); - xlog_assign_grant_head(&log->l_reserve_head.grant, log->l_curr_cycle, - BBTOB(log->l_curr_block)); - xlog_assign_grant_head(&log->l_write_head.grant, log->l_curr_cycle, - BBTOB(log->l_curr_block)); } /* diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c index a3c6b1548723..ec62ccea9cfb 100644 --- a/fs/xfs/xfs_sysfs.c +++ b/fs/xfs/xfs_sysfs.c @@ -373,39 +373,30 @@ log_tail_lsn_show( XFS_SYSFS_ATTR_RO(log_tail_lsn); STATIC ssize_t -reserve_grant_head_show( +reserve_grant_head_bytes_show( struct kobject *kobject, char *buf) - { - int cycle; - int bytes; - struct xlog *log = to_xlog(kobject); - - xlog_crack_grant_head(&log->l_reserve_head.grant, &cycle, &bytes); - return sysfs_emit(buf, "%d:%d\n", cycle, bytes); + return sysfs_emit(buf, "%lld\n", + atomic64_read(&to_xlog(kobject)->l_reserve_head.grant)); } -XFS_SYSFS_ATTR_RO(reserve_grant_head); +XFS_SYSFS_ATTR_RO(reserve_grant_head_bytes); STATIC ssize_t -write_grant_head_show( +write_grant_head_bytes_show( struct kobject *kobject, char *buf) { - int cycle; - int bytes; - struct xlog *log = to_xlog(kobject); - - xlog_crack_grant_head(&log->l_write_head.grant, &cycle, &bytes); - return sysfs_emit(buf, "%d:%d\n", cycle, bytes); + return sysfs_emit(buf, "%lld\n", + atomic64_read(&to_xlog(kobject)->l_write_head.grant)); } -XFS_SYSFS_ATTR_RO(write_grant_head); +XFS_SYSFS_ATTR_RO(write_grant_head_bytes); static struct attribute *xfs_log_attrs[] = { ATTR_LIST(log_head_lsn), ATTR_LIST(log_tail_lsn), - ATTR_LIST(reserve_grant_head), - ATTR_LIST(write_grant_head), + ATTR_LIST(reserve_grant_head_bytes), + ATTR_LIST(write_grant_head_bytes), NULL, }; ATTRIBUTE_GROUPS(xfs_log); diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index b11f46890098..3e85acda8cfc 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -1206,6 +1206,7 @@ DECLARE_EVENT_CLASS(xfs_loggrant_class, TP_ARGS(log, tic), TP_STRUCT__entry( __field(dev_t, dev) + __field(unsigned long, tic) __field(char, ocnt) __field(char, cnt) __field(int, curr_res) @@ -1213,16 +1214,16 @@ DECLARE_EVENT_CLASS(xfs_loggrant_class, __field(unsigned int, flags) __field(int, reserveq) __field(int, writeq) - __field(int, grant_reserve_cycle) - __field(int, grant_reserve_bytes) - __field(int, grant_write_cycle) - __field(int, grant_write_bytes) + __field(uint64_t, grant_reserve_bytes) + __field(uint64_t, grant_write_bytes) + __field(uint64_t, tail_space) __field(int, curr_cycle) __field(int, curr_block) __field(xfs_lsn_t, tail_lsn) ), TP_fast_assign( __entry->dev = log->l_mp->m_super->s_dev; + __entry->tic = (unsigned long)tic; __entry->ocnt = tic->t_ocnt; __entry->cnt = tic->t_cnt; __entry->curr_res = tic->t_curr_res; @@ -1230,23 +1231,22 @@ DECLARE_EVENT_CLASS(xfs_loggrant_class, __entry->flags = tic->t_flags; __entry->reserveq = list_empty(&log->l_reserve_head.waiters); __entry->writeq = list_empty(&log->l_write_head.waiters); - xlog_crack_grant_head(&log->l_reserve_head.grant, - &__entry->grant_reserve_cycle, - &__entry->grant_reserve_bytes); - xlog_crack_grant_head(&log->l_write_head.grant, - &__entry->grant_write_cycle, - &__entry->grant_write_bytes); + __entry->tail_space = READ_ONCE(log->l_tail_space); + __entry->grant_reserve_bytes = __entry->tail_space + + atomic64_read(&log->l_reserve_head.grant); + __entry->grant_write_bytes = __entry->tail_space + + atomic64_read(&log->l_write_head.grant); __entry->curr_cycle = log->l_curr_cycle; __entry->curr_block = log->l_curr_block; __entry->tail_lsn = atomic64_read(&log->l_tail_lsn); ), - TP_printk("dev %d:%d t_ocnt %u t_cnt %u t_curr_res %u " - "t_unit_res %u t_flags %s reserveq %s " - "writeq %s grant_reserve_cycle %d " - "grant_reserve_bytes %d grant_write_cycle %d " - "grant_write_bytes %d curr_cycle %d curr_block %d " + TP_printk("dev %d:%d tic 0x%lx t_ocnt %u t_cnt %u t_curr_res %u " + "t_unit_res %u t_flags %s reserveq %s writeq %s " + "tail space %llu grant_reserve_bytes %llu " + "grant_write_bytes %llu curr_cycle %d curr_block %d " "tail_cycle %d tail_block %d", MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->tic, __entry->ocnt, __entry->cnt, __entry->curr_res, @@ -1254,9 +1254,8 @@ DECLARE_EVENT_CLASS(xfs_loggrant_class, __print_flags(__entry->flags, "|", XLOG_TIC_FLAGS), __entry->reserveq ? "empty" : "active", __entry->writeq ? "empty" : "active", - __entry->grant_reserve_cycle, + __entry->tail_space, __entry->grant_reserve_bytes, - __entry->grant_write_cycle, __entry->grant_write_bytes, __entry->curr_cycle, __entry->curr_block, @@ -1284,6 +1283,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_ticket_ungrant); DEFINE_LOGGRANT_EVENT(xfs_log_ticket_ungrant_sub); DEFINE_LOGGRANT_EVENT(xfs_log_ticket_ungrant_exit); DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait); +DEFINE_LOGGRANT_EVENT(xfs_log_cil_return); DECLARE_EVENT_CLASS(xfs_log_item_class, TP_PROTO(struct xfs_log_item *lip), -- 2.39.2

From: Dave Chinner <dchinner@redhat.com> mainline inclusion from mainline-v6.10-rc2 commit f3f7ae68a4ea23aa9c49530733f1faaa6996b03a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/IC8HLD CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- The AIL pushing code spends a huge amount of time skipping over items that are already marked as flushing. It is not uncommon to see hundreds of thousands of items skipped every second due to inode clustering marking all the inodes in a cluster as flushing when the first one is flushed. However, to discover an item is already flushing and should be skipped we have to call the iop_push() method for it to try to flush the item. For inodes (where this matters most), we have to first check that inode is flushable first. We can optimise this overhead away by tracking whether the log item is flushing internally. This allows xfsaild_push() to check the log item directly for flushing state and immediately skip the log item. Whilst this doesn't remove the CPU cache misses for loading the log item, it does avoid the overhead of an indirect function call and the cache misses involved in accessing inode and backing cluster buffer structures to determine flushing state. When trying to flush hundreds of thousands of inodes each second, this CPU overhead saving adds up quickly. It's so noticeable that the biggest issue with pushing on the AIL on fast storage becomes the 10ms back-off wait when we hit enough pinned buffers to break out of the push loop but not enough for the AIL pushing to be considered stuck. This limits the xfsaild to about 70% total CPU usage, and on fast storage this isn't enough to keep the storage 100% busy. The xfsaild will block on IO submission on slow storage and so is self throttling - it does not need a backoff in the case where we are really just breaking out of the walk to submit the IO we have gathered. Further with no backoff we don't need to gather huge delwri lists to mitigate the impact of backoffs, so we can submit IO more frequently and reduce the time log items spend in flushing state by breaking out of the item push loop once we've gathered enough IO to batch submission effectively. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Conflicts: fs/xfs/xfs_trans.h [not merge commit db7ccc0bac2a ("xfs: move ->iop_recover to xfs_defer_op_type")] Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/xfs_inode.c | 1 + fs/xfs/xfs_inode_item.c | 6 +++++- fs/xfs/xfs_trans.h | 4 +++- fs/xfs/xfs_trans_ail.c | 8 +++++++- 4 files changed, 16 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index bf32340cd769..277df19fee21 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -3340,6 +3340,7 @@ xfs_iflush( iip->ili_last_fields = iip->ili_fields; iip->ili_fields = 0; iip->ili_fsync_fields = 0; + set_bit(XFS_LI_FLUSHING, &iip->ili_item.li_flags); spin_unlock(&iip->ili_lock); /* diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 0a1c07179dcf..455d27b158cf 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -973,6 +973,7 @@ xfs_iflush_finish( } iip->ili_last_fields = 0; iip->ili_flush_lsn = 0; + clear_bit(XFS_LI_FLUSHING, &lip->li_flags); spin_unlock(&iip->ili_lock); xfs_iflags_clear(iip->ili_inode, XFS_IFLUSHING); if (drop_buffer) @@ -1031,8 +1032,10 @@ xfs_buf_inode_io_fail( { struct xfs_log_item *lip; - list_for_each_entry(lip, &bp->b_li_list, li_bio_list) + list_for_each_entry(lip, &bp->b_li_list, li_bio_list) { set_bit(XFS_LI_FAILED, &lip->li_flags); + clear_bit(XFS_LI_FLUSHING, &lip->li_flags); + } } /* @@ -1051,6 +1054,7 @@ xfs_iflush_abort_clean( iip->ili_flush_lsn = 0; iip->ili_item.li_buf = NULL; list_del_init(&iip->ili_item.li_bio_list); + clear_bit(XFS_LI_FLUSHING, &iip->ili_item.li_flags); } /* diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index ead65f5f8dc3..7da343aa7214 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -58,13 +58,15 @@ struct xfs_log_item { #define XFS_LI_FAILED 2 #define XFS_LI_DIRTY 3 #define XFS_LI_WHITEOUT 4 +#define XFS_LI_FLUSHING 5 #define XFS_LI_FLAGS \ { (1u << XFS_LI_IN_AIL), "IN_AIL" }, \ { (1u << XFS_LI_ABORTED), "ABORTED" }, \ { (1u << XFS_LI_FAILED), "FAILED" }, \ { (1u << XFS_LI_DIRTY), "DIRTY" }, \ - { (1u << XFS_LI_WHITEOUT), "WHITEOUT" } + { (1u << XFS_LI_WHITEOUT), "WHITEOUT" }, \ + { (1u << XFS_LI_FLUSHING), "FLUSHING" } struct xfs_defer_pending; diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 1e935b38f176..a802353dcb2a 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -512,6 +512,9 @@ xfsaild_push( while ((XFS_LSN_CMP(lip->li_lsn, ailp->ail_target) <= 0)) { int lock_result; + if (test_bit(XFS_LI_FLUSHING, &lip->li_flags)) + goto next_item; + /* * Note that iop_push may unlock and reacquire the AIL lock. We * rely on the AIL cursor implementation to be able to deal with @@ -581,9 +584,12 @@ xfsaild_push( if (stuck > 100) break; +next_item: lip = xfs_trans_ail_cursor_next(ailp, &cur); if (lip == NULL) break; + if (lip->li_lsn != lsn && count > 1000) + break; lsn = lip->li_lsn; } @@ -620,7 +626,7 @@ xfsaild_push( /* * Assume we have more work to do in a short while. */ - tout = 10; + tout = 0; } return tout; -- 2.39.2

From: "Darrick J. Wong" <djwong@kernel.org> mainline inclusion from mainline-v6.10-rc2 commit 04d6dbb55301a29aeb9a197e6b0012cdc265f1e4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC8HLD CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... -------------------------------- In commit 9adf40249e6c, we changed the behavior of the AIL thread to set its own task state to KILLABLE whenever the timeout value is nonzero. Unfortunately, this missed the fact that xfsaild_push will return 50ms (aka a longish sleep) when we reach the push target or the AIL becomes empty, so xfsaild goes to sleep for a long period of time in uninterruptible D state. This results in artificially high load averages because KILLABLE processes are UNINTERRUPTIBLE, which contributes to load average even though the AIL is asleep waiting for someone to interrupt it. It's not blocked on IOs or anything, but people scrap ps for processes that look like they're stuck in D state, so restore the previous threshold. Fixes: 9adf40249e6c ("xfs: AIL doesn't need manual pushing") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/xfs_trans_ail.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index a802353dcb2a..9ce88d96daa5 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -644,7 +644,12 @@ xfsaild( set_freezable(); while (1) { - if (tout) + /* + * Long waits of 50ms or more occur when we've run out of items + * to push, so we only want uninterruptible state if we're + * actually blocked on something. + */ + if (tout && tout <= 20) set_current_state(TASK_KILLABLE|TASK_FREEZABLE); else set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE); -- 2.39.2

maillist inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/IC8HLD CVE: NA Reference: https://lore.kernel.org/linux-xfs/20250523063046.564170-1-leo.lilong@huawei.... -------------------------------- When the AIL is empty, we track ail_head_lsn as the tail LSN, where ail_head_lsn records the commit LSN of Checkpoint N, the last checkpoint inserted into the AIL. There are two possible scenarios when the AIL is empty: 1. Items from Checkpoint N were previously inserted into the AIL, have been written back, and subsequently removed from the AIL. 2. Items from Checkpoint N have not yet been inserted into the AIL, but the preparation for insertion is complete, and ail_head_lsn has already been updated to Checkpoint N's commit LSN. For scenario 1, the items in Checkpoint N have already been written to the metadata area. Even in the event of a crash, Checkpoint N does not require recovery, so forwarding the tail LSN to Checkpoint N's commit LSN is reasonable. For scenario 2, the items in Checkpoint N have not been written to the metadata area. If new logs (ie., Checkpoint N+1) are flushed to disk with the tail LSN recorded as Checkpoint N's commit LSN, a crash would make it impossible to recover Checkpoint N. Checkpoint N start N commit N +-------------+------------+--------------+ Checkpoint N+1 start N+1 commit N+1 Scenario 2 is possible. I encountered this issue in the Linux 6.6, where l_last_sync_lsn was used to track the tail LSN when the AIL is empty. Although the code has been refactored and I have not reproduced this issue in the latest mainline kernel, I believe the problem still exists. In the function xlog_cil_ail_insert(), which inserts items from the ctx into the AIL, the update of ail_head_lsn and the actual insertion of items into the AIL are not atomic. This process is not protected by `ail_lock` or `log->l_icloglock`, leaving a window that could result in the iclog being filled with an incorrect tail LSN. When the AIL is empty, the tail LSN should be set to Checkpoint N's start LSN before the items from Checkpoint N are inserted into the AIL. This ensures that Checkpoint N can be recovered in case of a crash. After the items from Checkpoint N are inserted into the AIL, the tail LSN tracked for an empty AIL can then be updated to Checkpoint N's commit LSN. This cannot be achieved with ail_head_lsn alone, so a new variable, ail_tail_lsn, is introduced specifically to track the tail LSN when the AIL is empty. Fixes: 14e15f1bcd73 ("xfs: push the grant head when the log head moves forward") # further than this Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/xfs_log_cil.c | 2 ++ fs/xfs/xfs_log_recover.c | 5 +++-- fs/xfs/xfs_trans_ail.c | 2 +- fs/xfs/xfs_trans_priv.h | 1 + 4 files changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 4b8079af8901..3d36066215be 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -783,6 +783,7 @@ xlog_cil_ail_insert( xfs_trans_ail_cursor_last(ailp, &cur, ctx->start_lsn); old_head = ailp->ail_head_lsn; ailp->ail_head_lsn = ctx->commit_lsn; + ailp->ail_tail_lsn = ctx->start_lsn; /* xfs_ail_update_finish() drops the ail_lock */ xfs_ail_update_finish(ailp, NULLCOMMITLSN); @@ -865,6 +866,7 @@ xlog_cil_ail_insert( spin_lock(&ailp->ail_lock); xfs_trans_ail_cursor_done(&cur); + ailp->ail_tail_lsn = ctx->commit_lsn; spin_unlock(&ailp->ail_lock); } diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index e6a5fa03a530..5cadabf97a9f 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -1177,7 +1177,7 @@ xlog_check_unmount_rec( */ xlog_assign_atomic_lsn(&log->l_tail_lsn, log->l_curr_cycle, after_umount_blk); - log->l_ailp->ail_head_lsn = + log->l_ailp->ail_head_lsn = log->l_ailp->ail_tail_lsn = atomic64_read(&log->l_tail_lsn); *tail_blk = after_umount_blk; @@ -1212,7 +1212,8 @@ xlog_set_state( if (bump_cycle) log->l_curr_cycle++; atomic64_set(&log->l_tail_lsn, be64_to_cpu(rhead->h_tail_lsn)); - log->l_ailp->ail_head_lsn = be64_to_cpu(rhead->h_lsn); + log->l_ailp->ail_head_lsn = log->l_ailp->ail_tail_lsn = + be64_to_cpu(rhead->h_lsn); } /* diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 9ce88d96daa5..2542c5e2c1cd 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -745,7 +745,7 @@ __xfs_ail_assign_tail_lsn( tail_lsn = __xfs_ail_min_lsn(ailp); if (!tail_lsn) - tail_lsn = ailp->ail_head_lsn; + tail_lsn = ailp->ail_tail_lsn; WRITE_ONCE(log->l_tail_space, xlog_lsn_sub(log, ailp->ail_head_lsn, tail_lsn)); diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h index bd841df93021..89ad29036ccb 100644 --- a/fs/xfs/xfs_trans_priv.h +++ b/fs/xfs/xfs_trans_priv.h @@ -56,6 +56,7 @@ struct xfs_ail { spinlock_t ail_lock; xfs_lsn_t ail_last_pushed_lsn; xfs_lsn_t ail_head_lsn; + xfs_lsn_t ail_tail_lsn; int ail_log_flush; unsigned long ail_opstate; struct list_head ail_buf_list; -- 2.39.2

反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/16456 邮件列表地址:https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/RS4... FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/16456 Mailing list address: https://mailweb.openeuler.org/archives/list/kernel@openeuler.org/message/RS4...
participants (2)
-
Long Li
-
patchwork bot