From: "Darrick J. Wong" darrick.wong@oracle.com
mainline inclusion from mainline-v5.11-rc4 commit 6da1b4b1ab36d80a3994fd4811c8381de10af604 category: bugfix bugzilla: 185867 https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
When overlayfs is running on top of xfs and the user unlinks a file in the overlay, overlayfs will create a whiteout inode and ask xfs to "rename" the whiteout file atop the one being unlinked. If the file being unlinked loses its one nlink, we then have to put the inode on the unlinked list.
This requires us to grab the AGI buffer of the whiteout inode to take it off the unlinked list (which is where whiteouts are created) and to grab the AGI buffer of the file being deleted. If the whiteout was created in a higher numbered AG than the file being deleted, we'll lock the AGIs in the wrong order and deadlock.
Therefore, grab all the AGI locks we think we'll need ahead of time, and in order of increasing AG number per the locking rules.
Reported-by: wenli xie wlxie7296@gmail.com Fixes: 93597ae8dac0 ("xfs: Fix deadlock between AGI and AGF when target_ip exists in xfs_rename()") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Brian Foster bfoster@redhat.com Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Lihong Kou koulihong@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/libxfs/xfs_dir2.h | 2 -- fs/xfs/libxfs/xfs_dir2_sf.c | 2 +- fs/xfs/xfs_inode.c | 42 ++++++++++++++++++++++--------------- 3 files changed, 26 insertions(+), 20 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_dir2.h b/fs/xfs/libxfs/xfs_dir2.h index e55378640b05..d03e6098ded9 100644 --- a/fs/xfs/libxfs/xfs_dir2.h +++ b/fs/xfs/libxfs/xfs_dir2.h @@ -47,8 +47,6 @@ extern int xfs_dir_lookup(struct xfs_trans *tp, struct xfs_inode *dp, extern int xfs_dir_removename(struct xfs_trans *tp, struct xfs_inode *dp, struct xfs_name *name, xfs_ino_t ino, xfs_extlen_t tot); -extern bool xfs_dir2_sf_replace_needblock(struct xfs_inode *dp, - xfs_ino_t inum); extern int xfs_dir_replace(struct xfs_trans *tp, struct xfs_inode *dp, struct xfs_name *name, xfs_ino_t inum, xfs_extlen_t tot); diff --git a/fs/xfs/libxfs/xfs_dir2_sf.c b/fs/xfs/libxfs/xfs_dir2_sf.c index 2463b5d73447..8c4f76bba88b 100644 --- a/fs/xfs/libxfs/xfs_dir2_sf.c +++ b/fs/xfs/libxfs/xfs_dir2_sf.c @@ -1018,7 +1018,7 @@ xfs_dir2_sf_removename( /* * Check whether the sf dir replace operation need more blocks. */ -bool +static bool xfs_dir2_sf_replace_needblock( struct xfs_inode *dp, xfs_ino_t inum) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index af54224ebbe7..b72dd3f67ca7 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -3212,7 +3212,7 @@ xfs_rename( struct xfs_trans *tp; struct xfs_inode *wip = NULL; /* whiteout inode */ struct xfs_inode *inodes[__XFS_SORT_INODES]; - struct xfs_buf *agibp; + int i; int num_inodes = __XFS_SORT_INODES; bool new_parent = (src_dp != target_dp); bool src_is_directory = S_ISDIR(VFS_I(src_ip)->i_mode); @@ -3325,6 +3325,30 @@ xfs_rename( } }
+ /* + * Lock the AGI buffers we need to handle bumping the nlink of the + * whiteout inode off the unlinked list and to handle dropping the + * nlink of the target inode. Per locking order rules, do this in + * increasing AG order and before directory block allocation tries to + * grab AGFs because we grab AGIs before AGFs. + * + * The (vfs) caller must ensure that if src is a directory then + * target_ip is either null or an empty directory. + */ + for (i = 0; i < num_inodes && inodes[i] != NULL; i++) { + if (inodes[i] == wip || + (inodes[i] == target_ip && + (VFS_I(target_ip)->i_nlink == 1 || src_is_directory))) { + struct xfs_buf *bp; + xfs_agnumber_t agno; + + agno = XFS_INO_TO_AGNO(mp, inodes[i]->i_ino); + error = xfs_read_agi(mp, tp, agno, &bp); + if (error) + goto out_trans_cancel; + } + } + /* * Directory entry creation below may acquire the AGF. Remove * the whiteout from the unlinked list first to preserve correct @@ -3377,22 +3401,6 @@ xfs_rename( * In case there is already an entry with the same * name at the destination directory, remove it first. */ - - /* - * Check whether the replace operation will need to allocate - * blocks. This happens when the shortform directory lacks - * space and we have to convert it to a block format directory. - * When more blocks are necessary, we must lock the AGI first - * to preserve locking order (AGI -> AGF). - */ - if (xfs_dir2_sf_replace_needblock(target_dp, src_ip->i_ino)) { - error = xfs_read_agi(mp, tp, - XFS_INO_TO_AGNO(mp, target_ip->i_ino), - &agibp); - if (error) - goto out_trans_cancel; - } - error = xfs_dir_replace(tp, target_dp, target_name, src_ip->i_ino, spaceres); if (error)
From: Brian Foster bfoster@redhat.com
mainline inclusion from mainline-v5.13-rc4 commit 84d8949e770745b16a7e8a68dcb1d0f3687bdee9 category: bugfix bugzilla: 185862 https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The special processing used to simulate a buffer I/O failure on fs shutdown has a difficult to reproduce race that can result in a use after free of the associated buffer. Consider a buffer that has been committed to the on-disk log and thus is AIL resident. The buffer lands on the writeback delwri queue, but is subsequently locked, committed and pinned by another transaction before submitted for I/O. At this point, the buffer is stuck on the delwri queue as it cannot be submitted for I/O until it is unpinned. A log checkpoint I/O failure occurs sometime later, which aborts the bli. The unpin handler is called with the aborted log item, drops the bli reference count, the pin count, and falls into the I/O failure simulation path.
The potential problem here is that once the pin count falls to zero in ->iop_unpin(), xfsaild is free to retry delwri submission of the buffer at any time, before the unpin handler even completes. If delwri queue submission wins the race to the buffer lock, it observes the shutdown state and simulates the I/O failure itself. This releases both the bli and delwri queue holds and frees the buffer while xfs_buf_item_unpin() sits on xfs_buf_lock() waiting to run through the same failure sequence. This problem is rare and requires many iterations of fstest generic/019 (which simulates disk I/O failures) to reproduce.
To avoid this problem, grab a hold on the buffer before the log item is unpinned if the associated item has been aborted and will require a simulated I/O failure. The hold is already required for the simulated I/O failure, so the ordering simply guarantees the unpin handler access to the buffer before it is unpinned and thus processed by the AIL. This particular ordering is required so long as the AIL does not acquire a reference on the bli, which is the long term solution to this problem.
Signed-off-by: Brian Foster bfoster@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Lihong Kou koulihong@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_buf_item.c | 37 +++++++++++++++++++++---------------- 1 file changed, 21 insertions(+), 16 deletions(-)
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 5d6535370f87..452af57731ed 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -393,17 +393,8 @@ xfs_buf_item_pin( }
/* - * This is called to unpin the buffer associated with the buf log - * item which was previously pinned with a call to xfs_buf_item_pin(). - * - * Also drop the reference to the buf item for the current transaction. - * If the XFS_BLI_STALE flag is set and we are the last reference, - * then free up the buf log item and unlock the buffer. - * - * If the remove flag is set we are called from uncommit in the - * forced-shutdown path. If that is true and the reference count on - * the log item is going to drop to zero we need to free the item's - * descriptor in the transaction. + * This is called to unpin the buffer associated with the buf log item which + * was previously pinned with a call to xfs_buf_item_pin(). */ STATIC void xfs_buf_item_unpin( @@ -420,12 +411,26 @@ xfs_buf_item_unpin(
trace_xfs_buf_item_unpin(bip);
+ /* + * Drop the bli ref associated with the pin and grab the hold required + * for the I/O simulation failure in the abort case. We have to do this + * before the pin count drops because the AIL doesn't acquire a bli + * reference. Therefore if the refcount drops to zero, the bli could + * still be AIL resident and the buffer submitted for I/O (and freed on + * completion) at any point before we return. This can be removed once + * the AIL properly holds a reference on the bli. + */ freed = atomic_dec_and_test(&bip->bli_refcount); - + if (freed && !stale && remove) + xfs_buf_hold(bp); if (atomic_dec_and_test(&bp->b_pin_count)) wake_up_all(&bp->b_waiters);
- if (freed && stale) { + /* nothing to do but drop the pin count if the bli is active */ + if (!freed) + return; + + if (stale) { ASSERT(bip->bli_flags & XFS_BLI_STALE); ASSERT(xfs_buf_islocked(bp)); ASSERT(bp->b_flags & XBF_STALE); @@ -468,13 +473,13 @@ xfs_buf_item_unpin( ASSERT(bp->b_log_item == NULL); } xfs_buf_relse(bp); - } else if (freed && remove) { + } else if (remove) { /* * The buffer must be locked and held by the caller to simulate - * an async I/O failure. + * an async I/O failure. We acquired the hold for this case + * before the buffer was unpinned. */ xfs_buf_lock(bp); - xfs_buf_hold(bp); bp->b_flags |= XBF_ASYNC; xfs_buf_ioend_fail(bp); }
From: Brian Foster bfoster@redhat.com
mainline inclusion from mainline-v5.13-rc4 commit e53d3aa0b605c49d780e1b2fd0b49dba4154f32b category: bugfix bugzilla: 185862 https://gitee.com/openeuler/kernel/issues/I4KIAO
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
This code goes back to a time when transaction commits wrote directly to iclogs. The associated log items were pinned, written to the log, and then "uncommitted" if some part of the log write had failed. This uncommit sequence called an ->iop_unpin_remove() handler that was eventually folded into ->iop_unpin() via the remove parameter. The log subsystem has since changed significantly in that transactions commit to the CIL instead of direct to iclogs, though log items must still be aborted in the event of an eventual log I/O error. However, the context for a log item abort is now asynchronous from transaction commit, which means the committing transaction has been freed by this point in time and the transaction uncommit sequence of events is no longer relevant.
Further, since stale buffers remain locked at transaction commit through unpin, we can be certain that the buffer is not associated with any transaction when the unpin callback executes. Remove this unused hunk of code and replace it with an assertion that the buffer is disassociated from transaction context.
Signed-off-by: Brian Foster bfoster@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Guo Xuenan guoxuenan@huawei.com Reviewed-by: Lihong Kou koulihong@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_buf_item.c | 21 ++------------------- 1 file changed, 2 insertions(+), 19 deletions(-)
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 452af57731ed..a3d5ecccfc2c 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -435,28 +435,11 @@ xfs_buf_item_unpin( ASSERT(xfs_buf_islocked(bp)); ASSERT(bp->b_flags & XBF_STALE); ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL); + ASSERT(list_empty(&lip->li_trans)); + ASSERT(!bp->b_transp);
trace_xfs_buf_item_unpin_stale(bip);
- if (remove) { - /* - * If we are in a transaction context, we have to - * remove the log item from the transaction as we are - * about to release our reference to the buffer. If we - * don't, the unlock that occurs later in - * xfs_trans_uncommit() will try to reference the - * buffer which we no longer have a hold on. - */ - if (!list_empty(&lip->li_trans)) - xfs_trans_del_item(lip); - - /* - * Since the transaction no longer refers to the buffer, - * the buffer should no longer refer to the transaction. - */ - bp->b_transp = NULL; - } - /* * If we get called here because of an IO error, we may or may * not have the item on the AIL. xfs_trans_ail_delete() will
From: Vivek Goyal vgoyal@redhat.com
mainline inclusion from mainline-v5.11-rc1 commit 63f9909ff602082597849f684655e93336c50b11 category: perf bugzilla: https://gitee.com/openeuler/kernel/issues/I4SIR8
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
We already have FUSE_HANDLE_KILLPRIV flag that says that file server will remove suid/sgid/caps on truncate/chown/write. But that's little different from what Linux VFS implements.
To be consistent with Linux VFS behavior what we want is.
- caps are always cleared on chown/write/truncate - suid is always cleared on chown, while for truncate/write it is cleared only if caller does not have CAP_FSETID. - sgid is always cleared on chown, while for truncate/write it is cleared only if caller does not have CAP_FSETID as well as file has group execute permission.
As previous flag did not provide above semantics. Implement a V2 of the protocol with above said constraints.
Server does not know if caller has CAP_FSETID or not. So for the case of write()/truncate(), client will send information in special flag to indicate whether to kill priviliges or not. These changes are in subsequent patches.
FUSE_HANDLE_KILLPRIV_V2 relies on WRITE being sent to server to clear suid/sgid/security.capability. But with ->writeback_cache, WRITES are cached in guest. So it is not recommended to use FUSE_HANDLE_KILLPRIV_V2 and writeback_cache together. Though it probably might be good enough for lot of use cases.
Signed-off-by: Vivek Goyal vgoyal@redhat.com Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/fuse/fuse_i.h | 8 ++++++++ fs/fuse/inode.c | 5 ++++- include/uapi/linux/fuse.h | 11 ++++++++++- 3 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index ed71ef6fbc9d..d31fc48c6afa 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -642,6 +642,14 @@ struct fuse_conn { /* show legacy mount options */ unsigned int legacy_opts_show:1;
+ /* + * fs kills suid/sgid/cap on write/chown/trunc. suid is killed on + * write/trunc only if caller did not have CAP_FSETID. sgid is killed + * on write/truncate only if caller did not have CAP_FSETID as well as + * file has group execute permission. + */ + unsigned handle_killpriv_v2:1; + /* * The following bitfields are only for optimization purposes * and hence races in setting them will not cause malfunction diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 5e484676343e..59765df54fa7 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -1058,6 +1058,8 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args, !fuse_dax_check_alignment(fc, arg->map_alignment)) { ok = false; } + if (arg->flags & FUSE_HANDLE_KILLPRIV_V2) + fc->handle_killpriv_v2 = 1; } else { ra_pages = fc->max_read / PAGE_SIZE; fc->no_lock = 1; @@ -1100,7 +1102,8 @@ void fuse_send_init(struct fuse_mount *fm) FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT | FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL | FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS | - FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA; + FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA | + FUSE_HANDLE_KILLPRIV_V2; #ifdef CONFIG_FUSE_DAX if (fm->fc->dax) ia->in.flags |= FUSE_MAP_ALIGNMENT; diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 7233502ea991..29bd2e007947 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -175,6 +175,9 @@ * * 7.32 * - add flags to fuse_attr, add FUSE_ATTR_SUBMOUNT, add FUSE_SUBMOUNTS + * + * 7.33 + * - add FUSE_HANDLE_KILLPRIV_V2 */
#ifndef _LINUX_FUSE_H @@ -210,7 +213,7 @@ #define FUSE_KERNEL_VERSION 7
/** Minor version number of this interface */ -#define FUSE_KERNEL_MINOR_VERSION 32 +#define FUSE_KERNEL_MINOR_VERSION 33
/** The node ID of the root inode */ #define FUSE_ROOT_ID 1 @@ -320,6 +323,11 @@ struct fuse_file_lock { * foffset and moffset fields in struct * fuse_setupmapping_out and fuse_removemapping_one. * FUSE_SUBMOUNTS: kernel supports auto-mounting directory submounts + * FUSE_HANDLE_KILLPRIV_V2: fs kills suid/sgid/cap on write/chown/trunc. + * Upon write/truncate suid/sgid is only killed if caller + * does not have CAP_FSETID. Additionally upon + * write/truncate sgid is killed only if file has group + * execute permission. (Same as Linux VFS behavior). */ #define FUSE_ASYNC_READ (1 << 0) #define FUSE_POSIX_LOCKS (1 << 1) @@ -349,6 +357,7 @@ struct fuse_file_lock { #define FUSE_EXPLICIT_INVAL_DATA (1 << 25) #define FUSE_MAP_ALIGNMENT (1 << 26) #define FUSE_SUBMOUNTS (1 << 27) +#define FUSE_HANDLE_KILLPRIV_V2 (1 << 28)
/** * CUSE INIT request/reply flags
From: Miklos Szeredi mszeredi@redhat.com
mainline inclusion from mainline-v5.11-rc1 commit 10c52c84e3f4872689a64ac7666b34d67e630691 category: perf bugzilla: https://gitee.com/openeuler/kernel/issues/I4SIR8
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Kernel has: ATTR_KILL_PRIV -> clear "security.capability" ATTR_KILL_SUID -> clear S_ISUID ATTR_KILL_SGID -> clear S_ISGID if executable
Fuse has: FUSE_WRITE_KILL_PRIV -> clear S_ISUID and S_ISGID if executable
So FUSE_WRITE_KILL_PRIV implies the complement of ATTR_KILL_PRIV, which is somewhat confusing. Also PRIV implies all privileges, including "security.capability".
Change the name to FUSE_WRITE_KILL_SUIDGID and make FUSE_WRITE_KILL_PRIV an alias to perserve API compatibility
Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/fuse/file.c | 2 +- include/uapi/linux/fuse.h | 9 ++++++--- 2 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 4dd70b53df81..74295dfe16ca 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1474,7 +1474,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
if (write) { if (!capable(CAP_FSETID)) - ia->write.in.write_flags |= FUSE_WRITE_KILL_PRIV; + ia->write.in.write_flags |= FUSE_WRITE_KILL_SUIDGID;
nres = fuse_send_write(ia, pos, nbytes, owner); } else { diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 29bd2e007947..2623c75b94a5 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -177,7 +177,7 @@ * - add flags to fuse_attr, add FUSE_ATTR_SUBMOUNT, add FUSE_SUBMOUNTS * * 7.33 - * - add FUSE_HANDLE_KILLPRIV_V2 + * - add FUSE_HANDLE_KILLPRIV_V2, FUSE_WRITE_KILL_SUIDGID */
#ifndef _LINUX_FUSE_H @@ -387,11 +387,14 @@ struct fuse_file_lock { * * FUSE_WRITE_CACHE: delayed write from page cache, file handle is guessed * FUSE_WRITE_LOCKOWNER: lock_owner field is valid - * FUSE_WRITE_KILL_PRIV: kill suid and sgid bits + * FUSE_WRITE_KILL_SUIDGID: kill suid and sgid bits */ #define FUSE_WRITE_CACHE (1 << 0) #define FUSE_WRITE_LOCKOWNER (1 << 1) -#define FUSE_WRITE_KILL_PRIV (1 << 2) +#define FUSE_WRITE_KILL_SUIDGID (1 << 2) + +/* Obsolete alias; this flag implies killing suid/sgid only. */ +#define FUSE_WRITE_KILL_PRIV FUSE_WRITE_KILL_SUIDGID
/** * Read flags
From: Vivek Goyal vgoyal@redhat.com
mainline inclusion from mainline-v5.11-rc1 commit b866739596ae3c3c60c43f1cf04a516c5aa20fd1 category: perf bugzilla: https://gitee.com/openeuler/kernel/issues/I4SIR8
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
With HANDLE_KILLPRIV_V2, server will need to kill suid/sgid if caller does not have CAP_FSETID. We already have a flag FUSE_WRITE_KILL_SUIDGID in WRITE request and we already set it in direct I/O path.
To make it work in cached write path also, start setting FUSE_WRITE_KILL_SUIDGID in this path too.
Set it only if fc->handle_killpriv_v2 is set. Otherwise client is responsible for kill suid/sgid.
In case of direct I/O we set FUSE_WRITE_KILL_SUIDGID unconditionally because we don't call file_remove_privs() in that path (with cache=none option).
Signed-off-by: Vivek Goyal vgoyal@redhat.com Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/fuse/file.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 74295dfe16ca..3f696df47b5a 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1104,6 +1104,8 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
fuse_write_args_fill(ia, ff, pos, count); ia->write.in.flags = fuse_write_flags(iocb); + if (fm->fc->handle_killpriv_v2 && !capable(CAP_FSETID)) + ia->write.in.write_flags |= FUSE_WRITE_KILL_SUIDGID;
err = fuse_simple_request(fm, &ap->args); if (!err && ia->write.out.size > count)
From: Vivek Goyal vgoyal@redhat.com
mainline inclusion from mainline-v5.11-rc1 commit 3179216135ec09825d7c7875580951a6e69dc5df category: perf bugzilla: https://gitee.com/openeuler/kernel/issues/I4SIR8
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
If fc->handle_killpriv_v2 is enabled, we expect file server to clear suid/sgid/security.capbility upon chown/truncate/write as appropriate.
Upon truncate (ATTR_SIZE), suid/sgid are cleared only if caller does not have CAP_FSETID. File server does not know whether caller has CAP_FSETID or not. Hence set FATTR_KILL_SUIDGID upon truncate to let file server know that caller does not have CAP_FSETID and it should kill suid/sgid as appropriate.
On chown (ATTR_UID/ATTR_GID) suid/sgid need to be cleared irrespective of capabilities of calling process, so set FATTR_KILL_SUIDGID unconditionally in that case.
Signed-off-by: Vivek Goyal vgoyal@redhat.com Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/fuse/dir.c | 10 ++++++++++ include/uapi/linux/fuse.h | 3 ++- 2 files changed, 12 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 8e95a75a4559..e582342bfc16 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1693,10 +1693,20 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr, inarg.valid |= FATTR_FH; inarg.fh = ff->fh; } + + /* Kill suid/sgid for non-directory chown unconditionally */ + if (fc->handle_killpriv_v2 && !S_ISDIR(inode->i_mode) && + attr->ia_valid & (ATTR_UID | ATTR_GID)) + inarg.valid |= FATTR_KILL_SUIDGID; + if (attr->ia_valid & ATTR_SIZE) { /* For mandatory locking in truncate */ inarg.valid |= FATTR_LOCKOWNER; inarg.lock_owner = fuse_lock_owner_id(fc, current->files); + + /* Kill suid/sgid for truncate only if no CAP_FSETID */ + if (fc->handle_killpriv_v2 && !capable(CAP_FSETID)) + inarg.valid |= FATTR_KILL_SUIDGID; } fuse_setattr_fill(fc, &args, inode, &inarg, &outarg); err = fuse_simple_request(fm, &args); diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 2623c75b94a5..9eb96e0564be 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -177,7 +177,7 @@ * - add flags to fuse_attr, add FUSE_ATTR_SUBMOUNT, add FUSE_SUBMOUNTS * * 7.33 - * - add FUSE_HANDLE_KILLPRIV_V2, FUSE_WRITE_KILL_SUIDGID + * - add FUSE_HANDLE_KILLPRIV_V2, FUSE_WRITE_KILL_SUIDGID, FATTR_KILL_SUIDGID */
#ifndef _LINUX_FUSE_H @@ -274,6 +274,7 @@ struct fuse_file_lock { #define FATTR_MTIME_NOW (1 << 8) #define FATTR_LOCKOWNER (1 << 9) #define FATTR_CTIME (1 << 10) +#define FATTR_KILL_SUIDGID (1 << 11)
/** * Flags returned by the OPEN request
From: Vivek Goyal vgoyal@redhat.com
mainline inclusion from mainline-v5.11-rc1 commit 8981bdfda7445af5d5a8c277c923bf91873a0c98 category: perf bugzilla: https://gitee.com/openeuler/kernel/issues/I4SIR8
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
If client does a write() on a suid/sgid file, VFS will first call fuse_setattr() with ATTR_KILL_S[UG]ID set. This requires sending setattr to file server with ATTR_MODE set to kill suid/sgid. But to do that client needs to know latest mode otherwise it is racy.
To reduce the race window, current code first call fuse_do_getattr() to get latest ->i_mode and then resets suid/sgid bits and sends rest to server with setattr(ATTR_MODE). This does not reduce the race completely but narrows race window significantly.
With fc->handle_killpriv_v2 enabled, it should be possible to remove this race completely. Do not kill suid/sgid with ATTR_MODE at all. It will be killed by server when WRITE request is sent to server soon. This is similar to fc->handle_killpriv logic. V2 is just more refined version of protocol. Hence this patch does not send ATTR_MODE to kill suid/sgid if fc->handle_killpriv_v2 is enabled.
This creates an issue if fc->writeback_cache is enabled. In that case WRITE can be cached in guest and server might not see WRITE request and hence will not kill suid/sgid. Miklos suggested that in such cases, we should fallback to a writethrough WRITE instead and that will generate WRITE request and kill suid/sgid. This patch implements that too.
But this relies on client seeing the suid/sgid set. If another client sets suid/sgid and this client does not see it immideately, then we will not fallback to writethrough WRITE. So this is one limitation with both fc->handle_killpriv_v2 and fc->writeback_cache enabled. Both the options are not fully compatible. But might be good enough for many use cases.
Note: This patch is not checking whether security.capability is set or not when falling back to writethrough path. If suid/sgid is not set and only security.capability is set, that will be taken care of by file_remove_privs() call in ->writeback_cache path.
Signed-off-by: Vivek Goyal vgoyal@redhat.com Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/fuse/dir.c | 2 +- fs/fuse/file.c | 9 ++++++++- 2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index e582342bfc16..f7e02f8151a7 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1797,7 +1797,7 @@ static int fuse_setattr(struct dentry *entry, struct iattr *attr) * * This should be done on write(), truncate() and chown(). */ - if (!fc->handle_killpriv) { + if (!fc->handle_killpriv && !fc->handle_killpriv_v2) { /* * ia_mode calculation may have used stale i_mode. * Refresh and recalculate. diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 3f696df47b5a..386c40478eee 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1285,17 +1285,24 @@ static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from) ssize_t written_buffered = 0; struct inode *inode = mapping->host; ssize_t err; + struct fuse_conn *fc = get_fuse_conn(inode); loff_t endbyte = 0;
- if (get_fuse_conn(inode)->writeback_cache) { + if (fc->writeback_cache) { /* Update size (EOF optimization) and mode (SUID clearing) */ err = fuse_update_attributes(mapping->host, file); if (err) return err;
+ if (fc->handle_killpriv_v2 && + should_remove_suid(file_dentry(file))) { + goto writethrough; + } + return generic_file_write_iter(iocb, from); }
+writethrough: inode_lock(inode);
/* We can write back this queue in page reclaim */
From: Vivek Goyal vgoyal@redhat.com
mainline inclusion from mainline-v5.11-rc1 commit 643a666a89c358ef588d2b3ef9f2dc1efc421e61 category: perf bugzilla: https://gitee.com/openeuler/kernel/issues/I4SIR8
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
With FUSE_HANDLE_KILLPRIV_V2 support, server will need to kill suid/sgid/ security.capability on open(O_TRUNC), if server supports FUSE_ATOMIC_O_TRUNC.
But server needs to kill suid/sgid only if caller does not have CAP_FSETID. Given server does not have this information, client needs to send this info to server.
So add a flag FUSE_OPEN_KILL_SUIDGID to fuse_open_in request which tells server to kill suid/sgid (only if group execute is set).
This flag is added to the FUSE_OPEN request, as well as the FUSE_CREATE request if the create was non-exclusive, since that might result in an existing file being opened/truncated.
Signed-off-by: Vivek Goyal vgoyal@redhat.com Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/fuse/dir.c | 6 ++++++ fs/fuse/file.c | 6 ++++++ include/uapi/linux/fuse.h | 11 +++++++++-- 3 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index f7e02f8151a7..6ab3798da079 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -560,6 +560,12 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry, inarg.flags = flags; inarg.mode = mode; inarg.umask = current_umask(); + + if (fm->fc->handle_killpriv_v2 && (flags & O_TRUNC) && + !(flags & O_EXCL) && !capable(CAP_FSETID)) { + inarg.open_flags |= FUSE_OPEN_KILL_SUIDGID; + } + args.opcode = FUSE_CREATE; args.nodeid = get_node_id(dir); args.in_numargs = 2; diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 386c40478eee..e63ce8443c96 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -42,6 +42,12 @@ static int fuse_send_open(struct fuse_mount *fm, u64 nodeid, struct file *file, inarg.flags = file->f_flags & ~(O_CREAT | O_EXCL | O_NOCTTY); if (!fm->fc->atomic_o_trunc) inarg.flags &= ~O_TRUNC; + + if (fm->fc->handle_killpriv_v2 && + (inarg.flags & O_TRUNC) && !capable(CAP_FSETID)) { + inarg.open_flags |= FUSE_OPEN_KILL_SUIDGID; + } + args.opcode = opcode; args.nodeid = nodeid; args.in_numargs = 1; diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h index 9eb96e0564be..98ca64d1beb6 100644 --- a/include/uapi/linux/fuse.h +++ b/include/uapi/linux/fuse.h @@ -178,6 +178,7 @@ * * 7.33 * - add FUSE_HANDLE_KILLPRIV_V2, FUSE_WRITE_KILL_SUIDGID, FATTR_KILL_SUIDGID + * - add FUSE_OPEN_KILL_SUIDGID */
#ifndef _LINUX_FUSE_H @@ -444,6 +445,12 @@ struct fuse_file_lock { */ #define FUSE_ATTR_SUBMOUNT (1 << 0)
+/** + * Open flags + * FUSE_OPEN_KILL_SUIDGID: Kill suid and sgid if executable + */ +#define FUSE_OPEN_KILL_SUIDGID (1 << 0) + enum fuse_opcode { FUSE_LOOKUP = 1, FUSE_FORGET = 2, /* no reply */ @@ -605,14 +612,14 @@ struct fuse_setattr_in {
struct fuse_open_in { uint32_t flags; - uint32_t unused; + uint32_t open_flags; /* FUSE_OPEN_... */ };
struct fuse_create_in { uint32_t flags; uint32_t mode; uint32_t umask; - uint32_t padding; + uint32_t open_flags; /* FUSE_OPEN_... */ };
struct fuse_open_out {
From: Vivek Goyal vgoyal@redhat.com
mainline inclusion from mainline-v5.11-rc1 commit 9d769e6aa2524e1762e3b8681e0ed78f8acf6cad category: perf bugzilla: https://gitee.com/openeuler/kernel/issues/I4SIR8
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Virtiofs can be slow with small writes if xattr are enabled and we are doing cached writes (No direct I/O). Ganesh Mahalingam noticed this.
Some debugging showed that file_remove_privs() is called in cached write path on every write. And everytime it calls security_inode_need_killpriv() which results in call to __vfs_getxattr(XATTR_NAME_CAPS). And this goes to file server to fetch xattr. This extra round trip for every write slows down writes tremendously.
Normally to avoid paying this penalty on every write, vfs has the notion of caching this information in inode (S_NOSEC). So vfs sets S_NOSEC, if filesystem opted for it using super block flag SB_NOSEC. And S_NOSEC is cleared when setuid/setgid bit is set or when security xattr is set on inode so that next time a write happens, we check inode again for clearing setuid/setgid bits as well clear any security.capability xattr.
This seems to work well for local file systems but for remote file systems it is possible that VFS does not have full picture and a different client sets setuid/setgid bit or security.capability xattr on file and that means VFS information about S_NOSEC on another client will be stale. So for remote filesystems SB_NOSEC was disabled by default.
Commit 9e1f1de02c22 ("more conservative S_NOSEC handling") mentioned that these filesystems can still make use of SB_NOSEC as long as they clear S_NOSEC when they are refreshing inode attriutes from server.
So this patch tries to enable SB_NOSEC on fuse (regular fuse as well as virtiofs). And clear SB_NOSEC when we are refreshing inode attributes.
This is enabled only if server supports FUSE_HANDLE_KILLPRIV_V2. This says that server will clear setuid/setgid/security.capability on chown/truncate/write as apporpriate.
This should provide tighter coherency because now suid/sgid/ security.capability will be cleared even if fuse client cache has not seen these attrs.
Basic idea is that fuse client will trigger suid/sgid/security.capability clearing based on its attr cache. But even if cache has gone stale, it is fine because FUSE_HANDLE_KILLPRIV_V2 will make sure WRITE clear suid/sgid/security.capability.
We make this change only if server supports FUSE_HANDLE_KILLPRIV_V2. This should make sure that existing filesystems which might be relying on seucurity.capability always being queried from server are not impacted.
This tighter coherency relies on WRITE showing up on server (and not being cached in guest). So writeback_cache mode will not provide that tight coherency and it is not recommended to use two together. Having said that it might work reasonably well for lot of use cases.
This change improves random write performance very significantly. Running virtiofsd with cache=auto and following fio command:
fio --ioengine=libaio --direct=1 --name=test --filename=/mnt/virtiofs/random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randwrite
Bandwidth increases from around 50MB/s to around 250MB/s as a result of applying this patch. So improvement is very significant.
Link: https://github.com/kata-containers/runtime/issues/2815 Reported-by: "Mahalingam, Ganesh" ganesh.mahalingam@intel.com Signed-off-by: Vivek Goyal vgoyal@redhat.com Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/fuse/inode.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 59765df54fa7..5e7bb8826f3f 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -207,6 +207,16 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr, inode->i_mode &= ~S_ISVTX;
fi->orig_ino = attr->ino; + + /* + * We are refreshing inode data and it is possible that another + * client set suid/sgid or security.capability xattr. So clear + * S_NOSEC. Ideally, we could have cleared it only if suid/sgid + * was set or if security.capability xattr was set. But we don't + * know if security.capability has been set or not. So clear it + * anyway. Its less efficient but should be safe. + */ + inode->i_flags &= ~S_NOSEC; }
void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr, @@ -1058,8 +1068,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args, !fuse_dax_check_alignment(fc, arg->map_alignment)) { ok = false; } - if (arg->flags & FUSE_HANDLE_KILLPRIV_V2) + if (arg->flags & FUSE_HANDLE_KILLPRIV_V2) { fc->handle_killpriv_v2 = 1; + fm->sb->s_flags |= SB_NOSEC; + } } else { ra_pages = fc->max_read / PAGE_SIZE; fc->no_lock = 1;
From: ChenXiaoSong chenxiaosong2@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4T2WV CVE: NA
--------------------------------
This reverts commit 67dd23f9e6fbaf163431912ef5599c5e0693476c.
filemap_sample_wb_err() will return 0 if nobody has seen the error yet, then filemap_check_wb_err() will return the unchanged writeback error.
Reproducer: nfs server | nfs client --------------------------------|---------------------------------------------- # No space left on server | fallocate -l 100G /server/nospc | | | mount -t nfs $nfs_server_ip:/ /mnt | | # Expected error: No space left on device | dd if=/dev/zero of=/mnt/file count=1 ibs=1K | | # Release space on mountpoint | rm /mnt/nospc | | # Unexpected error: No space left on device | dd if=/dev/zero of=/mnt/file count=1 ibs=1K
Signed-off-by: ChenXiaoSong chenxiaosong2@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/nfs/file.c | 5 +---- fs/nfs/nfs4file.c | 5 +---- 2 files changed, 2 insertions(+), 8 deletions(-)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c index 63940a7a70be..4556e75d4591 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -140,7 +140,6 @@ static int nfs_file_flush(struct file *file, fl_owner_t id) { struct inode *inode = file_inode(file); - errseq_t since;
dprintk("NFS: flush(%pD2)\n", file);
@@ -149,9 +148,7 @@ nfs_file_flush(struct file *file, fl_owner_t id) return 0;
/* Flush writes to the server and return any errors */ - since = filemap_sample_wb_err(file->f_mapping); - nfs_wb_all(inode); - return filemap_check_wb_err(file->f_mapping, since); + return nfs_wb_all(inode); }
ssize_t diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c index a1e5c6b85ded..079ec1947c24 100644 --- a/fs/nfs/nfs4file.c +++ b/fs/nfs/nfs4file.c @@ -111,7 +111,6 @@ static int nfs4_file_flush(struct file *file, fl_owner_t id) { struct inode *inode = file_inode(file); - errseq_t since;
dprintk("NFS: flush(%pD2)\n", file);
@@ -127,9 +126,7 @@ nfs4_file_flush(struct file *file, fl_owner_t id) return filemap_fdatawrite(file->f_mapping);
/* Flush writes to the server and return any errors */ - since = filemap_sample_wb_err(file->f_mapping); - nfs_wb_all(inode); - return filemap_check_wb_err(file->f_mapping, since); + return nfs_wb_all(inode); }
#ifdef CONFIG_NFS_V4_2
From: Dave Chinner dchinner@redhat.com
mainline-inclusion from mainline-v5.14-rc4 commit 2039a272300b949c05888428877317b834c0b1fb category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4V7IK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Make it less shouty and a static inline before adding more calls through the log code.
Also convert internal log code that uses XFS_FORCED_SHUTDOWN(mount) to use xlog_is_shutdown(log) as well.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Lihong Kou koulihong@huawei.com Reviewed-by: guoxuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log.c | 28 ++++++++++++++-------------- fs/xfs/xfs_log_cil.c | 10 +++++----- fs/xfs/xfs_log_priv.h | 7 +++++-- fs/xfs/xfs_log_recover.c | 9 +++------ fs/xfs/xfs_trans.c | 2 +- 5 files changed, 28 insertions(+), 28 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index b9f2ad4e8345..9a14cc07ef9d 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -239,7 +239,7 @@ xlog_grant_head_wait( list_add_tail(&tic->t_queue, &head->waiters);
do { - if (XLOG_FORCED_SHUTDOWN(log)) + if (xlog_is_shutdown(log)) goto shutdown; xlog_grant_push_ail(log, need_bytes);
@@ -253,7 +253,7 @@ xlog_grant_head_wait( trace_xfs_log_grant_wake(log, tic);
spin_lock(&head->lock); - if (XLOG_FORCED_SHUTDOWN(log)) + if (xlog_is_shutdown(log)) goto shutdown; } while (xlog_space_left(log, &head->grant) < need_bytes);
@@ -354,7 +354,7 @@ xfs_log_regrant( int need_bytes; int error = 0;
- if (XLOG_FORCED_SHUTDOWN(log)) + if (xlog_is_shutdown(log)) return -EIO;
XFS_STATS_INC(mp, xs_try_logspace); @@ -422,7 +422,7 @@ xfs_log_reserve(
ASSERT(client == XFS_TRANSACTION || client == XFS_LOG);
- if (XLOG_FORCED_SHUTDOWN(log)) + if (xlog_is_shutdown(log)) return -EIO;
XFS_STATS_INC(mp, xs_try_logspace); @@ -778,7 +778,7 @@ xlog_wait_on_iclog( struct xlog *log = iclog->ic_log;
trace_xlog_iclog_wait_on(iclog, _RET_IP_); - if (!XLOG_FORCED_SHUTDOWN(log) && + if (!xlog_is_shutdown(log) && iclog->ic_state != XLOG_STATE_ACTIVE && iclog->ic_state != XLOG_STATE_DIRTY) { XFS_STATS_INC(log->l_mp, xs_log_force_sleep); @@ -787,7 +787,7 @@ xlog_wait_on_iclog( spin_unlock(&log->l_icloglock); }
- if (XLOG_FORCED_SHUTDOWN(log)) + if (xlog_is_shutdown(log)) return -EIO; return 0; } @@ -895,7 +895,7 @@ xfs_log_unmount_write(
xfs_log_force(mp, XFS_LOG_SYNC);
- if (XLOG_FORCED_SHUTDOWN(log)) + if (xlog_is_shutdown(log)) return;
/* @@ -995,7 +995,7 @@ xfs_log_space_wake( struct xlog *log = mp->m_log; int free_bytes;
- if (XLOG_FORCED_SHUTDOWN(log)) + if (xlog_is_shutdown(log)) return;
if (!list_empty_careful(&log->l_write_head.waiters)) { @@ -1465,7 +1465,7 @@ xlog_commit_record( }; int error;
- if (XLOG_FORCED_SHUTDOWN(log)) + if (xlog_is_shutdown(log)) return -EIO;
error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS); @@ -1546,7 +1546,7 @@ xlog_grant_push_ail( xfs_lsn_t threshold_lsn;
threshold_lsn = xlog_grant_push_threshold(log, need_bytes); - if (threshold_lsn == NULLCOMMITLSN || XLOG_FORCED_SHUTDOWN(log)) + if (threshold_lsn == NULLCOMMITLSN || xlog_is_shutdown(log)) return;
/* @@ -2737,7 +2737,7 @@ xlog_state_do_callback( cycled_icloglock = true;
spin_lock(&log->l_icloglock); - if (XLOG_FORCED_SHUTDOWN(log)) + if (xlog_is_shutdown(log)) wake_up_all(&iclog->ic_force_wait); else xlog_state_clean_iclog(log, iclog); @@ -2789,7 +2789,7 @@ xlog_state_done_syncing( * split log writes, on the second, we shut down the file system and * no iclogs should ever be attempted to be written to disk again. */ - if (!XLOG_FORCED_SHUTDOWN(log)) { + if (!xlog_is_shutdown(log)) { ASSERT(iclog->ic_state == XLOG_STATE_SYNCING); iclog->ic_state = XLOG_STATE_DONE_SYNC; } @@ -2837,7 +2837,7 @@ xlog_state_get_iclog_space(
restart: spin_lock(&log->l_icloglock); - if (XLOG_FORCED_SHUTDOWN(log)) { + if (xlog_is_shutdown(log)) { spin_unlock(&log->l_icloglock); return -EIO; } @@ -3737,7 +3737,7 @@ xfs_log_force_umount( * No need to get locks for this. */ if (logerror && log->l_iclog->ic_state == XLOG_STATE_IOERROR) { - ASSERT(XLOG_FORCED_SHUTDOWN(log)); + ASSERT(xlog_is_shutdown(log)); return 1; }
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index d17e8b7e7fc6..ec6e302aaab4 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -612,7 +612,7 @@ xlog_cil_committed( struct xfs_cil_ctx *ctx) { struct xfs_mount *mp = ctx->cil->xc_log->l_mp; - bool abort = XLOG_FORCED_SHUTDOWN(ctx->cil->xc_log); + bool abort = xlog_is_shutdown(ctx->cil->xc_log);
/* * If the I/O failed, we're aborting the commit and already shutdown. @@ -880,7 +880,7 @@ xlog_cil_push_work( * shutdown, but then went back to sleep once already in the * shutdown state. */ - if (XLOG_FORCED_SHUTDOWN(log)) { + if (xlog_is_shutdown(log)) { spin_unlock(&cil->xc_push_lock); goto out_abort_free_ticket; } @@ -1000,7 +1000,7 @@ xlog_cil_push_work( out_abort_free_ticket: xfs_log_ticket_ungrant(log, tic); out_abort: - ASSERT(XLOG_FORCED_SHUTDOWN(log)); + ASSERT(xlog_is_shutdown(log)); xlog_cil_committed(ctx); }
@@ -1168,7 +1168,7 @@ xlog_cil_commit(
xlog_cil_insert_items(log, tp);
- if (regrant && !XLOG_FORCED_SHUTDOWN(log)) + if (regrant && !xlog_is_shutdown(log)) xfs_log_ticket_regrant(log, tp->t_ticket); else xfs_log_ticket_ungrant(log, tp->t_ticket); @@ -1260,7 +1260,7 @@ xlog_cil_force_seq( * shutdown, but then went back to sleep once already in the * shutdown state. */ - if (XLOG_FORCED_SHUTDOWN(log)) + if (xlog_is_shutdown(log)) goto out_shutdown; if (ctx->sequence > sequence) continue; diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index ef96bd5e7e56..d6c6030fac07 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -456,8 +456,11 @@ struct xlog { #define XLOG_BUF_CANCEL_BUCKET(log, blkno) \ ((log)->l_buf_cancel_table + ((uint64_t)blkno % XLOG_BC_TABLE_SIZE))
-#define XLOG_FORCED_SHUTDOWN(log) \ - (unlikely((log)->l_flags & XLOG_IO_ERROR)) +static inline bool +xlog_is_shutdown(struct xlog *log) +{ + return (log->l_flags & XLOG_IO_ERROR); +}
/* common routines */ extern int diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 36f01c8eb57e..7caeaf5e4fa8 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -143,7 +143,7 @@ xlog_do_io(
error = xfs_rw_bdev(log->l_targ->bt_bdev, log->l_logBBstart + blk_no, BBTOB(nbblks), data, op); - if (error && !XFS_FORCED_SHUTDOWN(log->l_mp)) { + if (error && !xlog_is_shutdown(log)) { xfs_alert(log->l_mp, "log recovery %s I/O error at daddr 0x%llx len %d error %d", op == REQ_OP_WRITE ? "write" : "read", @@ -3292,10 +3292,7 @@ xlog_do_recover( if (error) return error;
- /* - * If IO errors happened during recovery, bail out. - */ - if (XFS_FORCED_SHUTDOWN(mp)) + if (xlog_is_shutdown(log)) return -EIO;
/* @@ -3317,7 +3314,7 @@ xlog_do_recover( xfs_buf_hold(bp); error = _xfs_buf_read(bp, XBF_READ); if (error) { - if (!XFS_FORCED_SHUTDOWN(mp)) { + if (!xlog_is_shutdown(log)) { xfs_buf_ioerror_alert(bp, __this_address); ASSERT(0); } diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 3d1c4a4cb16e..e8a9967e7194 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -919,7 +919,7 @@ __xfs_trans_commit( */ xfs_trans_unreserve_and_mod_dquots(tp); if (tp->t_ticket) { - if (regrant && !XLOG_FORCED_SHUTDOWN(mp->m_log)) + if (regrant && !xlog_is_shutdown(mp->m_log)) xfs_log_ticket_regrant(mp->m_log, tp->t_ticket); else xfs_log_ticket_ungrant(mp->m_log, tp->t_ticket);
From: Dave Chinner dchinner@redhat.com
mainline-inclusion from mainline-v5.14-rc4 commit 5112e2067bd94bd56aace4c7e4d45ff13b9152f8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4V7IK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
We don't need an iclog state field to tell us the log has been shut down. We can just check the xlog_is_shutdown() instead. The avoids the need to have shutdown overwrite the current iclog state while being active used by the log code and so having to ensure that every iclog state check handles XLOG_STATE_IOERROR appropriately.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Lihong Kou koulihong@huawei.com Reviewed-by: guoxuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log.c | 108 ++++++++++++------------------------------ fs/xfs/xfs_log_cil.c | 2 +- fs/xfs/xfs_log_priv.h | 4 +- fs/xfs/xfs_trace.h | 1 - 4 files changed, 32 insertions(+), 83 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 9a14cc07ef9d..68807977be68 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -476,7 +476,7 @@ xlog_state_release_iclog( lockdep_assert_held(&log->l_icloglock);
trace_xlog_iclog_release(iclog, _RET_IP_); - if (iclog->ic_state == XLOG_STATE_IOERROR) + if (xlog_is_shutdown(log)) return -EIO;
/* @@ -841,7 +841,7 @@ xlog_unmount_write( error = xlog_write_unmount_record(log, tic); /* * At this point, we're umounting anyway, so there's no point in - * transitioning log state to IOERROR. Just continue... + * transitioning log state to shutdown. Just continue... */ out_err: if (error) @@ -1688,7 +1688,7 @@ xlog_write_iclog( * across the log IO to archieve that. */ down(&iclog->ic_sema); - if (unlikely(iclog->ic_state == XLOG_STATE_IOERROR)) { + if (xlog_is_shutdown(log)) { /* * It would seem logical to return EIO here, but we rely on * the log state machine to propagate I/O errors instead of @@ -2228,7 +2228,7 @@ xlog_write_copy_finish( xlog_state_switch_iclogs(log, iclog, 0); else ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC || - iclog->ic_state == XLOG_STATE_IOERROR); + xlog_is_shutdown(log)); if (!commit_iclog) goto release_iclog; spin_unlock(&log->l_icloglock); @@ -2644,8 +2644,7 @@ xlog_state_set_callback( static bool xlog_state_iodone_process_iclog( struct xlog *log, - struct xlog_in_core *iclog, - bool *ioerror) + struct xlog_in_core *iclog) { xfs_lsn_t lowest_lsn; xfs_lsn_t header_lsn; @@ -2657,15 +2656,6 @@ xlog_state_iodone_process_iclog( * Skip all iclogs in the ACTIVE & DIRTY states: */ return false; - case XLOG_STATE_IOERROR: - /* - * Between marking a filesystem SHUTDOWN and stopping the log, - * we do flush all iclogs to disk (if there wasn't a log I/O - * error). So, we do want things to go smoothly in case of just - * a SHUTDOWN w/o a LOG_IO_ERROR. - */ - *ioerror = true; - return false; case XLOG_STATE_DONE_SYNC: /* * Now that we have an iclog that is in the DONE_SYNC state, do @@ -2696,7 +2686,6 @@ xlog_state_do_callback( struct xlog_in_core *iclog; struct xlog_in_core *first_iclog; bool cycled_icloglock; - bool ioerror; int flushcnt = 0; int repeats = 0;
@@ -2710,23 +2699,20 @@ xlog_state_do_callback( * Keep looping through iclogs until one full pass is made * without running any callbacks. */ - first_iclog = log->l_iclog; - iclog = log->l_iclog; cycled_icloglock = false; - ioerror = false; - repeats++; + first_iclog = log->l_iclog; + iclog = first_iclog;
do { LIST_HEAD(cb_list);
- if (xlog_state_iodone_process_iclog(log, iclog, - &ioerror)) - break; - - if (iclog->ic_state != XLOG_STATE_CALLBACK && - iclog->ic_state != XLOG_STATE_IOERROR) { - iclog = iclog->ic_next; - continue; + if (!xlog_is_shutdown(log)) { + if (xlog_state_iodone_process_iclog(log, iclog)) + break; + if (iclog->ic_state != XLOG_STATE_CALLBACK) { + iclog = iclog->ic_next; + continue; + } } list_splice_init(&iclog->ic_callbacks, &cb_list); spin_unlock(&log->l_icloglock); @@ -2742,19 +2728,19 @@ xlog_state_do_callback( else xlog_state_clean_iclog(log, iclog); iclog = iclog->ic_next; - } while (first_iclog != iclog); + } while (iclog != first_iclog);
- if (repeats > 5000) { + if (++repeats > 5000) { flushcnt += repeats; repeats = 0; xfs_warn(log->l_mp, "%s: possible infinite loop (%d iterations)", __func__, flushcnt); } - } while (!ioerror && cycled_icloglock); + } while (!xlog_is_shutdown(log) && cycled_icloglock);
if (log->l_iclog->ic_state == XLOG_STATE_ACTIVE || - log->l_iclog->ic_state == XLOG_STATE_IOERROR) + xlog_is_shutdown(log)) wake_up_all(&log->l_flush_wait);
spin_unlock(&log->l_icloglock); @@ -2764,13 +2750,6 @@ xlog_state_do_callback( /* * Finish transitioning this iclog to the dirty state. * - * Make sure that we completely execute this routine only when this is - * the last call to the iclog. There is a good chance that iclog flushes, - * when we reach the end of the physical log, get turned into 2 separate - * calls to bwrite. Hence, one iclog flush could generate two calls to this - * routine. By using the reference count bwritecnt, we guarantee that only - * the second completion goes through. - * * Callbacks could take time, so they are done outside the scope of the * global state machine log lock. */ @@ -3130,10 +3109,10 @@ xfs_log_force( xlog_cil_force(log);
spin_lock(&log->l_icloglock); - iclog = log->l_iclog; - if (iclog->ic_state == XLOG_STATE_IOERROR) + if (xlog_is_shutdown(log)) goto out_error;
+ iclog = log->l_iclog; trace_xlog_iclog_force(iclog, _RET_IP_);
if (iclog->ic_state == XLOG_STATE_DIRTY || @@ -3213,10 +3192,10 @@ xlog_force_lsn( bool completed;
spin_lock(&log->l_icloglock); - iclog = log->l_iclog; - if (iclog->ic_state == XLOG_STATE_IOERROR) + if (xlog_is_shutdown(log)) goto out_error;
+ iclog = log->l_iclog; while (be64_to_cpu(iclog->ic_header.h_lsn) != lsn) { trace_xlog_iclog_force_lsn(iclog, _RET_IP_); iclog = iclog->ic_next; @@ -3666,34 +3645,6 @@ xlog_verify_iclog( } #endif
-/* - * Mark all iclogs IOERROR. l_icloglock is held by the caller. - */ -STATIC int -xlog_state_ioerror( - struct xlog *log) -{ - xlog_in_core_t *iclog, *ic; - - iclog = log->l_iclog; - if (iclog->ic_state != XLOG_STATE_IOERROR) { - /* - * Mark all the incore logs IOERROR. - * From now on, no log flushes will result. - */ - ic = iclog; - do { - ic->ic_state = XLOG_STATE_IOERROR; - ic = ic->ic_next; - } while (ic != iclog); - return 0; - } - /* - * Return non-zero, if state transition has already happened. - */ - return 1; -} - /* * This is called from xfs_force_shutdown, when we're forcibly * shutting down the filesystem, typically because of an IO error. @@ -3709,6 +3660,8 @@ xlog_state_ioerror( * Note: for the !logerror case we need to flush the regions held in memory out * to disk first. This needs to be done before the log is marked as shutdown, * otherwise the iclog writes will fail. + * + * Return non-zero if log shutdown transition had already happened. */ int xfs_log_force_umount( @@ -3716,7 +3669,7 @@ xfs_log_force_umount( int logerror) { struct xlog *log; - int retval; + int retval = 0;
log = mp->m_log;
@@ -3736,10 +3689,8 @@ xfs_log_force_umount( * Somebody could've already done the hard work for us. * No need to get locks for this. */ - if (logerror && log->l_iclog->ic_state == XLOG_STATE_IOERROR) { - ASSERT(xlog_is_shutdown(log)); + if (logerror && xlog_is_shutdown(log)) return 1; - }
/* * Flush all the completed transactions to disk before marking the log @@ -3764,8 +3715,10 @@ xfs_log_force_umount( * Mark the log and the iclogs with IO error flags to prevent any * further log IO from being issued or completed. */ - log->l_flags |= XLOG_IO_ERROR; - retval = xlog_state_ioerror(log); + if (!(log->l_flags & XLOG_IO_ERROR)) { + log->l_flags |= XLOG_IO_ERROR; + retval = 1; + } spin_unlock(&log->l_icloglock);
/* @@ -3789,7 +3742,6 @@ xfs_log_force_umount( spin_unlock(&log->l_cilp->xc_push_lock); xlog_state_do_callback(log);
- /* return non-zero if log IOERROR transition had already happened */ return retval; }
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index ec6e302aaab4..1259cd752b45 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -916,7 +916,7 @@ xlog_cil_push_work( * callbacks and dropped the icloglock. */ spin_lock(&log->l_icloglock); - if (commit_iclog->ic_state == XLOG_STATE_IOERROR) { + if (xlog_is_shutdown(log)) { spin_unlock(&log->l_icloglock); goto out_abort; } diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index d6c6030fac07..88c40ba7e55c 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -47,7 +47,6 @@ enum xlog_iclog_state { XLOG_STATE_DONE_SYNC, /* Done syncing to disk */ XLOG_STATE_CALLBACK, /* Callback functions now */ XLOG_STATE_DIRTY, /* Dirty IC log, not ready for ACTIVE status */ - XLOG_STATE_IOERROR, /* IO error happened in sync'ing log */ };
#define XLOG_STATE_STRINGS \ @@ -56,8 +55,7 @@ enum xlog_iclog_state { { XLOG_STATE_SYNCING, "XLOG_STATE_SYNCING" }, \ { XLOG_STATE_DONE_SYNC, "XLOG_STATE_DONE_SYNC" }, \ { XLOG_STATE_CALLBACK, "XLOG_STATE_CALLBACK" }, \ - { XLOG_STATE_DIRTY, "XLOG_STATE_DIRTY" }, \ - { XLOG_STATE_IOERROR, "XLOG_STATE_IOERROR" } + { XLOG_STATE_DIRTY, "XLOG_STATE_DIRTY" }
/* diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 5201111d24ff..3f4cc7e3915c 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -4001,7 +4001,6 @@ TRACE_DEFINE_ENUM(XLOG_STATE_SYNCING); TRACE_DEFINE_ENUM(XLOG_STATE_DONE_SYNC); TRACE_DEFINE_ENUM(XLOG_STATE_CALLBACK); TRACE_DEFINE_ENUM(XLOG_STATE_DIRTY); -TRACE_DEFINE_ENUM(XLOG_STATE_IOERROR);
DECLARE_EVENT_CLASS(xlog_iclog_class, TP_PROTO(struct xlog_in_core *iclog, unsigned long caller_ip),
From: Dave Chinner dchinner@redhat.com
mainline-inclusion from mainline-v5.14-rc4 commit fd67d8a07208ab06560287b7b9334c2d50b7d6d7 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4V7IK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
xfs_log_mount_finish() needs to know if recovery is needed or not to make decisions on whether to flush the log and AIL. Move the handling of the NEED_RECOVERY state out to this function rather than needing a temporary variable to store this state over the call to xlog_recover_finish().
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Lihong Kou koulihong@huawei.com Reviewed-by: guoxuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log.c | 24 ++++++++----- fs/xfs/xfs_log_recover.c | 73 +++++++++++++++------------------------- 2 files changed, 43 insertions(+), 54 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 68807977be68..96e841bda226 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -674,9 +674,9 @@ int xfs_log_mount_finish( struct xfs_mount *mp) { - int error = 0; - bool readonly = (mp->m_flags & XFS_MOUNT_RDONLY); - bool recovered = mp->m_log->l_flags & XLOG_RECOVERY_NEEDED; + struct xlog *log = mp->m_log; + bool readonly = (mp->m_flags & XFS_MOUNT_RDONLY); + int error = 0;
if (mp->m_flags & XFS_MOUNT_NORECOVERY) { ASSERT(mp->m_flags & XFS_MOUNT_RDONLY); @@ -707,7 +707,8 @@ xfs_log_mount_finish( * mount failure occurs. */ mp->m_super->s_flags |= SB_ACTIVE; - error = xlog_recover_finish(mp->m_log); + if (log->l_flags & XLOG_RECOVERY_NEEDED) + error = xlog_recover_finish(log); if (!error) xfs_log_work_queue(mp); mp->m_super->s_flags &= ~SB_ACTIVE; @@ -722,17 +723,24 @@ xfs_log_mount_finish( * Don't push in the error case because the AIL may have pending intents * that aren't removed until recovery is cancelled. */ - if (!error && recovered) { - xfs_log_force(mp, XFS_LOG_SYNC); - xfs_ail_push_all_sync(mp->m_ail); + if (log->l_flags & XLOG_RECOVERY_NEEDED) { + if (!error) { + xfs_log_force(mp, XFS_LOG_SYNC); + xfs_ail_push_all_sync(mp->m_ail); + } + xfs_notice(mp, "Ending recovery (logdev: %s)", + mp->m_logname ? mp->m_logname : "internal"); + } else { + xfs_info(mp, "Ending clean mount"); } xfs_wait_buftarg(mp->m_ddev_targp);
+ log->l_flags &= ~XLOG_RECOVERY_NEEDED; if (readonly) mp->m_flags |= XFS_MOUNT_RDONLY;
/* Make sure the log is dead if we're returning failure. */ - ASSERT(!error || (mp->m_log->l_flags & XLOG_IO_ERROR)); + ASSERT(!error || xlog_is_shutdown(log));
return error; } diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 7caeaf5e4fa8..d47b302387b6 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -3428,62 +3428,43 @@ xlog_recover( }
/* - * In the first part of recovery we replay inodes and buffers and build - * up the list of extent free items which need to be processed. Here - * we process the extent free items and clean up the on disk unlinked - * inode lists. This is separated from the first part of recovery so - * that the root and real-time bitmap inodes can be read in from disk in - * between the two stages. This is necessary so that we can free space - * in the real-time portion of the file system. + * In the first part of recovery we replay inodes and buffers and build up the + * list of intents which need to be processed. Here we process the intents and + * clean up the on disk unlinked inode lists. This is separated from the first + * part of recovery so that the root and real-time bitmap inodes can be read in + * from disk in between the two stages. This is necessary so that we can free + * space in the real-time portion of the file system. */ int xlog_recover_finish( struct xlog *log) { - /* - * Now we're ready to do the transactions needed for the - * rest of recovery. Start with completing all the extent - * free intent records and then process the unlinked inode - * lists. At this point, we essentially run in normal mode - * except that we're still performing recovery actions - * rather than accepting new requests. - */ - if (log->l_flags & XLOG_RECOVERY_NEEDED) { - int error; - error = xlog_recover_process_intents(log); - if (error) { - /* - * Cancel all the unprocessed intent items now so that - * we don't leave them pinned in the AIL. This can - * cause the AIL to livelock on the pinned item if - * anyone tries to push the AIL (inode reclaim does - * this) before we get around to xfs_log_mount_cancel. - */ - xlog_recover_cancel_intents(log); - xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); - xfs_alert(log->l_mp, "Failed to recover intents"); - return error; - } + int error;
+ error = xlog_recover_process_intents(log); + if (error) { /* - * Sync the log to get all the intents out of the AIL. - * This isn't absolutely necessary, but it helps in - * case the unlink transactions would have problems - * pushing the intents out of the way. + * Cancel all the unprocessed intent items now so that we don't + * leave them pinned in the AIL. This can cause the AIL to + * livelock on the pinned item if anyone tries to push the AIL + * (inode reclaim does this) before we get around to + * xfs_log_mount_cancel. */ - xfs_log_force(log->l_mp, XFS_LOG_SYNC); - - xlog_recover_process_iunlinks(log); + xlog_recover_cancel_intents(log); + xfs_alert(log->l_mp, "Failed to recover intents"); + xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); + return error; + }
- xlog_recover_check_summary(log); + /* + * Sync the log to get all the intents out of the AIL. This isn't + * absolutely necessary, but it helps in case the unlink transactions + * would have problems pushing the intents out of the way. + */ + xfs_log_force(log->l_mp, XFS_LOG_SYNC);
- xfs_notice(log->l_mp, "Ending recovery (logdev: %s)", - log->l_mp->m_logname ? log->l_mp->m_logname - : "internal"); - log->l_flags &= ~XLOG_RECOVERY_NEEDED; - } else { - xfs_info(log->l_mp, "Ending clean mount"); - } + xlog_recover_process_iunlinks(log); + xlog_recover_check_summary(log); return 0; }
From: Dave Chinner dchinner@redhat.com
mainline-inclusion from mainline-v5.14-rc4 commit e1d06e5f668a403f48538f0d6b163edfd4342adf category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4V7IK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
log->l_flags doesn't actually contain "flags" as such, it contains operational state information that can change at runtime. For the shutdown state, this at least should be an atomic bit because it is read without holding locks in many places and so using atomic bitops for the state field modifications makes sense.
This allows us to use things like test_and_set_bit() on state changes (e.g. setting XLOG_TAIL_WARN) to avoid races in setting the state when we aren't holding locks.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Lihong Kou koulihong@huawei.com Reviewed-by: guoxuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log.c | 58 ++++++++++++++++------------------------ fs/xfs/xfs_log.h | 1 - fs/xfs/xfs_log_priv.h | 34 +++++++++++++++-------- fs/xfs/xfs_log_recover.c | 6 ++--- fs/xfs/xfs_super.c | 2 +- 5 files changed, 50 insertions(+), 51 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 96e841bda226..ac0da4544dd6 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -291,7 +291,7 @@ xlog_grant_head_check( int free_bytes; int error = 0;
- ASSERT(!(log->l_flags & XLOG_ACTIVE_RECOVERY)); + ASSERT(!xlog_in_recovery(log));
/* * If there are other waiters on the queue then give them a chance at @@ -528,6 +528,7 @@ xfs_log_mount( xfs_daddr_t blk_offset, int num_bblks) { + struct xlog *log; bool fatal = xfs_sb_version_hascrc(&mp->m_sb); int error = 0; int min_logfsbs; @@ -542,11 +543,12 @@ xfs_log_mount( ASSERT(mp->m_flags & XFS_MOUNT_RDONLY); }
- mp->m_log = xlog_alloc_log(mp, log_target, blk_offset, num_bblks); - if (IS_ERR(mp->m_log)) { - error = PTR_ERR(mp->m_log); + log = xlog_alloc_log(mp, log_target, blk_offset, num_bblks); + if (IS_ERR(log)) { + error = PTR_ERR(log); goto out; } + mp->m_log = log;
/* * Validate the given log space and drop a critical message via syslog @@ -611,7 +613,7 @@ xfs_log_mount( xfs_warn(mp, "AIL initialisation failed: error %d", error); goto out_free_log; } - mp->m_log->l_ailp = mp->m_ail; + log->l_ailp = mp->m_ail;
/* * skip log recovery on a norecovery mount. pretend it all @@ -623,39 +625,39 @@ xfs_log_mount( if (readonly) mp->m_flags &= ~XFS_MOUNT_RDONLY;
- error = xlog_recover(mp->m_log); + error = xlog_recover(log);
if (readonly) mp->m_flags |= XFS_MOUNT_RDONLY; if (error) { xfs_warn(mp, "log mount/recovery failed: error %d", error); - xlog_recover_cancel(mp->m_log); + xlog_recover_cancel(log); goto out_destroy_ail; } }
- error = xfs_sysfs_init(&mp->m_log->l_kobj, &xfs_log_ktype, &mp->m_kobj, + error = xfs_sysfs_init(&log->l_kobj, &xfs_log_ktype, &mp->m_kobj, "log"); if (error) goto out_destroy_ail;
/* Normal transactions can now occur */ - mp->m_log->l_flags &= ~XLOG_ACTIVE_RECOVERY; + clear_bit(XLOG_ACTIVE_RECOVERY, &log->l_opstate);
/* * Now the log has been fully initialised and we know were our * space grant counters are, we can initialise the permanent ticket * needed for delayed logging to work. */ - xlog_cil_init_post_recovery(mp->m_log); + xlog_cil_init_post_recovery(log);
return 0;
out_destroy_ail: xfs_trans_ail_destroy(mp); out_free_log: - xlog_dealloc_log(mp->m_log); + xlog_dealloc_log(log); out: return error; } @@ -707,7 +709,7 @@ xfs_log_mount_finish( * mount failure occurs. */ mp->m_super->s_flags |= SB_ACTIVE; - if (log->l_flags & XLOG_RECOVERY_NEEDED) + if (xlog_recovery_needed(log)) error = xlog_recover_finish(log); if (!error) xfs_log_work_queue(mp); @@ -723,7 +725,7 @@ xfs_log_mount_finish( * Don't push in the error case because the AIL may have pending intents * that aren't removed until recovery is cancelled. */ - if (log->l_flags & XLOG_RECOVERY_NEEDED) { + if (xlog_recovery_needed(log)) { if (!error) { xfs_log_force(mp, XFS_LOG_SYNC); xfs_ail_push_all_sync(mp->m_ail); @@ -735,7 +737,7 @@ xfs_log_mount_finish( } xfs_wait_buftarg(mp->m_ddev_targp);
- log->l_flags &= ~XLOG_RECOVERY_NEEDED; + clear_bit(XLOG_RECOVERY_NEEDED, &log->l_opstate); if (readonly) mp->m_flags |= XFS_MOUNT_RDONLY;
@@ -1007,7 +1009,7 @@ xfs_log_space_wake( return;
if (!list_empty_careful(&log->l_write_head.waiters)) { - ASSERT(!(log->l_flags & XLOG_ACTIVE_RECOVERY)); + ASSERT(!xlog_in_recovery(log));
spin_lock(&log->l_write_head.lock); free_bytes = xlog_space_left(log, &log->l_write_head.grant); @@ -1016,7 +1018,7 @@ xfs_log_space_wake( }
if (!list_empty_careful(&log->l_reserve_head.waiters)) { - ASSERT(!(log->l_flags & XLOG_ACTIVE_RECOVERY)); + ASSERT(!xlog_in_recovery(log));
spin_lock(&log->l_reserve_head.lock); free_bytes = xlog_space_left(log, &log->l_reserve_head.grant); @@ -1319,7 +1321,7 @@ xlog_alloc_log( log->l_logBBstart = blk_offset; log->l_logBBsize = num_bblks; log->l_covered_state = XLOG_STATE_COVER_IDLE; - log->l_flags |= XLOG_ACTIVE_RECOVERY; + set_bit(XLOG_ACTIVE_RECOVERY, &log->l_opstate); INIT_DELAYED_WORK(&log->l_work, xfs_log_worker);
log->l_prev_block = -1; @@ -3508,17 +3510,15 @@ xlog_verify_grant_tail( xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_blocks); if (tail_cycle != cycle) { if (cycle - 1 != tail_cycle && - !(log->l_flags & XLOG_TAIL_WARN)) { + !test_and_set_bit(XLOG_TAIL_WARN, &log->l_opstate)) { xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, "%s: cycle - 1 != tail_cycle", __func__); - log->l_flags |= XLOG_TAIL_WARN; }
if (space > BBTOB(tail_blocks) && - !(log->l_flags & XLOG_TAIL_WARN)) { + !test_and_set_bit(XLOG_TAIL_WARN, &log->l_opstate)) { xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, "%s: space > BBTOB(tail_blocks)", __func__); - log->l_flags |= XLOG_TAIL_WARN; } } } @@ -3685,8 +3685,7 @@ xfs_log_force_umount( * If this happens during log recovery, don't worry about * locking; the log isn't open for business yet. */ - if (!log || - log->l_flags & XLOG_ACTIVE_RECOVERY) { + if (!log || xlog_in_recovery(log)) { mp->m_flags |= XFS_MOUNT_FS_SHUTDOWN; if (mp->m_sb_bp) mp->m_sb_bp->b_flags |= XBF_DONE; @@ -3723,10 +3722,8 @@ xfs_log_force_umount( * Mark the log and the iclogs with IO error flags to prevent any * further log IO from being issued or completed. */ - if (!(log->l_flags & XLOG_IO_ERROR)) { - log->l_flags |= XLOG_IO_ERROR; + if (!test_and_set_bit(XLOG_IO_ERROR, &log->l_opstate)) retval = 1; - } spin_unlock(&log->l_icloglock);
/* @@ -3813,12 +3810,3 @@ xfs_log_check_lsn(
return valid; } - -bool -xfs_log_in_recovery( - struct xfs_mount *mp) -{ - struct xlog *log = mp->m_log; - - return log->l_flags & XLOG_ACTIVE_RECOVERY; -} diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 66bd673344a8..d2dca196634c 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -137,7 +137,6 @@ bool xfs_log_item_in_current_chkpt(struct xfs_log_item *lip); void xfs_log_work_queue(struct xfs_mount *mp); void xfs_log_quiesce(struct xfs_mount *mp); bool xfs_log_check_lsn(struct xfs_mount *, xfs_lsn_t); -bool xfs_log_in_recovery(struct xfs_mount *);
xfs_lsn_t xlog_grant_push_threshold(struct xlog *log, int need_bytes);
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 88c40ba7e55c..b8387925d0f3 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -11,15 +11,6 @@ struct xlog; struct xlog_ticket; struct xfs_mount;
-/* - * Flags for log structure - */ -#define XLOG_ACTIVE_RECOVERY 0x2 /* in the middle of recovery */ -#define XLOG_RECOVERY_NEEDED 0x4 /* log was recovered */ -#define XLOG_IO_ERROR 0x8 /* log hit an I/O error, and being - shutdown */ -#define XLOG_TAIL_WARN 0x10 /* log tail verify warning issued */ - /* * get client id from packed copy. * @@ -400,7 +391,7 @@ struct xlog { struct xfs_buftarg *l_targ; /* buftarg of log */ struct workqueue_struct *l_ioend_workqueue; /* for I/O completions */ struct delayed_work l_work; /* background flush work */ - uint l_flags; + long l_opstate; /* operational state */ uint l_quotaoffs_flag; /* XFS_DQ_*, for QUOTAOFFs */ struct list_head *l_buf_cancel_table; int l_iclog_hsize; /* size of iclog header */ @@ -454,10 +445,31 @@ struct xlog { #define XLOG_BUF_CANCEL_BUCKET(log, blkno) \ ((log)->l_buf_cancel_table + ((uint64_t)blkno % XLOG_BC_TABLE_SIZE))
+/* + * Bits for operational state + */ +#define XLOG_ACTIVE_RECOVERY 0 /* in the middle of recovery */ +#define XLOG_RECOVERY_NEEDED 1 /* log was recovered */ +#define XLOG_IO_ERROR 2 /* log hit an I/O error, and being + shutdown */ +#define XLOG_TAIL_WARN 3 /* log tail verify warning issued */ + +static inline bool +xlog_recovery_needed(struct xlog *log) +{ + return test_bit(XLOG_RECOVERY_NEEDED, &log->l_opstate); +} + +static inline bool +xlog_in_recovery(struct xlog *log) +{ + return test_bit(XLOG_ACTIVE_RECOVERY, &log->l_opstate); +} + static inline bool xlog_is_shutdown(struct xlog *log) { - return (log->l_flags & XLOG_IO_ERROR); + return test_bit(XLOG_IO_ERROR, &log->l_opstate); }
/* common routines */ diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index d47b302387b6..f3e7016823e8 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -3338,7 +3338,7 @@ xlog_do_recover( xlog_recover_check_summary(log);
/* Normal transactions can now occur */ - log->l_flags &= ~XLOG_ACTIVE_RECOVERY; + clear_bit(XLOG_ACTIVE_RECOVERY, &log->l_opstate); return 0; }
@@ -3422,7 +3422,7 @@ xlog_recover( : "internal");
error = xlog_do_recover(log, head_blk, tail_blk); - log->l_flags |= XLOG_RECOVERY_NEEDED; + set_bit(XLOG_RECOVERY_NEEDED, &log->l_opstate); } return error; } @@ -3472,7 +3472,7 @@ void xlog_recover_cancel( struct xlog *log) { - if (log->l_flags & XLOG_RECOVERY_NEEDED) + if (xlog_recovery_needed(log)) xlog_recover_cancel_intents(log); }
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 475bb2ffd4f4..8533571421e6 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -707,7 +707,7 @@ xfs_fs_drop_inode( * that. See the comment for this inode flag. */ if (ip->i_flags & XFS_IRECOVERY) { - ASSERT(ip->i_mount->m_log->l_flags & XLOG_RECOVERY_NEEDED); + ASSERT(xlog_recovery_needed(ip->i_mount->m_log)); return 0; }
From: Dave Chinner dchinner@redhat.com
mainline-inclusion from mainline-v5.14-rc4 commit b36d4651e1650082d27fa477318183c4a7210e30 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4V7IK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
The running of a forced shutdown is a bit of a mess. It does racy checks for XFS_MOUNT_SHUTDOWN in xfs_do_force_shutdown(), then does more racy checks in xfs_log_force_unmount() before finally setting XFS_MOUNT_SHUTDOWN and XLOG_IO_ERROR under the log->icloglock.
Move the checking and setting of XFS_MOUNT_SHUTDOWN into xfs_do_force_shutdown() so we only process a shutdown once and once only. Serialise this with the mp->m_sb_lock spinlock so that the state change is atomic and won't race. Move all the mount specific shutdown state changes from xfs_log_force_unmount() to xfs_do_force_shutdown() so they are done atomically with setting XFS_MOUNT_SHUTDOWN.
Then get rid of the racy xlog_is_shutdown() check from xlog_force_shutdown(), and gate the log shutdown on the test_and_set_bit(XLOG_IO_ERROR) test under the icloglock. This means that the log is shutdown once and once only, and code that needs to prevent races with shutdown can do so by holding the icloglock and checking the return value of xlog_is_shutdown().
This results in a predictable shutdown execution process - we set the shutdown flags once and process the shutdown once rather than the current "as many concurrent shutdowns as can race to the flag setting" situation we have now.
Also, now that shutdown is atomic, alway emit a stack trace when the error level for the filesystem is high enough. This means that we always get a stack trace when trying to diagnose the cause of shutdowns in the field, rather than just for SHUTDOWN_CORRUPT_INCORE cases.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Lihong Kou koulihong@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_fsops.c | 63 ++++++++++++++-------------- fs/xfs/xfs_log.c | 100 ++++++++++++++++++++------------------------- fs/xfs/xfs_log.h | 2 +- 3 files changed, 76 insertions(+), 89 deletions(-)
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c index 4a3a4f38531a..2741dbd22704 100644 --- a/fs/xfs/xfs_fsops.c +++ b/fs/xfs/xfs_fsops.c @@ -459,6 +459,11 @@ xfs_fs_goingdown( * consistent. We don't do an unmount here; just shutdown the shop, make sure * that absolutely nothing persistent happens to this filesystem after this * point. + * + * The shutdown state change is atomic, resulting in the first and only the + * first shutdown call processing the shutdown. This means we only shutdown the + * log once as it requires, and we don't spam the logs when multiple concurrent + * shutdowns race to set the shutdown flags. */ void xfs_do_force_shutdown( @@ -467,48 +472,40 @@ xfs_do_force_shutdown( char *fname, int lnnum) { - bool logerror = flags & SHUTDOWN_LOG_IO_ERROR; + int tag; + const char *why;
- /* - * No need to duplicate efforts. - */ - if (XFS_FORCED_SHUTDOWN(mp) && !logerror) - return; - - /* - * This flags XFS_MOUNT_FS_SHUTDOWN, makes sure that we don't - * queue up anybody new on the log reservations, and wakes up - * everybody who's sleeping on log reservations to tell them - * the bad news. - */ - if (xfs_log_force_umount(mp, logerror)) - return; - - if (flags & SHUTDOWN_FORCE_UMOUNT) { - xfs_alert(mp, -"User initiated shutdown received. Shutting down filesystem"); + spin_lock(&mp->m_sb_lock); + if (XFS_FORCED_SHUTDOWN(mp)) { + spin_unlock(&mp->m_sb_lock); return; } + mp->m_flags |= XFS_MOUNT_FS_SHUTDOWN; + if (mp->m_sb_bp) + mp->m_sb_bp->b_flags |= XBF_DONE; + spin_unlock(&mp->m_sb_lock); + + if (flags & SHUTDOWN_FORCE_UMOUNT) + xfs_alert(mp, "User initiated shutdown received.");
- xfs_notice(mp, -"%s(0x%x) called from line %d of file %s. Return address = "PTR_FMT, - __func__, flags, lnnum, fname, __return_address); - - if (flags & SHUTDOWN_CORRUPT_INCORE) { - xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_CORRUPT, -"Corruption of in-memory data detected. Shutting down filesystem"); - if (XFS_ERRLEVEL_HIGH <= xfs_error_level) - xfs_stack_trace(); - } else if (logerror) { - xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_LOGERROR, - "Log I/O Error Detected. Shutting down filesystem"); + if (xlog_force_shutdown(mp->m_log, flags)) { + tag = XFS_PTAG_SHUTDOWN_LOGERROR; + why = "Log I/O Error"; + } else if (flags & SHUTDOWN_CORRUPT_INCORE) { + tag = XFS_PTAG_SHUTDOWN_CORRUPT; + why = "Corruption of in-memory data"; } else { - xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_IOERROR, - "I/O Error Detected. Shutting down filesystem"); + tag = XFS_PTAG_SHUTDOWN_IOERROR; + why = "Metadata I/O Error"; }
+ xfs_alert_tag(mp, tag, +"%s (0x%x) detected at %pS (%s:%d). Shutting down filesystem.", + why, flags, __return_address, fname, lnnum); xfs_alert(mp, "Please unmount the filesystem and rectify the problem(s)"); + if (xfs_error_level >= XFS_ERRLEVEL_HIGH) + xfs_stack_trace(); }
/* diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index ac0da4544dd6..f96390df66fc 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -3654,76 +3654,66 @@ xlog_verify_iclog( #endif
/* - * This is called from xfs_force_shutdown, when we're forcibly - * shutting down the filesystem, typically because of an IO error. - * Our main objectives here are to make sure that: - * a. if !logerror, flush the logs to disk. Anything modified - * after this is ignored. - * b. the filesystem gets marked 'SHUTDOWN' for all interested - * parties to find out, 'atomically'. - * c. those who're sleeping on log reservations, pinned objects and - * other resources get woken up, and be told the bad news. - * d. nothing new gets queued up after (b) and (c) are done. + * Perform a forced shutdown on the log. This should be called once and once + * only by the high level filesystem shutdown code to shut the log subsystem + * down cleanly. * - * Note: for the !logerror case we need to flush the regions held in memory out - * to disk first. This needs to be done before the log is marked as shutdown, - * otherwise the iclog writes will fail. + * Our main objectives here are to make sure that: + * a. if the shutdown was not due to a log IO error, flush the logs to + * disk. Anything modified after this is ignored. + * b. the log gets atomically marked 'XLOG_IO_ERROR' for all interested + * parties to find out. Nothing new gets queued after this is done. + * c. Tasks sleeping on log reservations, pinned objects and + * other resources get woken up. * - * Return non-zero if log shutdown transition had already happened. + * Return true if the shutdown cause was a log IO error and we actually shut the + * log down. */ -int -xfs_log_force_umount( - struct xfs_mount *mp, - int logerror) +bool +xlog_force_shutdown( + struct xlog *log, + int shutdown_flags) { - struct xlog *log; - int retval = 0; - - log = mp->m_log; + bool log_error = (shutdown_flags & SHUTDOWN_LOG_IO_ERROR);
/* - * If this happens during log recovery, don't worry about - * locking; the log isn't open for business yet. + * If this happens during log recovery then we aren't using the runtime + * log mechanisms yet so there's nothing to shut down. */ - if (!log || xlog_in_recovery(log)) { - mp->m_flags |= XFS_MOUNT_FS_SHUTDOWN; - if (mp->m_sb_bp) - mp->m_sb_bp->b_flags |= XBF_DONE; - return 0; - } + if (!log || xlog_in_recovery(log)) + return false;
- /* - * Somebody could've already done the hard work for us. - * No need to get locks for this. - */ - if (logerror && xlog_is_shutdown(log)) - return 1; + ASSERT(!xlog_is_shutdown(log));
/* * Flush all the completed transactions to disk before marking the log - * being shut down. We need to do it in this order to ensure that - * completed operations are safely on disk before we shut down, and that - * we don't have to issue any buffer IO after the shutdown flags are set - * to guarantee this. + * being shut down. We need to do this first as shutting down the log + * before the force will prevent the log force from flushing the iclogs + * to disk. + * + * Re-entry due to a log IO error shutdown during the log force is + * prevented by the atomicity of higher level shutdown code. */ - if (!logerror) - xfs_log_force(mp, XFS_LOG_SYNC); + if (!log_error) + xfs_log_force(log->l_mp, XFS_LOG_SYNC);
/* - * mark the filesystem and the as in a shutdown state and wake - * everybody up to tell them the bad news. + * Atomically set the shutdown state. If the shutdown state is already + * set, there someone else is performing the shutdown and so we are done + * here. This should never happen because we should only ever get called + * once by the first shutdown caller. + * + * Much of the log state machine transitions assume that shutdown state + * cannot change once they hold the log->l_icloglock. Hence we need to + * hold that lock here, even though we use the atomic test_and_set_bit() + * operation to set the shutdown state. */ spin_lock(&log->l_icloglock); - mp->m_flags |= XFS_MOUNT_FS_SHUTDOWN; - if (mp->m_sb_bp) - mp->m_sb_bp->b_flags |= XBF_DONE; - - /* - * Mark the log and the iclogs with IO error flags to prevent any - * further log IO from being issued or completed. - */ - if (!test_and_set_bit(XLOG_IO_ERROR, &log->l_opstate)) - retval = 1; + if (test_and_set_bit(XLOG_IO_ERROR, &log->l_opstate)) { + spin_unlock(&log->l_icloglock); + ASSERT(0); + return false; + } spin_unlock(&log->l_icloglock);
/* @@ -3747,7 +3737,7 @@ xfs_log_force_umount( spin_unlock(&log->l_cilp->xc_push_lock); xlog_state_do_callback(log);
- return retval; + return log_error; }
STATIC int diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index d2dca196634c..764e054464b8 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -126,7 +126,6 @@ int xfs_log_reserve(struct xfs_mount *mp, bool permanent); int xfs_log_regrant(struct xfs_mount *mp, struct xlog_ticket *tic); void xfs_log_unmount(struct xfs_mount *mp); -int xfs_log_force_umount(struct xfs_mount *mp, int logerror);
struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket); void xfs_log_ticket_put(struct xlog_ticket *ticket); @@ -139,5 +138,6 @@ void xfs_log_quiesce(struct xfs_mount *mp); bool xfs_log_check_lsn(struct xfs_mount *, xfs_lsn_t);
xfs_lsn_t xlog_grant_push_threshold(struct xlog *log, int need_bytes); +bool xlog_force_shutdown(struct xlog *log, int shutdown_flags);
#endif /* __XFS_LOG_H__ */
From: Dave Chinner dchinner@redhat.com
mainline-inclusion from mainline-v5.14-rc4 commit 8bb92005b0e4682a6e5dad131c5f3636c7d56dc1 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4V7IK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Clean it up a bit by factoring and rearranging some of the code.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Lihong Kou koulihong@huawei.com Reviewed-by: guoxuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log.c | 96 ++++++++++++++++++++++++++---------------------- 1 file changed, 53 insertions(+), 43 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index f96390df66fc..8872798fa4ba 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -2689,56 +2689,66 @@ xlog_state_iodone_process_iclog( } }
+/* + * Loop over all the iclogs, running attached callbacks on them. Return true if + * we ran any callbacks, indicating that we dropped the icloglock. + */ +static bool +xlog_state_do_iclog_callbacks( + struct xlog *log) + __releases(&log->l_icloglock) + __acquires(&log->l_icloglock) +{ + struct xlog_in_core *first_iclog = log->l_iclog; + struct xlog_in_core *iclog = first_iclog; + bool ran_callback = false; + + do { + LIST_HEAD(cb_list); + + if (!xlog_is_shutdown(log)) { + if (xlog_state_iodone_process_iclog(log, iclog)) + break; + if (iclog->ic_state != XLOG_STATE_CALLBACK) { + iclog = iclog->ic_next; + continue; + } + } + list_splice_init(&iclog->ic_callbacks, &cb_list); + spin_unlock(&log->l_icloglock); + + trace_xlog_iclog_callbacks_start(iclog, _RET_IP_); + xlog_cil_process_committed(&cb_list); + trace_xlog_iclog_callbacks_done(iclog, _RET_IP_); + ran_callback = true; + + spin_lock(&log->l_icloglock); + if (xlog_is_shutdown(log)) + wake_up_all(&iclog->ic_force_wait); + else + xlog_state_clean_iclog(log, iclog); + iclog = iclog->ic_next; + } while (iclog != first_iclog); + + return ran_callback; +} + + +/* + * Loop running iclog completion callbacks until there are no more iclogs in a + * state that can run callbacks. + */ STATIC void xlog_state_do_callback( struct xlog *log) { - struct xlog_in_core *iclog; - struct xlog_in_core *first_iclog; - bool cycled_icloglock; int flushcnt = 0; int repeats = 0;
spin_lock(&log->l_icloglock); - do { - /* - * Scan all iclogs starting with the one pointed to by the - * log. Reset this starting point each time the log is - * unlocked (during callbacks). - * - * Keep looping through iclogs until one full pass is made - * without running any callbacks. - */ - cycled_icloglock = false; - first_iclog = log->l_iclog; - iclog = first_iclog; - - do { - LIST_HEAD(cb_list); - - if (!xlog_is_shutdown(log)) { - if (xlog_state_iodone_process_iclog(log, iclog)) - break; - if (iclog->ic_state != XLOG_STATE_CALLBACK) { - iclog = iclog->ic_next; - continue; - } - } - list_splice_init(&iclog->ic_callbacks, &cb_list); - spin_unlock(&log->l_icloglock); - - trace_xlog_iclog_callbacks_start(iclog, _RET_IP_); - xlog_cil_process_committed(&cb_list); - trace_xlog_iclog_callbacks_done(iclog, _RET_IP_); - cycled_icloglock = true; - - spin_lock(&log->l_icloglock); - if (xlog_is_shutdown(log)) - wake_up_all(&iclog->ic_force_wait); - else - xlog_state_clean_iclog(log, iclog); - iclog = iclog->ic_next; - } while (iclog != first_iclog); + while (xlog_state_do_iclog_callbacks(log)) { + if (xlog_is_shutdown(log)) + break;
if (++repeats > 5000) { flushcnt += repeats; @@ -2747,7 +2757,7 @@ xlog_state_do_callback( "%s: possible infinite loop (%d iterations)", __func__, flushcnt); } - } while (!xlog_is_shutdown(log) && cycled_icloglock); + }
if (log->l_iclog->ic_state == XLOG_STATE_ACTIVE || xlog_is_shutdown(log))
From: Dave Chinner dchinner@redhat.com
mainline-inclusion from mainline-v5.14-rc4 commit aad7272a920869b950d937b87562e494af72523c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4V7IK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
The iclog callback processing done during a forced log shutdown has different logic to normal runtime IO completion callback processing. Separate out the shutdown callbacks into their own function and call that from the shutdown code instead.
We don't need this shutdown specific logic in the normal runtime completion code - we'll always run the shutdown version on shutdown, and it will do what shutdown needs regardless of whether there are racing IO completion callbacks scheduled or in progress. Hence we can also simplify the normal IO completion callpath and only abort if shutdown occurred while we actively were processing callbacks.
Further, separating out the IO completion logic from the shutdown logic avoids callback race conditions from being triggered by log IO completion after a shutdown. IO completion will now only run callbacks on iclogs that are in the correct state for a callback to be run, avoiding the possibility of running callbacks on a referenced iclog that hasn't yet been submitted for IO.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Lihong Kou koulihong@huawei.com Reviewed-by: guoxuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log.c | 53 ++++++++++++++++++++++++++++++++++-------------- 1 file changed, 38 insertions(+), 15 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 8872798fa4ba..cf6f50c56150 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -458,6 +458,32 @@ xfs_log_reserve( return error; }
+/* + * Run all the pending iclog callbacks and wake log force waiters and iclog + * space waiters so they can process the newly set shutdown state. We really + * don't care what order we process callbacks here because the log is shut down + * and so state cannot change on disk anymore. + */ +static void +xlog_state_shutdown_callbacks( + struct xlog *log) +{ + struct xlog_in_core *iclog; + LIST_HEAD(cb_list); + + spin_lock(&log->l_icloglock); + iclog = log->l_iclog; + do { + list_splice_init(&iclog->ic_callbacks, &cb_list); + wake_up_all(&iclog->ic_force_wait); + } while ((iclog = iclog->ic_next) != log->l_iclog); + + wake_up_all(&log->l_flush_wait); + spin_unlock(&log->l_icloglock); + + xlog_cil_process_committed(&cb_list); +} + /* * Flush iclog to disk if this is the last reference to the given iclog and the * it is in the WANT_SYNC state. If the caller passes in a non-zero @@ -2691,7 +2717,10 @@ xlog_state_iodone_process_iclog(
/* * Loop over all the iclogs, running attached callbacks on them. Return true if - * we ran any callbacks, indicating that we dropped the icloglock. + * we ran any callbacks, indicating that we dropped the icloglock. We don't need + * to handle transient shutdown state here at all because + * xlog_state_shutdown_callbacks() will be run to do the necessary shutdown + * cleanup of the callbacks. */ static bool xlog_state_do_iclog_callbacks( @@ -2706,13 +2735,11 @@ xlog_state_do_iclog_callbacks( do { LIST_HEAD(cb_list);
- if (!xlog_is_shutdown(log)) { - if (xlog_state_iodone_process_iclog(log, iclog)) - break; - if (iclog->ic_state != XLOG_STATE_CALLBACK) { - iclog = iclog->ic_next; - continue; - } + if (xlog_state_iodone_process_iclog(log, iclog)) + break; + if (iclog->ic_state != XLOG_STATE_CALLBACK) { + iclog = iclog->ic_next; + continue; } list_splice_init(&iclog->ic_callbacks, &cb_list); spin_unlock(&log->l_icloglock); @@ -2723,10 +2750,7 @@ xlog_state_do_iclog_callbacks( ran_callback = true;
spin_lock(&log->l_icloglock); - if (xlog_is_shutdown(log)) - wake_up_all(&iclog->ic_force_wait); - else - xlog_state_clean_iclog(log, iclog); + xlog_state_clean_iclog(log, iclog); iclog = iclog->ic_next; } while (iclog != first_iclog);
@@ -2759,8 +2783,7 @@ xlog_state_do_callback( } }
- if (log->l_iclog->ic_state == XLOG_STATE_ACTIVE || - xlog_is_shutdown(log)) + if (log->l_iclog->ic_state == XLOG_STATE_ACTIVE) wake_up_all(&log->l_flush_wait);
spin_unlock(&log->l_icloglock); @@ -3745,7 +3768,7 @@ xlog_force_shutdown( spin_lock(&log->l_cilp->xc_push_lock); wake_up_all(&log->l_cilp->xc_commit_wait); spin_unlock(&log->l_cilp->xc_push_lock); - xlog_state_do_callback(log); + xlog_state_shutdown_callbacks(log);
return log_error; }
From: Dave Chinner dchinner@redhat.com
mainline-inclusion from mainline-v5.14-rc4 commit 502a01fac0983406e7c312764d7a03e06b3d0748 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4V7IK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
When the log is shutdown, it currently walks all the iclogs and runs callbacks that are attached to the iclogs, regardless of whether the iclog is queued for IO completion or not. This creates a problem for contexts attaching callbacks to iclogs in that a racing shutdown can run the callbacks even before the attaching context has finished processing the iclog and releasing it for IO submission.
If the callback processing of the iclog frees the structure that is attached to the iclog, then this leads to an UAF scenario that can only be protected against by holding the icloglock from the point callbacks are attached through to the release of the iclog. While we currently do this, it is not practical or sustainable.
Hence we need to make shutdown processing the responsibility of the context that holds active references to the iclog. We know that the contexts attaching callbacks to the iclog must have active references to the iclog, and that means they must be in either ACTIVE or WANT_SYNC states. xlog_state_do_callback() will skip over iclogs in these states -except- when the log is shut down.
xlog_state_do_callback() checks the state of the iclogs while holding the icloglock, therefore the reference count/state change that occurs in xlog_state_release_iclog() after the callbacks are atomic w.r.t. shutdown processing.
We can't push the responsibility of callback cleanup onto the CIL context because we can have ACTIVE iclogs that have callbacks attached that have already been released. Hence we really need to internalise the cleanup of callbacks into xlog_state_release_iclog() processing.
Indeed, we already have that internalisation via:
xlog_state_release_iclog drop last reference ->SYNCING xlog_sync xlog_write_iclog if (log_is_shutdown) xlog_state_done_syncing() xlog_state_do_callback() <process shutdown on iclog that is now in SYNCING state>
The problem is that xlog_state_release_iclog() aborts before doing anything if the log is already shut down. It assumes that the callbacks have already been cleaned up, and it doesn't need to do any cleanup.
Hence the fix is to remove the xlog_is_shutdown() check from xlog_state_release_iclog() so that reference counts are correctly released from the iclogs, and when the reference count is zero we always transition to SYNCING if the log is shut down. Hence we'll always enter the xlog_sync() path in a shutdown and eventually end up erroring out the iclog IO and running xlog_state_do_callback() to process the callbacks attached to the iclog.
This allows us to stop processing referenced ACTIVE/WANT_SYNC iclogs directly in the shutdown code, and in doing so gets rid of the UAF vector that currently exists. This then decouples the adding of callbacks to the iclogs from xlog_state_release_iclog() as we guarantee that xlog_state_release_iclog() will process the callbacks if the log has been shut down before xlog_state_release_iclog() has been called.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Lihong Kou koulihong@huawei.com Reviewed-by: guoxuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log.c | 35 +++++++++++++++++++++++++++++++---- fs/xfs/xfs_log_cil.c | 12 ++++-------- 2 files changed, 35 insertions(+), 12 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index cf6f50c56150..2c6c380a15ec 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -41,6 +41,8 @@ xlog_dealloc_log( /* local state machine functions */ STATIC void xlog_state_done_syncing( struct xlog_in_core *iclog); +STATIC void xlog_state_do_callback( + struct xlog *log); STATIC int xlog_state_get_iclog_space( struct xlog *log, @@ -463,6 +465,11 @@ xfs_log_reserve( * space waiters so they can process the newly set shutdown state. We really * don't care what order we process callbacks here because the log is shut down * and so state cannot change on disk anymore. + * + * We avoid processing actively referenced iclogs so that we don't run callbacks + * while the iclog owner might still be preparing the iclog for IO submssion. + * These will be caught by xlog_state_iclog_release() and call this function + * again to process any callbacks that may have been added to that iclog. */ static void xlog_state_shutdown_callbacks( @@ -474,7 +481,12 @@ xlog_state_shutdown_callbacks( spin_lock(&log->l_icloglock); iclog = log->l_iclog; do { + if (atomic_read(&iclog->ic_refcnt)) { + /* Reference holder will re-run iclog callbacks. */ + continue; + } list_splice_init(&iclog->ic_callbacks, &cb_list); + wake_up_all(&iclog->ic_write_wait); wake_up_all(&iclog->ic_force_wait); } while ((iclog = iclog->ic_next) != log->l_iclog);
@@ -499,12 +511,11 @@ xlog_state_release_iclog( xfs_lsn_t old_tail_lsn) { xfs_lsn_t tail_lsn; + bool last_ref; + lockdep_assert_held(&log->l_icloglock);
trace_xlog_iclog_release(iclog, _RET_IP_); - if (xlog_is_shutdown(log)) - return -EIO; - /* * Grabbing the current log tail needs to be atomic w.r.t. the writing * of the tail LSN into the iclog so we guarantee that the log tail does @@ -518,7 +529,23 @@ xlog_state_release_iclog( iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; }
- if (!atomic_dec_and_test(&iclog->ic_refcnt)) + last_ref = atomic_dec_and_test(&iclog->ic_refcnt); + + if (xlog_is_shutdown(log)) { + /* + * If there are no more references to this iclog, process the + * pending iclog callbacks that were waiting on the release of + * this iclog. + */ + if (last_ref) { + spin_unlock(&log->l_icloglock); + xlog_state_shutdown_callbacks(log); + spin_lock(&log->l_icloglock); + } + return -EIO; + } + + if (!last_ref) return 0;
if (iclog->ic_state != XLOG_STATE_WANT_SYNC) { diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 1259cd752b45..28884ae690c1 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -909,11 +909,10 @@ xlog_cil_push_work( xfs_log_ticket_ungrant(log, tic);
/* - * Once we attach the ctx to the iclog, a shutdown can process the - * iclog, run the callbacks and free the ctx. The only thing preventing - * this potential UAF situation here is that we are holding the - * icloglock. Hence we cannot access the ctx once we have attached the - * callbacks and dropped the icloglock. + * Once we attach the ctx to the iclog, it is effectively owned by the + * iclog and we can only use it while we still have an active reference + * to the iclog. i.e. once we call xlog_state_release_iclog() we can no + * longer safely reference the ctx. */ spin_lock(&log->l_icloglock); if (xlog_is_shutdown(log)) { @@ -945,9 +944,6 @@ xlog_cil_push_work( * wakeup until this commit_iclog is written to disk. Hence we use the * iclog header lsn and compare it to the commit lsn to determine if we * need to wait on iclogs or not. - * - * NOTE: It is not safe to reference the ctx after this check as we drop - * the icloglock if we have to wait for completion of other iclogs. */ if (ctx->start_lsn != commit_lsn) { xfs_lsn_t plsn;
From: Dave Chinner dchinner@redhat.com
mainline-inclusion from mainline-v5.14-rc4 commit 2562c322404d81ee5fa82f3cf601a2e27393ab57 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4V7IK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
I'm seeing assert failures from xlog_space_left() after a shutdown has begun that look like:
XFS (dm-0): log I/O error -5 XFS (dm-0): xfs_do_force_shutdown(0x2) called from line 1338 of file fs/xfs/xfs_log.c. Return address = xlog_ioend_work+0x64/0xc0 XFS (dm-0): Log I/O Error Detected. XFS (dm-0): Shutting down filesystem. Please unmount the filesystem and rectify the problem(s) XFS (dm-0): xlog_space_left: head behind tail XFS (dm-0): tail_cycle = 6, tail_bytes = 2706944 XFS (dm-0): GH cycle = 6, GH bytes = 1633867 XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 1310 ------------[ cut here ]------------ Call Trace: xlog_space_left+0xc3/0x110 xlog_grant_push_threshold+0x3f/0xf0 xlog_grant_push_ail+0x12/0x40 xfs_log_reserve+0xd2/0x270 ? __might_sleep+0x4b/0x80 xfs_trans_reserve+0x18b/0x260 .....
There are two things here. Firstly, after a shutdown, the log head and tail can be out of whack as things abort and release (or don't release) resources, so checking them for sanity doesn't make much sense. Secondly, xfs_log_reserve() can race with shutdown and so it can still fail like this even though it has already checked for a log shutdown before calling xlog_grant_push_ail().
So, before ASSERT failing in xlog_space_left(), make sure we haven't already shut down....
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Lihong Kou koulihong@huawei.com Reviewed-by: guoxuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log.c | 51 +++++++++++++++++++++++++----------------------- 1 file changed, 27 insertions(+), 24 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 2c6c380a15ec..06444102bc32 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1187,16 +1187,18 @@ xlog_assign_tail_lsn( * wrap the tail, we should blow up. Rather than catch this case here, * we depend on other ASSERTions in other parts of the code. XXXmiken * - * This code also handles the case where the reservation head is behind - * the tail. The details of this case are described below, but the end - * result is that we return the size of the log as the amount of space left. + * If reservation head is behind the tail, we have a problem. Warn about it, + * but then treat it as if the log is empty. + * + * If the log is shut down, the head and tail may be invalid or out of whack, so + * shortcut invalidity asserts in this case so that we don't trigger them + * falsely. */ STATIC int xlog_space_left( struct xlog *log, atomic64_t *head) { - int free_bytes; int tail_bytes; int tail_cycle; int head_cycle; @@ -1206,29 +1208,30 @@ xlog_space_left( xlog_crack_atomic_lsn(&log->l_tail_lsn, &tail_cycle, &tail_bytes); tail_bytes = BBTOB(tail_bytes); if (tail_cycle == head_cycle && head_bytes >= tail_bytes) - free_bytes = log->l_logsize - (head_bytes - tail_bytes); - else if (tail_cycle + 1 < head_cycle) + return log->l_logsize - (head_bytes - tail_bytes); + if (tail_cycle + 1 < head_cycle) return 0; - else if (tail_cycle < head_cycle) { + + /* Ignore potential inconsistency when shutdown. */ + if (xlog_is_shutdown(log)) + return log->l_logsize; + + if (tail_cycle < head_cycle) { ASSERT(tail_cycle == (head_cycle - 1)); - free_bytes = tail_bytes - head_bytes; - } else { - /* - * The reservation head is behind the tail. - * In this case we just want to return the size of the - * log as the amount of space left. - */ - xfs_alert(log->l_mp, "xlog_space_left: head behind tail"); - xfs_alert(log->l_mp, - " tail_cycle = %d, tail_bytes = %d", - tail_cycle, tail_bytes); - xfs_alert(log->l_mp, - " GH cycle = %d, GH bytes = %d", - head_cycle, head_bytes); - ASSERT(0); - free_bytes = log->l_logsize; + return tail_bytes - head_bytes; } - return free_bytes; + + /* + * The reservation head is behind the tail. In this case we just want to + * return the size of the log as the amount of space left. + */ + xfs_alert(log->l_mp, "xlog_space_left: head behind tail"); + xfs_alert(log->l_mp, " tail_cycle = %d, tail_bytes = %d", + tail_cycle, tail_bytes); + xfs_alert(log->l_mp, " GH cycle = %d, GH bytes = %d", + head_cycle, head_bytes); + ASSERT(0); + return log->l_logsize; }
From: Dave Chinner dchinner@redhat.com
mainline-inclusion from mainline-v5.14-rc4 commit 2ce82b722de980deef809438603b7e95156d3818 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4V7IK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
It is only used by the CIL checkpoints, and is the counterpart to start record formatting and writing that is already local to xfs_log_cil.c.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Lihong Kou koulihong@huawei.com Reviewed-by: guoxuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log.c | 31 ------------------------------- fs/xfs/xfs_log_cil.c | 35 ++++++++++++++++++++++++++++++++++- fs/xfs/xfs_log_priv.h | 2 -- 3 files changed, 34 insertions(+), 34 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 06444102bc32..a9e5b5025d14 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1509,37 +1509,6 @@ xlog_alloc_log( return ERR_PTR(error); } /* xlog_alloc_log */
-/* - * Write out the commit record of a transaction associated with the given - * ticket to close off a running log write. Return the lsn of the commit record. - */ -int -xlog_commit_record( - struct xlog *log, - struct xlog_ticket *ticket, - struct xlog_in_core **iclog, - xfs_lsn_t *lsn) -{ - struct xfs_log_iovec reg = { - .i_addr = NULL, - .i_len = 0, - .i_type = XLOG_REG_TYPE_COMMIT, - }; - struct xfs_log_vec vec = { - .lv_niovecs = 1, - .lv_iovecp = ®, - }; - int error; - - if (xlog_is_shutdown(log)) - return -EIO; - - error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS); - if (error) - xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); - return error; -} - /* * Compute the LSN that we'd need to push the log tail towards in order to have * (a) enough on-disk log space to log the number of bytes specified, (b) at diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 28884ae690c1..52586e64a7c6 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -659,6 +659,38 @@ xlog_cil_process_committed( } }
+/* + * Write out the commit record of a checkpoint transaction associated with the + * given ticket to close off a running log write. Return the lsn of the commit + * record. + */ +static int +xlog_cil_write_commit_record( + struct xlog *log, + struct xlog_ticket *ticket, + struct xlog_in_core **iclog, + xfs_lsn_t *lsn) +{ + struct xfs_log_iovec reg = { + .i_addr = NULL, + .i_len = 0, + .i_type = XLOG_REG_TYPE_COMMIT, + }; + struct xfs_log_vec vec = { + .lv_niovecs = 1, + .lv_iovecp = ®, + }; + int error; + + if (xlog_is_shutdown(log)) + return -EIO; + + error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS); + if (error) + xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); + return error; +} + /* * Push the Committed Item List to the log. * @@ -902,7 +934,8 @@ xlog_cil_push_work( } spin_unlock(&cil->xc_push_lock);
- error = xlog_commit_record(log, tic, &commit_iclog, &commit_lsn); + error = xlog_cil_write_commit_record(log, tic, &commit_iclog, + &commit_lsn); if (error) goto out_abort_free_ticket;
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index b8387925d0f3..3b007fd1f82c 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -507,8 +507,6 @@ void xlog_print_trans(struct xfs_trans *); int xlog_write(struct xlog *log, struct xfs_log_vec *log_vector, struct xlog_ticket *tic, xfs_lsn_t *start_lsn, struct xlog_in_core **commit_iclog, uint optype); -int xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket, - struct xlog_in_core **iclog, xfs_lsn_t *lsn); void xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket); void xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
From: "Darrick J. Wong" darrick.wong@oracle.com
mainline-inclusion from mainline-v5.10-rc5 commit a5336d6bb2d02d0e9d4d3c8be04b80b8b68d56c8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4V7IK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
In commit 27c14b5daa82 we started tracking the last inode seen during an inode walk to avoid infinite loops if a corrupt inobt record happens to have a lower ir_startino than the record preceeding it. Unfortunately, the assertion trips over the case where there are completely empty inobt records (which can happen quite easily on 64k page filesystems) because we advance the tracking cursor without actually putting the empty record into the processing buffer. Fix the assert to allow for this case.
Reported-by: zlang@redhat.com Fixes: 27c14b5daa82 ("xfs: ensure inobt record walks always make forward progress") Signed-off-by: Darrick J. Wong darrick.wong@oracle.com Reviewed-by: Zorro Lang zlang@redhat.com Reviewed-by: Dave Chinner dchinner@redhat.com Reviewed-by: guoxuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_iwalk.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c index c06702c7e64b..e26aeea3f0c0 100644 --- a/fs/xfs/xfs_iwalk.c +++ b/fs/xfs/xfs_iwalk.c @@ -365,7 +365,7 @@ xfs_iwalk_run_callbacks( /* Delete cursor but remember the last record we cached... */ xfs_iwalk_del_inobt(iwag->tp, curpp, agi_bpp, 0); irec = &iwag->recs[iwag->nr_recs - 1]; - ASSERT(next_agino == irec->ir_startino + XFS_INODES_PER_CHUNK); + ASSERT(next_agino >= irec->ir_startino + XFS_INODES_PER_CHUNK);
if (iwag->drop_trans) { xfs_trans_cancel(iwag->tp);
From: Dave Chinner dchinner@redhat.com
mainline-inclusion from mainline-v5.14-rc4 commit c45aba40cf5b2988c0bebee8c9b846c88aa651eb category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4V7IK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Pass the CIL context to xlog_write() rather than a pointer to a LSN variable. Only the CIL checkpoint calls to xlog_write() need to know about the start LSN of the writes, so rework xlog_write to directly write the LSNs into the CIL context structure.
This removes the commit_lsn variable from xlog_cil_push_work(), so now we only have to issue the commit record ordering wakeup from there.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Lihong Kou koulihong@huawei.com Reviewed-by: guoxuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log.c | 18 +++++++++------ fs/xfs/xfs_log_cil.c | 52 ++++++++++++++++++++++++++++++------------- fs/xfs/xfs_log_priv.h | 7 ++++-- 3 files changed, 52 insertions(+), 25 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index a9e5b5025d14..7f31fb697135 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -881,7 +881,7 @@ xlog_write_unmount_record( /* account for space used by record data */ ticket->t_curr_res -= sizeof(ulf);
- return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS); + return xlog_write(log, NULL, &vec, ticket, NULL, XLOG_UNMOUNT_TRANS); }
/* @@ -2322,9 +2322,9 @@ xlog_write_copy_finish( int xlog_write( struct xlog *log, + struct xfs_cil_ctx *ctx, struct xfs_log_vec *log_vector, struct xlog_ticket *ticket, - xfs_lsn_t *start_lsn, struct xlog_in_core **commit_iclog, uint optype) { @@ -2355,8 +2355,6 @@ xlog_write( }
len = xlog_write_calc_vec_length(ticket, log_vector, optype); - if (start_lsn) - *start_lsn = 0; while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) { void *ptr; int log_offset; @@ -2369,9 +2367,15 @@ xlog_write( ASSERT(log_offset <= iclog->ic_size - 1); ptr = iclog->ic_datap + log_offset;
- /* Start_lsn is the first lsn written to. */ - if (start_lsn && !*start_lsn) - *start_lsn = be64_to_cpu(iclog->ic_header.h_lsn); + /* + * If we have a context pointer, pass it the first iclog we are + * writing to so it can record state needed for iclog write + * ordering. + */ + if (ctx) { + xlog_cil_set_ctx_write_state(ctx, iclog); + ctx = NULL; + }
/* * This loop writes out as many regions as can fit in the amount diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 52586e64a7c6..4460d9981dfa 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -659,6 +659,30 @@ xlog_cil_process_committed( } }
+/* +* Record the LSN of the iclog we were just granted space to start writing into. +* If the context doesn't have a start_lsn recorded, then this iclog will +* contain the start record for the checkpoint. Otherwise this write contains +* the commit record for the checkpoint. +*/ +void +xlog_cil_set_ctx_write_state( + struct xfs_cil_ctx *ctx, + struct xlog_in_core *iclog) +{ + struct xfs_cil *cil = ctx->cil; + xfs_lsn_t lsn = be64_to_cpu(iclog->ic_header.h_lsn); + + ASSERT(!ctx->commit_lsn); + spin_lock(&cil->xc_push_lock); + if (!ctx->start_lsn) + ctx->start_lsn = lsn; + else + ctx->commit_lsn = lsn; + spin_unlock(&cil->xc_push_lock); +} + + /* * Write out the commit record of a checkpoint transaction associated with the * given ticket to close off a running log write. Return the lsn of the commit @@ -666,26 +690,26 @@ xlog_cil_process_committed( */ static int xlog_cil_write_commit_record( - struct xlog *log, - struct xlog_ticket *ticket, - struct xlog_in_core **iclog, - xfs_lsn_t *lsn) + struct xfs_cil_ctx *ctx, + struct xlog_in_core **iclog) { - struct xfs_log_iovec reg = { + struct xlog *log = ctx->cil->xc_log; + struct xfs_log_iovec reg = { .i_addr = NULL, .i_len = 0, .i_type = XLOG_REG_TYPE_COMMIT, }; - struct xfs_log_vec vec = { + struct xfs_log_vec vec = { .lv_niovecs = 1, .lv_iovecp = ®, }; - int error; + int error;
if (xlog_is_shutdown(log)) return -EIO;
- error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS); + error = xlog_write(log, ctx, &vec, ctx->ticket, iclog, + XLOG_COMMIT_TRANS); if (error) xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); return error; @@ -723,7 +747,6 @@ xlog_cil_push_work( struct xfs_log_iovec lhdr; struct xfs_log_vec lvhdr = { NULL }; xfs_lsn_t preflush_tail_lsn; - xfs_lsn_t commit_lsn; xfs_csn_t push_seq; struct bio bio; DECLARE_COMPLETION_ONSTACK(bdev_flush); @@ -895,8 +918,7 @@ xlog_cil_push_work( */ wait_for_completion(&bdev_flush);
- error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, - XLOG_START_TRANS); + error = xlog_write(log, ctx, &lvhdr, tic, NULL, XLOG_START_TRANS); if (error) goto out_abort_free_ticket;
@@ -934,8 +956,7 @@ xlog_cil_push_work( } spin_unlock(&cil->xc_push_lock);
- error = xlog_cil_write_commit_record(log, tic, &commit_iclog, - &commit_lsn); + error = xlog_cil_write_commit_record(ctx, &commit_iclog); if (error) goto out_abort_free_ticket;
@@ -962,7 +983,6 @@ xlog_cil_push_work( * and wake up anyone who is waiting for the commit to complete. */ spin_lock(&cil->xc_push_lock); - ctx->commit_lsn = commit_lsn; wake_up_all(&cil->xc_commit_wait); spin_unlock(&cil->xc_push_lock);
@@ -978,11 +998,11 @@ xlog_cil_push_work( * iclog header lsn and compare it to the commit lsn to determine if we * need to wait on iclogs or not. */ - if (ctx->start_lsn != commit_lsn) { + if (ctx->start_lsn != ctx->commit_lsn) { xfs_lsn_t plsn;
plsn = be64_to_cpu(commit_iclog->ic_prev->ic_header.h_lsn); - if (plsn && XFS_LSN_CMP(plsn, commit_lsn) < 0) { + if (plsn && XFS_LSN_CMP(plsn, ctx->commit_lsn) < 0) { /* * Waiting on ic_force_wait orders the completion of * iclogs older than ic_prev. Hence we only need to wait diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 3b007fd1f82c..ad45a204bbc9 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -504,8 +504,8 @@ xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes)
void xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket); void xlog_print_trans(struct xfs_trans *); -int xlog_write(struct xlog *log, struct xfs_log_vec *log_vector, - struct xlog_ticket *tic, xfs_lsn_t *start_lsn, +int xlog_write(struct xlog *log, struct xfs_cil_ctx *ctx, + struct xfs_log_vec *log_vector, struct xlog_ticket *tic, struct xlog_in_core **commit_iclog, uint optype); void xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket); void xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket); @@ -579,6 +579,9 @@ void xlog_cil_destroy(struct xlog *log); bool xlog_cil_empty(struct xlog *log); void xlog_cil_commit(struct xlog *log, struct xfs_trans *tp, xfs_csn_t *commit_seq, bool regrant); +void xlog_cil_set_ctx_write_state(struct xfs_cil_ctx *ctx, + struct xlog_in_core *iclog); +
/* * CIL force routines
From: Dave Chinner dchinner@redhat.com
mainline-inclusion from mainline-v5.14-rc4 commit bf034bc827807ac15affa051e6a94b03f93b1a03 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4V7IK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
So we can use it for start record ordering as well as commit record ordering in future.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Lihong Kou koulihong@huawei.com Reviewed-by: guoxuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log_cil.c | 87 ++++++++++++++++++++++++++------------------ 1 file changed, 51 insertions(+), 36 deletions(-)
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 4460d9981dfa..e58550f101b7 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -684,9 +684,54 @@ xlog_cil_set_ctx_write_state(
/* - * Write out the commit record of a checkpoint transaction associated with the - * given ticket to close off a running log write. Return the lsn of the commit - * record. + * Ensure that the order of log writes follows checkpoint sequence order. This + * relies on the context LSN being zero until the log write has guaranteed the + * LSN that the log write will start at via xlog_state_get_iclog_space(). + */ +static int +xlog_cil_order_write( + struct xfs_cil *cil, + xfs_csn_t sequence) +{ + struct xfs_cil_ctx *ctx; + +restart: + spin_lock(&cil->xc_push_lock); + list_for_each_entry(ctx, &cil->xc_committing, committing) { + /* + * Avoid getting stuck in this loop because we were woken by the + * shutdown, but then went back to sleep once already in the + * shutdown state. + */ + if (xlog_is_shutdown(cil->xc_log)) { + spin_unlock(&cil->xc_push_lock); + return -EIO; + } + + /* + * Higher sequences will wait for this one so skip them. + * Don't wait for our own sequence, either. + */ + if (ctx->sequence >= sequence) + continue; + if (!ctx->commit_lsn) { + /* + * It is still being pushed! Wait for the push to + * complete, then start again from the beginning. + */ + xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock); + goto restart; + } + } + spin_unlock(&cil->xc_push_lock); + return 0; +} + +/* + * Write out the commit record of a checkpoint transaction to close off a + * running log write. These commit records are strictly ordered in ascending CIL + * sequence order so that log recovery will always replay the checkpoints in the + * correct order. */ static int xlog_cil_write_commit_record( @@ -922,39 +967,9 @@ xlog_cil_push_work( if (error) goto out_abort_free_ticket;
- /* - * now that we've written the checkpoint into the log, strictly - * order the commit records so replay will get them in the right order. - */ -restart: - spin_lock(&cil->xc_push_lock); - list_for_each_entry(new_ctx, &cil->xc_committing, committing) { - /* - * Avoid getting stuck in this loop because we were woken by the - * shutdown, but then went back to sleep once already in the - * shutdown state. - */ - if (xlog_is_shutdown(log)) { - spin_unlock(&cil->xc_push_lock); - goto out_abort_free_ticket; - } - - /* - * Higher sequences will wait for this one so skip them. - * Don't wait for our own sequence, either. - */ - if (new_ctx->sequence >= ctx->sequence) - continue; - if (!new_ctx->commit_lsn) { - /* - * It is still being pushed! Wait for the push to - * complete, then start again from the beginning. - */ - xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock); - goto restart; - } - } - spin_unlock(&cil->xc_push_lock); + error = xlog_cil_order_write(ctx->cil, ctx->sequence); + if (error) + goto out_abort_free_ticket;
error = xlog_cil_write_commit_record(ctx, &commit_iclog); if (error)
From: Dave Chinner dchinner@redhat.com
mainline-inclusion from mainline-v5.14-rc4 commit caa80090d17c89d0caca1dcb4c8a9cdef5335e71 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4V7IK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Now that we have a mechanism to guarantee that the callbacks attached to an iclog are owned by the context that attaches them until they drop their reference to the iclog via xlog_state_release_iclog(), we can attach callbacks to the iclog at any time we have an active reference to the iclog.
xlog_state_get_iclog_space() always guarantees that the commit record will fit in the iclog it returns, so we can move this IO callback setting to xlog_cil_set_ctx_write_state(), record the commit iclog in the context and remove the need for the commit iclog to be returned by xlog_write() altogether.
This, in turn, allows us to move the wakeup for ordered commit record writes up into xlog_cil_set_ctx_write_state(), too, because we have been guaranteed that this commit record will be physically located in the iclog before any waiting commit record at a higher sequence number will be granted iclog space.
This further cleans up the post commit record write processing in the CIL push code, especially as xlog_state_release_iclog() will now clean up the context when shutdown errors occur.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Lihong Kou koulihong@huawei.com Reviewed-by: guoxuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log.c | 47 +++++++-------------- fs/xfs/xfs_log_cil.c | 98 ++++++++++++++++++++++++------------------- fs/xfs/xfs_log_priv.h | 3 +- 3 files changed, 72 insertions(+), 76 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 7f31fb697135..1f4e6ce71ec7 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -881,7 +881,7 @@ xlog_write_unmount_record( /* account for space used by record data */ ticket->t_curr_res -= sizeof(ulf);
- return xlog_write(log, NULL, &vec, ticket, NULL, XLOG_UNMOUNT_TRANS); + return xlog_write(log, NULL, &vec, ticket, XLOG_UNMOUNT_TRANS); }
/* @@ -2232,8 +2232,7 @@ xlog_write_copy_finish( int *data_cnt, int *partial_copy, int *partial_copy_len, - int log_offset, - struct xlog_in_core **commit_iclog) + int log_offset) { int error;
@@ -2252,27 +2251,20 @@ xlog_write_copy_finish( *partial_copy = 0; *partial_copy_len = 0;
- if (iclog->ic_size - log_offset <= sizeof(xlog_op_header_t)) { - /* no more space in this iclog - push it. */ - spin_lock(&log->l_icloglock); - xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt); - *record_cnt = 0; - *data_cnt = 0; - - if (iclog->ic_state == XLOG_STATE_ACTIVE) - xlog_state_switch_iclogs(log, iclog, 0); - else - ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC || - xlog_is_shutdown(log)); - if (!commit_iclog) - goto release_iclog; - spin_unlock(&log->l_icloglock); - ASSERT(flags & XLOG_COMMIT_TRANS); - *commit_iclog = iclog; - } + if (iclog->ic_size - log_offset > sizeof(xlog_op_header_t)) + return 0;
- return 0; + /* no more space in this iclog - push it. */ + spin_lock(&log->l_icloglock); + xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt); + *record_cnt = 0; + *data_cnt = 0;
+ if (iclog->ic_state == XLOG_STATE_ACTIVE) + xlog_state_switch_iclogs(log, iclog, 0); + else + ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC || + xlog_is_shutdown(log)); release_iclog: error = xlog_state_release_iclog(log, iclog, 0); spin_unlock(&log->l_icloglock); @@ -2325,7 +2317,6 @@ xlog_write( struct xfs_cil_ctx *ctx, struct xfs_log_vec *log_vector, struct xlog_ticket *ticket, - struct xlog_in_core **commit_iclog, uint optype) { struct xlog_in_core *iclog = NULL; @@ -2454,8 +2445,7 @@ xlog_write( &record_cnt, &data_cnt, &partial_copy, &partial_copy_len, - log_offset, - commit_iclog); + log_offset); if (error) return error;
@@ -2493,12 +2483,7 @@ xlog_write(
spin_lock(&log->l_icloglock); xlog_state_finish_copy(log, iclog, record_cnt, data_cnt); - if (commit_iclog) { - ASSERT(optype & XLOG_COMMIT_TRANS); - *commit_iclog = iclog; - } else { - error = xlog_state_release_iclog(log, iclog, 0); - } + error = xlog_state_release_iclog(log, iclog, 0); spin_unlock(&log->l_icloglock);
return error; diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index e58550f101b7..f49b961f0748 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -674,11 +674,41 @@ xlog_cil_set_ctx_write_state( xfs_lsn_t lsn = be64_to_cpu(iclog->ic_header.h_lsn);
ASSERT(!ctx->commit_lsn); - spin_lock(&cil->xc_push_lock); - if (!ctx->start_lsn) + if (!ctx->start_lsn) { + spin_lock(&cil->xc_push_lock); ctx->start_lsn = lsn; - else - ctx->commit_lsn = lsn; + spin_unlock(&cil->xc_push_lock); + return; + } + + /* + * Take a reference to the iclog for the context so that we still hold + * it when xlog_write is done and has released it. This means the + * context controls when the iclog is released for IO. + */ + atomic_inc(&iclog->ic_refcnt); + + /* + * xlog_state_get_iclog_space() guarantees there is enough space in the + * iclog for an entire commit record, so we can attach the context + * callbacks now. This needs to be done before we make the commit_lsn + * visible to waiters so that checkpoints with commit records in the + * same iclog order their IO completion callbacks in the same order that + * the commit records appear in the iclog. + */ + spin_lock(&cil->xc_log->l_icloglock); + list_add_tail(&ctx->iclog_entry, &iclog->ic_callbacks); + spin_unlock(&cil->xc_log->l_icloglock); + + /* + * Now we can record the commit LSN and wake anyone waiting for this + * sequence to have the ordered commit record assigned to a physical + * location in the log. + */ + spin_lock(&cil->xc_push_lock); + ctx->commit_iclog = iclog; + ctx->commit_lsn = lsn; + wake_up_all(&cil->xc_commit_wait); spin_unlock(&cil->xc_push_lock); }
@@ -735,8 +765,7 @@ xlog_cil_order_write( */ static int xlog_cil_write_commit_record( - struct xfs_cil_ctx *ctx, - struct xlog_in_core **iclog) + struct xfs_cil_ctx *ctx) { struct xlog *log = ctx->cil->xc_log; struct xfs_log_iovec reg = { @@ -753,8 +782,7 @@ xlog_cil_write_commit_record( if (xlog_is_shutdown(log)) return -EIO;
- error = xlog_write(log, ctx, &vec, ctx->ticket, iclog, - XLOG_COMMIT_TRANS); + error = xlog_write(log, ctx, &vec, ctx->ticket, XLOG_COMMIT_TRANS); if (error) xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); return error; @@ -784,7 +812,6 @@ xlog_cil_push_work( struct xlog *log = cil->xc_log; struct xfs_log_vec *lv; struct xfs_cil_ctx *new_ctx; - struct xlog_in_core *commit_iclog; struct xlog_ticket *tic; int num_iovecs; int error = 0; @@ -963,7 +990,7 @@ xlog_cil_push_work( */ wait_for_completion(&bdev_flush);
- error = xlog_write(log, ctx, &lvhdr, tic, NULL, XLOG_START_TRANS); + error = xlog_write(log, ctx, &lvhdr, tic, XLOG_START_TRANS); if (error) goto out_abort_free_ticket;
@@ -971,36 +998,12 @@ xlog_cil_push_work( if (error) goto out_abort_free_ticket;
- error = xlog_cil_write_commit_record(ctx, &commit_iclog); + error = xlog_cil_write_commit_record(ctx); if (error) goto out_abort_free_ticket;
xfs_log_ticket_ungrant(log, tic);
- /* - * Once we attach the ctx to the iclog, it is effectively owned by the - * iclog and we can only use it while we still have an active reference - * to the iclog. i.e. once we call xlog_state_release_iclog() we can no - * longer safely reference the ctx. - */ - spin_lock(&log->l_icloglock); - if (xlog_is_shutdown(log)) { - spin_unlock(&log->l_icloglock); - goto out_abort; - } - ASSERT_ALWAYS(commit_iclog->ic_state == XLOG_STATE_ACTIVE || - commit_iclog->ic_state == XLOG_STATE_WANT_SYNC); - list_add_tail(&ctx->iclog_entry, &commit_iclog->ic_callbacks); - - /* - * now the checkpoint commit is complete and we've attached the - * callbacks to the iclog we can assign the commit LSN to the context - * and wake up anyone who is waiting for the commit to complete. - */ - spin_lock(&cil->xc_push_lock); - wake_up_all(&cil->xc_commit_wait); - spin_unlock(&cil->xc_push_lock); - /* * If the checkpoint spans multiple iclogs, wait for all previous iclogs * to complete before we submit the commit_iclog. We can't use state @@ -1013,17 +1016,18 @@ xlog_cil_push_work( * iclog header lsn and compare it to the commit lsn to determine if we * need to wait on iclogs or not. */ + spin_lock(&log->l_icloglock); if (ctx->start_lsn != ctx->commit_lsn) { xfs_lsn_t plsn;
- plsn = be64_to_cpu(commit_iclog->ic_prev->ic_header.h_lsn); + plsn = be64_to_cpu(ctx->commit_iclog->ic_prev->ic_header.h_lsn); if (plsn && XFS_LSN_CMP(plsn, ctx->commit_lsn) < 0) { /* * Waiting on ic_force_wait orders the completion of * iclogs older than ic_prev. Hence we only need to wait * on the most recent older iclog here. */ - xlog_wait_on_iclog(commit_iclog->ic_prev); + xlog_wait_on_iclog(ctx->commit_iclog->ic_prev); spin_lock(&log->l_icloglock); }
@@ -1031,7 +1035,7 @@ xlog_cil_push_work( * We need to issue a pre-flush so that the ordering for this * checkpoint is correctly preserved down to stable storage. */ - commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; + ctx->commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; }
/* @@ -1044,11 +1048,11 @@ xlog_cil_push_work( * will be written when released, switch it's state to WANT_SYNC right * now. */ - commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA; + ctx->commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA; if (push_commit_stable && - commit_iclog->ic_state == XLOG_STATE_ACTIVE) - xlog_state_switch_iclogs(log, commit_iclog, 0); - xlog_state_release_iclog(log, commit_iclog, preflush_tail_lsn); + ctx->commit_iclog->ic_state == XLOG_STATE_ACTIVE) + xlog_state_switch_iclogs(log, ctx->commit_iclog, 0); + xlog_state_release_iclog(log, ctx->commit_iclog, preflush_tail_lsn);
/* Not safe to reference ctx now! */
@@ -1063,9 +1067,15 @@ xlog_cil_push_work(
out_abort_free_ticket: xfs_log_ticket_ungrant(log, tic); -out_abort: ASSERT(xlog_is_shutdown(log)); - xlog_cil_committed(ctx); + if (!ctx->commit_iclog) { + xlog_cil_committed(ctx); + return; + } + spin_lock(&log->l_icloglock); + xlog_state_release_iclog(log, ctx->commit_iclog, 0); + /* Not safe to reference ctx now! */ + spin_unlock(&log->l_icloglock); }
/* diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index ad45a204bbc9..ed3bf129023f 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -233,6 +233,7 @@ struct xfs_cil_ctx { xfs_csn_t sequence; /* chkpt sequence # */ xfs_lsn_t start_lsn; /* first LSN of chkpt commit */ xfs_lsn_t commit_lsn; /* chkpt commit record lsn */ + struct xlog_in_core *commit_iclog; struct xlog_ticket *ticket; /* chkpt ticket */ int nvecs; /* number of regions */ int space_used; /* aggregate size of regions */ @@ -506,7 +507,7 @@ void xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket); void xlog_print_trans(struct xfs_trans *); int xlog_write(struct xlog *log, struct xfs_cil_ctx *ctx, struct xfs_log_vec *log_vector, struct xlog_ticket *tic, - struct xlog_in_core **commit_iclog, uint optype); + uint optype); void xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket); void xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
From: Dave Chinner dchinner@redhat.com
mainline-inclusion from mainline-v5.14-rc4 commit 68a74dcae6737c27b524b680e070fe41f0cad43a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4V7IK CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
-------------------------------------------------
Because log recovery depends on strictly ordered start records as well as strictly ordered commit records.
This is a zero day bug in the way XFS writes pipelined transactions to the journal which is exposed by fixing the zero day bug that prevents the CIL from pipelining checkpoints. This re-introduces explicit concurrent commits back into the on-disk journal and hence out of order start records.
The XFS journal commit code has never ordered start records and we have relied on strict commit record ordering for correct recovery ordering of concurrently written transactions. Unfortunately, root cause analysis uncovered the fact that log recovery uses the LSN of the start record for transaction commit processing. Hence, whilst the commits are processed in strict order by recovery, the LSNs associated with the commits can be out of order and so recovery may stamp incorrect LSNs into objects and/or misorder intents in the AIL for later processing. This can result in log recovery failures and/or on disk corruption, sometimes silent.
Because this is a long standing log recovery issue, we can't just fix log recovery and call it good. This still leaves older kernels susceptible to recovery failures and corruption when replaying a log from a kernel that pipelines checkpoints. There is also the issue that in-memory ordering for AIL pushing and data integrity operations are based on checkpoint start LSNs, and if the start LSN is incorrect in the journal, it is also incorrect in memory.
Hence there's really only one choice for fixing this zero-day bug: we need to strictly order checkpoint start records in ascending sequence order in the log, the same way we already strictly order commit records.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Lihong Kou koulihong@huawei.com Reviewed-by: guoxuenan guoxuenan@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/xfs/xfs_log.c | 1 + fs/xfs/xfs_log_cil.c | 69 +++++++++++++++++++++++++++++++++++-------- fs/xfs/xfs_log_priv.h | 1 + 3 files changed, 58 insertions(+), 13 deletions(-)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 1f4e6ce71ec7..607a101585fc 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -3754,6 +3754,7 @@ xlog_force_shutdown( * avoid races. */ spin_lock(&log->l_cilp->xc_push_lock); + wake_up_all(&log->l_cilp->xc_start_wait); wake_up_all(&log->l_cilp->xc_commit_wait); spin_unlock(&log->l_cilp->xc_push_lock); xlog_state_shutdown_callbacks(log); diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index f49b961f0748..ad98bb895686 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -623,6 +623,7 @@ xlog_cil_committed( */ if (abort) { spin_lock(&ctx->cil->xc_push_lock); + wake_up_all(&ctx->cil->xc_start_wait); wake_up_all(&ctx->cil->xc_commit_wait); spin_unlock(&ctx->cil->xc_push_lock); } @@ -676,7 +677,14 @@ xlog_cil_set_ctx_write_state( ASSERT(!ctx->commit_lsn); if (!ctx->start_lsn) { spin_lock(&cil->xc_push_lock); + /* + * The LSN we need to pass to the log items on transaction + * commit is the LSN reported by the first log vector write, not + * the commit lsn. If we use the commit record lsn then we can + * move the tail beyond the grant write head. + */ ctx->start_lsn = lsn; + wake_up_all(&cil->xc_start_wait); spin_unlock(&cil->xc_push_lock); return; } @@ -718,10 +726,16 @@ xlog_cil_set_ctx_write_state( * relies on the context LSN being zero until the log write has guaranteed the * LSN that the log write will start at via xlog_state_get_iclog_space(). */ +enum _record_type { + _START_RECORD, + _COMMIT_RECORD, +}; + static int xlog_cil_order_write( struct xfs_cil *cil, - xfs_csn_t sequence) + xfs_csn_t sequence, + enum _record_type record) { struct xfs_cil_ctx *ctx;
@@ -744,19 +758,47 @@ xlog_cil_order_write( */ if (ctx->sequence >= sequence) continue; - if (!ctx->commit_lsn) { - /* - * It is still being pushed! Wait for the push to - * complete, then start again from the beginning. - */ - xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock); - goto restart; + + /* Wait until the LSN for the record has been recorded. */ + switch (record) { + case _START_RECORD: + if (!ctx->start_lsn) { + xlog_wait(&cil->xc_start_wait, &cil->xc_push_lock); + goto restart; + } + break; + case _COMMIT_RECORD: + if (!ctx->commit_lsn) { + xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock); + goto restart; + } + break; } } spin_unlock(&cil->xc_push_lock); return 0; }
+/* + * Write out the log vector change now attached to the CIL context. This will + * write a start record that needs to be strictly ordered in ascending CIL + * sequence order so that log recovery will always use in-order start LSNs when + * replaying checkpoints. + */ +static int +xlog_cil_write_chain( + struct xfs_cil_ctx *ctx, + struct xfs_log_vec *chain) +{ + struct xlog *log = ctx->cil->xc_log; + int error; + + error = xlog_cil_order_write(ctx->cil, ctx->sequence, _START_RECORD); + if (error) + return error; + return xlog_write(log, ctx, chain, ctx->ticket, XLOG_START_TRANS); +} + /* * Write out the commit record of a checkpoint transaction to close off a * running log write. These commit records are strictly ordered in ascending CIL @@ -782,6 +824,10 @@ xlog_cil_write_commit_record( if (xlog_is_shutdown(log)) return -EIO;
+ error = xlog_cil_order_write(ctx->cil, ctx->sequence, _COMMIT_RECORD); + if (error) + return error; + error = xlog_write(log, ctx, &vec, ctx->ticket, XLOG_COMMIT_TRANS); if (error) xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); @@ -990,11 +1036,7 @@ xlog_cil_push_work( */ wait_for_completion(&bdev_flush);
- error = xlog_write(log, ctx, &lvhdr, tic, XLOG_START_TRANS); - if (error) - goto out_abort_free_ticket; - - error = xlog_cil_order_write(ctx->cil, ctx->sequence); + error = xlog_cil_write_chain(ctx, &lvhdr); if (error) goto out_abort_free_ticket;
@@ -1444,6 +1486,7 @@ xlog_cil_init( spin_lock_init(&cil->xc_push_lock); init_waitqueue_head(&cil->xc_push_wait); init_rwsem(&cil->xc_ctx_lock); + init_waitqueue_head(&cil->xc_start_wait); init_waitqueue_head(&cil->xc_commit_wait); cil->xc_log = log; log->l_cilp = cil; diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index ed3bf129023f..b01fb03e60d4 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -275,6 +275,7 @@ struct xfs_cil { bool xc_push_commit_stable; struct list_head xc_committing; wait_queue_head_t xc_commit_wait; + wait_queue_head_t xc_start_wait; xfs_csn_t xc_current_sequence; wait_queue_head_t xc_push_wait; /* background push throttle */ } ____cacheline_aligned_in_smp;
From: Yanling Song songyl@ramaxel.com
Ramaxel inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4UA67 CVE: NA
----------------------------------------------
Fix: 1.Remove UNF_ORIGIN_HOTTAG_MASK and UNF_HOTTAG_FLAG 2.Update some output string 3.Remove spinlock protect in free_parent_sq() because there is spinlock protect in caller function free_parent_queue_info()
Signed-off-by: Yanling Song songyl@ramaxel.com Reviewed-by: Yun Xu xuyun@ramaxel.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- drivers/scsi/spfc/common/unf_common.h | 2 -- drivers/scsi/spfc/common/unf_io.c | 3 +-- drivers/scsi/spfc/common/unf_io_abnormal.c | 2 +- drivers/scsi/spfc/common/unf_rport.c | 2 +- drivers/scsi/spfc/common/unf_service.c | 19 +++++-------------- drivers/scsi/spfc/hw/spfc_hba.c | 2 +- drivers/scsi/spfc/hw/spfc_io.c | 2 +- drivers/scsi/spfc/hw/spfc_queue.c | 5 ----- drivers/scsi/spfc/hw/spfc_service.c | 22 ++++++++++++---------- 9 files changed, 22 insertions(+), 37 deletions(-)
diff --git a/drivers/scsi/spfc/common/unf_common.h b/drivers/scsi/spfc/common/unf_common.h index bf9d156e07ce..9613649308bf 100644 --- a/drivers/scsi/spfc/common/unf_common.h +++ b/drivers/scsi/spfc/common/unf_common.h @@ -12,8 +12,6 @@ #define SPFC_DRV_DESC "Ramaxel Memory Technology Fibre Channel Driver"
#define UNF_MAX_SECTORS 0xffff -#define UNF_ORIGIN_HOTTAG_MASK 0x7fff -#define UNF_HOTTAG_FLAG (1 << 15) #define UNF_PKG_FREE_OXID 0x0 #define UNF_PKG_FREE_RXID 0x1
diff --git a/drivers/scsi/spfc/common/unf_io.c b/drivers/scsi/spfc/common/unf_io.c index b1255ecba88c..5de69f8ddc6d 100644 --- a/drivers/scsi/spfc/common/unf_io.c +++ b/drivers/scsi/spfc/common/unf_io.c @@ -890,8 +890,7 @@ static int unf_send_fcpcmnd(struct unf_lport *lport, struct unf_rport *rport, unf_xchg->private_data[PKG_PRIVATE_XCHG_ALLOC_TIME]; pkg.private_data[PKG_PRIVATE_XCHG_VP_INDEX] = unf_lport->vp_index; pkg.private_data[PKG_PRIVATE_XCHG_RPORT_INDEX] = unf_rport->rport_index; - pkg.private_data[PKG_PRIVATE_XCHG_HOT_POOL_INDEX] = - unf_xchg->hotpooltag | UNF_HOTTAG_FLAG; + pkg.private_data[PKG_PRIVATE_XCHG_HOT_POOL_INDEX] = unf_xchg->hotpooltag;
unf_select_sq(unf_xchg, &pkg); pkg.fcp_cmnd = &unf_xchg->fcp_cmnd; diff --git a/drivers/scsi/spfc/common/unf_io_abnormal.c b/drivers/scsi/spfc/common/unf_io_abnormal.c index fece7aa5f441..4e268ac026ca 100644 --- a/drivers/scsi/spfc/common/unf_io_abnormal.c +++ b/drivers/scsi/spfc/common/unf_io_abnormal.c @@ -763,7 +763,7 @@ int unf_send_scsi_mgmt_cmnd(struct unf_xchg *xchg, struct unf_lport *lport, pkg.xchg_contex = unf_xchg; pkg.private_data[PKG_PRIVATE_XCHG_RPORT_INDEX] = rport->rport_index; pkg.fcp_cmnd = &unf_xchg->fcp_cmnd; - pkg.private_data[PKG_PRIVATE_XCHG_HOT_POOL_INDEX] = unf_xchg->hotpooltag | UNF_HOTTAG_FLAG; + pkg.private_data[PKG_PRIVATE_XCHG_HOT_POOL_INDEX] = unf_xchg->hotpooltag; pkg.frame_head.csctl_sid = lport->nport_id; pkg.frame_head.rctl_did = rport->nport_id;
diff --git a/drivers/scsi/spfc/common/unf_rport.c b/drivers/scsi/spfc/common/unf_rport.c index aa4967fc0ab6..9b06df884524 100644 --- a/drivers/scsi/spfc/common/unf_rport.c +++ b/drivers/scsi/spfc/common/unf_rport.c @@ -352,7 +352,7 @@ struct unf_rport *unf_find_valid_rport(struct unf_lport *lport, u64 wwpn, u32 si spin_unlock_irqrestore(rport_state_lock, flags);
FC_DRV_PRINT(UNF_LOG_LOGIN_ATT, UNF_INFO, - "[err]Port(0x%x) RPort(0x%p) find by WWPN(0x%llx) is invalid", + "[info]Port(0x%x) RPort(0x%p) find by WWPN(0x%llx) is invalid", lport->port_id, rport_by_wwpn, wwpn);
rport_by_wwpn = NULL; diff --git a/drivers/scsi/spfc/common/unf_service.c b/drivers/scsi/spfc/common/unf_service.c index 8f72f6470647..9c86c99374c8 100644 --- a/drivers/scsi/spfc/common/unf_service.c +++ b/drivers/scsi/spfc/common/unf_service.c @@ -130,7 +130,7 @@ void unf_fill_package(struct unf_frame_pkg *pkg, struct unf_xchg *xchg, pkg->private_data[PKG_PRIVATE_RPORT_RX_SIZE] = rport->max_frame_size; }
- pkg->private_data[PKG_PRIVATE_XCHG_HOT_POOL_INDEX] = xchg->hotpooltag | UNF_HOTTAG_FLAG; + pkg->private_data[PKG_PRIVATE_XCHG_HOT_POOL_INDEX] = xchg->hotpooltag; pkg->private_data[PKG_PRIVATE_XCHG_ALLOC_TIME] = xchg->private_data[PKG_PRIVATE_XCHG_ALLOC_TIME]; pkg->private_data[PKG_PRIVATE_LOWLEVEL_XCHG_ADD] = @@ -250,7 +250,7 @@ u32 unf_send_abts(struct unf_lport *lport, struct unf_xchg *xchg) pkg.unf_cmnd_pload_bl.buffer_ptr = (u8 *)xchg->fcp_sfs_union.sfs_entry.fc_sfs_entry_ptr;
pkg.unf_cmnd_pload_bl.buf_dma_addr = xchg->fcp_sfs_union.sfs_entry.sfs_buff_phy_addr; - pkg.private_data[PKG_PRIVATE_XCHG_HOT_POOL_INDEX] = xchg->hotpooltag | UNF_HOTTAG_FLAG; + pkg.private_data[PKG_PRIVATE_XCHG_HOT_POOL_INDEX] = xchg->hotpooltag;
UNF_SET_XCHG_ALLOC_TIME(&pkg, xchg); UNF_SET_ABORT_INFO_IOTYPE(&pkg, xchg); @@ -407,19 +407,10 @@ static u32 unf_els_cmnd_default_handler(struct unf_lport *lport, struct unf_xchg rjt_info.reason_code = UNF_LS_RJT_NOT_SUPPORTED;
unf_rport = unf_get_rport_by_nport_id(lport, sid); - if (unf_rport) { - if (unf_rport->rport_index != - xchg->private_data[PKG_PRIVATE_XCHG_RPORT_INDEX]) { - FC_DRV_PRINT(UNF_LOG_LOGIN_ATT, UNF_WARN, - "[warn]Port(0x%x_0x%x) NPort handle(0x%x) from low level is not equal to RPort index(0x%x)", - lport->port_id, lport->nport_id, - xchg->private_data[PKG_PRIVATE_XCHG_RPORT_INDEX], - unf_rport->rport_index); - } + if (unf_rport) ret = unf_send_els_rjt_by_rport(lport, xchg, unf_rport, &rjt_info); - } else { + else ret = unf_send_els_rjt_by_did(lport, xchg, sid, &rjt_info); - }
return ret; } @@ -1389,7 +1380,7 @@ static void unf_fill_free_xid_pkg(struct unf_xchg *xchg, struct unf_frame_pkg *p pkg->frame_head.csctl_sid = xchg->sid; pkg->frame_head.rctl_did = xchg->did; pkg->frame_head.oxid_rxid = (u32)(((u32)xchg->oxid << UNF_SHIFT_16) | xchg->rxid); - pkg->private_data[PKG_PRIVATE_XCHG_HOT_POOL_INDEX] = xchg->hotpooltag | UNF_HOTTAG_FLAG; + pkg->private_data[PKG_PRIVATE_XCHG_HOT_POOL_INDEX] = xchg->hotpooltag; UNF_SET_XCHG_ALLOC_TIME(pkg, xchg);
if (xchg->xchg_type == UNF_XCHG_TYPE_SFS) { diff --git a/drivers/scsi/spfc/hw/spfc_hba.c b/drivers/scsi/spfc/hw/spfc_hba.c index e12299c9e2c9..b033dcb78bb3 100644 --- a/drivers/scsi/spfc/hw/spfc_hba.c +++ b/drivers/scsi/spfc/hw/spfc_hba.c @@ -56,7 +56,7 @@ static struct unf_cfg_item spfc_port_cfg_parm[] = { {"port_topology", 0, 0xf, 0x20}, {"port_alpa", 0, 0xdead, 0xffff}, /* alpa address of port */ /* queue depth of originator registered to SCSI midlayer */ - {"max_queue_depth", 0, 128, 128}, + {"max_queue_depth", 0, 512, 512}, {"sest_num", 0, 2048, 2048}, {"max_login", 0, 2048, 2048}, /* nodename from 32 bit to 64 bit */ diff --git a/drivers/scsi/spfc/hw/spfc_io.c b/drivers/scsi/spfc/hw/spfc_io.c index 2b1d1c607b13..7184eb6a10af 100644 --- a/drivers/scsi/spfc/hw/spfc_io.c +++ b/drivers/scsi/spfc/hw/spfc_io.c @@ -1138,7 +1138,7 @@ u32 spfc_scq_recv_iresp(struct spfc_hba_info *hba, union spfc_scqe *wqe) pkg.private_data[PKG_PRIVATE_XCHG_ALLOC_TIME] = iresp->magic_num; pkg.frame_head.oxid_rxid = (((iresp->wd0.ox_id) << UNF_SHIFT_16) | (iresp->wd0.rx_id));
- hot_tag = (u16)iresp->wd2.hotpooltag & UNF_ORIGIN_HOTTAG_MASK; + hot_tag = (u16)iresp->wd2.hotpooltag; /* 2. HotTag validity check */ if (likely(hot_tag >= hba->exi_base && (hot_tag < hba->exi_base + hba->exi_count))) { pkg.status = UNF_IO_SUCCESS; diff --git a/drivers/scsi/spfc/hw/spfc_queue.c b/drivers/scsi/spfc/hw/spfc_queue.c index abcf1ff3f49f..fa4295832da7 100644 --- a/drivers/scsi/spfc/hw/spfc_queue.c +++ b/drivers/scsi/spfc/hw/spfc_queue.c @@ -2138,11 +2138,9 @@ static void spfc_free_parent_sq(struct spfc_hba_info *hba, u32 uidelaycnt = 0; struct list_head *list = NULL; struct spfc_suspend_sqe_info *suspend_sqe = NULL; - ulong flag = 0;
sq_info = &parq_info->parent_sq_info;
- spin_lock_irqsave(&parq_info->parent_queue_state_lock, flag); while (!list_empty(&sq_info->suspend_sqe_list)) { list = UNF_OS_LIST_NEXT(&sq_info->suspend_sqe_list); list_del(list); @@ -2156,7 +2154,6 @@ static void spfc_free_parent_sq(struct spfc_hba_info *hba, kfree(suspend_sqe); } } - spin_unlock_irqrestore(&parq_info->parent_queue_state_lock, flag);
/* Free data cos */ spfc_update_cos_rport_cnt(hba, parq_info->queue_data_cos); @@ -4475,9 +4472,7 @@ void spfc_free_parent_queue_info(void *handle, struct spfc_parent_queue_info *pa * with the sq in the queue of the parent */
- spin_unlock_irqrestore(prtq_state_lock, flag); spfc_free_parent_sq(hba, parent_queue_info); - spin_lock_irqsave(prtq_state_lock, flag);
/* The initialization of all queue id is invalid */ parent_queue_info->parent_cmd_scq_info.cqm_queue_id = INVALID_VALUE32; diff --git a/drivers/scsi/spfc/hw/spfc_service.c b/drivers/scsi/spfc/hw/spfc_service.c index e99802df50a2..1da58e3f9fbe 100644 --- a/drivers/scsi/spfc/hw/spfc_service.c +++ b/drivers/scsi/spfc/hw/spfc_service.c @@ -742,7 +742,7 @@ u32 spfc_scq_recv_abts_rsp(struct spfc_hba_info *hba, union spfc_scqe *scqe)
ox_id = (u32)(abts_rsp->wd0.ox_id);
- hot_tag = abts_rsp->wd1.hotpooltag & UNF_ORIGIN_HOTTAG_MASK; + hot_tag = abts_rsp->wd1.hotpooltag; if (unlikely(hot_tag < (u32)hba->exi_base || hot_tag >= (u32)(hba->exi_base + hba->exi_count))) { FC_DRV_PRINT(UNF_LOG_LOGIN_ATT, UNF_ERR, @@ -1210,7 +1210,7 @@ u32 spfc_scq_recv_ls_gs_rsp(struct spfc_hba_info *hba, union spfc_scqe *scqe) spfc_swap_16_in_32((u32 *)ls_gs_rsp_scqe->user_id, SPFC_LS_GS_USERID_LEN);
ox_id = ls_gs_rsp_scqe->wd1.ox_id; - hot_tag = ((u16)(ls_gs_rsp_scqe->wd5.hotpooltag) & UNF_ORIGIN_HOTTAG_MASK) - hba->exi_base; + hot_tag = ((u16)ls_gs_rsp_scqe->wd5.hotpooltag) - hba->exi_base; pkg.frame_head.oxid_rxid = (u32)(ls_gs_rsp_scqe->wd1.rx_id) | ox_id << UNF_SHIFT_16; pkg.private_data[PKG_PRIVATE_XCHG_ALLOC_TIME] = ls_gs_rsp_scqe->magic_num; pkg.private_data[PKG_PRIVATE_XCHG_HOT_POOL_INDEX] = hot_tag; @@ -1317,8 +1317,7 @@ u32 spfc_scq_recv_els_rsp_sts(struct spfc_hba_info *hba, union spfc_scqe *scqe) pkg.private_data[PKG_PRIVATE_XCHG_ALLOC_TIME] = els_rsp_sts_scqe->magic_num; pkg.frame_head.oxid_rxid = rx_id | (u32)(els_rsp_sts_scqe->wd0.ox_id) << UNF_SHIFT_16; - hot_tag = (u32)((els_rsp_sts_scqe->wd1.hotpooltag & UNF_ORIGIN_HOTTAG_MASK) - - hba->exi_base); + hot_tag = (u32)(els_rsp_sts_scqe->wd1.hotpooltag - hba->exi_base);
if (unlikely(SPFC_SCQE_HAS_ERRCODE(scqe))) pkg.status = UNF_IO_FAILED; @@ -1759,7 +1758,7 @@ u32 spfc_scq_recv_marker_sts(struct spfc_hba_info *hba, union spfc_scqe *scqe) tmf_marker_sts_scqe = &scqe->itmf_marker_sts; ox_id = (u32)tmf_marker_sts_scqe->wd1.ox_id; rx_id = (u32)tmf_marker_sts_scqe->wd1.rx_id; - hot_tag = (tmf_marker_sts_scqe->wd4.hotpooltag & UNF_ORIGIN_HOTTAG_MASK) - hba->exi_base; + hot_tag = tmf_marker_sts_scqe->wd4.hotpooltag - hba->exi_base; pkg.frame_head.oxid_rxid = rx_id | (u32)(ox_id) << UNF_SHIFT_16; pkg.private_data[PKG_PRIVATE_XCHG_ALLOC_TIME] = tmf_marker_sts_scqe->magic_num; pkg.frame_head.csctl_sid = tmf_marker_sts_scqe->wd3.sid; @@ -1800,7 +1799,7 @@ u32 spfc_scq_recv_abts_marker_sts(struct spfc_hba_info *hba, union spfc_scqe *sc
ox_id = (u32)abts_marker_sts_scqe->wd1.ox_id; rx_id = (u32)abts_marker_sts_scqe->wd1.rx_id; - hot_tag = (abts_marker_sts_scqe->wd4.hotpooltag & UNF_ORIGIN_HOTTAG_MASK) - hba->exi_base; + hot_tag = abts_marker_sts_scqe->wd4.hotpooltag - hba->exi_base; pkg.frame_head.oxid_rxid = rx_id | (u32)(ox_id) << UNF_SHIFT_16; pkg.frame_head.csctl_sid = abts_marker_sts_scqe->wd3.sid; pkg.frame_head.rctl_did = abts_marker_sts_scqe->wd2.did; @@ -1972,8 +1971,7 @@ u32 spfc_scq_free_xid_sts(struct spfc_hba_info *hba, union spfc_scqe *scqe) rx_id = (u32)free_xid_sts_scqe->wd0.rx_id;
if (free_xid_sts_scqe->wd1.hotpooltag != INVALID_VALUE16) { - hot_tag = (free_xid_sts_scqe->wd1.hotpooltag & - UNF_ORIGIN_HOTTAG_MASK) - hba->exi_base; + hot_tag = free_xid_sts_scqe->wd1.hotpooltag - hba->exi_base; }
FC_DRV_PRINT(UNF_LOG_EQUIP_ATT, UNF_INFO, @@ -1998,7 +1996,7 @@ u32 spfc_scq_exchg_timeout_sts(struct spfc_hba_info *hba, union spfc_scqe *scqe) rx_id = (u32)time_out_scqe->wd0.rx_id;
if (time_out_scqe->wd1.hotpooltag != INVALID_VALUE16) - hot_tag = (time_out_scqe->wd1.hotpooltag & UNF_ORIGIN_HOTTAG_MASK) - hba->exi_base; + hot_tag = time_out_scqe->wd1.hotpooltag - hba->exi_base;
FC_DRV_PRINT(UNF_LOG_EQUIP_ATT, UNF_INFO, "Port(0x%x) recv timer time out sts hotpooltag(0x%x) magicnum(0x%x) ox_id(0x%x) rxid(0x%x) sts(%d)", @@ -2054,7 +2052,7 @@ u32 spfc_scq_rcv_sq_nop_sts(struct spfc_hba_info *hba, union spfc_scqe *scqe) FC_DRV_PRINT(UNF_LOG_LOGIN_ATT, UNF_INFO, "[info]Port(0x%x) rport_index(0x%x) find suspend sqe.", hba->port_cfg.port_id, rport_index); - if (sqn < sqn_max) { + if ((sqn < sqn_max) && (sqn >= sqn_base)) { ret = spfc_send_nop_cmd(hba, parent_sq_info, magic_num, sqn + 1); } else if (sqn == sqn_max) { if (!cancel_delayed_work(&suspend_sqe->timeout_work)) { @@ -2065,6 +2063,10 @@ u32 spfc_scq_rcv_sq_nop_sts(struct spfc_hba_info *hba, union spfc_scqe *scqe) parent_sq_info->need_offloaded = suspend_sqe->old_offload_sts; ret = spfc_pop_suspend_sqe(hba, prt_qinfo, suspend_sqe); kfree(suspend_sqe); + } else { + FC_DRV_PRINT(UNF_LOG_LOGIN_ATT, UNF_WARN, + "[warn]Port(0x%x) rport(0x%x) rcv error sqn(0x%x)", + hba->port_cfg.port_id, rport_index, sqn); } } else { FC_DRV_PRINT(UNF_LOG_LOGIN_ATT, UNF_WARN,
From: Gou Hao gouhao@uniontech.com
uniontech inclusion category: cleanup bugzilla: https://gitee.com/openeuler/kernel/issues/I4X47D?from=project-issue CVE: NA
--------------------------------
The 'left' always is 0. If it is not 0, it will 'goto out;' from the previous if judgment.
Reviewed-by: Yu Kuai yukuai3@huawei.com Acked-by: Xie XiuQi xiexiuqi@huawei.com Signed-off-by: Zheng Zengkai zhengzengkai@huawei.com --- fs/eulerfs/dax.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/eulerfs/dax.c b/fs/eulerfs/dax.c index 9ec8ad713fd9..131d08f1c6c1 100644 --- a/fs/eulerfs/dax.c +++ b/fs/eulerfs/dax.c @@ -1172,8 +1172,8 @@ static ssize_t do_mapping_read(struct address_space *mapping, goto out; }
- copied += (nr - left); - offset += (nr - left); + copied += nr; + offset += nr; index += offset >> PAGE_SHIFT; offset &= ~PAGE_MASK; } while (copied < len);