bugfix for openEuler 20.03 @20210408
Guoqing Jiang (4): md: add new workqueue for delete rdev md: don't flush workqueue unconditionally in md_open md: flush md_rdev_misc_wq for HOT_ADD_DISK case md: fix the checking of wrong work queue
Jan Kara (1): ext4: fix timer use-after-free on failed mount
Junxiao Bi (2): md: fix deadlock causing by sysfs_notify md: get sysfs entry after redundancy attr group create
Lu Jialin (1): config: Enable files cgroup on x86
Mauricio Faria de Oliveira (1): loop: fix I/O error on fsync() in detached loop devices
Mike Christie (7): scsi: libiscsi: Fix iscsi_prep_scsi_cmd_pdu() error handling scsi: libiscsi: Drop taskqueuelock scsi: libiscsi: Fix iscsi_task use after free() scsi: libiscsi: Fix iSCSI host workq destruction scsi: libiscsi: Add helper to calculate max SCSI cmds per session scsi: iscsi_tcp: Fix shost can_queue initialization scsi: libiscsi: Reset max/exp cmdsn during recovery
Oscar Salvador (1): mm,hwpoison: return -EBUSY when migration fails
Shijie Luo (1): ext4: fix potential error in ext4_do_update_inode
Theodore Ts'o (1): ext4: don't leak old mountpoint samples
Vlastimil Babka (1): mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
Wu Bo (1): scsi: libiscsi: Fix error count for active session
Yang Yingliang (1): mm: slub: avoid wrong report about corrupted freelist
Ye Bin (1): ext4: Fix unreport netlink message to userspace when fs abort
Zhao Heming (1): md/bitmap: fix memory leak of temporary bitmap
yangerkun (1): scsi: libiscsi: convert change of struct iscsi_conn to fix KABI
arch/x86/configs/hulk_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 2 +- drivers/block/loop.c | 3 + drivers/md/md-bitmap.c | 5 +- drivers/md/md.c | 89 +++++-- drivers/md/md.h | 8 +- drivers/md/raid10.c | 2 +- drivers/md/raid5.c | 6 +- drivers/scsi/bnx2i/bnx2i_iscsi.c | 2 - drivers/scsi/iscsi_tcp.c | 9 +- drivers/scsi/libiscsi.c | 347 +++++++++++++++++---------- drivers/scsi/libiscsi_tcp.c | 86 ++++--- fs/ext4/file.c | 2 +- fs/ext4/inode.c | 8 +- fs/ext4/super.c | 5 +- include/linux/slab.h | 4 + include/scsi/libiscsi.h | 6 +- mm/memory-failure.c | 8 +- mm/slab_common.c | 11 +- mm/slob.c | 41 +++- mm/slub.c | 12 +- 21 files changed, 435 insertions(+), 222 deletions(-)
From: Guoqing Jiang guoqing.jiang@cloud.ionos.com
mainline inclusion from mainline-5.8-rc1 commit cc1ffe61c026e2318024bfa8722e8f689fd2a7f4 category: bugfix bugzilla: 35792 CVE: NA
---------------------------
Since the purpose of call flush_workqueue in new_dev_store is to ensure md_delayed_delete() has completed, so we should check rdev->del_work is pending or not.
To suppress lockdep warning, we have to check mddev->del_work while md_delayed_delete is attached to rdev->del_work, so it is not aligned to the purpose of flush workquee. So a new workqueue is needed to avoid the awkward situation, and introduce a new func flush_rdev_wq to flush the new workqueue after check if there was pending work.
Also like new_dev_store, ADD_NEW_DISK ioctl has the same purpose to flush workqueue while it holds bdev->bd_mutex, so make the same change applies to the ioctl to avoid similar lock issue.
And md_delayed_delete actually wants to delete rdev, so rename the function to rdev_delayed_delete.
Signed-off-by: Guoqing Jiang guoqing.jiang@cloud.ionos.com Signed-off-by: Song Liu songliubraving@fb.com Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/md.c | 37 +++++++++++++++++++++++++++++-------- 1 file changed, 29 insertions(+), 8 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index 3659083bdd04..bb1b9850c8d6 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -94,6 +94,7 @@ EXPORT_SYMBOL(md_cluster_mod); static DECLARE_WAIT_QUEUE_HEAD(resync_wait); static struct workqueue_struct *md_wq; static struct workqueue_struct *md_misc_wq; +static struct workqueue_struct *md_rdev_misc_wq;
static int remove_and_add_spares(struct mddev *mddev, struct md_rdev *this); @@ -2301,7 +2302,7 @@ static int bind_rdev_to_array(struct md_rdev *rdev, struct mddev *mddev) return err; }
-static void md_delayed_delete(struct work_struct *ws) +static void rdev_delayed_delete(struct work_struct *ws) { struct md_rdev *rdev = container_of(ws, struct md_rdev, del_work); kobject_del(&rdev->kobj); @@ -2325,9 +2326,9 @@ static void unbind_rdev_from_array(struct md_rdev *rdev) * to delay it due to rcu usage. */ synchronize_rcu(); - INIT_WORK(&rdev->del_work, md_delayed_delete); + INIT_WORK(&rdev->del_work, rdev_delayed_delete); kobject_get(&rdev->kobj); - queue_work(md_misc_wq, &rdev->del_work); + queue_work(md_rdev_misc_wq, &rdev->del_work); }
/* @@ -4353,6 +4354,20 @@ null_show(struct mddev *mddev, char *page) return -EINVAL; }
+/* need to ensure rdev_delayed_delete() has completed */ +static void flush_rdev_wq(struct mddev *mddev) +{ + struct md_rdev *rdev; + + rcu_read_lock(); + rdev_for_each_rcu(rdev, mddev) + if (work_pending(&rdev->del_work)) { + flush_workqueue(md_rdev_misc_wq); + break; + } + rcu_read_unlock(); +} + static ssize_t new_dev_store(struct mddev *mddev, const char *buf, size_t len) { @@ -4380,8 +4395,7 @@ new_dev_store(struct mddev *mddev, const char *buf, size_t len) minor != MINOR(dev)) return -EOVERFLOW;
- flush_workqueue(md_misc_wq); - + flush_rdev_wq(mddev); err = mddev_lock(mddev); if (err) return err; @@ -4619,7 +4633,8 @@ action_store(struct mddev *mddev, const char *page, size_t len) clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && mddev_lock(mddev) == 0) { - flush_workqueue(md_misc_wq); + if (work_pending(&mddev->del_work)) + flush_workqueue(md_misc_wq); if (mddev->sync_thread) { set_bit(MD_RECOVERY_INTR, &mddev->recovery); md_reap_sync_thread(mddev); @@ -7237,8 +7252,7 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode, }
if (cmd == ADD_NEW_DISK) - /* need to ensure md_delayed_delete() has completed */ - flush_workqueue(md_misc_wq); + flush_rdev_wq(mddev);
if (cmd == HOT_REMOVE_DISK) /* need to ensure recovery thread has run */ @@ -9189,6 +9203,10 @@ static int __init md_init(void) if (!md_misc_wq) goto err_misc_wq;
+ md_rdev_misc_wq = alloc_workqueue("md_rdev_misc", 0, 0); + if (!md_misc_wq) + goto err_rdev_misc_wq; + if ((ret = register_blkdev(MD_MAJOR, "md")) < 0) goto err_md;
@@ -9210,6 +9228,8 @@ static int __init md_init(void) err_mdp: unregister_blkdev(MD_MAJOR, "md"); err_md: + destroy_workqueue(md_rdev_misc_wq); +err_rdev_misc_wq: destroy_workqueue(md_misc_wq); err_misc_wq: destroy_workqueue(md_wq); @@ -9472,6 +9492,7 @@ static __exit void md_exit(void) * destroy_workqueue() below will wait for that to complete. */ } + destroy_workqueue(md_rdev_misc_wq); destroy_workqueue(md_misc_wq); destroy_workqueue(md_wq); }
From: Guoqing Jiang guoqing.jiang@cloud.ionos.com
mainline inclusion from mainline-5.8-rc1 commit f6766ff6afff70e2aaf39e1511e16d471de7c3ae category: bugfix bugzilla: 35792 CVE: NA
---------------------------
We need to check mddev->del_work before flush workqueu since the purpose of flush is to ensure the previous md is disappeared. Otherwise the similar deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the bdev->bd_mutex before flush workqueue.
kernel: [ 154.522645] ====================================================== kernel: [ 154.522647] WARNING: possible circular locking dependency detected kernel: [ 154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G O kernel: [ 154.522651] ------------------------------------------------------ kernel: [ 154.522653] mdadm/2482 is trying to acquire lock: kernel: [ 154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0 kernel: [ 154.522673] kernel: [ 154.522673] but task is already holding lock: kernel: [ 154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590 kernel: [ 154.522691] kernel: [ 154.522691] which lock already depends on the new lock. kernel: [ 154.522691] kernel: [ 154.522694] kernel: [ 154.522694] the existing dependency chain (in reverse order) is: kernel: [ 154.522696] kernel: [ 154.522696] -> #4 (&bdev->bd_mutex){+.+.}: kernel: [ 154.522704] __mutex_lock+0x87/0x950 kernel: [ 154.522706] __blkdev_get+0x79/0x590 kernel: [ 154.522708] blkdev_get+0x65/0x140 kernel: [ 154.522709] blkdev_get_by_dev+0x2f/0x40 kernel: [ 154.522716] lock_rdev+0x3d/0x90 [md_mod] kernel: [ 154.522719] md_import_device+0xd6/0x1b0 [md_mod] kernel: [ 154.522723] new_dev_store+0x15e/0x210 [md_mod] kernel: [ 154.522728] md_attr_store+0x7a/0xc0 [md_mod] kernel: [ 154.522732] kernfs_fop_write+0x117/0x1b0 kernel: [ 154.522735] vfs_write+0xad/0x1a0 kernel: [ 154.522737] ksys_write+0xa4/0xe0 kernel: [ 154.522745] do_syscall_64+0x64/0x2b0 kernel: [ 154.522748] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522749] kernel: [ 154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}: kernel: [ 154.522752] __mutex_lock+0x87/0x950 kernel: [ 154.522756] new_dev_store+0xc9/0x210 [md_mod] kernel: [ 154.522759] md_attr_store+0x7a/0xc0 [md_mod] kernel: [ 154.522761] kernfs_fop_write+0x117/0x1b0 kernel: [ 154.522763] vfs_write+0xad/0x1a0 kernel: [ 154.522765] ksys_write+0xa4/0xe0 kernel: [ 154.522767] do_syscall_64+0x64/0x2b0 kernel: [ 154.522769] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522770] kernel: [ 154.522770] -> #2 (kn->count#253){++++}: kernel: [ 154.522775] __kernfs_remove+0x253/0x2c0 kernel: [ 154.522778] kernfs_remove+0x1f/0x30 kernel: [ 154.522780] kobject_del+0x28/0x60 kernel: [ 154.522783] mddev_delayed_delete+0x24/0x30 [md_mod] kernel: [ 154.522786] process_one_work+0x2a7/0x5f0 kernel: [ 154.522788] worker_thread+0x2d/0x3d0 kernel: [ 154.522793] kthread+0x117/0x130 kernel: [ 154.522795] ret_from_fork+0x3a/0x50 kernel: [ 154.522796] kernel: [ 154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}: kernel: [ 154.522800] process_one_work+0x27e/0x5f0 kernel: [ 154.522802] worker_thread+0x2d/0x3d0 kernel: [ 154.522804] kthread+0x117/0x130 kernel: [ 154.522806] ret_from_fork+0x3a/0x50 kernel: [ 154.522807] kernel: [ 154.522807] -> #0 ((wq_completion)md_misc){+.+.}: kernel: [ 154.522813] __lock_acquire+0x1392/0x1690 kernel: [ 154.522816] lock_acquire+0xb4/0x1a0 kernel: [ 154.522818] flush_workqueue+0xab/0x4b0 kernel: [ 154.522821] md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522823] __blkdev_get+0xea/0x590 kernel: [ 154.522825] blkdev_get+0x65/0x140 kernel: [ 154.522828] do_dentry_open+0x1d1/0x380 kernel: [ 154.522831] path_openat+0x567/0xcc0 kernel: [ 154.522834] do_filp_open+0x9b/0x110 kernel: [ 154.522836] do_sys_openat2+0x201/0x2a0 kernel: [ 154.522838] do_sys_open+0x57/0x80 kernel: [ 154.522840] do_syscall_64+0x64/0x2b0 kernel: [ 154.522842] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522844] kernel: [ 154.522844] other info that might help us debug this: kernel: [ 154.522844] kernel: [ 154.522846] Chain exists of: kernel: [ 154.522846] (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex kernel: [ 154.522846] kernel: [ 154.522850] Possible unsafe locking scenario: kernel: [ 154.522850] kernel: [ 154.522852] CPU0 CPU1 kernel: [ 154.522853] ---- ---- kernel: [ 154.522854] lock(&bdev->bd_mutex); kernel: [ 154.522856] lock(&mddev->reconfig_mutex); kernel: [ 154.522858] lock(&bdev->bd_mutex); kernel: [ 154.522860] lock((wq_completion)md_misc); kernel: [ 154.522861] kernel: [ 154.522861] *** DEADLOCK *** kernel: [ 154.522861] kernel: [ 154.522864] 1 lock held by mdadm/2482: kernel: [ 154.522865] #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590 kernel: [ 154.522868] kernel: [ 154.522868] stack backtrace: kernel: [ 154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G O 5.6.0-rc7-lp151.27-default #25 kernel: [ 154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 kernel: [ 154.522878] Call Trace: kernel: [ 154.522881] dump_stack+0x8f/0xcb kernel: [ 154.522884] check_noncircular+0x194/0x1b0 kernel: [ 154.522888] ? __lock_acquire+0x1392/0x1690 kernel: [ 154.522890] __lock_acquire+0x1392/0x1690 kernel: [ 154.522893] lock_acquire+0xb4/0x1a0 kernel: [ 154.522895] ? flush_workqueue+0x84/0x4b0 kernel: [ 154.522898] flush_workqueue+0xab/0x4b0 kernel: [ 154.522900] ? flush_workqueue+0x84/0x4b0 kernel: [ 154.522905] ? md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522908] md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522910] __blkdev_get+0xea/0x590 kernel: [ 154.522912] ? bd_acquire+0xc0/0xc0 kernel: [ 154.522914] blkdev_get+0x65/0x140 kernel: [ 154.522916] ? bd_acquire+0xc0/0xc0 kernel: [ 154.522918] do_dentry_open+0x1d1/0x380 kernel: [ 154.522921] path_openat+0x567/0xcc0 kernel: [ 154.522923] ? __lock_acquire+0x380/0x1690 kernel: [ 154.522926] do_filp_open+0x9b/0x110 kernel: [ 154.522929] ? __alloc_fd+0xe5/0x1f0 kernel: [ 154.522935] ? kmem_cache_alloc+0x28c/0x630 kernel: [ 154.522939] ? do_sys_openat2+0x201/0x2a0 kernel: [ 154.522941] do_sys_openat2+0x201/0x2a0 kernel: [ 154.522944] do_sys_open+0x57/0x80 kernel: [ 154.522946] do_syscall_64+0x64/0x2b0 kernel: [ 154.522948] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522951] RIP: 0033:0x7f98d279d9ae
And md_alloc also flushed the same workqueue, but the thing is different here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and the flush is necessary to avoid race condition, so leave it as it is.
Signed-off-by: Guoqing Jiang guoqing.jiang@cloud.ionos.com Signed-off-by: Song Liu songliubraving@fb.com Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/md.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index bb1b9850c8d6..f7c4d8d62335 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7507,7 +7507,8 @@ static int md_open(struct block_device *bdev, fmode_t mode) */ mddev_put(mddev); /* Wait until bdev->bd_disk is definitely gone */ - flush_workqueue(md_misc_wq); + if (work_pending(&mddev->del_work)) + flush_workqueue(md_misc_wq); /* Then retry the open from the top */ return -ERESTARTSYS; }
From: Guoqing Jiang guoqing.jiang@cloud.ionos.com
mainline inclusion from mainline-5.8-rc1 commit 78b990cf2822d1150a419c13be48e74aefa83f27 category: bugfix bugzilla: 35792 CVE: NA
---------------------------
Since rdev->kobj is removed asynchronously, it is possible that the rdev->kobj still exists when try to add the rdev again after rdev is removed. But this path md_ioctl (HOT_ADD_DISK) -> hot_add_disk -> bind_rdev_to_array missed it.
Signed-off-by: Guoqing Jiang guoqing.jiang@cloud.ionos.com Signed-off-by: Song Liu songliubraving@fb.com Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/md.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index f7c4d8d62335..a01d86a13e1a 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7251,7 +7251,7 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
}
- if (cmd == ADD_NEW_DISK) + if (cmd == ADD_NEW_DISK || cmd == HOT_ADD_DISK) flush_rdev_wq(mddev);
if (cmd == HOT_REMOVE_DISK)
From: Guoqing Jiang guoqing.jiang@cloud.ionos.com
mainline inclusion from mainline-5.10-rc1 commit cf0b9b4821a2955f8a23813ef8f422208ced9bd7 category: bugfix bugzilla: 35792 CVE: NA
---------------------------
It should check md_rdev_misc_wq instead of md_misc_wq.
Fixes: cc1ffe61c026 ("md: add new workqueue for delete rdev") Cc: stable@vger.kernel.org # v5.8+ Signed-off-by: Guoqing Jiang guoqing.jiang@cloud.ionos.com Signed-off-by: Song Liu songliubraving@fb.com Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/md.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index a01d86a13e1a..1c816af47549 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9205,7 +9205,7 @@ static int __init md_init(void) goto err_misc_wq;
md_rdev_misc_wq = alloc_workqueue("md_rdev_misc", 0, 0); - if (!md_misc_wq) + if (!md_rdev_misc_wq) goto err_rdev_misc_wq;
if ((ret = register_blkdev(MD_MAJOR, "md")) < 0)
From: Junxiao Bi junxiao.bi@oracle.com
mainline inclusion from mainline-5.9-rc1 commit e1a86dbbbd6a77f73c3d099030495fa31f181e2f category: bugfix bugzilla: 41594 CVE: NA
---------------------------
The following deadlock was captured. The first process is holding 'kernfs_mutex' and hung by io. The io was staging in 'r1conf.pending_bio_list' of raid1 device, this pending bio list would be flushed by second process 'md127_raid1', but it was hung by 'kernfs_mutex'. Using sysfs_notify_dirent_safe() to replace sysfs_notify() can fix it. There were other sysfs_notify() invoked from io path, removed all of them.
PID: 40430 TASK: ffff8ee9c8c65c40 CPU: 29 COMMAND: "probe_file" #0 [ffffb87c4df37260] __schedule at ffffffff9a8678ec #1 [ffffb87c4df372f8] schedule at ffffffff9a867f06 #2 [ffffb87c4df37310] io_schedule at ffffffff9a0c73e6 #3 [ffffb87c4df37328] __dta___xfs_iunpin_wait_3443 at ffffffffc03a4057 [xfs] #4 [ffffb87c4df373a0] xfs_iunpin_wait at ffffffffc03a6c79 [xfs] #5 [ffffb87c4df373b0] __dta_xfs_reclaim_inode_3357 at ffffffffc039a46c [xfs] #6 [ffffb87c4df37400] xfs_reclaim_inodes_ag at ffffffffc039a8b6 [xfs] #7 [ffffb87c4df37590] xfs_reclaim_inodes_nr at ffffffffc039bb33 [xfs] #8 [ffffb87c4df375b0] xfs_fs_free_cached_objects at ffffffffc03af0e9 [xfs] #9 [ffffb87c4df375c0] super_cache_scan at ffffffff9a287ec7 #10 [ffffb87c4df37618] shrink_slab at ffffffff9a1efd93 #11 [ffffb87c4df37700] shrink_node at ffffffff9a1f5968 #12 [ffffb87c4df37788] do_try_to_free_pages at ffffffff9a1f5ea2 #13 [ffffb87c4df377f0] try_to_free_mem_cgroup_pages at ffffffff9a1f6445 #14 [ffffb87c4df37880] try_charge at ffffffff9a26cc5f #15 [ffffb87c4df37920] memcg_kmem_charge_memcg at ffffffff9a270f6a #16 [ffffb87c4df37958] new_slab at ffffffff9a251430 #17 [ffffb87c4df379c0] ___slab_alloc at ffffffff9a251c85 #18 [ffffb87c4df37a80] __slab_alloc at ffffffff9a25635d #19 [ffffb87c4df37ac0] kmem_cache_alloc at ffffffff9a251f89 #20 [ffffb87c4df37b00] alloc_inode at ffffffff9a2a2b10 #21 [ffffb87c4df37b20] iget_locked at ffffffff9a2a4854 #22 [ffffb87c4df37b60] kernfs_get_inode at ffffffff9a311377 #23 [ffffb87c4df37b80] kernfs_iop_lookup at ffffffff9a311e2b #24 [ffffb87c4df37ba8] lookup_slow at ffffffff9a290118 #25 [ffffb87c4df37c10] walk_component at ffffffff9a291e83 #26 [ffffb87c4df37c78] path_lookupat at ffffffff9a293619 #27 [ffffb87c4df37cd8] filename_lookup at ffffffff9a2953af #28 [ffffb87c4df37de8] user_path_at_empty at ffffffff9a295566 #29 [ffffb87c4df37e10] vfs_statx at ffffffff9a289787 #30 [ffffb87c4df37e70] SYSC_newlstat at ffffffff9a289d5d #31 [ffffb87c4df37f18] sys_newlstat at ffffffff9a28a60e #32 [ffffb87c4df37f28] do_syscall_64 at ffffffff9a003949 #33 [ffffb87c4df37f50] entry_SYSCALL_64_after_hwframe at ffffffff9aa001ad RIP: 00007f617a5f2905 RSP: 00007f607334f838 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 00007f6064044b20 RCX: 00007f617a5f2905 RDX: 00007f6064044b20 RSI: 00007f6064044b20 RDI: 00007f6064005890 RBP: 00007f6064044aa0 R8: 0000000000000030 R9: 000000000000011c R10: 0000000000000013 R11: 0000000000000246 R12: 00007f606417e6d0 R13: 00007f6064044aa0 R14: 00007f6064044b10 R15: 00000000ffffffff ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b
PID: 927 TASK: ffff8f15ac5dbd80 CPU: 42 COMMAND: "md127_raid1" #0 [ffffb87c4df07b28] __schedule at ffffffff9a8678ec #1 [ffffb87c4df07bc0] schedule at ffffffff9a867f06 #2 [ffffb87c4df07bd8] schedule_preempt_disabled at ffffffff9a86825e #3 [ffffb87c4df07be8] __mutex_lock at ffffffff9a869bcc #4 [ffffb87c4df07ca0] __mutex_lock_slowpath at ffffffff9a86a013 #5 [ffffb87c4df07cb0] mutex_lock at ffffffff9a86a04f #6 [ffffb87c4df07cc8] kernfs_find_and_get_ns at ffffffff9a311d83 #7 [ffffb87c4df07cf0] sysfs_notify at ffffffff9a314b3a #8 [ffffb87c4df07d18] md_update_sb at ffffffff9a688696 #9 [ffffb87c4df07d98] md_update_sb at ffffffff9a6886d5 #10 [ffffb87c4df07da8] md_check_recovery at ffffffff9a68ad9c #11 [ffffb87c4df07dd0] raid1d at ffffffffc01f0375 [raid1] #12 [ffffb87c4df07ea0] md_thread at ffffffff9a680348 #13 [ffffb87c4df07f08] kthread at ffffffff9a0b8005 #14 [ffffb87c4df07f50] ret_from_fork at ffffffff9aa00344
Signed-off-by: Junxiao Bi junxiao.bi@oracle.com Signed-off-by: Song Liu songliubraving@fb.com Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/md-bitmap.c | 2 +- drivers/md/md.c | 44 ++++++++++++++++++++++++++++-------------- drivers/md/md.h | 8 +++++++- drivers/md/raid10.c | 2 +- drivers/md/raid5.c | 6 +++--- 5 files changed, 42 insertions(+), 20 deletions(-)
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index 86694b27507e..33e0dc8084cf 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -1635,7 +1635,7 @@ void md_bitmap_cond_end_sync(struct bitmap *bitmap, sector_t sector, bool force) s += blocks; } bitmap->last_end_sync = jiffies; - sysfs_notify(&bitmap->mddev->kobj, NULL, "sync_completed"); + sysfs_notify_dirent_safe(bitmap->mddev->sysfs_completed); } EXPORT_SYMBOL(md_bitmap_cond_end_sync);
diff --git a/drivers/md/md.c b/drivers/md/md.c index 1c816af47549..ebbd403313e5 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2287,6 +2287,10 @@ static int bind_rdev_to_array(struct md_rdev *rdev, struct mddev *mddev) if (sysfs_create_link(&rdev->kobj, ko, "block")) /* failure here is OK */; rdev->sysfs_state = sysfs_get_dirent_safe(rdev->kobj.sd, "state"); + rdev->sysfs_unack_badblocks = + sysfs_get_dirent_safe(rdev->kobj.sd, "unacknowledged_bad_blocks"); + rdev->sysfs_badblocks = + sysfs_get_dirent_safe(rdev->kobj.sd, "bad_blocks");
list_add_rcu(&rdev->same_set, &mddev->disks); bd_link_disk_holder(rdev->bdev, mddev->gendisk); @@ -2319,7 +2323,11 @@ static void unbind_rdev_from_array(struct md_rdev *rdev) rdev->mddev = NULL; sysfs_remove_link(&rdev->kobj, "block"); sysfs_put(rdev->sysfs_state); + sysfs_put(rdev->sysfs_unack_badblocks); + sysfs_put(rdev->sysfs_badblocks); rdev->sysfs_state = NULL; + rdev->sysfs_unack_badblocks = NULL; + rdev->sysfs_badblocks = NULL; rdev->badblocks.count = 0; /* We need to delay this, otherwise we can deadlock when * writing to 'remove' to "dev/state". We also need @@ -2664,7 +2672,7 @@ void md_update_sb(struct mddev *mddev, int force_change) goto repeat; wake_up(&mddev->sb_wait); if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) - sysfs_notify(&mddev->kobj, NULL, "sync_completed"); + sysfs_notify_dirent_safe(mddev->sysfs_completed);
rdev_for_each(rdev, mddev) { if (test_and_clear_bit(FaultRecorded, &rdev->flags)) @@ -3920,7 +3928,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len) mddev_resume(mddev); if (!mddev->thread) md_update_sb(mddev, 1); - sysfs_notify(&mddev->kobj, NULL, "level"); + sysfs_notify_dirent_safe(mddev->sysfs_level); md_new_event(mddev); rv = len; out_unlock: @@ -4664,7 +4672,7 @@ action_store(struct mddev *mddev, const char *page, size_t len) } if (err) return err; - sysfs_notify(&mddev->kobj, NULL, "degraded"); + sysfs_notify_dirent_safe(mddev->sysfs_degraded); } else { if (cmd_match(page, "check")) set_bit(MD_RECOVERY_CHECK, &mddev->recovery); @@ -5289,6 +5297,13 @@ static void md_free(struct kobject *ko)
if (mddev->sysfs_state) sysfs_put(mddev->sysfs_state); + if (mddev->sysfs_completed) + sysfs_put(mddev->sysfs_completed); + if (mddev->sysfs_degraded) + sysfs_put(mddev->sysfs_degraded); + if (mddev->sysfs_level) + sysfs_put(mddev->sysfs_level); +
if (mddev->gendisk) del_gendisk(mddev->gendisk); @@ -5451,6 +5466,9 @@ static int md_alloc(dev_t dev, char *name) if (!error && mddev->kobj.sd) { kobject_uevent(&mddev->kobj, KOBJ_ADD); mddev->sysfs_state = sysfs_get_dirent_safe(mddev->kobj.sd, "array_state"); + mddev->sysfs_completed = sysfs_get_dirent_safe(mddev->kobj.sd, "sync_completed"); + mddev->sysfs_degraded = sysfs_get_dirent_safe(mddev->kobj.sd, "degraded"); + mddev->sysfs_level = sysfs_get_dirent_safe(mddev->kobj.sd, "level"); } mddev_put(mddev); return error; @@ -5786,7 +5804,7 @@ static int do_md_run(struct mddev *mddev) kobject_uevent(&disk_to_dev(mddev->gendisk)->kobj, KOBJ_CHANGE); sysfs_notify_dirent_safe(mddev->sysfs_state); sysfs_notify_dirent_safe(mddev->sysfs_action); - sysfs_notify(&mddev->kobj, NULL, "degraded"); + sysfs_notify_dirent_safe(mddev->sysfs_degraded); out: clear_bit(MD_NOT_READY, &mddev->flags); return err; @@ -8498,7 +8516,7 @@ void md_do_sync(struct md_thread *thread) } else mddev->curr_resync = 3; /* no longer delayed */ mddev->curr_resync_completed = j; - sysfs_notify(&mddev->kobj, NULL, "sync_completed"); + sysfs_notify_dirent_safe(mddev->sysfs_completed); md_new_event(mddev); update_time = jiffies;
@@ -8526,7 +8544,7 @@ void md_do_sync(struct md_thread *thread) mddev->recovery_cp = j; update_time = jiffies; set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags); - sysfs_notify(&mddev->kobj, NULL, "sync_completed"); + sysfs_notify_dirent_safe(mddev->sysfs_completed); }
while (j >= mddev->resync_max && @@ -8633,7 +8651,7 @@ void md_do_sync(struct md_thread *thread) !test_bit(MD_RECOVERY_INTR, &mddev->recovery) && mddev->curr_resync > 3) { mddev->curr_resync_completed = mddev->curr_resync; - sysfs_notify(&mddev->kobj, NULL, "sync_completed"); + sysfs_notify_dirent_safe(mddev->sysfs_completed); } mddev->pers->sync_request(mddev, max_sectors, &skipped);
@@ -8761,7 +8779,7 @@ static int remove_and_add_spares(struct mddev *mddev, }
if (removed && mddev->kobj.sd) - sysfs_notify(&mddev->kobj, NULL, "degraded"); + sysfs_notify_dirent_safe(mddev->sysfs_degraded);
if (this && removed) goto no_add; @@ -9043,8 +9061,7 @@ void md_reap_sync_thread(struct mddev *mddev) /* success...*/ /* activate any spares */ if (mddev->pers->spare_active(mddev)) { - sysfs_notify(&mddev->kobj, NULL, - "degraded"); + sysfs_notify_dirent_safe(mddev->sysfs_degraded); set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); } } @@ -9123,8 +9140,7 @@ int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors, if (rv == 0) { /* Make sure they get written out promptly */ if (test_bit(ExternalBbl, &rdev->flags)) - sysfs_notify(&rdev->kobj, NULL, - "unacknowledged_bad_blocks"); + sysfs_notify_dirent_safe(rdev->sysfs_unack_badblocks); sysfs_notify_dirent_safe(rdev->sysfs_state); set_mask_bits(&mddev->sb_flags, 0, BIT(MD_SB_CHANGE_CLEAN) | BIT(MD_SB_CHANGE_PENDING)); @@ -9145,7 +9161,7 @@ int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors, s += rdev->data_offset; rv = badblocks_clear(&rdev->badblocks, s, sectors); if ((rv == 0) && test_bit(ExternalBbl, &rdev->flags)) - sysfs_notify(&rdev->kobj, NULL, "bad_blocks"); + sysfs_notify_dirent_safe(rdev->sysfs_badblocks); return rv; } EXPORT_SYMBOL_GPL(rdev_clear_badblocks); @@ -9351,7 +9367,7 @@ static int read_rdev(struct mddev *mddev, struct md_rdev *rdev) if (rdev->recovery_offset == MaxSector && !test_bit(In_sync, &rdev->flags) && mddev->pers->spare_active(mddev)) - sysfs_notify(&mddev->kobj, NULL, "degraded"); + sysfs_notify_dirent_safe(mddev->sysfs_degraded);
put_page(swapout); return 0; diff --git a/drivers/md/md.h b/drivers/md/md.h index f1b1881d804e..ee3c9768ff9e 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -120,7 +120,10 @@ struct md_rdev {
struct kernfs_node *sysfs_state; /* handle for 'state' * sysfs entry */ - + /* handle for 'unacknowledged_bad_blocks' sysfs dentry */ + struct kernfs_node *sysfs_unack_badblocks; + /* handle for 'bad_blocks' sysfs dentry */ + struct kernfs_node *sysfs_badblocks; struct badblocks badblocks;
struct { @@ -402,6 +405,9 @@ struct mddev { * file in sysfs. */ struct kernfs_node *sysfs_action; /* handle for 'sync_action' */ + struct kernfs_node *sysfs_completed; /*handle for 'sync_completed' */ + struct kernfs_node *sysfs_degraded; /*handle for 'degraded' */ + struct kernfs_node *sysfs_level; /*handle for 'level' */
struct work_struct del_work; /* used for delayed sysfs removal */
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 7c091b640a8a..5049c4d829e5 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -4457,7 +4457,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, sector_nr = conf->reshape_progress; if (sector_nr) { mddev->curr_resync_completed = sector_nr; - sysfs_notify(&mddev->kobj, NULL, "sync_completed"); + sysfs_notify_dirent_safe(mddev->sysfs_completed); *skipped = 1; return sector_nr; } diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 01021382131b..24ef07d528ee 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -5790,7 +5790,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk sector_div(sector_nr, new_data_disks); if (sector_nr) { mddev->curr_resync_completed = sector_nr; - sysfs_notify(&mddev->kobj, NULL, "sync_completed"); + sysfs_notify_dirent_safe(mddev->sysfs_completed); *skipped = 1; retn = sector_nr; goto finish; @@ -5904,7 +5904,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk conf->reshape_safe = mddev->reshape_position; spin_unlock_irq(&conf->device_lock); wake_up(&conf->wait_for_overlap); - sysfs_notify(&mddev->kobj, NULL, "sync_completed"); + sysfs_notify_dirent_safe(mddev->sysfs_completed); }
INIT_LIST_HEAD(&stripes); @@ -6011,7 +6011,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk conf->reshape_safe = mddev->reshape_position; spin_unlock_irq(&conf->device_lock); wake_up(&conf->wait_for_overlap); - sysfs_notify(&mddev->kobj, NULL, "sync_completed"); + sysfs_notify_dirent_safe(mddev->sysfs_completed); } ret: return retn;
From: Junxiao Bi junxiao.bi@oracle.com
mainline inclusion from mainline-5.9-rc1 commit e8efa9b88e3c20a20ff34bb33e2e94bf6896016a category: bugfix bugzilla: 41594 CVE: NA
---------------------------
"sync_completed" and "degraded" belongs to redundancy attr group, it was not exist yet when md device was created.
Reported-by: kernel test robot rong.a.chen@intel.com Fixes: e1a86dbbbd6a ("md: fix deadlock causing by sysfs_notify") Signed-off-by: Junxiao Bi junxiao.bi@oracle.com Signed-off-by: Song Liu songliubraving@fb.com Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/md.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c index ebbd403313e5..013896b1d1ae 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -688,7 +688,13 @@ void mddev_unlock(struct mddev *mddev) sysfs_remove_group(&mddev->kobj, &md_redundancy_group); if (mddev->sysfs_action) sysfs_put(mddev->sysfs_action); + if (mddev->sysfs_completed) + sysfs_put(mddev->sysfs_completed); + if (mddev->sysfs_degraded) + sysfs_put(mddev->sysfs_degraded); mddev->sysfs_action = NULL; + mddev->sysfs_completed = NULL; + mddev->sysfs_degraded = NULL; } } mddev->sysfs_active = 0; @@ -3881,6 +3887,8 @@ level_store(struct mddev *mddev, const char *buf, size_t len) pr_warn("md: cannot register extra attributes for %s\n", mdname(mddev)); mddev->sysfs_action = sysfs_get_dirent(mddev->kobj.sd, "sync_action"); + mddev->sysfs_completed = sysfs_get_dirent_safe(mddev->kobj.sd, "sync_completed"); + mddev->sysfs_degraded = sysfs_get_dirent_safe(mddev->kobj.sd, "degraded"); } if (oldpers->sync_request != NULL && pers->sync_request == NULL) { @@ -5297,14 +5305,9 @@ static void md_free(struct kobject *ko)
if (mddev->sysfs_state) sysfs_put(mddev->sysfs_state); - if (mddev->sysfs_completed) - sysfs_put(mddev->sysfs_completed); - if (mddev->sysfs_degraded) - sysfs_put(mddev->sysfs_degraded); if (mddev->sysfs_level) sysfs_put(mddev->sysfs_level);
- if (mddev->gendisk) del_gendisk(mddev->gendisk); if (mddev->queue) @@ -5466,8 +5469,6 @@ static int md_alloc(dev_t dev, char *name) if (!error && mddev->kobj.sd) { kobject_uevent(&mddev->kobj, KOBJ_ADD); mddev->sysfs_state = sysfs_get_dirent_safe(mddev->kobj.sd, "array_state"); - mddev->sysfs_completed = sysfs_get_dirent_safe(mddev->kobj.sd, "sync_completed"); - mddev->sysfs_degraded = sysfs_get_dirent_safe(mddev->kobj.sd, "degraded"); mddev->sysfs_level = sysfs_get_dirent_safe(mddev->kobj.sd, "level"); } mddev_put(mddev); @@ -5734,6 +5735,8 @@ int md_run(struct mddev *mddev) pr_warn("md: cannot register extra attributes for %s\n", mdname(mddev)); mddev->sysfs_action = sysfs_get_dirent_safe(mddev->kobj.sd, "sync_action"); + mddev->sysfs_completed = sysfs_get_dirent_safe(mddev->kobj.sd, "sync_completed"); + mddev->sysfs_degraded = sysfs_get_dirent_safe(mddev->kobj.sd, "degraded"); } else if (mddev->ro == 2) /* auto-readonly not meaningful */ mddev->ro = 0;
From: Zhao Heming heming.zhao@suse.com
mainline inclusion from mainline-5.10-rc1 commit 1383b347a8ae4a69c04ae3746e6cb5c8d38e2585 category: bugfix bugzilla: 41594 CVE: NA
---------------------------
Callers of get_bitmap_from_slot() are responsible to free the bitmap.
Suggested-by: Guoqing Jiang guoqing.jiang@cloud.ionos.com Signed-off-by: Zhao Heming heming.zhao@suse.com Signed-off-by: Song Liu songliubraving@fb.com
Conflict: drivers/md/md-cluster.c resize_bitmap() do not exist Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/md/md-bitmap.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index 33e0dc8084cf..81631e17f4a0 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -1947,6 +1947,7 @@ int md_bitmap_load(struct mddev *mddev) } EXPORT_SYMBOL_GPL(md_bitmap_load);
+/* caller need to free returned bitmap with md_bitmap_free() */ struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot) { int rv = 0; @@ -2010,6 +2011,7 @@ int md_bitmap_copy_from_slot(struct mddev *mddev, int slot, md_bitmap_unplug(mddev->bitmap); *low = lo; *high = hi; + md_bitmap_free(bitmap);
return rv; } @@ -2599,4 +2601,3 @@ struct attribute_group md_bitmap_group = { .name = "bitmap", .attrs = md_bitmap_attrs, }; -
From: Mauricio Faria de Oliveira mfo@canonical.com
mainline inclusion from mainline-5.12-rc1 commit 4ceddce55eb35d15b0f87f5dcf6f0058fd15d3a4 category: bugfix bugzilla: 50454 CVE: NA ---------------------------
There's an I/O error on fsync() in a detached loop device if it has been previously attached.
The issue is write cache is enabled in the attach path in loop_configure() but it isn't disabled in the detach path; thus it remains enabled in the block device regardless of whether it is attached or not.
Now fsync() can get an I/O request that will just be failed later in loop_queue_rq() as device's state is not 'Lo_bound'.
So, disable write cache in the detach path.
Do so based on the queue flag, not the loop device flag for read-only (used to enable) as the queue flag can be changed via sysfs even on read-only loop devices (e.g., losetup -r.)
Test-case:
# DEV=/dev/loop7
# IMG=/tmp/image # truncate --size 1M $IMG
# losetup $DEV $IMG # losetup -d $DEV
Before:
# strace -e fsync parted -s $DEV print 2>&1 | grep fsync fsync(3) = -1 EIO (Input/output error) Warning: Error fsyncing/closing /dev/loop7: Input/output error [ 982.529929] blk_update_request: I/O error, dev loop7, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
After:
# strace -e fsync parted -s $DEV print 2>&1 | grep fsync fsync(3) = 0
Co-developed-by: Eric Desrochers eric.desrochers@canonical.com Signed-off-by: Eric Desrochers eric.desrochers@canonical.com Signed-off-by: Mauricio Faria de Oliveira mfo@canonical.com Tested-by: Gabriel Krisman Bertazi krisman@collabora.com Reviewed-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Yufen Yu yuyufen@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/block/loop.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 19b64ca8c4e3..981424b1c689 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1086,6 +1086,9 @@ static int __loop_clr_fd(struct loop_device *lo, bool release) goto out_unlock; }
+ if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags)) + blk_queue_write_cache(lo->lo_queue, false, false); + /* freeze request queue during the transition */ blk_mq_freeze_queue(lo->lo_queue);
From: Jan Kara jack@suse.cz
mainline inclusion from mainline-v5.12-rc4 commit 2a4ae3bcdf05b8639406eaa09a2939f3c6dd8e75 category: bugfix bugzilla: 46758 CVE: NA
-----------------------------------------------
When filesystem mount fails because of corrupted filesystem we first cancel the s_err_report timer reminding fs errors every day and only then we flush s_error_work. However s_error_work may report another fs error and re-arm timer thus resulting in timer use-after-free. Fix the problem by first flushing the work and only after that canceling the s_err_report timer.
Reported-by: syzbot+628472a2aac693ab0fcd@syzkaller.appspotmail.com Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available") CC: stable@vger.kernel.org Signed-off-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20210315165906.2175-1-jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index f92b5cc115e2..3b2f7f7ea8cb 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4746,8 +4746,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) failed_mount3a: ext4_es_unregister_shrinker(sbi); failed_mount3: - del_timer_sync(&sbi->s_err_report); flush_work(&sbi->s_error_work); + del_timer_sync(&sbi->s_err_report); if (sbi->s_mmp_tsk) kthread_stop(sbi->s_mmp_tsk); failed_mount2:
From: Wu Bo wubo40@huawei.com
mainline inclusion from mainline-5.7-rc1 commit 1d99702f903277aeaf17e1873151ba40751ed6ed category: bugfix bugzilla: 46895 CVE: NA ---------------------------
Fix an error count for active session if the total_cmds is invalid on the function iscsi_session_setup(). Decrement the number of active sessions before the funcion return.
Link: https://lore.kernel.org/r/EDBAAA0BBBA2AC4E9C8B6B81DEEE1D6916A28542@DGGEML525... Reviewed-by: Lee Duncan lduncan@suuse.com Signed-off-by: Wu Bo wubo40@huawei.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/libiscsi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index d6f3c4ed63e8..60995250131d 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -2855,7 +2855,7 @@ iscsi_session_setup(struct iscsi_transport *iscsit, struct Scsi_Host *shost, "must be a power of 2.\n", total_cmds); total_cmds = rounddown_pow_of_two(total_cmds); if (total_cmds < ISCSI_TOTAL_CMDS_MIN) - return NULL; + goto dec_session_count; printk(KERN_INFO "iscsi: Rounding can_queue to %d.\n", total_cmds); }
From: Mike Christie michael.christie@oracle.com
mainline inclusion from mainline-v5.12-rc1 commit d28d48c699779973ab9a3bd0e5acfa112bd4fdef category: bugfix bugzilla: 46895 CVE: NA ---------------------------
If iscsi_prep_scsi_cmd_pdu() fails we try to add it back to the cmdqueue, but we leave it partially setup. We don't have functions that can undo the pdu and init task setup. We only have cleanup_task which can clean up both parts. So this has us just fail the cmd and go through the standard cleanup routine and then have the SCSI midlayer retry it like is done when it fails in the queuecommand path.
Link: https://lore.kernel.org/r/20210207044608.27585-2-michael.christie@oracle.com Reviewed-by: Lee Duncan lduncan@suse.com Signed-off-by: Mike Christie michael.christie@oracle.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/libiscsi.c | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-)
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index 60995250131d..134cddff9fdb 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -1588,14 +1588,9 @@ static int iscsi_data_xmit(struct iscsi_conn *conn) } rc = iscsi_prep_scsi_cmd_pdu(conn->task); if (rc) { - if (rc == -ENOMEM || rc == -EACCES) { - spin_lock_bh(&conn->taskqueuelock); - list_add_tail(&conn->task->running, - &conn->cmdqueue); - conn->task = NULL; - spin_unlock_bh(&conn->taskqueuelock); - goto done; - } else + if (rc == -ENOMEM || rc == -EACCES) + fail_scsi_task(conn->task, DID_IMM_RETRY); + else fail_scsi_task(conn->task, DID_ABORT); spin_lock_bh(&conn->taskqueuelock); continue;
From: Mike Christie michael.christie@oracle.com
mainline inclusion from mainline-v5.12-rc1 commit 5923d64b7ab63dcc6f0df946098f50902f9540d1 category: bugfix bugzilla: 46895 CVE: NA ---------------------------
The purpose of the taskqueuelock was to handle the issue where a bad target decides to send a R2T and before its data has been sent decides to send a cmd response to complete the cmd. The following patches fix up the frwd/back locks so they are taken from the queue/xmit (frwd) and completion (back) paths again. To get there this patch removes the taskqueuelock which for iSCSI xmit wq based drivers was taken in the queue, xmit and completion paths.
Instead of the lock, we just make sure we have a ref to the task when we queue a R2T, and then we always remove the task from the requeue list in the xmit path or the forced cleanup paths.
Link: https://lore.kernel.org/r/20210207044608.27585-3-michael.christie@oracle.com Reviewed-by: Lee Duncan lduncan@suse.com Signed-off-by: Mike Christie michael.christie@oracle.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com
Conflicts: drivers/scsi/libiscsi.c drivers/scsi/libiscsi_tcp.c
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/libiscsi.c | 177 +++++++++++++++++++++++------------- drivers/scsi/libiscsi_tcp.c | 86 ++++++++++++------ include/scsi/libiscsi.h | 4 +- 3 files changed, 172 insertions(+), 95 deletions(-)
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index 134cddff9fdb..d82f8296f158 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -567,16 +567,6 @@ static void iscsi_complete_task(struct iscsi_task *task, int state) WARN_ON_ONCE(task->state == ISCSI_TASK_FREE); task->state = state;
- spin_lock_bh(&conn->taskqueuelock); - if (!list_empty(&task->running)) { - pr_debug_once("%s while task on list", __func__); - list_del_init(&task->running); - } - spin_unlock_bh(&conn->taskqueuelock); - - if (conn->task == task) - conn->task = NULL; - if (READ_ONCE(conn->ping_task) == task) WRITE_ONCE(conn->ping_task, NULL);
@@ -608,9 +598,40 @@ void iscsi_complete_scsi_task(struct iscsi_task *task, } EXPORT_SYMBOL_GPL(iscsi_complete_scsi_task);
+/* + * Must be called with back and frwd lock + */ +static bool cleanup_queued_task(struct iscsi_task *task) +{ + struct iscsi_conn *conn = task->conn; + bool early_complete = false; + + /* Bad target might have completed task while it was still running */ + if (task->state == ISCSI_TASK_COMPLETED) + early_complete = true; + + if (!list_empty(&task->running)) { + list_del_init(&task->running); + /* + * If it's on a list but still running, this could be from + * a bad target sending a rsp early, cleanup from a TMF, or + * session recovery. + */ + if (task->state == ISCSI_TASK_RUNNING || + task->state == ISCSI_TASK_COMPLETED) + __iscsi_put_task(task); + } + + if (conn->task == task) { + conn->task = NULL; + __iscsi_put_task(task); + } + + return early_complete; +}
/* - * session back_lock must be held and if not called for a task that is + * session frwd_lock must be held and if not called for a task that is * still pending or from the xmit thread, then xmit thread must * be suspended. */ @@ -629,6 +650,13 @@ static void fail_scsi_task(struct iscsi_task *task, int err) if (!sc) return;
+ /* regular RX path uses back_lock */ + spin_lock_bh(&conn->session->back_lock); + if (cleanup_queued_task(task)) { + spin_unlock_bh(&conn->session->back_lock); + return; + } + if (task->state == ISCSI_TASK_PENDING) { /* * cmd never made it to the xmit thread, so we should not count @@ -650,8 +678,6 @@ static void fail_scsi_task(struct iscsi_task *task, int err) scsi_in(sc)->resid = scsi_in(sc)->length; }
- /* regular RX path uses back_lock */ - spin_lock_bh(&conn->session->back_lock); iscsi_complete_task(task, state); spin_unlock_bh(&conn->session->back_lock); } @@ -797,9 +823,7 @@ __iscsi_conn_send_pdu(struct iscsi_conn *conn, struct iscsi_hdr *hdr, if (session->tt->xmit_task(task)) goto free_task; } else { - spin_lock_bh(&conn->taskqueuelock); list_add_tail(&task->running, &conn->mgmtqueue); - spin_unlock_bh(&conn->taskqueuelock); iscsi_conn_queue_work(conn); }
@@ -1467,31 +1491,61 @@ static int iscsi_check_cmdsn_window_closed(struct iscsi_conn *conn) return 0; }
-static int iscsi_xmit_task(struct iscsi_conn *conn) +static int iscsi_xmit_task(struct iscsi_conn *conn, struct iscsi_task *task, + bool was_requeue) { - struct iscsi_task *task = conn->task; int rc;
- if (test_bit(ISCSI_SUSPEND_BIT, &conn->suspend_tx)) - return -ENODATA; - spin_lock_bh(&conn->session->back_lock); - if (conn->task == NULL) { + + if (!conn->task) { + /* Take a ref so we can access it after xmit_task() */ + __iscsi_get_task(task); + } else { + /* Already have a ref from when we failed to send it last call */ + conn->task = NULL; + } + + /* + * If this was a requeue for a R2T we have an extra ref on the task in + * case a bad target sends a cmd rsp before we have handled the task. + */ + if (was_requeue) + __iscsi_put_task(task); + + /* + * Do this after dropping the extra ref because if this was a requeue + * it's removed from that list and cleanup_queued_task would miss it. + */ + if (test_bit(ISCSI_SUSPEND_BIT, &conn->suspend_tx)) { + /* + * Save the task and ref in case we weren't cleaning up this + * task and get woken up again. + */ + conn->task = task; spin_unlock_bh(&conn->session->back_lock); return -ENODATA; } - __iscsi_get_task(task); spin_unlock_bh(&conn->session->back_lock); + spin_unlock_bh(&conn->session->frwd_lock); rc = conn->session->tt->xmit_task(task); spin_lock_bh(&conn->session->frwd_lock); if (!rc) { /* done with this task */ task->last_xfer = jiffies; - conn->task = NULL; } /* regular RX path uses back_lock */ spin_lock(&conn->session->back_lock); + if (rc && task->state == ISCSI_TASK_RUNNING) { + /* + * get an extra ref that is released next time we access it + * as conn->task above. + */ + __iscsi_get_task(task); + conn->task = task; + } + __iscsi_put_task(task); spin_unlock(&conn->session->back_lock); return rc; @@ -1501,9 +1555,7 @@ static int iscsi_xmit_task(struct iscsi_conn *conn) * iscsi_requeue_task - requeue task to run from session workqueue * @task: task to requeue * - * LLDs that need to run a task from the session workqueue should call - * this. The session frwd_lock must be held. This should only be called - * by software drivers. + * Callers must have taken a ref to the task that is going to be requeued. */ void iscsi_requeue_task(struct iscsi_task *task) { @@ -1513,11 +1565,18 @@ void iscsi_requeue_task(struct iscsi_task *task) * this may be on the requeue list already if the xmit_task callout * is handling the r2ts while we are adding new ones */ - spin_lock_bh(&conn->taskqueuelock); - if (list_empty(&task->running)) + spin_lock_bh(&conn->session->frwd_lock); + if (list_empty(&task->running)) { list_add_tail(&task->running, &conn->requeue); - spin_unlock_bh(&conn->taskqueuelock); + } else { + /* + * Don't need the extra ref since it's already requeued and + * has a ref. + */ + iscsi_put_task(task); + } iscsi_conn_queue_work(conn); + spin_unlock_bh(&conn->session->frwd_lock); } EXPORT_SYMBOL_GPL(iscsi_requeue_task);
@@ -1543,7 +1602,7 @@ static int iscsi_data_xmit(struct iscsi_conn *conn) }
if (conn->task) { - rc = iscsi_xmit_task(conn); + rc = iscsi_xmit_task(conn, conn->task, false); if (rc) goto done; } @@ -1553,49 +1612,41 @@ static int iscsi_data_xmit(struct iscsi_conn *conn) * only have one nop-out as a ping from us and targets should not * overflow us with nop-ins */ - spin_lock_bh(&conn->taskqueuelock); check_mgmt: while (!list_empty(&conn->mgmtqueue)) { - conn->task = list_entry(conn->mgmtqueue.next, - struct iscsi_task, running); - list_del_init(&conn->task->running); - spin_unlock_bh(&conn->taskqueuelock); - if (iscsi_prep_mgmt_task(conn, conn->task)) { + task = list_entry(conn->mgmtqueue.next, struct iscsi_task, + running); + list_del_init(&task->running); + if (iscsi_prep_mgmt_task(conn, task)) { /* regular RX path uses back_lock */ spin_lock_bh(&conn->session->back_lock); - __iscsi_put_task(conn->task); + __iscsi_put_task(task); spin_unlock_bh(&conn->session->back_lock); - conn->task = NULL; - spin_lock_bh(&conn->taskqueuelock); continue; } - rc = iscsi_xmit_task(conn); + rc = iscsi_xmit_task(conn, task, false); if (rc) goto done; - spin_lock_bh(&conn->taskqueuelock); }
/* process pending command queue */ while (!list_empty(&conn->cmdqueue)) { - conn->task = list_entry(conn->cmdqueue.next, struct iscsi_task, - running); - list_del_init(&conn->task->running); - spin_unlock_bh(&conn->taskqueuelock); + task = list_entry(conn->cmdqueue.next, struct iscsi_task, + running); + list_del_init(&task->running); if (conn->session->state == ISCSI_STATE_LOGGING_OUT) { - fail_scsi_task(conn->task, DID_IMM_RETRY); - spin_lock_bh(&conn->taskqueuelock); + fail_scsi_task(task, DID_IMM_RETRY); continue; } - rc = iscsi_prep_scsi_cmd_pdu(conn->task); + rc = iscsi_prep_scsi_cmd_pdu(task); if (rc) { if (rc == -ENOMEM || rc == -EACCES) - fail_scsi_task(conn->task, DID_IMM_RETRY); + fail_scsi_task(task, DID_IMM_RETRY); else - fail_scsi_task(conn->task, DID_ABORT); - spin_lock_bh(&conn->taskqueuelock); + fail_scsi_task(task, DID_ABORT); continue; } - rc = iscsi_xmit_task(conn); + rc = iscsi_xmit_task(conn, task, false); if (rc) goto done; /* @@ -1603,7 +1654,6 @@ static int iscsi_data_xmit(struct iscsi_conn *conn) * we need to check the mgmt queue for nops that need to * be sent to aviod starvation */ - spin_lock_bh(&conn->taskqueuelock); if (!list_empty(&conn->mgmtqueue)) goto check_mgmt; } @@ -1617,21 +1667,17 @@ static int iscsi_data_xmit(struct iscsi_conn *conn)
task = list_entry(conn->requeue.next, struct iscsi_task, running); + if (iscsi_check_tmf_restrictions(task, ISCSI_OP_SCSI_DATA_OUT)) break;
- conn->task = task; - list_del_init(&conn->task->running); - conn->task->state = ISCSI_TASK_RUNNING; - spin_unlock_bh(&conn->taskqueuelock); - rc = iscsi_xmit_task(conn); + list_del_init(&task->running); + rc = iscsi_xmit_task(conn, task, true); if (rc) goto done; - spin_lock_bh(&conn->taskqueuelock); if (!list_empty(&conn->mgmtqueue)) goto check_mgmt; } - spin_unlock_bh(&conn->taskqueuelock); spin_unlock_bh(&conn->session->frwd_lock); return -ENODATA;
@@ -1797,9 +1843,7 @@ int iscsi_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *sc) goto prepd_reject; } } else { - spin_lock_bh(&conn->taskqueuelock); list_add_tail(&task->running, &conn->cmdqueue); - spin_unlock_bh(&conn->taskqueuelock); iscsi_conn_queue_work(conn); }
@@ -2990,7 +3034,6 @@ iscsi_conn_setup(struct iscsi_cls_session *cls_session, int dd_size, INIT_LIST_HEAD(&conn->mgmtqueue); INIT_LIST_HEAD(&conn->cmdqueue); INIT_LIST_HEAD(&conn->requeue); - spin_lock_init(&conn->taskqueuelock); INIT_WORK(&conn->xmitwork, iscsi_xmitworker);
/* allocate login_task used for the login/text sequences */ @@ -3156,10 +3199,16 @@ fail_mgmt_tasks(struct iscsi_session *session, struct iscsi_conn *conn) ISCSI_DBG_SESSION(conn->session, "failing mgmt itt 0x%x state %d\n", task->itt, task->state); + + spin_lock_bh(&session->back_lock); + if (cleanup_queued_task(task)) { + spin_unlock_bh(&session->back_lock); + continue; + } + state = ISCSI_TASK_ABRT_SESS_RECOV; if (task->state == ISCSI_TASK_PENDING) state = ISCSI_TASK_COMPLETED; - spin_lock_bh(&session->back_lock); iscsi_complete_task(task, state); spin_unlock_bh(&session->back_lock); } diff --git a/drivers/scsi/libiscsi_tcp.c b/drivers/scsi/libiscsi_tcp.c index b88ea3ebd096..f9437fe983a1 100644 --- a/drivers/scsi/libiscsi_tcp.c +++ b/drivers/scsi/libiscsi_tcp.c @@ -531,48 +531,79 @@ static int iscsi_tcp_data_in(struct iscsi_conn *conn, struct iscsi_task *task) /** * iscsi_tcp_r2t_rsp - iSCSI R2T Response processing * @conn: iscsi connection - * @task: scsi command task + * @hdr: PDU header */ -static int iscsi_tcp_r2t_rsp(struct iscsi_conn *conn, struct iscsi_task *task) +static int iscsi_tcp_r2t_rsp(struct iscsi_conn *conn, struct iscsi_hdr *hdr) { struct iscsi_session *session = conn->session; - struct iscsi_tcp_task *tcp_task = task->dd_data; - struct iscsi_tcp_conn *tcp_conn = conn->dd_data; - struct iscsi_r2t_rsp *rhdr = (struct iscsi_r2t_rsp *)tcp_conn->in.hdr; + struct iscsi_tcp_task *tcp_task; + struct iscsi_tcp_conn *tcp_conn; + struct iscsi_r2t_rsp *rhdr; struct iscsi_r2t_info *r2t; - int r2tsn = be32_to_cpu(rhdr->r2tsn); + struct iscsi_task *task; u32 data_length; u32 data_offset; + int r2tsn; int rc;
+ spin_lock(&session->back_lock); + task = iscsi_itt_to_ctask(conn, hdr->itt); + if (!task) { + spin_unlock(&session->back_lock); + return ISCSI_ERR_BAD_ITT; + } else if (task->sc->sc_data_direction != DMA_TO_DEVICE) { + spin_unlock(&session->back_lock); + return ISCSI_ERR_PROTO; + } + /* + * A bad target might complete the cmd before we have handled R2Ts + * so get a ref to the task that will be dropped in the xmit path. + */ + if (task->state != ISCSI_TASK_RUNNING) { + spin_unlock(&session->back_lock); + /* Let the path that got the early rsp complete it */ + return 0; + } + task->last_xfer = jiffies; + __iscsi_get_task(task); + + tcp_conn = conn->dd_data; + rhdr = (struct iscsi_r2t_rsp *)tcp_conn->in.hdr; + /* fill-in new R2T associated with the task */ + iscsi_update_cmdsn(session, (struct iscsi_nopin *)rhdr); + spin_unlock(&session->back_lock); + if (tcp_conn->in.datalen) { iscsi_conn_printk(KERN_ERR, conn, "invalid R2t with datalen %d\n", tcp_conn->in.datalen); - return ISCSI_ERR_DATALEN; + rc = ISCSI_ERR_DATALEN; + goto put_task; }
+ tcp_task = task->dd_data; + r2tsn = be32_to_cpu(rhdr->r2tsn); if (tcp_task->exp_datasn != r2tsn){ ISCSI_DBG_TCP(conn, "task->exp_datasn(%d) != rhdr->r2tsn(%d)\n", tcp_task->exp_datasn, r2tsn); - return ISCSI_ERR_R2TSN; + rc = ISCSI_ERR_R2TSN; + goto put_task; }
- /* fill-in new R2T associated with the task */ - iscsi_update_cmdsn(session, (struct iscsi_nopin*)rhdr); - - if (!task->sc || session->state != ISCSI_STATE_LOGGED_IN) { + if (session->state != ISCSI_STATE_LOGGED_IN) { iscsi_conn_printk(KERN_INFO, conn, "dropping R2T itt %d in recovery.\n", task->itt); - return 0; + rc = 0; + goto put_task; }
data_length = be32_to_cpu(rhdr->data_length); if (data_length == 0) { iscsi_conn_printk(KERN_ERR, conn, "invalid R2T with zero data len\n"); - return ISCSI_ERR_DATALEN; + rc = ISCSI_ERR_DATALEN; + goto put_task; }
if (data_length > session->max_burst) @@ -586,7 +617,8 @@ static int iscsi_tcp_r2t_rsp(struct iscsi_conn *conn, struct iscsi_task *task) "invalid R2T with data len %u at offset %u " "and total length %d\n", data_length, data_offset, scsi_out(task->sc)->length); - return ISCSI_ERR_DATALEN; + rc = ISCSI_ERR_DATALEN; + goto put_task; }
spin_lock(&tcp_task->pool2queue); @@ -596,7 +628,8 @@ static int iscsi_tcp_r2t_rsp(struct iscsi_conn *conn, struct iscsi_task *task) "Target has sent more R2Ts than it " "negotiated for or driver has leaked.\n"); spin_unlock(&tcp_task->pool2queue); - return ISCSI_ERR_PROTO; + rc = ISCSI_ERR_PROTO; + goto put_task; }
r2t->exp_statsn = rhdr->statsn; @@ -614,6 +647,10 @@ static int iscsi_tcp_r2t_rsp(struct iscsi_conn *conn, struct iscsi_task *task)
iscsi_requeue_task(task); return 0; + +put_task: + iscsi_put_task(task); + return rc; }
/* @@ -737,20 +774,11 @@ iscsi_tcp_hdr_dissect(struct iscsi_conn *conn, struct iscsi_hdr *hdr) rc = iscsi_complete_pdu(conn, hdr, NULL, 0); break; case ISCSI_OP_R2T: - spin_lock(&conn->session->back_lock); - task = iscsi_itt_to_ctask(conn, hdr->itt); - spin_unlock(&conn->session->back_lock); - if (!task) - rc = ISCSI_ERR_BAD_ITT; - else if (ahslen) + if (ahslen) { rc = ISCSI_ERR_AHSLEN; - else if (task->sc->sc_data_direction == DMA_TO_DEVICE) { - task->last_xfer = jiffies; - spin_lock(&conn->session->frwd_lock); - rc = iscsi_tcp_r2t_rsp(conn, task); - spin_unlock(&conn->session->frwd_lock); - } else - rc = ISCSI_ERR_PROTO; + break; + } + rc = iscsi_tcp_r2t_rsp(conn, hdr); break; case ISCSI_OP_LOGIN_RSP: case ISCSI_OP_TEXT_RSP: diff --git a/include/scsi/libiscsi.h b/include/scsi/libiscsi.h index 8e146aa623cf..c71297f7078c 100644 --- a/include/scsi/libiscsi.h +++ b/include/scsi/libiscsi.h @@ -200,7 +200,7 @@ struct iscsi_conn { struct iscsi_task *task; /* xmit task in progress */
/* xmit */ - spinlock_t taskqueuelock; /* protects the next three lists */ + /* items must be added/deleted under frwd lock */ struct list_head mgmtqueue; /* mgmt (control) xmit queue */ struct list_head cmdqueue; /* data-path cmd queue */ struct list_head requeue; /* tasks needing another run */ @@ -346,7 +346,7 @@ struct iscsi_session { * cmdsn, queued_cmdsn * * session resources: * * - cmdpool kfifo_out , * - * - mgmtpool, */ + * - mgmtpool, queues */ spinlock_t back_lock; /* protects cmdsn_exp * * cmdsn_max, * * cmdpool kfifo_in */
From: Mike Christie michael.christie@oracle.com
mainline inclusion from mainline-v5.12-rc1 commit 14936b1ed249916c28642d0db47a51b085ce13b4 category: bugfix bugzilla: 46895 CVE: NA ---------------------------
The following bug was reported and debugged by wubo40@huawei.com:
When testing kernel 4.18 version, NULL pointer dereference problem occurs in iscsi_eh_cmd_timed_out() function.
I think this bug in the upstream is still exists.
The analysis reasons are as follows:
1) For some reason, I/O command did not complete within the timeout period. The block layer timer works, call scsi_times_out() to handle I/O timeout logic. At the same time the command just completes.
2) scsi_times_out() call iscsi_eh_cmd_timed_out() to process timeout logic. Although there is an NULL judgment for the task, the task has not been released yet now.
3) iscsi_complete_task() calls __iscsi_put_task(). The task reference count reaches zero, the conditions for free task is met, then iscsi_free_task() frees the task, and sets sc->SCp.ptr = NULL. After iscsi_eh_cmd_timed_out() passes the task judgment check, there can still be NULL dereference scenarios.
CPU0 CPU3
|- scsi_times_out() |- iscsi_complete_task() | | |- iscsi_eh_cmd_timed_out() |- __iscsi_put_task() | | |- task=sc->SCp.ptr, task is not NUL, check passed |- iscsi_free_task(task) | | | |-> sc->SCp.ptr = NULL | | |- task is NULL now, NULL pointer dereference | | | |/ |/
Calltrace: [380751.840862] BUG: unable to handle kernel NULL pointer dereference at 0000000000000138 [380751.843709] PGD 0 P4D 0 [380751.844770] Oops: 0000 [#1] SMP PTI [380751.846283] CPU: 0 PID: 403 Comm: kworker/0:1H Kdump: loaded Tainted: G [380751.851467] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) [380751.856521] Workqueue: kblockd blk_mq_timeout_work [380751.858527] RIP: 0010:iscsi_eh_cmd_timed_out+0x15e/0x2e0 [libiscsi] [380751.861129] Code: 83 ea 01 48 8d 74 d0 08 48 8b 10 48 8b 4a 50 48 85 c9 74 2c 48 39 d5 74 [380751.868811] RSP: 0018:ffffc1e280a5fd58 EFLAGS: 00010246 [380751.870978] RAX: ffff9fd1e84e15e0 RBX: ffff9fd1e84e6dd0 RCX: 0000000116acc580 [380751.873791] RDX: ffff9fd1f97a9400 RSI: ffff9fd1e84e1800 RDI: ffff9fd1e4d6d420 [380751.876059] RBP: ffff9fd1e4d49000 R08: 0000000116acc580 R09: 0000000116acc580 [380751.878284] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9fd1e6e931e8 [380751.880500] R13: ffff9fd1e84e6ee0 R14: 0000000000000010 R15: 0000000000000003 [380751.882687] FS: 0000000000000000(0000) GS:ffff9fd1fac00000(0000) knlGS:0000000000000000 [380751.885236] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [380751.887059] CR2: 0000000000000138 CR3: 000000011860a001 CR4: 00000000003606f0 [380751.889308] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [380751.891523] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [380751.893738] Call Trace: [380751.894639] scsi_times_out+0x60/0x1c0 [380751.895861] blk_mq_check_expired+0x144/0x200 [380751.897302] ? __switch_to_asm+0x35/0x70 [380751.898551] blk_mq_queue_tag_busy_iter+0x195/0x2e0 [380751.900091] ? __blk_mq_requeue_request+0x100/0x100 [380751.901611] ? __switch_to_asm+0x41/0x70 [380751.902853] ? __blk_mq_requeue_request+0x100/0x100 [380751.904398] blk_mq_timeout_work+0x54/0x130 [380751.905740] process_one_work+0x195/0x390 [380751.907228] worker_thread+0x30/0x390 [380751.908713] ? process_one_work+0x390/0x390 [380751.910350] kthread+0x10d/0x130 [380751.911470] ? kthread_flush_work_fn+0x10/0x10 [380751.913007] ret_from_fork+0x35/0x40
crash> dis -l iscsi_eh_cmd_timed_out+0x15e xxxxx/drivers/scsi/libiscsi.c: 2062
1970 enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc) { ... 1984 spin_lock_bh(&session->frwd_lock); 1985 task = (struct iscsi_task *)sc->SCp.ptr; 1986 if (!task) { 1987 /* 1988 * Raced with completion. Blk layer has taken ownership 1989 * so let timeout code complete it now. 1990 */ 1991 rc = BLK_EH_DONE; 1992 goto done; 1993 }
...
2052 for (i = 0; i < conn->session->cmds_max; i++) { 2053 running_task = conn->session->cmds[i]; 2054 if (!running_task->sc || running_task == task || 2055 running_task->state != ISCSI_TASK_RUNNING) 2056 continue; 2057 2058 /* 2059 * Only check if cmds started before this one have made 2060 * progress, or this could never fail 2061 */ 2062 if (time_after(running_task->sc->jiffies_at_alloc, 2063 task->sc->jiffies_at_alloc)) <--- 2064 continue; 2065 ... }
carsh> struct scsi_cmnd ffff9fd1e6e931e8 struct scsi_cmnd { ... SCp = { ptr = 0x0, <--- iscsi_task this_residual = 0, ... }, }
To prevent this, we take a ref to the cmd under the back (completion) lock so if the completion side were to call iscsi_complete_task() on the task while the timer/eh paths are not holding the back_lock it will not be freed from under us.
Note that this requires the previous patch, "scsi: libiscsi: Drop taskqueuelock" because bnx2i sleeps in its cleanup_task callout if the cmd is aborted. If the EH/timer and completion path are racing we don't know which path will do the last put. The previous patch moved the operations we needed to do under the forward lock to cleanup_queued_task. Once that has run we can drop the forward lock for the cmd and bnx2i no longer has to worry about if the EH, timer or completion path did the ast put and if the forward lock is held or not since it won't be.
Link: https://lore.kernel.org/r/20210207044608.27585-4-michael.christie@oracle.com Reported-by: Wu Bo wubo40@huawei.com Reviewed-by: Lee Duncan lduncan@suse.com Signed-off-by: Mike Christie michael.christie@oracle.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com
Conflicts: drivers/scsi/libiscsi.c
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/bnx2i/bnx2i_iscsi.c | 2 - drivers/scsi/libiscsi.c | 71 ++++++++++++++++++++------------ 2 files changed, 45 insertions(+), 28 deletions(-)
diff --git a/drivers/scsi/bnx2i/bnx2i_iscsi.c b/drivers/scsi/bnx2i/bnx2i_iscsi.c index de0a507577ef..1af1a884109e 100644 --- a/drivers/scsi/bnx2i/bnx2i_iscsi.c +++ b/drivers/scsi/bnx2i/bnx2i_iscsi.c @@ -1173,10 +1173,8 @@ static void bnx2i_cleanup_task(struct iscsi_task *task) bnx2i_send_cmd_cleanup_req(hba, task->dd_data);
spin_unlock_bh(&conn->session->back_lock); - spin_unlock_bh(&conn->session->frwd_lock); wait_for_completion_timeout(&bnx2i_conn->cmd_cleanup_cmpl, msecs_to_jiffies(ISCSI_CMD_CLEANUP_TIMEOUT)); - spin_lock_bh(&conn->session->frwd_lock); spin_lock_bh(&conn->session->back_lock); } bnx2i_iscsi_unmap_sg_list(task->dd_data); diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index d82f8296f158..1c074f1ae4c3 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -631,9 +631,8 @@ static bool cleanup_queued_task(struct iscsi_task *task) }
/* - * session frwd_lock must be held and if not called for a task that is - * still pending or from the xmit thread, then xmit thread must - * be suspended. + * session frwd lock must be held and if not called for a task that is still + * pending or from the xmit thread, then xmit thread must be suspended */ static void fail_scsi_task(struct iscsi_task *task, int err) { @@ -641,16 +640,6 @@ static void fail_scsi_task(struct iscsi_task *task, int err) struct scsi_cmnd *sc; int state;
- /* - * if a command completes and we get a successful tmf response - * we will hit this because the scsi eh abort code does not take - * a ref to the task. - */ - sc = task->sc; - if (!sc) - return; - - /* regular RX path uses back_lock */ spin_lock_bh(&conn->session->back_lock); if (cleanup_queued_task(task)) { spin_unlock_bh(&conn->session->back_lock); @@ -670,6 +659,7 @@ static void fail_scsi_task(struct iscsi_task *task, int err) else state = ISCSI_TASK_ABRT_TMF;
+ sc = task->sc; sc->result = err << 16; if (!scsi_bidi_cmnd(sc)) scsi_set_resid(sc, scsi_bufflen(sc)); @@ -1955,27 +1945,39 @@ static int iscsi_exec_task_mgmt_fn(struct iscsi_conn *conn, }
/* - * Fail commands. session lock held and recv side suspended and xmit - * thread flushed + * Fail commands. session frwd lock held and xmit thread flushed. */ static void fail_scsi_tasks(struct iscsi_conn *conn, u64 lun, int error) { + struct iscsi_session *session = conn->session; struct iscsi_task *task; int i;
- for (i = 0; i < conn->session->cmds_max; i++) { - task = conn->session->cmds[i]; + spin_lock_bh(&session->back_lock); + for (i = 0; i < session->cmds_max; i++) { + task = session->cmds[i]; if (!task->sc || task->state == ISCSI_TASK_FREE) continue;
if (lun != -1 && lun != task->sc->device->lun) continue;
- ISCSI_DBG_SESSION(conn->session, + __iscsi_get_task(task); + spin_unlock_bh(&session->back_lock); + + ISCSI_DBG_SESSION(session, "failing sc %p itt 0x%x state %d\n", task->sc, task->itt, task->state); fail_scsi_task(task, error); + + spin_unlock_bh(&session->frwd_lock); + iscsi_put_task(task); + spin_lock_bh(&session->frwd_lock); + + spin_lock_bh(&session->back_lock); } + + spin_unlock_bh(&session->back_lock); }
/** @@ -2053,6 +2055,7 @@ enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc) ISCSI_DBG_EH(session, "scsi cmd %p timedout\n", sc);
spin_lock_bh(&session->frwd_lock); + spin_lock(&session->back_lock); task = (struct iscsi_task *)sc->SCp.ptr; if (!task) { /* @@ -2060,8 +2063,11 @@ enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc) * so let timeout code complete it now. */ rc = BLK_EH_DONE; + spin_unlock(&session->back_lock); goto done; } + __iscsi_get_task(task); + spin_unlock(&session->back_lock);
if (session->state != ISCSI_STATE_LOGGED_IN) { /* @@ -2123,6 +2129,7 @@ enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc) goto done; }
+ spin_lock(&session->back_lock); for (i = 0; i < conn->session->cmds_max; i++) { running_task = conn->session->cmds[i]; if (!running_task->sc || running_task == task || @@ -2155,10 +2162,12 @@ enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc) "last xfer %lu/%lu. Last check %lu.\n", task->last_xfer, running_task->last_xfer, task->last_timeout); + spin_unlock(&session->back_lock); rc = BLK_EH_RESET_TIMER; goto done; } } + spin_unlock(&session->back_lock);
/* Assumes nop timeout is shorter than scsi cmd timeout */ if (task->have_checked_conn) @@ -2180,9 +2189,12 @@ enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc) rc = BLK_EH_RESET_TIMER;
done: - if (task) - task->last_timeout = jiffies; spin_unlock_bh(&session->frwd_lock); + + if (task) { + task->last_timeout = jiffies; + iscsi_put_task(task); + } ISCSI_DBG_EH(session, "return %s\n", rc == BLK_EH_RESET_TIMER ? "timer reset" : "shutdown or nh"); return rc; @@ -2290,15 +2302,20 @@ int iscsi_eh_abort(struct scsi_cmnd *sc) conn->eh_abort_cnt++; age = session->age;
+ spin_lock(&session->back_lock); task = (struct iscsi_task *)sc->SCp.ptr; - ISCSI_DBG_EH(session, "aborting [sc %p itt 0x%x]\n", - sc, task->itt); - - /* task completed before time out */ - if (!task->sc) { + if (!task || !task->sc) { + /* task completed before time out */ ISCSI_DBG_EH(session, "sc completed while abort in progress\n"); - goto success; + + spin_unlock(&session->back_lock); + spin_unlock_bh(&session->frwd_lock); + mutex_unlock(&session->eh_mutex); + return SUCCESS; } + ISCSI_DBG_EH(session, "aborting [sc %p itt 0x%x]\n", sc, task->itt); + __iscsi_get_task(task); + spin_unlock(&session->back_lock);
if (task->state == ISCSI_TASK_PENDING) { fail_scsi_task(task, DID_ABORT); @@ -2360,6 +2377,7 @@ int iscsi_eh_abort(struct scsi_cmnd *sc) success_unlocked: ISCSI_DBG_EH(session, "abort success [sc %p itt 0x%x]\n", sc, task->itt); + iscsi_put_task(task); mutex_unlock(&session->eh_mutex); return SUCCESS;
@@ -2368,6 +2386,7 @@ int iscsi_eh_abort(struct scsi_cmnd *sc) failed_unlocked: ISCSI_DBG_EH(session, "abort failed [sc %p itt 0x%x]\n", sc, task ? task->itt : 0); + iscsi_put_task(task); mutex_unlock(&session->eh_mutex); return FAILED; }
From: Mike Christie michael.christie@oracle.com
mainline inclusion from mainline-v5.12-rc1 commit c435f0a9ecb7435e70f447b7231ca52de589b252 category: bugfix bugzilla: 46895 CVE: NA ---------------------------
We allocate the iSCSI host workq in iscsi_host_alloc() so iscsi_host_free() should do the destruction. Drivers can then do their error/goto handling and call iscsi_host_free() to clean up what has been allocated in iscsi_host_alloc().
Link: https://lore.kernel.org/r/20210207044608.27585-5-michael.christie@oracle.com Reviewed-by: Lee Duncan lduncan@suse.com Signed-off-by: Mike Christie michael.christie@oracle.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/libiscsi.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index 1c074f1ae4c3..0b832a5bb04a 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -2814,8 +2814,6 @@ void iscsi_host_remove(struct Scsi_Host *shost) flush_signals(current);
scsi_remove_host(shost); - if (ihost->workq) - destroy_workqueue(ihost->workq); } EXPORT_SYMBOL_GPL(iscsi_host_remove);
@@ -2823,6 +2821,9 @@ void iscsi_host_free(struct Scsi_Host *shost) { struct iscsi_host *ihost = shost_priv(shost);
+ if (ihost->workq) + destroy_workqueue(ihost->workq); + kfree(ihost->netdev); kfree(ihost->hwaddress); kfree(ihost->initiatorname);
From: Mike Christie michael.christie@oracle.com
mainline inclusion from mainline-v5.12-rc1 commit b4046922b3c0740ad50a6e9c59e12f4dc43946d4 category: bugfix bugzilla: 46895 CVE: NA ---------------------------
This patch just breaks out the code that calculates the number of SCSI cmds that will be used for a SCSI session. It also adds a check that we don't go over the host's can_queue value.
Link: https://lore.kernel.org/r/20210207044608.27585-6-michael.christie@oracle.com Reviewed-by: Lee Duncan lduncan@suse.com Signed-off-by: Mike Christie michael.christie@oracle.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/libiscsi.c | 86 ++++++++++++++++++++++++++--------------- include/scsi/libiscsi.h | 2 + 2 files changed, 56 insertions(+), 32 deletions(-)
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index 0b832a5bb04a..b48427a96a0a 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -2724,6 +2724,56 @@ void iscsi_pool_free(struct iscsi_pool *q) } EXPORT_SYMBOL_GPL(iscsi_pool_free);
+int iscsi_host_get_max_scsi_cmds(struct Scsi_Host *shost, + uint16_t requested_cmds_max) +{ + int scsi_cmds, total_cmds = requested_cmds_max; + +check: + if (!total_cmds) + total_cmds = ISCSI_DEF_XMIT_CMDS_MAX; + /* + * The iscsi layer needs some tasks for nop handling and tmfs, + * so the cmds_max must at least be greater than ISCSI_MGMT_CMDS_MAX + * + 1 command for scsi IO. + */ + if (total_cmds < ISCSI_TOTAL_CMDS_MIN) { + printk(KERN_ERR "iscsi: invalid max cmds of %d. Must be a power of two that is at least %d.\n", + total_cmds, ISCSI_TOTAL_CMDS_MIN); + return -EINVAL; + } + + if (total_cmds > ISCSI_TOTAL_CMDS_MAX) { + printk(KERN_INFO "iscsi: invalid max cmds of %d. Must be a power of 2 less than or equal to %d. Using %d.\n", + requested_cmds_max, ISCSI_TOTAL_CMDS_MAX, + ISCSI_TOTAL_CMDS_MAX); + total_cmds = ISCSI_TOTAL_CMDS_MAX; + } + + if (!is_power_of_2(total_cmds)) { + total_cmds = rounddown_pow_of_two(total_cmds); + if (total_cmds < ISCSI_TOTAL_CMDS_MIN) { + printk(KERN_ERR "iscsi: invalid max cmds of %d. Must be a power of 2 greater than %d.\n", requested_cmds_max, ISCSI_TOTAL_CMDS_MIN); + return -EINVAL; + } + + printk(KERN_INFO "iscsi: invalid max cmds %d. Must be a power of 2. Rounding max cmds down to %d.\n", + requested_cmds_max, total_cmds); + } + + scsi_cmds = total_cmds - ISCSI_MGMT_CMDS_MAX; + if (shost->can_queue && scsi_cmds > shost->can_queue) { + total_cmds = shost->can_queue; + + printk(KERN_INFO "iscsi: requested max cmds %u is higher than driver limit. Using driver limit %u\n", + requested_cmds_max, shost->can_queue); + goto check; + } + + return scsi_cmds; +} +EXPORT_SYMBOL_GPL(iscsi_host_get_max_scsi_cmds); + /** * iscsi_host_add - add host to system * @shost: scsi host @@ -2877,7 +2927,7 @@ iscsi_session_setup(struct iscsi_transport *iscsit, struct Scsi_Host *shost, struct iscsi_host *ihost = shost_priv(shost); struct iscsi_session *session; struct iscsi_cls_session *cls_session; - int cmd_i, scsi_cmds, total_cmds = cmds_max; + int cmd_i, scsi_cmds; unsigned long flags;
spin_lock_irqsave(&ihost->lock, flags); @@ -2888,37 +2938,9 @@ iscsi_session_setup(struct iscsi_transport *iscsit, struct Scsi_Host *shost, ihost->num_sessions++; spin_unlock_irqrestore(&ihost->lock, flags);
- if (!total_cmds) - total_cmds = ISCSI_DEF_XMIT_CMDS_MAX; - /* - * The iscsi layer needs some tasks for nop handling and tmfs, - * so the cmds_max must at least be greater than ISCSI_MGMT_CMDS_MAX - * + 1 command for scsi IO. - */ - if (total_cmds < ISCSI_TOTAL_CMDS_MIN) { - printk(KERN_ERR "iscsi: invalid can_queue of %d. can_queue " - "must be a power of two that is at least %d.\n", - total_cmds, ISCSI_TOTAL_CMDS_MIN); + scsi_cmds = iscsi_host_get_max_scsi_cmds(shost, cmds_max); + if (scsi_cmds < 0) goto dec_session_count; - } - - if (total_cmds > ISCSI_TOTAL_CMDS_MAX) { - printk(KERN_ERR "iscsi: invalid can_queue of %d. can_queue " - "must be a power of 2 less than or equal to %d.\n", - cmds_max, ISCSI_TOTAL_CMDS_MAX); - total_cmds = ISCSI_TOTAL_CMDS_MAX; - } - - if (!is_power_of_2(total_cmds)) { - printk(KERN_ERR "iscsi: invalid can_queue of %d. can_queue " - "must be a power of 2.\n", total_cmds); - total_cmds = rounddown_pow_of_two(total_cmds); - if (total_cmds < ISCSI_TOTAL_CMDS_MIN) - goto dec_session_count; - printk(KERN_INFO "iscsi: Rounding can_queue to %d.\n", - total_cmds); - } - scsi_cmds = total_cmds - ISCSI_MGMT_CMDS_MAX;
cls_session = iscsi_alloc_session(shost, iscsit, sizeof(struct iscsi_session) + @@ -2934,7 +2956,7 @@ iscsi_session_setup(struct iscsi_transport *iscsit, struct Scsi_Host *shost, session->lu_reset_timeout = 15; session->abort_timeout = 10; session->scsi_cmds_max = scsi_cmds; - session->cmds_max = total_cmds; + session->cmds_max = scsi_cmds + ISCSI_MGMT_CMDS_MAX; session->queued_cmdsn = session->cmdsn = initial_cmdsn; session->exp_cmdsn = initial_cmdsn + 1; session->max_cmdsn = initial_cmdsn + 1; diff --git a/include/scsi/libiscsi.h b/include/scsi/libiscsi.h index c71297f7078c..7ae6867316d3 100644 --- a/include/scsi/libiscsi.h +++ b/include/scsi/libiscsi.h @@ -409,6 +409,8 @@ extern struct Scsi_Host *iscsi_host_alloc(struct scsi_host_template *sht, extern void iscsi_host_remove(struct Scsi_Host *shost); extern void iscsi_host_free(struct Scsi_Host *shost); extern int iscsi_target_alloc(struct scsi_target *starget); +extern int iscsi_host_get_max_scsi_cmds(struct Scsi_Host *shost, + uint16_t requested_cmds_max);
/* * session management
From: Mike Christie michael.christie@oracle.com
mainline inclusion from mainline-v5.12-rc1 commit 25c400db2083732a5fbdd72f0d3a0337119b2fa5 category: bugfix bugzilla: 46895 CVE: NA ---------------------------
We are setting the shost's can_queue after we add the host which is too late, because the SCSI midlayer will have allocated the tag set based on the can_queue value at that time. This patch has us use the iscsi_host_get_max_scsi_cmds() helper to figure out the number of SCSI cmds.
It also fixes up the template can_queue so it reflects the max SCSI cmds we can support like how other drivers work.
Link: https://lore.kernel.org/r/20210207044608.27585-7-michael.christie@oracle.com Reviewed-by: Lee Duncan lduncan@suse.com Signed-off-by: Mike Christie michael.christie@oracle.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/iscsi_tcp.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c index 93ce99019808..241b1a310519 100644 --- a/drivers/scsi/iscsi_tcp.c +++ b/drivers/scsi/iscsi_tcp.c @@ -852,6 +852,7 @@ iscsi_sw_tcp_session_create(struct iscsi_endpoint *ep, uint16_t cmds_max, struct iscsi_session *session; struct iscsi_sw_tcp_host *tcp_sw_host; struct Scsi_Host *shost; + int rc;
if (ep) { printk(KERN_ERR "iscsi_tcp: invalid ep %p.\n", ep); @@ -869,6 +870,11 @@ iscsi_sw_tcp_session_create(struct iscsi_endpoint *ep, uint16_t cmds_max, shost->max_channel = 0; shost->max_cmd_len = SCSI_MAX_VARLEN_CDB_SIZE;
+ rc = iscsi_host_get_max_scsi_cmds(shost, cmds_max); + if (rc < 0) + goto free_host; + shost->can_queue = rc; + if (iscsi_host_add(shost, NULL)) goto free_host;
@@ -883,7 +889,6 @@ iscsi_sw_tcp_session_create(struct iscsi_endpoint *ep, uint16_t cmds_max, tcp_sw_host = iscsi_host_priv(shost); tcp_sw_host->session = session;
- shost->can_queue = session->scsi_cmds_max; if (iscsi_tcp_r2tpool_alloc(session)) goto remove_session; return cls_session; @@ -992,7 +997,7 @@ static struct scsi_host_template iscsi_sw_tcp_sht = { .name = "iSCSI Initiator over TCP/IP", .queuecommand = iscsi_queuecommand, .change_queue_depth = scsi_change_queue_depth, - .can_queue = ISCSI_DEF_XMIT_CMDS_MAX - 1, + .can_queue = ISCSI_TOTAL_CMDS_MAX, .sg_tablesize = 4096, .max_sectors = 0xFFFF, .cmd_per_lun = ISCSI_DEF_CMD_PER_LUN,
From: Mike Christie michael.christie@oracle.com
mainline inclusion from mainline-v5.12-rc1 commit c8447e4c2eb77dbb96012ae96e7c83179cecf880 category: bugfix bugzilla: 46895 CVE: NA ---------------------------
If we lose the session then relogin, but the new cmdsn window has shrunk (due to something like an admin changing a setting) we will have the old exp/max_cmdsn values and will never be able to update them. For example, max_cmdsn would be 64, but if on the target the user set the window to be smaller then the target could try to return the max_cmdsn as 32. We will see that new max_cmdsn in the rsp but because it's lower than the old max_cmdsn when the window was larger we will not update it.
So this patch has us reset the window values during session cleanup so they can be updated after a new login.
Link: https://lore.kernel.org/r/20210207044608.27585-8-michael.christie@oracle.com Reviewed-by: Lee Duncan lduncan@suse.com Signed-off-by: Mike Christie michael.christie@oracle.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- drivers/scsi/libiscsi.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c index b48427a96a0a..afa8091e19dd 100644 --- a/drivers/scsi/libiscsi.c +++ b/drivers/scsi/libiscsi.c @@ -3345,6 +3345,13 @@ int iscsi_conn_bind(struct iscsi_cls_session *cls_session, session->leadconn = conn; spin_unlock_bh(&session->frwd_lock);
+ /* + * The target could have reduced it's window size between logins, so + * we have to reset max/exp cmdsn so we can see the new values. + */ + spin_lock_bh(&session->back_lock); + session->max_cmdsn = session->exp_cmdsn = session->cmdsn + 1; + spin_unlock_bh(&session->back_lock); /* * Unblock xmitworker(), Login Phase will pass through. */
From: yangerkun yangerkun@huawei.com
hulk inclusion category: bugfix bugzilla: 46895 CVE: NA ---------------------------
Signed-off-by: yangerkun yangerkun@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/scsi/libiscsi.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/scsi/libiscsi.h b/include/scsi/libiscsi.h index 7ae6867316d3..c14d739c8887 100644 --- a/include/scsi/libiscsi.h +++ b/include/scsi/libiscsi.h @@ -200,7 +200,7 @@ struct iscsi_conn { struct iscsi_task *task; /* xmit task in progress */
/* xmit */ - /* items must be added/deleted under frwd lock */ + spinlock_t taskqueuelock; /* protects the next three lists */ struct list_head mgmtqueue; /* mgmt (control) xmit queue */ struct list_head cmdqueue; /* data-path cmd queue */ struct list_head requeue; /* tasks needing another run */
From: Theodore Ts'o tytso@mit.edu
mainline inclusion from mainline-5.11-rc4 commit 5a3b590d4b2db187faa6f06adc9a53d6199fb1f9 category: bugfix bugzilla: 47374 CVE: NA ---------------------------
When the first file is opened, ext4 samples the mountpoint of the filesystem in 64 bytes of the super block. It does so using strlcpy(), this means that the remaining bytes in the super block string buffer are untouched. If the mount point before had a longer path than the current one, it can be reconstructed.
Consider the case where the fs was mounted to "/media/johnjdeveloper" and later to "/". The super block buffer then contains "/\x00edia/johnjdeveloper".
This case was seen in the wild and caused confusion how the name of a developer ands up on the super block of a filesystem used in production...
Fix this by using strncpy() instead of strlcpy(). The superblock field is defined to be a fixed-size char array, and it is already marked using __nonstring in fs/ext4/ext4.h. The consumer of the field in e2fsprogs already assumes that in the case of a 64+ byte mount path, that s_last_mounted will not be NUL terminated.
Link: https://lore.kernel.org/r/X9ujIOJG/HqMr88R@mit.edu Reported-by: Richard Weinberger richard@nod.at Signed-off-by: Theodore Ts'o tytso@mit.edu Cc: stable@kernel.org Signed-off-by: Zhang Yi yi.zhang@huawei.com Reviewed-by: Ye bin yebin10@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/file.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 4e791056a860..21b0fc0d6ffe 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -433,7 +433,7 @@ static int ext4_sample_last_mounted(struct super_block *sb, if (err) goto out_journal; lock_buffer(sbi->s_sbh); - strlcpy(sbi->s_es->s_last_mounted, cp, + strncpy(sbi->s_es->s_last_mounted, cp, sizeof(sbi->s_es->s_last_mounted)); ext4_superblock_csum_set(sb); unlock_buffer(sbi->s_sbh);
From: Ye Bin yebin10@huawei.com
hulk inclusion category: bugfix bugzilla: NA CVE: NA
-----------------------------------------------
Fixes: e5389b288cf0 ("ext4: make ext4_abort() use __ext4_error()") Fixes: 47e86bedca58 ("ext4: report error to userspace by netlink") Signed-off-by: Ye Bin yebin10@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/super.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 3b2f7f7ea8cb..655ba77db225 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -89,7 +89,6 @@ static void ext4_unregister_li_request(struct super_block *sb); static void ext4_clear_request_list(void); static struct inode *ext4_get_journal_inode(struct super_block *sb, unsigned int journal_inum); -static void ext4_netlink_send_info(struct super_block *sb, int ext4_errno); static struct sock *ext4nl;
/* @@ -590,7 +589,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error, smp_wmb(); sb->s_flags |= SB_RDONLY; out: - ext4_netlink_send_info(sb, 1); + ext4_netlink_send_info(sb, force_ro ? 2 : 1); }
static void flush_stashed_error_work(struct work_struct *work)
From: Yang Yingliang yangyingliang@huawei.com
hulk inclusion category: bugfix bugzilla: 47452 CVE: NA
-------------------------------------------------
In fast path of slab_alloc_node(), interrupt is not disabled, after get object and page, they may be changed beacuse memory will be allocated in interrupt context, it cause check_valid_pointer() report corrupted freelist wrongly. So compare next object with magic number directly to avoid the crash caused by using connnector.
[22635000.696206] Unable to handle kernel paging request at virtual address 000000030000004c [22635000.704370] Mem abort info: [22635000.707417] ESR = 0x96000004 [22635000.710723] Exception class = DABT (current EL), IL = 32 bits [22635000.716878] SET = 0, FnV = 0 [22635000.720183] EA = 0, S1PTW = 0 [22635000.723575] Data abort info: [22635000.726709] ISV = 0, ISS = 0x00000004 [22635000.730792] CM = 0, WnR = 0 [22635000.734011] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000cfe46aff [22635000.740859] [000000030000004c] pgd=0000000000000000 [22635000.745980] Internal error: Oops: 96000004 [#1] SMP [22635000.751094] Process thread-pool-6 (pid: 44316, stack limit = 0x00000000378f8b35) [22635000.758717] CPU: 49 PID: 44316 Comm: thread-pool-6 Kdump: loaded Tainted: G OE K 4.19.36 #1 [22635000.771695] Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 1.05 09/18/2019 [22635000.779314] pstate: 60400009 (nZCv daif +PAN -UAO) [22635000.784348] pc : __kmalloc_node_track_caller+0x214/0x348 [22635000.789894] lr : __kmalloc_node_track_caller+0x6c/0x348
Besides, add page check, before using it to print message and add WARN to track stack.
Signed-off-by: Yang Yingliang yangyingliang@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/slub.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c index 45b5353e0938..4b681e71ba8c 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1359,15 +1359,19 @@ static bool isolate_corrupted_freelist(struct kmem_cache *s, struct page *page, void **freelist, void *next) { if (sysctl_isolate_corrupted_freelist && - (!check_valid_pointer(s, page, next))) { + (unsigned long)next == 0x30000004cul) { /* Need caller make sure freelist is not NULL */ *freelist = NULL; isolate_cnt++;
slab_fix(s, "Freelist corrupt,isolate corrupted freechain."); - pr_info("freelist=%lx object=%lx, page->base=%lx, page->objects=%u, objsize=%u, s->offset=%d page:%llx\n", - (long)freelist, (long)next, (long)page_address(page), page->objects, s->size, s->offset, (u64)page); - + if (page) + pr_info("freelist=%lx object=%lx, page->base=%lx, page->objects=%u, objsize=%u, s->offset=%d page:%llx\n", + (long)freelist, (long)next, (long)page_address(page), page->objects, s->size, s->offset, (u64)page); + else + pr_info("freelist=%lx object=%lx, objsize=%u, s->offset=%d\n", + (long)freelist, (long)next, s->size, s->offset); + WARN_ON(1); return true; }
From: Lu Jialin lujialin4@huawei.com
hulk inclusion category: feature/cgroups bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=7 CVE: NA
--------
Signed-off-by: Lu Jialin lujialin4@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- arch/x86/configs/hulk_defconfig | 1 + arch/x86/configs/openeuler_defconfig | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/configs/hulk_defconfig b/arch/x86/configs/hulk_defconfig index e11831046b41..9d6ba46ee783 100644 --- a/arch/x86/configs/hulk_defconfig +++ b/arch/x86/configs/hulk_defconfig @@ -150,6 +150,7 @@ CONFIG_CGROUP_PERF=y CONFIG_CGROUP_BPF=y # CONFIG_CGROUP_DEBUG is not set CONFIG_SOCK_CGROUP_DATA=y +CONFIG_CGROUP_FILES=y CONFIG_NAMESPACES=y CONFIG_UTS_NS=y CONFIG_IPC_NS=y diff --git a/arch/x86/configs/openeuler_defconfig b/arch/x86/configs/openeuler_defconfig index e4891dde8610..833754cc996b 100644 --- a/arch/x86/configs/openeuler_defconfig +++ b/arch/x86/configs/openeuler_defconfig @@ -144,7 +144,7 @@ CONFIG_CGROUP_PERF=y CONFIG_CGROUP_BPF=y # CONFIG_CGROUP_DEBUG is not set CONFIG_SOCK_CGROUP_DATA=y -# CONFIG_CGROUP_FILES is not set +CONFIG_CGROUP_FILES=y CONFIG_NAMESPACES=y CONFIG_UTS_NS=y CONFIG_IPC_NS=y
From: Oscar Salvador osalvador@suse.de
mainline inclusion from mainline-v5.11-rc1 commit 3f4b815a439adfb8f238335612c4b28bc10084d8 category: bugfix bugzilla: NA CVE: NA
-------------------------------------------------
Currently, we return -EIO when we fail to migrate the page.
Migrations' failures are rather transient as they can happen due to several reasons, e.g: high page refcount bump, mapping->migrate_page failing etc. All meaning that at that time the page could not be migrated, but that has nothing to do with an EIO error.
Let us return -EBUSY instead, as we do in case we failed to isolate the page.
While are it, let us remove the "ret" print as its value does not change.
Link: https://lkml.kernel.org/r/20201209092818.30417-1-osalvador@suse.de Signed-off-by: Oscar Salvador osalvador@suse.de Acked-by: Naoya Horiguchi naoya.horiguchi@nec.com Acked-by: Vlastimil Babka vbabka@suse.cz Cc: David Hildenbrand david@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Shixin Liu liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- mm/memory-failure.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index fa091bfb5943..d15ffccb2db4 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1730,7 +1730,7 @@ static int soft_offline_huge_page(struct page *page, int flags) if (!list_empty(&pagelist)) putback_movable_pages(&pagelist); if (ret > 0) - ret = -EIO; + ret = -EBUSY; } else { /* * We set PG_hwpoison only when the migration source hugepage @@ -1821,11 +1821,11 @@ static int __soft_offline_page(struct page *page, int flags) pr_info("soft offline: %#lx: migration failed %d, type %lx (%pGp)\n", pfn, ret, page->flags, &page->flags); if (ret > 0) - ret = -EIO; + ret = -EBUSY; } } else { - pr_info("soft offline: %#lx: isolation failed: %d, page count %d, type %lx (%pGp)\n", - pfn, ret, page_count(page), page->flags, &page->flags); + pr_info("soft offline: %#lx: isolation failed, page count %d, type %lx (%pGp)\n", + pfn, page_count(page), page->flags, &page->flags); } return ret; }
From: Vlastimil Babka vbabka@suse.cz
mainline inclusion from mainline-5.4-rc3 commit 59bb47985c1db229ccff8c5deebecd54fc77d2a9 category: bugfix bugzilla: 51349 CVE: NA
-------------------------------------------------
In most configurations, kmalloc() happens to return naturally aligned (i.e. aligned to the block size itself) blocks for power of two sizes.
That means some kmalloc() users might unknowingly rely on that alignment, until stuff breaks when the kernel is built with e.g. CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then developers have to devise workaround such as own kmem caches with specified alignment [1], which is not always practical, as recently evidenced in [2].
The topic has been discussed at LSF/MM 2019 [3]. Adding a 'kmalloc_aligned()' variant would not help with code unknowingly relying on the implicit alignment. For slab implementations it would either require creating more kmalloc caches, or allocate a larger size and only give back part of it. That would be wasteful, especially with a generic alignment parameter (in contrast with a fixed alignment to size).
Ideally we should provide to mm users what they need without difficult workarounds or own reimplementations, so let's make the kmalloc() alignment to size explicitly guaranteed for power-of-two sizes under all configurations. What this means for the three available allocators?
* SLAB object layout happens to be mostly unchanged by the patch. The implicitly provided alignment could be compromised with CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for caches with alignment larger than unsigned long long. Practically on at least x86 this includes kmalloc caches as they use cache line alignment, which is larger than that. Still, this patch ensures alignment on all arches and cache sizes.
* SLUB layout is also unchanged unless redzoning is enabled through CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache. With this patch, explicit alignment is guaranteed with redzoning as well. This will result in more memory being wasted, but that should be acceptable in a debugging scenario.
* SLOB has no implicit alignment so this patch adds it explicitly for kmalloc(). The potential downside is increased fragmentation. While pathological allocation scenarios are certainly possible, in my testing, after booting a x86_64 kernel+userspace with virtme, around 16MB memory was consumed by slab pages both before and after the patch, with difference in the noise.
[1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a... [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.... [3] https://lwn.net/Articles/787740/
[akpm@linux-foundation.org: documentation fixlet, per Matthew] Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka vbabka@suse.cz Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Acked-by: Michal Hocko mhocko@suse.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Acked-by: Christoph Hellwig hch@lst.de Cc: David Sterba dsterba@suse.cz Cc: Christoph Lameter cl@linux.com Cc: Pekka Enberg penberg@kernel.org Cc: David Rientjes rientjes@google.com Cc: Ming Lei ming.lei@redhat.com Cc: Dave Chinner david@fromorbit.com Cc: "Darrick J . Wong" darrick.wong@oracle.com Cc: Christoph Hellwig hch@lst.de Cc: James Bottomley James.Bottomley@HansenPartnership.com Cc: Vlastimil Babka vbabka@suse.cz Cc: Joonsoo Kim iamjoonsoo.kim@lge.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org (cherry picked from commit 59bb47985c1db229ccff8c5deebecd54fc77d2a9) Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- include/linux/slab.h | 4 ++++ mm/slab_common.c | 11 ++++++++++- mm/slob.c | 41 ++++++++++++++++++++++++++++++----------- 3 files changed, 44 insertions(+), 12 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h index d6393413ef09..e4f5d7f88c16 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -457,6 +457,10 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags) * kmalloc is the normal method of allocating memory * for objects smaller than page size in the kernel. * + * The allocated object address is aligned to at least ARCH_KMALLOC_MINALIGN + * bytes. For @size of power of two bytes, the alignment is also guaranteed + * to be at least to the size. + * * The @flags argument may be one of: * * %GFP_USER - Allocate memory on behalf of user. May sleep. diff --git a/mm/slab_common.c b/mm/slab_common.c index a94b9981eb17..9b8d6a9f1e97 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -970,10 +970,19 @@ void __init create_boot_cache(struct kmem_cache *s, const char *name, unsigned int useroffset, unsigned int usersize) { int err; + unsigned int align = ARCH_KMALLOC_MINALIGN;
s->name = name; s->size = s->object_size = size; - s->align = calculate_alignment(flags, ARCH_KMALLOC_MINALIGN, size); + + /* + * For power of two sizes, guarantee natural alignment for kmalloc + * caches, regardless of SL*B debugging options. + */ + if (is_power_of_2(size)) + align = max(align, size); + s->align = calculate_alignment(flags, align, size); + s->useroffset = useroffset; s->usersize = usersize;
diff --git a/mm/slob.c b/mm/slob.c index 307c2c9feb44..fdf284009be9 100644 --- a/mm/slob.c +++ b/mm/slob.c @@ -215,7 +215,7 @@ static void slob_free_pages(void *b, int order) /* * Allocate a slob block within a given slob_page sp. */ -static void *slob_page_alloc(struct page *sp, size_t size, int align) +static void *slob_page_alloc(struct page *sp, size_t size, int align, int align_offset) { slob_t *prev, *cur, *aligned = NULL; int delta = 0, units = SLOB_UNITS(size); @@ -223,8 +223,17 @@ static void *slob_page_alloc(struct page *sp, size_t size, int align) for (prev = NULL, cur = sp->freelist; ; prev = cur, cur = slob_next(cur)) { slobidx_t avail = slob_units(cur);
+ /* + * 'aligned' will hold the address of the slob block so that the + * address 'aligned'+'align_offset' is aligned according to the + * 'align' parameter. This is for kmalloc() which prepends the + * allocated block with its size, so that the block itself is + * aligned when needed. + */ if (align) { - aligned = (slob_t *)ALIGN((unsigned long)cur, align); + aligned = (slob_t *) + (ALIGN((unsigned long)cur + align_offset, align) + - align_offset); delta = aligned - cur; } if (avail >= units + delta) { /* room enough? */ @@ -266,7 +275,8 @@ static void *slob_page_alloc(struct page *sp, size_t size, int align) /* * slob_alloc: entry point into the slob allocator. */ -static void *slob_alloc(size_t size, gfp_t gfp, int align, int node) +static void *slob_alloc(size_t size, gfp_t gfp, int align, int node, + int align_offset) { struct page *sp; struct list_head *prev; @@ -298,7 +308,7 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, int node)
/* Attempt to alloc */ prev = sp->lru.prev; - b = slob_page_alloc(sp, size, align); + b = slob_page_alloc(sp, size, align, align_offset); if (!b) continue;
@@ -326,7 +336,7 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, int node) INIT_LIST_HEAD(&sp->lru); set_slob(b, SLOB_UNITS(PAGE_SIZE), b + SLOB_UNITS(PAGE_SIZE)); set_slob_page_free(sp, slob_list); - b = slob_page_alloc(sp, size, align); + b = slob_page_alloc(sp, size, align, align_offset); BUG_ON(!b); spin_unlock_irqrestore(&slob_lock, flags); } @@ -428,7 +438,7 @@ static __always_inline void * __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller) { unsigned int *m; - int align = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN); + int minalign = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN); void *ret;
gfp &= gfp_allowed_mask; @@ -436,19 +446,28 @@ __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller) fs_reclaim_acquire(gfp); fs_reclaim_release(gfp);
- if (size < PAGE_SIZE - align) { + if (size < PAGE_SIZE - minalign) { + int align = minalign; + + /* + * For power of two sizes, guarantee natural alignment for + * kmalloc()'d objects. + */ + if (is_power_of_2(size)) + align = max(minalign, (int) size); + if (!size) return ZERO_SIZE_PTR;
- m = slob_alloc(size + align, gfp, align, node); + m = slob_alloc(size + minalign, gfp, align, node, minalign);
if (!m) return NULL; *m = size; - ret = (void *)m + align; + ret = (void *)m + minalign;
trace_kmalloc_node(caller, ret, - size, size + align, gfp, node); + size, size + minalign, gfp, node); } else { unsigned int order = get_order(size);
@@ -544,7 +563,7 @@ static void *slob_alloc_node(struct kmem_cache *c, gfp_t flags, int node) fs_reclaim_release(flags);
if (c->size < PAGE_SIZE) { - b = slob_alloc(c->size, flags, c->align, node); + b = slob_alloc(c->size, flags, c->align, node, 0); trace_kmem_cache_alloc_node(_RET_IP_, b, c->object_size, SLOB_UNITS(c->size) * SLOB_UNIT, flags, node);
From: Shijie Luo luoshijie1@huawei.com
mainline inclusion from mainline-v5.12-rc4 commit 7d8bd3c76da1d94b85e6c9b7007e20e980bfcfe6 category: bugfix bugzilla: 51135 CVE: NA
-----------------------------------------------
If set_large_file = 1 and errors occur in ext4_handle_dirty_metadata(), the error code will be overridden, go to out_brelse to avoid this situation.
Signed-off-by: Shijie Luo luoshijie1@huawei.com Link: https://lore.kernel.org/r/20210312065051.36314-1-luoshijie1@huawei.com Cc: stable@kernel.org Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Zhihao Cheng chengzhihao1@huawei.com Reviewed-by: zhangyi (F) yi.zhang@huawei.com Signed-off-by: Yang Yingliang yangyingliang@huawei.com Signed-off-by: Cheng Jian cj.chengjian@huawei.com --- fs/ext4/inode.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index b31492c431b4..c68823c50688 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5282,7 +5282,7 @@ static int ext4_do_update_inode(handle_t *handle, struct ext4_inode_info *ei = EXT4_I(inode); struct buffer_head *bh = iloc->bh; struct super_block *sb = inode->i_sb; - int err = 0, rc, block; + int err = 0, block; int need_datasync = 0, set_large_file = 0; uid_t i_uid; gid_t i_gid; @@ -5394,9 +5394,9 @@ static int ext4_do_update_inode(handle_t *handle, bh->b_data);
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata"); - rc = ext4_handle_dirty_metadata(handle, NULL, bh); - if (!err) - err = rc; + err = ext4_handle_dirty_metadata(handle, NULL, bh); + if (err) + goto out_brelse; ext4_clear_inode_state(inode, EXT4_STATE_NEW); if (set_large_file) { BUFFER_TRACE(EXT4_SB(sb)->s_sbh, "get write access");