From: Dave Chinner dchinner@redhat.com
mainline inclusion from mainline-v5.19-rc1 commit c230a4a85bcdbfc1a7415deec6caf04e8fca1301 category: bugfix bugzilla: 187372, https://gitee.com/openeuler/kernel/issues/I5K0OM CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
Ever since we added shadown format buffers to the log items, log items need to handle the item being released with shadow buffers attached. Due to the fact this requirement was added at the same time we added new rmap/reflink intents, we missed the cleanup of those items.
In theory, this means shadow buffers can be leaked in a very small window when a shutdown is initiated. Testing with KASAN shows this leak does not happen in practice - we haven't identified a single leak in several years of shutdown testing since ~v4.8 kernels.
However, the intent whiteout cleanup mechanism results in every cancelled intent in exactly the same state as this tiny race window creates and so if intents down clean up shadow buffers on final release we will leak the shadow buffer for just about every intent we create.
Hence we start with this patch to close this condition off and ensure that when whiteouts start to be used we don't leak lots of memory.
Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Allison Henderson allison.henderson@oracle.com Signed-off-by: Dave Chinner david@fromorbit.com
conflicts: fs/xfs/xfs_bmap_item.c fs/xfs/xfs_icreate_item.c fs/xfs/xfs_refcount_item.c fs/xfs/xfs_rmap_item.c
Signed-off-by: Li Nan linan122@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/xfs/xfs_bmap_item.c | 2 ++ fs/xfs/xfs_icreate_item.c | 1 + fs/xfs/xfs_refcount_item.c | 2 ++ fs/xfs/xfs_rmap_item.c | 2 ++ 4 files changed, 7 insertions(+)
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c index 44ec0f2d5253..e6de8081451f 100644 --- a/fs/xfs/xfs_bmap_item.c +++ b/fs/xfs/xfs_bmap_item.c @@ -40,6 +40,7 @@ STATIC void xfs_bui_item_free( struct xfs_bui_log_item *buip) { + kmem_free(buip->bui_item.li_lv_shadow); kmem_cache_free(xfs_bui_zone, buip); }
@@ -199,6 +200,7 @@ xfs_bud_item_release( struct xfs_bud_log_item *budp = BUD_ITEM(lip);
xfs_bui_release(budp->bud_buip); + kmem_free(budp->bud_item.li_lv_shadow); kmem_cache_free(xfs_bud_zone, budp); }
diff --git a/fs/xfs/xfs_icreate_item.c b/fs/xfs/xfs_icreate_item.c index 9b3994b9c716..aa8c7c261d24 100644 --- a/fs/xfs/xfs_icreate_item.c +++ b/fs/xfs/xfs_icreate_item.c @@ -63,6 +63,7 @@ STATIC void xfs_icreate_item_release( struct xfs_log_item *lip) { + kmem_free(ICR_ITEM(lip)->ic_item.li_lv_shadow); kmem_cache_free(xfs_icreate_zone, ICR_ITEM(lip)); }
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c index 0dee316283a9..9f4ff45c7a93 100644 --- a/fs/xfs/xfs_refcount_item.c +++ b/fs/xfs/xfs_refcount_item.c @@ -35,6 +35,7 @@ STATIC void xfs_cui_item_free( struct xfs_cui_log_item *cuip) { + kmem_free(cuip->cui_item.li_lv_shadow); if (cuip->cui_format.cui_nextents > XFS_CUI_MAX_FAST_EXTENTS) kmem_free(cuip); else @@ -204,6 +205,7 @@ xfs_cud_item_release( struct xfs_cud_log_item *cudp = CUD_ITEM(lip);
xfs_cui_release(cudp->cud_cuip); + kmem_free(cudp->cud_item.li_lv_shadow); kmem_cache_free(xfs_cud_zone, cudp); }
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c index 20905953fe76..b5447ac7cb9b 100644 --- a/fs/xfs/xfs_rmap_item.c +++ b/fs/xfs/xfs_rmap_item.c @@ -35,6 +35,7 @@ STATIC void xfs_rui_item_free( struct xfs_rui_log_item *ruip) { + kmem_free(ruip->rui_item.li_lv_shadow); if (ruip->rui_format.rui_nextents > XFS_RUI_MAX_FAST_EXTENTS) kmem_free(ruip); else @@ -227,6 +228,7 @@ xfs_rud_item_release( struct xfs_rud_log_item *rudp = RUD_ITEM(lip);
xfs_rui_release(rudp->rud_ruip); + kmem_free(rudp->rud_item.li_lv_shadow); kmem_cache_free(xfs_rud_zone, rudp); }
From: Marc Zyngier maz@kernel.org
mainline inclusion from mainline-v5.18-rc2 commit af27e41612ec7e5b4783f589b753a7c31a37aac8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I66S8J CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------
The way KVM drives GICv4.{0,1} is as follows: - vcpu_load() makes the VPE resident, instructing the RD to start scanning for interrupts - just before entering the guest, we check that the RD has finished scanning and that we can start running the vcpu - on preemption, we deschedule the VPE by making it invalid on the RD
However, we are preemptible between the first two steps. If it so happens *and* that the RD was still scanning, we nonetheless write to the GICR_VPENDBASER register while Dirty is set, and bad things happen (we're in UNPRED land).
This affects both the 4.0 and 4.1 implementations.
Make sure Dirty is cleared before performing the deschedule, meaning that its_clear_vpend_valid() becomes a sort of full VPE residency barrier.
Reported-by: Jingyi Wang wangjingyi11@huawei.com Tested-by: Nianyao Tang tangnianyao@huawei.com Signed-off-by: Marc Zyngier maz@kernel.org Fixes: 57e3cebd022f ("KVM: arm64: Delay the polling of the GICR_VPENDBASER.Dirty bit") Link: https://lore.kernel.org/r/4aae10ba-b39a-5f84-754b-69c2eb0a2c03@huawei.com Signed-off-by: Zenghui Yu yuzenghui@huawei.com Reviewed-by: Keqian Zhu zhukeqian1@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/irqchip/irq-gic-v3-its.c | 28 +++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-)
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index 1cb392fb16d0..99fcc5601ea4 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -3061,18 +3061,12 @@ static int __init allocate_lpi_tables(void) return 0; }
-static u64 its_clear_vpend_valid(void __iomem *vlpi_base, u64 clr, u64 set) +static u64 read_vpend_dirty_clear(void __iomem *vlpi_base) { u32 count = 1000000; /* 1s! */ bool clean; u64 val;
- val = gicr_read_vpendbaser(vlpi_base + GICR_VPENDBASER); - val &= ~GICR_VPENDBASER_Valid; - val &= ~clr; - val |= set; - gicr_write_vpendbaser(val, vlpi_base + GICR_VPENDBASER); - do { val = gicr_read_vpendbaser(vlpi_base + GICR_VPENDBASER); clean = !(val & GICR_VPENDBASER_Dirty); @@ -3083,10 +3077,26 @@ static u64 its_clear_vpend_valid(void __iomem *vlpi_base, u64 clr, u64 set) } } while (!clean && count);
- if (unlikely(val & GICR_VPENDBASER_Dirty)) { + if (unlikely(!clean)) pr_err_ratelimited("ITS virtual pending table not cleaning\n"); + + return val; +} + +static u64 its_clear_vpend_valid(void __iomem *vlpi_base, u64 clr, u64 set) +{ + u64 val; + + /* Make sure we wait until the RD is done with the initial scan */ + val = read_vpend_dirty_clear(vlpi_base); + val &= ~GICR_VPENDBASER_Valid; + val &= ~clr; + val |= set; + gicr_write_vpendbaser(val, vlpi_base + GICR_VPENDBASER); + + val = read_vpend_dirty_clear(vlpi_base); + if (unlikely(val & GICR_VPENDBASER_Dirty)) val |= GICR_VPENDBASER_PendingLast; - }
return val; }
From: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp
mainline inclusion from mainline-v6.0-rc1 commit 68aaee147e597b495622b7c9038e5922c7c61f57 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6ADCF CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
syzbot is reporting GFP_KERNEL allocation with oom_lock held when reporting memcg OOM [1]. If this allocation triggers the global OOM situation then the system can livelock because the GFP_KERNEL allocation with oom_lock held cannot trigger the global OOM killer because __alloc_pages_may_oom() fails to hold oom_lock.
Fix this problem by removing the allocation from memory_stat_format() completely, and pass static buffer when calling from memcg OOM path.
Note that the caller holding filesystem lock was the trigger for syzbot to report this locking dependency. Doing GFP_KERNEL allocation with filesystem lock held can deadlock the system even without involving OOM situation.
Link: https://syzkaller.appspot.com/bug?extid=2d2aeadc6ce1e1f11d45 [1] Link: https://lkml.kernel.org/r/86afb39f-8c65-bec2-6cfc-c5e3cd600c0b@I-love.SAKURA... Fixes: c8713d0b23123759 ("mm: memcontrol: dump memory.stat during cgroup OOM") Signed-off-by: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Reported-by: syzbot syzbot+2d2aeadc6ce1e1f11d45@syzkaller.appspotmail.com Suggested-by: Michal Hocko mhocko@suse.com Acked-by: Michal Hocko mhocko@suse.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Shakeel Butt shakeelb@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org
conflicts: mm/memcontrol.c
Signed-off-by: Cai Xinchen caixinchen1@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- mm/memcontrol.c | 22 +++++++++------------- 1 file changed, 9 insertions(+), 13 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b2c4bc4bb591..6344fe257af6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1509,14 +1509,12 @@ static int __init memory_stats_init(void) } pure_initcall(memory_stats_init);
-static char *memory_stat_format(struct mem_cgroup *memcg) +static void memory_stat_format(struct mem_cgroup *memcg, char *buf, int bufsize) { struct seq_buf s; int i;
- seq_buf_init(&s, kmalloc(PAGE_SIZE, GFP_KERNEL), PAGE_SIZE); - if (!s.buffer) - return NULL; + seq_buf_init(&s, buf, bufsize);
/* * Provide statistics on the state of the memory subsystem as @@ -1576,8 +1574,6 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
/* The above should easily fit into one page */ WARN_ON_ONCE(seq_buf_has_overflowed(&s)); - - return s.buffer; }
#define K(x) ((x) << (PAGE_SHIFT-10)) @@ -1613,7 +1609,10 @@ void mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct * */ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) { - char *buf; + /* Use static buffer, for the caller is holding oom_lock. */ + static char buf[PAGE_SIZE]; + + lockdep_assert_held(&oom_lock);
pr_info("memory: usage %llukB, limit %llukB, failcnt %lu\n", K((u64)page_counter_read(&memcg->memory)), @@ -1634,11 +1633,8 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) pr_info("Memory cgroup stats for "); pr_cont_cgroup_path(memcg->css.cgroup); pr_cont(":"); - buf = memory_stat_format(memcg); - if (!buf) - return; + memory_stat_format(memcg, buf, sizeof(buf)); pr_info("%s", buf); - kfree(buf);
mem_cgroup_print_memfs_info(memcg, NULL); } @@ -6912,11 +6908,11 @@ static int memory_events_local_show(struct seq_file *m, void *v) static int memory_stat_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - char *buf; + char *buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
- buf = memory_stat_format(memcg); if (!buf) return -ENOMEM; + memory_stat_format(memcg, buf, PAGE_SIZE); seq_puts(m, buf); kfree(buf); return 0;
From: Liu Shixin liushixin2@huawei.com
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6ADCF CVE: NA
--------------------------------
syzbot is reporting GFP_KERNEL allocation with oom_lock held when reporting memcg OOM [1]. If this allocation triggers the global OOM situation then the system can livelock because the GFP_KERNEL allocation with oom_lock held cannot trigger the global OOM killer because __alloc_pages_may_oom() fails to hold oom_lock.
The problem mentioned above has been fixed by patch[2]. The is the same problem in memcg_memfs_info feature too. Refer to the patch[2], fix it by removing the allocation from mem_cgroup_print_memfs_info() completely, and pass static buffer when calling from memcg OOM path.
Link: https://syzkaller.appspot.com/bug?extid=2d2aeadc6ce1e1f11d45 [1] Link: https://lkml.kernel.org/r/86afb39f-8c65-bec2-6cfc-c5e3cd600c0b@I-love.SAKURA... [2] Fixes: 6b1d4d3a3713 ("mm/memcg_memfs_info: show files that having pages charged in mem_cgroup") Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Kefeng Wang wangkefeng.wang@huawei.com Reviewed-by: Cai Xinchen caixinchen1@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- include/linux/memcg_memfs_info.h | 4 +++- mm/memcg_memfs_info.c | 20 ++++++++++---------- mm/memcontrol.c | 3 ++- 3 files changed, 15 insertions(+), 12 deletions(-)
diff --git a/include/linux/memcg_memfs_info.h b/include/linux/memcg_memfs_info.h index 658a91e22bd7..b5e3709baa9e 100644 --- a/include/linux/memcg_memfs_info.h +++ b/include/linux/memcg_memfs_info.h @@ -6,11 +6,13 @@ #include <linux/seq_file.h>
#ifdef CONFIG_MEMCG_MEMFS_INFO -void mem_cgroup_print_memfs_info(struct mem_cgroup *memcg, struct seq_file *m); +void mem_cgroup_print_memfs_info(struct mem_cgroup *memcg, char *pathbuf, + struct seq_file *m); int mem_cgroup_memfs_files_show(struct seq_file *m, void *v); void mem_cgroup_memfs_info_init(void); #else static inline void mem_cgroup_print_memfs_info(struct mem_cgroup *memcg, + char *pathbuf, struct seq_file *m) { } diff --git a/mm/memcg_memfs_info.c b/mm/memcg_memfs_info.c index f404367ad08c..db7b0aee8053 100644 --- a/mm/memcg_memfs_info.c +++ b/mm/memcg_memfs_info.c @@ -157,7 +157,8 @@ static void memfs_show_files_in_mem_cgroup(struct super_block *sb, void *data) mntput(pfc->vfsmnt); }
-void mem_cgroup_print_memfs_info(struct mem_cgroup *memcg, struct seq_file *m) +void mem_cgroup_print_memfs_info(struct mem_cgroup *memcg, char *pathbuf, + struct seq_file *m) { struct print_files_control pfc = { .memcg = memcg, @@ -165,17 +166,11 @@ void mem_cgroup_print_memfs_info(struct mem_cgroup *memcg, struct seq_file *m) .max_print_files = memfs_max_print_files, .size_threshold = memfs_size_threshold, }; - char *pathbuf; int i;
if (!memfs_enable || !memcg) return;
- pathbuf = kmalloc(PATH_MAX, GFP_KERNEL); - if (!pathbuf) { - SEQ_printf(m, "Show memfs failed due to OOM\n"); - return; - } pfc.pathbuf = pathbuf; pfc.pathbuf_size = PATH_MAX;
@@ -192,15 +187,20 @@ void mem_cgroup_print_memfs_info(struct mem_cgroup *memcg, struct seq_file *m) SEQ_printf(m, "total files: %lu, total memory-size: %lukB\n", pfc.total_print_files, pfc.total_files_size >> 10); } - - kfree(pfc.pathbuf); }
int mem_cgroup_memfs_files_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + char *pathbuf;
- mem_cgroup_print_memfs_info(memcg, m); + pathbuf = kmalloc(PATH_MAX, GFP_KERNEL); + if (!pathbuf) { + SEQ_printf(m, "Show memfs abort: failed to allocate memory\n"); + return 0; + } + mem_cgroup_print_memfs_info(memcg, pathbuf, m); + kfree(pathbuf); return 0; }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6344fe257af6..635cb8b65b86 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1611,6 +1611,7 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) { /* Use static buffer, for the caller is holding oom_lock. */ static char buf[PAGE_SIZE]; + static char pathbuf[PATH_MAX];
lockdep_assert_held(&oom_lock);
@@ -1636,7 +1637,7 @@ void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) memory_stat_format(memcg, buf, sizeof(buf)); pr_info("%s", buf);
- mem_cgroup_print_memfs_info(memcg, NULL); + mem_cgroup_print_memfs_info(memcg, pathbuf, NULL); }
/*
From: Zhong Jinghua zhongjinghua@huawei.com
hulk inclusion category: bugfix bugzilla: 188150, https://gitee.com/openeuler/kernel/issues/I643OL CVE: NA
--------------------------------
When the three iscsi operations delete, logout, and rescan are concurrent at the same time, there is a probability of failure to add disk through device_add_disk(). The concurrent process is as follows:
T0: scan host // echo 1 > /sys/devices/platform/host1/scsi_host/host1/scan T1: delete target // echo 1 > /sys/devices/platform/host1/session1/target1:0:0/1:0:0:1/delete T2: logout // iscsiadm -m node --login T3: T2 scsi_queue_work T4: T0 bus_probe_device
T0 T1 T2 T3 scsi_scan_target mutex_lock(&shost->scan_mutex); __scsi_scan_target scsi_report_lun_scan scsi_add_lun scsi_sysfs_add_sdev device_add kobject_add //create session1/target1:0:0/1:0:0:1/ ... bus_probe_device // Create block asynchronously mutex_unlock(&shost->scan_mutex); sdev_store_delete scsi_remove_device device_remove_file mutex_lock(scan_mutex) __scsi_remove_device res = scsi_device_set_state(sdev, SDEV_CANCEL) iscsi_if_recv_msg scsi_queue_work __iscsi_unbind_session session->target_id = ISCSI_MAX_TARGET __scsi_remove_target sdev->sdev_state == SDEV_CANCEL continue; // end, No delete kobject 1:0:0:1 iscsi_if_recv_msg transport->destroy_session(session) __iscsi_destroy_session iscsi_session_teardown iscsi_remove_session __iscsi_unbind_session iscsi_session_event device_del // delete session T4: // create the block, its parent is 1:0:0:1 // If kobject 1:0:0:1 does not exist, it won't go down __device_attach_async_helper device_lock ... __device_attach_driver driver_probe_device really_probe sd_probe device_add_disk register_disk device_add // error
The block is created after the seesion is deleted. When T2 deletes the session, it will mark block'parent 1:0:01 as unusable: T2 device_del kobject_del sysfs_remove_dir __kernfs_remove // Mark the children under the session as unusable while ((pos = kernfs_next_descendant_post(pos, kn))) if (kernfs_active(pos)) atomic_add(KN_DEACTIVATED_BIAS, &pos->active);
Then, create the block: T4 device_add kobject_add kobject_add_varg kobject_add_internal create_dir sysfs_create_dir_ns kernfs_create_dir_ns kernfs_add_one if ((parent->flags & KERNFS_ACTIVATED) && !kernfs_active(parent)) goto out_unlock; // return error
This error will cause a warning: kobject_add_internal failed for block (error: -2 parent: 1:0:0:1). In the lower version (such as 5.10), there is no corresponding error handling, continuing to go down will trigger a kernel panic, so cc stable.
Therefore, creating the block should not be done after deleting the session. More practically, we should ensure that the target under the session is deleted first, and then the session is deleted. In this way, there are two possibilities:
1) if the process(T1) of deleting the target execute first, it will grab the device_lock(), and the process(T4) of creating the block will wait for the deletion to complete. Then, block's parent 1:0:0:1 has been deleted, it won't go down.
2) if the process(T4) of creating block execute first, it will grab the device_lock(), and the process(T1) of deleting the target will wait for the creation block to complete. Then, the process(T2) of deleting the session should need wait for the deletion to complete.
Fix it by removing the judgment of state equal to SDEV_CANCEL in __scsi_remove_target() to ensure the order of deletion. Then, it will wait for T1's mutex_lock(scan_mutex) and device_del() in __scsi_remove_device() will wait for T4's device_lock(dev). But we found that such a fix would cause the previous problem: commit 81b6c9998979 ("scsi: core: check for device state in __scsi_remove_target()"). So we use scsi_device_try_get() instead of get_devcie() to fix the previous problem.
Fixes: 81b6c9998979 ("scsi: core: check for device state in __scsi_remove_target()") Cc: stable@vger.kernel.org Signed-off-by: Zhong Jinghua zhongjinghua@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/scsi/scsi_sysfs.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c index 42db9c52208e..e7893835b99a 100644 --- a/drivers/scsi/scsi_sysfs.c +++ b/drivers/scsi/scsi_sysfs.c @@ -1503,6 +1503,13 @@ void scsi_remove_device(struct scsi_device *sdev) } EXPORT_SYMBOL(scsi_remove_device);
+static int scsi_device_try_get(struct scsi_device *sdev) +{ + if (!kobject_get_unless_zero(&sdev->sdev_gendev.kobj)) + return -ENXIO; + return 0; +} + static void __scsi_remove_target(struct scsi_target *starget) { struct Scsi_Host *shost = dev_to_shost(starget->dev.parent); @@ -1521,9 +1528,7 @@ static void __scsi_remove_target(struct scsi_target *starget) if (sdev->channel != starget->channel || sdev->id != starget->id) continue; - if (sdev->sdev_state == SDEV_DEL || - sdev->sdev_state == SDEV_CANCEL || - !get_device(&sdev->sdev_gendev)) + if (scsi_device_try_get(sdev)) continue; spin_unlock_irqrestore(shost->host_lock, flags); scsi_remove_device(sdev);
From: Sean Christopherson seanjc@google.com
mainline inclusion from mainline-v6.2-rc1 commit d6aba4c8e20d4d2bf65d589953f6d891c178f3a3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6BE56 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h...
--------------------------------
Pass "end - 1" instead of "end" when walking the interval tree in hugetlb_vmdelete_list() to fix an inclusive vs. exclusive bug. The two callers that pass a non-zero "end" treat it as exclusive, whereas the interval tree iterator expects an inclusive "last". E.g. punching a hole in a file that precisely matches the size of a single hugepage, with a vma starting right on the boundary, will result in unmap_hugepage_range() being called twice, with the second call having start==end.
The off-by-one error doesn't cause functional problems as __unmap_hugepage_range() turns into a massive nop due to short-circuiting its for-loop on "address < end". But, the mmu_notifier invocations to invalid_range_{start,end}() are passed a bogus zero-sized range, which may be unexpected behavior for secondary MMUs.
The bug was exposed by commit ed922739c919 ("KVM: Use interval tree to do fast hva lookup in memslots"), currently queued in the KVM tree for 5.17, which added a WARN to detect ranges with start==end.
Link: https://lkml.kernel.org/r/20211228234257.1926057-1-seanjc@google.com Fixes: 1bfad99ab425 ("hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete") Signed-off-by: Sean Christopherson seanjc@google.com Reported-by: syzbot+4e697fe80a31aa7efe21@syzkaller.appspotmail.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Zhihao Cheng chengzhihao1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/hugetlbfs/inode.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 379aa008514d..cfdd8cffe6d7 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -472,10 +472,11 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end) struct vm_area_struct *vma;
/* - * end == 0 indicates that the entire range after - * start should be unmapped. + * end == 0 indicates that the entire range after start should be + * unmapped. Note, end is exclusive, whereas the interval tree takes + * an inclusive "last". */ - vma_interval_tree_foreach(vma, root, start, end ? end : ULONG_MAX) { + vma_interval_tree_foreach(vma, root, start, end ? end - 1 : ULONG_MAX) { unsigned long v_offset; unsigned long v_end;
From: Yu Kuai yukuai3@huawei.com
mainline inclusion from mainline-v6.2-rc1 commit 077a4033541fc96fb0a955985aab7d1f353da831 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6B4N7 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h...
--------------------------------
After creating a dm device, then user can reload such dm with itself, and dead loop will be triggered because dm keep looking up to itself.
Test procedures:
1) dmsetup create test --table "xxx sda", assume dm-0 is created 2) dmsetup suspend test 3) dmsetup reload test --table "xxx dm-0" 4) dmsetup resume test
Test result:
BUG: TASK stack guard page was hit at 00000000736a261f (stack is 000000008d12c88d..00000000c8dd82d5) stack guard page: 0000 [#1] PREEMPT SMP CPU: 29 PID: 946 Comm: systemd-udevd Not tainted 6.1.0-rc3-next-20221101-00006-g17640ca3b0ee #1295 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 RIP: 0010:dm_prepare_ioctl+0xf/0x1e0 Code: da 48 83 05 4a 7c 99 0b 01 41 89 c4 eb cd e8 b8 1f 40 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 48 83 05 a1 5a 99 0b 01 <41> 56 49 89 d6 41 55 4c 8d af 90 02 00 00 9 RSP: 0018:ffffc90002090000 EFLAGS: 00010206 RAX: ffff8881049d6800 RBX: ffff88817e589000 RCX: 0000000000000000 RDX: ffffc90002090010 RSI: ffffc9000209001c RDI: ffff88817e589000 RBP: 00000000484a101d R08: 0000000000000000 R09: 0000000000000007 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000005331 R13: 0000000000005331 R14: 0000000000000000 R15: 0000000000000000 FS: 00007fddf9609200(0000) GS:ffff889fbfd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc9000208fff8 CR3: 0000000179043000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> dm_blk_ioctl+0x50/0x1c0 ? dm_prepare_ioctl+0xe0/0x1e0 dm_blk_ioctl+0x88/0x1c0 dm_blk_ioctl+0x88/0x1c0 ......(a lot of same lines) dm_blk_ioctl+0x88/0x1c0 dm_blk_ioctl+0x88/0x1c0 blkdev_ioctl+0x184/0x3e0 __x64_sys_ioctl+0xa3/0x110 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fddf7306577 Code: b3 66 90 48 8b 05 11 89 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 88 8 RSP: 002b:00007ffd0b2ec318 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00005634ef478320 RCX: 00007fddf7306577 RDX: 0000000000000000 RSI: 0000000000005331 RDI: 0000000000000007 RBP: 0000000000000007 R08: 00005634ef4843e0 R09: 0000000000000080 R10: 00007fddf75cfb38 R11: 0000000000000246 R12: 00000000030d4000 R13: 0000000000000000 R14: 0000000000000000 R15: 00005634ef48b800 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:dm_prepare_ioctl+0xf/0x1e0 Code: da 48 83 05 4a 7c 99 0b 01 41 89 c4 eb cd e8 b8 1f 40 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 48 83 05 a1 5a 99 0b 01 <41> 56 49 89 d6 41 55 4c 8d af 90 02 00 00 9 RSP: 0018:ffffc90002090000 EFLAGS: 00010206 RAX: ffff8881049d6800 RBX: ffff88817e589000 RCX: 0000000000000000 RDX: ffffc90002090010 RSI: ffffc9000209001c RDI: ffff88817e589000 RBP: 00000000484a101d R08: 0000000000000000 R09: 0000000000000007 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000005331 R13: 0000000000005331 R14: 0000000000000000 R15: 0000000000000000 FS: 00007fddf9609200(0000) GS:ffff889fbfd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc9000208fff8 CR3: 0000000179043000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
Fix the problem by forbidding a disk to create link to itself.
Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20221115141054.1051801-11-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe axboe@kernel.dk
Conflict: block/holder.c
Signed-off-by: Li Lingfeng lilingfeng3@huawei.com Reviewed-by: Hou Tao houtao1@huawei.com Reviewed-by: Yang Erkun yangerkun@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/block_dev.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/fs/block_dev.c b/fs/block_dev.c index 22d3a0f5152d..1c56cddcaa3d 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1264,6 +1264,9 @@ int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk) struct bd_holder_disk *holder; int ret = 0;
+ if (bdev->bd_disk == disk) + return -EINVAL; + /* * bdev could be deleted beneath us which would implicitly destroy * the holder directory. Hold on to it.
From: Baokun Li libaokun1@huawei.com
stable inclusion from stable-v5.10.150 commit f34ab95162763cd7352f46df169296eec28b688d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6BJ5F CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit f9c1f248607d5546075d3f731e7607d5571f2b60 upstream.
I caught a null-ptr-deref bug as follows:
================================================================== KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f] CPU: 1 PID: 1589 Comm: umount Not tainted 5.10.0-02219-dirty #339 RIP: 0010:ext4_write_info+0x53/0x1b0 [...] Call Trace: dquot_writeback_dquots+0x341/0x9a0 ext4_sync_fs+0x19e/0x800 __sync_filesystem+0x83/0x100 sync_filesystem+0x89/0xf0 generic_shutdown_super+0x79/0x3e0 kill_block_super+0xa1/0x110 deactivate_locked_super+0xac/0x130 deactivate_super+0xb6/0xd0 cleanup_mnt+0x289/0x400 __cleanup_mnt+0x16/0x20 task_work_run+0x11c/0x1c0 exit_to_user_mode_prepare+0x203/0x210 syscall_exit_to_user_mode+0x5b/0x3a0 do_syscall_64+0x59/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ==================================================================
Above issue may happen as follows: ------------------------------------- exit_to_user_mode_prepare task_work_run __cleanup_mnt cleanup_mnt deactivate_super deactivate_locked_super kill_block_super generic_shutdown_super shrink_dcache_for_umount dentry = sb->s_root sb->s_root = NULL <--- Here set NULL sync_filesystem __sync_filesystem sb->s_op->sync_fs > ext4_sync_fs dquot_writeback_dquots sb->dq_op->write_info > ext4_write_info ext4_journal_start(d_inode(sb->s_root), EXT4_HT_QUOTA, 2) d_inode(sb->s_root) s_root->d_inode <--- Null pointer dereference
To solve this problem, we use ext4_journal_start_sb directly to avoid s_root being used.
Cc: stable@kernel.org Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20220805123947.565152-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/ext4/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 9e7d204c9730..c94ea845ea57 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -6371,7 +6371,7 @@ static int ext4_write_info(struct super_block *sb, int type) handle_t *handle;
/* Data block + inode block */ - handle = ext4_journal_start(d_inode(sb->s_root), EXT4_HT_QUOTA, 2); + handle = ext4_journal_start_sb(sb, EXT4_HT_QUOTA, 2); if (IS_ERR(handle)) return PTR_ERR(handle); ret = dquot_commit_info(sb, type);
From: Baokun Li libaokun1@huawei.com
stable inclusion from stable-v5.10.163 commit 7223d5e75f26352354ea2c0ccf8b579821b52adf category: bugfix bugzilla: 187904,https://gitee.com/openeuler/kernel/issues/I6BJAR CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit a71248b1accb2b42e4980afef4fa4a27fa0e36f5 upstream.
I caught a issue as follows:
================================================================== BUG: KASAN: use-after-free in __list_add_valid+0x28/0x1a0 Read of size 8 at addr ffff88814b13f378 by task mount/710
CPU: 1 PID: 710 Comm: mount Not tainted 6.1.0-rc3-next #370 Call Trace: <TASK> dump_stack_lvl+0x73/0x9f print_report+0x25d/0x759 kasan_report+0xc0/0x120 __asan_load8+0x99/0x140 __list_add_valid+0x28/0x1a0 ext4_orphan_cleanup+0x564/0x9d0 [ext4] __ext4_fill_super+0x48e2/0x5300 [ext4] ext4_fill_super+0x19f/0x3a0 [ext4] get_tree_bdev+0x27b/0x450 ext4_get_tree+0x19/0x30 [ext4] vfs_get_tree+0x49/0x150 path_mount+0xaae/0x1350 do_mount+0xe2/0x110 __x64_sys_mount+0xf0/0x190 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd </TASK> [...] ==================================================================
Above issue may happen as follows: ------------------------------------- ext4_fill_super ext4_orphan_cleanup --- loop1: assume last_orphan is 12 --- list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->s_orphan) ext4_truncate --> return 0 ext4_inode_attach_jinode --> return -ENOMEM iput(inode) --> free inode<12> --- loop2: last_orphan is still 12 --- list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->s_orphan); // use inode<12> and trigger UAF
To solve this issue, we need to propagate the return value of ext4_inode_attach_jinode() appropriately.
Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20221102080633.1630225-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Zhang Yi yi.zhang@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- fs/ext4/inode.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 95dfbd18e2eb..a0bf0253485a 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4224,7 +4224,8 @@ int ext4_truncate(struct inode *inode)
/* If we zero-out tail of the page, we have to create jinode for jbd2 */ if (inode->i_size & (inode->i_sb->s_blocksize - 1)) { - if (ext4_inode_attach_jinode(inode) < 0) + err = ext4_inode_attach_jinode(inode); + if (err) goto out_trace; }
From: Herbert Xu herbert@gondor.apana.org.au
stable inclusion from stable-v5.10.164 commit 6c9e2c11c33c35563d34d12b343d43b5c12200b5 category: bugfix bugzilla: 188291, https://gitee.com/src-openeuler/kernel/issues/I6B1V2 CVE: CVE-2023-0394
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit cb3e9864cdbe35ff6378966660edbcbac955fe17 upstream.
The total cork length created by ip6_append_data includes extension headers, so we must exclude them when comparing them against the IPV6_CHECKSUM offset which does not include extension headers.
Reported-by: Kyle Zeng zengyhkyle@gmail.com Fixes: 357b40a18b04 ("[IPV6]: IPV6_CHECKSUM socket option can corrupt kernel memory") Signed-off-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Lu Wei luwei32@huawei.com Reviewed-by: Liu Jian liujian56@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- net/ipv6/raw.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 31eb54e92b3f..110254f44a46 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -539,6 +539,7 @@ static int rawv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, static int rawv6_push_pending_frames(struct sock *sk, struct flowi6 *fl6, struct raw6_sock *rp) { + struct ipv6_txoptions *opt; struct sk_buff *skb; int err = 0; int offset; @@ -556,6 +557,9 @@ static int rawv6_push_pending_frames(struct sock *sk, struct flowi6 *fl6,
offset = rp->offset; total_len = inet_sk(sk)->cork.base.length; + opt = inet6_sk(sk)->cork.opt; + total_len -= opt ? opt->opt_flen : 0; + if (offset >= total_len - 1) { err = -EINVAL; ip6_flush_pending_frames(sk);
From: Zheng Wang zyytlz.wz@163.com
mainline inclusion from mainline-v6.2-rc3 commit 4a61648af68f5ba4884f0e3b494ee1cabc4b6620 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5XXFF CVE: CVE-2022-3707
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
If intel_gvt_dma_map_guest_page failed, it will call ppgtt_invalidate_spt, which will finally free the spt. But the caller function ppgtt_populate_spt_by_guest_entry does not notice that, it will free spt again in its error path.
Fix this by canceling the mapping of DMA address and freeing sub_spt. Besides, leave the handle of spt destroy to caller function instead of callee function when error occurs.
conflicts: drivers/gpu/drm/i915/gvt/gtt.c
Fixes: b901b252b6cf ("drm/i915/gvt: Add 2M huge gtt support") Signed-off-by: Zheng Wang zyytlz.wz@163.com Reviewed-by: Zhenyu Wang zhenyuw@linux.intel.com Signed-off-by: Zhenyu Wang zhenyuw@linux.intel.com Link: http://patchwork.freedesktop.org/patch/msgid/20221229165641.1192455-1-zyytlz... Signed-off-by: Wang ShaoBo bobo.shaobowang@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Wei Li liwei391@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/gpu/drm/i915/gvt/gtt.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c index a3a4305eda01..0201f9b5f87e 100644 --- a/drivers/gpu/drm/i915/gvt/gtt.c +++ b/drivers/gpu/drm/i915/gvt/gtt.c @@ -1192,10 +1192,8 @@ static int split_2MB_gtt_entry(struct intel_vgpu *vgpu, for_each_shadow_entry(sub_spt, &sub_se, sub_index) { ret = intel_gvt_hypervisor_dma_map_guest_page(vgpu, start_gfn + sub_index, PAGE_SIZE, &dma_addr); - if (ret) { - ppgtt_invalidate_spt(spt); - return ret; - } + if (ret) + goto err; sub_se.val64 = se->val64;
/* Copy the PAT field from PDE. */ @@ -1214,6 +1212,17 @@ static int split_2MB_gtt_entry(struct intel_vgpu *vgpu, ops->set_pfn(se, sub_spt->shadow_page.mfn); ppgtt_set_shadow_entry(spt, se, index); return 0; +err: + /* Cancel the existing addess mappings of DMA addr. */ + for_each_present_shadow_entry(sub_spt, &sub_se, sub_index) { + gvt_vdbg_mm("invalidate 4K entry\n"); + ppgtt_invalidate_pte(sub_spt, &sub_se); + } + /* Release the new allocated spt. */ + trace_spt_change(sub_spt->vgpu->id, "release", sub_spt, + sub_spt->guest_page.gfn, sub_spt->shadow_page.type); + ppgtt_free_spt(sub_spt); + return ret; }
static int split_64KB_gtt_entry(struct intel_vgpu *vgpu,
From: Sebastian Andrzej Siewior bigeasy@linutronix.de
stable inclusion from stable-v5.10.142 commit d71a1c9fce184718d1b3a51a9e8a6e31cbbb45ce category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6D0ZE
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------------------------
commit 278d3ba61563ceed3cb248383ced19e14ec7bc1f upstream.
On 32bit-UP u64_stats_fetch_begin() disables only preemption. If the reader is in preemptible context and the writer side (u64_stats_update_begin*()) runs in an interrupt context (IRQ or softirq) then the writer can update the stats during the read operation. This update remains undetected.
Use u64_stats_fetch_begin_irq() to ensure the stats fetch on 32bit-UP are not interrupted by a writer. 32bit-SMP remains unaffected by this change.
Cc: "David S. Miller" davem@davemloft.net Cc: Catherine Sullivan csully@google.com Cc: David Awogbemila awogbemila@google.com Cc: Dimitris Michailidis dmichail@fungible.com Cc: Eric Dumazet edumazet@google.com Cc: Hans Ulli Kroll ulli.kroll@googlemail.com Cc: Jakub Kicinski kuba@kernel.org Cc: Jeroen de Borst jeroendb@google.com Cc: Johannes Berg johannes@sipsolutions.net Cc: Linus Walleij linus.walleij@linaro.org Cc: Paolo Abeni pabeni@redhat.com Cc: Simon Horman simon.horman@corigine.com Cc: linux-arm-kernel@lists.infradead.org Cc: linux-wireless@vger.kernel.org Cc: netdev@vger.kernel.org Cc: oss-drivers@corigine.com Cc: stable@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior bigeasy@linutronix.de Reviewed-by: Simon Horman simon.horman@corigine.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit d71a1c9fce184718d1b3a51a9e8a6e31cbbb45ce) Signed-off-by: Wang Yufen wangyufen@huawei.com
Conflicts: drivers/net/ethernet/huawei/hinic/hinic_rx.c drivers/net/ethernet/huawei/hinic/hinic_tx.c
Signed-off-by: Wang Yufen wangyufen@huawei.com Reviewed-by: Liu Jian liujian56@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/net/ethernet/cortina/gemini.c | 24 +++++++++---------- drivers/net/ethernet/google/gve/gve_ethtool.c | 16 ++++++------- drivers/net/ethernet/google/gve/gve_main.c | 12 +++++----- drivers/net/ethernet/huawei/hinic/hinic_rx.c | 4 ++-- drivers/net/ethernet/huawei/hinic/hinic_tx.c | 4 ++-- .../ethernet/netronome/nfp/nfp_net_common.c | 8 +++---- .../ethernet/netronome/nfp/nfp_net_ethtool.c | 8 +++---- drivers/net/netdevsim/netdev.c | 4 ++-- net/mac80211/sta_info.c | 8 +++---- net/mpls/af_mpls.c | 4 ++-- 10 files changed, 46 insertions(+), 46 deletions(-)
diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c index 368587864a7a..b22ea4069102 100644 --- a/drivers/net/ethernet/cortina/gemini.c +++ b/drivers/net/ethernet/cortina/gemini.c @@ -1920,7 +1920,7 @@ static void gmac_get_stats64(struct net_device *netdev,
/* Racing with RX NAPI */ do { - start = u64_stats_fetch_begin(&port->rx_stats_syncp); + start = u64_stats_fetch_begin_irq(&port->rx_stats_syncp);
stats->rx_packets = port->stats.rx_packets; stats->rx_bytes = port->stats.rx_bytes; @@ -1932,11 +1932,11 @@ static void gmac_get_stats64(struct net_device *netdev, stats->rx_crc_errors = port->stats.rx_crc_errors; stats->rx_frame_errors = port->stats.rx_frame_errors;
- } while (u64_stats_fetch_retry(&port->rx_stats_syncp, start)); + } while (u64_stats_fetch_retry_irq(&port->rx_stats_syncp, start));
/* Racing with MIB and TX completion interrupts */ do { - start = u64_stats_fetch_begin(&port->ir_stats_syncp); + start = u64_stats_fetch_begin_irq(&port->ir_stats_syncp);
stats->tx_errors = port->stats.tx_errors; stats->tx_packets = port->stats.tx_packets; @@ -1946,15 +1946,15 @@ static void gmac_get_stats64(struct net_device *netdev, stats->rx_missed_errors = port->stats.rx_missed_errors; stats->rx_fifo_errors = port->stats.rx_fifo_errors;
- } while (u64_stats_fetch_retry(&port->ir_stats_syncp, start)); + } while (u64_stats_fetch_retry_irq(&port->ir_stats_syncp, start));
/* Racing with hard_start_xmit */ do { - start = u64_stats_fetch_begin(&port->tx_stats_syncp); + start = u64_stats_fetch_begin_irq(&port->tx_stats_syncp);
stats->tx_dropped = port->stats.tx_dropped;
- } while (u64_stats_fetch_retry(&port->tx_stats_syncp, start)); + } while (u64_stats_fetch_retry_irq(&port->tx_stats_syncp, start));
stats->rx_dropped += stats->rx_missed_errors; } @@ -2032,18 +2032,18 @@ static void gmac_get_ethtool_stats(struct net_device *netdev, /* Racing with MIB interrupt */ do { p = values; - start = u64_stats_fetch_begin(&port->ir_stats_syncp); + start = u64_stats_fetch_begin_irq(&port->ir_stats_syncp);
for (i = 0; i < RX_STATS_NUM; i++) *p++ = port->hw_stats[i];
- } while (u64_stats_fetch_retry(&port->ir_stats_syncp, start)); + } while (u64_stats_fetch_retry_irq(&port->ir_stats_syncp, start)); values = p;
/* Racing with RX NAPI */ do { p = values; - start = u64_stats_fetch_begin(&port->rx_stats_syncp); + start = u64_stats_fetch_begin_irq(&port->rx_stats_syncp);
for (i = 0; i < RX_STATUS_NUM; i++) *p++ = port->rx_stats[i]; @@ -2051,13 +2051,13 @@ static void gmac_get_ethtool_stats(struct net_device *netdev, *p++ = port->rx_csum_stats[i]; *p++ = port->rx_napi_exits;
- } while (u64_stats_fetch_retry(&port->rx_stats_syncp, start)); + } while (u64_stats_fetch_retry_irq(&port->rx_stats_syncp, start)); values = p;
/* Racing with TX start_xmit */ do { p = values; - start = u64_stats_fetch_begin(&port->tx_stats_syncp); + start = u64_stats_fetch_begin_irq(&port->tx_stats_syncp);
for (i = 0; i < TX_MAX_FRAGS; i++) { *values++ = port->tx_frag_stats[i]; @@ -2066,7 +2066,7 @@ static void gmac_get_ethtool_stats(struct net_device *netdev, *values++ = port->tx_frags_linearized; *values++ = port->tx_hw_csummed;
- } while (u64_stats_fetch_retry(&port->tx_stats_syncp, start)); + } while (u64_stats_fetch_retry_irq(&port->tx_stats_syncp, start)); }
static int gmac_get_ksettings(struct net_device *netdev, diff --git a/drivers/net/ethernet/google/gve/gve_ethtool.c b/drivers/net/ethernet/google/gve/gve_ethtool.c index 66f9b377f8e7..80a8c0c231a3 100644 --- a/drivers/net/ethernet/google/gve/gve_ethtool.c +++ b/drivers/net/ethernet/google/gve/gve_ethtool.c @@ -172,14 +172,14 @@ gve_get_ethtool_stats(struct net_device *netdev, struct gve_rx_ring *rx = &priv->rx[ring];
start = - u64_stats_fetch_begin(&priv->rx[ring].statss); + u64_stats_fetch_begin_irq(&priv->rx[ring].statss); tmp_rx_pkts = rx->rpackets; tmp_rx_bytes = rx->rbytes; tmp_rx_skb_alloc_fail = rx->rx_skb_alloc_fail; tmp_rx_buf_alloc_fail = rx->rx_buf_alloc_fail; tmp_rx_desc_err_dropped_pkt = rx->rx_desc_err_dropped_pkt; - } while (u64_stats_fetch_retry(&priv->rx[ring].statss, + } while (u64_stats_fetch_retry_irq(&priv->rx[ring].statss, start)); rx_pkts += tmp_rx_pkts; rx_bytes += tmp_rx_bytes; @@ -193,10 +193,10 @@ gve_get_ethtool_stats(struct net_device *netdev, if (priv->tx) { do { start = - u64_stats_fetch_begin(&priv->tx[ring].statss); + u64_stats_fetch_begin_irq(&priv->tx[ring].statss); tmp_tx_pkts = priv->tx[ring].pkt_done; tmp_tx_bytes = priv->tx[ring].bytes_done; - } while (u64_stats_fetch_retry(&priv->tx[ring].statss, + } while (u64_stats_fetch_retry_irq(&priv->tx[ring].statss, start)); tx_pkts += tmp_tx_pkts; tx_bytes += tmp_tx_bytes; @@ -254,13 +254,13 @@ gve_get_ethtool_stats(struct net_device *netdev, data[i++] = rx->cnt; do { start = - u64_stats_fetch_begin(&priv->rx[ring].statss); + u64_stats_fetch_begin_irq(&priv->rx[ring].statss); tmp_rx_bytes = rx->rbytes; tmp_rx_skb_alloc_fail = rx->rx_skb_alloc_fail; tmp_rx_buf_alloc_fail = rx->rx_buf_alloc_fail; tmp_rx_desc_err_dropped_pkt = rx->rx_desc_err_dropped_pkt; - } while (u64_stats_fetch_retry(&priv->rx[ring].statss, + } while (u64_stats_fetch_retry_irq(&priv->rx[ring].statss, start)); data[i++] = tmp_rx_bytes; /* rx dropped packets */ @@ -313,9 +313,9 @@ gve_get_ethtool_stats(struct net_device *netdev, data[i++] = tx->done; do { start = - u64_stats_fetch_begin(&priv->tx[ring].statss); + u64_stats_fetch_begin_irq(&priv->tx[ring].statss); tmp_tx_bytes = tx->bytes_done; - } while (u64_stats_fetch_retry(&priv->tx[ring].statss, + } while (u64_stats_fetch_retry_irq(&priv->tx[ring].statss, start)); data[i++] = tmp_tx_bytes; data[i++] = tx->wake_queue; diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c index 6cb75bb1ed05..f0c1e6c80b61 100644 --- a/drivers/net/ethernet/google/gve/gve_main.c +++ b/drivers/net/ethernet/google/gve/gve_main.c @@ -40,10 +40,10 @@ static void gve_get_stats(struct net_device *dev, struct rtnl_link_stats64 *s) for (ring = 0; ring < priv->rx_cfg.num_queues; ring++) { do { start = - u64_stats_fetch_begin(&priv->rx[ring].statss); + u64_stats_fetch_begin_irq(&priv->rx[ring].statss); packets = priv->rx[ring].rpackets; bytes = priv->rx[ring].rbytes; - } while (u64_stats_fetch_retry(&priv->rx[ring].statss, + } while (u64_stats_fetch_retry_irq(&priv->rx[ring].statss, start)); s->rx_packets += packets; s->rx_bytes += bytes; @@ -53,10 +53,10 @@ static void gve_get_stats(struct net_device *dev, struct rtnl_link_stats64 *s) for (ring = 0; ring < priv->tx_cfg.num_queues; ring++) { do { start = - u64_stats_fetch_begin(&priv->tx[ring].statss); + u64_stats_fetch_begin_irq(&priv->tx[ring].statss); packets = priv->tx[ring].pkt_done; bytes = priv->tx[ring].bytes_done; - } while (u64_stats_fetch_retry(&priv->tx[ring].statss, + } while (u64_stats_fetch_retry_irq(&priv->tx[ring].statss, start)); s->tx_packets += packets; s->tx_bytes += bytes; @@ -1041,9 +1041,9 @@ void gve_handle_report_stats(struct gve_priv *priv) if (priv->tx) { for (idx = 0; idx < priv->tx_cfg.num_queues; idx++) { do { - start = u64_stats_fetch_begin(&priv->tx[idx].statss); + start = u64_stats_fetch_begin_irq(&priv->tx[idx].statss); tx_bytes = priv->tx[idx].bytes_done; - } while (u64_stats_fetch_retry(&priv->tx[idx].statss, start)); + } while (u64_stats_fetch_retry_irq(&priv->tx[idx].statss, start)); stats[stats_idx++] = (struct stats) { .stat_name = cpu_to_be32(TX_WAKE_CNT), .value = cpu_to_be64(priv->tx[idx].wake_queue), diff --git a/drivers/net/ethernet/huawei/hinic/hinic_rx.c b/drivers/net/ethernet/huawei/hinic/hinic_rx.c index 57d5d792c6ce..1b57b6754085 100644 --- a/drivers/net/ethernet/huawei/hinic/hinic_rx.c +++ b/drivers/net/ethernet/huawei/hinic/hinic_rx.c @@ -375,7 +375,7 @@ void hinic_rxq_get_stats(struct hinic_rxq *rxq,
u64_stats_update_begin(&stats->syncp); do { - start = u64_stats_fetch_begin(&rxq_stats->syncp); + start = u64_stats_fetch_begin_irq(&rxq_stats->syncp); stats->bytes = rxq_stats->bytes; stats->packets = rxq_stats->packets; stats->errors = rxq_stats->csum_errors + @@ -384,7 +384,7 @@ void hinic_rxq_get_stats(struct hinic_rxq *rxq, stats->other_errors = rxq_stats->other_errors; stats->dropped = rxq_stats->dropped; stats->rx_buf_empty = rxq_stats->rx_buf_empty; - } while (u64_stats_fetch_retry(&rxq_stats->syncp, start)); + } while (u64_stats_fetch_retry_irq(&rxq_stats->syncp, start)); u64_stats_update_end(&stats->syncp); }
diff --git a/drivers/net/ethernet/huawei/hinic/hinic_tx.c b/drivers/net/ethernet/huawei/hinic/hinic_tx.c index 75fa3442674a..ff37b6f05b7b 100644 --- a/drivers/net/ethernet/huawei/hinic/hinic_tx.c +++ b/drivers/net/ethernet/huawei/hinic/hinic_tx.c @@ -61,7 +61,7 @@ void hinic_txq_get_stats(struct hinic_txq *txq,
u64_stats_update_begin(&stats->syncp); do { - start = u64_stats_fetch_begin(&txq_stats->syncp); + start = u64_stats_fetch_begin_irq(&txq_stats->syncp); stats->bytes = txq_stats->bytes; stats->packets = txq_stats->packets; stats->busy = txq_stats->busy; @@ -69,7 +69,7 @@ void hinic_txq_get_stats(struct hinic_txq *txq, stats->dropped = txq_stats->dropped; stats->big_frags_pkts = txq_stats->big_frags_pkts; stats->big_udp_pkts = txq_stats->big_udp_pkts; - } while (u64_stats_fetch_retry(&txq_stats->syncp, start)); + } while (u64_stats_fetch_retry_irq(&txq_stats->syncp, start)); u64_stats_update_end(&stats->syncp); }
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index dfc1f32cda2b..5ab230aab2cd 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -3373,21 +3373,21 @@ static void nfp_net_stat64(struct net_device *netdev, unsigned int start;
do { - start = u64_stats_fetch_begin(&r_vec->rx_sync); + start = u64_stats_fetch_begin_irq(&r_vec->rx_sync); data[0] = r_vec->rx_pkts; data[1] = r_vec->rx_bytes; data[2] = r_vec->rx_drops; - } while (u64_stats_fetch_retry(&r_vec->rx_sync, start)); + } while (u64_stats_fetch_retry_irq(&r_vec->rx_sync, start)); stats->rx_packets += data[0]; stats->rx_bytes += data[1]; stats->rx_dropped += data[2];
do { - start = u64_stats_fetch_begin(&r_vec->tx_sync); + start = u64_stats_fetch_begin_irq(&r_vec->tx_sync); data[0] = r_vec->tx_pkts; data[1] = r_vec->tx_bytes; data[2] = r_vec->tx_errors; - } while (u64_stats_fetch_retry(&r_vec->tx_sync, start)); + } while (u64_stats_fetch_retry_irq(&r_vec->tx_sync, start)); stats->tx_packets += data[0]; stats->tx_bytes += data[1]; stats->tx_errors += data[2]; diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c index bfcd90f6e983..d4136d3db7bf 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c @@ -498,7 +498,7 @@ static u64 *nfp_vnic_get_sw_stats(struct net_device *netdev, u64 *data) unsigned int start;
do { - start = u64_stats_fetch_begin(&nn->r_vecs[i].rx_sync); + start = u64_stats_fetch_begin_irq(&nn->r_vecs[i].rx_sync); data[0] = nn->r_vecs[i].rx_pkts; tmp[0] = nn->r_vecs[i].hw_csum_rx_ok; tmp[1] = nn->r_vecs[i].hw_csum_rx_inner_ok; @@ -506,10 +506,10 @@ static u64 *nfp_vnic_get_sw_stats(struct net_device *netdev, u64 *data) tmp[3] = nn->r_vecs[i].hw_csum_rx_error; tmp[4] = nn->r_vecs[i].rx_replace_buf_alloc_fail; tmp[5] = nn->r_vecs[i].hw_tls_rx; - } while (u64_stats_fetch_retry(&nn->r_vecs[i].rx_sync, start)); + } while (u64_stats_fetch_retry_irq(&nn->r_vecs[i].rx_sync, start));
do { - start = u64_stats_fetch_begin(&nn->r_vecs[i].tx_sync); + start = u64_stats_fetch_begin_irq(&nn->r_vecs[i].tx_sync); data[1] = nn->r_vecs[i].tx_pkts; data[2] = nn->r_vecs[i].tx_busy; tmp[6] = nn->r_vecs[i].hw_csum_tx; @@ -519,7 +519,7 @@ static u64 *nfp_vnic_get_sw_stats(struct net_device *netdev, u64 *data) tmp[10] = nn->r_vecs[i].hw_tls_tx; tmp[11] = nn->r_vecs[i].tls_tx_fallback; tmp[12] = nn->r_vecs[i].tls_tx_no_fallback; - } while (u64_stats_fetch_retry(&nn->r_vecs[i].tx_sync, start)); + } while (u64_stats_fetch_retry_irq(&nn->r_vecs[i].tx_sync, start));
data += NN_RVEC_PER_Q_STATS;
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c index ad6dbf011052..4fb0638a55b4 100644 --- a/drivers/net/netdevsim/netdev.c +++ b/drivers/net/netdevsim/netdev.c @@ -67,10 +67,10 @@ nsim_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats) unsigned int start;
do { - start = u64_stats_fetch_begin(&ns->syncp); + start = u64_stats_fetch_begin_irq(&ns->syncp); stats->tx_bytes = ns->tx_bytes; stats->tx_packets = ns->tx_packets; - } while (u64_stats_fetch_retry(&ns->syncp, start)); + } while (u64_stats_fetch_retry_irq(&ns->syncp, start)); }
static int diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c index 461c03737da8..cee39ae52245 100644 --- a/net/mac80211/sta_info.c +++ b/net/mac80211/sta_info.c @@ -2175,9 +2175,9 @@ static inline u64 sta_get_tidstats_msdu(struct ieee80211_sta_rx_stats *rxstats, u64 value;
do { - start = u64_stats_fetch_begin(&rxstats->syncp); + start = u64_stats_fetch_begin_irq(&rxstats->syncp); value = rxstats->msdu[tid]; - } while (u64_stats_fetch_retry(&rxstats->syncp, start)); + } while (u64_stats_fetch_retry_irq(&rxstats->syncp, start));
return value; } @@ -2241,9 +2241,9 @@ static inline u64 sta_get_stats_bytes(struct ieee80211_sta_rx_stats *rxstats) u64 value;
do { - start = u64_stats_fetch_begin(&rxstats->syncp); + start = u64_stats_fetch_begin_irq(&rxstats->syncp); value = rxstats->bytes; - } while (u64_stats_fetch_retry(&rxstats->syncp, start)); + } while (u64_stats_fetch_retry_irq(&rxstats->syncp, start));
return value; } diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 9c047c148a11..72398149e4d4 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -1078,9 +1078,9 @@ static void mpls_get_stats(struct mpls_dev *mdev,
p = per_cpu_ptr(mdev->stats, i); do { - start = u64_stats_fetch_begin(&p->syncp); + start = u64_stats_fetch_begin_irq(&p->syncp); local = p->stats; - } while (u64_stats_fetch_retry(&p->syncp, start)); + } while (u64_stats_fetch_retry_irq(&p->syncp, start));
stats->rx_packets += local.rx_packets; stats->rx_bytes += local.rx_bytes;
From: Kuniyuki Iwashima kuniyu@amazon.com
stable inclusion from stable-v5.10.154 commit 2bf33b5ea46dbe547de44cdcbee6b1c0b6c167d4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6D0ZE
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------------------------
commit 11052589cf5c0bab3b4884d423d5f60c38fcf25d upstream.
Commit e21145a9871a ("ipv4: namespacify ip_early_demux sysctl knob") made it possible to enable/disable early_demux on a per-netns basis. Then, we introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for TCP/UDP in commit dddb64bcb346 ("net: Add sysctl to toggle early demux for tcp and udp"). However, the .proc_handler() was wrong and actually disabled us from changing the behaviour in each netns.
We can execute early_demux if net.ipv4.ip_early_demux is on and each proto .early_demux() handler is not NULL. When we toggle (tcp|udp)_early_demux, the change itself is saved in each netns variable, but the .early_demux() handler is a global variable, so the handler is switched based on the init_net's sysctl variable. Thus, netns (tcp|udp)_early_demux knobs have nothing to do with the logic. Whether we CAN execute proto .early_demux() is always decided by init_net's sysctl knob, and whether we DO it or not is by each netns ip_early_demux knob.
This patch namespacifies (tcp|udp)_early_demux again. For now, the users of the .early_demux() handler are TCP and UDP only, and they are called directly to avoid retpoline. So, we can remove the .early_demux() handler from inet6?_protos and need not dereference them in ip6?_rcv_finish_core(). If another proto needs .early_demux(), we can restore it at that time.
Fixes: dddb64bcb346 ("net: Add sysctl to toggle early demux for tcp and udp") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Link: https://lore.kernel.org/r/20220713175207.7727-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 2bf33b5ea46dbe547de44cdcbee6b1c0b6c167d4) Signed-off-by: Wang Yufen wangyufen@huawei.com Reviewed-by: Liu Jian liujian56@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- include/net/protocol.h | 4 --- include/net/tcp.h | 2 +- include/net/udp.h | 1 + net/ipv4/af_inet.c | 14 ++------- net/ipv4/ip_input.c | 37 ++++++++++++++---------- net/ipv4/sysctl_net_ipv4.c | 59 ++------------------------------------ net/ipv6/ip6_input.c | 26 ++++++++++------- net/ipv6/tcp_ipv6.c | 9 ++---- net/ipv6/udp.c | 9 ++---- 9 files changed, 47 insertions(+), 114 deletions(-)
diff --git a/include/net/protocol.h b/include/net/protocol.h index 2b778e1d2d8f..0fd2df844fc7 100644 --- a/include/net/protocol.h +++ b/include/net/protocol.h @@ -35,8 +35,6 @@
/* This is used to register protocols. */ struct net_protocol { - int (*early_demux)(struct sk_buff *skb); - int (*early_demux_handler)(struct sk_buff *skb); int (*handler)(struct sk_buff *skb);
/* This returns an error if we weren't able to handle the error. */ @@ -53,8 +51,6 @@ struct net_protocol {
#if IS_ENABLED(CONFIG_IPV6) struct inet6_protocol { - void (*early_demux)(struct sk_buff *skb); - void (*early_demux_handler)(struct sk_buff *skb); int (*handler)(struct sk_buff *skb);
/* This returns an error if we weren't able to handle the error. */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 8ca542924957..04036bc8c0b3 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -937,7 +937,7 @@ extern const struct inet_connection_sock_af_ops ipv6_specific;
INDIRECT_CALLABLE_DECLARE(void tcp_v6_send_check(struct sock *sk, struct sk_buff *skb)); INDIRECT_CALLABLE_DECLARE(int tcp_v6_rcv(struct sk_buff *skb)); -INDIRECT_CALLABLE_DECLARE(void tcp_v6_early_demux(struct sk_buff *skb)); +void tcp_v6_early_demux(struct sk_buff *skb);
#endif
diff --git a/include/net/udp.h b/include/net/udp.h index 010bc324f860..388e68c7bca0 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -176,6 +176,7 @@ INDIRECT_CALLABLE_DECLARE(int udp6_gro_complete(struct sk_buff *, int)); struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb, struct udphdr *uh, struct sock *sk); int udp_gro_complete(struct sk_buff *skb, int nhoff, udp_lookup_t lookup); +void udp_v6_early_demux(struct sk_buff *skb);
struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb, netdev_features_t features, bool is_ipv6); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 4629beccd2d4..0e5b739ae2c1 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1730,12 +1730,7 @@ static const struct net_protocol igmp_protocol = { }; #endif
-/* thinking of making this const? Don't. - * early_demux can change based on sysctl. - */ -static struct net_protocol tcp_protocol = { - .early_demux = tcp_v4_early_demux, - .early_demux_handler = tcp_v4_early_demux, +static const struct net_protocol tcp_protocol = { .handler = tcp_v4_rcv, .err_handler = tcp_v4_err, .no_policy = 1, @@ -1743,12 +1738,7 @@ static struct net_protocol tcp_protocol = { .icmp_strict_tag_validation = 1, };
-/* thinking of making this const? Don't. - * early_demux can change based on sysctl. - */ -static struct net_protocol udp_protocol = { - .early_demux = udp_v4_early_demux, - .early_demux_handler = udp_v4_early_demux, +static const struct net_protocol udp_protocol = { .handler = udp_rcv, .err_handler = udp_err, .no_policy = 1, diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index b0c244af1e4d..f6b3237e88ca 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -309,14 +309,13 @@ static bool ip_can_use_hint(const struct sk_buff *skb, const struct iphdr *iph, ip_hdr(hint)->tos == iph->tos; }
-INDIRECT_CALLABLE_DECLARE(int udp_v4_early_demux(struct sk_buff *)); -INDIRECT_CALLABLE_DECLARE(int tcp_v4_early_demux(struct sk_buff *)); +int tcp_v4_early_demux(struct sk_buff *skb); +int udp_v4_early_demux(struct sk_buff *skb); static int ip_rcv_finish_core(struct net *net, struct sock *sk, struct sk_buff *skb, struct net_device *dev, const struct sk_buff *hint) { const struct iphdr *iph = ip_hdr(skb); - int (*edemux)(struct sk_buff *skb); struct rtable *rt; int err;
@@ -327,21 +326,29 @@ static int ip_rcv_finish_core(struct net *net, struct sock *sk, goto drop_error; }
- if (net->ipv4.sysctl_ip_early_demux && + if (READ_ONCE(net->ipv4.sysctl_ip_early_demux) && !skb_dst(skb) && !skb->sk && !ip_is_fragment(iph)) { - const struct net_protocol *ipprot; - int protocol = iph->protocol; - - ipprot = rcu_dereference(inet_protos[protocol]); - if (ipprot && (edemux = READ_ONCE(ipprot->early_demux))) { - err = INDIRECT_CALL_2(edemux, tcp_v4_early_demux, - udp_v4_early_demux, skb); - if (unlikely(err)) - goto drop_error; - /* must reload iph, skb->head might have changed */ - iph = ip_hdr(skb); + switch (iph->protocol) { + case IPPROTO_TCP: + if (READ_ONCE(net->ipv4.sysctl_tcp_early_demux)) { + tcp_v4_early_demux(skb); + + /* must reload iph, skb->head might have changed */ + iph = ip_hdr(skb); + } + break; + case IPPROTO_UDP: + if (READ_ONCE(net->ipv4.sysctl_udp_early_demux)) { + err = udp_v4_early_demux(skb); + if (unlikely(err)) + goto drop_error; + + /* must reload iph, skb->head might have changed */ + iph = ip_hdr(skb); + } + break; } }
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 9f5e4425aecf..dee47ef05655 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -361,61 +361,6 @@ static int proc_tcp_fastopen_key(struct ctl_table *table, int write, return ret; }
-static void proc_configure_early_demux(int enabled, int protocol) -{ - struct net_protocol *ipprot; -#if IS_ENABLED(CONFIG_IPV6) - struct inet6_protocol *ip6prot; -#endif - - rcu_read_lock(); - - ipprot = rcu_dereference(inet_protos[protocol]); - if (ipprot) - ipprot->early_demux = enabled ? ipprot->early_demux_handler : - NULL; - -#if IS_ENABLED(CONFIG_IPV6) - ip6prot = rcu_dereference(inet6_protos[protocol]); - if (ip6prot) - ip6prot->early_demux = enabled ? ip6prot->early_demux_handler : - NULL; -#endif - rcu_read_unlock(); -} - -static int proc_tcp_early_demux(struct ctl_table *table, int write, - void *buffer, size_t *lenp, loff_t *ppos) -{ - int ret = 0; - - ret = proc_dointvec(table, write, buffer, lenp, ppos); - - if (write && !ret) { - int enabled = init_net.ipv4.sysctl_tcp_early_demux; - - proc_configure_early_demux(enabled, IPPROTO_TCP); - } - - return ret; -} - -static int proc_udp_early_demux(struct ctl_table *table, int write, - void *buffer, size_t *lenp, loff_t *ppos) -{ - int ret = 0; - - ret = proc_dointvec(table, write, buffer, lenp, ppos); - - if (write && !ret) { - int enabled = init_net.ipv4.sysctl_udp_early_demux; - - proc_configure_early_demux(enabled, IPPROTO_UDP); - } - - return ret; -} - static int proc_tfo_blackhole_detect_timeout(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) @@ -727,14 +672,14 @@ static struct ctl_table ipv4_net_table[] = { .data = &init_net.ipv4.sysctl_udp_early_demux, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_udp_early_demux + .proc_handler = proc_douintvec_minmax, }, { .procname = "tcp_early_demux", .data = &init_net.ipv4.sysctl_tcp_early_demux, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_tcp_early_demux + .proc_handler = proc_douintvec_minmax, }, { .procname = "nexthop_compat_mode", diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c index 15ea3d082534..4eb9fbfdce33 100644 --- a/net/ipv6/ip6_input.c +++ b/net/ipv6/ip6_input.c @@ -44,21 +44,25 @@ #include <net/inet_ecn.h> #include <net/dst_metadata.h>
-INDIRECT_CALLABLE_DECLARE(void udp_v6_early_demux(struct sk_buff *)); -INDIRECT_CALLABLE_DECLARE(void tcp_v6_early_demux(struct sk_buff *)); +void udp_v6_early_demux(struct sk_buff *); +void tcp_v6_early_demux(struct sk_buff *); static void ip6_rcv_finish_core(struct net *net, struct sock *sk, struct sk_buff *skb) { - void (*edemux)(struct sk_buff *skb); - - if (net->ipv4.sysctl_ip_early_demux && !skb_dst(skb) && skb->sk == NULL) { - const struct inet6_protocol *ipprot; - - ipprot = rcu_dereference(inet6_protos[ipv6_hdr(skb)->nexthdr]); - if (ipprot && (edemux = READ_ONCE(ipprot->early_demux))) - INDIRECT_CALL_2(edemux, tcp_v6_early_demux, - udp_v6_early_demux, skb); + if (READ_ONCE(net->ipv4.sysctl_ip_early_demux) && + !skb_dst(skb) && !skb->sk) { + switch (ipv6_hdr(skb)->nexthdr) { + case IPPROTO_TCP: + if (READ_ONCE(net->ipv4.sysctl_tcp_early_demux)) + tcp_v6_early_demux(skb); + break; + case IPPROTO_UDP: + if (READ_ONCE(net->ipv4.sysctl_udp_early_demux)) + udp_v6_early_demux(skb); + break; + } } + if (!skb_valid_dst(skb)) ip6_route_input(skb); } diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 4ab5a681bf60..928be701fefb 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1820,7 +1820,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) goto discard_it; }
-INDIRECT_CALLABLE_SCOPE void tcp_v6_early_demux(struct sk_buff *skb) +void tcp_v6_early_demux(struct sk_buff *skb) { const struct ipv6hdr *hdr; const struct tcphdr *th; @@ -2171,12 +2171,7 @@ struct proto tcpv6_prot = { }; EXPORT_SYMBOL_GPL(tcpv6_prot);
-/* thinking of making this const? Don't. - * early_demux can change based on sysctl. - */ -static struct inet6_protocol tcpv6_protocol = { - .early_demux = tcp_v6_early_demux, - .early_demux_handler = tcp_v6_early_demux, +static const struct inet6_protocol tcpv6_protocol = { .handler = tcp_v6_rcv, .err_handler = tcp_v6_err, .flags = INET6_PROTO_NOPOLICY|INET6_PROTO_FINAL, diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 6f21534967b3..08a2ceb6b999 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1026,7 +1026,7 @@ static struct sock *__udp6_lib_demux_lookup(struct net *net, return NULL; }
-INDIRECT_CALLABLE_SCOPE void udp_v6_early_demux(struct sk_buff *skb) +void udp_v6_early_demux(struct sk_buff *skb) { struct net *net = dev_net(skb->dev); const struct udphdr *uh; @@ -1639,12 +1639,7 @@ int udpv6_getsockopt(struct sock *sk, int level, int optname, return ipv6_getsockopt(sk, level, optname, optval, optlen); }
-/* thinking of making this const? Don't. - * early_demux can change based on sysctl. - */ -static struct inet6_protocol udpv6_protocol = { - .early_demux = udp_v6_early_demux, - .early_demux_handler = udp_v6_early_demux, +static const struct inet6_protocol udpv6_protocol = { .handler = udpv6_rcv, .err_handler = udpv6_err, .flags = INET6_PROTO_NOPOLICY|INET6_PROTO_FINAL,
From: Wang Yufen wangyufen@huawei.com
Offering: HULK hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6D0ZE
--------------------------------
Add early_demux_handler and early_demux back to fix kabi broken in struct net_protocol and inet6_protocol.
Signed-off-by: Wang Yufen wangyufen@huawei.com Reviewed-by: Liu Jian liujian56@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- include/net/protocol.h | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/include/net/protocol.h b/include/net/protocol.h index 0fd2df844fc7..2b778e1d2d8f 100644 --- a/include/net/protocol.h +++ b/include/net/protocol.h @@ -35,6 +35,8 @@
/* This is used to register protocols. */ struct net_protocol { + int (*early_demux)(struct sk_buff *skb); + int (*early_demux_handler)(struct sk_buff *skb); int (*handler)(struct sk_buff *skb);
/* This returns an error if we weren't able to handle the error. */ @@ -51,6 +53,8 @@ struct net_protocol {
#if IS_ENABLED(CONFIG_IPV6) struct inet6_protocol { + void (*early_demux)(struct sk_buff *skb); + void (*early_demux_handler)(struct sk_buff *skb); int (*handler)(struct sk_buff *skb);
/* This returns an error if we weren't able to handle the error. */
From: Marco Elver elver@google.com
stable inclusion from stable-v5.10.163 commit 67349025f00d0749e36386cfcfc32c2887f47fdb category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6D0MR CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit fa69ee5aa48b5b52e8028c2eb486906e9998d081 ]
It turns out that usage of skb extensions can cause memory leaks. Ido Schimmel reported: "[...] there are instances that blindly overwrite 'skb->extensions' by invoking skb_copy_header() after __alloc_skb()."
Therefore, give up on using skb extensions for KCOV handle, and instead directly store kcov_handle in sk_buff.
Fixes: 6370cc3bbd8a ("net: add kcov handle to skb extensions") Fixes: 85ce50d337d1 ("net: kcov: don't select SKB_EXTENSIONS when there is no NET") Fixes: 97f53a08cba1 ("net: linux/skbuff.h: combine SKB_EXTENSIONS + KCOV handling") Link: https://lore.kernel.org/linux-wireless/20201121160941.GA485907@shredder.lan/ Reported-by: Ido Schimmel idosch@idosch.org Signed-off-by: Marco Elver elver@google.com Link: https://lore.kernel.org/r/20201125224840.2014773-1-elver@google.com Signed-off-by: Jakub Kicinski kuba@kernel.org Stable-dep-of: db0b124f02ba ("igc: Enhance Qbv scheduling by using first flag bit") Signed-off-by: Sasha Levin sashal@kernel.org
Conflicts: include/linux/skbuff.h
Signed-off-by: Zhengchao Shao shaozhengchao@huawei.com Reviewed-by: Liu Jian liujian56@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- include/linux/skbuff.h | 37 +++++++++++++------------------------ lib/Kconfig.debug | 1 - net/core/skbuff.c | 6 ------ 3 files changed, 13 insertions(+), 31 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 26c431883c69..d16c8bd085f3 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -706,6 +706,7 @@ typedef unsigned char *sk_buff_data_t; * @transport_header: Transport layer header * @network_header: Network layer header * @mac_header: Link layer header + * @kcov_handle: KCOV remote handle for remote coverage collection * @scm_io_uring: SKB holds io_uring registered files * @tail: Tail pointer * @end: End pointer @@ -913,6 +914,10 @@ struct sk_buff { __u16 network_header; __u16 mac_header;
+#ifdef CONFIG_KCOV + u64 kcov_handle; +#endif + /* private: */ __u32 headers_end[0]; /* public: */ @@ -4223,9 +4228,6 @@ enum skb_ext_id { #endif #if IS_ENABLED(CONFIG_MPTCP) SKB_EXT_MPTCP, -#endif -#if IS_ENABLED(CONFIG_KCOV) - SKB_EXT_KCOV_HANDLE, #endif SKB_EXT_NUM, /* must be last */ }; @@ -4681,35 +4683,22 @@ static inline void skb_reset_redirect(struct sk_buff *skb) #endif }
-#if IS_ENABLED(CONFIG_KCOV) && IS_ENABLED(CONFIG_SKB_EXTENSIONS) static inline void skb_set_kcov_handle(struct sk_buff *skb, const u64 kcov_handle) { - /* Do not allocate skb extensions only to set kcov_handle to zero - * (as it is zero by default). However, if the extensions are - * already allocated, update kcov_handle anyway since - * skb_set_kcov_handle can be called to zero a previously set - * value. - */ - if (skb_has_extensions(skb) || kcov_handle) { - u64 *kcov_handle_ptr = skb_ext_add(skb, SKB_EXT_KCOV_HANDLE); - - if (kcov_handle_ptr) - *kcov_handle_ptr = kcov_handle; - } +#ifdef CONFIG_KCOV + skb->kcov_handle = kcov_handle; +#endif }
static inline u64 skb_get_kcov_handle(struct sk_buff *skb) { - u64 *kcov_handle = skb_ext_find(skb, SKB_EXT_KCOV_HANDLE); - - return kcov_handle ? *kcov_handle : 0; -} +#ifdef CONFIG_KCOV + return skb->kcov_handle; #else -static inline void skb_set_kcov_handle(struct sk_buff *skb, - const u64 kcov_handle) { } -static inline u64 skb_get_kcov_handle(struct sk_buff *skb) { return 0; } -#endif /* CONFIG_KCOV && CONFIG_SKB_EXTENSIONS */ + return 0; +#endif +}
static inline bool skb_csum_is_sctp(struct sk_buff *skb) { diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 10e425c30486..f53afec6f7ae 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1945,7 +1945,6 @@ config KCOV depends on CC_HAS_SANCOV_TRACE_PC || GCC_PLUGINS select DEBUG_FS select GCC_PLUGIN_SANCOV if !CC_HAS_SANCOV_TRACE_PC - select SKB_EXTENSIONS if NET help KCOV exposes kernel code coverage information in a form suitable for coverage-guided fuzzing (randomized testing). diff --git a/net/core/skbuff.c b/net/core/skbuff.c index bb31ac64ec38..7f0b5e3a5f78 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4334,9 +4334,6 @@ static const u8 skb_ext_type_len[] = { #if IS_ENABLED(CONFIG_MPTCP) [SKB_EXT_MPTCP] = SKB_EXT_CHUNKSIZEOF(struct mptcp_ext), #endif -#if IS_ENABLED(CONFIG_KCOV) - [SKB_EXT_KCOV_HANDLE] = SKB_EXT_CHUNKSIZEOF(u64), -#endif };
static __always_inline unsigned int skb_ext_total_length(void) @@ -4353,9 +4350,6 @@ static __always_inline unsigned int skb_ext_total_length(void) #endif #if IS_ENABLED(CONFIG_MPTCP) skb_ext_type_len[SKB_EXT_MPTCP] + -#endif -#if IS_ENABLED(CONFIG_KCOV) - skb_ext_type_len[SKB_EXT_KCOV_HANDLE] + #endif 0; }
From: Eric Dumazet edumazet@google.com
stable inclusion from stable-v5.10.156 commit e929ec98c0c3b10d9c07f3776df0c1a02d7a763e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I67UR2 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
commit b64085b00044bdf3cd1c9825e9ef5b2e0feae91a upstream.
macvlan should enforce a minimal mtu of 68, even at link creation.
This patch avoids the current behavior (which could lead to crashes in ipv6 stack if the link is brought up)
$ ip link add macvlan1 link eno1 mtu 8 type macvlan # This should fail ! $ ip link sh dev macvlan1 5: macvlan1@eno1: <BROADCAST,MULTICAST> mtu 8 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 02:47:6c:24:74:82 brd ff:ff:ff:ff:ff:ff $ ip link set macvlan1 mtu 67 Error: mtu less than device minimum. $ ip link set macvlan1 mtu 68 $ ip link set macvlan1 mtu 8 Error: mtu less than device minimum.
Fixes: 91572088e3fd ("net: use core MTU range checking in core net infra") Reported-by: syzbot syzkaller@googlegroups.com Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Zhengchao Shao shaozhengchao@huawei.com Reviewed-by: Liu Jian liujian56@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- drivers/net/macvlan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index c8d803d3616c..a421137e4ff8 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -1176,7 +1176,7 @@ void macvlan_common_setup(struct net_device *dev) { ether_setup(dev);
- dev->min_mtu = 0; + /* ether_setup() has set dev->min_mtu to ETH_MIN_MTU. */ dev->max_mtu = ETH_MAX_MTU; dev->priv_flags &= ~IFF_TX_SKB_SHARING; netif_keep_dst(dev);
From: Eric Dumazet edumazet@google.com
stable inclusion from stable-v5.10.152 commit 7aa3d623c11b9ab60f86b7833666e5d55bac4be9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I64L17 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
[ Upstream commit ebda44da44f6f309d302522b049f43d6f829f7aa ]
We had one syzbot report [1] in syzbot queue for a while. I was waiting for more occurrences and/or a repro but Dmitry Vyukov spotted the issue right away.
<quoting Dmitry> qdisc_graft() drops reference to qdisc in notify_and_destroy while it's still assigned to dev->qdisc </quoting>
Indeed, RCU rules are clear when replacing a data structure. The visible pointer (dev->qdisc in this case) must be updated to the new object _before_ RCU grace period is started (qdisc_put(old) in this case).
[1] BUG: KASAN: use-after-free in __tcf_qdisc_find.part.0+0xa3a/0xac0 net/sched/cls_api.c:1066 Read of size 4 at addr ffff88802065e038 by task syz-executor.4/21027
CPU: 0 PID: 21027 Comm: syz-executor.4 Not tainted 6.0.0-rc3-syzkaller-00363-g7726d4c3e60b #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:317 [inline] print_report.cold+0x2ba/0x719 mm/kasan/report.c:433 kasan_report+0xb1/0x1e0 mm/kasan/report.c:495 __tcf_qdisc_find.part.0+0xa3a/0xac0 net/sched/cls_api.c:1066 __tcf_qdisc_find net/sched/cls_api.c:1051 [inline] tc_new_tfilter+0x34f/0x2200 net/sched/cls_api.c:2018 rtnetlink_rcv_msg+0x955/0xca0 net/core/rtnetlink.c:6081 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2501 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x917/0xe10 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:734 ____sys_sendmsg+0x6eb/0x810 net/socket.c:2482 ___sys_sendmsg+0x110/0x1b0 net/socket.c:2536 __sys_sendmsg+0xf3/0x1c0 net/socket.c:2565 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f5efaa89279 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f5efbc31168 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f5efab9bf80 RCX: 00007f5efaa89279 RDX: 0000000000000000 RSI: 0000000020000140 RDI: 0000000000000005 RBP: 00007f5efaae32e9 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f5efb0cfb1f R14: 00007f5efbc31300 R15: 0000000000022000 </TASK>
Allocated by task 21027: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:45 [inline] set_alloc_info mm/kasan/common.c:437 [inline] ____kasan_kmalloc mm/kasan/common.c:516 [inline] ____kasan_kmalloc mm/kasan/common.c:475 [inline] __kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:525 kmalloc_node include/linux/slab.h:623 [inline] kzalloc_node include/linux/slab.h:744 [inline] qdisc_alloc+0xb0/0xc50 net/sched/sch_generic.c:938 qdisc_create_dflt+0x71/0x4a0 net/sched/sch_generic.c:997 attach_one_default_qdisc net/sched/sch_generic.c:1152 [inline] netdev_for_each_tx_queue include/linux/netdevice.h:2437 [inline] attach_default_qdiscs net/sched/sch_generic.c:1170 [inline] dev_activate+0x760/0xcd0 net/sched/sch_generic.c:1229 __dev_open+0x393/0x4d0 net/core/dev.c:1441 __dev_change_flags+0x583/0x750 net/core/dev.c:8556 rtnl_configure_link+0xee/0x240 net/core/rtnetlink.c:3189 rtnl_newlink_create net/core/rtnetlink.c:3371 [inline] __rtnl_newlink+0x10b8/0x17e0 net/core/rtnetlink.c:3580 rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3593 rtnetlink_rcv_msg+0x43a/0xca0 net/core/rtnetlink.c:6090 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2501 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x917/0xe10 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:734 ____sys_sendmsg+0x6eb/0x810 net/socket.c:2482 ___sys_sendmsg+0x110/0x1b0 net/socket.c:2536 __sys_sendmsg+0xf3/0x1c0 net/socket.c:2565 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
Freed by task 21020: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 kasan_set_track+0x21/0x30 mm/kasan/common.c:45 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370 ____kasan_slab_free mm/kasan/common.c:367 [inline] ____kasan_slab_free+0x166/0x1c0 mm/kasan/common.c:329 kasan_slab_free include/linux/kasan.h:200 [inline] slab_free_hook mm/slub.c:1754 [inline] slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1780 slab_free mm/slub.c:3534 [inline] kfree+0xe2/0x580 mm/slub.c:4562 rcu_do_batch kernel/rcu/tree.c:2245 [inline] rcu_core+0x7b5/0x1890 kernel/rcu/tree.c:2505 __do_softirq+0x1d3/0x9c6 kernel/softirq.c:571
Last potentially related work creation: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 __kasan_record_aux_stack+0xbe/0xd0 mm/kasan/generic.c:348 call_rcu+0x99/0x790 kernel/rcu/tree.c:2793 qdisc_put+0xcd/0xe0 net/sched/sch_generic.c:1083 notify_and_destroy net/sched/sch_api.c:1012 [inline] qdisc_graft+0xeb1/0x1270 net/sched/sch_api.c:1084 tc_modify_qdisc+0xbb7/0x1a00 net/sched/sch_api.c:1671 rtnetlink_rcv_msg+0x43a/0xca0 net/core/rtnetlink.c:6090 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2501 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x917/0xe10 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:734 ____sys_sendmsg+0x6eb/0x810 net/socket.c:2482 ___sys_sendmsg+0x110/0x1b0 net/socket.c:2536 __sys_sendmsg+0xf3/0x1c0 net/socket.c:2565 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
Second to last potentially related work creation: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 __kasan_record_aux_stack+0xbe/0xd0 mm/kasan/generic.c:348 kvfree_call_rcu+0x74/0x940 kernel/rcu/tree.c:3322 neigh_destroy+0x431/0x630 net/core/neighbour.c:912 neigh_release include/net/neighbour.h:454 [inline] neigh_cleanup_and_release+0x1f8/0x330 net/core/neighbour.c:103 neigh_del net/core/neighbour.c:225 [inline] neigh_remove_one+0x37d/0x460 net/core/neighbour.c:246 neigh_forced_gc net/core/neighbour.c:276 [inline] neigh_alloc net/core/neighbour.c:447 [inline] ___neigh_create+0x18b5/0x29a0 net/core/neighbour.c:642 ip6_finish_output2+0xfb8/0x1520 net/ipv6/ip6_output.c:125 __ip6_finish_output net/ipv6/ip6_output.c:195 [inline] ip6_finish_output+0x690/0x1160 net/ipv6/ip6_output.c:206 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x1ed/0x540 net/ipv6/ip6_output.c:227 dst_output include/net/dst.h:451 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] NF_HOOK include/linux/netfilter.h:301 [inline] mld_sendpack+0xa09/0xe70 net/ipv6/mcast.c:1820 mld_send_cr net/ipv6/mcast.c:2121 [inline] mld_ifc_work+0x71c/0xdc0 net/ipv6/mcast.c:2653 process_one_work+0x991/0x1610 kernel/workqueue.c:2289 worker_thread+0x665/0x1080 kernel/workqueue.c:2436 kthread+0x2e4/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
The buggy address belongs to the object at ffff88802065e000 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 56 bytes inside of 1024-byte region [ffff88802065e000, ffff88802065e400)
The buggy address belongs to the physical page: page:ffffea0000819600 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x20658 head:ffffea0000819600 order:3 compound_mapcount:0 compound_pincount:0 flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000010200 0000000000000000 dead000000000001 ffff888011841dc0 raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 3523, tgid 3523 (sshd), ts 41495190986, free_ts 41417713212 prep_new_page mm/page_alloc.c:2532 [inline] get_page_from_freelist+0x109b/0x2ce0 mm/page_alloc.c:4283 __alloc_pages+0x1c7/0x510 mm/page_alloc.c:5515 alloc_pages+0x1a6/0x270 mm/mempolicy.c:2270 alloc_slab_page mm/slub.c:1824 [inline] allocate_slab+0x27e/0x3d0 mm/slub.c:1969 new_slab mm/slub.c:2029 [inline] ___slab_alloc+0x7f1/0xe10 mm/slub.c:3031 __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3118 slab_alloc_node mm/slub.c:3209 [inline] __kmalloc_node_track_caller+0x2f2/0x380 mm/slub.c:4955 kmalloc_reserve net/core/skbuff.c:358 [inline] __alloc_skb+0xd9/0x2f0 net/core/skbuff.c:430 alloc_skb_fclone include/linux/skbuff.h:1307 [inline] tcp_stream_alloc_skb+0x38/0x580 net/ipv4/tcp.c:861 tcp_sendmsg_locked+0xc36/0x2f80 net/ipv4/tcp.c:1325 tcp_sendmsg+0x2b/0x40 net/ipv4/tcp.c:1483 inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:734 sock_write_iter+0x291/0x3d0 net/socket.c:1108 call_write_iter include/linux/fs.h:2187 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x9e9/0xdd0 fs/read_write.c:578 ksys_write+0x1e8/0x250 fs/read_write.c:631 page last free stack trace: reset_page_owner include/linux/page_owner.h:24 [inline] free_pages_prepare mm/page_alloc.c:1449 [inline] free_pcp_prepare+0x5e4/0xd20 mm/page_alloc.c:1499 free_unref_page_prepare mm/page_alloc.c:3380 [inline] free_unref_page+0x19/0x4d0 mm/page_alloc.c:3476 __unfreeze_partials+0x17c/0x1a0 mm/slub.c:2548 qlink_free mm/kasan/quarantine.c:168 [inline] qlist_free_all+0x6a/0x170 mm/kasan/quarantine.c:187 kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:294 __kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:447 kasan_slab_alloc include/linux/kasan.h:224 [inline] slab_post_alloc_hook mm/slab.h:727 [inline] slab_alloc_node mm/slub.c:3243 [inline] slab_alloc mm/slub.c:3251 [inline] __kmem_cache_alloc_lru mm/slub.c:3258 [inline] kmem_cache_alloc+0x267/0x3b0 mm/slub.c:3268 kmem_cache_zalloc include/linux/slab.h:723 [inline] alloc_buffer_head+0x20/0x140 fs/buffer.c:2974 alloc_page_buffers+0x280/0x790 fs/buffer.c:829 create_empty_buffers+0x2c/0xee0 fs/buffer.c:1558 ext4_block_write_begin+0x1004/0x1530 fs/ext4/inode.c:1074 ext4_da_write_begin+0x422/0xae0 fs/ext4/inode.c:2996 generic_perform_write+0x246/0x560 mm/filemap.c:3738 ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:270 ext4_file_write_iter+0x44a/0x1660 fs/ext4/file.c:679 call_write_iter include/linux/fs.h:2187 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x9e9/0xdd0 fs/read_write.c:578
Fixes: af356afa010f ("net_sched: reintroduce dev->qdisc for use by sch_api") Reported-by: syzbot syzkaller@googlegroups.com Diagnosed-by: Dmitry Vyukov dvyukov@google.com Signed-off-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/20221018203258.2793282-1-edumazet@google.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Zhengchao Shao shaozhengchao@huawei.com Reviewed-by: Liu Jian liujian56@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- net/sched/sch_api.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index 01bec5e746d3..54e2309315eb 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -1081,12 +1081,13 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
skip: if (!ingress) { - notify_and_destroy(net, skb, n, classid, - rtnl_dereference(dev->qdisc), new); + old = rtnl_dereference(dev->qdisc); if (new && !new->ops->attach) qdisc_refcount_inc(new); rcu_assign_pointer(dev->qdisc, new ? : &noop_qdisc);
+ notify_and_destroy(net, skb, n, classid, old, new); + if (new && new->ops->attach) new->ops->attach(new); } else {
From: Tejun Heo tj@kernel.org
stable inclusion from stable-v5.10.143 commit bfbacc2ef7b5b9261e795e361ee233e7d36efae6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60G94 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------
[ Upstream commit 671c11f0619e5ccb380bcf0f062f69ba95fc974a ]
cgroup_update_dfl_csses() write-lock the threadgroup_rwsem as updating the csses can trigger process migrations. However, if the subtree doesn't contain any tasks, there aren't gonna be any cgroup migrations. This condition can be trivially detected by testing whether mgctx.preloaded_src_csets is empty. Elide write-locking threadgroup_rwsem if the subtree is empty.
After this optimization, the usage pattern of creating a cgroup, enabling the necessary controllers, and then seeding it with CLONE_INTO_CGROUP and then removing the cgroup after it becomes empty doesn't need to write-lock threadgroup_rwsem at all.
Signed-off-by: Tejun Heo tj@kernel.org Cc: Christian Brauner brauner@kernel.org Cc: Michal Koutný mkoutny@suse.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Cai Xinchen caixinchen1@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- kernel/cgroup/cgroup.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index a5dd82dd8b2c..d00c391ca19e 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2954,12 +2954,11 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) struct cgroup_subsys_state *d_css; struct cgroup *dsct; struct css_set *src_cset; + bool has_tasks; int ret;
lockdep_assert_held(&cgroup_mutex);
- percpu_down_write(&cgroup_threadgroup_rwsem); - /* look up all csses currently attached to @cgrp's subtree */ spin_lock_irq(&css_set_lock); cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) { @@ -2970,6 +2969,16 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) } spin_unlock_irq(&css_set_lock);
+ /* + * We need to write-lock threadgroup_rwsem while migrating tasks. + * However, if there are no source csets for @cgrp, changing its + * controllers isn't gonna produce any task migrations and the + * write-locking can be skipped safely. + */ + has_tasks = !list_empty(&mgctx.preloaded_src_csets); + if (has_tasks) + percpu_down_write(&cgroup_threadgroup_rwsem); + /* NULL dst indicates self on default hierarchy */ ret = cgroup_migrate_prepare_dst(&mgctx); if (ret) @@ -2988,7 +2997,8 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) ret = cgroup_migrate_execute(&mgctx); out_finish: cgroup_migrate_finish(&mgctx); - percpu_up_write(&cgroup_threadgroup_rwsem); + if (has_tasks) + percpu_up_write(&cgroup_threadgroup_rwsem); return ret; }
From: Tejun Heo tj@kernel.org
stable inclusion from stable-v5.10.143 commit dee1e2b18cf5426eed985512ccc6636ec69dbdd6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60G94 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
-------------------------------
[ Upstream commit 4f7e7236435ca0abe005c674ebd6892c6e83aeb3 ]
Bringing up a CPU may involve creating and destroying tasks which requires read-locking threadgroup_rwsem, so threadgroup_rwsem nests inside cpus_read_lock(). However, cpuset's ->attach(), which may be called with thredagroup_rwsem write-locked, also wants to disable CPU hotplug and acquires cpus_read_lock(), leading to a deadlock.
Fix it by guaranteeing that ->attach() is always called with CPU hotplug disabled and removing cpus_read_lock() call from cpuset_attach().
Signed-off-by: Tejun Heo tj@kernel.org Reviewed-and-tested-by: Imran Khan imran.f.khan@oracle.com Reported-and-tested-by: Xuewen Yan xuewen.yan@unisoc.com Fixes: 05c7b7a92cc8 ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug") Cc: stable@vger.kernel.org # v5.17+ Signed-off-by: Sasha Levin sashal@kernel.org
conflict: kernel/cgroup/cgroup.c
Signed-off-by: Cai Xinchen caixinchen1@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- kernel/cgroup/cgroup.c | 77 +++++++++++++++++++++++++++++------------- kernel/cgroup/cpuset.c | 3 +- 2 files changed, 55 insertions(+), 25 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index d00c391ca19e..41dde1800526 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2330,6 +2330,47 @@ int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen) } EXPORT_SYMBOL_GPL(task_cgroup_path);
+/** + * cgroup_attach_lock - Lock for ->attach() + * @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem + * + * cgroup migration sometimes needs to stabilize threadgroups against forks and + * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach() + * implementations (e.g. cpuset), also need to disable CPU hotplug. + * Unfortunately, letting ->attach() operations acquire cpus_read_lock() can + * lead to deadlocks. + * + * Bringing up a CPU may involve creating and destroying tasks which requires + * read-locking threadgroup_rwsem, so threadgroup_rwsem nests inside + * cpus_read_lock(). If we call an ->attach() which acquires the cpus lock while + * write-locking threadgroup_rwsem, the locking order is reversed and we end up + * waiting for an on-going CPU hotplug operation which in turn is waiting for + * the threadgroup_rwsem to be released to create new tasks. For more details: + * + * http://lkml.kernel.org/r/20220711174629.uehfmqegcwn2lqzu@wubuntu + * + * Resolve the situation by always acquiring cpus_read_lock() before optionally + * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that + * CPU hotplug is disabled on entry. + */ +static void cgroup_attach_lock(bool lock_threadgroup) +{ + cpus_read_lock(); + if (lock_threadgroup) + percpu_down_write(&cgroup_threadgroup_rwsem); +} + +/** + * cgroup_attach_unlock - Undo cgroup_attach_lock() + * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem + */ +static void cgroup_attach_unlock(bool lock_threadgroup) +{ + if (lock_threadgroup) + percpu_up_write(&cgroup_threadgroup_rwsem); + cpus_read_unlock(); +} + /** * cgroup_migrate_add_task - add a migration target task to a migration context * @task: target task @@ -2804,8 +2845,7 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, }
struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, - bool *locked) - __acquires(&cgroup_threadgroup_rwsem) + bool *threadgroup_locked) { struct task_struct *tsk; pid_t pid; @@ -2822,12 +2862,8 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, * Therefore, we can skip the global lock. */ lockdep_assert_held(&cgroup_mutex); - if (pid || threadgroup) { - percpu_down_write(&cgroup_threadgroup_rwsem); - *locked = true; - } else { - *locked = false; - } + *threadgroup_locked = pid || threadgroup; + cgroup_attach_lock(*threadgroup_locked);
rcu_read_lock(); if (pid) { @@ -2858,17 +2894,14 @@ struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, goto out_unlock_rcu;
out_unlock_threadgroup: - if (*locked) { - percpu_up_write(&cgroup_threadgroup_rwsem); - *locked = false; - } + cgroup_attach_unlock(*threadgroup_locked); + *threadgroup_locked = false; out_unlock_rcu: rcu_read_unlock(); return tsk; }
-void cgroup_procs_write_finish(struct task_struct *task, bool locked) - __releases(&cgroup_threadgroup_rwsem) +void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked) { struct cgroup_subsys *ss; int ssid; @@ -2876,8 +2909,8 @@ void cgroup_procs_write_finish(struct task_struct *task, bool locked) /* release reference from cgroup_procs_write_start() */ put_task_struct(task);
- if (locked) - percpu_up_write(&cgroup_threadgroup_rwsem); + cgroup_attach_unlock(threadgroup_locked); + for_each_subsys(ss, ssid) if (ss->post_attach) ss->post_attach(); @@ -2976,8 +3009,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) * write-locking can be skipped safely. */ has_tasks = !list_empty(&mgctx.preloaded_src_csets); - if (has_tasks) - percpu_down_write(&cgroup_threadgroup_rwsem); + cgroup_attach_lock(has_tasks);
/* NULL dst indicates self on default hierarchy */ ret = cgroup_migrate_prepare_dst(&mgctx); @@ -2997,8 +3029,7 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp) ret = cgroup_migrate_execute(&mgctx); out_finish: cgroup_migrate_finish(&mgctx); - if (has_tasks) - percpu_up_write(&cgroup_threadgroup_rwsem); + cgroup_attach_unlock(has_tasks); return ret; }
@@ -4960,13 +4991,13 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, struct task_struct *task; const struct cred *saved_cred; ssize_t ret; - bool locked; + bool threadgroup_locked;
dst_cgrp = cgroup_kn_lock_live(of->kn, false); if (!dst_cgrp) return -ENODEV;
- task = cgroup_procs_write_start(buf, threadgroup, &locked); + task = cgroup_procs_write_start(buf, threadgroup, &threadgroup_locked); ret = PTR_ERR_OR_ZERO(task); if (ret) goto out_unlock; @@ -4992,7 +5023,7 @@ static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf, ret = cgroup_attach_task(dst_cgrp, task, threadgroup);
out_finish: - cgroup_procs_write_finish(task, locked); + cgroup_procs_write_finish(task, threadgroup_locked); out_unlock: cgroup_kn_unlock(of->kn);
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index dc1f782f8e0a..60489cc5a92a 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2220,7 +2220,7 @@ static void cpuset_attach(struct cgroup_taskset *tset) cgroup_taskset_first(tset, &css); cs = css_cs(css);
- cpus_read_lock(); + lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */ percpu_down_write(&cpuset_rwsem);
/* prepare for attach */ @@ -2276,7 +2276,6 @@ static void cpuset_attach(struct cgroup_taskset *tset) wake_up(&cpuset_attach_wq);
percpu_up_write(&cpuset_rwsem); - cpus_read_unlock(); }
/* The various types of files and directories in a cpuset file system */
From: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp
stable inclusion from stable-v5.10.145 commit 9f267393b036f1470fb12fb892d59e7ff8aeb58d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60G94 CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v...
----------------------------------------
commit 43626dade36fa74d3329046f4ae2d7fdefe401c6 upstream.
syzbot is hitting percpu_rwsem_assert_held(&cpu_hotplug_lock) warning at cpuset_attach() [1], for commit 4f7e7236435ca0ab ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock") missed that cpuset_attach() is also called from cgroup_attach_task_all(). Add cpus_read_lock() like what cgroup_procs_write_start() does.
Link: https://syzkaller.appspot.com/bug?extid=29d3a3b4d86c8136ad9e [1] Reported-by: syzbot syzbot+29d3a3b4d86c8136ad9e@syzkaller.appspotmail.com Signed-off-by: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Fixes: 4f7e7236435ca0ab ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock") Signed-off-by: Tejun Heo tj@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Cai Xinchen caixinchen1@huawei.com Reviewed-by: Xiu Jianfeng xiujianfeng@huawei.com Reviewed-by: Wang Weiyang wangweiyang2@huawei.com Signed-off-by: Jialin Zhang zhangjialin11@huawei.com --- kernel/cgroup/cgroup-v1.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index be884bc2ae61..647d0891cff6 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -57,6 +57,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) int retval = 0;
mutex_lock(&cgroup_mutex); + cpus_read_lock(); percpu_down_write(&cgroup_threadgroup_rwsem); for_each_root(root) { struct cgroup *from_cgrp; @@ -73,6 +74,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) break; } percpu_up_write(&cgroup_threadgroup_rwsem); + cpus_read_unlock(); mutex_unlock(&cgroup_mutex);
return retval;