From: Sergey Shtylyov s.shtylyov@omp.ru
stable inclusion from stable-v4.19.260 commit 2566706ac6393386a4e7c4ce23fe17f4c98d9aa0 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
[ Upstream commit 2f945a792f67815abca26fa8a5e863ccf3fa1181 ]
Commit 78c44d910d3e ("drivers/of: Fix depth when unflattening devicetree") forgot to fix up the depth check in the loop body in unflatten_dt_nodes() which makes it possible to overflow the nps[] buffer...
Found by Linux Verification Center (linuxtesting.org) with the SVACE static analysis tool.
Fixes: 78c44d910d3e ("drivers/of: Fix depth when unflattening devicetree") Signed-off-by: Sergey Shtylyov s.shtylyov@omp.ru Signed-off-by: Rob Herring robh@kernel.org Link: https://lore.kernel.org/r/7c354554-006f-6b31-c195-cdfe4caee392@omp.ru Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/of/fdt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c index 800ad252cf9c..50b67e65159b 100644 --- a/drivers/of/fdt.c +++ b/drivers/of/fdt.c @@ -391,7 +391,7 @@ static int unflatten_dt_nodes(const void *blob, for (offset = 0; offset >= 0 && depth >= initial_depth; offset = fdt_next_node(blob, offset, &depth)) { - if (WARN_ON_ONCE(depth >= FDT_MAX_DEPTH)) + if (WARN_ON_ONCE(depth >= FDT_MAX_DEPTH - 1)) continue;
if (!IS_ENABLED(CONFIG_OF_KOBJ) &&
From: Chao Yu chao.yu@oppo.com
stable inclusion from stable-v4.19.260 commit e996821717c5cf8aa1e1abdb6b3d900a231e3755 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
commit 7e9c323c52b379d261a72dc7bd38120a761a93cd upstream.
In create_unique_id(), kmalloc(, GFP_KERNEL) can fail due to out-of-memory, if it fails, return errno correctly rather than triggering panic via BUG_ON();
kernel BUG at mm/slub.c:5893! Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
Call trace: sysfs_slab_add+0x258/0x260 mm/slub.c:5973 __kmem_cache_create+0x60/0x118 mm/slub.c:4899 create_cache mm/slab_common.c:229 [inline] kmem_cache_create_usercopy+0x19c/0x31c mm/slab_common.c:335 kmem_cache_create+0x1c/0x28 mm/slab_common.c:390 f2fs_kmem_cache_create fs/f2fs/f2fs.h:2766 [inline] f2fs_init_xattr_caches+0x78/0xb4 fs/f2fs/xattr.c:808 f2fs_fill_super+0x1050/0x1e0c fs/f2fs/super.c:4149 mount_bdev+0x1b8/0x210 fs/super.c:1400 f2fs_mount+0x44/0x58 fs/f2fs/super.c:4512 legacy_get_tree+0x30/0x74 fs/fs_context.c:610 vfs_get_tree+0x40/0x140 fs/super.c:1530 do_new_mount+0x1dc/0x4e4 fs/namespace.c:3040 path_mount+0x358/0x914 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __arm64_sys_mount+0x2f8/0x408 fs/namespace.c:3568
Cc: stable@kernel.org Fixes: 81819f0fc8285 ("SLUB core") Reported-by: syzbot+81684812ea68216e08c5@syzkaller.appspotmail.com Reviewed-by: Muchun Song songmuchun@bytedance.com Reviewed-by: Hyeonggon Yoo 42.hyeyoo@gmail.com Signed-off-by: Chao Yu chao.yu@oppo.com Acked-by: David Rientjes rientjes@google.com Signed-off-by: Vlastimil Babka vbabka@suse.cz Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- mm/slub.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/slub.c b/mm/slub.c index f9b39a3718d0..150362a663f6 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -5701,7 +5701,8 @@ static char *create_unique_id(struct kmem_cache *s) char *name = kmalloc(ID_STR_LENGTH, GFP_KERNEL); char *p = name;
- BUG_ON(!name); + if (!name) + return ERR_PTR(-ENOMEM);
*p++ = ':'; /* @@ -5783,6 +5784,8 @@ static int sysfs_slab_add(struct kmem_cache *s) * for the symlinks. */ name = create_unique_id(s); + if (IS_ERR(name)) + return PTR_ERR(name); }
s->kobj.kset = kset;
From: Benjamin Poirier bpoirier@nvidia.com
stable inclusion from stable-v4.19.260 commit edda7614d3851c7282f3e3d8dd67421c864682c6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
[ Upstream commit bd60234222b2fd5573526da7bcd422801f271f5f ]
Netdev drivers are expected to call dev_{uc,mc}_sync() in their ndo_set_rx_mode method and dev_{uc,mc}_unsync() in their ndo_stop method. This is mentioned in the kerneldoc for those dev_* functions.
The team driver calls dev_{uc,mc}_unsync() during ndo_uninit instead of ndo_stop. This is ineffective because address lists (dev->{uc,mc}) have already been emptied in unregister_netdevice_many() before ndo_uninit is called. This mistake can result in addresses being leftover on former team ports after a team device has been deleted; see test_LAG_cleanup() in the last patch in this series.
Add unsync calls at their expected location, team_close().
v3: * When adding or deleting a port, only sync/unsync addresses if the team device is up. In other cases, it is taken care of at the right time by ndo_open/ndo_set_rx_mode/ndo_stop.
Fixes: 3d249d4ca7d0 ("net: introduce ethernet teaming device") Signed-off-by: Benjamin Poirier bpoirier@nvidia.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/net/team/team.c | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c index 3feb49badda9..e1ced8169d65 100644 --- a/drivers/net/team/team.c +++ b/drivers/net/team/team.c @@ -1276,10 +1276,12 @@ static int team_port_add(struct team *team, struct net_device *port_dev, } }
- netif_addr_lock_bh(dev); - dev_uc_sync_multiple(port_dev, dev); - dev_mc_sync_multiple(port_dev, dev); - netif_addr_unlock_bh(dev); + if (dev->flags & IFF_UP) { + netif_addr_lock_bh(dev); + dev_uc_sync_multiple(port_dev, dev); + dev_mc_sync_multiple(port_dev, dev); + netif_addr_unlock_bh(dev); + }
port->index = -1; list_add_tail_rcu(&port->list, &team->port_list); @@ -1350,8 +1352,10 @@ static int team_port_del(struct team *team, struct net_device *port_dev) netdev_rx_handler_unregister(port_dev); team_port_disable_netpoll(port); vlan_vids_del_by_dev(port_dev, dev); - dev_uc_unsync(port_dev, dev); - dev_mc_unsync(port_dev, dev); + if (dev->flags & IFF_UP) { + dev_uc_unsync(port_dev, dev); + dev_mc_unsync(port_dev, dev); + } dev_close(port_dev); team_port_leave(team, port);
@@ -1699,6 +1703,14 @@ static int team_open(struct net_device *dev)
static int team_close(struct net_device *dev) { + struct team *team = netdev_priv(dev); + struct team_port *port; + + list_for_each_entry(port, &team->port_list, list) { + dev_uc_unsync(port->dev, dev); + dev_mc_unsync(port->dev, dev); + } + return 0; }
From: Mel Gorman mgorman@techsingularity.net
stable inclusion from stable-v4.19.261 commit 26e871af465dff04431b03ba9c71144f63e59599 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
commit 3d36424b3b5850bd92f3e89b953a430d7cfc88ef upstream.
Patrick Daly reported the following problem;
NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation [0] - ZONE_MOVABLE [1] - ZONE_NORMAL [2] - NULL
For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent memory_offline operation removes the last page from ZONE_MOVABLE, build_all_zonelists() & build_zonerefs_node() will update node_zonelists as shown below. Only populated zones are added.
NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation [0] - ZONE_NORMAL [1] - NULL [2] - NULL
The race is simple -- page allocation could be in progress when a memory hot-remove operation triggers a zonelist rebuild that removes zones. The allocation request will still have a valid ac->preferred_zoneref that is now pointing to NULL and triggers an OOM kill.
This problem probably always existed but may be slightly easier to trigger due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") which distinguishes between zones that are completely unpopulated versus zones that have valid pages not managed by the buddy allocator (e.g. reserved, memblock, ballooning etc). Memory hotplug had multiple stages with timing considerations around managed/present page updates, the zonelist rebuild and the zone span updates. As David Hildenbrand puts it
memory offlining adjusts managed+present pages of the zone essentially in one go. If after the adjustments, the zone is no longer populated (present==0), we rebuild the zone lists.
Once that's done, we try shrinking the zone (start+spanned pages) -- which results in zone_start_pfn == 0 if there are no more pages. That happens *after* rebuilding the zonelists via remove_pfn_range_from_zone().
The only requirement to fix the race is that a page allocation request identifies when a zonelist rebuild has happened since the allocation request started and no page has yet been allocated. Use a seqlock_t to track zonelist updates with a lockless read-side of the zonelist and protecting the rebuild and update of the counter with a spinlock.
[akpm@linux-foundation.org: make zonelist_update_seq static] Link: https://lkml.kernel.org/r/20220824110900.vh674ltxmzb3proq@techsingularity.ne... Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") Signed-off-by: Mel Gorman mgorman@techsingularity.net Reported-by: Patrick Daly quic_pdaly@quicinc.com Acked-by: Michal Hocko mhocko@suse.com Reviewed-by: David Hildenbrand david@redhat.com Cc: stable@vger.kernel.org [4.9+] Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- mm/page_alloc.c | 53 +++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 43 insertions(+), 10 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a998aa179bfa..d290dbe177c5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3984,6 +3984,30 @@ void fs_reclaim_release(gfp_t gfp_mask) EXPORT_SYMBOL_GPL(fs_reclaim_release); #endif
+/* + * Zonelists may change due to hotplug during allocation. Detect when zonelists + * have been rebuilt so allocation retries. Reader side does not lock and + * retries the allocation if zonelist changes. Writer side is protected by the + * embedded spin_lock. + */ +static DEFINE_SEQLOCK(zonelist_update_seq); + +static unsigned int zonelist_iter_begin(void) +{ + if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) + return read_seqbegin(&zonelist_update_seq); + + return 0; +} + +static unsigned int check_retry_zonelist(unsigned int seq) +{ + if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) + return read_seqretry(&zonelist_update_seq, seq); + + return seq; +} + /* Perform direct synchronous page reclaim */ static int __perform_reclaim(gfp_t gfp_mask, unsigned int order, @@ -4336,6 +4360,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, int compaction_retries; int no_progress_loops; unsigned int cpuset_mems_cookie; + unsigned int zonelist_iter_cookie; int reserve_flags;
/* @@ -4346,11 +4371,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM))) gfp_mask &= ~__GFP_ATOMIC;
-retry_cpuset: +restart: compaction_retries = 0; no_progress_loops = 0; compact_priority = DEF_COMPACT_PRIORITY; cpuset_mems_cookie = read_mems_allowed_begin(); + zonelist_iter_cookie = zonelist_iter_begin();
/* * The fast path uses conservative alloc_flags to succeed only until @@ -4503,9 +4529,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, goto retry;
- /* Deal with possible cpuset update races before we start OOM killing */ - if (check_retry_cpuset(cpuset_mems_cookie, ac)) - goto retry_cpuset; + /* + * Deal with possible cpuset update races or zonelist updates to avoid + * a unnecessary OOM kill. + */ + if (check_retry_cpuset(cpuset_mems_cookie, ac) || + check_retry_zonelist(zonelist_iter_cookie)) + goto restart;
/* Reclaim has failed us, start killing things */ page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); @@ -4525,9 +4555,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, }
nopage: - /* Deal with possible cpuset update races before we fail */ - if (check_retry_cpuset(cpuset_mems_cookie, ac)) - goto retry_cpuset; + /* + * Deal with possible cpuset update races or zonelist updates to avoid + * a unnecessary OOM kill. + */ + if (check_retry_cpuset(cpuset_mems_cookie, ac) || + check_retry_zonelist(zonelist_iter_cookie)) + goto restart;
/* * Make sure that __GFP_NOFAIL request doesn't leak out and make sure @@ -5756,9 +5790,8 @@ static void __build_all_zonelists(void *data) int nid; int __maybe_unused cpu; pg_data_t *self = data; - static DEFINE_SPINLOCK(lock);
- spin_lock(&lock); + write_seqlock(&zonelist_update_seq);
#ifdef CONFIG_NUMA memset(node_load, 0, sizeof(node_load)); @@ -5791,7 +5824,7 @@ static void __build_all_zonelists(void *data) #endif }
- spin_unlock(&lock); + write_sequnlock(&zonelist_update_seq); }
static noinline void __init
From: Maurizio Lombardi mlombard@redhat.com
stable inclusion from stable-v4.19.261 commit 39a22a4ccd3a6c073ba2257629e752afa4e7ad08 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
commit dac22531bbd4af2426c4e29e05594415ccfa365d upstream.
A number of drivers call page_frag_alloc() with a fragment's size > PAGE_SIZE.
In low memory conditions, __page_frag_cache_refill() may fail the order 3 cache allocation and fall back to order 0; In this case, the cache will be smaller than the fragment, causing memory corruptions.
Prevent this from happening by checking if the newly allocated cache is large enough for the fragment; if not, the allocation will fail and page_frag_alloc() will return NULL.
Link: https://lkml.kernel.org/r/20220715125013.247085-1-mlombard@redhat.com Fixes: b63ae8ca096d ("mm/net: Rename and move page fragment handling from net/ to mm/") Signed-off-by: Maurizio Lombardi mlombard@redhat.com Reviewed-by: Alexander Duyck alexanderduyck@fb.com Cc: Chen Lin chen45464546@163.com Cc: Jakub Kicinski kuba@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- mm/page_alloc.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d290dbe177c5..a9a9e3a12f19 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4962,6 +4962,18 @@ void *page_frag_alloc(struct page_frag_cache *nc, /* reset page count bias and offset to start of new frag */ nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1; offset = size - fragsz; + if (unlikely(offset < 0)) { + /* + * The caller is trying to allocate a fragment + * with fragsz > PAGE_SIZE but the cache isn't big + * enough to satisfy the request, this may + * happen in low memory conditions. + * We don't release the cache page because + * it could make memory pressure worse + * so we simply return NULL here. + */ + return NULL; + } }
nc->pagecnt_bias--;
From: Alistair Popple apopple@nvidia.com
stable inclusion from stable-v4.19.261 commit acf4387e553ede843bdacf3616e4b750517b58db category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
commit 60bae73708963de4a17231077285bd9ff2f41c44 upstream.
When clearing a PTE the TLB should be flushed whilst still holding the PTL to avoid a potential race with madvise/munmap/etc. For example consider the following sequence:
CPU0 CPU1 ---- ----
migrate_vma_collect_pmd() pte_unmap_unlock() madvise(MADV_DONTNEED) -> zap_pte_range() pte_offset_map_lock() [ PTE not present, TLB not flushed ] pte_unmap_unlock() [ page is still accessible via stale TLB ] flush_tlb_range()
In this case the page may still be accessed via the stale TLB entry after madvise returns. Fix this by flushing the TLB while holding the PTL.
Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Link: https://lkml.kernel.org/r/9f801e9d8d830408f2ca27821f606e09aa856899.166207852... Signed-off-by: Alistair Popple apopple@nvidia.com Reported-by: Nadav Amit nadav.amit@gmail.com Reviewed-by: "Huang, Ying" ying.huang@intel.com Acked-by: David Hildenbrand david@redhat.com Acked-by: Peter Xu peterx@redhat.com Cc: Alex Sierra alex.sierra@amd.com Cc: Ben Skeggs bskeggs@redhat.com Cc: Felix Kuehling Felix.Kuehling@amd.com Cc: huang ying huang.ying.caritas@gmail.com Cc: Jason Gunthorpe jgg@nvidia.com Cc: John Hubbard jhubbard@nvidia.com Cc: Karol Herbst kherbst@redhat.com Cc: Logan Gunthorpe logang@deltatee.com Cc: Lyude Paul lyude@redhat.com Cc: Matthew Wilcox willy@infradead.org Cc: Paul Mackerras paulus@ozlabs.org Cc: Ralph Campbell rcampbell@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- mm/migrate.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c index 3a20900e19e0..dc6416ccef44 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2344,13 +2344,14 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, migrate->dst[migrate->npages] = 0; migrate->src[migrate->npages++] = mpfn; } - arch_leave_lazy_mmu_mode(); - pte_unmap_unlock(ptep - 1, ptl);
/* Only flush the TLB if we actually modified any entries */ if (unmapped) flush_tlb_range(walk->vma, start, end);
+ arch_leave_lazy_mmu_mode(); + pte_unmap_unlock(ptep - 1, ptl); + return 0; }
From: Tyler Hicks tyhicks@linux.microsoft.com
stable inclusion from stable-v4.19.261 commit 7e290764624acfc807a9dae958b3e4ecc550b50c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
commit 9ff8a616dfab96a4fa0ddd36190907dc68886d9b upstream.
Ask the LSM to free its audit rule rather than directly calling kfree(). Both AppArmor and SELinux do additional work in their audit_rule_free() hooks. Fix memory leaks by allowing the LSMs to perform necessary work.
Fixes: b16942455193 ("ima: use the lsm policy update notifier") Signed-off-by: Tyler Hicks tyhicks@linux.microsoft.com Cc: Janne Karhunen janne.karhunen@gmail.com Cc: Casey Schaufler casey@schaufler-ca.com Reviewed-by: Mimi Zohar zohar@linux.ibm.com Signed-off-by: Mimi Zohar zohar@linux.ibm.com Cc: stable@vger.kernel.org # 4.19+ Signed-off-by: Gou Hao gouhao@uniontech.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- security/integrity/ima/ima.h | 5 +++++ security/integrity/ima/ima_policy.c | 4 +++- 2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h index 2619dee3b4a5..0f1dabddb640 100644 --- a/security/integrity/ima/ima.h +++ b/security/integrity/ima/ima.h @@ -297,6 +297,7 @@ static inline int ima_read_xattr(struct dentry *dentry, #ifdef CONFIG_IMA_LSM_RULES
#define security_filter_rule_init security_audit_rule_init +#define security_filter_rule_free security_audit_rule_free #define security_filter_rule_match security_audit_rule_match
#else @@ -307,6 +308,10 @@ static inline int security_filter_rule_init(u32 field, u32 op, char *rulestr, return -EINVAL; }
+static inline void security_filter_rule_free(void *lsmrule) +{ +} + static inline int security_filter_rule_match(u32 secid, u32 field, u32 op, void *lsmrule, struct audit_context *actx) diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c index 93babb60b05a..a323d875dc2d 100644 --- a/security/integrity/ima/ima_policy.c +++ b/security/integrity/ima/ima_policy.c @@ -1045,8 +1045,10 @@ void ima_delete_rules(void)
temp_ima_appraise = 0; list_for_each_entry_safe(entry, tmp, &ima_temp_rules, list) { - for (i = 0; i < MAX_LSM_RULES; i++) + for (i = 0; i < MAX_LSM_RULES; i++) { + security_filter_rule_free(entry->lsm[i].rule); kfree(entry->lsm[i].args_p); + }
list_del(&entry->list); kfree(entry);
From: Tyler Hicks tyhicks@linux.microsoft.com
stable inclusion from stable-v4.19.261 commit 3d55a948aaec1ffd4ea329bc6e1a7ecd4f10e64f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
commit 465aee77aae857b5fcde56ee192b33dc369fba04 upstream.
Create a function, ima_free_rule(), to free all memory associated with an ima_rule_entry. Use the new function to fix memory leaks of allocated ima_rule_entry members, such as .fsname and .keyrings, when deleting a list of rules.
Make the existing ima_lsm_free_rule() function specific to the LSM audit rule array of an ima_rule_entry and require that callers make an additional call to kfree to free the ima_rule_entry itself.
This fixes a memory leak seen when loading by a valid rule that contains an additional piece of allocated memory, such as an fsname, followed by an invalid rule that triggers a policy load failure:
# echo -e "dont_measure fsname=securityfs\nbad syntax" > \ /sys/kernel/security/ima/policy -bash: echo: write error: Invalid argument # echo scan > /sys/kernel/debug/kmemleak # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff9bab67ca12c0 (size 16): comm "bash", pid 684, jiffies 4295212803 (age 252.344s) hex dump (first 16 bytes): 73 65 63 75 72 69 74 79 66 73 00 6b 6b 6b 6b a5 securityfs.kkkk. backtrace: [<00000000adc80b1b>] kstrdup+0x2e/0x60 [<00000000d504cb0d>] ima_parse_add_rule+0x7d4/0x1020 [<00000000444825ac>] ima_write_policy+0xab/0x1d0 [<000000002b7f0d6c>] vfs_write+0xde/0x1d0 [<0000000096feedcf>] ksys_write+0x68/0xe0 [<0000000052b544a2>] do_syscall_64+0x56/0xa0 [<000000007ead1ba7>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: f1b08bbcbdaf ("ima: define a new policy condition based on the filesystem name") Fixes: 2b60c0ecedf8 ("IMA: Read keyrings= option from the IMA policy") Signed-off-by: Tyler Hicks tyhicks@linux.microsoft.com Signed-off-by: Mimi Zohar zohar@linux.ibm.com Cc: stable@vger.kernel.org # 4.19+ Signed-off-by: Gou Hao gouhao@uniontech.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- security/integrity/ima/ima_policy.c | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-)
diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c index a323d875dc2d..044de5121375 100644 --- a/security/integrity/ima/ima_policy.c +++ b/security/integrity/ima/ima_policy.c @@ -241,6 +241,21 @@ static int __init default_appraise_policy_setup(char *str) } __setup("ima_appraise_tcb", default_appraise_policy_setup);
+static void ima_free_rule(struct ima_rule_entry *entry) +{ + int i; + + if (!entry) + return; + + kfree(entry->fsname); + for (i = 0; i < MAX_LSM_RULES; i++) { + security_filter_rule_free(entry->lsm[i].rule); + kfree(entry->lsm[i].args_p); + } + kfree(entry); +} + /* * The LSM policy can be reloaded, leaving the IMA LSM based rules referring * to the old, stale LSM policy. Update the IMA LSM based rules to reflect @@ -1041,17 +1056,11 @@ ssize_t ima_parse_add_rule(char *rule) void ima_delete_rules(void) { struct ima_rule_entry *entry, *tmp; - int i;
temp_ima_appraise = 0; list_for_each_entry_safe(entry, tmp, &ima_temp_rules, list) { - for (i = 0; i < MAX_LSM_RULES; i++) { - security_filter_rule_free(entry->lsm[i].rule); - kfree(entry->lsm[i].args_p); - } - list_del(&entry->list); - kfree(entry); + ima_free_rule(entry); } }
From: Tyler Hicks tyhicks@linux.microsoft.com
stable inclusion from stable-v4.19.261 commit f039564c36ee609bc5327a52895dfee05c8f0f7c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
commit 2bdd737c5687d6dec30e205953146ede8a87dbdd upstream.
Use ima_free_rule() to fix memory leaks of allocated ima_rule_entry members, such as .fsname and .keyrings, when an error is encountered during rule parsing.
Set the args_p pointer to NULL after freeing it in the error path of ima_lsm_rule_init() so that it isn't freed twice.
This fixes a memory leak seen when loading an rule that contains an additional piece of allocated memory, such as an fsname, followed by an invalid conditional:
# echo "measure fsname=tmpfs bad=cond" > /sys/kernel/security/ima/policy -bash: echo: write error: Invalid argument # echo scan > /sys/kernel/debug/kmemleak # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff98e7e4ece6c0 (size 8): comm "bash", pid 672, jiffies 4294791843 (age 21.855s) hex dump (first 8 bytes): 74 6d 70 66 73 00 6b a5 tmpfs.k. backtrace: [<00000000abab7413>] kstrdup+0x2e/0x60 [<00000000f11ede32>] ima_parse_add_rule+0x7d4/0x1020 [<00000000f883dd7a>] ima_write_policy+0xab/0x1d0 [<00000000b17cf753>] vfs_write+0xde/0x1d0 [<00000000b8ddfdea>] ksys_write+0x68/0xe0 [<00000000b8e21e87>] do_syscall_64+0x56/0xa0 [<0000000089ea7b98>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: f1b08bbcbdaf ("ima: define a new policy condition based on the filesystem name") Fixes: 2b60c0ecedf8 ("IMA: Read keyrings= option from the IMA policy") Signed-off-by: Tyler Hicks tyhicks@linux.microsoft.com Signed-off-by: Mimi Zohar zohar@linux.ibm.com Cc: stable@vger.kernel.org # 4.19+ Signed-off-by: Gou Hao gouhao@uniontech.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- security/integrity/ima/ima_policy.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c index 044de5121375..f68c9f3751f2 100644 --- a/security/integrity/ima/ima_policy.c +++ b/security/integrity/ima/ima_policy.c @@ -663,6 +663,7 @@ static int ima_lsm_rule_init(struct ima_rule_entry *entry, &entry->lsm[lsm_rule].rule); if (!entry->lsm[lsm_rule].rule) { kfree(entry->lsm[lsm_rule].args_p); + entry->lsm[lsm_rule].args_p = NULL; return -EINVAL; }
@@ -1035,7 +1036,7 @@ ssize_t ima_parse_add_rule(char *rule)
result = ima_parse_rule(p, entry); if (result) { - kfree(entry); + ima_free_rule(entry); integrity_audit_msg(AUDIT_INTEGRITY_STATUS, NULL, NULL, op, "invalid-policy", result, audit_info);
From: Sasha Levin sashal@kernel.org
stable inclusion from stable-v4.19.262 commit 6d43e94b8daf009694d709c5919d67067936577a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
This reverts commit fd0a6e99b61e6c08fa5cf585d54fd956f70c73a6.
Which was upstream commit 97ef77c52b789ec1411d360ed99dca1efe4b2c81.
The commit is missing dependencies and breaks NFS tests, remove it for now.
Reported-by: Saeed Mirzamohammadi saeed.mirzamohammadi@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/splice.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/fs/splice.c b/fs/splice.c index d71716f94e3c..230367f3df41 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -899,15 +899,17 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, { struct pipe_inode_info *pipe; long ret, bytes; + umode_t i_mode; size_t len; int i, flags, more;
/* - * We require the input to be seekable, as we don't want to randomly - * drop data for eg socket -> socket splicing. Use the piped splicing - * for that! + * We require the input being a regular file, as we don't want to + * randomly drop data for eg socket -> socket splicing. Use the + * piped splicing for that! */ - if (unlikely(!(in->f_mode & FMODE_LSEEK))) + i_mode = file_inode(in)->i_mode; + if (unlikely(!S_ISREG(i_mode) && !S_ISBLK(i_mode))) return -EINVAL;
/*
From: Baokun Li libaokun1@huawei.com
stable inclusion from stable-v4.19.262 commit 947264e00c46de19a016fd81218118c708fed2f3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
commit f9c1f248607d5546075d3f731e7607d5571f2b60 upstream.
I caught a null-ptr-deref bug as follows:
================================================================== KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f] CPU: 1 PID: 1589 Comm: umount Not tainted 5.10.0-02219-dirty #339 RIP: 0010:ext4_write_info+0x53/0x1b0 [...] Call Trace: dquot_writeback_dquots+0x341/0x9a0 ext4_sync_fs+0x19e/0x800 __sync_filesystem+0x83/0x100 sync_filesystem+0x89/0xf0 generic_shutdown_super+0x79/0x3e0 kill_block_super+0xa1/0x110 deactivate_locked_super+0xac/0x130 deactivate_super+0xb6/0xd0 cleanup_mnt+0x289/0x400 __cleanup_mnt+0x16/0x20 task_work_run+0x11c/0x1c0 exit_to_user_mode_prepare+0x203/0x210 syscall_exit_to_user_mode+0x5b/0x3a0 do_syscall_64+0x59/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ==================================================================
Above issue may happen as follows: ------------------------------------- exit_to_user_mode_prepare task_work_run __cleanup_mnt cleanup_mnt deactivate_super deactivate_locked_super kill_block_super generic_shutdown_super shrink_dcache_for_umount dentry = sb->s_root sb->s_root = NULL <--- Here set NULL sync_filesystem __sync_filesystem sb->s_op->sync_fs > ext4_sync_fs dquot_writeback_dquots sb->dq_op->write_info > ext4_write_info ext4_journal_start(d_inode(sb->s_root), EXT4_HT_QUOTA, 2) d_inode(sb->s_root) s_root->d_inode <--- Null pointer dereference
To solve this problem, we use ext4_journal_start_sb directly to avoid s_root being used.
Cc: stable@kernel.org Signed-off-by: Baokun Li libaokun1@huawei.com Reviewed-by: Jan Kara jack@suse.cz Link: https://lore.kernel.org/r/20220805123947.565152-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/ext4/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 2faad105a945..594fc6dd0e9f 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -6017,7 +6017,7 @@ static int ext4_write_info(struct super_block *sb, int type) handle_t *handle;
/* Data block + inode block */ - handle = ext4_journal_start(d_inode(sb->s_root), EXT4_HT_QUOTA, 2); + handle = ext4_journal_start_sb(sb, EXT4_HT_QUOTA, 2); if (IS_ERR(handle)) return PTR_ERR(handle); ret = dquot_commit_info(sb, type);
From: Neal Cardwell ncardwell@google.com
stable inclusion from stable-v4.19.262 commit a434d10e7a90e301ea4a1826ee758b53c79d7de8 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
[ Upstream commit f4ce91ce12a7c6ead19b128ffa8cff6e3ded2a14 ]
This commit fixes a bug in the tracking of max_packets_out and is_cwnd_limited. This bug can cause the connection to fail to remember that is_cwnd_limited is true, causing the connection to fail to grow cwnd when it should, causing throughput to be lower than it should be.
The following event sequence is an example that triggers the bug:
(a) The connection is cwnd_limited, but packets_out is not at its peak due to TSO deferral deciding not to send another skb yet. In such cases the connection can advance max_packets_seq and set tp->is_cwnd_limited to true and max_packets_out to a small number.
(b) Then later in the round trip the connection is pacing-limited (not cwnd-limited), and packets_out is larger. In such cases the connection would raise max_packets_out to a bigger number but (unexpectedly) flip tp->is_cwnd_limited from true to false.
This commit fixes that bug.
One straightforward fix would be to separately track (a) the next window after max_packets_out reaches a maximum, and (b) the next window after tp->is_cwnd_limited is set to true. But this would require consuming an extra u32 sequence number.
Instead, to save space we track only the most important information. Specifically, we track the strongest available signal of the degree to which the cwnd is fully utilized:
(1) If the connection is cwnd-limited then we remember that fact for the current window.
(2) If the connection not cwnd-limited then we track the maximum number of outstanding packets in the current window.
In particular, note that the new logic cannot trigger the buggy (a)/(b) sequence above because with the new logic a condition where tp->packets_out > tp->max_packets_out can only trigger an update of tp->is_cwnd_limited if tp->is_cwnd_limited is false.
This first showed up in a testing of a BBRv2 dev branch, but this buggy behavior highlighted a general issue with the tcp_cwnd_validate() logic that can cause cwnd to fail to increase at the proper rate for any TCP congestion control, including Reno or CUBIC.
Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler") Signed-off-by: Neal Cardwell ncardwell@google.com Signed-off-by: Kevin(Yudong) Yang yyd@google.com Signed-off-by: Yuchung Cheng ycheng@google.com Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- include/linux/tcp.h | 2 +- include/net/tcp.h | 5 ++++- net/ipv4/tcp.c | 2 ++ net/ipv4/tcp_output.c | 19 ++++++++++++------- 4 files changed, 19 insertions(+), 9 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 0ea3a31cc670..38e569967849 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -263,7 +263,7 @@ struct tcp_sock { u32 packets_out; /* Packets which are "in flight" */ u32 retrans_out; /* Retransmitted packets out */ u32 max_packets_out; /* max packets_out in last window */ - u32 max_packets_seq; /* right edge of max_packets_out flight */ + u32 cwnd_usage_seq; /* right edge of cwnd usage tracking flight */
u16 urg_data; /* Saved octet of OOB data and control flags */ u8 ecn_flags; /* ECN status bits. */ diff --git a/include/net/tcp.h b/include/net/tcp.h index e2c09ab2e1f8..91b69dc5bed7 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1233,11 +1233,14 @@ static inline bool tcp_is_cwnd_limited(const struct sock *sk) { const struct tcp_sock *tp = tcp_sk(sk);
+ if (tp->is_cwnd_limited) + return true; + /* If in slow start, ensure cwnd grows to twice what was ACKed. */ if (tcp_in_slow_start(tp)) return tp->snd_cwnd < 2 * tp->max_packets_out;
- return tp->is_cwnd_limited; + return false; }
/* BBR congestion control needs pacing. diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 6f71b6cfc1b2..0adf8b05a52e 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2612,6 +2612,8 @@ int tcp_disconnect(struct sock *sk, int flags) icsk->icsk_probes_out = 0; tp->snd_ssthresh = TCP_INFINITE_SSTHRESH; tp->snd_cwnd_cnt = 0; + tp->is_cwnd_limited = 0; + tp->max_packets_out = 0; tp->window_clamp = 0; tp->delivered = 0; tp->delivered_ce = 0; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 24e1f79e73d2..5fbfbccd3d92 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1691,15 +1691,20 @@ static void tcp_cwnd_validate(struct sock *sk, bool is_cwnd_limited) const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops; struct tcp_sock *tp = tcp_sk(sk);
- /* Track the maximum number of outstanding packets in each - * window, and remember whether we were cwnd-limited then. + /* Track the strongest available signal of the degree to which the cwnd + * is fully utilized. If cwnd-limited then remember that fact for the + * current window. If not cwnd-limited then track the maximum number of + * outstanding packets in the current window. (If cwnd-limited then we + * chose to not update tp->max_packets_out to avoid an extra else + * clause with no functional impact.) */ - if (!before(tp->snd_una, tp->max_packets_seq) || - tp->packets_out > tp->max_packets_out || - is_cwnd_limited) { - tp->max_packets_out = tp->packets_out; - tp->max_packets_seq = tp->snd_nxt; + if (!before(tp->snd_una, tp->cwnd_usage_seq) || + is_cwnd_limited || + (!tp->is_cwnd_limited && + tp->packets_out > tp->max_packets_out)) { tp->is_cwnd_limited = is_cwnd_limited; + tp->max_packets_out = tp->packets_out; + tp->cwnd_usage_seq = tp->snd_nxt; }
if (tcp_is_cwnd_limited(sk)) {
From: Eric Dumazet edumazet@google.com
stable inclusion from stable-v4.19.262 commit 5c4e1b8939195fe27b05d791577f92445b139a3e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
[ Upstream commit aacd467c0a576e5e44d2de4205855dc0fe43f6fb ]
tcp_md5sig_pool_populated can be read while another thread changes its value.
The race has no consequence because allocations are protected with tcp_md5sig_mutex.
This patch adds READ_ONCE() and WRITE_ONCE() to document the race and silence KCSAN.
Reported-by: Abhishek Shah abhishek.shah@columbia.edu Signed-off-by: Eric Dumazet edumazet@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- net/ipv4/tcp.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 0adf8b05a52e..30ccd90a2971 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3699,12 +3699,16 @@ static void __tcp_alloc_md5sig_pool(void) * to memory. See smp_rmb() in tcp_get_md5sig_pool() */ smp_wmb(); - tcp_md5sig_pool_populated = true; + /* Paired with READ_ONCE() from tcp_alloc_md5sig_pool() + * and tcp_get_md5sig_pool(). + */ + WRITE_ONCE(tcp_md5sig_pool_populated, true); }
bool tcp_alloc_md5sig_pool(void) { - if (unlikely(!tcp_md5sig_pool_populated)) { + /* Paired with WRITE_ONCE() from __tcp_alloc_md5sig_pool() */ + if (unlikely(!READ_ONCE(tcp_md5sig_pool_populated))) { mutex_lock(&tcp_md5sig_mutex);
if (!tcp_md5sig_pool_populated) @@ -3712,7 +3716,8 @@ bool tcp_alloc_md5sig_pool(void)
mutex_unlock(&tcp_md5sig_mutex); } - return tcp_md5sig_pool_populated; + /* Paired with WRITE_ONCE() from __tcp_alloc_md5sig_pool() */ + return READ_ONCE(tcp_md5sig_pool_populated); } EXPORT_SYMBOL(tcp_alloc_md5sig_pool);
@@ -3728,7 +3733,8 @@ struct tcp_md5sig_pool *tcp_get_md5sig_pool(void) { local_bh_disable();
- if (tcp_md5sig_pool_populated) { + /* Paired with WRITE_ONCE() from __tcp_alloc_md5sig_pool() */ + if (READ_ONCE(tcp_md5sig_pool_populated)) { /* coupled with smp_wmb() in __tcp_alloc_md5sig_pool() */ smp_rmb(); return this_cpu_ptr(&tcp_md5sig_pool);
From: Khalid Masum khalid.masum.92@gmail.com
stable inclusion from stable-v4.19.262 commit 1e8abde895b3ac6a368cbdb372e8800c49e73a28 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
[ Upstream commit 8a04d2fc700f717104bfb95b0f6694e448a4537f ]
Currently if ipcomp_alloc_scratches() fails to allocate memory ipcomp_scratches holds obsolete address. So when we try to free the percpu scratches using ipcomp_free_scratches() it tries to vfree non existent vm area. Described below:
static void * __percpu *ipcomp_alloc_scratches(void) { ... scratches = alloc_percpu(void *); if (!scratches) return NULL; ipcomp_scratches does not know about this allocation failure. Therefore holding the old obsolete address. ... }
So when we free,
static void ipcomp_free_scratches(void) { ... scratches = ipcomp_scratches; Assigning obsolete address from ipcomp_scratches
if (!scratches) return;
for_each_possible_cpu(i) vfree(*per_cpu_ptr(scratches, i)); Trying to free non existent page, causing warning: trying to vfree existent vm area. ... }
Fix this breakage by updating ipcomp_scrtches with NULL when scratches is freed
Suggested-by: Herbert Xu herbert@gondor.apana.org.au Reported-by: syzbot+5ec9bb042ddfe9644773@syzkaller.appspotmail.com Tested-by: syzbot+5ec9bb042ddfe9644773@syzkaller.appspotmail.com Signed-off-by: Khalid Masum khalid.masum.92@gmail.com Acked-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Steffen Klassert steffen.klassert@secunet.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- net/xfrm/xfrm_ipcomp.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/net/xfrm/xfrm_ipcomp.c b/net/xfrm/xfrm_ipcomp.c index a00ec715aa46..32aed1d0f6ee 100644 --- a/net/xfrm/xfrm_ipcomp.c +++ b/net/xfrm/xfrm_ipcomp.c @@ -216,6 +216,7 @@ static void ipcomp_free_scratches(void) vfree(*per_cpu_ptr(scratches, i));
free_percpu(scratches); + ipcomp_scratches = NULL; }
static void * __percpu *ipcomp_alloc_scratches(void)
From: Ziyang Xuan william.xuanziyang@huawei.com
stable inclusion from stable-v4.19.262 commit dae06957f856eb699f2a504a46891718c9b1e0d3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
[ Upstream commit 3fd7bfd28cfd68ae80a2fe92ea1615722cc2ee6e ]
If can_send() fail, it should not update frames_abs counter in bcm_can_tx(). Add the result check for can_send() in bcm_can_tx().
Suggested-by: Marc Kleine-Budde mkl@pengutronix.de Suggested-by: Oliver Hartkopp socketcan@hartkopp.net Signed-off-by: Ziyang Xuan william.xuanziyang@huawei.com Link: https://lore.kernel.org/all/9851878e74d6d37aee2f1ee76d68361a46f89458.1663206... Acked-by: Oliver Hartkopp socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde mkl@pengutronix.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- net/can/bcm.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/net/can/bcm.c b/net/can/bcm.c index db987a1567ba..f0af3861c959 100644 --- a/net/can/bcm.c +++ b/net/can/bcm.c @@ -272,6 +272,7 @@ static void bcm_can_tx(struct bcm_op *op) struct sk_buff *skb; struct net_device *dev; struct canfd_frame *cf = op->frames + op->cfsiz * op->currframe; + int err;
/* no target device? => exit */ if (!op->ifindex) @@ -296,11 +297,11 @@ static void bcm_can_tx(struct bcm_op *op) /* send with loopback */ skb->dev = dev; can_skb_set_owner(skb, op->sk); - can_send(skb, 1); + err = can_send(skb, 1); + if (!err) + op->frames_abs++;
- /* update statistics */ op->currframe++; - op->frames_abs++;
/* reached last frame? */ if (op->currframe >= op->nframes)
From: Liu Jian liujian56@huawei.com
stable inclusion from stable-v4.19.262 commit 5fe03917bb017d9af68a95f989f1c122eebc69a6 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
[ Upstream commit 3f8ef65af927db247418d4e1db49164d7a158fc5 ]
Fixes the below NULL pointer dereference:
[...] [ 14.471200] Call Trace: [ 14.471562] <TASK> [ 14.471882] lock_acquire+0x245/0x2e0 [ 14.472416] ? remove_wait_queue+0x12/0x50 [ 14.473014] ? _raw_spin_lock_irqsave+0x17/0x50 [ 14.473681] _raw_spin_lock_irqsave+0x3d/0x50 [ 14.474318] ? remove_wait_queue+0x12/0x50 [ 14.474907] remove_wait_queue+0x12/0x50 [ 14.475480] sk_stream_wait_memory+0x20d/0x340 [ 14.476127] ? do_wait_intr_irq+0x80/0x80 [ 14.476704] do_tcp_sendpages+0x287/0x600 [ 14.477283] tcp_bpf_push+0xab/0x260 [ 14.477817] tcp_bpf_sendmsg_redir+0x297/0x500 [ 14.478461] ? __local_bh_enable_ip+0x77/0xe0 [ 14.479096] tcp_bpf_send_verdict+0x105/0x470 [ 14.479729] tcp_bpf_sendmsg+0x318/0x4f0 [ 14.480311] sock_sendmsg+0x2d/0x40 [ 14.480822] ____sys_sendmsg+0x1b4/0x1c0 [ 14.481390] ? copy_msghdr_from_user+0x62/0x80 [ 14.482048] ___sys_sendmsg+0x78/0xb0 [ 14.482580] ? vmf_insert_pfn_prot+0x91/0x150 [ 14.483215] ? __do_fault+0x2a/0x1a0 [ 14.483738] ? do_fault+0x15e/0x5d0 [ 14.484246] ? __handle_mm_fault+0x56b/0x1040 [ 14.484874] ? lock_is_held_type+0xdf/0x130 [ 14.485474] ? find_held_lock+0x2d/0x90 [ 14.486046] ? __sys_sendmsg+0x41/0x70 [ 14.486587] __sys_sendmsg+0x41/0x70 [ 14.487105] ? intel_pmu_drain_pebs_core+0x350/0x350 [ 14.487822] do_syscall_64+0x34/0x80 [ 14.488345] entry_SYSCALL_64_after_hwframe+0x63/0xcd [...]
The test scenario has the following flow:
thread1 thread2 ----------- --------------- tcp_bpf_sendmsg tcp_bpf_send_verdict tcp_bpf_sendmsg_redir sock_close tcp_bpf_push_locked __sock_release tcp_bpf_push //inet_release do_tcp_sendpages sock->ops->release sk_stream_wait_memory // tcp_close sk_wait_event sk->sk_prot->close release_sock(__sk); *** lock_sock(sk); __tcp_close sock_orphan(sk) sk->sk_wq = NULL release_sock **** lock_sock(__sk); remove_wait_queue(sk_sleep(sk), &wait); sk_sleep(sk) //NULL pointer dereference &rcu_dereference_raw(sk->sk_wq)->wait
While waiting for memory in thread1, the socket is released with its wait queue because thread2 has closed it. This caused by tcp_bpf_send_verdict didn't increase the f_count of psock->sk_redir->sk_socket->file in thread1.
We should check if SOCK_DEAD flag is set on wakeup in sk_stream_wait_memory before accessing the wait queue.
Suggested-by: Jakub Sitnicki jakub@cloudflare.com Signed-off-by: Liu Jian liujian56@huawei.com Signed-off-by: Daniel Borkmann daniel@iogearbox.net Acked-by: John Fastabend john.fastabend@gmail.com Cc: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/bpf/20220823133755.314697-2-liujian56@huawei.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- net/core/stream.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/core/stream.c b/net/core/stream.c index 7f5eaa95a675..71f654ca8da3 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -159,7 +159,8 @@ int sk_stream_wait_memory(struct sock *sk, long *timeo_p) *timeo_p = current_timeo; } out: - remove_wait_queue(sk_sleep(sk), &wait); + if (!sock_flag(sk, SOCK_DEAD)) + remove_wait_queue(sk_sleep(sk), &wait); return err;
do_error:
From: Keith Busch kbusch@kernel.org
stable inclusion from stable-v4.19.262 commit 366a2b3110c69f919fb3277acc1a0bb8cd8a8dbd category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
[ Upstream commit a8eb6c1ba48bddea82e8d74cbe6e119f006be97d ]
The firmware revision can change on after a reset so copy the most recent info each time instead of just the first time, otherwise the sysfs firmware_rev entry may contain stale data.
Reported-by: Jeff Lien jeff.lien@wdc.com Signed-off-by: Keith Busch kbusch@kernel.org Reviewed-by: Sagi Grimberg sagi@grimberg.me Reviewed-by: Chaitanya Kulkarni kch@nvidia.com Reviewed-by: Chao Leng lengchao@huawei.com Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/nvme/host/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index dcaab0304c9e..a8e6488c7c8e 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2470,7 +2470,6 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) nvme_init_subnqn(subsys, ctrl, id); memcpy(subsys->serial, id->sn, sizeof(subsys->serial)); memcpy(subsys->model, id->mn, sizeof(subsys->model)); - memcpy(subsys->firmware_rev, id->fr, sizeof(subsys->firmware_rev)); subsys->vendor_id = le16_to_cpu(id->vid); subsys->cmic = id->cmic;
@@ -2633,6 +2632,8 @@ int nvme_init_identify(struct nvme_ctrl *ctrl) ctrl->quirks |= core_quirks[i].quirks; } } + memcpy(ctrl->subsys->firmware_rev, id->fr, + sizeof(ctrl->subsys->firmware_rev));
if (force_apst && (ctrl->quirks & NVME_QUIRK_NO_DEEPEST_PS)) { dev_warn(ctrl->device, "forcibly allowing all power states due to nvme_core.force_apst -- use at your own risk\n");
From: Jerry Lee 李修賢 jerrylee@qnap.com
stable inclusion from stable-v4.19.262 commit f2180ad6a43501597d20eacad0c6f146c51d4bbd category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
commit df3cb754d13d2cd5490db9b8d536311f8413a92e upstream.
When expanding a file system from (16TiB-2MiB) to 18TiB, the operation exits early which leads to result inconsistency between resize2fs and Ext4 kernel driver.
=== before === ○ → resize2fs /dev/mapper/thin resize2fs 1.45.5 (07-Jan-2020) Filesystem at /dev/mapper/thin is mounted on /mnt/test; on-line resizing required old_desc_blocks = 2048, new_desc_blocks = 2304 The filesystem on /dev/mapper/thin is now 4831837696 (4k) blocks long.
[ 865.186308] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none. [ 912.091502] dm-4: detected capacity change from 34359738368 to 38654705664 [ 970.030550] dm-5: detected capacity change from 34359734272 to 38654701568 [ 1000.012751] EXT4-fs (dm-5): resizing filesystem from 4294966784 to 4831837696 blocks [ 1000.012878] EXT4-fs (dm-5): resized filesystem to 4294967296
=== after === [ 129.104898] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none. [ 143.773630] dm-4: detected capacity change from 34359738368 to 38654705664 [ 198.203246] dm-5: detected capacity change from 34359734272 to 38654701568 [ 207.918603] EXT4-fs (dm-5): resizing filesystem from 4294966784 to 4831837696 blocks [ 207.918754] EXT4-fs (dm-5): resizing filesystem from 4294967296 to 4831837696 blocks [ 207.918758] EXT4-fs (dm-5): Converting file system to meta_bg [ 207.918790] EXT4-fs (dm-5): resizing filesystem from 4294967296 to 4831837696 blocks [ 221.454050] EXT4-fs (dm-5): resized to 4658298880 blocks [ 227.634613] EXT4-fs (dm-5): resized filesystem to 4831837696
Signed-off-by: Jerry Lee jerrylee@qnap.com Link: https://lore.kernel.org/r/PU1PR04MB22635E739BD21150DC182AC6A18C9@PU1PR04MB22... Signed-off-by: Theodore Ts'o tytso@mit.edu Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- fs/ext4/resize.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index f2b881aaf0b1..b8e9ea475a59 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -2084,7 +2084,7 @@ int ext4_resize_fs(struct super_block *sb, ext4_fsblk_t n_blocks_count) goto out; }
- if (ext4_blocks_count(es) == n_blocks_count) + if (ext4_blocks_count(es) == n_blocks_count && n_blocks_count_retry == 0) goto out;
err = ext4_alloc_flex_bg_array(sb, n_group + 1);
From: Eric Dumazet edumazet@google.com
stable inclusion from stable-v4.19.262 commit 75a578000ae5e511e5d0e8433c94a14d9c99c412 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
commit 8f905c0e7354ef261360fb7535ea079b1082c105 upstream.
syzbot reported various issues around early demux, one being included in this changelog [1]
sk->sk_rx_dst is using RCU protection without clearly documenting it.
And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv() are not following standard RCU rules.
[a] dst_release(dst); [b] sk->sk_rx_dst = NULL;
They look wrong because a delete operation of RCU protected pointer is supposed to clear the pointer before the call_rcu()/synchronize_rcu() guarding actual memory freeing.
In some cases indeed, dst could be freed before [b] is done.
We could cheat by clearing sk_rx_dst before calling dst_release(), but this seems the right time to stick to standard RCU annotations and debugging facilities.
[1] BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline] BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792 Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204
CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247 __kasan_report mm/kasan/report.c:433 [inline] kasan_report.cold+0x83/0xdf mm/kasan/report.c:450 dst_check include/net/dst.h:470 [inline] tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792 ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340 ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583 ip_sublist_rcv net/ipv4/ip_input.c:609 [inline] ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644 __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline] __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556 __netif_receive_skb_list net/core/dev.c:5608 [inline] netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699 gro_normal_list net/core/dev.c:5853 [inline] gro_normal_list net/core/dev.c:5849 [inline] napi_complete_done+0x1f1/0x880 net/core/dev.c:6590 virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline] virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557 __napi_poll+0xaf/0x440 net/core/dev.c:7023 napi_poll net/core/dev.c:7090 [inline] net_rx_action+0x801/0xb40 net/core/dev.c:7177 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558 invoke_softirq kernel/softirq.c:432 [inline] __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637 irq_exit_rcu+0x5/0x20 kernel/softirq.c:649 common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240 asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629 RIP: 0033:0x7f5e972bfd57 Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73 RSP: 002b:00007fff8a413210 EFLAGS: 00000283 RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45 RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45 RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9 R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0 R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019 </TASK>
Allocated by task 13: kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:434 [inline] __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467 kasan_slab_alloc include/linux/kasan.h:259 [inline] slab_post_alloc_hook mm/slab.h:519 [inline] slab_alloc_node mm/slub.c:3234 [inline] slab_alloc mm/slub.c:3242 [inline] kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247 dst_alloc+0x146/0x1f0 net/core/dst.c:92 rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613 ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340 ip_route_input_rcu net/ipv4/route.c:2470 [inline] ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415 ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354 ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583 ip_sublist_rcv net/ipv4/ip_input.c:609 [inline] ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644 __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline] __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556 __netif_receive_skb_list net/core/dev.c:5608 [inline] netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699 gro_normal_list net/core/dev.c:5853 [inline] gro_normal_list net/core/dev.c:5849 [inline] napi_complete_done+0x1f1/0x880 net/core/dev.c:6590 virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline] virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557 __napi_poll+0xaf/0x440 net/core/dev.c:7023 napi_poll net/core/dev.c:7090 [inline] net_rx_action+0x801/0xb40 net/core/dev.c:7177 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Freed by task 13: kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 kasan_set_track+0x21/0x30 mm/kasan/common.c:46 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370 ____kasan_slab_free mm/kasan/common.c:366 [inline] ____kasan_slab_free mm/kasan/common.c:328 [inline] __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374 kasan_slab_free include/linux/kasan.h:235 [inline] slab_free_hook mm/slub.c:1723 [inline] slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749 slab_free mm/slub.c:3513 [inline] kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530 dst_destroy+0x2d6/0x3f0 net/core/dst.c:127 rcu_do_batch kernel/rcu/tree.c:2506 [inline] rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
Last potentially related work creation: kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 __kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348 __call_rcu kernel/rcu/tree.c:2985 [inline] call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065 dst_release net/core/dst.c:177 [inline] dst_release+0x79/0xe0 net/core/dst.c:167 tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712 sk_backlog_rcv include/net/sock.h:1030 [inline] __release_sock+0x134/0x3b0 net/core/sock.c:2768 release_sock+0x54/0x1b0 net/core/sock.c:3300 tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441 inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:704 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:724 sock_write_iter+0x289/0x3c0 net/socket.c:1057 call_write_iter include/linux/fs.h:2162 [inline] new_sync_write+0x429/0x660 fs/read_write.c:503 vfs_write+0x7cd/0xae0 fs/read_write.c:590 ksys_write+0x1ee/0x250 fs/read_write.c:643 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88807f1cb700 which belongs to the cache ip_dst_cache of size 176 The buggy address is located 58 bytes inside of 176-byte region [ffff88807f1cb700, ffff88807f1cb7b0) The buggy address belongs to the page: page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780 raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062 prep_new_page mm/page_alloc.c:2418 [inline] get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149 __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369 alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191 alloc_slab_page mm/slub.c:1793 [inline] allocate_slab mm/slub.c:1930 [inline] new_slab+0x32d/0x4a0 mm/slub.c:1993 ___slab_alloc+0x918/0xfe0 mm/slub.c:3022 __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109 slab_alloc_node mm/slub.c:3200 [inline] slab_alloc mm/slub.c:3242 [inline] kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247 dst_alloc+0x146/0x1f0 net/core/dst.c:92 rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613 __mkroute_output net/ipv4/route.c:2564 [inline] ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791 ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619 __ip_route_output_key include/net/route.h:126 [inline] ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850 ip_route_output_key include/net/route.h:142 [inline] geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809 geneve_xmit_skb drivers/net/geneve.c:899 [inline] geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082 __netdev_start_xmit include/linux/netdevice.h:4994 [inline] netdev_start_xmit include/linux/netdevice.h:5008 [inline] xmit_one net/core/dev.c:3590 [inline] dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606 __dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229 page last free stack trace: reset_page_owner include/linux/page_owner.h:24 [inline] free_pages_prepare mm/page_alloc.c:1338 [inline] free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389 free_unref_page_prepare mm/page_alloc.c:3309 [inline] free_unref_page+0x19/0x690 mm/page_alloc.c:3388 qlink_free mm/kasan/quarantine.c:146 [inline] qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165 kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272 __kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444 kasan_slab_alloc include/linux/kasan.h:259 [inline] slab_post_alloc_hook mm/slab.h:519 [inline] slab_alloc_node mm/slub.c:3234 [inline] kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270 __alloc_skb+0x215/0x340 net/core/skbuff.c:414 alloc_skb include/linux/skbuff.h:1126 [inline] alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078 sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575 mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754 add_grhead+0x265/0x330 net/ipv6/mcast.c:1857 add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995 mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242 mld_send_initial_cr net/ipv6/mcast.c:1232 [inline] mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268 process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298 worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
Memory state around the buggy address: ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 41063e9dd119 ("ipv4: Early TCP socket demux.") Signed-off-by: Eric Dumazet edumazet@google.com Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org [cmllamas: fixed trivial merge conflict] Signed-off-by: Carlos Llamas cmllamas@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Conflicts: net/ipv4/af_inet.c
Signed-off-by: Dong Chenchen dongchenchen2@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- include/net/sock.h | 2 +- net/ipv4/af_inet.c | 2 +- net/ipv4/tcp.c | 3 +-- net/ipv4/tcp_input.c | 2 +- net/ipv4/tcp_ipv4.c | 11 +++++++---- net/ipv4/udp.c | 6 +++--- net/ipv6/tcp_ipv6.c | 11 +++++++---- net/ipv6/udp.c | 4 ++-- 8 files changed, 23 insertions(+), 18 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h index 33834204fc0e..273f8226dbff 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -402,7 +402,7 @@ struct sock { #ifdef CONFIG_XFRM struct xfrm_policy __rcu *sk_policy[2]; #endif - struct dst_entry *sk_rx_dst; + struct dst_entry __rcu *sk_rx_dst; struct dst_entry __rcu *sk_dst_cache; atomic_t sk_omem_alloc; int sk_sndbuf; diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 428efffde33a..72aee2b4a6ee 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -157,7 +157,7 @@ void inet_sock_destruct(struct sock *sk)
kfree(rcu_dereference_protected(inet->inet_opt, 1)); dst_release(rcu_dereference_protected(sk->sk_dst_cache, 1)); - dst_release(sk->sk_rx_dst); + dst_release(rcu_dereference_protected(sk->sk_rx_dst, 1)); sk_refcnt_debug_dec(sk); } EXPORT_SYMBOL(inet_sock_destruct); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 30ccd90a2971..cb118905740f 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2631,8 +2631,7 @@ int tcp_disconnect(struct sock *sk, int flags) icsk->icsk_ack.rcv_mss = TCP_MIN_MSS; memset(&tp->rx_opt, 0, sizeof(tp->rx_opt)); __sk_dst_reset(sk); - dst_release(sk->sk_rx_dst); - sk->sk_rx_dst = NULL; + dst_release(xchg((__force struct dst_entry **)&sk->sk_rx_dst, NULL)); tcp_saved_syn_free(tp); tp->compressed_ack = 0; tp->segs_in = 0; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 94123a220875..9cfd25e20be9 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5587,7 +5587,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb) trace_tcp_probe(sk, skb);
tcp_mstamp_refresh(tp); - if (unlikely(!sk->sk_rx_dst)) + if (unlikely(!rcu_access_pointer(sk->sk_rx_dst))) inet_csk(sk)->icsk_af_ops->sk_rx_dst_set(sk, skb); /* * Header prediction. diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index a30808cab791..dc59bd34eb42 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1533,15 +1533,18 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) struct sock *rsk;
if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */ - struct dst_entry *dst = sk->sk_rx_dst; + struct dst_entry *dst; + + dst = rcu_dereference_protected(sk->sk_rx_dst, + lockdep_sock_is_held(sk));
sock_rps_save_rxhash(sk, skb); sk_mark_napi_id(sk, skb); if (dst) { if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif || !dst->ops->check(dst, 0)) { + RCU_INIT_POINTER(sk->sk_rx_dst, NULL); dst_release(dst); - sk->sk_rx_dst = NULL; } } tcp_rcv_established(sk, skb); @@ -1616,7 +1619,7 @@ int tcp_v4_early_demux(struct sk_buff *skb) skb->sk = sk; skb->destructor = sock_edemux; if (sk_fullsock(sk)) { - struct dst_entry *dst = READ_ONCE(sk->sk_rx_dst); + struct dst_entry *dst = rcu_dereference(sk->sk_rx_dst);
if (dst) dst = dst_check(dst, 0); @@ -2013,7 +2016,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb) struct dst_entry *dst = skb_dst(skb);
if (dst && dst_hold_safe(dst)) { - sk->sk_rx_dst = dst; + rcu_assign_pointer(sk->sk_rx_dst, dst); inet_sk(sk)->rx_dst_ifindex = skb->skb_iif; } } diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 0b9f3109c4ef..b9b6b62ef16f 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -2047,7 +2047,7 @@ bool udp_sk_rx_dst_set(struct sock *sk, struct dst_entry *dst) struct dst_entry *old;
if (dst_hold_safe(dst)) { - old = xchg(&sk->sk_rx_dst, dst); + old = xchg((__force struct dst_entry **)&sk->sk_rx_dst, dst); dst_release(old); return old != dst; } @@ -2237,7 +2237,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable, struct dst_entry *dst = skb_dst(skb); int ret;
- if (unlikely(sk->sk_rx_dst != dst)) + if (unlikely(rcu_dereference(sk->sk_rx_dst) != dst)) udp_sk_rx_dst_set(sk, dst);
ret = udp_unicast_rcv_skb(sk, skb, uh); @@ -2395,7 +2395,7 @@ int udp_v4_early_demux(struct sk_buff *skb)
skb->sk = sk; skb->destructor = sock_efree; - dst = READ_ONCE(sk->sk_rx_dst); + dst = rcu_dereference(sk->sk_rx_dst);
if (dst) dst = dst_check(dst, 0); diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index ab9e022325e7..79171f2e3c83 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -97,7 +97,7 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb) if (dst && dst_hold_safe(dst)) { const struct rt6_info *rt = (const struct rt6_info *)dst;
- sk->sk_rx_dst = dst; + rcu_assign_pointer(sk->sk_rx_dst, dst); inet_sk(sk)->rx_dst_ifindex = skb->skb_iif; inet6_sk(sk)->rx_dst_cookie = rt6_get_cookie(rt); } @@ -1335,15 +1335,18 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb) opt_skb = skb_clone(skb, sk_gfp_mask(sk, GFP_ATOMIC));
if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */ - struct dst_entry *dst = sk->sk_rx_dst; + struct dst_entry *dst; + + dst = rcu_dereference_protected(sk->sk_rx_dst, + lockdep_sock_is_held(sk));
sock_rps_save_rxhash(sk, skb); sk_mark_napi_id(sk, skb); if (dst) { if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif || dst->ops->check(dst, np->rx_dst_cookie) == NULL) { + RCU_INIT_POINTER(sk->sk_rx_dst, NULL); dst_release(dst); - sk->sk_rx_dst = NULL; } }
@@ -1688,7 +1691,7 @@ static void tcp_v6_early_demux(struct sk_buff *skb) skb->sk = sk; skb->destructor = sock_edemux; if (sk_fullsock(sk)) { - struct dst_entry *dst = READ_ONCE(sk->sk_rx_dst); + struct dst_entry *dst = rcu_dereference(sk->sk_rx_dst);
if (dst) dst = dst_check(dst, inet6_sk(sk)->rx_dst_cookie); diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 3ed5fe0055af..3f66026098cc 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -826,7 +826,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable, struct dst_entry *dst = skb_dst(skb); int ret;
- if (unlikely(sk->sk_rx_dst != dst)) + if (unlikely(rcu_dereference(sk->sk_rx_dst) != dst)) udp6_sk_rx_dst_set(sk, dst);
if (!uh->check && !udp_sk(sk)->no_check6_rx) { @@ -938,7 +938,7 @@ static void udp_v6_early_demux(struct sk_buff *skb)
skb->sk = sk; skb->destructor = sock_efree; - dst = READ_ONCE(sk->sk_rx_dst); + dst = rcu_dereference(sk->sk_rx_dst);
if (dst) dst = dst_check(dst, inet6_sk(sk)->rx_dst_cookie);
From: Eric Dumazet edumazet@google.com
stable inclusion from stable-v4.19.262 commit f5686a03b138f6330eeda082ee4f96c8109f56f3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
[ Upstream commit 62c07983bef9d3e78e71189441e1a470f0d1e653 ]
Christophe Leroy reported a ~80ms latency spike happening at first TCP connect() time.
This is because __inet_hash_connect() uses get_random_once() to populate a perturbation table which became quite big after commit 4c2c8f03a5ab ("tcp: increase source port perturb table to 2^16")
get_random_once() uses DO_ONCE(), which block hard irqs for the duration of the operation.
This patch adds DO_ONCE_SLOW() which uses a mutex instead of a spinlock for operations where we prefer to stay in process context.
Then __inet_hash_connect() can use get_random_slow_once() to populate its perturbation table.
Fixes: 4c2c8f03a5ab ("tcp: increase source port perturb table to 2^16") Fixes: 190cc82489f4 ("tcp: change source port randomizarion at connect() time") Reported-by: Christophe Leroy christophe.leroy@csgroup.eu Link: https://lore.kernel.org/netdev/CANn89iLAEYBaoYajy0Y9UmGFff5GPxDUoG-ErVB2jDdR... Signed-off-by: Eric Dumazet edumazet@google.com Cc: Willy Tarreau w@1wt.eu Tested-by: Christophe Leroy christophe.leroy@csgroup.eu Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org
One conflict occurs because the commit 4c2c8f03a5ab ("tcp: increase source port perturb table to 2^16") is integrated but the commit e9261476184b ("tcp: dynamically allocate the perturb table used by source port") is not integrated. One conflict occurs because the commit 1027b96ec9d3 ("once: Fix panic when module unload") is not integrated.
Conflicts: net/ipv4/inet_hashtables.c lib/once.c
Signed-off-by: Liu Jian liujian56@huawei.com Reviewed-by: Yue Haibing yuehaibing@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- include/linux/once.h | 28 ++++++++++++++++++++++++++++ lib/once.c | 30 ++++++++++++++++++++++++++++++ net/ipv4/inet_hashtables.c | 2 +- 3 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/include/linux/once.h b/include/linux/once.h index 9225ee6d96c7..8a1e32f75c81 100644 --- a/include/linux/once.h +++ b/include/linux/once.h @@ -5,10 +5,18 @@ #include <linux/types.h> #include <linux/jump_label.h>
+/* Helpers used from arbitrary contexts. + * Hard irqs are blocked, be cautious. + */ bool __do_once_start(bool *done, unsigned long *flags); void __do_once_done(bool *done, struct static_key_true *once_key, unsigned long *flags);
+/* Variant for process contexts only. */ +bool __do_once_slow_start(bool *done); +void __do_once_slow_done(bool *done, struct static_key_true *once_key, + struct module *mod); + /* Call a function exactly once. The idea of DO_ONCE() is to perform * a function call such as initialization of random seeds, etc, only * once, where DO_ONCE() can live in the fast-path. After @func has @@ -52,9 +60,29 @@ void __do_once_done(bool *done, struct static_key_true *once_key, ___ret; \ })
+/* Variant of DO_ONCE() for process/sleepable contexts. */ +#define DO_ONCE_SLOW(func, ...) \ + ({ \ + bool ___ret = false; \ + static bool __section(".data.once") ___done = false; \ + static DEFINE_STATIC_KEY_TRUE(___once_key); \ + if (static_branch_unlikely(&___once_key)) { \ + ___ret = __do_once_slow_start(&___done); \ + if (unlikely(___ret)) { \ + func(__VA_ARGS__); \ + __do_once_slow_done(&___done, &___once_key, \ + THIS_MODULE); \ + } \ + } \ + ___ret; \ + }) + #define get_random_once(buf, nbytes) \ DO_ONCE(get_random_bytes, (buf), (nbytes)) #define get_random_once_wait(buf, nbytes) \ DO_ONCE(get_random_bytes_wait, (buf), (nbytes)) \
+#define get_random_slow_once(buf, nbytes) \ + DO_ONCE_SLOW(get_random_bytes, (buf), (nbytes)) + #endif /* _LINUX_ONCE_H */ diff --git a/lib/once.c b/lib/once.c index 959f8db41ccf..83f41c521e57 100644 --- a/lib/once.c +++ b/lib/once.c @@ -78,3 +78,33 @@ void __do_once_done(bool *done, struct static_key_true *once_key, once_disable_jump(once_key); } EXPORT_SYMBOL(__do_once_done); + +static DEFINE_MUTEX(once_mutex); + +bool __do_once_slow_start(bool *done) + __acquires(once_mutex) +{ + mutex_lock(&once_mutex); + if (*done) { + mutex_unlock(&once_mutex); + /* Keep sparse happy by restoring an even lock count on + * this mutex. In case we return here, we don't call into + * __do_once_done but return early in the DO_ONCE_SLOW() macro. + */ + __acquire(once_mutex); + return false; + } + + return true; +} +EXPORT_SYMBOL(__do_once_slow_start); + +void __do_once_slow_done(bool *done, struct static_key_true *once_key, + struct module *mod) + __releases(once_mutex) +{ + *done = true; + mutex_unlock(&once_mutex); + once_disable_jump(once_key); +} +EXPORT_SYMBOL(__do_once_slow_done); diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index fb53caeee47c..8d3e27fdfc81 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -719,7 +719,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, if (likely(remaining > 1)) remaining &= ~1U;
- net_get_random_once(table_perturb, sizeof(table_perturb)); + get_random_slow_once(table_perturb, sizeof(table_perturb)); index = hash_32(port_offset, INET_TABLE_PERTURB_SHIFT);
offset = READ_ONCE(table_perturb[index]) + port_offset;
From: Michael Kelley mikelley@microsoft.com
stable inclusion from stable-v4.19.261 commit 5f7fd71e5bebf337769f20dd125822ce63266e4d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZXGL CVE: NA
--------------------------------
[ Upstream commit c292a337d0e45a292c301e3cd51c35aa0ae91e95 ]
The IOC_PR_CLEAR and IOC_PR_RELEASE ioctls are non-functional on NVMe devices because the nvme_pr_clear() and nvme_pr_release() functions set the IEKEY field incorrectly. The IEKEY field should be set only when the key is zero (i.e, not specified). The current code does it backwards.
Furthermore, the NVMe spec describes the persistent reservation "clear" function as an option on the reservation release command. The current implementation of nvme_pr_clear() erroneously uses the reservation register command.
Fix these errors. Note that NVMe version 1.3 and later specify that setting the IEKEY field will return an error of Invalid Field in Command. The fix will set IEKEY when the key is zero, which is appropriate as these ioctls consider a zero key to be "unspecified", and the intention of the spec change is to require a valid key.
Tested on a version 1.4 PCI NVMe device in an Azure VM.
Fixes: 1673f1f08c88 ("nvme: move block_device_operations and ns/ctrl freeing to common code") Fixes: 1d277a637a71 ("NVMe: Add persistent reservation ops") Signed-off-by: Michael Kelley mikelley@microsoft.com Signed-off-by: Christoph Hellwig hch@lst.de Signed-off-by: Sasha Levin sashal@kernel.org Conflicts: drivers/nvme/host/core.c Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Reviewed-by: Yu Kuai yukuai3@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- drivers/nvme/host/core.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index a8e6488c7c8e..ccfdd4222697 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1870,13 +1870,15 @@ static int nvme_pr_preempt(struct block_device *bdev, u64 old, u64 new,
static int nvme_pr_clear(struct block_device *bdev, u64 key) { - u32 cdw10 = 1 | (key ? 1 << 3 : 0); - return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_register); + u32 cdw10 = 1 | (key ? 0 : 1 << 3); + + return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_release); }
static int nvme_pr_release(struct block_device *bdev, u64 key, enum pr_type type) { - u32 cdw10 = nvme_pr_type(type) << 8 | (key ? 1 << 3 : 0); + u32 cdw10 = nvme_pr_type(type) << 8 | (key ? 0 : 1 << 3); + return nvme_pr_command(bdev, cdw10, key, 0, nvme_cmd_resv_release); }
From: Thomas Pfaff tpfaff@pcs.com
stable inclusion from stable-4.19.238 commit c7daf1b4ad809692d5c26f33c02ed8a031066548 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5X41F CVE: NA
--------------------------------
commit 8707898e22fd665bc1d7b18b809be4b56ce25bdd upstream.
A kernel hang can be observed when running setserial in a loop on a kernel with force threaded interrupts. The sequence of events is:
setserial open("/dev/ttyXXX") request_irq() do_stuff() -> serial interrupt -> wake(irq_thread) desc->threads_active++; close() free_irq() kthread_stop(irq_thread) synchronize_irq() <- hangs because desc->threads_active != 0
The thread is created in request_irq() and woken up, but does not get on a CPU to reach the actual thread function, which would handle the pending wake-up. kthread_stop() sets the should stop condition which makes the thread immediately exit, which in turn leaves the stale threads_active count around.
This problem was introduced with commit 519cc8652b3a, which addressed a interrupt sharing issue in the PCIe code.
Before that commit free_irq() invoked synchronize_irq(), which waits for the hard interrupt handler and also for associated threads to complete.
To address the PCIe issue synchronize_irq() was replaced with __synchronize_hardirq(), which only waits for the hard interrupt handler to complete, but not for threaded handlers.
This was done under the assumption, that the interrupt thread already reached the thread function and waits for a wake-up, which is guaranteed to be handled before acting on the stop condition. The problematic case, that the thread would not reach the thread function, was obviously overlooked.
Make sure that the interrupt thread is really started and reaches thread_fn() before returning from __setup_irq().
This utilizes the existing wait queue in the interrupt descriptor. The wait queue is unused for non-shared interrupts. For shared interrupts the usage might cause a spurious wake-up of a waiter in synchronize_irq() or the completion of a threaded handler might cause a spurious wake-up of the waiter for the ready flag. Both are harmless and have no functional impact.
[ tglx: Amended changelog ]
Fixes: 519cc8652b3a ("genirq: Synchronize only with single thread on free_irq()") Signed-off-by: Thomas Pfaff tpfaff@pcs.com Signed-off-by: Thomas Gleixner tglx@linutronix.de Reviewed-by: Marc Zyngier maz@kernel.org Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/552fe7b4-9224-b183-bb87-a8f36d335690@pcs.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Lin Yujun linyujun809@huawei.com Reviewed-by: Zhang Jianhua chris.zjh@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- kernel/irq/internals.h | 2 ++ kernel/irq/irqdesc.c | 2 ++ kernel/irq/manage.c | 39 +++++++++++++++++++++++++++++---------- 3 files changed, 33 insertions(+), 10 deletions(-)
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h index 871d1c12eae9..f29d6982b3ce 100644 --- a/kernel/irq/internals.h +++ b/kernel/irq/internals.h @@ -29,12 +29,14 @@ extern struct irqaction chained_action; * IRQTF_WARNED - warning "IRQ_WAKE_THREAD w/o thread_fn" has been printed * IRQTF_AFFINITY - irq thread is requested to adjust affinity * IRQTF_FORCED_THREAD - irq action is force threaded + * IRQTF_READY - signals that irq thread is ready */ enum { IRQTF_RUNTHREAD, IRQTF_WARNED, IRQTF_AFFINITY, IRQTF_FORCED_THREAD, + IRQTF_READY, };
/* diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index ffdf02b01d81..c0248d4dde19 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -404,6 +404,7 @@ static struct irq_desc *alloc_desc(int irq, int node, unsigned int flags, lockdep_set_class(&desc->lock, &irq_desc_lock_class); mutex_init(&desc->request_mutex); init_rcu_head(&desc->rcu); + init_waitqueue_head(&desc->wait_for_threads);
desc_set_defaults(irq, desc, node, affinity, owner); irqd_set(&desc->irq_data, flags); @@ -568,6 +569,7 @@ int __init early_irq_init(void) raw_spin_lock_init(&desc[i].lock); lockdep_set_class(&desc[i].lock, &irq_desc_lock_class); mutex_init(&desc[i].request_mutex); + init_waitqueue_head(&desc[i].wait_for_threads); desc_set_defaults(i, &desc[i], node, NULL, NULL); } return arch_early_irq_init(); diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index a8c66acee82a..d3ee0b8bea5f 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -1096,6 +1096,31 @@ static void irq_wake_secondary(struct irq_desc *desc, struct irqaction *action) raw_spin_unlock_irq(&desc->lock); }
+/* + * Internal function to notify that a interrupt thread is ready. + */ +static void irq_thread_set_ready(struct irq_desc *desc, + struct irqaction *action) +{ + set_bit(IRQTF_READY, &action->thread_flags); + wake_up(&desc->wait_for_threads); +} + +/* + * Internal function to wake up a interrupt thread and wait until it is + * ready. + */ +static void wake_up_and_wait_for_irq_thread_ready(struct irq_desc *desc, + struct irqaction *action) +{ + if (!action || !action->thread) + return; + + wake_up_process(action->thread); + wait_event(desc->wait_for_threads, + test_bit(IRQTF_READY, &action->thread_flags)); +} + /* * Interrupt handler thread */ @@ -1107,6 +1132,8 @@ static int irq_thread(void *data) irqreturn_t (*handler_fn)(struct irq_desc *desc, struct irqaction *action);
+ irq_thread_set_ready(desc, action); + if (force_irqthreads && test_bit(IRQTF_FORCED_THREAD, &action->thread_flags)) handler_fn = irq_forced_thread_fn; @@ -1536,8 +1563,6 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) }
if (!shared) { - init_waitqueue_head(&desc->wait_for_threads); - /* Setup the type (level, edge polarity) if configured: */ if (new->flags & IRQF_TRIGGER_MASK) { ret = __irq_set_trigger(desc, @@ -1627,14 +1652,8 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
irq_setup_timings(desc, new);
- /* - * Strictly no need to wake it up, but hung_task complains - * when no hard interrupt wakes the thread up. - */ - if (new->thread) - wake_up_process(new->thread); - if (new->secondary) - wake_up_process(new->secondary->thread); + wake_up_and_wait_for_irq_thread_ready(desc, new); + wake_up_and_wait_for_irq_thread_ready(desc, new->secondary);
register_irq_proc(irq, desc); new->dir = NULL;
From: Borislav Petkov bp@suse.de
stable inclusion from stable-4.19.238 commit c7daf1b4ad809692d5c26f33c02ed8a031066548 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5X41F CVE: NA
--------------------------------
commit f9e14dbbd454581061c736bf70bf5cbb15ac927c upstream.
When resuming from system sleep state, restore_processor_state() restores the boot CPU MSRs. These MSRs could be emulated by microcode. If microcode is not loaded yet, writing to emulated MSRs leads to unchecked MSR access error:
... PM: Calling lapic_suspend+0x0/0x210 unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0...0) at rIP: ... (native_write_msr) Call Trace: <TASK> ? restore_processor_state x86_acpi_suspend_lowlevel acpi_suspend_enter suspend_devices_and_enter pm_suspend.cold state_store kobj_attr_store sysfs_kf_write kernfs_fop_write_iter new_sync_write vfs_write ksys_write __x64_sys_write do_syscall_64 entry_SYSCALL_64_after_hwframe RIP: 0033:0x7fda13c260a7
To ensure microcode emulated MSRs are available for restoration, load the microcode on the boot CPU before restoring these MSRs.
[ Pawan: write commit message and productize it. ]
Fixes: e2a1256b17b1 ("x86/speculation: Restore speculation related MSRs during S3 resume") Reported-by: Kyle D. Pelton kyle.d.pelton@intel.com Signed-off-by: Borislav Petkov bp@suse.de Signed-off-by: Pawan Gupta pawan.kumar.gupta@linux.intel.com Tested-by: Kyle D. Pelton kyle.d.pelton@intel.com Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=215841 Link: https://lore.kernel.org/r/4350dfbf785cd482d3fafa72b2b49c83102df3ce.165038631... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Lin Yujun linyujun809@huawei.com Reviewed-by: Zhang Jianhua chris.zjh@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- arch/x86/include/asm/microcode.h | 2 ++ arch/x86/kernel/cpu/microcode/core.c | 6 +++--- arch/x86/power/cpu.c | 8 ++++++++ 3 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/microcode.h b/arch/x86/include/asm/microcode.h index 2b7cc5397f80..91a06cef50c1 100644 --- a/arch/x86/include/asm/microcode.h +++ b/arch/x86/include/asm/microcode.h @@ -133,11 +133,13 @@ extern void load_ucode_ap(void); void reload_early_microcode(void); extern bool get_builtin_firmware(struct cpio_data *cd, const char *name); extern bool initrd_gone; +void microcode_bsp_resume(void); #else static inline int __init microcode_init(void) { return 0; }; static inline void __init load_ucode_bsp(void) { } static inline void load_ucode_ap(void) { } static inline void reload_early_microcode(void) { } +static inline void microcode_bsp_resume(void) { } static inline bool get_builtin_firmware(struct cpio_data *cd, const char *name) { return false; } #endif diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c index eab4de387ce6..985ef98c8ba2 100644 --- a/arch/x86/kernel/cpu/microcode/core.c +++ b/arch/x86/kernel/cpu/microcode/core.c @@ -773,9 +773,9 @@ static struct subsys_interface mc_cpu_interface = { };
/** - * mc_bp_resume - Update boot CPU microcode during resume. + * microcode_bsp_resume - Update boot CPU microcode during resume. */ -static void mc_bp_resume(void) +void microcode_bsp_resume(void) { int cpu = smp_processor_id(); struct ucode_cpu_info *uci = ucode_cpu_info + cpu; @@ -787,7 +787,7 @@ static void mc_bp_resume(void) }
static struct syscore_ops mc_syscore_ops = { - .resume = mc_bp_resume, + .resume = microcode_bsp_resume, };
static int mc_cpu_starting(unsigned int cpu) diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c index 3aa3149df07f..94732e42f3ef 100644 --- a/arch/x86/power/cpu.c +++ b/arch/x86/power/cpu.c @@ -26,6 +26,7 @@ #include <asm/cpu.h> #include <asm/mmu_context.h> #include <asm/cpu_device_id.h> +#include <asm/microcode.h>
#ifdef CONFIG_X86_32 __visible unsigned long saved_context_ebx; @@ -267,6 +268,13 @@ static void notrace __restore_processor_state(struct saved_context *ctxt) x86_platform.restore_sched_clock_state(); mtrr_bp_restore(); perf_restore_debug_store(); + + microcode_bsp_resume(); + + /* + * This needs to happen after the microcode has been updated upon resume + * because some of the MSRs are "emulated" in microcode. + */ msr_restore_context(ctxt); }
From: Pawan Gupta pawan.kumar.gupta@linux.intel.com
stable inclusion from stable-4.19.238 commit c7daf1b4ad809692d5c26f33c02ed8a031066548 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5X41F CVE: NA
--------------------------------
commit e2a1256b17b16f9b9adf1b6fea56819e7b68e463 upstream.
After resuming from suspend-to-RAM, the MSRs that control CPU's speculative execution behavior are not being restored on the boot CPU.
These MSRs are used to mitigate speculative execution vulnerabilities. Not restoring them correctly may leave the CPU vulnerable. Secondary CPU's MSRs are correctly being restored at S3 resume by identify_secondary_cpu().
During S3 resume, restore these MSRs for boot CPU when restoring its processor state.
Fixes: 772439717dbf ("x86/bugs/intel: Set proper CPU features and setup RDS") Reported-by: Neelima Krishnan neelima.krishnan@intel.com Signed-off-by: Pawan Gupta pawan.kumar.gupta@linux.intel.com Tested-by: Neelima Krishnan neelima.krishnan@intel.com Acked-by: Borislav Petkov bp@suse.de Reviewed-by: Dave Hansen dave.hansen@linux.intel.com Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Lin Yujun linyujun809@huawei.com Reviewed-by: Zhang Jianhua chris.zjh@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- arch/x86/power/cpu.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c index 94732e42f3ef..df00425677a5 100644 --- a/arch/x86/power/cpu.c +++ b/arch/x86/power/cpu.c @@ -522,10 +522,24 @@ static int pm_cpu_check(const struct x86_cpu_id *c) return ret; }
+static void pm_save_spec_msr(void) +{ + u32 spec_msr_id[] = { + MSR_IA32_SPEC_CTRL, + MSR_IA32_TSX_CTRL, + MSR_TSX_FORCE_ABORT, + MSR_IA32_MCU_OPT_CTRL, + MSR_AMD64_LS_CFG, + }; + + msr_build_context(spec_msr_id, ARRAY_SIZE(spec_msr_id)); +} + static int pm_check_save_msr(void) { dmi_check_system(msr_save_dmi_table); pm_cpu_check(msr_save_cpu_table); + pm_save_spec_msr();
return 0; }
From: Pawan Gupta pawan.kumar.gupta@linux.intel.com
stable inclusion from stable-4.19.238 commit c7daf1b4ad809692d5c26f33c02ed8a031066548 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5X41F CVE: NA
--------------------------------
commit 73924ec4d560257004d5b5116b22a3647661e364 upstream.
The mechanism to save/restore MSRs during S3 suspend/resume checks for the MSR validity during suspend, and only restores the MSR if its a valid MSR. This is not optimal, as an invalid MSR will unnecessarily throw an exception for every suspend cycle. The more invalid MSRs, higher the impact will be.
Check and save the MSR validity at setup. This ensures that only valid MSRs that are guaranteed to not throw an exception will be attempted during suspend.
Fixes: 7a9c2dd08ead ("x86/pm: Introduce quirk framework to save/restore extra MSR registers around suspend/resume") Suggested-by: Dave Hansen dave.hansen@linux.intel.com Signed-off-by: Pawan Gupta pawan.kumar.gupta@linux.intel.com Reviewed-by: Dave Hansen dave.hansen@linux.intel.com Acked-by: Borislav Petkov bp@suse.de Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Lin Yujun linyujun809@huawei.com Reviewed-by: Zhang Jianhua chris.zjh@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- arch/x86/power/cpu.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c index df00425677a5..794824948263 100644 --- a/arch/x86/power/cpu.c +++ b/arch/x86/power/cpu.c @@ -42,7 +42,8 @@ static void msr_save_context(struct saved_context *ctxt) struct saved_msr *end = msr + ctxt->saved_msrs.num;
while (msr < end) { - msr->valid = !rdmsrl_safe(msr->info.msr_no, &msr->info.reg.q); + if (msr->valid) + rdmsrl(msr->info.msr_no, msr->info.reg.q); msr++; } } @@ -434,8 +435,10 @@ static int msr_build_context(const u32 *msr_id, const int num) }
for (i = saved_msrs->num, j = 0; i < total_num; i++, j++) { + u64 dummy; + msr_array[i].info.msr_no = msr_id[j]; - msr_array[i].valid = false; + msr_array[i].valid = !rdmsrl_safe(msr_id[j], &dummy); msr_array[i].info.reg.q = 0; } saved_msrs->num = total_num;
From: Randy Dunlap rdunlap@infradead.org
stable inclusion from stable-4.19.238 commit c7daf1b4ad809692d5c26f33c02ed8a031066548 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5X41F CVE: NA
--------------------------------
[ Upstream commit f9a40b0890658330c83c95511f9d6b396610defc ]
initcall_blacklist() should return 1 to indicate that it handled its cmdline arguments.
set_debug_rodata() should return 1 to indicate that it handled its cmdline arguments. Print a warning if the option string is invalid.
This prevents these strings from being added to the 'init' program's environment as they are not init arguments/parameters.
Link: https://lkml.kernel.org/r/20220221050901.23985-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap rdunlap@infradead.org Reported-by: Igor Zhbanov i.zhbanov@omprussia.ru Cc: Ingo Molnar mingo@kernel.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Lin Yujun linyujun809@huawei.com Reviewed-by: Zhang Jianhua chris.zjh@huawei.com Signed-off-by: Yongqiang Liu liuyongqiang13@huawei.com --- init/main.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/init/main.c b/init/main.c index c149972b46ad..50af60ff0ef6 100644 --- a/init/main.c +++ b/init/main.c @@ -774,7 +774,7 @@ static int __init initcall_blacklist(char *str) } } while (str_entry);
- return 0; + return 1; }
static bool __init_or_module initcall_blacklisted(initcall_t fn) @@ -1027,7 +1027,9 @@ static noinline void __init kernel_init_freeable(void); bool rodata_enabled __ro_after_init = true; static int __init set_debug_rodata(char *str) { - return strtobool(str, &rodata_enabled); + if (strtobool(str, &rodata_enabled)) + pr_warn("Invalid option string for rodata: '%s'\n", str); + return 1; } __setup("rodata=", set_debug_rodata); #endif