From: Larysa Zaremba larysa.zaremba@intel.com
mainline inclusion from mainline-v6.11-rc7 commit 2504b8405768a57a71e660dbfd5abd59f679a03f category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/IARWO7 CVE: CVE-2024-46765
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
--------------------------------
The main threat to data consistency in ice_xdp() is a possible asynchronous PF reset. It can be triggered by a user or by TX timeout handler.
XDP setup and PF reset code access the same resources in the following sections: * ice_vsi_close() in ice_prepare_for_reset() - already rtnl-locked * ice_vsi_rebuild() for the PF VSI - not protected * ice_vsi_open() - already rtnl-locked
With an unfortunate timing, such accesses can result in a crash such as the one below:
[ +1.999878] ice 0000:b1:00.0: Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring 14 [ +2.002992] ice 0000:b1:00.0: Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring 18 [Mar15 18:17] ice 0000:b1:00.0 ens801f0np0: NETDEV WATCHDOG: CPU: 38: transmit queue 14 timed out 80692736 ms [ +0.000093] ice 0000:b1:00.0 ens801f0np0: tx_timeout: VSI_num: 6, Q 14, NTC: 0x0, HW_HEAD: 0x0, NTU: 0x0, INT: 0x4000001 [ +0.000012] ice 0000:b1:00.0 ens801f0np0: tx_timeout recovery level 1, txqueue 14 [ +0.394718] ice 0000:b1:00.0: PTP reset successful [ +0.006184] BUG: kernel NULL pointer dereference, address: 0000000000000098 [ +0.000045] #PF: supervisor read access in kernel mode [ +0.000023] #PF: error_code(0x0000) - not-present page [ +0.000023] PGD 0 P4D 0 [ +0.000018] Oops: 0000 [#1] PREEMPT SMP NOPTI [ +0.000023] CPU: 38 PID: 7540 Comm: kworker/38:1 Not tainted 6.8.0-rc7 #1 [ +0.000031] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021 [ +0.000036] Workqueue: ice ice_service_task [ice] [ +0.000183] RIP: 0010:ice_clean_tx_ring+0xa/0xd0 [ice] [...] [ +0.000013] Call Trace: [ +0.000016] <TASK> [ +0.000014] ? __die+0x1f/0x70 [ +0.000029] ? page_fault_oops+0x171/0x4f0 [ +0.000029] ? schedule+0x3b/0xd0 [ +0.000027] ? exc_page_fault+0x7b/0x180 [ +0.000022] ? asm_exc_page_fault+0x22/0x30 [ +0.000031] ? ice_clean_tx_ring+0xa/0xd0 [ice] [ +0.000194] ice_free_tx_ring+0xe/0x60 [ice] [ +0.000186] ice_destroy_xdp_rings+0x157/0x310 [ice] [ +0.000151] ice_vsi_decfg+0x53/0xe0 [ice] [ +0.000180] ice_vsi_rebuild+0x239/0x540 [ice] [ +0.000186] ice_vsi_rebuild_by_type+0x76/0x180 [ice] [ +0.000145] ice_rebuild+0x18c/0x840 [ice] [ +0.000145] ? delay_tsc+0x4a/0xc0 [ +0.000022] ? delay_tsc+0x92/0xc0 [ +0.000020] ice_do_reset+0x140/0x180 [ice] [ +0.000886] ice_service_task+0x404/0x1030 [ice] [ +0.000824] process_one_work+0x171/0x340 [ +0.000685] worker_thread+0x277/0x3a0 [ +0.000675] ? preempt_count_add+0x6a/0xa0 [ +0.000677] ? _raw_spin_lock_irqsave+0x23/0x50 [ +0.000679] ? __pfx_worker_thread+0x10/0x10 [ +0.000653] kthread+0xf0/0x120 [ +0.000635] ? __pfx_kthread+0x10/0x10 [ +0.000616] ret_from_fork+0x2d/0x50 [ +0.000612] ? __pfx_kthread+0x10/0x10 [ +0.000604] ret_from_fork_asm+0x1b/0x30 [ +0.000604] </TASK>
The previous way of handling this through returning -EBUSY is not viable, particularly when destroying AF_XDP socket, because the kernel proceeds with removal anyway.
There is plenty of code between those calls and there is no need to create a large critical section that covers all of them, same as there is no need to protect ice_vsi_rebuild() with rtnl_lock().
Add xdp_state_lock mutex to protect ice_vsi_rebuild() and ice_xdp().
Leaving unprotected sections in between would result in two states that have to be considered: 1. when the VSI is closed, but not yet rebuild 2. when VSI is already rebuild, but not yet open
The latter case is actually already handled through !netif_running() case, we just need to adjust flag checking a little. The former one is not as trivial, because between ice_vsi_close() and ice_vsi_rebuild(), a lot of hardware interaction happens, this can make adding/deleting rings exit with an error. Luckily, VSI rebuild is pending and can apply new configuration for us in a managed fashion.
Therefore, add an additional VSI state flag ICE_VSI_REBUILD_PENDING to indicate that ice_xdp() can just hot-swap the program.
Also, as ice_vsi_rebuild() flow is touched in this patch, make it more consistent by deconfiguring VSI when coalesce allocation fails.
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Fixes: efc2214b6047 ("ice: Add support for XDP") Reviewed-by: Wojciech Drewek wojciech.drewek@intel.com Reviewed-by: Jacob Keller jacob.e.keller@intel.com Tested-by: Chandan Kumar Rout chandanx.rout@intel.com Signed-off-by: Larysa Zaremba larysa.zaremba@intel.com Reviewed-by: Maciej Fijalkowski maciej.fijalkowski@intel.com Signed-off-by: Tony Nguyen anthony.l.nguyen@intel.com Conflicts: drivers/net/ethernet/intel/ice/ice.h drivers/net/ethernet/intel/ice/ice_lib.c drivers/net/ethernet/intel/ice/ice_xsk.c [conflicts due to not mergered 1a1c40df2e80 ("ice: set and release switchdev environment"), conflicts due to not mergered b3056ae2b578 ("ice: xsk: drop power of 2 ring size restriction for AF_XDP"), conflicts due to not mergered a696d61528f0 ("ice: stop hard coding the ICE_VSI_CTRL location"), conflicts due to not mergered 6624e780a577 ("ice: split ice_vsi_setup into smaller functions"), conflicts due to not mergered 0db66d20f9cf ("ice: cleanup in VSI config/deconfig code")] Signed-off-by: Wang Liang wangliang74@huawei.com --- drivers/net/ethernet/intel/ice/ice.h | 2 ++ drivers/net/ethernet/intel/ice/ice_lib.c | 24 ++++++++++++++++++----- drivers/net/ethernet/intel/ice/ice_main.c | 19 +++++++++++++----- drivers/net/ethernet/intel/ice/ice_xsk.c | 3 ++- 4 files changed, 37 insertions(+), 11 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h index a4c8ba4e7791..d4ef12f506b9 100644 --- a/drivers/net/ethernet/intel/ice/ice.h +++ b/drivers/net/ethernet/intel/ice/ice.h @@ -236,6 +236,7 @@ enum ice_vsi_state { ICE_VSI_MMAC_FLTR_CHANGED, ICE_VSI_VLAN_FLTR_CHANGED, ICE_VSI_PROMISC_CHANGED, + ICE_VSI_REBUILD_PENDING, ICE_VSI_STATE_NBITS /* must be last */ };
@@ -326,6 +327,7 @@ struct ice_vsi { struct ice_ring **xdp_rings; /* XDP ring array */ u16 num_xdp_txq; /* Used XDP queues */ u8 xdp_mapping_mode; /* ICE_MAP_MODE_[CONTIG|SCATTER] */ + struct mutex xdp_state_lock; struct xsk_buff_pool **xsk_pools; u16 num_xsk_pools_used; u16 num_xsk_pools; diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c index 575f1756a662..b7779e1dbd21 100644 --- a/drivers/net/ethernet/intel/ice/ice_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_lib.c @@ -352,6 +352,7 @@ static int ice_vsi_clear(struct ice_vsi *vsi) pf->next_vsi = vsi->idx;
ice_vsi_free_arrays(vsi); + mutex_destroy(&vsi->xdp_state_lock); mutex_unlock(&pf->sw_mutex); devm_kfree(dev, vsi);
@@ -475,6 +476,9 @@ ice_vsi_alloc(struct ice_pf *pf, enum ice_vsi_type vsi_type, u16 vf_id) pf->next_vsi = ice_get_free_slot(pf->vsi, pf->num_alloc_vsi, pf->next_vsi); } + + mutex_init(&vsi->xdp_state_lock); + goto unlock_pf;
err_rings: @@ -2881,10 +2885,14 @@ int ice_vsi_rebuild(struct ice_vsi *vsi, bool init_vsi) if (vsi->type == ICE_VSI_VF) vf = &pf->vf[vsi->vf_id];
+ mutex_lock(&vsi->xdp_state_lock); + coalesce = kcalloc(vsi->num_q_vectors, sizeof(struct ice_coalesce_stored), GFP_KERNEL); - if (!coalesce) - return -ENOMEM; + if (!coalesce) { + ret = -ENOMEM; + goto unlock; + }
prev_num_q_vectors = ice_vsi_rebuild_get_coalesce(vsi, coalesce);
@@ -3000,13 +3008,16 @@ int ice_vsi_rebuild(struct ice_vsi *vsi, bool init_vsi) ret = -EIO; goto err_vectors; } else { - return ice_schedule_reset(pf, ICE_RESET_PFR); + ret = ice_schedule_reset(pf, ICE_RESET_PFR); + if (ret) + goto err_vectors; + goto free_coalesce; } } ice_vsi_rebuild_set_coalesce(vsi, coalesce, prev_num_q_vectors); - kfree(coalesce); + clear_bit(ICE_VSI_REBUILD_PENDING, vsi->state);
- return 0; + goto free_coalesce;
err_vectors: ice_vsi_free_q_vectors(vsi); @@ -3020,7 +3031,10 @@ int ice_vsi_rebuild(struct ice_vsi *vsi, bool init_vsi) err_vsi: ice_vsi_clear(vsi); set_bit(__ICE_RESET_FAILED, pf->state); +free_coalesce: kfree(coalesce); +unlock: + mutex_unlock(&vsi->xdp_state_lock); return ret; }
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index 26daf2c474bb..30da82638c90 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -476,6 +476,7 @@ ice_prepare_for_reset(struct ice_pf *pf) /* clear SW filtering DB */ ice_clear_hw_tbls(hw); /* disable the VSIs and their queues that are not already DOWN */ + set_bit(ICE_VSI_REBUILD_PENDING, ice_get_main_vsi(pf)->state); ice_pf_dis_all_vsi(pf, false);
if (hw->port_info) @@ -2528,7 +2529,8 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct bpf_prog *prog, }
/* hot swap progs and avoid toggling link */ - if (ice_is_xdp_ena_vsi(vsi) == !!prog) { + if (ice_is_xdp_ena_vsi(vsi) == !!prog || + test_bit(ICE_VSI_REBUILD_PENDING, vsi->state)) { ice_vsi_assign_bpf_prog(vsi, prog); return 0; } @@ -2593,21 +2595,28 @@ static int ice_xdp(struct net_device *dev, struct netdev_bpf *xdp) { struct ice_netdev_priv *np = netdev_priv(dev); struct ice_vsi *vsi = np->vsi; + int ret;
if (vsi->type != ICE_VSI_PF) { NL_SET_ERR_MSG_MOD(xdp->extack, "XDP can be loaded only on PF VSI"); return -EINVAL; }
+ mutex_lock(&vsi->xdp_state_lock); + switch (xdp->command) { case XDP_SETUP_PROG: - return ice_xdp_setup_prog(vsi, xdp->prog, xdp->extack); + ret = ice_xdp_setup_prog(vsi, xdp->prog, xdp->extack); + break; case XDP_SETUP_XSK_POOL: - return ice_xsk_pool_setup(vsi, xdp->xsk.pool, - xdp->xsk.queue_id); + ret = ice_xsk_pool_setup(vsi, xdp->xsk.pool, xdp->xsk.queue_id); + break; default: - return -EINVAL; + ret = -EINVAL; } + + mutex_unlock(&vsi->xdp_state_lock); + return ret; }
/** diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c index 59963b901be0..a5d40d5d4ee2 100644 --- a/drivers/net/ethernet/intel/ice/ice_xsk.c +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c @@ -384,7 +384,8 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid) goto failure; }
- if_running = netif_running(vsi->netdev) && ice_is_xdp_ena_vsi(vsi); + if_running = !test_bit(ICE_VSI_DOWN, vsi->state) && + ice_is_xdp_ena_vsi(vsi);
if (if_running) { ret = ice_qp_dis(vsi, qid);