Fixing CVE-2024-35931
Asad Kamal (1): drm/amdgpu : Add hive ras recovery check
Stanley.Yang (1): drm/amdgpu: Skip do PCI error slot reset during RAS recovery
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 ++++++++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 +++++++-- drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 1 + .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 10 +++++++++- 4 files changed, 31 insertions(+), 3 deletions(-)
From: Asad Kamal asad.kamal@amd.com
mainline inclusion from mainline-v6.9-rc1 commit 601429cca96b4af3be44172c3b64e4228515dbe1 category: bugfix bugzilla: 189933 CVE: CVE-2024-35931
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
If one of the devices in the hive detects a fatal error, need to send ras recovery reset message to PMFW of all devices in the hive. For that add a flag in hive to indicate that it's undergoing ras recovery
Signed-off-by: Asad Kamal asad.kamal@amd.com Reviewed-by: Hawking Zhang Hawking.Zhang@amd.com Reviewed-by: Lijo Lazar lijo.lazar@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Conflicts: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c [adjust context conflicts] Signed-off-by: Liu Chuang liuchuang40@huawei.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 +++++++-- drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 1 + drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 10 +++++++++- 3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index 67b75ff0f7c3..0ba8e069f2b8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -2027,9 +2027,11 @@ static void amdgpu_ras_do_recovery(struct work_struct *work) struct amdgpu_device *remote_adev = NULL; struct amdgpu_device *adev = ras->adev; struct list_head device_list, *device_list_handle = NULL; + struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev);
+ if (hive) + atomic_set(&hive->ras_recovery, 1); if (!ras->disable_ras_err_cnt_harvest) { - struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev);
/* Build list of devices to query RAS related errors */ if (hive && adev->gmc.xgmi.num_physical_nodes > 1) { @@ -2046,7 +2048,6 @@ static void amdgpu_ras_do_recovery(struct work_struct *work) amdgpu_ras_log_on_err_counter(remote_adev); }
- amdgpu_put_xgmi_hive(hive); }
if (amdgpu_device_should_recover_gpu(ras->adev)) { @@ -2081,6 +2082,10 @@ static void amdgpu_ras_do_recovery(struct work_struct *work) amdgpu_device_gpu_recover(ras->adev, NULL, &reset_context); } atomic_set(&ras->in_recovery, 0); + if (hive) { + atomic_set(&hive->ras_recovery, 0); + amdgpu_put_xgmi_hive(hive); + } }
/* alloc/realloc bps array */ diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h index 86fbf56938f4..6cab882e8061 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h @@ -44,6 +44,7 @@ struct amdgpu_hive_info {
struct amdgpu_reset_domain *reset_domain; uint32_t device_remove_count; + atomic_t ras_recovery; };
struct amdgpu_pcs_ras_field { diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c index 6a28f8d5bff7..8f7c58206856 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c @@ -2136,16 +2136,24 @@ static int smu_v13_0_6_get_thermal_temperature_range(struct smu_context *smu, static int smu_v13_0_6_mode1_reset(struct smu_context *smu) { struct amdgpu_device *adev = smu->adev; + struct amdgpu_hive_info *hive = NULL; + u32 hive_ras_recovery = 0; struct amdgpu_ras *ras; u32 fatal_err, param; int ret = 0;
+ hive = amdgpu_get_xgmi_hive(adev); ras = amdgpu_ras_get_context(adev); fatal_err = 0; param = SMU_RESET_MODE_1;
+ if (hive) { + hive_ras_recovery = atomic_read(&hive->ras_recovery); + amdgpu_put_xgmi_hive(hive); + } + /* fatal error triggered by ras, PMFW supports the flag */ - if (ras && atomic_read(&ras->in_recovery)) + if (ras && (atomic_read(&ras->in_recovery) || hive_ras_recovery)) fatal_err = 1;
param |= (fatal_err << 16);
From: "Stanley.Yang" Stanley.Yang@amd.com
mainline inclusion from mainline-v6.9-rc1 commit 601429cca96b4af3be44172c3b64e4228515dbe1 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I9QGL7 CVE: CVE-2024-35931
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
--------------------------------
Why: The PCI error slot reset maybe triggered after inject ue to UMC multi times, this caused system hang. [ 557.371857] amdgpu 0000:af:00.0: amdgpu: GPU reset succeeded, trying to resume [ 557.373718] [drm] PCIE GART of 512M enabled. [ 557.373722] [drm] PTB located at 0x0000031FED700000 [ 557.373788] [drm] VRAM is lost due to GPU reset! [ 557.373789] [drm] PSP is resuming... [ 557.547012] mlx5_core 0000:55:00.0: mlx5_pci_err_detected Device state = 1 pci_status: 0. Exit, result = 3, need reset [ 557.547067] [drm] PCI error: detected callback, state(1)!! [ 557.547069] [drm] No support for XGMI hive yet... [ 557.548125] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 0. Enter [ 557.607763] mlx5_core 0000:55:00.0: wait vital counter value 0x16b5b after 1 iterations [ 557.607777] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 1. Exit, err = 0, result = 5, recovered [ 557.610492] [drm] PCI error: slot reset callback!! ... [ 560.689382] amdgpu 0000:3f:00.0: amdgpu: GPU reset(2) succeeded! [ 560.689546] amdgpu 0000:5a:00.0: amdgpu: GPU reset(2) succeeded! [ 560.689562] general protection fault, probably for non-canonical address 0x5f080b54534f611f: 0000 [#1] SMP NOPTI [ 560.701008] CPU: 16 PID: 2361 Comm: kworker/u448:9 Tainted: G OE 5.15.0-91-generic #101-Ubuntu [ 560.712057] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C11.AG.1 11/08/2023 [ 560.720959] Workqueue: amdgpu-reset-hive amdgpu_ras_do_recovery [amdgpu] [ 560.728887] RIP: 0010:amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu] [ 560.736891] Code: ff 41 89 c6 e9 1b ff ff ff 44 0f b6 45 b0 e9 4f ff ff ff be 01 00 00 00 4c 89 e7 e8 76 c9 8b ff 44 0f b6 45 b0 e9 3c fd ff ff <48> 83 ba 18 02 00 00 00 0f 84 6a f8 ff ff 48 8d 7a 78 be 01 00 00 [ 560.757967] RSP: 0018:ffa0000032e53d80 EFLAGS: 00010202 [ 560.763848] RAX: ffa00000001dfd10 RBX: ffa0000000197090 RCX: ffa0000032e53db0 [ 560.771856] RDX: 5f080b54534f5f07 RSI: 0000000000000000 RDI: ff11000128100010 [ 560.779867] RBP: ffa0000032e53df0 R08: 0000000000000000 R09: ffffffffffe77f08 [ 560.787879] R10: 0000000000ffff0a R11: 0000000000000001 R12: 0000000000000000 [ 560.795889] R13: ffa0000032e53e00 R14: 0000000000000000 R15: 0000000000000000 [ 560.803889] FS: 0000000000000000(0000) GS:ff11007e7e800000(0000) knlGS:0000000000000000 [ 560.812973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 560.819422] CR2: 000055a04c118e68 CR3: 0000000007410005 CR4: 0000000000771ee0 [ 560.827433] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 560.835433] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 560.843444] PKRU: 55555554 [ 560.846480] Call Trace: [ 560.849225] <TASK> [ 560.851580] ? show_trace_log_lvl+0x1d6/0x2ea [ 560.856488] ? show_trace_log_lvl+0x1d6/0x2ea [ 560.861379] ? amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu] [ 560.867778] ? show_regs.part.0+0x23/0x29 [ 560.872293] ? __die_body.cold+0x8/0xd [ 560.876502] ? die_addr+0x3e/0x60 [ 560.880238] ? exc_general_protection+0x1c5/0x410 [ 560.885532] ? asm_exc_general_protection+0x27/0x30 [ 560.891025] ? amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu] [ 560.898323] amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu] [ 560.904520] process_one_work+0x228/0x3d0 How: In RAS recovery, mode-1 reset is issued from RAS fatal error handling and expected all the nodes in a hive to be reset. no need to issue another mode-1 during this procedure.
Signed-off-by: Stanley.Yang Stanley.Yang@amd.com Reviewed-by: Hawking Zhang Hawking.Zhang@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Conflicts: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c [Due to not merge commit 4e8303cf2c4d drm/amdgpu: Use function for IP version check] Signed-off-by: Liu Chuang liuchuang40@huawei.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 062d78818da1..94139ddce2bb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5742,6 +5742,20 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev) struct amdgpu_reset_context reset_context; u32 memsize; struct list_head device_list; + struct amdgpu_hive_info *hive; + int hive_ras_recovery = 0; + struct amdgpu_ras *ras; + + /* PCI error slot reset should be skipped During RAS recovery */ + hive = amdgpu_get_xgmi_hive(adev); + if (hive) { + hive_ras_recovery = atomic_read(&hive->ras_recovery); + amdgpu_put_xgmi_hive(hive); + } + ras = amdgpu_ras_get_context(adev); + if ((adev->ip_versions[GC_HWIP][0] == IP_VERSION(9, 4, 3)) && + ras && (atomic_read(&ras->in_recovery) || hive_ras_recovery)) + return PCI_ERS_RESULT_RECOVERED;
DRM_INFO("PCI error: slot reset callback!!\n");
反馈: 您发送到kernel@openeuler.org的补丁/补丁集,已成功转换为PR! PR链接地址: https://gitee.com/openeuler/kernel/pulls/10216 邮件列表地址:https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/I...
FeedBack: The patch(es) which you have sent to kernel@openeuler.org mailing list has been converted to a pull request successfully! Pull request link: https://gitee.com/openeuler/kernel/pulls/10216 Mailing list address: https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/I...