From: xiabing xiabing12@h-partners.com
1. In the NCQ scenario, if multiple I/Os are delivered and one of the I/Os is faulty, a group of slow disks will always occur. 2. During I/O running, a null pointer exception occurs during the pressure test when a disk is removed. 3. When the length of the DMA Setup frame returned by the disk is abnormal, a group of slow disks are triggered.
John Garry (2): scsi: libsas: Add sas_ata_device_link_abort() scsi: libsas: Update SATA dev FIS in sas_ata_task_done()
Xingui Yang (6): scsi: hisi_sas: Move slot variable definition in hisi_sas_abort_task() scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw scsi: hisi_sas: Use abort task set to reset SAS disks when discovered scsi: libsas: Grab the ATA port lock in sas_ata_device_link_abort() {topost} scsi: hisi_sas: Handle NCQ error when IPTT is valid {topost} scsi: hisi_sas: Grab sas_dev lock when traversing the members of sas_dev.list
drivers/scsi/hisi_sas/hisi_sas.h | 4 +- drivers/scsi/hisi_sas/hisi_sas_main.c | 52 ++++++++++++++++------- drivers/scsi/hisi_sas/hisi_sas_v1_hw.c | 8 +++- drivers/scsi/hisi_sas/hisi_sas_v2_hw.c | 8 +++- drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 58 ++++++++++++++++++++++++-- drivers/scsi/libsas/sas_ata.c | 22 +++++++++- include/scsi/sas_ata.h | 6 +++ 7 files changed, 132 insertions(+), 26 deletions(-)
From: John Garry john.garry@huawei.com
mainline inclusion from mainline-v6.2-rc1 commit 44112922674b94a7d699dfff6307fc830018df7c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6NPLN CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Similar to how AHCI handles NCQ errors in ahci_error_intr() -> ata_port_abort() -> ata_do_link_abort(), add an NCQ error handler for LLDDs to call to initiate a link abort.
This will mark all outstanding QCs as failed and kick-off EH.
Note:
A "force reset" argument is added for drivers which require the ATA error handling to always reset the device.
A driver may require this feature for when SATA device per-SCSI cmnd resources are only released during reset for ATA EH. As such, we need an option to force reset to be done, regardless of what any EH autopsy decides.
The SATA device FIS fields are set to indicate a device error from ata_eh_analyze_tf().
Suggested-by: Damien Le Moal damien.lemoal@opensource.wdc.com Suggested-by: Niklas Cassel niklas.cassel@wdc.com Signed-off-by: John Garry john.garry@huawei.com Link: https://lore.kernel.org/r/1665998435-199946-2-git-send-email-john.garry@huaw... Tested-by: Damien Le Moal damien.lemoal@opensource.wdc.com Tested-by: Niklas Cassel niklas.cassel@wdc.com # pm80xx Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: xiabing xiabing12@h-partners.com --- drivers/scsi/libsas/sas_ata.c | 15 +++++++++++++++ include/scsi/sas_ata.h | 6 ++++++ 2 files changed, 21 insertions(+)
diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c index d54f91636ad1..434083c2c2f7 100644 --- a/drivers/scsi/libsas/sas_ata.c +++ b/drivers/scsi/libsas/sas_ata.c @@ -875,3 +875,18 @@ void sas_ata_wait_eh(struct domain_device *dev) ap = dev->sata_dev.ap; ata_port_wait_eh(ap); } + +void sas_ata_device_link_abort(struct domain_device *device, bool force_reset) +{ + struct ata_port *ap = device->sata_dev.ap; + struct ata_link *link = &ap->link; + + device->sata_dev.fis[2] = ATA_ERR | ATA_DRDY; /* tf status */ + device->sata_dev.fis[3] = ATA_ABORTED; /* tf error */ + + link->eh_info.err_mask |= AC_ERR_DEV; + if (force_reset) + link->eh_info.action |= ATA_EH_RESET; + ata_link_abort(link); +} +EXPORT_SYMBOL_GPL(sas_ata_device_link_abort); diff --git a/include/scsi/sas_ata.h b/include/scsi/sas_ata.h index e0d2d4915257..68c342e55f6f 100644 --- a/include/scsi/sas_ata.h +++ b/include/scsi/sas_ata.h @@ -33,6 +33,7 @@ void sas_probe_sata(struct asd_sas_port *port); void sas_suspend_sata(struct asd_sas_port *port); void sas_resume_sata(struct asd_sas_port *port); void sas_ata_end_eh(struct ata_port *ap); +void sas_ata_device_link_abort(struct domain_device *dev, bool force_reset); int sas_ata_wait_after_reset(struct domain_device *dev, unsigned long deadline); #else
@@ -87,6 +88,11 @@ static inline void sas_ata_end_eh(struct ata_port *ap) { }
+static inline void sas_ata_device_link_abort(struct domain_device *dev, + bool force_reset) +{ +} + static inline int sas_ata_wait_after_reset(struct domain_device *dev, unsigned long deadline) {
From: Xingui Yang yangxingui@huawei.com
mainline inclusion from mainline-v6.2-rc1 commit 4b329abc91800d23941ac773e69b322a13981ecb category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6NPLN CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Each branch currently defines a slot variable independently, and it is neater to move it to the function head.
Signed-off-by: Xingui Yang yangxingui@huawei.com Signed-off-by: John Garry john.garry@huawei.com Link: https://lore.kernel.org/r/1665998435-199946-3-git-send-email-john.garry@huaw... Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: xiabing xiabing12@h-partners.com --- drivers/scsi/hisi_sas/hisi_sas_main.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c index b8249a055fbb..e3e35560539d 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_main.c +++ b/drivers/scsi/hisi_sas/hisi_sas_main.c @@ -1680,6 +1680,7 @@ static int hisi_sas_abort_task(struct sas_task *task) struct hisi_sas_tmf_task tmf_task; struct domain_device *device = task->dev; struct hisi_sas_device *sas_dev = device->lldd_dev; + struct hisi_sas_slot *slot = task->lldd_task; struct hisi_hba *hisi_hba; struct device *dev; int rc = TMF_RESP_FUNC_FAILED; @@ -1693,7 +1694,6 @@ static int hisi_sas_abort_task(struct sas_task *task)
spin_lock_irqsave(&task->task_state_lock, flags); if (task->task_state_flags & SAS_TASK_STATE_DONE) { - struct hisi_sas_slot *slot = task->lldd_task; struct hisi_sas_cq *cq;
if (slot) { @@ -1711,9 +1711,8 @@ static int hisi_sas_abort_task(struct sas_task *task) task->task_state_flags |= SAS_TASK_STATE_ABORTED; spin_unlock_irqrestore(&task->task_state_lock, flags);
- if (task->lldd_task && task->task_proto & SAS_PROTOCOL_SSP) { + if (slot && task->task_proto & SAS_PROTOCOL_SSP) { struct scsi_cmnd *cmnd = task->uldd_task; - struct hisi_sas_slot *slot = task->lldd_task; u16 tag = slot->idx; int rc2;
@@ -1756,9 +1755,8 @@ static int hisi_sas_abort_task(struct sas_task *task) hisi_sas_dereg_device(hisi_hba, device); rc = hisi_sas_softreset_ata_disk(device); } - } else if (task->lldd_task && task->task_proto & SAS_PROTOCOL_SMP) { + } else if (slot && task->task_proto & SAS_PROTOCOL_SMP) { /* SMP */ - struct hisi_sas_slot *slot = task->lldd_task; u32 tag = slot->idx; struct hisi_sas_cq *cq = &hisi_hba->cq[slot->dlvry_queue];
From: Xingui Yang yangxingui@huawei.com
mainline inclusion from mainline-v6.2-rc1 commit 930d97dabdd56681aef752a35475f0212a171741 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6NPLN CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
When CQ header dw3 SATA_DISK_ERR is set it means this SATA disk is in error state and the current IPTT is invalid. An invalid IPTT does not correspond to any slot.
In this scenario, new I/Os that delivered to disk will be rejected by the controller and all I/Os remaining in the disk should be aborted, which we add here with the sas_ata_device_link_abort() call.
In hisi_sas_abort_task() we don't want to issue a soft reset as it may cause info to be lost in the target disk for the ATA EH autopsy. In this case, just release resources - the disk won't return other I/Os normally after NCQ Error, so this is safe.
Signed-off-by: Xingui Yang yangxingui@huawei.com Signed-off-by: John Garry john.garry@huawei.com Link: https://lore.kernel.org/r/1665998435-199946-4-git-send-email-john.garry@huaw... Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: xiabing xiabing12@h-partners.com --- drivers/scsi/hisi_sas/hisi_sas.h | 1 + drivers/scsi/hisi_sas/hisi_sas_main.c | 19 +++++++++- drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 48 ++++++++++++++++++++++++-- 3 files changed, 65 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/hisi_sas/hisi_sas.h b/drivers/scsi/hisi_sas/hisi_sas.h index 660da5b10193..0bdf72758cda 100644 --- a/drivers/scsi/hisi_sas/hisi_sas.h +++ b/drivers/scsi/hisi_sas/hisi_sas.h @@ -105,6 +105,7 @@ enum { enum dev_status { HISI_SAS_DEV_INIT, HISI_SAS_DEV_NORMAL, + HISI_SAS_DEV_NCQ_ERR, };
enum { diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c index e3e35560539d..a5bcb70cc64b 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_main.c +++ b/drivers/scsi/hisi_sas/hisi_sas_main.c @@ -1745,6 +1745,8 @@ static int hisi_sas_abort_task(struct sas_task *task) } else if (task->task_proto & SAS_PROTOCOL_SATA || task->task_proto & SAS_PROTOCOL_STP) { if (task->dev->dev_type == SAS_SATA_DEV) { + struct ata_queued_cmd *qc = task->uldd_task; + rc = hisi_sas_internal_task_abort(hisi_hba, device, HISI_SAS_INT_ABT_DEV, 0, false); @@ -1753,7 +1755,18 @@ static int hisi_sas_abort_task(struct sas_task *task) goto out; } hisi_sas_dereg_device(hisi_hba, device); - rc = hisi_sas_softreset_ata_disk(device); + + /* + * If an ATA internal command times out in ATA EH, it + * need to execute soft reset, so check the scsicmd + */ + if ((sas_dev->dev_status == HISI_SAS_DEV_NCQ_ERR) && + qc && qc->scsicmd) { + hisi_sas_do_release_task(hisi_hba, task, slot); + rc = TMF_RESP_FUNC_COMPLETE; + } else { + rc = hisi_sas_softreset_ata_disk(device); + } } } else if (slot && task->task_proto & SAS_PROTOCOL_SMP) { /* SMP */ @@ -1883,8 +1896,12 @@ static int hisi_sas_I_T_nexus_reset(struct domain_device *device) { struct hisi_hba *hisi_hba = dev_to_hisi_hba(device); struct device *dev = hisi_hba->dev; + struct hisi_sas_device *sas_dev = device->lldd_dev; int rc;
+ if (sas_dev->dev_status == HISI_SAS_DEV_NCQ_ERR) + sas_dev->dev_status = HISI_SAS_DEV_NORMAL; + rc = hisi_sas_internal_task_abort(hisi_hba, device, HISI_SAS_INT_ABT_DEV, 0, false); if (rc < 0) { diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c index 4a2ed1a74ef9..a36d9382e15a 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c +++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c @@ -403,6 +403,11 @@ #define CMPLT_HDR_CMPLT_MSK (0x3 << CMPLT_HDR_CMPLT_OFF) #define CMPLT_HDR_ERROR_PHASE_OFF 2 #define CMPLT_HDR_ERROR_PHASE_MSK (0xff << CMPLT_HDR_ERROR_PHASE_OFF) +/* bit[9:2] Error Phase */ +#define ERR_PHASE_RESPONSE_FRAME_REV_STAGE_OFF \ + 8 +#define ERR_PHASE_RESPONSE_FRAME_REV_STAGE_MSK \ + (0x1 << ERR_PHASE_RESPONSE_FRAME_REV_STAGE_OFF) #define CMPLT_HDR_RSPNS_XFRD_OFF 10 #define CMPLT_HDR_RSPNS_XFRD_MSK (0x1 << CMPLT_HDR_RSPNS_XFRD_OFF) #define CMPLT_HDR_RSPNS_GOOD_OFF 11 @@ -422,8 +427,15 @@ #define CMPLT_HDR_DEV_ID_OFF 16 #define CMPLT_HDR_DEV_ID_MSK (0xffff << CMPLT_HDR_DEV_ID_OFF) /* dw3 */ +#define CMPLT_HDR_SATA_DISK_ERR_OFF 16 +#define CMPLT_HDR_SATA_DISK_ERR_MSK (0x1 << CMPLT_HDR_SATA_DISK_ERR_OFF) #define CMPLT_HDR_IO_IN_TARGET_OFF 17 #define CMPLT_HDR_IO_IN_TARGET_MSK (0x1 << CMPLT_HDR_IO_IN_TARGET_OFF) +/* bit[23:18] ERR_FIS_ATA_STATUS */ +#define FIS_ATA_STATUS_ERR_OFF 18 +#define FIS_ATA_STATUS_ERR_MSK (0x1 << FIS_ATA_STATUS_ERR_OFF) +#define FIS_TYPE_SDB_OFF 31 +#define FIS_TYPE_SDB_MSK (0x1 << FIS_TYPE_SDB_OFF)
/* ITCT header */ /* qw0 */ @@ -2148,6 +2160,18 @@ static irqreturn_t fatal_axi_int_v3_hw(int irq_no, void *p) return IRQ_HANDLED; }
+static bool is_ncq_err_v3_hw(struct hisi_sas_complete_v3_hdr *complete_hdr) +{ + u32 dw0, dw3; + + dw0 = le32_to_cpu(complete_hdr->dw0); + dw3 = le32_to_cpu(complete_hdr->dw3); + + return (dw0 & ERR_PHASE_RESPONSE_FRAME_REV_STAGE_MSK) && + (dw3 & FIS_TYPE_SDB_MSK) && + (dw3 & FIS_ATA_STATUS_ERR_MSK); +} + static bool slot_err_v3_hw(struct hisi_hba *hisi_hba, struct sas_task *task, struct hisi_sas_slot *slot) @@ -2384,14 +2408,34 @@ static irqreturn_t cq_thread_v3_hw(int irq_no, void *p) while (rd_point != wr_point) { struct hisi_sas_complete_v3_hdr *complete_hdr; struct device *dev = hisi_hba->dev; - u32 dw1; + u32 dw0, dw1, dw3; int iptt;
complete_hdr = &complete_queue[rd_point]; + dw0 = le32_to_cpu(complete_hdr->dw0); dw1 = le32_to_cpu(complete_hdr->dw1); + dw3 = le32_to_cpu(complete_hdr->dw3);
iptt = dw1 & CMPLT_HDR_IPTT_MSK; - if (likely(iptt < HISI_SAS_COMMAND_ENTRIES_V3_HW)) { + if (unlikely((dw0 & CMPLT_HDR_CMPLT_MSK) == 0x3) && + (dw3 & CMPLT_HDR_SATA_DISK_ERR_MSK)) { + int device_id = (dw1 & CMPLT_HDR_DEV_ID_MSK) >> + CMPLT_HDR_DEV_ID_OFF; + struct hisi_sas_itct *itct = + &hisi_hba->itct[device_id]; + struct hisi_sas_device *sas_dev = + &hisi_hba->devices[device_id]; + struct domain_device *device = sas_dev->sas_device; + + dev_err(dev, "erroneous completion disk err dev id=%d sas_addr=0x%llx CQ hdr: 0x%x 0x%x 0x%x 0x%x\n", + device_id, itct->sas_addr, dw0, dw1, + complete_hdr->act, dw3); + + if (is_ncq_err_v3_hw(complete_hdr)) + sas_dev->dev_status = HISI_SAS_DEV_NCQ_ERR; + + sas_ata_device_link_abort(device, true); + } else if (likely(iptt < HISI_SAS_COMMAND_ENTRIES_V3_HW)) { slot = &hisi_hba->slot_info[iptt]; slot->cmplt_queue_slot = rd_point; slot->cmplt_queue = queue;
From: Xingui Yang yangxingui@huawei.com
mainline inclusion from mainline-v6.2-rc4 commit 037b48057e8b485a8d72f808122796aeadbbee32 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6NPLN CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Currently clear task set is used to abort all commands remaining in the disk when the SAS disk is discovered, and if the disk is discovered by two initiators, other I_T nexuses are also affected. So use abort task set instead and take effect only on the specified I_T nexus.
Signed-off-by: Xingui Yang yangxingui@huawei.com Signed-off-by: Xiang Chen chenxiang66@hisilicon.com Link: https://lore.kernel.org/r/1672805000-141102-2-git-send-email-chenxiang66@his... Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: xiabing xiabing12@h-partners.com --- drivers/scsi/hisi_sas/hisi_sas_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c index a5bcb70cc64b..51c02a1c9a96 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_main.c +++ b/drivers/scsi/hisi_sas/hisi_sas_main.c @@ -696,7 +696,7 @@ static int hisi_sas_init_device(struct domain_device *device) case SAS_END_DEVICE: int_to_scsilun(0, &lun);
- tmf_task.tmf = TMF_CLEAR_TASK_SET; + tmf_task.tmf = TMF_ABORT_TASK_SET; while (retry-- > 0) { rc = hisi_sas_debug_issue_ssp_tmf(device, lun.scsi_lun, &tmf_task);
From: John Garry john.garry@huawei.com
mainline inclusion from mainline-v6.2-rc1 commit cc22efbec0110181725b1f5f6778155a2e352522 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6NPLN CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
In sas_ata_task_done(), for commands which complete with error we set the SATA dev FIS status field with ATA_ERR. In ata_eh_analyze_tf() this would be interpreted as a HSM error. Set ATA_DRDY, which will lead libata to judge as a device error, which is a safer bet.
Signed-off-by: John Garry john.garry@huawei.com Link: https://lore.kernel.org/r/1665998435-199946-9-git-send-email-john.garry@huaw... Reviewed-by: Niklas Cassel niklas.cassel@wdc.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: xiabing xiabing12@h-partners.com --- drivers/scsi/libsas/sas_ata.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c index 434083c2c2f7..5793b06ba71d 100644 --- a/drivers/scsi/libsas/sas_ata.c +++ b/drivers/scsi/libsas/sas_ata.c @@ -147,8 +147,8 @@ static void sas_ata_task_done(struct sas_task *task) qc->flags |= ATA_QCFLAG_FAILED; }
- dev->sata_dev.fis[3] = 0x04; /* status err */ - dev->sata_dev.fis[2] = ATA_ERR; + dev->sata_dev.fis[2] = ATA_ERR | ATA_DRDY; /* tf status */ + dev->sata_dev.fis[3] = ATA_ABORTED; /* tf error */ } }
From: Xingui Yang yangxingui@huawei.com
mainline inclusion from mainline-v6.2-rc4 commit a67aad57d9aee41180aff36e54cb72fe4b8d5a5a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6NPLN CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
----------------------------------------------------------------------
Grab the ATA port lock in sas_ata_device_link_abort() before calling ata_link_abort() as outlined in this function's locking requirements.
Fixes: 44112922674b ("scsi: libsas: Add sas_ata_device_link_abort()") Signed-off-by: Xingui Yang yangxingui@huawei.com Reviewed-by: John Garry john.g.garry@oracle.com Reviewed-by: Jason Yan yanaijie@huawei.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: xiabing xiabing12@h-partners.com --- drivers/scsi/libsas/sas_ata.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c index 5793b06ba71d..91a3a0ff8952 100644 --- a/drivers/scsi/libsas/sas_ata.c +++ b/drivers/scsi/libsas/sas_ata.c @@ -880,7 +880,9 @@ void sas_ata_device_link_abort(struct domain_device *device, bool force_reset) { struct ata_port *ap = device->sata_dev.ap; struct ata_link *link = &ap->link; + unsigned long flags;
+ spin_lock_irqsave(ap->lock, flags); device->sata_dev.fis[2] = ATA_ERR | ATA_DRDY; /* tf status */ device->sata_dev.fis[3] = ATA_ABORTED; /* tf error */
@@ -888,5 +890,6 @@ void sas_ata_device_link_abort(struct domain_device *device, bool force_reset) if (force_reset) link->eh_info.action |= ATA_EH_RESET; ata_link_abort(link); + spin_unlock_irqrestore(ap->lock, flags); } EXPORT_SYMBOL_GPL(sas_ata_device_link_abort);
From: Xingui Yang yangxingui@huawei.com
driver inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6NPLN CVE: NA
----------------------------------------------------------------------
If an NCQ error occurs when the IPTT is valid and slot->abort flag is set in completion path, sas_task_abort() will be called to abort only one NCQ command now, and the host would be set to SHOST_RECOVERY state. But this may not kick-off EH Immediately until other outstanding QCs timeouts. As a result, the host may remain in the SHOST_RECOVERY state for up to 30 seconds, such as follows:
[7972317.645234] hisi_sas_v3_hw 0000:74:04.0: erroneous completion iptt=3264 task=00000000466116b8 dev id=2 sas_addr=0x5000000000000502 CQ hdr: 0x1883 0x20cc0 0x40000 0x20420000 Error info: 0x0 0x0 0x200000 0x0 [7972341.508264] sas: Enter sas_scsi_recover_host busy: 32 failed: 32 [7972341.984731] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 32 tries: 1
So all NCQ commands that are in the queue should be aborted when an NCQ error occurs In this scenario.
Fixes: 05d91b557af9 ("scsi: hisi_sas: Directly trigger SCSI error handling for completion errors")
Signed-off-by: Xingui Yang yangxingui@huawei.com Reviewed-by: Xiang Chen chenxiang66@hisilicon.com Signed-off-by: xiabing xiabing12@h-partners.com --- drivers/scsi/hisi_sas/hisi_sas_v1_hw.c | 6 +++++- drivers/scsi/hisi_sas/hisi_sas_v2_hw.c | 6 +++++- drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 6 +++++- 3 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v1_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v1_hw.c index fdff327bb030..a820c97f33a8 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_v1_hw.c +++ b/drivers/scsi/hisi_sas/hisi_sas_v1_hw.c @@ -1259,7 +1259,11 @@ static void slot_complete_v1_hw(struct hisi_hba *hisi_hba,
slot_err_v1_hw(hisi_hba, task, slot); if (unlikely(slot->abort)) { - sas_task_abort(task); + if (dev_is_sata(device) && task->ata_task.use_ncq) + sas_ata_device_link_abort(device, true); + else + sas_task_abort(task); + return; } goto out; diff --git a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c index 9bfa796505aa..071430785100 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c +++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c @@ -2405,7 +2405,11 @@ static void slot_complete_v2_hw(struct hisi_hba *hisi_hba, error_info[2], error_info[3]);
if (unlikely(slot->abort)) { - sas_task_abort(task); + if (dev_is_sata(device) && task->ata_task.use_ncq) + sas_ata_device_link_abort(device, true); + else + sas_task_abort(task); + return; } goto out; diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c index a36d9382e15a..4d36ff3657d2 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c +++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c @@ -2318,7 +2318,11 @@ static void slot_complete_v3_hw(struct hisi_hba *hisi_hba, error_info[0], error_info[1], error_info[2], error_info[3]); if (unlikely(slot->abort)) { - sas_task_abort(task); + if (dev_is_sata(device) && task->ata_task.use_ncq) + sas_ata_device_link_abort(device, true); + else + sas_task_abort(task); + return; } goto out;
From: Xingui Yang yangxingui@huawei.com
driver inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6NPLN CVE: NA
----------------------------------------------------------------------
When the IO complete and free slot in function slot_complete_v3_hw(), it is possible that sas_dev.list is being traversed elsewhere, and it may trigger a null pointer exception, such as follows: ==>cq thread ==>scsi_eh_6
==>scsi_error_handler() ==>sas_eh_handle_sas_errors() ==>sas_scsi_find_task() ==>lldd_abort_task() ==>slot_complete_v3_hw() ==>hisi_sas_abort_task() ==>hisi_sas_slot_task_free() ==>dereg_device_v3_hw() ==>list_del_init() ==>list_for_each_entry_safe()
[ 7165.434918] sas: Enter sas_scsi_recover_host busy: 32 failed: 32 [ 7165.434926] sas: trying to find task 0x00000000769b5ba5 [ 7165.434927] sas: sas_scsi_find_task: aborting task 0x00000000769b5ba5 [ 7165.434940] hisi_sas_v3_hw 0000:b4:02.0: slot complete: task(00000000769b5ba5) aborted [ 7165.434964] hisi_sas_v3_hw 0000:b4:02.0: slot complete: task(00000000c9f7aa07) ignored [ 7165.434965] hisi_sas_v3_hw 0000:b4:02.0: slot complete: task(00000000e2a1cf01) ignored [ 7165.434968] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 7165.434972] hisi_sas_v3_hw 0000:b4:02.0: slot complete: task(0000000022d52d93) ignored [ 7165.434975] hisi_sas_v3_hw 0000:b4:02.0: slot complete: task(0000000066a7516c) ignored [ 7165.434976] Mem abort info: [ 7165.434982] ESR = 0x96000004 [ 7165.434991] Exception class = DABT (current EL), IL = 32 bits [ 7165.434992] SET = 0, FnV = 0 [ 7165.434993] EA = 0, S1PTW = 0 [ 7165.434994] Data abort info: [ 7165.434994] ISV = 0, ISS = 0x00000004 [ 7165.434995] CM = 0, WnR = 0 [ 7165.434997] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000f29543f2 [ 7165.434998] [0000000000000000] pgd=0000000000000000 [ 7165.435003] Internal error: Oops: 96000004 [#1] SMP [ 7165.439863] Process scsi_eh_6 (pid: 4109, stack limit = 0x00000000c43818d5) [ 7165.468862] pstate: 00c00009 (nzcv daif +PAN +UAO) [ 7165.473637] pc : dereg_device_v3_hw+0x68/0xa8 [hisi_sas_v3_hw] [ 7165.479443] lr : dereg_device_v3_hw+0x2c/0xa8 [hisi_sas_v3_hw] [ 7165.485247] sp : ffff00001d623bc0 [ 7165.488546] x29: ffff00001d623bc0 x28: ffffa027d03b9508 [ 7165.493835] x27: ffff80278ed50af0 x26: ffffa027dd31e0a8 [ 7165.499123] x25: ffffa027d9b27f88 x24: ffffa027d9b209f8 [ 7165.504411] x23: ffffa027c45b0d60 x22: ffff80278ec07c00 [ 7165.509700] x21: 0000000000000008 x20: ffffa027d9b209f8 [ 7165.514988] x19: ffffa027d9b27f88 x18: ffffffffffffffff [ 7165.520276] x17: 0000000000000000 x16: 0000000000000000 [ 7165.525564] x15: ffff0000091d9708 x14: ffff0000093b7dc8 [ 7165.530852] x13: ffff0000093b7a23 x12: 6e7265746e692067 [ 7165.536140] x11: 0000000000000000 x10: 0000000000000bb0 [ 7165.541429] x9 : ffff00001d6238f0 x8 : ffffa027d877af00 [ 7165.546718] x7 : ffffa027d6329600 x6 : ffff7e809f58ca00 [ 7165.552006] x5 : 0000000000001f8a x4 : 000000000000088e [ 7165.557295] x3 : ffffa027d9b27fa8 x2 : 0000000000000000 [ 7165.562583] x1 : 0000000000000000 x0 : 000000003000188e [ 7165.567872] Call trace: [ 7165.570309] dereg_device_v3_hw+0x68/0xa8 [hisi_sas_v3_hw] [ 7165.575775] hisi_sas_abort_task+0x248/0x358 [hisi_sas_main] [ 7165.581415] sas_eh_handle_sas_errors+0x258/0x8e0 [libsas] [ 7165.586876] sas_scsi_recover_host+0x134/0x458 [libsas] [ 7165.592082] scsi_error_handler+0xb4/0x488 [ 7165.596163] kthread+0x134/0x138 [ 7165.599380] ret_from_fork+0x10/0x18 [ 7165.602940] Code: d5033e9f b9000040 aa0103e2 eb03003f (f9400021) [ 7165.609004] kernel fault(0x1) notification starting on CPU 75 [ 7165.700728] ---[ end trace fc042cbbea224efc ]--- [ 7165.705326] Kernel panic - not syncing: Fatal exception
So grab sas_dev lock when traversing the members of sas_dev.list in dereg_device_v3_hw() and hisi_sas_release_tasks() to avoid concurrency of adding and deleting member, which may cause an exception. And when hisi_sas_release_tasks() call hisi_sas_do_release_task() to free slot, the lock cannot be grabbed again in hisi_sas_slot_task_free(), then a bool parameter need_lock is added.
Signed-off-by: Xingui Yang yangxingui@huawei.com Reviewed-by: Xiang Chen chenxiang66@hisilicon.com Signed-off-by: xiabing xiabing12@h-partners.com --- drivers/scsi/hisi_sas/hisi_sas.h | 3 ++- drivers/scsi/hisi_sas/hisi_sas_main.c | 25 ++++++++++++++++--------- drivers/scsi/hisi_sas/hisi_sas_v1_hw.c | 2 +- drivers/scsi/hisi_sas/hisi_sas_v2_hw.c | 2 +- drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 4 +++- 5 files changed, 23 insertions(+), 13 deletions(-)
diff --git a/drivers/scsi/hisi_sas/hisi_sas.h b/drivers/scsi/hisi_sas/hisi_sas.h index 0bdf72758cda..ed75328de62b 100644 --- a/drivers/scsi/hisi_sas/hisi_sas.h +++ b/drivers/scsi/hisi_sas/hisi_sas.h @@ -661,7 +661,8 @@ extern void hisi_sas_phy_down(struct hisi_hba *hisi_hba, int phy_no, int rdy, gfp_t gfp_flags); extern void hisi_sas_slot_task_free(struct hisi_hba *hisi_hba, struct sas_task *task, - struct hisi_sas_slot *slot); + struct hisi_sas_slot *slot, + bool need_lock); extern void hisi_sas_init_mem(struct hisi_hba *hisi_hba); extern void hisi_sas_rst_work_handler(struct work_struct *work); extern void hisi_sas_sync_rst_work_handler(struct work_struct *work); diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c index 51c02a1c9a96..dc9957b0d6a1 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_main.c +++ b/drivers/scsi/hisi_sas/hisi_sas_main.c @@ -215,7 +215,7 @@ static void hisi_sas_slot_index_init(struct hisi_hba *hisi_hba) }
void hisi_sas_slot_task_free(struct hisi_hba *hisi_hba, struct sas_task *task, - struct hisi_sas_slot *slot) + struct hisi_sas_slot *slot, bool need_lock) { int device_id = slot->device_id; struct hisi_sas_device *sas_dev = &hisi_hba->devices[device_id]; @@ -244,9 +244,13 @@ void hisi_sas_slot_task_free(struct hisi_hba *hisi_hba, struct sas_task *task, } }
- spin_lock(&sas_dev->lock); - list_del_init(&slot->entry); - spin_unlock(&sas_dev->lock); + if (need_lock) { + spin_lock(&sas_dev->lock); + list_del_init(&slot->entry); + spin_unlock(&sas_dev->lock); + } else { + list_del_init(&slot->entry); + }
memset(slot, 0, offsetof(struct hisi_sas_slot, buf));
@@ -1024,7 +1028,7 @@ static void hisi_sas_port_notify_formed(struct asd_sas_phy *sas_phy) }
static void hisi_sas_do_release_task(struct hisi_hba *hisi_hba, struct sas_task *task, - struct hisi_sas_slot *slot) + struct hisi_sas_slot *slot, bool need_lock) { if (task) { unsigned long flags; @@ -1042,7 +1046,7 @@ static void hisi_sas_do_release_task(struct hisi_hba *hisi_hba, struct sas_task spin_unlock_irqrestore(&task->task_state_lock, flags); }
- hisi_sas_slot_task_free(hisi_hba, task, slot); + hisi_sas_slot_task_free(hisi_hba, task, slot, need_lock); }
static void hisi_sas_release_task(struct hisi_hba *hisi_hba, @@ -1051,8 +1055,11 @@ static void hisi_sas_release_task(struct hisi_hba *hisi_hba, struct hisi_sas_slot *slot, *slot2; struct hisi_sas_device *sas_dev = device->lldd_dev;
+ spin_lock(&sas_dev->lock); list_for_each_entry_safe(slot, slot2, &sas_dev->list, entry) - hisi_sas_do_release_task(hisi_hba, slot->task, slot); + hisi_sas_do_release_task(hisi_hba, slot->task, slot, false); + + spin_unlock(&sas_dev->lock); }
void hisi_sas_release_tasks(struct hisi_hba *hisi_hba) @@ -1740,7 +1747,7 @@ static int hisi_sas_abort_task(struct sas_task *task) */ if (rc == TMF_RESP_FUNC_COMPLETE && rc2 != TMF_RESP_FUNC_SUCC) { if (task->lldd_task) - hisi_sas_do_release_task(hisi_hba, task, slot); + hisi_sas_do_release_task(hisi_hba, task, slot, true); } } else if (task->task_proto & SAS_PROTOCOL_SATA || task->task_proto & SAS_PROTOCOL_STP) { @@ -1762,7 +1769,7 @@ static int hisi_sas_abort_task(struct sas_task *task) */ if ((sas_dev->dev_status == HISI_SAS_DEV_NCQ_ERR) && qc && qc->scsicmd) { - hisi_sas_do_release_task(hisi_hba, task, slot); + hisi_sas_do_release_task(hisi_hba, task, slot, true); rc = TMF_RESP_FUNC_COMPLETE; } else { rc = hisi_sas_softreset_ata_disk(device); diff --git a/drivers/scsi/hisi_sas/hisi_sas_v1_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v1_hw.c index a820c97f33a8..722d837b19e6 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_v1_hw.c +++ b/drivers/scsi/hisi_sas/hisi_sas_v1_hw.c @@ -1313,7 +1313,7 @@ static void slot_complete_v1_hw(struct hisi_hba *hisi_hba, }
out: - hisi_sas_slot_task_free(hisi_hba, task, slot); + hisi_sas_slot_task_free(hisi_hba, task, slot, true);
if (task->task_done) task->task_done(task); diff --git a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c index 071430785100..d74e342faffd 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c +++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c @@ -2469,7 +2469,7 @@ static void slot_complete_v2_hw(struct hisi_hba *hisi_hba, } task->task_state_flags |= SAS_TASK_STATE_DONE; spin_unlock_irqrestore(&task->task_state_lock, flags); - hisi_sas_slot_task_free(hisi_hba, task, slot); + hisi_sas_slot_task_free(hisi_hba, task, slot, true);
if (!is_internal && (task->task_proto != SAS_PROTOCOL_SMP)) { spin_lock_irqsave(&device->done_lock, flags); diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c index 4d36ff3657d2..9a6c9531dbb5 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c +++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c @@ -878,6 +878,7 @@ static void dereg_device_v3_hw(struct hisi_hba *hisi_hba,
cfg_abt_set_query_iptt = hisi_sas_read32(hisi_hba, CFG_ABT_SET_QUERY_IPTT); + spin_lock(&sas_dev->lock); list_for_each_entry_safe(slot, slot2, &sas_dev->list, entry) { cfg_abt_set_query_iptt &= ~CFG_SET_ABORTED_IPTT_MSK; cfg_abt_set_query_iptt |= (1 << CFG_SET_ABORTED_EN_OFF) | @@ -885,6 +886,7 @@ static void dereg_device_v3_hw(struct hisi_hba *hisi_hba, hisi_sas_write32(hisi_hba, CFG_ABT_SET_QUERY_IPTT, cfg_abt_set_query_iptt); } + spin_unlock(&sas_dev->lock); cfg_abt_set_query_iptt &= ~(1 << CFG_SET_ABORTED_EN_OFF); hisi_sas_write32(hisi_hba, CFG_ABT_SET_QUERY_IPTT, cfg_abt_set_query_iptt); @@ -2378,7 +2380,7 @@ static void slot_complete_v3_hw(struct hisi_hba *hisi_hba, } task->task_state_flags |= SAS_TASK_STATE_DONE; spin_unlock_irqrestore(&task->task_state_lock, flags); - hisi_sas_slot_task_free(hisi_hba, task, slot); + hisi_sas_slot_task_free(hisi_hba, task, slot, true);
if (!is_internal && (task->task_proto != SAS_PROTOCOL_SMP)) { spin_lock_irqsave(&device->done_lock, flags);