When the number of online CPUs is less than 16, we found that it will fail to allocate 32 MSI interrupts (including 16 affinity interrupts) after the hisi_sas module is unloaded and then reloaded.
After analysis, it is found that a bug exists when the ITS releases interrupt resources, and this patch set contains a bugfix patch and a patch for appending debugging information.
Luo Jiaxing (2): irqchip/gic-v3-its: don't set bitmap for LPI which user didn't allocate genirq/msi: add an error print when __irq_domain_alloc_irqs() failed
drivers/irqchip/irq-gic-v3-its.c | 4 ++++ kernel/irq/msi.c | 1 + 2 files changed, 5 insertions(+)
The driver sets the LPI bitmap of device based on get_count_order(nvecs). This means that when the number of LPI interrupts does not meet the power of two, redundant bits are set in the LPI bitmap. However, when free interrupt, these redundant bits is not cleared. As a result, device will fails to allocate the same numbers of interrupts next time.
Therefore, clear the redundant bits set in LPI bitmap.
Fixes: 4615fbc3788d ("genirq/irqdomain: Don't try to free an interrupt that has no mapping")
Signed-off-by: Luo Jiaxing luojiaxing@huawei.com --- drivers/irqchip/irq-gic-v3-its.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index ed46e60..027f7ef 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -3435,6 +3435,10 @@ static int its_alloc_device_irq(struct its_device *dev, int nvecs, irq_hw_number
*hwirq = dev->event_map.lpi_base + idx;
+ bitmap_clear(dev->event_map.lpi_map, + idx + nvecs, + roundup_pow_of_two(nvecs) - nvecs); + return 0; }
On 2021-02-08 10:58, Luo Jiaxing wrote:
The driver sets the LPI bitmap of device based on get_count_order(nvecs). This means that when the number of LPI interrupts does not meet the power of two, redundant bits are set in the LPI bitmap. However, when free interrupt, these redundant bits is not cleared. As a result, device will fails to allocate the same numbers of interrupts next time.
Therefore, clear the redundant bits set in LPI bitmap.
Fixes: 4615fbc3788d ("genirq/irqdomain: Don't try to free an interrupt that has no mapping")
Signed-off-by: Luo Jiaxing luojiaxing@huawei.com
drivers/irqchip/irq-gic-v3-its.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index ed46e60..027f7ef 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -3435,6 +3435,10 @@ static int its_alloc_device_irq(struct its_device *dev, int nvecs, irq_hw_number
*hwirq = dev->event_map.lpi_base + idx;
- bitmap_clear(dev->event_map.lpi_map,
idx + nvecs,
roundup_pow_of_two(nvecs) - nvecs);
- return 0;
}
What makes you think that the remaining LPIs are free to be released? Even if the end-point has request a non-po2 number of MSIs, it could very well rely on the the rest of it to be available (specially in the case of PCI Multi-MSI).
Have a look at the thread pointed out by John for a potential fix.
Thanks,
M.
On 2021/2/8 19:59, Marc Zyngier wrote:
On 2021-02-08 10:58, Luo Jiaxing wrote:
The driver sets the LPI bitmap of device based on get_count_order(nvecs). This means that when the number of LPI interrupts does not meet the power of two, redundant bits are set in the LPI bitmap. However, when free interrupt, these redundant bits is not cleared. As a result, device will fails to allocate the same numbers of interrupts next time.
Therefore, clear the redundant bits set in LPI bitmap.
Fixes: 4615fbc3788d ("genirq/irqdomain: Don't try to free an interrupt that has no mapping")
Signed-off-by: Luo Jiaxing luojiaxing@huawei.com
drivers/irqchip/irq-gic-v3-its.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index ed46e60..027f7ef 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -3435,6 +3435,10 @@ static int its_alloc_device_irq(struct its_device *dev, int nvecs, irq_hw_number
*hwirq = dev->event_map.lpi_base + idx;
+ bitmap_clear(dev->event_map.lpi_map, + idx + nvecs, + roundup_pow_of_two(nvecs) - nvecs);
return 0; }
What makes you think that the remaining LPIs are free to be released?
I think that the LPI bitmap is used to mark the valid LPI interrupts allocated to the PCIe device.
Therefore, for the remaining LPIs, the ITS can reserve entries in the ITT table, but the bitmap does not need to be set.
Maybe my understanding is wrong, and I'm a little confused about the function of this bitmap.
Even if the end-point has request a non-po2 number of MSIs, it could very well rely on the the rest of it to be available (specially in the case of PCI Multi-MSI).
yes, you are right. But for Multi-MSI, does it mean that one PCIE device can own several MSI interrupts?
Another question, is it possible for module driver to use these remaining LPIs?
For example, in my case
I allcoate 32 MSI with 16 affi-IRQ in it.
MSI can only offer 20 MSIs because online CPU number is 4 and it create 20 msi desc then.
ITS create a its device for this PCIe device and generate a ITT tabel for 32 MSIs.
so in MSI, it provide 20 valid MSIs, but in ITS, lpi bitmap show that 32 MSI is allocated.
This logic is a bit strange and a little incomprehensible.
Have a look at the thread pointed out by John for a potential fix.
Sorry for missing that, I think it can fix my issue too, let me test it later.
Thanks
jiaxing
Thanks,
M.
During debug, we found that the return value of __irq_domain_alloc_irqs() will be overwritten by the return value of subsequent function. As a result, the locating clue will be lost.
To improve debug efficiency, an error message is added to print the return value of __irq_domain_alloc_irqs().
Signed-off-by: Luo Jiaxing luojiaxing@huawei.com --- kernel/irq/msi.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c index b338d62..f8729b0 100644 --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -418,6 +418,7 @@ int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev, desc->affinity); if (virq < 0) { ret = -ENOSPC; + dev_err(dev, "failed to allocate irq, virq=%d\n", virq); if (ops->handle_error) ret = ops->handle_error(domain, desc, ret); if (ops->msi_finish)
On 08/02/2021 10:58, Luo Jiaxing wrote:
When the number of online CPUs is less than 16, we found that it will fail to allocate 32 MSI interrupts (including 16 affinity interrupts) after the hisi_sas module is unloaded and then reloaded.
After analysis, it is found that a bug exists when the ITS releases interrupt resources, and this patch set contains a bugfix patch and a patch for appending debugging information.
Please note that this issue has already been reported: https://lore.kernel.org/lkml/fd88ce05-8aee-5b1f-5ab6-be88fa53d3aa@huawei.com...
Luo Jiaxing (2): irqchip/gic-v3-its: don't set bitmap for LPI which user didn't allocate genirq/msi: add an error print when __irq_domain_alloc_irqs() failed
drivers/irqchip/irq-gic-v3-its.c | 4 ++++ kernel/irq/msi.c | 1 + 2 files changed, 5 insertions(+)