On 2021/3/26 12:26, Tony W Wang-oc wrote:
On 26/03/2021 11:22, Hanjun Guo wrote:
On 2021/3/25 18:10, LeoLiu-oc wrote:
This bug is found in Zhaoxin platform, but it's a commom code bug. Fail sequence: step1: Unbind UHCI controller from native driver; step2: Bind UHCI controller to vfio-pci, which will put UHCI controller in one vfio group's device list and set UHCI's dev->driver_data to struct vfio-pci(for UHCI) step3: Unbind EHCI controller from native driver, will try to tell UHCI native driver that "I'm removed by set companion_hcd->self.hs_companion to NULL. However, companion_hcd get from UHCI's dev->driver_data that has modified by vfio-pci already.So, the vfio-pci structure will be damaged! step4: Bind EHCI controller to vfio-pci driver, which will put EHCI controller in the same vfio group as UHCI controller; ... ... step5: Unbind UHCI controller from vfio-pci, which will delete UHCI from vfio group' device list that has been damaged in step 3. So,delete operation can random result into a NULL pointer dereference with the below stack dump. step6: Bind UHCI controller to native driver; step7: Unbind EHCI controller from vfio-pci, which will try to remove EHCI controller from the vfio group; step8: Bind EHCI controller to native driver;
[...]
Using virt-manager/qemu to boot guest os, we can see the same fail sequence!
Fix this by check for UHCI driver loaded or not before modifiy UHCI's dev->driver_data, which will happen in ehci native driver probe/remove.
This patch was submitted to mainline kernel but not accepted by upstream maintainer whose reason is "Given that it's currently needed in only one place, it seems reasonable to leave this as a "gentlemen's agreement" in userspace for the time being instead of adding it to the kernel."
We think the kernel driver should fix this bug regardless of userspaces behavior.
https://lkml.org/lkml/2020/7/22/493 Signed-off-by: LeoLiu-oc LeoLiu-oc@zhaoxin.com
drivers/usb/core/hcd-pci.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/drivers/usb/core/hcd-pci.c b/drivers/usb/core/hcd-pci.c index 7537681355f6..57ac942acf12 100644 --- a/drivers/usb/core/hcd-pci.c +++ b/drivers/usb/core/hcd-pci.c @@ -34,6 +34,8 @@ static DECLARE_RWSEM(companions_rwsem); #define CL_OHCI PCI_CLASS_SERIAL_USB_OHCI #define CL_EHCI PCI_CLASS_SERIAL_USB_EHCI
+#define PCI_DEV_DRV_FLAG 2
pretty hack :)
static inline int is_ohci_or_uhci(struct pci_dev *pdev) { return pdev->class == CL_OHCI || pdev->class == CL_UHCI; @@ -69,6 +71,9 @@ static void for_each_companion(struct pci_dev *pdev, struct usb_hcd *hcd, companion->class != CL_EHCI) continue;
+ if (!(companion->priv_flags & PCI_DEV_DRV_FLAG)) + continue;
Why not use the unique PCI vendor ID for Zhaoxin here?
It's a common case and not only for Zhaoxin.
I can see the problem now, but the way of fix this issue is hack, we can't manipulate the PCI priv_flags in USB core, I'm not sure if it will introduce other issue, can you please explain more about why it works but have no side effect?
Thanks Hanjun