On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:
On 2021/4/19 20:33, Jason Gunthorpe wrote:
On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
I'm also confused how this works securely at all, as a general rule a VFIO PCI driver cannot access the MMIO memory of the function it is planning to assign to the guest. There is a lot of danger that the guest could access that MMIO space one way or another.
VF's MMIO memory is divided into two parts, one is the guest part, and the other is the live migration part. They do not affect each other, so there is no security problem.
AFAIK there are several scenarios where a guest can access this MMIO memory using DMA even if it is not mapped into the guest for CPU access.
The hardware divides VF's MMIO memory into two parts. The live migration driver in the host uses the live migration part, and the device driver in the guest uses the guest part. They obtain the address of VF's MMIO memory in their respective drivers, although these two parts The memory is continuous on the hardware device, but due to the needs of the drive function, they will not perform operations on another part of the memory, and the device hardware also independently responds to the operation commands of the two parts.
It doesn't matter, the memory is still under the same PCI BDF and VFIO supports scenarios where devices in the same IOMMU group are not isolated from each other.
This is why the granual of isolation is a PCI BDF - VFIO directly blocks kernel drivers from attaching to PCI BDFs that are not completely isolated from VFIO BDF.
Bypassing this prevention and attaching a kernel driver directly to the same BDF being exposed to the guest breaks that isolation model.
So, I still don't understand what the security risk you are talking about is, and what do you think the security design should look like? Can you elaborate on it?
Each security domain must have its own PCI BDF.
The migration control registers must be on a different VF from the VF being plugged into a guest and the two VFs have to be in different IOMMU groups to ensure they are isolated from each other.
Jason
On Tue, 20 Apr 2021 09:59:57 -0300 Jason Gunthorpe jgg@nvidia.com wrote:
On Tue, Apr 20, 2021 at 08:50:12PM +0800, liulongfang wrote:
On 2021/4/19 20:33, Jason Gunthorpe wrote:
On Mon, Apr 19, 2021 at 08:24:40PM +0800, liulongfang wrote:
I'm also confused how this works securely at all, as a general rule a VFIO PCI driver cannot access the MMIO memory of the function it is planning to assign to the guest. There is a lot of danger that the guest could access that MMIO space one way or another.
VF's MMIO memory is divided into two parts, one is the guest part, and the other is the live migration part. They do not affect each other, so there is no security problem.
AFAIK there are several scenarios where a guest can access this MMIO memory using DMA even if it is not mapped into the guest for CPU access.
The hardware divides VF's MMIO memory into two parts. The live migration driver in the host uses the live migration part, and the device driver in the guest uses the guest part. They obtain the address of VF's MMIO memory in their respective drivers, although these two parts The memory is continuous on the hardware device, but due to the needs of the drive function, they will not perform operations on another part of the memory, and the device hardware also independently responds to the operation commands of the two parts.
It doesn't matter, the memory is still under the same PCI BDF and VFIO supports scenarios where devices in the same IOMMU group are not isolated from each other.
This is why the granual of isolation is a PCI BDF - VFIO directly blocks kernel drivers from attaching to PCI BDFs that are not completely isolated from VFIO BDF.
Bypassing this prevention and attaching a kernel driver directly to the same BDF being exposed to the guest breaks that isolation model.
So, I still don't understand what the security risk you are talking about is, and what do you think the security design should look like? Can you elaborate on it?
Each security domain must have its own PCI BDF.
The migration control registers must be on a different VF from the VF being plugged into a guest and the two VFs have to be in different IOMMU groups to ensure they are isolated from each other.
I think that's a solution, I don't know if it's the only solution. AIUI, the issue here is that we have a device specific kernel driver extending vfio-pci with migration support for this device by using an MMIO region of the same device. This is susceptible to DMA manipulation by the user device. Whether that's a security issue or not depends on how the user can break the device. If the scope is limited to breaking their own device, they can do that any number of ways and it's not very interesting. If the user can manipulate device state in order to trigger an exploit of the host-side kernel driver, that's obviously more of a problem.
The other side of this is that if migration support can be implemented entirely within the VF using this portion of the device MMIO space, why do we need the host kernel to support this rather than implementing it in userspace? For example, QEMU could know about this device, manipulate the BAR size to expose only the operational portion of MMIO to the VM and use the remainder to support migration itself. I'm afraid that just like mdev, the vfio migration uAPI is going to be used as an excuse to create kernel drivers simply to be able to make use of that uAPI. I haven't looked at this driver to know if it has some other reason to exist beyond what could be done through vfio-pci and userspace migration support. Thanks,
Alex
On Tue, Apr 20, 2021 at 04:04:57PM -0600, Alex Williamson wrote:
The migration control registers must be on a different VF from the VF being plugged into a guest and the two VFs have to be in different IOMMU groups to ensure they are isolated from each other.
I think that's a solution, I don't know if it's the only solution.
Maybe, but that approach does offer DMA access for the migration. For instance to implement something that needs a lot of data like migrating a complicated device state, or dirty page tracking or whatver.
This driver seems very simple - it has only 17 state elements - and doesn't use DMA.
I can't quite tell, but does this pass the hypervisor BAR into the guest anyhow? That would certainly be an adquate statement that it is safe, assuming someone did a good security analysis.
ways and it's not very interesting. If the user can manipulate device state in order to trigger an exploit of the host-side kernel driver, that's obviously more of a problem.
Well, for instance, we have an implementation of (VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING) which means the guest CPUs are still running and a hostile guest can be manipulating the device.
But this driver is running code, like vf_qm_state_pre_save() in this state. Looks very suspicious.
One quick attack I can imagine is to use the guest CPU to DOS the migration and permanently block it, eg by causing qm_mb() or other looping functions to fail.
There may be worse things possible, it is a bit hard to tell just from the code.
.. also drivers should not be open coding ARM assembly as in qm_mb_write()
.. and also, code can not randomly call pci_get_drvdata() on a struct device it isn't attached to haven't verified the right driver is bound, or locked correctly.
manipulate the BAR size to expose only the operational portion of MMIO to the VM and use the remainder to support migration itself. I'm afraid that just like mdev, the vfio migration uAPI is going to be used as an excuse to create kernel drivers simply to be able to make use of that uAPI.
I thought that is the general direction people had agreed on during the IDXD mdev discussion?
People want the IOCTLs from VFIO to be the single API to program all the VMMs to and to not implement user space drivers..
This actually seems like a great candidate for a userspace driver.
I would like to know we are still settled on this direction as the mlx5 drivers we are working on also have some complicated option to be user space only.
Jason
On 2021/4/21 7:18, Jason Gunthorpe wrote:
On Tue, Apr 20, 2021 at 04:04:57PM -0600, Alex Williamson wrote:
The migration control registers must be on a different VF from the VF being plugged into a guest and the two VFs have to be in different IOMMU groups to ensure they are isolated from each other.
I think that's a solution, I don't know if it's the only solution.
Maybe, but that approach does offer DMA access for the migration. For instance to implement something that needs a lot of data like migrating a complicated device state, or dirty page tracking or whatver.
This driver seems very simple - it has only 17 state elements - and doesn't use DMA.
Yes,the operating address of this driver is the MMIO address, not the DMA address, but the internal hardware DMA address is used as data for migration.
I can't quite tell, but does this pass the hypervisor BAR into the guest anyhow? That would certainly be an adquate statement that it is safe, assuming someone did a good security analysis.
ways and it's not very interesting. If the user can manipulate device state in order to trigger an exploit of the host-side kernel driver, that's obviously more of a problem.
Well, for instance, we have an implementation of (VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING) which means the guest CPUs are still running and a hostile guest can be manipulating the device.
But this driver is running code, like vf_qm_state_pre_save() in this state. Looks very suspicious.
One quick attack I can imagine is to use the guest CPU to DOS the migration and permanently block it, eg by causing qm_mb() or other looping functions to fail.
There may be worse things possible, it is a bit hard to tell just from the code.
.. also drivers should not be open coding ARM assembly as in qm_mb_write()
OK, these codes need to be encapsulated and should not be presented in this driver.
.. and also, code can not randomly call pci_get_drvdata() on a struct device it isn't attached to haven't verified the right driver is bound, or locked correctly.
Yes, This call needs to be placed in an encapsulation interface, and access protection needs to be added.
manipulate the BAR size to expose only the operational portion of MMIO to the VM and use the remainder to support migration itself. I'm afraid that just like mdev, the vfio migration uAPI is going to be used as an excuse to create kernel drivers simply to be able to make use of that uAPI.
I thought that is the general direction people had agreed on during the IDXD mdev discussion?
People want the IOCTLs from VFIO to be the single API to program all the VMMs to and to not implement user space drivers..
This actually seems like a great candidate for a userspace driver.
I would like to know we are still settled on this direction as the mlx5 drivers we are working on also have some complicated option to be user space only.
Jason .