Re: [PATCH v6 2/5] hisi_acc_vfio_pci: modify the register location of the XQC address

13 May 2024


      
...
-----Original Message-----
From: liulongfang <liulongfang@huawei.com>
Sent: Monday, May 13, 2024 9:16 AM
To: Alex Williamson <alex.williamson@redhat.com>; Shameerali Kolothum
Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: jgg@nvidia.com; Jonathan Cameron <jonathan.cameron@huawei.com>;
kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
linuxarm@openeuler.org
Subject: Re: [PATCH v6 2/5] hisi_acc_vfio_pci: modify the register location of
the XQC address
On 2024/5/9 22:29, Alex Williamson wrote:
...
On Thu, 9 May 2024 09:37:51 +0000
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
wrote:
...
...
-----Original Message-----
From: Alex Williamson <alex.williamson@redhat.com>
Sent: Wednesday, May 8, 2024 7:00 PM
To: liulongfang <liulongfang@huawei.com>
Cc: jgg@nvidia.com; Shameerali Kolothum Thodi
<shameerali.kolothum.thodi@huawei.com>; Jonathan Cameron
<jonathan.cameron@huawei.com>; kvm@vger.kernel.org; linux-
kernel@vger.kernel.org; linuxarm@openeuler.org
Subject: Re: [PATCH v6 2/5] hisi_acc_vfio_pci: modify the register
location of
the XQC address
[...]
...
...
HiSilicon accelerator equipment can perform general services after
completing live migration.
This kind of business is executed through the user mode driver and only
needs to use SQE and CQE.
At the same time, this device can also perform kernel-mode services in
the
VM through the crypto
subsystem. This kind of service requires the use of EQE.
Finally, if the device is abnormal, the driver needs to perform a device
reset, and AEQE needs to
be used in this case.
Therefore, a complete device live migration function needs to ensure
that
device functions are
normal in all these scenarios.
Therefore, this data still needs to be migrated.
Ok, I had jumped to an in-kernel host driver in reference to "kernel
mode" rather than a guest kernel.  Migrating with bad data only affects
the current configuration of the device, reloading a guest driver to
update these registers or a reset of the device would allow proper
operation of the device, correct?
Yes, after talking to Longfang, the device RAS will trigger a reset and
would function after reset.
...
But I think this still isn't really a complete solution, we know
there's a bug in the migration data stream, so not only would we fix
the data stream, but I think we should also take measures to prevent
loading a known bad data stream.  AIUI migration of this device while
running in kernel mode (ie. a kernel driver within a guest VM) is
broken.  Therefore, the least we can do in a new kernel, knowing that
there was previously a bug in the migration data stream, is to fail to
load that migration data because it risks this scenario where the
device is broken after migration.  Shouldn't we then also increment a
migration version field in the data stream to block migrations that
risk this breakage, or barring that, change the magic data field to
prevent the migration?  Thanks,
Ok. We could add a new ACC_DEV_MAGIC_V2 and prevent the migration
in vf_qm_check_match(). The only concern here is that, it will completely
block old kernel to new kernel migration and since we can recover the
device after the reset whether it is too restrictive or not.
What's the impact to the running driver, kernel or userspace, if the
device is reset?  Migration is intended to be effectively transparent
If the device is reset, the user's task needs to be restarted.
If an exception has been detected, the best way is not to migrate.
...
to the driver.  If the driver stalls and needs to reset the device,
what has the migration driver accomplished versus an offline migration?
If there's a way to detect from the migration data if the device is
running in kernel mode or user mode then you could potentially accept
and send v1 magic conditional that the device is in user mode and
require v2 magic for any migration where the device is in kernel mode.
This all adds complication though and seems like it has corner cases
where we might allow migration to an old kernel that might trap the
device there if the use case changes.
The driver does not support checking whether the device is running in
kernel mode or user mode.
Moreover, the device supports user-mode services and kernel-mode services
to run at the same time.
...
Essentially it comes down to what should the migration experience be
and while restricting old->new and new->old migration is undesirable,
it seems old->old migration is effectively already broken anyway.  As
you consider a v2 magic, perhaps consider how the migration data
structure might be improved overall to better handle new features and
bugs.  Thanks,
We discussed a plan:
Update ACC_DEV_MAGIC to ACC_DEV_MAGIC_VERSION and configure its
last byte
as version information:
/* QM match information, last byte is version number */
#define ACC_DEV_MAGIC_VERSION	0XACCDEVFEEDCAFE01
Oops..cant have V there. But the idea is replace magic with last byte
as version info which can be used in future for handling bugs/features
etc.

Thanks,
Shameer

Shameerali Kolothum Thodi

tags

participants (1)