On Mon, Aug 9, 2021 at 4:01 PM Nir Soffer <nsoffer(a)redhat.com> wrote:
>
> On Mon, Aug 9, 2021 at 2:42 PM Yedidyah Bar David <didi(a)redhat.com> wrote:
> >
> > On Mon, Aug 9, 2021 at 1:43 PM Nir Soffer <nsoffer(a)redhat.com> wrote:
> > >
> > > On Mon, Aug 9, 2021 at 10:35 AM Yedidyah Bar David <didi(a)redhat.com>
wrote:
> > > >
> > > > On Sun, Aug 8, 2021 at 5:42 PM Code Review <gerrit(a)ovirt.org>
wrote:
> > > > >
> > > > > From Jenkins CI <jenkins(a)ovirt.org>:
> > > > >
> > > > > Jenkins CI has posted comments on this change. (
https://gerrit.ovirt.org/c/ovirt-system-tests/+/115392 )
> > > > >
> > > > > Change subject: HE: Use node image
> > > > >
......................................................................
> > > > >
> > > > >
> > > > > Patch Set 13: Continuous-Integration-1
> > > > >
> > > > > Build Failed
> > > >
> > > > While trying to deactivate a host, the engine wanted to migrate a VM
> > > > (vm0) from host-0 to host-1. vdsm log of host-0 says:
> > > >
> > > > 2021-08-08 14:31:10,076+0000 ERROR (migsrc/cde311f9) [virt.vm]
> > > > (vmId='cde311f9-9a33-4eb9-8338-fa22ff49edc2') Failed to
migrate
> > > > (migration:503)
> > > > Traceback (most recent call last):
> > > > File
"/usr/lib/python3.6/site-packages/vdsm/virt/migration.py", line
> > > > 477, in _regular_run
> > > > time.time(), machineParams
> > > > File
"/usr/lib/python3.6/site-packages/vdsm/virt/migration.py", line
> > > > 578, in _startUnderlyingMigration
> > > > self._perform_with_conv_schedule(duri, muri)
> > > > File
"/usr/lib/python3.6/site-packages/vdsm/virt/migration.py", line
> > > > 667, in _perform_with_conv_schedule
> > > > self._perform_migration(duri, muri)
> > > > File
"/usr/lib/python3.6/site-packages/vdsm/virt/migration.py", line
> > > > 596, in _perform_migration
> > > > self._migration_flags)
> > > > File
"/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line
> > > > 159, in call
> > > > return getattr(self._vm._dom, name)(*a, **kw)
> > > > File
"/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
> > > > ret = attr(*args, **kwargs)
> > > > File
"/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py",
> > > > line 131, in wrapper
> > > > ret = f(*args, **kwargs)
> > > > File
"/usr/lib/python3.6/site-packages/vdsm/common/function.py",
> > > > line 94, in wrapper
> > > > return func(inst, *args, **kwargs)
> > > > File "/usr/lib64/python3.6/site-packages/libvirt.py",
line 2126, in
> > > > migrateToURI3
> > > > raise libvirtError('virDomainMigrateToURI3() failed')
> > > > libvirt.libvirtError: Unsafe migration: Migration without shared
> > > > storage is unsafe
> > >
> > > Please share the vm xml:
> > >
> > > sudo virsh -r dumpxl vm-name
> >
> > I think you should be able to find a dump of it in vdsm.log:
> >
> >
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/186...
> >
> > I think the first line of starting a migration is:
> >
> > 2021-08-08 14:31:08,350+0000 DEBUG (jsonrpc/4) [jsonrpc.JsonRpcServer]
> > Calling 'VM.migrate' in bridge with {'vmID':
> > 'cde311f9-9a33-4eb9-8338-fa22ff49edc2', 'params':
> >
> > A few lines later:
> >
> > 2021-08-08 14:31:08,387+0000 DEBUG (migsrc/cde311f9)
> > [virt.metadata.Descriptor] dumped metadata for
> > cde311f9-9a33-4eb9-8338-fa22ff49edc2: <?xml version='1.0'
> > encoding='utf-8'?>
> > <vm>
> > <balloonTarget type="int">98304</balloonTarget>
>
> This is not the vm xml but the metadata xml.
>
> Looking at the logs on both hosts:
>
> [nsoffer@sparse ost]$ head -1 *vdsm.log
> ==> host0-vdsm.log <==
> 2021-08-08 13:16:04,676+0000 INFO (MainThread) [vds] (PID: 65169) I
> am the actual vdsm 4.40.80.3.12.git6d67b935b
> ost-he-basic-suite-master-host-0 (4.18.0-326.el8.x86_64) (vdsmd:162)
>
> ==> host1-vdsm.log <==
> 2021-08-08 15:40:54,367+0200 INFO (MainThread) [vds] (PID: 23005) I
> am the actual vdsm 4.40.80.4.5.git4309a3949
> ost-he-basic-suite-master-host-1 (4.18.0-326.el8.x86_64) (vdsmd:162)
>
> - The hosts clocks are using different time zones (+0000 vs +0200) is
> this intended?
> - You are testing different versions of vdsm - is this intended?
>
> We have about 60 errors:
> $ grep 'Migration without shared storage is unsafe' host0-vdsm.log | wc -l
> 60
>
> Looking at the first migration that failed, the vm xml is here:
>
> 2021-08-08 14:20:34,127+0000 INFO (vm/cde311f9) [virt.vm]
> (vmId='cde311f9-9a33-4eb9-8338-fa22ff49edc2') <?xml version='1.0'
e
> ncoding='utf-8'?>
> <domain
xmlns:ns0="http://libvirt.org/schemas/domain/qemu/1.0"
>
xmlns:ovirt-vm="http://ovirt.org/vm/1.0" type="kvm">
> <name>vm0</name>
> <uuid>cde311f9-9a33-4eb9-8338-fa22ff49edc2</uuid>
> ...
>
> The relevant parts for storage are:
>
> <disk device="disk" snapshot="no"
type="file">
> <driver cache="none" error_policy="stop"
io="threads"
> iothread="1" name="qemu" type="qcow2" />
> <source
>
file="/rhev/data-center/7d97ea80-f849-11eb-ac79-5452d501341a/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3f7bad2f3a87">
> <seclabel model="dac" relabel="no" />
> </source>
> <target bus="virtio" dev="vda" />
> <serial>20002ad2-4a97-4d2f-b3fc-c103477b5b91</serial>
> <boot order="1" />
> <alias name="ua-20002ad2-4a97-4d2f-b3fc-c103477b5b91"
/>
> <address bus="0x05" domain="0x0000"
function="0x0"
> slot="0x00" type="pci" />
> </disk>
> <disk device="disk" snapshot="no"
type="block">
> <driver cache="none" error_policy="stop"
io="native"
> name="qemu" type="raw" />
> <source
dev="/dev/mapper/36001405bc9d94e4419b4b80a2f702e2f">
> <seclabel model="dac" relabel="no" />
> </source>
> <target bus="scsi" dev="sda" />
> <serial>738c8486-8929-44ec-9083-69327bde9c65</serial>
> <alias name="ua-738c8486-8929-44ec-9083-69327bde9c65"
/>
> <address bus="0" controller="0"
target="0" type="drive" unit="0" />
> </disk>
>
> So we have one qcow2 disk on file storage, and one direct lun.
>
> On the destination, the first migration attempt is here:
>
> 2021-08-08 16:31:08,437+0200 DEBUG (jsonrpc/2) [jsonrpc.JsonRpcServer]
> Calling 'VM.migrationCreate' in bridge with {'vmID': 'c
> de311f9-9a33-4eb9-8338-fa22ff49edc2', 'params': {'_srcDomXML':
> '<domain type=\'kvm\' id=\'6\'
xmlns:qemu=\'http://libvirt.org/
> ...
>
> We prepare the qcow2 disk:
>
> 2021-08-08 16:31:09,313+0200 INFO (vm/cde311f9) [vdsm.api] FINISH
> prepareImage return={'path':
>
'/rhev/data-center/mnt/192.168.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3f7bad2f3a87',
> 'info': {'type': 'file', 'path':
>
'/rhev/data-center/mnt/192.168.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3f7bad2f3a87'},
> 'imgVolumesInfo': [{'domainID':
> '46fa5761-bb9e-46be-8f1c-35f4b03d0203', 'imageID':
> '20002ad2-4a97-4d2f-b3fc-c103477b5b91', 'volumeID':
> '1d3f07dc-b481-492f-a2a6-7c46689d82ba', 'path':
>
'/rhev/data-center/mnt/192.168.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/1d3f07dc-b481-492f-a2a6-7c46689d82ba',
> 'leasePath':
'/rhev/data-center/mnt/192.168.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/1d3f07dc-b481-492f-a2a6-7c46689d82ba.lease',
> 'leaseOffset': 0}, {'domainID':
> '46fa5761-bb9e-46be-8f1c-35f4b03d0203', 'imageID':
> '20002ad2-4a97-4d2f-b3fc-c103477b5b91', 'volumeID':
> '614abd56-4d4f-4412-aa2a-3f7bad2f3a87', 'path':
>
'/rhev/data-center/mnt/192.168.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3f7bad2f3a87',
> 'leasePath':
'/rhev/data-center/mnt/192.168.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3f7bad2f3a87.lease',
> 'leaseOffset': 0}, {'domainID':
> '46fa5761-bb9e-46be-8f1c-35f4b03d0203', 'imageID':
> '20002ad2-4a97-4d2f-b3fc-c103477b5b91', 'volumeID':
> 'a4309ef3-01bb-45db-8bf7-0f9498a7feeb', 'path':
>
'/rhev/data-center/mnt/192.168.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/a4309ef3-01bb-45db-8bf7-0f9498a7feeb',
> 'leasePath':
'/rhev/data-center/mnt/192.168.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/a4309ef3-01bb-45db-8bf7-0f9498a7feeb.lease',
> 'leaseOffset': 0}]} from=internal,
> task_id=f2dfbacb-154c-4f5b-b57d-affcbf419691 (api:54)
> 2021-08-08 16:31:09,314+0200 INFO (vm/cde311f9) [vds] prepared volume
> path:
/rhev/data-center/mnt/192.168.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3f7bad2f3a87
> (clientIF:518)
>
> And the direct lun:
>
> 2021-08-08 16:31:09,315+0200 INFO (vm/cde311f9) [vdsm.api] START
> appropriateDevice(guid='36001405bc9d94e4419b4b80a2f702e2f',
> thiefId='cde311f9-9a33-4eb9-8338-fa22ff49edc2', deviceType='mpath')
> from=internal, task_id=220f1c1f-caec-4327-a157-7a4fab3b54a5 (api:48)
> 2021-08-08 16:31:09,550+0200 INFO (vm/cde311f9) [vdsm.api] FINISH
> appropriateDevice return={'truesize': '21474836480',
'apparentsize':
> '21474836480', 'path':
> '/dev/mapper/36001405bc9d94e4419b4b80a2f702e2f'} from=internal,
> task_id=220f1c1f-caec-4327-a157-7a4fab3b54a5 (api:54)
> 2021-08-08 16:31:09,550+0200 INFO (vm/cde311f9) [vds] prepared volume
> path: /dev/mapper/36001405bc9d94e4419b4b80a2f702e2f (clientIF:518)
>
> The interesting thing is that the qcow2 disk is using different path
> on the source
> and destination vms:
>
> source:
>
/rhev/data-center/7d97ea80-f849-11eb-ac79-5452d501341a/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3f7bad2f3a87
>
> destination:
>
/rhev/data-center/mnt/192.168.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3f7bad2f3a87
>
> On the source we have:
> /rhev/data-center/pool-id/domain-id/images/image-id/volume-id
>
> On the destination we have:
> /rhev/data-center/mnt/mountpoint/domain-id/images/image-id/volume-id
>
> Both lead to the same disk, but libvirt probably compared the strings
> and decided that
> we don't have shared storage.
>
> It may be new validation in libvirt, or maybe engine changed the way the disk
> path is added to the xml recently?
>
> In vdsm we use os.path.realpath() when we compare disk paths since we know
> that we can have different paths to the same volume.
>
> This kind of issue is likely to be reproducible without ost.
Trying to understand why the drive path is different on source:
The source path was created by a snapshot:
2021-08-08 14:19:40,219+0000 INFO (tasks/9) [storage.Volume] Request
to create snapshot 20002ad2-4a97-4d2f-b3fc-c103477b5b91/
614abd56-4d4f-4412-aa2a-3f7bad2f3a87 of volume
20002ad2-4a97-4d2f-b3fc-c103477b5b91/1d3f07dc-b481-492f-a2a6-7c46689d82ba
with
capacity 117440512 (fileVolume:528)
Vdsm reports the volume as:
2021-08-08 14:19:56,944+0000 INFO (jsonrpc/6) [vdsm.api] FINISH
getVolumeInfo return={'info': {'uuid':
'614abd56-4d4f-4412-aa2a-3f7bad2f3a87', 'type': 'SPARSE',
'format':
'COW', 'disktype': 'DATA', 'voltype': 'LEAF',
'capacity': '117440512',
'parent': '1d3f07dc-b481-492f-a2a6-7c46689d82ba', 'description':
'',
'pool': '', 'domain':
'46fa5761-bb9e-46be-8f1c-35f4b03d0203', 'image':
'20002ad2-4a97-4d2f-b3fc-c103477b5b91', 'ctime': '1628432381',
'mtime': '0', 'legality': 'LEGAL', 'generation':
0, 'apparentsize':
'196616', 'truesize': '200704', 'status': 'OK',
'lease': {'path':
'/rhev/data-center/mnt/192.168.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3f7bad2f3a87.lease',
'offset': 0, 'owners': [], 'version': None}, 'children':
[]}}
from=::ffff:192.168.200.61,44352,
flow_id=fe1c52bf-a1ea-462f-ac32-14cd8f16c411,
task_id=14914c43-a38e-4692-9753-78ac2ddfbe5d (api:54)
(Using the /rhev/data-center/mnt/mountpoint/...)
The image was prepared later:
2021-08-08 14:20:32,047+0000 INFO (vm/cde311f9) [vdsm.api] START
prepareImage(sdUUID='46fa5761-bb9e-46be-8f1c-35f4b03d0203',
spUUID='7d97ea80-f849-11eb-ac79-5452d501341a',
imgUUID='20002ad2-4a97-4d2f-b3fc-c103477b5b91',
leafUUID='614abd56-4d4f-4412-aa
...
2021-08-08 14:20:32,726+0000 INFO (vm/cde311f9) [vdsm.api] FINISH
prepareImage return={'path': '/rhev/data-center/mnt/192.168
.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412
-aa2a-3f7bad2f3a87', 'info': {'type': 'file',
'path':
'/rhev/data-center/mnt/192.168.200.2:_exports_nfs_share1/46fa5761-bb9e-4
6be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3f7bad2f3a87'},
'imgVolumesInfo': [{
'domainID': '46fa5761-bb9e-46be-8f1c-35f4b03d0203', 'imageID':
'20002ad2-4a97-4d2f-b3fc-c103477b5b91', 'volumeID': '1d3f07dc-b
481-492f-a2a6-7c46689d82ba', 'path':
'/rhev/data-center/mnt/192.168.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d
0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/1d3f07dc-b481-492f-a2a6-7c46689d82ba',
'leasePath': '/rhev/data-center/mnt/19
2.168.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/1d3f07dc-b481
-492f-a2a6-7c46689d82ba.lease', 'leaseOffset': 0}, {'domainID':
'46fa5761-bb9e-46be-8f1c-35f4b03d0203', 'imageID': '20002ad2-4
a97-4d2f-b3fc-c103477b5b91', 'volumeID':
'614abd56-4d4f-4412-aa2a-3f7bad2f3a87', 'path':
'/rhev/data-center/mnt/192.168.200.2:
_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3
f7bad2f3a87', 'leasePath':
'/rhev/data-center/mnt/192.168.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/image
s/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3f7bad2f3a87.lease',
'leaseOffset': 0}, {'domainID': '46fa5761-
bb9e-46be-8f1c-35f4b03d0203', 'imageID':
'20002ad2-4a97-4d2f-b3fc-c103477b5b91', 'volumeID':
'a4309ef3-01bb-45db-8bf7-0f9498a7
feeb', 'path':
'/rhev/data-center/mnt/192.168.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4
a97-4d2f-b3fc-c103477b5b91/a4309ef3-01bb-45db-8bf7-0f9498a7feeb',
'leasePath': '/rhev/data-center/mnt/192.168.200.2:_exports_n
fs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/a4309ef3-01bb-45db-8bf7-0f9498a7fee
b.lease', 'leaseOffset': 0}]} from=internal,
task_id=92af0386-5c0f-4495-9af2-9d5c47d02c48 (api:54)
2021-08-08 14:20:32,726+0000 INFO (vm/cde311f9) [vds] prepared volume
path: /rhev/data-center/mnt/192.168.200.2:_exports_nfs_
share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3f7bad2f3a87
(
clientIF:518)
...
2021-08-08 14:20:32,726+0000 INFO (vm/cde311f9) [vds] prepared volume
path: /rhev/data-center/mnt/192.168.200.2:_exports_nfs_
share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3f7bad2f3a87
(
clientIF:518)
And preapreImage returns the /rhev/data-center/mnt/... path - this path should
be used by the VM!
But then we see this suspicious log:
2021-08-08 14:20:33,975+0000 INFO (vm/cde311f9) [virt.vm]
(vmId='cde311f9-9a33-4eb9-8338-fa22ff49edc2') drive 'vda' path:
'file=/rhev/data-center/7d97ea80-f849-11eb-ac79-5452d501341a/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3f7bad2f3a87'
->
'*file=/rhev/data-center/mnt/192.168.200.2:_exports_nfs_share1/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3f7bad2f3a87'
(storagexml:341)
And we see the wrong paths in the vm xml:
<disk device="disk" snapshot="no"
type="file">
<driver cache="none" error_policy="stop"
io="threads"
iothread="1" name="qemu" type="qcow2" />
<source
file="/rhev/data-center/7d97ea80-f849-11eb-ac79-5452d501341a/46fa5761-bb9e-46be-8f1c-35f4b03d0203/images/20002ad2-4a97-4d2f-b3fc-c103477b5b91/614abd56-4d4f-4412-aa2a-3f7bad2f3a87">
<seclabel model="dac" relabel="no" />
</source>
So it looks like code in storagexml is *changing* the path returned from storage
(rhev/data-center/mnt/...), with another path (rhev/data-center/pool-id/...),
which is incorrect and confuses libvirt.
I don't think we change anything in this area in vdsm, so this is likely a new
validation added in libvirt, revealing old bug in vdsm/engine.
For the record, I rebased the OST patch, which triggered a check-patch
run, which passed:
It didn't have to migrate vm0 while deactivating the host (during
test_012, at least) -
it was probably already on host-1, not sure why. So it's not a proof that above
issue was "fixed".
--
Didi