Hi,
It appears that O365 has trouble delivering mails to this list, so two
earlier mails of mine are still somewhere in a queue and may yet be delivered.
This mail has all of the content of 3 successive mails. I apologize for this
format.
Op 18-7-2019 om 11:20 schreef Martijn Grendelman:
Op 18-7-2019 om 10:16 schreef Martijn Grendelman:
> Hi,
>
> For the first time in many months I have run into some trouble with
> oVirt (4.3.4.3) and I need some help.
>
> Yesterday, I noticed one of my iSCSI storage domains was almost full,
> and tried to move a disk image off of it, to another domain. This
> failed, and somewhere in the process, the whole storage domain went
> to status 'Inactive'.
>
> From engine.log:
>
> 2019-07-17 16:30:35,319+02 INFOÂ
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy]
> (EE-ManagedThreadFactory-engine-Thread-1836383) [] starting
> processDomainRecovery for domain
> '875847b6-29a4-4419-be92-9315f4435429:HQST0_ISCSI02'.
> 2019-07-17 16:30:35,337+02 ERROR
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy]
> (EE-ManagedThreadFactory-engine-Thread-1836383) [] Domain
> '875847b6-29a4-4419-be92-9315f4435429:HQST0_ISCSI02' was reported
> by all hosts in status UP as problematic. Moving the domain to
> NonOperational.
> 2019-07-17 16:30:35,410+02 WARNÂ
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedThreadFactory-engine-Thread-1836383) [5f6fd35e]
> EVENT_ID: SYSTEM_DEACTIVATED_STORAGE_DOMAIN(970), Storage Domain
> HQST0_ISCSI02 (Data Center ISAAC01) was deactivated by system
> because it's not visible by any of the hosts.
>
> The thing is, the domain is still functional on all my hosts. It
> carries over 50 disks, and all involved VMs are up and running, and
> don't seem to have any problems. Also, 'iscsiadm' on all hosts seems
> to indiciate that everything is fine with this specific target and
> reading from the device with dd, or getting its size with 'blockdev'
> all works without issue.
>
> When I try to reactivate the domain, these errors are logged:
>
> 2019-07-18 09:34:53,631+02 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedThreadFactory-engine-Thread-43475) [79e386e] EVENT_ID:
> IRS_BROKER_COMMAND_FAILURE(10,803), VDSM command
> ActivateStorageDomainVDS failed: Storage domain does not exist:
> (u'875847b6-29a4-4419-be92-9315f4435429',)
> 2019-07-18 09:34:53,631+02 ERROR
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> (EE-ManagedThreadFactory-engine-Thread-43475) [79e386e]
> IrsBroker::Failed::ActivateStorageDomainVDS: IRSGenericException:
> IRSErrorException: Failed to ActivateStorageDomainVDS, error =
> Storage domain does not exist:
> (u'875847b6-29a4-4419-be92-9315f4435429',), code = 358
> 2019-07-18 09:34:53,648+02 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedThreadFactory-engine-Thread-43475) [79e386e] EVENT_ID:
> USER_ACTIVATE_STORAGE_DOMAIN_FAILED(967), Failed to activate
> Storage Domain HQST0_ISCSI02 (Data Center ISAAC01) by martijn@-authz
>
> On the SPM host, there are errors that indicate problems with the LVM
> volume group:
>
> 2019-07-18 09:34:50,462+0200 INFOÂ (jsonrpc/2) [vdsm.api] START
> activateStorageDomain(sdUUID=u'875847b6-29a4-4419-be92-9315f4435429',
> spUUID=u'aefd5844-6e01-4070-b3b9-c0d73cc40c78', options=None)
> from=::ffff:172.17.1.140,56570, flow_id=197dadec,
> task_id=51107845-d80b-47f4-aed8-345aaa49f0f8 (api:48)
> 2019-07-18 09:34:50,464+0200 INFOÂ (jsonrpc/2)
> [storage.StoragePool] sdUUID=875847b6-29a4-4419-be92-9315f4435429
> spUUID=aefd5844-6e01-4070-b3b9-c0d73cc40c78 (sp:1125)
> 2019-07-18 09:34:50,629+0200 WARNÂ (jsonrpc/2) [storage.LVM]
> Reloading VGs failed
> (vgs=[u'875847b6-29a4-4419-be92-9315f4435429'] rc=5 out=[]
> err=['Â /dev/mapper/23536316636393463: Checksum error at offset
> 2748693688832', "Â Couldn't read volume group metadata from
> /dev/mapper/23536316636393463.", 'Â Metadata location on
> /dev/mapper/23536316636393463 at 2748693688832 has invalid
> summary for VG.', 'Â Failed to read metadata summary from
> /dev/mapper/23536316636393463', 'Â Failed to scan VG from
> /dev/mapper/23536316636393463', 'Â Volume group
> "875847b6-29a4-4419-be92-9315f4435429" not found', 'Â Cannot
> process volume group 875847b6-29a4-4419-be92-9315f4435429'])
> (lvm:442)
> 2019-07-18 09:34:50,629+0200 INFOÂ (jsonrpc/2) [vdsm.api] FINISH
> activateStorageDomain error=Storage domain does not exist:
> (u'875847b6-29a4-4419-be92-9315f4435429',)
> from=::ffff:172.17.1.140,56570, flow_id=197dadec,
> task_id=51107845-d80b-47f4-aed8-345aaa49f0f8 (api:52)
> 2019-07-18 09:34:50,629+0200 ERROR (jsonrpc/2)
> [storage.TaskManager.Task]
> (Task='51107845-d80b-47f4-aed8-345aaa49f0f8') Unexpected error
> (task:875)
> Traceback (most recent call last):
> Â File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py",
> line 882, in _run
> Â Â Â return fn(*args, **kargs)
> Â File "<string>", line 2, in activateStorageDomain
> Â File "/usr/lib/python2.7/site-packages/vdsm/common/api.py",
> line 50, in method
> Â Â Â ret = func(*args, **kwargs)
> Â File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py",
> line 1262, in activateStorageDomain
> Â Â Â pool.activateSD(sdUUID)
> Â File
> "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py",
> line 79, in wrapper
> Â Â Â return method(self, *args, **kwargs)
> Â File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py",
> line 1127, in activateSD
> Â Â Â dom = sdCache.produce(sdUUID)
> Â File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py",
> line 110, in produce
> Â Â Â domain.getRealDomain()
> Â File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py",
> line 51, in getRealDomain
> Â Â Â return self._cache._realProduce(self._sdUUID)
> Â File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py",
> line 134, in _realProduce
> Â Â Â domain = self._findDomain(sdUUID)
> Â File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py",
> line 151, in _findDomain
> Â Â Â return findMethod(sdUUID)
> Â File
> "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line
> 1807, in findDomain
> Â Â Â return
> BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
> Â File
> "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line
> 1665, in findDomainPath
> Â Â Â raise se.StorageDomainDoesNotExist(sdUUID)
> StorageDomainDoesNotExist: Storage domain does not exist:
> (u'875847b6-29a4-4419-be92-9315f4435429',)
> 2019-07-18 09:34:50,629+0200 INFOÂ (jsonrpc/2)
> [storage.TaskManager.Task]
> (Task='51107845-d80b-47f4-aed8-345aaa49f0f8') aborting: Task is
> aborted: "Storage domain does not exist:
> (u'875847b6-29a4-4419-be92-9315f4435429',)" - code 358 (task:1181)
> 2019-07-18 09:34:50,629+0200 ERROR (jsonrpc/2)
> [storage.Dispatcher] FINISH activateStorageDomain error=Storage
> domain does not exist: (u'875847b6-29a4-4419-be92-9315f4435429',)
> (dispatcher:83)
>
>
> I need help getting this storage domain back online. Can anyone here
> help me? If you need any additional information, please let me know!
It appears the VG metadata is corrupt:
 /dev/mapper/23536316636393463: Checksum error at offset
2748693688832
 Couldn't read volume group metadata from
/dev/mapper/23536316636393463.
 Metadata location on /dev/mapper/23536316636393463 at
2748693688832 has invalid summary for VG.
 Failed to read metadata summary from /dev/mapper/23536316636393463
 Failed to scan VG from /dev/mapper/23536316636393463
Is this fixable? If so, how?
So, I have found some information online, that suggests that PV metadata
can be fixed by recreating the PV label using the correct PVID and a
backup of the LVM metadata, like so:
pvcreate -u <pv_uuid> --restorefile <lvm_metadata_backup>
/dev/mapper/23536316636393463
Now I have the following two files:
* An LVM metadata backup of yesterday 10:35, about 6 hours before the
problem occurred.
* The actual metadata found on the PV at offset 2748693688832
(obtained with /hexedit/ on the block device).
These are largely the same, but there are differences:
* seqno = 1854 in the backup and 1865 in the actual metadata.
* 3 logical volumes that are not present in the backup, but are in the
actual metadata. I suspect that these are related to snapshots that
were created for live storage migration, but I am not sure. In any
case, I did NOT create any new disk images on this domain, so that
can't be it.
Now, support I wanted to try the 'pvcreate' route, then:
* what would be the chances of success? Is this procedure at all
advisable, or is there an alternative?
* which restore file (1854 or 1865) should I use for the restore?
* can I do this while the VG is in use? I tried running the command
without --force, and it said 'Can't open
/dev/mapper/23536316636393463 exclusively. Mounted filesystem?'. I
didn't dare to try it with '--force'.
I could really use some advice on how to proceed. This are about 36 VMs
that have one or more disks on this domain. I could bring them down,
although doing so for extended amounts of time would be problematic. I
want to be careful, obviously, especially since the actual storage
doesn't seem to be impacted at this time. The VMs are all still running
without issue, and if I'm about to embark on a dangerous journey that
could cause data loss, I need a contingency / recovery plan.
Hoping someone can help...
Best regards,
Martijn Grendelman