Hi,
Thanks for the tips! I didn't know about 'pvmove', thanks.
In the mean time, I managed to get it fixed by restoring the VG metadata on the iSCSI
server, so on the underlying Zvol directly, rather than via the iSCSI session on the oVirt
host. That allowed me to perform the restore without bringing all VMs down, which was
important to me, because if I had to shut down VMs, I was sure I wouldn't be able to
restart them before the storage domain was back online.
Of course this is a more a Linux problem than an oVirt problem, but oVirt did cause it
;-)
Thanks,
Martijn.
Op 19-7-2019 om 19:06 schreef Strahil Nikolov:
Hi Martin,
First check what went wrong with the VG -as it could be something simple.
vgcfgbackup -f VGname will create a file which you can use to compare current metadata
with a previous version.
If you have Linux boxes - you can add disks from another storage and then pvmove the data
inside the VM. Of course , you will need to reinstall grub on the new OS disk , or you
won't be able to boot afterwards.
If possible, try with a test VM before proceeding with important ones.
Backing up the VMs is very important , because working on LVM metadata is quite risky.
Last time I had such an issue , I was working on clustered LVs which got their PVs
"Missing". For me , restore from VG backup fixed the issue - but that might not
be always the case.
Just get the vgcfgbackup's output and compare with diff or vimdiff and check what is
different.
Sadly, I think that this is more a Linux problem , than an oVirt problem.
Best Regards,
Strahil Nikolov
В четвъртък, 18 юли 2019 г., 18:51:32 ч. Гринуич+3, Martijn Grendelman
<Martijn.Grendelman@isaac.nl><mailto:Martijn.Grendelman@isaac.nl> написа:
Hi!
Thanks. Like I wrote, I have metadata backups from /etc/lvm/backup and -/archive, and I
also have the current metadata as it exists on disk. What I'm most concerned about, is
the proposed procedure.
I would create a backup of the VG, but I'm not sure what would be the most sensible
way to do it. I could make a new iSCSI target and simply 'dd' the whole disk over,
but that would take quite some time (it's 2,5 TB) and there are VMs that can't
really be down for that long. And I'm not even sure that dd'ing the disk like that
is a sensible strategy.
Moving disks out of the domain is currently not possible. oVirt says 'Source Storage
Domain is not active'.
Thanks,
Martijn.
Op 18-7-2019 om 17:44 schreef Strahil Nikolov:
Can you check the /etc/lvm/backup and /etc/lvm/archive on your SPM host (check the other
hosts, just in case you find anything useful) ?
Usually LVM makes backup of everything.
I would recommend you to:
1. Create a backup of the problematic VG
2. Compare the backup file and a file from backup/archive folders for the same VG
Check what is different with diff/vimdiff . It might give you a clue.
I had some issues (non-related to oVirt) and restoring the VG from older backup did help
me .Still ,any operation on block devices should be considered risky and a proper backup
is needed.
You could try to move a less important VM's disks out of this storage domain to
another one.
If it succeeds - then you can evacuate all VMs away before you can start
"breaking" the storage domain.
Best Regards,
Strahil Nikolov
В четвъртък, 18 юли 2019 г., 16:59:46 ч. Гринуич+3, Martijn Grendelman
<martijn.grendelman@isaac.nl><mailto:martijn.grendelman@isaac.nl> написа:
Hi,
It appears that O365 has trouble delivering mails to this list, so two
earlier mails of mine are still somewhere in a queue and may yet be delivered.
This mail has all of the content of 3 successive mails. I apologize for this
format.
Op 18-7-2019 om 11:20 schreef Martijn Grendelman:
Op 18-7-2019 om 10:16 schreef Martijn Grendelman:
Hi,
For the first time in many months I have run into some trouble with oVirt (4.3.4.3) and I
need some help.
Yesterday, I noticed one of my iSCSI storage domains was almost full, and tried to move a
disk image off of it, to another domain. This failed, and somewhere in the process, the
whole storage domain went to status 'Inactive'.
From engine.log:
2019-07-17 16:30:35,319+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy]
(EE-ManagedThreadFactory-engine-Thread-1836383) [] starting processDomainRecovery for
domain '875847b6-29a4-4419-be92-9315f4435429:HQST0_ISCSI02'.
2019-07-17 16:30:35,337+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy]
(EE-ManagedThreadFactory-engine-Thread-1836383) [] Domain
'875847b6-29a4-4419-be92-9315f4435429:HQST0_ISCSI02' was reported by all hosts in
status UP as problematic. Moving the domain to NonOperational.
2019-07-17 16:30:35,410+02 WARN
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engine-Thread-1836383) [5f6fd35e] EVENT_ID:
SYSTEM_DEACTIVATED_STORAGE_DOMAIN(970), Storage Domain HQST0_ISCSI02 (Data Center ISAAC01)
was deactivated by system because it's not visible by any of the hosts.
The thing is, the domain is still functional on all my hosts. It carries over 50 disks,
and all involved VMs are up and running, and don't seem to have any problems. Also,
'iscsiadm' on all hosts seems to indiciate that everything is fine with this
specific target and reading from the device with dd, or getting its size with
'blockdev' all works without issue.
When I try to reactivate the domain, these errors are logged:
2019-07-18 09:34:53,631+02 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engine-Thread-43475) [79e386e] EVENT_ID:
IRS_BROKER_COMMAND_FAILURE(10,803), VDSM command ActivateStorageDomainVDS failed: Storage
domain does not exist: (u'875847b6-29a4-4419-be92-9315f4435429',)
2019-07-18 09:34:53,631+02 ERROR
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(EE-ManagedThreadFactory-engine-Thread-43475) [79e386e]
IrsBroker::Failed::ActivateStorageDomainVDS: IRSGenericException: IRSErrorException:
Failed to ActivateStorageDomainVDS, error = Storage domain does not exist:
(u'875847b6-29a4-4419-be92-9315f4435429',), code = 358
2019-07-18 09:34:53,648+02 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engine-Thread-43475) [79e386e] EVENT_ID:
USER_ACTIVATE_STORAGE_DOMAIN_FAILED(967), Failed to activate Storage Domain HQST0_ISCSI02
(Data Center ISAAC01) by martijn@-authz
On the SPM host, there are errors that indicate problems with the LVM volume group:
2019-07-18 09:34:50,462+0200 INFO (jsonrpc/2) [vdsm.api] START
activateStorageDomain(sdUUID=u'875847b6-29a4-4419-be92-9315f4435429',
spUUID=u'aefd5844-6e01-4070-b3b9-c0d73cc40c78', options=None)
from=::ffff:172.17.1.140,56570, flow_id=197dadec,
task_id=51107845-d80b-47f4-aed8-345aaa49f0f8 (api:48)
2019-07-18 09:34:50,464+0200 INFO (jsonrpc/2) [storage.StoragePool]
sdUUID=875847b6-29a4-4419-be92-9315f4435429 spUUID=aefd5844-6e01-4070-b3b9-c0d73cc40c78
(sp:1125)
2019-07-18 09:34:50,629+0200 WARN (jsonrpc/2) [storage.LVM] Reloading VGs failed
(vgs=[u'875847b6-29a4-4419-be92-9315f4435429'] rc=5 out=[] err=['
/dev/mapper/23536316636393463: Checksum error at offset 2748693688832', "
Couldn't read volume group metadata from /dev/mapper/23536316636393463.", '
Metadata location on /dev/mapper/23536316636393463 at 2748693688832 has invalid summary
for VG.', ' Failed to read metadata summary from
/dev/mapper/23536316636393463', ' Failed to scan VG from
/dev/mapper/23536316636393463', ' Volume group
"875847b6-29a4-4419-be92-9315f4435429" not found', ' Cannot process
volume group 875847b6-29a4-4419-be92-9315f4435429']) (lvm:442)
2019-07-18 09:34:50,629+0200 INFO (jsonrpc/2) [vdsm.api] FINISH activateStorageDomain
error=Storage domain does not exist: (u'875847b6-29a4-4419-be92-9315f4435429',)
from=::ffff:172.17.1.140,56570, flow_id=197dadec,
task_id=51107845-d80b-47f4-aed8-345aaa49f0f8 (api:52)
2019-07-18 09:34:50,629+0200 ERROR (jsonrpc/2) [storage.TaskManager.Task]
(Task='51107845-d80b-47f4-aed8-345aaa49f0f8') Unexpected error (task:875)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in
_run
return fn(*args, **kargs)
File "<string>", line 2, in activateStorageDomain
File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in
method
ret = func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1262, in
activateStorageDomain
pool.activateSD(sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in
wrapper
return method(self, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1127, in
activateSD
dom = sdCache.produce(sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in
produce
domain.getRealDomain()
File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in
getRealDomain
return self._cache._realProduce(self._sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in
_realProduce
domain = self._findDomain(sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in
_findDomain
return findMethod(sdUUID)
File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1807, in
findDomain
return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1665, in
findDomainPath
raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist:
(u'875847b6-29a4-4419-be92-9315f4435429',)
2019-07-18 09:34:50,629+0200 INFO (jsonrpc/2) [storage.TaskManager.Task]
(Task='51107845-d80b-47f4-aed8-345aaa49f0f8') aborting: Task is aborted:
"Storage domain does not exist:
(u'875847b6-29a4-4419-be92-9315f4435429',)" - code 358 (task:1181)
2019-07-18 09:34:50,629+0200 ERROR (jsonrpc/2) [storage.Dispatcher] FINISH
activateStorageDomain error=Storage domain does not exist:
(u'875847b6-29a4-4419-be92-9315f4435429',) (dispatcher:83)
I need help getting this storage domain back online. Can anyone here help me? If you need
any additional information, please let me know!
It appears the VG metadata is corrupt:
/dev/mapper/23536316636393463: Checksum error at offset 2748693688832
Couldn't read volume group metadata from /dev/mapper/23536316636393463.
Metadata location on /dev/mapper/23536316636393463 at 2748693688832 has invalid summary
for VG.
Failed to read metadata summary from /dev/mapper/23536316636393463
Failed to scan VG from /dev/mapper/23536316636393463
Is this fixable? If so, how?
So, I have found some information online, that suggests that PV metadata can be fixed by
recreating the PV label using the correct PVID and a backup of the LVM metadata, like
so:
pvcreate -u <pv_uuid> --restorefile <lvm_metadata_backup>
/dev/mapper/23536316636393463
Now I have the following two files:
* An LVM metadata backup of yesterday 10:35, about 6 hours before the problem
occurred.
* The actual metadata found on the PV at offset 2748693688832 (obtained with hexedit
on the block device).
These are largely the same, but there are differences:
* seqno = 1854 in the backup and 1865 in the actual metadata.
* 3 logical volumes that are not present in the backup, but are in the actual
metadata. I suspect that these are related to snapshots that were created for live
storage migration, but I am not sure. In any case, I did NOT create any new disk images on
this domain, so that can't be it.
Now, support I wanted to try the 'pvcreate' route, then:
* what would be the chances of success? Is this procedure at all advisable, or is
there an alternative?
* which restore file (1854 or 1865) should I use for the restore?
* can I do this while the VG is in use? I tried running the command without --force,
and it said 'Can't open /dev/mapper/23536316636393463 exclusively. Mounted
filesystem?'. I didn't dare to try it with '--force'.
I could really use some advice on how to proceed. This are about 36 VMs that have one or
more disks on this domain. I could bring them down, although doing so for extended amounts
of time would be problematic. I want to be careful, obviously, especially since the actual
storage doesn't seem to be impacted at this time. The VMs are all still running
without issue, and if I'm about to embark on a dangerous journey that could cause data
loss, I need a contingency / recovery plan.
Hoping someone can help...
Best regards,
Martijn Grendelman
_______________________________________________
Users mailing list -- users@ovirt.org<mailto:users@ovirt.org>
To unsubscribe send an email to
users-leave@ovirt.org<mailto:users-leave@ovirt.org>
Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/AKWOTMY6MKD...
--
Met vriendelijke groet,
Kind regards,
[Martijn]<mailto:martijn.grendelman@isaac.nl>
Martijn Grendelman Infrastructure Architect
T: +31 (0)40 264 94 44
[ISAAC]<https://www.isaac.nl>
ISAAC Marconilaan 16 5621 AA Eindhoven The Netherlands
T: +31 (0)40 290 89 79
www.isaac.nl<https://www.isaac.nl>
[ISAAC #1
Again!]<https://www.isaac.nl/nl/over-ons/nieuws/isaac-news/ISAAC-voor-tweede-keer-nummer-1-Fullservice-Digital-Agency-Emerce100>
Dit e-mail bericht is alleen bestemd voor de geadresseerde(n). Indien dit bericht niet
voor u is bedoeld wordt u verzocht de afzender hiervan op de hoogte te stellen door het
bericht te retourneren en de inhoud niet te gebruiken. Aan dit bericht kunnen geen rechten
worden ontleend.
--
Met vriendelijke groet,
Kind regards,
[Martijn]<mailto:martijn.grendelman@isaac.nl>
Martijn Grendelman Infrastructure Architect
T: +31 (0)40 264 94 44
[ISAAC]<https://www.isaac.nl>
ISAAC Marconilaan 16 5621 AA Eindhoven The Netherlands
T: +31 (0)40 290 89 79
www.isaac.nl<https://www.isaac.nl>
[ISAAC #1
Again!]<https://www.isaac.nl/nl/over-ons/nieuws/isaac-news/ISAAC-voor-tweede-keer-nummer-1-Fullservice-Digital-Agency-Emerce100>
Dit e-mail bericht is alleen bestemd voor de geadresseerde(n). Indien dit bericht niet
voor u is bedoeld wordt u verzocht de afzender hiervan op de hoogte te stellen door het
bericht te retourneren en de inhoud niet te gebruiken. Aan dit bericht kunnen geen rechten
worden ontleend.