Ovirt-engine-ha cannot to see live status of Hosted Engine
by asm@pioner.kz
Good day for all.
I have some issues with Ovirt 4.2.6. But now the main this of it:
I have two Centos 7 Nodes with same config and last Ovirt 4.2.6 with Hostedengine with disk on NFS storage.
Also some of virtual machines working good.
But, when HostedEngine running on one node (srv02.local) everything is fine.
After migrating to another node (srv00.local), i see that agent cannot to check livelinness of HostedEngine. After few minutes HostedEngine going to reboot and after some time i see some situation. After migration to another node (srv00.local) all looks OK.
hosted-engine --vm-status commang when HosterEngine on srv00 node:
--== Host 1 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : srv02.local
Host ID : 1
Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down_unexpected", "detail": "unknown"}
Score : 0
stopped : False
Local maintenance : False
crc32 : ecc7ad2d
local_conf_timestamp : 78328
Host timestamp : 78328
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=78328 (Tue Sep 18 12:44:18 2018)
host-id=1
score=0
vm_conf_refresh_time=78328 (Tue Sep 18 12:44:18 2018)
conf_on_shared_storage=True
maintenance=False
state=EngineUnexpectedlyDown
stopped=False
timeout=Fri Jan 2 03:49:58 1970
--== Host 2 status ==--
conf_on_shared_storage : True
Status up-to-date : True
Hostname : srv00.local
Host ID : 2
Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"}
Score : 3400
stopped : False
Local maintenance : False
crc32 : 1d62b106
local_conf_timestamp : 326288
Host timestamp : 326288
Extra metadata (valid at timestamp):
metadata_parse_version=1
metadata_feature_version=1
timestamp=326288 (Tue Sep 18 12:44:21 2018)
host-id=2
score=3400
vm_conf_refresh_time=326288 (Tue Sep 18 12:44:21 2018)
conf_on_shared_storage=True
maintenance=False
state=EngineStarting
stopped=False
Log agent.log from srv00.local:
MainThread::INFO::2018-09-18 12:40:51,749::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE
ngine::(consume) VM is powering up..
MainThread::INFO::2018-09-18 12:40:52,052::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.
HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
MainThread::INFO::2018-09-18 12:41:01,066::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE
ngine::(consume) VM is powering up..
MainThread::INFO::2018-09-18 12:41:01,374::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.
HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
MainThread::INFO::2018-09-18 12:41:11,393::state_machine::169::ovirt_hosted_engine_ha.agent.hosted_engine.
HostedEngine::(refresh) Global metadata: {'maintenance': False}
MainThread::INFO::2018-09-18 12:41:11,393::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.
HostedEngine::(refresh) Host srv02.local.pioner.kz (id 1): {'conf_on_shared_storage': True, 'extra': 'meta
data_parse_version=1\nmetadata_feature_version=1\ntimestamp=78128 (Tue Sep 18 12:40:58 2018)\nhost-id=1\ns
core=0\nvm_conf_refresh_time=78128 (Tue Sep 18 12:40:58 2018)\nconf_on_shared_storage=True\nmaintenance=Fa
lse\nstate=EngineUnexpectedlyDown\nstopped=False\ntimeout=Fri Jan 2 03:49:58 1970\n', 'hostname': 'srv02.
local.pioner.kz', 'alive': True, 'host-id': 1, 'engine-status': {'reason': 'vm not running on this host',
'health': 'bad', 'vm': 'down_unexpected', 'detail': 'unknown'}, 'score': 0, 'stopped': False, 'maintenance
': False, 'crc32': 'e18e3f22', 'local_conf_timestamp': 78128, 'host-ts': 78128}
MainThread::INFO::2018-09-18 12:41:11,393::state_machine::177::ovirt_hosted_engine_ha.agent.hosted_engine.
HostedEngine::(refresh) Local (id 2): {'engine-health': {'reason': 'failed liveliness check', 'health': 'b
ad', 'vm': 'up', 'detail': 'Up'}, 'bridge': True, 'mem-free': 12763.0, 'maintenance': False, 'cpu-load': 0
.0364, 'gateway': 1.0, 'storage-domain': True}
MainThread::INFO::2018-09-18 12:41:11,393::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE
ngine::(consume) VM is powering up..
MainThread::INFO::2018-09-18 12:41:11,703::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.
HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
MainThread::INFO::2018-09-18 12:41:21,716::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE
ngine::(consume) VM is powering up..
MainThread::INFO::2018-09-18 12:41:22,020::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.
HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
MainThread::INFO::2018-09-18 12:41:31,033::states::779::ovirt_hosted_engine_ha.agent.hosted_engine.HostedE
ngine::(consume) VM is powering up..
MainThread::INFO::2018-09-18 12:41:31,344::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.
HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
As we can see, agent thinking that HostedEngine just in powering up mode. I cannot to do anythink with it. I allready reinstalled many times srv00 node without success.
One time i even has to uninstall ovirt* and vdsm* software. Also here one interesting point, after installing just "yum install http://resources.ovirt.org/pub/yum-repo/ovirt-release42.rpm" on this node i try to install this node from engine web interface with "Deploy" action. But, installation was unsuccesfull, before i didnt install ovirt-hosted-engine-ha on this node. I dont see in documentation that its need bofore installation of new hosts. But this is for information and checking. After installing ovirt-hosted-engine-ha node was installed with HostedEngine support. But the main issue not changed.
Thanks in advance for help.
BR,
Alexandr
4 years, 9 months
Re: [ANN] oVirt 4.3.7 Third Release Candidate is now available for testing
by Strahil
I got upgraded to RC3 and now cannot power any VM .
Constantly getting I/O error, but checking at gluster level - I can dd from each disk or even create a new one.
Removing the HighAvailability doesn't help.
I guess I should restore the engine from the gluster snapshot and rollback via 'yum history undo last'.
Does anyone else have my issues ?
Best Regards,
Strahil NikolovOn Nov 13, 2019 15:31, Sandro Bonazzola <sbonazzo(a)redhat.com> wrote:
>
>
>
> Il giorno mer 13 nov 2019 alle ore 14:25 Sandro Bonazzola <sbonazzo(a)redhat.com> ha scritto:
>>
>>
>>
>> Il giorno mer 13 nov 2019 alle ore 13:56 Florian Schmid <fschmid(a)ubimet.com> ha scritto:
>>>
>>> Hello,
>>>
>>> I have a question about bugs, which are flagged as [downstream clone - 4.3.7], but are not yet released.
>>>
>>> I'm talking about this bug:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1749202
>>>
>>> I can't see it in 4.3.7 release notes. Will it be included in a further release candidate? This fix is very important I think and I can't upgrade yet because of this bug.
>>
>>
>>
>> Looking at the bug, the fix was done with $ git tag --contains 12bd5cb1fe7c95e29b4065fca968913722fe9eaa
>> ovirt-engine-4.3.6.6
>> ovirt-engine-4.3.6.7
>> ovirt-engine-4.3.7.0
>> ovirt-engine-4.3.7.1
>>
>> So the fix is already included in release oVirt 4.3.6.
>
>
> Sent a fix to 4.3.6 release notes: https://github.com/oVirt/ovirt-site/pull/2143. @Ryan Barry can you please review?
>
>
>>
>>
>>
>>
>>
>>
>>>
>>>
>>> BR Florian Schmid
>>>
>>> ________________________________
>>> Von: "Sandro Bonazzola" <sbonazzo(a)redhat.com>
>>> An: "users" <users(a)ovirt.org>
>>> Gesendet: Mittwoch, 13. November 2019 13:34:59
>>> Betreff: [ovirt-users] [ANN] oVirt 4.3.7 Third Release Candidate is now available for testing
>>>
>>> The oVirt Project is pleased to announce the availability of the oVirt 4.3.7 Third Release Candidate for testing, as of November 13th, 2019.
>>>
>>> This update is a release candidate of the seventh in a series of stabilization updates to the 4.3 series.
>>> This is pre-release software. This pre-release should not to be used in production.
>>>
>>> This release is available now on x86_64 architecture for:
>>> * Red Hat Enterprise Linux 7.7 or later (but <8)
>>> * CentOS Linux (or similar) 7.7 or later (but <8)
>>>
>>> This release supports Hypervisor Hosts on x86_64 and ppc64le architectures for:
>>> * Red Hat Enterprise Linux 7.7 or later (but <8)
>>> * CentOS Linux (or similar) 7.7 or later (but <8)
>>> * oVirt Node 4.3 (available for x86_64 only) has been built consuming CentOS 7.7 Release
>>>
>>> See the release notes [1] for known issues, new features and bugs fixed.
>>>
>>> While testing this release candidate please note that oVirt node now includes:
>>> - ansible 2.9.0
>>> - GlusterFS 6.6
>>>
>>> Notes:
>>> - oVirt Appliance is already available
>>> - oVirt Node is already available
>>>
>>> Additional Resources:
>>> * Read more about the oVirt 4.3.7 release highlights: http://www.ovirt.org/release/4.3.7/
>>> * Get more oVirt Project updates on Twitter: https://twitter.com/ovirt
>>> * Check out the latest project news on the oVirt blog:http://www.ovirt.org/blog/
>>>
>>> [1] http://www.ovirt.org/release/4.3.7/
>>> [2] http://resources.ovirt.org/pub/ovirt-4.3-pre/iso/
>>>
>>> --
>>>
>>> Sandro Bonazzola
>>>
>>> MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV
>>>
>>> Red Hat EMEA
>>>
>>> sbonazzo(a)redhat.com
>>>
>>> Red Hat respects your work life balance. Therefore there is no need to answer this email out of your office hours.
>>>
>>> _______________________________________________
>>> Users mailing list -- users(a)ovirt.org
>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/24QUREJPZHT...
>>
>>
>>
>> --
>>
>> Sandro Bonazzola
>>
>> MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV
>>
>> Red Hat EMEA
>>
>> sbonazzo(a)redhat.com
>>
>> Red Hat respects your work life balance. Therefore there is no need to answer this email out of your office hours.
>
>
>
> --
>
> Sandro Bonazzola
>
> MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV
>
> Red Hat EMEA
>
> sbonazzo(a)redhat.com
>
> Red Hat respects your work life balance. Therefore there is no need to answer this email out of your office hours.
4 years, 9 months
Low disk space on Storage
by suporte@logicworks.pt
Hi,
I'm running ovirt Version:4.3.4.3-1.el7
My filesystem disk has 30 GB free space.
Cannot start a VM due to an I/O error storage.
When tryng to move the disk to another storage domain get this error:
Error while executing action: Cannot move Virtual Disk. Low disk space on Storage Domain DATA4.
The sum of pre-allocated disk is the total of the storage domain disk.
Any idea what can I do to move a disk to other storage domain?
Many thanks
--
Jose Ferradeira
http://www.logicworks.pt
4 years, 10 months
Hyperconverged setup - storage architecture - scaling
by Leo David
Hello Everyone,
Reading through the document:
"Red Hat Hyperconverged Infrastructure for Virtualization 1.5
Automating RHHI for Virtualization deployment"
Regarding storage scaling, i see the following statements:
*2.7. SCALINGRed Hat Hyperconverged Infrastructure for Virtualization is
supported for one node, and for clusters of 3, 6, 9, and 12 nodes.The
initial deployment is either 1 or 3 nodes.There are two supported methods
of horizontally scaling Red Hat Hyperconverged Infrastructure for
Virtualization:*
*1 Add new hyperconverged nodes to the cluster, in sets of three, up to the
maximum of 12 hyperconverged nodes.*
*2 Create new Gluster volumes using new disks on existing hyperconverged
nodes.You cannot create a volume that spans more than 3 nodes, or expand an
existing volume so that it spans across more than 3 nodes at a time*
*2.9.1. Prerequisites for geo-replicationBe aware of the following
requirements and limitations when configuring geo-replication:One
geo-replicated volume onlyRed Hat Hyperconverged Infrastructure for
Virtualization (RHHI for Virtualization) supports only one geo-replicated
volume. Red Hat recommends backing up the volume that stores the data of
your virtual machines, as this is usually contains the most valuable data.*
------
Also in oVirtEngine UI, when I add a brick to an existing volume i get the
following warning:
*"Expanding gluster volume in a hyper-converged setup is not recommended as
it could lead to degraded performance. To expand storage for cluster, it is
advised to add additional gluster volumes." *
Those things are raising a couple of questions that maybe for some for you
guys are easy to answer, but for me it creates a bit of confusion...
I am also referring to RedHat product documentation, because I treat
oVirt as production-ready as RHHI is.
*1*. Is there any reason for not going to distributed-replicated volumes (
ie: spread one volume across 6,9, or 12 nodes ) ?
- ie: is recomanded that in a 9 nodes scenario I should have 3 separated
volumes, but how should I deal with the folowing question
*2.* If only one geo-replicated volume can be configured, how should I
deal with 2nd and 3rd volume replication for disaster recovery
*3.* If the limit of hosts per datacenter is 250, then (in theory ) the
recomended way in reaching this treshold would be to create 20 separated
oVirt logical clusters with 12 nodes per each ( and datacenter managed from
one ha-engine ) ?
*4.* In present, I have the folowing one 9 nodes cluster , all hosts
contributing with 2 disks each to a single replica 3 distributed
replicated volume. They where added to the volume in the following order:
node1 - disk1
node2 - disk1
......
node9 - disk1
node1 - disk2
node2 - disk2
......
node9 - disk2
At the moment, the volume is arbitrated, but I intend to go for full
distributed replica 3.
Is this a bad setup ? Why ?
It oviously brakes the redhat recommended rules...
Is there anyone so kind to discuss on these things ?
Thank you very much !
Leo
--
Best regards, Leo David
--
Best regards, Leo David
4 years, 10 months
AWX and error using ovirt as an inventory source
by Gianluca Cecchi
Hello,
I have awx 9.0.1 and ansible 2.8.5 in container of a CentOS 7.7 server.
I'm trying to use oVirt 4.3.6.7-1.el7 as a source of an inventory in awx
but I get error when syncing
Find at bottom below the error messages.
I see that in recent past (around June this year) there were some problems,
but they should be solved now, correct?
There was also a problem in syncing when some powered off VMs were present
in oVirt env, but I think this solved too, correct?
Any way to replicate / test from command line of awx container?
I try some things but in command line I always get error regarding
oVirt inventory script requires ovirt-engine-sdk-python >= 4.0.0
that I think depends on not using correct command line and/or not setting
needed env.
Thanks in advance,
Gianluca
2.536 INFO Updating inventory 4: MYDC_OVIRT
3.011 INFO Reading Ansible inventory source:
/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/plugins/inventory/ovirt4.py
3.013 INFO Using VIRTUAL_ENV: /var/lib/awx/venv/ansible
3.013 INFO Using PATH:
/var/lib/awx/venv/ansible/bin:/var/lib/awx/venv/awx/bin:/var/lib/awx/venv/awx/bin:/var/lib/awx/venv/awx/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
3.013 INFO Using PYTHONPATH:
/var/lib/awx/venv/ansible/lib/python3.6/site-packages:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/bin/awx-manage", line 11, in <module>
load_entry_point('awx==9.0.1.0', 'console_scripts', 'awx-manage')()
File
"/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/__init__.py", line
158, in manage
execute_from_command_line(sys.argv)
File
"/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/__init__.py",
line 381, in execute_from_command_line
utility.execute()
File
"/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/__init__.py",
line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File
"/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/base.py",
line 323, in run_from_argv
self.execute(*args, **cmd_options)
File
"/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/core/management/base.py",
line 364, in execute
output = self.handle(*args, **options)
File
"/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/management/commands/inventory_import.py",
line 1153, in handle
raise exc
File
"/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/management/commands/inventory_import.py",
line 1043, in handle
venv_path=venv_path, verbosity=self.verbosity).load()
File
"/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/management/commands/inventory_import.py",
line 214, in load
return self.command_to_json(base_args + ['--list'])
File
"/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/main/management/commands/inventory_import.py",
line 197, in command_to_json
self.method, proc.returncode, stdout, stderr))
RuntimeError: ansible-inventory failed (rc=1) with stdout:
stderr:
ansible-inventory 2.8.5
config file = /etc/ansible/ansible.cfg
configured module search path = ['/var/lib/awx/.ansible/plugins/modules',
'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python3.6/site-packages/ansible
executable location = /usr/bin/ansible-inventory
python version = 3.6.8 (default, Oct 7 2019, 17:58:22) [GCC 8.2.1
20180905 (Red Hat 8.2.1-3)]
Using /etc/ansible/ansible.cfg as config file
[WARNING]: * Failed to parse /var/lib/awx/venv/awx/lib64/python3.6/site-
packages/awx/plugins/inventory/ovirt4.py with script plugin: Inventory
script
(/var/lib/awx/venv/awx/lib64/python3.6/site-
packages/awx/plugins/inventory/ovirt4.py) had an execution error:
File "/usr/lib/python3.6/site-packages/ansible/inventory/manager.py",
line 268, in parse_source
plugin.parse(self._inventory, self._loader, source, cache=cache)
File
"/usr/lib/python3.6/site-packages/ansible/plugins/inventory/script.py",
line 161, in parse
raise AnsibleParserError(to_native(e))
[WARNING]: Unable to parse /var/lib/awx/venv/awx/lib64/python3.6/site-
packages/awx/plugins/inventory/ovirt4.py as an inventory source
ERROR! No inventory was parsed, please check your configuration and options.
4 years, 10 months
ovirt hosted-engine on iSCSI offering one target
by wodel youchi
Hi,
We have an oVirt Platforme using the 4.1 version.
when the platforme was installed, it was made of :
- Two HP Proliant DL380 G9 as hypervisors
- One HP MSA1040 for iSCSI
- One Synology for NFS
- Two switches, one for network/vm traffic, the second for storage traffic.
The problem : the hosted-engine domain was created using iSCSI on the HP
MSA. The problem is that this disk array does not give the possibility to
create different targets, it presents just one target.
At that time we create both the hosted-engine and the first data domain
using the same target, and we didn't pay attention to the information
saying "i*f you are using iSCSI storage, do not use the same iSCSI target
for the shared storage domain and data storage domain*".
Question :
- what problems can be generated by this (mis-)configuration?
- Is this a must to do (correct) configuration.
Regards.
4 years, 11 months
Unable to attach ISO domain to Datacenter
by Ivan de Gusmão Apolonio
I'm having trouble to create a storage ISO Domain and attach it to a Datacenter. It just give me this error message:
Error while executing action Attach Storage Domain: Could not obtain lock
Also the oVirt's Engine log files show this error message: "setsid: failed to execute /usr/bin/ionice: Permission denied", but I was unable to identify what exactly it's trying to do to get this permission denied.
2019-11-14 16:46:07,779-03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-7388) [86161370-2aaa-4eff-9aab-c184bdf5bb98] EVENT_ID: IRS_BROKER_COMMAND_FAILURE(10,803), VDSM command AttachStorageDomainVDS failed: Cannot obtain lock: u"id=e6b34c42-0ca6-41f4-be3e-3c9b2af1747b, rc=1, out=[], err=['setsid: failed to execute /usr/bin/ionice: Permission denied']"
This behavior just happens on ISO Domains, while Data Domains works fine. I have read oVirt documentation and searched everywhere but I was unable to find the solution for this issue.
I'm using CentOS 7 with last update of all packages (oVirt version 4.3.6.7). Please help!
Thanks,
Ivan de Gusmão Apolonio
4 years, 11 months
Gluster set up fails - Nearly there I think...
by rob.downer@orbitalsystems.co.uk
Gluster fails with
vdo: ERROR - Device /dev/sdb excluded by a filter.\n",
however I have run
[root@ovirt1 ~]# vdo create --name=vdo1 --device=/dev/sdb --force
Creating VDO vdo1
Starting VDO vdo1
Starting compression on VDO vdo1
VDO instance 1 volume is ready at /dev/mapper/vdo1
[root@ovirt1 ~]#
there are no filters in lvm.conf
I have run
wipefs -a /dev/sdb —force
on all hosts before start
4 years, 11 months
Re: Cannot obtain information from export domain
by Strahil
Hi can you describe your actions?
Usually the export is like this:
1. You make a backup of the VM
2. You migrate the disks to the export storage domain
3. You shut down the VM
4. Set the storage domain in maintenance and then detach it from the oVirt
5. You atttach it to the new oVirt
6. Once the domain is active - click on import VM tab and import all VMs (defining the cluster you want them to be running on)
7. Power up VM and then migrate the disks to the permanent storage.
Best Regards,
Strahil NikolovOn Nov 26, 2019 19:41, Arthur Rodrigues Stilben <arthur.stilben(a)gmail.com> wrote:
>
> Hello everyone,
>
> I'm trying to export a virtual machine, but I'm getting the following error:
>
> 2019-11-26 16:30:06,250-02 ERROR
> [org.ovirt.engine.core.bll.exportimport.GetVmsFromExportDomainQuery]
> (default task-22) [b9a0b9d5-2127-4002-9cee-2e3525bccc89] Exception:
> org.ovirt.engine.core.common.errors.EngineException: EngineException:
> org.ovirt.engine.core.vdsbroker.irsbroker.IRSErrorException:
> IRSGenericException: IRSErrorException: Failed to GetVmsInfoVDS, error =
> Storage domain does not exist:
> (u'5ac6c35d-0406-4a06-a682-ed8fb2d1933f',), code = 358 (Failed with
> error StorageDomainDoesNotExist and code 358)
>
> 2019-11-26 16:30:06,249-02 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (default task-22) [b9a0b9d5-2127-4002-9cee-2e3525bccc89] EVENT_ID:
> IMPORTEXPORT_GET_VMS_INFO_FAILED(200), Correlation ID: null, Call Stack:
> null, Custom ID: null, Custom Event ID: -1, Message: Failed to retrieve
> VM/Templates information from export domain BackupMV
>
> The version of the oVirt that I am using is 4.1.
>
> Att,
>
> --
> Arthur Rodrigues Stilben
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/Z6FQ45UVQOD...
4 years, 11 months
Cannot activate/deactivate storage domain
by Albl, Oliver
Hi all,
I run an oVirt 4.3.6.7-1.el7 installation (50+ hosts, 40+ FC storage domains on two all-flash arrays) and experienced a problem accessing single storage domains.
As a result, hosts were taken "not operational" because they could not see all storage domains, SPM started to move around the hosts.
oVirt messages start with:
2019-11-04 15:10:22.739+01 | VDSM HOST082 command SpmStatusVDS failed: (-202, 'Sanlock resource read failure', 'IO timeout')
2019-11-04 15:10:22.781+01 | Invalid status on Data Center <name>. Setting Data Center status to Non Responsive (On host HOST82, Error: General Exception).
...
2019-11-04 15:13:58.836+01 | Host HOST017 cannot access the Storage Domain(s) HOST_LUN_204 attached to the Data Center <name>. Setting Host state to Non-Operational.
2019-11-04 15:13:58.85+01 | Host HOST005 cannot access the Storage Domain(s) HOST_LUN_204 attached to the Data Center <name>. Setting Host state to Non-Operational.
2019-11-04 15:13:58.85+01 | Host HOST012 cannot access the Storage Domain(s) HOST_LUN_204 attached to the Data Center <name>. Setting Host state to Non-Operational.
2019-11-04 15:13:58.851+01 | Host HOST002 cannot access the Storage Domain(s) HOST_LUN_204 attached to the Data Center <name>. Setting Host state to Non-Operational.
2019-11-04 15:13:58.851+01 | Host HOST010 cannot access the Storage Domain(s) HOST_LUN_204 attached to the Data Center <name>. Setting Host state to Non-Operational.
2019-11-04 15:13:58.851+01 | Host HOST011 cannot access the Storage Domain(s) HOST_LUN_204 attached to the Data Center <name>. Setting Host state to Non-Operational.
2019-11-04 15:13:58.852+01 | Host HOST004 cannot access the Storage Domain(s) HOST_LUN_204 attached to the Data Center <name>. Setting Host state to Non-Operational.
2019-11-04 15:13:59.011+01 | Host HOST017 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center <UNKNOWN>. Setting Host state to Non-Operational.
2019-11-04 15:13:59.238+01 | Host HOST004 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center <UNKNOWN>. Setting Host state to Non-Operational.
2019-11-04 15:13:59.249+01 | Host HOST005 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center <UNKNOWN>. Setting Host state to Non-Operational.
2019-11-04 15:13:59.255+01 | Host HOST012 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center <UNKNOWN>. Setting Host state to Non-Operational.
2019-11-04 15:13:59.273+01 | Host HOST002 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center <UNKNOWN>. Setting Host state to Non-Operational.
2019-11-04 15:13:59.279+01 | Host HOST010 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center <UNKNOWN>. Setting Host state to Non-Operational.
2019-11-04 15:13:59.386+01 | Host HOST011 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center <UNKNOWN>. Setting Host state to Non-Operational.
2019-11-04 15:15:14.145+01 | Storage domain HOST_LUN_221 experienced a high latency of 9.60953 seconds from host HOST038. This may cause performance and functional issues. Please consult your Storage Administrator.
The problem mainly affected two storage domains (on the same array) but I also saw single messages for other storage domains (one the other array as well).
Storage domains stayed available to the hosts, all VMs continued to run.
When constantly reading from the storage domains (/bin/dd iflag=direct if=<metadata> bs=4096 count=1 of=/dev/null) we got expected 20+ MBytes/s on all but some storage domains. One of them showed "transfer rates" around 200 Bytes/s, but went up to normal performance from time to time. Transfer rate to this domain was different between the hosts.
/var/log/messages contain qla2xxx abort messages on almost all hosts. There are no errors on SAN switches or storage array (but vendor is still investigating). I did not see high load on the storage array.
The system seemed to stabilize when I stopped all VMs on the affected storage domain and this storage domain became "inactive". Currently, this storage domain still is inactive and we cannot place it in maintenance mode ("Failed to deactivate Storage Domain") nor activate it. OVF Metadata seems to be corrupt as well (failed to update OVF disks <id>, OVF data isn't updated on those OVF stores). The first six 512 byte blocks of /dev/<id>/metadata seem to contain only zeros.
Any advice on how to proceed here?
Is there a way to recover this storage domain?
All the best,
Oliver
4 years, 11 months