Once again my production ovirt cluster is collapsing in on itself. My
servers are intermittently unavailable or degrading, customers are noticing
and calling in. This seems to be yet another gluster failure that I
haven't been able to pin down.
I posted about this a while ago, but didn't get anywhere (no replies that I
found). The problem started out as a glusterfsd process consuming large
amounts of ram (up to the point where ram and swap were exhausted and the
kernel OOM killer killed off the glusterfsd process). For reasons not
clear to me at this time, that resulted in any VMs running on that host and
that gluster volume to be paused with I/O error (the glusterfs process is
usually unharmed; why it didn't continue I/O with other servers is
confusing to me).
I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and
data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica
3. The first 3 are backed by an LVM partition (some thin provisioned) on
an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for
acceleration). data-hdd is the only thing on the disk. Servers are Dell
R610 with the PERC/6i raid card, with the disks individually passed through
to the OS (no raid enabled).
The above RAM usage issue came from the data-hdd volume. Yesterday, I
cought one of the glusterfsd high ram usage before the OOM-Killer had to
run. I was able to migrate the VMs off the machine and for good measure,
reboot the entire machine (after taking this opportunity to run the
software updates that ovirt said were pending). Upon booting back up, the
necessary volume healing began. However, this time, the healing caused all
three servers to go to very, very high load averages (I saw just under 200
on one server; typically they've been 40-70) with top reporting IO Wait at
7-20%. Network for this volume is a dedicated gig network. According to
bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but
tailed off to mostly in the kB/s for a while. All machines' load averages
were still 40+ and gluster volume heal data-hdd info reported 5 items
needing healing. Server's were intermittently experiencing IO issues, even
on the 3 gluster volumes that appeared largely unaffected. Even the OS
activities on the hosts itself (logging in, running commands) would often
be very delayed. The ovirt engine was seemingly randomly throwing engine
down / engine up / engine failed notifications. Responsiveness on ANY VM
was horrific most of the time, with random VMs being inaccessible.
I let the gluster heal run overnight. By morning, there were still 5 items
needing healing, all three servers were still experiencing high load, and
servers were still largely unstable.
I've noticed that all of my ovirt outages (and I've had a lot, way more
than is acceptable for a production cluster) have come from gluster. I
still have 3 VMs who's hard disk images have become corrupted by my last
gluster crash that I haven't had time to repair / rebuild yet (I believe
this crash was caused by the OOM issue previously mentioned, but I didn't
know it at the time).
Is gluster really ready for production yet? It seems so unstable to
me.... I'm looking at replacing gluster with a dedicated NFS server likely
FreeNAS. Any suggestions? What is the "right" way to do production
storage on this (3 node cluster)? Can I get this gluster volume stable
enough to get my VMs to run reliably again until I can deploy another
I have an a some error in Ovirt 4.2.7
In dash I see:
ETL service aggregation to hourly tables has encountered an error. Please consult the service log for more details.
In log ovirt engine server:
2019-01-14 15:59:59|rwL6AB|euUXph|wfcjQ7|OVIRT_ENGINE_DWH|HourlyTimeKeepingJob|Default|5|tWarn|tWarn_1|2019-01-14 15:59:59| ETL service aggregation to hourly tables has encountered an error. lastHourAgg value =Mon Jan 14 14:00:00 EET 2019 and runTime = Mon Jan 14 15:59:59 EET 2019 .Please consult the service log for more details.|42
In some sources people said the problem is in PostgreSQL DB, but I don't understand how can I fix this problem?
I'm using oVirt 4.2.5. I know, it is not the newest anymore, but our cluster is quite big and I will upgrade to 4.2.8 as soon as possible.
I have a VM with several disks, one with virtio (boot device) and 3 other disks with virtio-scsi:
log 8190c797-0ed8-421f-85c7-cc1f540408f8 1 GiB
root 5edab51c-9113-466c-bd27-e73d4bfb29c4 10 GiB
tmp 11d74762-6053-4347-bdf2-4838dc2ea6f0 1 GiB
web_web-content bb5b1881-d40f-4ad1-a8c8-8ee594b3fe8a 20 GiB
Snapshots where quite small, because not much is changing there. All disks are on NFS v3 share running on NetApp cluster.
VM ID: bc25c5c9-353b-45ba-b0d5-5dbba41e9c5f
affected disk ID: 6cbd2f85-8335-416f-a208-ef60ecd839a4
Snapshot ID: c8103ae8-3432-4b69-8b91-790cdc37a2da
Snapshot disk ID: 2564b125-857e-41fa-b187-2832df277ccf
Task ID: 2a60efb5-1a11-49ac-a7f0-406faac219d6
Storage domain ID: 14794a3e-16fc-4dd3-a867-10507acfe293
After triggering snapshot delete task (2a60efb5-1a11-49ac-a7f0-406faac219d6), deletion was running for about one hour and I though it was hanging and I restarted the engine process on self-hosted engine...
After that, snapshot was still in lock state, therefore, I deleted the lock:
/usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh -t snapshot c8103ae8-3432-4b69-8b91-790cdc37a2da
CAUTION, this operation may lead to data corruption and should be used with care. Please contact support prior to running this command
Are you sure you want to proceed? [y/n]
INSERT 0 1
unlock snapshot c8103ae8-3432-4b69-8b91-790cdc37a2da completed successfully.
After trying to delete snapshot again, engine gave the error, that the Disk is in status Illegal.
Snapshot file is still there for the disk log:
-rw-rw----. 1 vdsm kvm 282M Feb 1 13:33 2564b125-857e-41fa-b187-2832df277ccf
-rw-rw----. 1 vdsm kvm 1.0M Jan 22 03:22 2564b125-857e-41fa-b187-2832df277ccf.lease
-rw-r--r--. 1 vdsm kvm 267 Feb 1 14:05 2564b125-857e-41fa-b187-2832df277ccf.meta
-rw-rw----. 1 vdsm kvm 1.0G Feb 1 11:53 6cbd2f85-8335-416f-a208-ef60ecd839a4
-rw-rw----. 1 vdsm kvm 1.0M Jan 10 12:07 6cbd2f85-8335-416f-a208-ef60ecd839a4.lease
-rw-r--r--. 1 vdsm kvm 272 Jan 22 03:22 6cbd2f85-8335-416f-a208-ef60ecd839a4.meta
All other snapshots have been merged successfully. I have umounted the disk inside the VM, after I saw, that the snapshot disk is still in use. That's why, the date is not changed anymore.
The strange thing is, that it looks like that the merge was working for a short time, because also time of the underlying disk has changed...
In database, I have this data about the VM and its snapshot:
engine=# select snapshot_id,snapshot_type,status,description from snapshots where vm_id='bc25c5c9-353b-45ba-b0d5-5dbba41e9c5f';
snapshot_id | snapshot_type | status | description
f596ba1c-4a6e-4372-9df4-c8e870c55fea | ACTIVE | OK | Active VM
c8103ae8-3432-4b69-8b91-790cdc37a2da | REGULAR | OK | cab-3449
engine=# select image_guid,parentid,imagestatus,vm_snapshot_id,volume_type,volume_format,active from images where image_group_id='8190c797-0ed8-421f-85c7-cc1f540408f8';
image_guid | parentid | imagestatus | vm_snapshot_id | volume_type | volume_format | active
2564b125-857e-41fa-b187-2832df277ccf | 6cbd2f85-8335-416f-a208-ef60ecd839a4 | 1 | f596ba1c-4a6e-4372-9df4-c8e870c55fea | 2 | 4 | t
6cbd2f85-8335-416f-a208-ef60ecd839a4 | 00000000-0000-0000-0000-000000000000 | 4 | c8103ae8-3432-4b69-8b91-790cdc37a2da | 2 | 5 | f
vdsm-tool dump-volume-chains 14794a3e-16fc-4dd3-a867-10507acfe293:
status: OK, voltype: INTERNAL, format: RAW, legality: LEGAL, type: SPARSE
status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE
it would be great, when you can help me to get the disk back only without stopping or starting the VM. I'm really afraid now of deleting snapshots...
I will send you the vdsm log from host running the VM and from SPM and engine.log.
Thank you very much!
BR Florian Schmid
I'm running oVirt 220.127.116.11. I installed Windows 10 pro as a guest, along
with QXL and all the drivers, as well as QEMU guest agent.
While Windows reports a CPU usage of something like 2-4% when idle, oVirt
reports 24-27% CPU usage.
Bug? should I report it?
Trying to get my ovirt cluster connected to my vSphere cluster to import my guests. When trying to connect, I get:
"VDSM ovirt1 command GetVmsNamesFromExternalProviderVDS failed: internal error: curl_easy_perform() returned an error: Couldn't connect to server (7) : Failed connect to 10.0.0.55:443; Connection timed out"
ovirt1 node (192.168.1.195) is at location A while vSphere (10.0.0.55) is at location B.
I added a static route on ovirt1 for 10.0.0.0/8 via 192.168.1.13, which has a VPN connection back to 10.0.0.0/8. ICMP from ovirt1 gets through just fine, but any other traffic never leaves ovirt1 (ie: 443 traffic never arrives at 1.13).
I'm assuming there's some firewall rule somewhere, blocking anything other than outbound ICMP, but I have been unable to find it. Any suggestions?
I have successfully setup a centos vm, its up and running. Now I need to
setup a windows 10 VM and I cant seem to get anything to work.
Ive tried setting the OS type to other or Windows 10 64bit. With Windows
64bit, it fails to startup at all.... When I select OtherOS it will allow
me to start to run the Windows ISO, but then fails after a few seconds as
well. Is there something I'm not correctly configuring?
I noticed that if I create a directory in root of ISO domain and put an
image in it, you can see it in admin portal image list (listed as
"foo/bar.iso"), but mounting that iso in a VM ("change CD") fails
"Error while executing action Change CD: Drive image file could not be
I can easily reproduce this behaviour in ovirt 4.2.8 and rhv 4.2.7.
Is it a bug?
Wish I was @FOSDEM :'( Next year...
When accessing the selfservice VM portal for users we wish to
disable/hide our admin management realm.
Is there currently a way to do this? We are maybe a bit paranoid but
informatie leakage can bite you in the ass.
We want to hookup Keycloak for the user portal and use the current ipa
authentication for the admins only....
Met vriendelijke groet,
With kind regards,
THE IDIOT COMPANY
7547 TA Enschede
The Netherlands +31 (0)53 20 30 275
Uncaught exception occurred. Please try reloading the page. Details: (TypeError) : Cannot read property 'Vg' of null
Please have your administrator check the UI logs
Im trying to setup Ovirt 4.3 RC because its the only one that supports AMD EPYC. The issue Im having is that I creates a raid 5 array of 5 120GB SSDs using mdadm and mounted it at /VM. I put the Host into maintenance, select Configure Local Storage, and it fails telling me:
New Local Storage Domain: Storage format is unsupported
I formatted the array to ext4. This is a self hosted system. Anyone have any ideas?
[root@ovirt VM]# uname -r
[root@ovirt VM]# cat /etc/centos-release
CentOS Linux release 7.6.1810 (Core)
[root@ovirt VM]# rpm -qa|grep ovirt