oVirt 4.3 node-ng host kernel options for nvidia vGPU
by Edward Berger
Hi,
One of our projects wants to try offering VMs with nvidia vGPU.
My co-worker had some problems before, so I thought I'd try the latest 4.3
ovirt-node-ng.
In the "Edit Host" -> kernel dialog I see two promising checkbox options
Hostdev Passthrough & SR-IOV (which adds to kernel line intel_iommu=on)
and
Blacklist Nouveau (which adds to kernel line rdblacklist=nouveau)
but they seem to be acting as mutually exclusive options, when both are
selected
the kernel command line box is outlined in red and I can't continue on.
Am I wrong to want both options?
5 years, 9 months
Host compatibility issue after upgrade from 4.2.8 to 4.3.0
by ronjero@gmail.com
I have a three node (hyper converged) cluster, all hosts are identical hardware wise, however after the upgrade one of the three is getting kicked out of the cluster with the following error: Host is compatible with versions (3.6,4.0,4.1,4.2) and cannot join Cluster...
The hosts have SandyBridge processors but do have SSBD:
CPU Type: Intel SandyBridge IBRS SSBD Family.
virsh -r capabilities | grep ssbd
<feature name='ssbd'/>
The only other thing I should mention is that I *may* have change the cluster compatibly level to 4.3 while this particular node was in maintenance mode (don't know if that has any influence on this issue).
Any help getting this node back into the cluster would be greatly appreciated.
Ron.
5 years, 9 months
ERROR running your engine inside of the hosted-engine VM and are not in "Global Maintenance" mode
by mhumaj@gmail.com
Hi,
We run ovirt upgrade to 4.3, after upgrade we wanted to run engine-setup but we do not know how to put this host which is simply another virtual machine with ovirt-engine. hosted-engine is running on hosts.
During execution engine service will be stopped (OK, Cancel) [OK]:
[ ERROR ] It seems that you are running your engine inside of the hosted-engine VM and are not in "Global Maintenance" mode.
In that case you should put the system into the "Global Maintenance" mode before running engine-setup, or the hosted-engine HA agent might kill the machine, which might corrupt your data.
[ ERROR ] Failed to execute stage 'Setup validation': Hosted Engine setup detected, but Global Maintenance is not set.
[ INFO ] Stage: Clean up
Log file is located at /var/log/ovirt-engine/setup/ovirt-engine-setup-20190205121802-l7llrw.log
[ INFO ] Generating answer file '/var/lib/ovirt-engine/setup/answers/20190205121855-setup.conf'
[ INFO ] Stage: Pre-termination
[ INFO ] Stage: Termination
[ ERROR ] Execution of setup failed
from the hosted nodes
--== Host 2 status ==--
Host ID : 2
Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Can anyone please tell me how to put the global maintenance on virtual machine where the ovirt-engine is? not the hosts even if I put them on the global maintenance I am unable to run engine-setup on vm with ovirt-enging.
thanks
5 years, 9 months
[4.3.0] VNC Virt-viewer console not opening
by Nicolas Ecarnot
Hello,
First, congratulations to all of you who worked for this 4.3.0 release,
and obviously thank you.
Today, I upgraded 4 oVirt setups (4 DC) from 4.2.7 to 4.3.0.
I went well on all 4 DCs.
But on one of them, when I try to open a console, I see it open as a
flash (it opens and closes immediately).
I'm using Firefox 64.0 with Ubuntu 18.10, and all my VMs are setup like
this :
- video type : QXL
- Gfx protocol : VNC
- VNC Kbd layout : fr
and I'm using virt-viewer
On the problematic DC, all the VMs are showing the same issue.
When I try to use Spice instead of VNc, it is working nicely.
When I try to use noVNC, the additional tab opens and shows "Unsupported
security types: 19"
I tried to track down this issue thanks to the firefox dev console, but
it's beyond my understanding.
Trying the same with Chromium does the same blinking open/close.
I'd rather learn how to provide additionnal debug messages, but
/var/log/ovirt-engine/engine.log does not give any useful hint :
2019-02-04 16:57:04,150+01 INFO
[org.ovirt.engine.core.bll.SetVmTicketCommand] (default task-24)
[1fb01d42] Running command: SetVmTicketCommand internal: false. Entities
affected : ID: 0c3e02b3-7fec-4bb1-b3d6-2e6c228e7278 Type:
VMAction group CONNECT_TO_VM with role type USER
2019-02-04 16:57:04,155+01 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.SetVmTicketVDSCommand]
(default task-24) [1fb01d42] START, SetVmTicketVDSCommand(HostName =
hv01.prd.sdis38.fr, SetVmTicketVDSCommandParameters:{hostId='
687c1c01-a5e1-449c-89d2-9713ccfc2487',
vmId='0c3e02b3-7fec-4bb1-b3d6-2e6c228e7278', protocol='VNC',
ticket='IivrpGHx5zSw', validTime='120', userName='admin',
userId='4a340386-851a-11e8-863d-3417ebeef1af', disconnectAction='NONE'}
), log id: 2a897f30
2019-02-04 16:57:04,188+01 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.SetVmTicketVDSCommand]
(default task-24) [1fb01d42] FINISH, SetVmTicketVDSCommand, return: ,
log id: 2a897f30
2019-02-04 16:57:04,211+01 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(default task-24) [1fb01d42] EVENT_ID: VM_SET_TICKET(164), User
admin@internal-authz initiated console session for VM ad02.ct
at.sdis38.fr
What could I give to help you help me?
--
Nicolas ECARNOT
5 years, 9 months
Upgrade guide for node from 4.2->4.3
by Juhani Rautiainen
Hi!
Thanks for the new release. I managed to upgrade the engine from
4.2->4.3 with old upgrade instructions but I'm having problems with
4.2 node upgrades. Adding repos with ovirt-release43 doesn't allow
node upgrade. Lot's of missing dependencies. Dpdk depencies could be
solved with adding Centos Extras, but what repo you should use for
openscap, openscap-utils and scap-security-guide packages?
-Juhani
5 years, 9 months
Ovirt cluster unstable; gluster to blame (again)
by Jim Kusznir
hi all:
Once again my production ovirt cluster is collapsing in on itself. My
servers are intermittently unavailable or degrading, customers are noticing
and calling in. This seems to be yet another gluster failure that I
haven't been able to pin down.
I posted about this a while ago, but didn't get anywhere (no replies that I
found). The problem started out as a glusterfsd process consuming large
amounts of ram (up to the point where ram and swap were exhausted and the
kernel OOM killer killed off the glusterfsd process). For reasons not
clear to me at this time, that resulted in any VMs running on that host and
that gluster volume to be paused with I/O error (the glusterfs process is
usually unharmed; why it didn't continue I/O with other servers is
confusing to me).
I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and
data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica
3. The first 3 are backed by an LVM partition (some thin provisioned) on
an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for
acceleration). data-hdd is the only thing on the disk. Servers are Dell
R610 with the PERC/6i raid card, with the disks individually passed through
to the OS (no raid enabled).
The above RAM usage issue came from the data-hdd volume. Yesterday, I
cought one of the glusterfsd high ram usage before the OOM-Killer had to
run. I was able to migrate the VMs off the machine and for good measure,
reboot the entire machine (after taking this opportunity to run the
software updates that ovirt said were pending). Upon booting back up, the
necessary volume healing began. However, this time, the healing caused all
three servers to go to very, very high load averages (I saw just under 200
on one server; typically they've been 40-70) with top reporting IO Wait at
7-20%. Network for this volume is a dedicated gig network. According to
bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but
tailed off to mostly in the kB/s for a while. All machines' load averages
were still 40+ and gluster volume heal data-hdd info reported 5 items
needing healing. Server's were intermittently experiencing IO issues, even
on the 3 gluster volumes that appeared largely unaffected. Even the OS
activities on the hosts itself (logging in, running commands) would often
be very delayed. The ovirt engine was seemingly randomly throwing engine
down / engine up / engine failed notifications. Responsiveness on ANY VM
was horrific most of the time, with random VMs being inaccessible.
I let the gluster heal run overnight. By morning, there were still 5 items
needing healing, all three servers were still experiencing high load, and
servers were still largely unstable.
I've noticed that all of my ovirt outages (and I've had a lot, way more
than is acceptable for a production cluster) have come from gluster. I
still have 3 VMs who's hard disk images have become corrupted by my last
gluster crash that I haven't had time to repair / rebuild yet (I believe
this crash was caused by the OOM issue previously mentioned, but I didn't
know it at the time).
Is gluster really ready for production yet? It seems so unstable to
me.... I'm looking at replacing gluster with a dedicated NFS server likely
FreeNAS. Any suggestions? What is the "right" way to do production
storage on this (3 node cluster)? Can I get this gluster volume stable
enough to get my VMs to run reliably again until I can deploy another
storage solution?
--Jim
5 years, 9 months
ETL service aggregation to hourly tables has encountered an error. Please consult the service log for more details.
by melnyksergii@gmail.com
Dears,
I have an a some error in Ovirt 4.2.7
In dash I see:
ETL service aggregation to hourly tables has encountered an error. Please consult the service log for more details.
In log ovirt engine server:
2019-01-14 15:59:59|rwL6AB|euUXph|wfcjQ7|OVIRT_ENGINE_DWH|HourlyTimeKeepingJob|Default|5|tWarn|tWarn_1|2019-01-14 15:59:59| ETL service aggregation to hourly tables has encountered an error. lastHourAgg value =Mon Jan 14 14:00:00 EET 2019 and runTime = Mon Jan 14 15:59:59 EET 2019 .Please consult the service log for more details.|42
In some sources people said the problem is in PostgreSQL DB, but I don't understand how can I fix this problem?
Thanks
5 years, 9 months
One disk in illegal state after deleting snapshot with several disks
by Florian Schmid
Hi,
I'm using oVirt 4.2.5. I know, it is not the newest anymore, but our cluster is quite big and I will upgrade to 4.2.8 as soon as possible.
I have a VM with several disks, one with virtio (boot device) and 3 other disks with virtio-scsi:
log 8190c797-0ed8-421f-85c7-cc1f540408f8 1 GiB
root 5edab51c-9113-466c-bd27-e73d4bfb29c4 10 GiB
tmp 11d74762-6053-4347-bdf2-4838dc2ea6f0 1 GiB
web_web-content bb5b1881-d40f-4ad1-a8c8-8ee594b3fe8a 20 GiB
Snapshots where quite small, because not much is changing there. All disks are on NFS v3 share running on NetApp cluster.
Some IDs:
VM ID: bc25c5c9-353b-45ba-b0d5-5dbba41e9c5f
affected disk ID: 6cbd2f85-8335-416f-a208-ef60ecd839a4
Snapshot ID: c8103ae8-3432-4b69-8b91-790cdc37a2da
Snapshot disk ID: 2564b125-857e-41fa-b187-2832df277ccf
Task ID: 2a60efb5-1a11-49ac-a7f0-406faac219d6
Storage domain ID: 14794a3e-16fc-4dd3-a867-10507acfe293
After triggering snapshot delete task (2a60efb5-1a11-49ac-a7f0-406faac219d6), deletion was running for about one hour and I though it was hanging and I restarted the engine process on self-hosted engine...
After that, snapshot was still in lock state, therefore, I deleted the lock:
/usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh -t snapshot c8103ae8-3432-4b69-8b91-790cdc37a2da
##########################################
CAUTION, this operation may lead to data corruption and should be used with care. Please contact support prior to running this command
##########################################
Are you sure you want to proceed? [y/n]
y
select fn_db_unlock_snapshot('c8103ae8-3432-4b69-8b91-790cdc37a2da');
INSERT 0 1
unlock snapshot c8103ae8-3432-4b69-8b91-790cdc37a2da completed successfully.
After trying to delete snapshot again, engine gave the error, that the Disk is in status Illegal.
Snapshot file is still there for the disk log:
-rw-rw----. 1 vdsm kvm 282M Feb 1 13:33 2564b125-857e-41fa-b187-2832df277ccf
-rw-rw----. 1 vdsm kvm 1.0M Jan 22 03:22 2564b125-857e-41fa-b187-2832df277ccf.lease
-rw-r--r--. 1 vdsm kvm 267 Feb 1 14:05 2564b125-857e-41fa-b187-2832df277ccf.meta
-rw-rw----. 1 vdsm kvm 1.0G Feb 1 11:53 6cbd2f85-8335-416f-a208-ef60ecd839a4
-rw-rw----. 1 vdsm kvm 1.0M Jan 10 12:07 6cbd2f85-8335-416f-a208-ef60ecd839a4.lease
-rw-r--r--. 1 vdsm kvm 272 Jan 22 03:22 6cbd2f85-8335-416f-a208-ef60ecd839a4.meta
All other snapshots have been merged successfully. I have umounted the disk inside the VM, after I saw, that the snapshot disk is still in use. That's why, the date is not changed anymore.
The strange thing is, that it looks like that the merge was working for a short time, because also time of the underlying disk has changed...
In database, I have this data about the VM and its snapshot:
engine=# select snapshot_id,snapshot_type,status,description from snapshots where vm_id='bc25c5c9-353b-45ba-b0d5-5dbba41e9c5f';
snapshot_id | snapshot_type | status | description
--------------------------------------+---------------+--------+-------------
f596ba1c-4a6e-4372-9df4-c8e870c55fea | ACTIVE | OK | Active VM
c8103ae8-3432-4b69-8b91-790cdc37a2da | REGULAR | OK | cab-3449
engine=# select image_guid,parentid,imagestatus,vm_snapshot_id,volume_type,volume_format,active from images where image_group_id='8190c797-0ed8-421f-85c7-cc1f540408f8';
image_guid | parentid | imagestatus | vm_snapshot_id | volume_type | volume_format | active
--------------------------------------+--------------------------------------+-------------+--------------------------------------+-------------+---------------+--------
2564b125-857e-41fa-b187-2832df277ccf | 6cbd2f85-8335-416f-a208-ef60ecd839a4 | 1 | f596ba1c-4a6e-4372-9df4-c8e870c55fea | 2 | 4 | t
6cbd2f85-8335-416f-a208-ef60ecd839a4 | 00000000-0000-0000-0000-000000000000 | 4 | c8103ae8-3432-4b69-8b91-790cdc37a2da | 2 | 5 | f
vdsm-tool dump-volume-chains 14794a3e-16fc-4dd3-a867-10507acfe293:
image: 8190c797-0ed8-421f-85c7-cc1f540408f8
- 6cbd2f85-8335-416f-a208-ef60ecd839a4
status: OK, voltype: INTERNAL, format: RAW, legality: LEGAL, type: SPARSE
- 2564b125-857e-41fa-b187-2832df277ccf
status: OK, voltype: LEAF, format: COW, legality: LEGAL, type: SPARSE
@bzlotnik,
it would be great, when you can help me to get the disk back only without stopping or starting the VM. I'm really afraid now of deleting snapshots...
I will send you the vdsm log from host running the VM and from SPM and engine.log.
Thank you very much!
BR Florian Schmid
5 years, 9 months
Wrong CPU performance report
by Hetz Ben Hamo
Hi,
I'm running oVirt 4.2.7.1. I installed Windows 10 pro as a guest, along
with QXL and all the drivers, as well as QEMU guest agent.
While Windows reports a CPU usage of something like 2-4% when idle, oVirt
reports 24-27% CPU usage.
Bug? should I report it?
Thanks
5 years, 9 months