Random hosts disconnects
by anton.louw@voxtelecom.co.za
Hi All,
I have a strange issue in my oVirt environment. I currently have a standalone manager which is running in VMware. In my oVirt environment, I have two Data Centers. The manager is currently sitting on the same subnet as DC1. Randomly, hosts in DC2 will say “Not Responding” and then 2 seconds later, the hosts will activate again.
The strange thing is, when the manager was sitting on the same subnet as DC2, hosts in DC1 will randomly say “Not Responding”
I have tried going through the logs, but I cannot see anything out of the ordinary regarding why the hosts would drop connection. I have attached the engine.log for anybody that would like to do a spot check.
Thanks
4 years, 4 months
Disconnected Server has closed the connection.
by info@worldhostess.com
It seems that the installation is all done, but I have a problem. it takes very long to open the web pages, plus it disconnect all the time. it is impossible to do anything.
I can ping the hostname as I set up a sub-domain for it. to be honest, I am new to this and it took me days to get to this point. I think there are some issues with my network settings.
if there are any oVirt experts that can check my installation and give me advice about how to improve it, it will be greatly appreciated.
I have done an "Installing oVirt as a self-hosted engine using the Cockpit web interface"
4 years, 4 months
Gluster quorum issue on 3-node HCI with extra 5-nodes as compute and storage nodes
by thomas@hoberg.net
Yes, I've also posted this on the Gluster Slack. But I am using Gluster mostly because it's part of oVirt HCI, so don't just send me away, please!
Problem: GlusterD refusing to start due to quorum issues for volumes where it isn’t contributing any brick
(I've had this before on a different farm, but there it was transitory. Now I have it in a more observable manner, that's why I open a new topic)
In a test farm with recycled servers, I started running Gluster via oVirt 3node-HCI, because I got 3 machines originally.
They were set up as group A in a 2:1 (replica:arbiter) oVirt HCI setup with 'engine', 'vmstore' and 'data' volumes, one brick on each node.
I then got another five machines with hardware specs that were rather different to group A, so I set those up as group B to mostly act as compute nodes, but also to provide extra storage, mostly to be used externally as GlusterFS shares. It took a bit of fiddling with Ansible but I got these 5 nodes to serve two more Gluster volumes 'tape' and 'scratch' using dispersed bricks (4 disperse:1 redundancy), RAID5 in my mind.
The two groups are in one Gluster, not because they serve bricks to the same volumes, but because oVirt doesn't like nodes to be in different Glusters (or actually, to already be in a Gluster when you add them as host node). But the two groups provide bricks to distinct volumes, there is no overlap.
After setup things have been running fine for weeks, but now I needed to restart a machine from group B, which has ‘tape’ and ‘scratch’ bricks, but none from original oVirt ‘engine’, ‘vmstore’ and ‘data’ in group A. Yet the gluster daemon refuses to start, citing a loss of quorum for these three volumes, even if it has no bricks in them… which makes no sense to me.
I am afraid the source of the issue is concept issues: I clearly don't really understand some design assumptions of Gluster.
And I'm afraid the design assumptions of Gluster and of oVirt (even with HCI), are not as related as one might assume from the marketing materials on the oVirt home-page.
But most of all I'd like to know: How do I fix this now?
I can't heal 'tape' and 'scratch', which are growing ever more apart while the glusterd on this machine in group B refuses to come online for lack of a quorum on volumes where it is not contributing bricks.
4 years, 4 months
What is the purpose of memory deflation in oVirt memory ballooning?
by pub.virtualization@gmail.com
Hi, guys
Why does momd(ballooning manager in oVirt) explicitly deflate the balloon when host gets plenty of memory?
as far as I know, momd is supporting memory ballooning with setMemory API to inflate/deflate the balloon in guest
and I've just checked the memory change in the guest after inflating the balloon.
as expected, memory(total, free, available) in the guest was reduced just after inflating the balloon, but it was "automatically" restored to its initial memory after a few seconds.
So, here I'm wondering why deflation is additionally required even though it can be restored automatically just after seconds.
Thanks.
4 years, 4 months
Any eta for 4.4.2 final?
by Gianluca Cecchi
Hello,
I would like to upgrade a 4.4.0 environment to the latest 4.4.2 when
available.
Any indication if there are any show stoppers after the rc5 released on
27th of August or any eta about other release candidates?
Thanks,
Gianluca
4 years, 4 months
oVirt disk disappeared after importing iSCSI domain
by gantonjo-ovirt@yahoo.com
Hi all oVirt experts.
So, I managed to f.. up a VM today. I had one VM running on a oVirt 4.3 cluster, where the disk was located on an iSCSI data domain. I stopped the VM, put the storage domain in maintenance mode, detached adn removed the storage domain from the ovirt 4.3 cluster.
Then I imported the storage domain to our new oVirt 4.4.1 cluster, let the process convert it from V4 to V5 format and activated the storage domain in the new cluster. Entering the information page for the imported storage domain, I expected to see "Import VM" and "Import Disk" menus, but these menues did not appear as they did for other storage domains that I successfully had moved from old to new cluster.
Clicking "Scan Disks" did not help either.
So, now I am stuck with a storage domain where the VM's disk is located, but oVirt is not able to see the disk.
What can I do to recover the lost disk from the storage domain? Any CLI commands available in ovirt 4.4.1 would be nice.
Thanks in advance for your quick and good help.
4 years, 4 months
Disconnected Server has closed the connection.
by info@worldhostess.com
It seems that the installation is all done, but I have a problem. it takes very long to open the web pages, plus it disconnect all the time. it is impossible to do anything.
I can ping the hostname as I set up a sub-domain for it. to be honest, I am new to this and it took me days to get to this point. I think there are some issues with my network settings.
if there are any oVirt experts that can check my installation and give me advice about how to improve it, it will be greatly appreciated.
I have done an "Installing oVirt as a self-hosted engine using the Cockpit web interface"
4 years, 4 months
Multiple GPU Passthrough with NVLink (Invalid I/O region)
by Vinícius Ferrão
Hello, here we go again.
I’m trying to passthrough 4x NVIDIA Tesla V100 GPUs (with NVLink) to a single VM; but things aren’t that good. Only one GPU shows up on the VM. lspci is able to show the GPUs, but three of them are unusable:
08:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
There are some errors on dmesg, regarding a misconfigured BIOS:
[ 27.295972] nvidia: loading out-of-tree module taints kernel.
[ 27.295980] nvidia: module license 'NVIDIA' taints kernel.
[ 27.295981] Disabling lock debugging due to kernel taint
[ 27.304180] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 27.364244] nvidia-nvlink: Nvlink Core is being initialized, major device number 241
[ 27.579261] nvidia 0000:09:00.0: enabling device (0000 -> 0002)
[ 27.579560] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:09:00.0)
[ 27.579560] NVRM: The system BIOS may have misconfigured your GPU.
[ 27.579566] nvidia: probe of 0000:09:00.0 failed with error -1
[ 27.580727] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:0a:00.0)
[ 27.580729] NVRM: The system BIOS may have misconfigured your GPU.
[ 27.580734] nvidia: probe of 0000:0a:00.0 failed with error -1
[ 27.581299] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 27.581300] NVRM: The system BIOS may have misconfigured your GPU.
[ 27.581305] nvidia: probe of 0000:0b:00.0 failed with error -1
[ 27.581333] NVRM: The NVIDIA probe routine failed for 3 device(s).
[ 27.581334] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 450.51.06 Sun Jul 19 20:02:54 UTC 2020
[ 27.649128] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 450.51.06 Sun Jul 19 20:06:42 UTC 2020
The host is Secure Intel Skylake (x86_64). VM is running with Q35 Chipset with UEFI (pc-q35-rhel8.2.0)
I’ve tried to change the I/O mapping options on the host, tried with 56TB and 12TB without success. Same results. Didn’t tried with 512GB since the machine have 768GB of system RAM.
Tried blacklisting the nouveau on the host, nothing.
Installed NVIDIA drivers on the host, nothing.
In the host I can use the 4x V100, but inside a single VM it’s impossible.
Any suggestions?
4 years, 4 months
Testing ovirt 4.4.1 Nested KVM on Skylake-client (core i5) does not work
by wodel youchi
Hi,
I've been using my core i5 6500 (skylake-client) for some time now to test
oVirt on my machine.
However this is no longer the case.
I am using Fedora 32 as my base system with nested-kvm enabled, when I try
to install oVirt 4.4 as HCI single node, I get an error in the last phase
which consists of copying the VM-Manager to the engine volume and boot it.
It is the boot that causes the problem, I get an error about the CPU :
*the CPU is incompatible with host CPU: Host CPU does not provide required
features: mpx*
*This is the CPU part from virsh domcapabilities on my physical machine*
<cpu>
<mode name='host-passthrough' supported='yes'/>
<mode name='host-model' supported='yes'>
*<model fallback='forbid'>Skylake-Client-IBRS</model> *
<vendor>Intel</vendor>
<feature policy='require' name='ss'/>
<feature policy='require' name='vmx'/>
<feature policy='require' name='pdcm'/>
<feature policy='require' name='hypervisor'/>
<feature policy='require' name='tsc_adjust'/>
<feature policy='require' name='clflushopt'/>
<feature policy='require' name='umip'/>
<feature policy='require' name='md-clear'/>
<feature policy='require' name='stibp'/>
<feature policy='require' name='arch-capabilities'/>
<feature policy='require' name='ssbd'/>
<feature policy='require' name='xsaves'/>
<feature policy='require' name='pdpe1gb'/>
<feature policy='require' name='invtsc'/>
<feature policy='require' name='ibpb'/>
<feature policy='require' name='amd-ssbd'/>
<feature policy='require' name='skip-l1dfl-vmentry'/>
</mode>
<mode name='custom' supported='yes'>
<model usable='yes'>qemu64</model>
<model usable='yes'>qemu32</model>
<model usable='no'>phenom</model>
<model usable='yes'>pentium3</model>
<model usable='yes'>pentium2</model>
<model usable='yes'>pentium</model>
<model usable='yes'>n270</model>
<model usable='yes'>kvm64</model>
<model usable='yes'>kvm32</model>
<model usable='yes'>coreduo</model>
<model usable='yes'>core2duo</model>
<model usable='no'>athlon</model>
<model usable='yes'>Westmere-IBRS</model>
<model usable='yes'>Westmere</model>
<model usable='no'>Skylake-Server-IBRS</model>
<model usable='no'>Skylake-Server</model>
<model usable='yes'>Skylake-Client-IBRS</model>
<model usable='yes'>Skylake-Client</model>
<model usable='yes'>SandyBridge-IBRS</model>
<model usable='yes'>SandyBridge</model>
<model usable='yes'>Penryn</model>
<model usable='no'>Opteron_G5</model>
<model usable='no'>Opteron_G4</model>
<model usable='no'>Opteron_G3</model>
<model usable='yes'>Opteron_G2</model>
<model usable='yes'>Opteron_G1</model>
<model usable='yes'>Nehalem-IBRS</model>
<model usable='yes'>Nehalem</model>
<model usable='yes'>IvyBridge-IBRS</model>
<model usable='yes'>IvyBridge</model>
<model usable='no'>Icelake-Server</model>
<model usable='no'>Icelake-Client</model>
<model usable='yes'>Haswell-noTSX-IBRS</model>
<model usable='yes'>Haswell-noTSX</model>
<model usable='yes'>Haswell-IBRS</model>
<model usable='yes'>Haswell</model>
<model usable='no'>EPYC-IBPB</model>
<model usable='no'>EPYC</model>
<model usable='no'>Dhyana</model>
<model usable='yes'>Conroe</model>
<model usable='no'>Cascadelake-Server</model>
<model usable='yes'>Broadwell-noTSX-IBRS</model>
<model usable='yes'>Broadwell-noTSX</model>
<model usable='yes'>Broadwell-IBRS</model>
<model usable='yes'>Broadwell</model>
<model usable='yes'>486</model>
</mode>
</cpu>
*Here is the lscpu of my physical machine*
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
Stepping: 3
CPU MHz: 954.588
CPU max MHz: 3600.0000
CPU min MHz: 800.0000
BogoMIPS: 6399.96
Virtualization: VT-x
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB
NUMA node0 CPU(s): 0-3
Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional
cache flushes, SMT disabled
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT
disabled
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass
disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and
__user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB
conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Srbds: Vulnerable: No microcode
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT
disabled
Flags: fpu vme de pse tsc msr pae mce cx8 apic
sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
pbe syscall nx pdpe1gb rdtscp lm constan
t_tsc art arch_perfmon pebs bts rep_good
nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16
xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe
popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm
3dnowprefetch cpuid_fault invpcid_single pti ssbd
ibrs ibpb stibp tpr_shadow vnmi
flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2
erms invpcid rtm *mpx* rdseed adx smap clflushopt in
tel_pt xsaveopt xsavec xgetbv1 xsaves
dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear
flush_l1d
*Here is the CPU part from virsh dumpxml of my ovirt hypervisor*
<cpu mode='custom' match='exact' check='full'>
<model fallback='forbid'>Skylake-Client-IBRS</model>
<vendor>Intel</vendor>
<feature policy='require' name='ss'/>
<feature policy='require' name='vmx'/>
<feature policy='require' name='pdcm'/>
<feature policy='require' name='hypervisor'/>
<feature policy='require' name='tsc_adjust'/>
<feature policy='require' name='clflushopt'/>
<feature policy='require' name='umip'/>
<feature policy='require' name='md-clear'/>
<feature policy='require' name='stibp'/>
<feature policy='require' name='arch-capabilities'/>
<feature policy='require' name='ssbd'/>
<feature policy='require' name='xsaves'/>
<feature policy='require' name='pdpe1gb'/>
<feature policy='require' name='ibpb'/>
<feature policy='require' name='amd-ssbd'/>
<feature policy='require' name='skip-l1dfl-vmentry'/>
<feature policy='disable' name='mpx'/>
</cpu>
*Here is the lcpu of my ovirt hypervisor*
[root@node1 ~]# lscpu
Architecture : x86_64
Mode(s) opératoire(s) des processeurs : 32-bit, 64-bit
Boutisme : Little Endian
Processeur(s) : 4
Liste de processeur(s) en ligne : 0-3
Thread(s) par cœur : 1
Cœur(s) par socket : 1
Socket(s) : 4
Nœud(s) NUMA : 1
Identifiant constructeur : GenuineIntel
Famille de processeur : 6
Modèle : 94
Nom de modèle : Intel Core Processor (Skylake,
IBRS)
Révision : 3
Vitesse du processeur en MHz : 3191.998
BogoMIPS : 6383.99
Virtualisation : VT-x
Constructeur d'hyperviseur : KVM
Type de virtualisation : complet
Cache L1d : 32K
Cache L1i : 32K
Cache L2 : 4096K
Cache L3 : 16384K
Nœud NUMA 0 de processeur(s) : 0-3
Drapaux : fpu vme de pse tsc msr pae mce cx8
apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall
nx pdpe1gb rdtscp lm constant_tsc rep_go
od nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16
pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave
avx f16c rdrand hypervisor lahf_lm abm 3dnow
prefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow
vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms
invpcid rtm rdseed adx smap clflushopt xs
aveopt xsavec xgetbv1 xsaves arat umip md_clear arch_capabilities
it seems not all the flags are presented to the hypervisor especially the
mpx which causes the error
Is there a workaround for this?
Regards.
4 years, 4 months