Problems after upgrade from 4.4.3 to 4.4.4
by tferic@swissonline.ch
Hi
I have problems after upgrading my 2-node cluster from 4.4.3 to 4.4.4.
Initially, I performed the upgrade of the oVirt hosts using the oVirt GUI (I wasn't planning any changes).
It appears that the upgrade broke the system.
On host1, the ovirt-engine was configured to run on the oVirt host itself (not self-hosted engine).
After the upgrade, the oVirt GUI didn't load in the Browser anymore.
I tried to fix the issue by migrating to self-hosted engine, which did not work, so I ran engine restore and engine-setup in order to get back to the initial state.
I am now able to login to the oVirt GUI again, but I am having the following problems:
host1 is in status "Unassigned", and it has the SPM role. It cannot be set to maintenance mode, nor re-installed from GUI, but I am able to reboot the host from oVirt.
All Storage Domains are inactive. (all NFS)
In the /var/log/messages log, I can see the following message appearing frequently: "vdsm[5935]: ERROR ssl handshake: socket error, address: ::ffff:192.168.100.61"
The cluster is down and no VM's can be run. I don't know how to fix either of the issues.
Does anyone have an idea?
I am appending a tar file containing log files to this email.
http://gofile.me/5fp92/d7iGEqh3H <http://gofile.me/5fp92/d7iGEqh3H>
Many thanks
Toni
3 years, 8 months
Locked disks
by Giulio Casella
Since yesterday I found a couple VMs with locked disk. I don't know the
reason, I suspect some interaction made by our backup system (vprotect,
snapshot based), despite it's working for more than a year.
I'd give a chance to unlock_entity.sh script, but it reports:
CAUTION, this operation may lead to data corruption and should be used
with care. Please contact support prior to running this command
Do you think I should trust? Is it safe? VMs are in production...
My manager is 4.4.4.7-1.el8 (CentOS stream 8), hosts are oVirt Node 4.4.4
TIA,
Giulio
3 years, 8 months
supervdsm failing during network_caps
by Alan G
Hi,
I have issues with one host where supervdsm is failing in network_caps.
I see the following trace in the log.
MainProcess|jsonrpc/1::ERROR::2020-01-06 03:01:05,558::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) Error in network_caps
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/supervdsm_server.py", line 98, in wrapper
res = func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 56, in network_caps
return netswitch.configurator.netcaps(compatibility=30600)
File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 317, in netcaps
net_caps = netinfo(compatibility=compatibility)
File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch/configurator.py", line 325, in netinfo
_netinfo = netinfo_get(vdsmnets, compatibility)
File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 150, in get
return _stringify_mtus(_get(vdsmnets))
File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 59, in _get
ipaddrs = getIpAddrs()
File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/addresses.py", line 72, in getIpAddrs
for addr in nl_addr.iter_addrs():
File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/addr.py", line 33, in iter_addrs
with _nl_addr_cache(sock) as addr_cache:
File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/__init__.py", line 92, in _cache_manager
cache = cache_allocator(sock)
File "/usr/lib/python2.7/site-packages/vdsm/network/netlink/libnl.py", line 469, in rtnl_addr_alloc_cache
raise IOError(-err, nl_geterror(err))
IOError: [Errno 16] Message sequence number mismatch
A restart of supervdsm will resolve the issue for a period, maybe 24 hours, then it will occur again. So I'm thinking it's resource exhaustion or a leak of some kind?
Running 4.2.8.2 with VDSM at 4.20.46.
I've had a look through the bugzilla and can't find an exact match, closest was this one https://bugzilla.redhat.com/show_bug.cgi?id=1666123 which seems to be a RHV only fix.
Thanks,
Alan
3 years, 9 months
Migrate windows 2003 server 64bits from libvirt to ovirt
by Fernando Hallberg
Hi,
I have a VM with 2003 server x64, and I upload the vm image to oVirt.
The VM boot on the oVirt, but, the blue screen appear with a error message:
[image: image.png]
Anybody has some information about this?
I try to convert de img file from raw to qcow2, but the error persists.
Regards,
Fernando Hallberg
3 years, 9 months
Gluster volume engine stuck in healing with 1 unsynched entry & HostedEngine paused
by souvaliotimaria@mail.com
Hello everyone,
Any help would be greatly appreciated in the following problem.
In my lab, the day before yesterday, we had power issues, with a UPS going off-line and following the power outage of the NFS/DNS server I have set up to serve ovirt with isos and as a DNS server (our other DNS servers are located as VMs within the oVirt environment). We found a broadcast storm on the switch (due to a faulty NIC on the aformentioned UPS) that the ovirt nodes are connected and later on had to re-establish several of the virtual connections as well. The above led to one of the hosts becoming NonResponsive, two machines becoming unresponsive and three VMs shuting down.
The oVirt environment, version 4.3.5.2, is a replica 2 + arbiter 1 environment and runs GlusterFS with the recommended volumes of data, engine and vmstore.
So far, the times there was some kind of a problem, usually oVirt was able to solve it by its own.
This time, however, after we recovered from the above state, the volumes of data and vmstore successfully healing , the volume engine became stuck to the healing process (Up, unsynched entries, needs healing), and from the web GUI I see that the VM HostedEngine is paused due to a storage I/O error while the output of virsh list --all command shows that the HostedEngine is running.. How is that happening?
I tried to manually trigger the healing process for the volume but nothing with
gluster volume heal engine
The command
gluster volume heal engine info
shows the following
[root@ov-no3 ~]# gluster volume heal engine info
Brick ov-no1.ariadne-t.local:/gluster_bricks/engine/engine
Status: Connected
Number of entries: 0
Brick ov-no2.ariadne-t.local:/gluster_bricks/engine/engine
/80f6e393-9718-4738-a14a-64cf43c3d8c2/images/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7
Status: Connected
Number of entries: 1
Brick ov-no3.ariadne-t.local:/gluster_bricks/engine/engine
/80f6e393-9718-4738-a14a-64cf43c3d8c2/images/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7
Status: Connected
Number of entries: 1
This morning I came upon this Reddit post https://www.reddit.com/r/gluster/comments/fl3yb7/entries_stuck_in_heal_pe... where it seems that after a graceful reboot one of the ovirt hosts, the gluster came back online after it completed the appropriate healing processes. The thing is from what I have read that when there are unsynched entries in the gluster a host cannot be put into maintenance mode so that it can be rebooted, correct?
Should I try to restart the glusterd service.
Could someone tell me what I should do?
Thank you all for your time and help,
Maria Souvalioti
3 years, 9 months
deploy oVirt 4.4 errors
by grig.4n@gmail.com
grep ERROR /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20210227232700-gil2fj.log
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9786
2021-02-28 00:00:12,059+0600 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 ovirtsdk4.ConnectionError: Error while sending HTTP request: (7, 'Failed to connect to ovirt4-adm.domain.local port 443: No route to host')
2021-02-28 00:00:12,160+0600 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 fatal: [localhost]: FAILED! => {"attempts": 50, "changed": false, "msg": "Error while sending HTTP request: (7, 'Failed to connect to ovirt4-adm.domain.local port 443: No route to host')"}
2021-02-28 00:00:58,055+0600 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 fatal: [localhost]: FAILED! => {"changed": false, "msg": "The system may not be provisioned according to the playbook results: please check the logs for the issue, fix accordingly or re-deploy from scratch.\n"}
2021-02-28 00:00:58,759+0600 ERROR otopi.context context._executeMethod:154 Failed to execute stage 'Closing up': Failed executing ansible-playbook
2021-02-28 00:01:05,984+0600 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:109 fatal: [localhost]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host ovirt4-adm.domain.local. port 22: No route to host", "skip_reason": "Host localhost is unreachable", "unreachable": true}
2021-02-28 00:01:22,146+0600 ERROR otopi.plugins.gr_he_common.core.misc misc._terminate:167 Hosted Engine deployment failed: please check the logs for the issue, fix accordingly or re-deploy from scratch.
3 years, 9 months
CVE-2021-3156 && ovirt-node-ng 4.3 && 4.4 (sudo)
by Renaud RAKOTOMALALA
Hello everyone,
I operate several oVirt clusters including pre-productions using ovirt-node-ng images.
For our traditional clusters we manage the incident in a unitary way with a dedicated rpm, however for ovirt-node-ng I am not yet up to date with critical package updates process.
Do you have any advice or tips?
Nice day,
Renaud
3 years, 9 months
Right Glusterfs Config on HA Server
by Marcel d'Heureuse
Hi Guys or Girls,
I have a problem in a productive system. We use three Servers which have
a Raid 5 with four Harddrives SATA 8 TB 7200 rpms. The Systems are
interconnected for Glusterfs with 1 GB/s.
The Network Load between all three servers are between 5 and 8 % of the
network. The Glusterfs has a separate VLAN which complete isolated from
other networks.
Find attached a small drawing.
On that server we rum as HA and there are 8 VMs installed. Most of the
VMs working well but if there is high load on the system some of the VM
can't write the Data in a "good" speed on the disks. The Glusterfs is
configured as Replica on the Raid5 disk which is as Jbob in Glusterfs
defined.
If i take a look on the iostats i can see that the Harddisk have round
50 open connections some of the glusterfs and a lot of the vms. In side
the VMs i can see long iowaiting times from 300 up to 600 ms. With is on
case of a Posgresql database very long.
Should we configure the Raid5 controller out that the controller only
should 4 disks with 8 TB and Glusterfs should connect directly to the
/dev/sd[b-e]?
What do you think? Will SSD HDD will help on performance but the same
config with Raid5 and Jbod could be the bigger problem.
Thanks
Marcel
3 years, 9 months
Multipath flapping with SAS via FCP
by Benoit Chatelain
Hi,
I have some troubles with multipath.
When I add SAS disk over FCP as Storage Domain via oVirt WebUI,
The first link as active, but the second is stuck as failed.
Volum disk is provided from Dell Compellent via FCP, and disk is transported in SAS.
multipath is flapping in all hypervisor from the same domain disk:
[root@isildur-adm ~]# tail -f /var/log/messages
Feb 25 11:48:21 isildur-adm kernel: device-mapper: multipath: 253:3: Failing path 8:32.
Feb 25 11:48:24 isildur-adm multipathd[659460]: 36000d31003d5c2000000000000000010: sdc - tur checker reports path is up
Feb 25 11:48:24 isildur-adm multipathd[659460]: 8:32: reinstated
Feb 25 11:48:24 isildur-adm multipathd[659460]: 36000d31003d5c2000000000000000010: remaining active paths: 2
Feb 25 11:48:24 isildur-adm kernel: device-mapper: multipath: 253:3: Reinstating path 8:32.
Feb 25 11:48:24 isildur-adm kernel: sd 1:0:1:2: alua: port group f01c state S non-preferred supports toluSNA
Feb 25 11:48:24 isildur-adm kernel: sd 1:0:1:2: alua: port group f01c state S non-preferred supports toluSNA
Feb 25 11:48:24 isildur-adm kernel: device-mapper: multipath: 253:3: Failing path 8:32.
Feb 25 11:48:25 isildur-adm multipathd[659460]: sdc: mark as failed
Feb 25 11:48:25 isildur-adm multipathd[659460]: 36000d31003d5c2000000000000000010: remaining active paths: 1
---
[root@isildur-adm ~]# multipath -ll
36000d31003d5c2000000000000000010 dm-3 COMPELNT,Compellent Vol
size=1.5T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=25 status=active
|- 1:0:0:2 sdb 8:16 active ready running
`- 1:0:1:2 sdc 8:32 failed ready running
---
VDSM generate multipath.conf like this ( I have remove commented lines for read confort ) :
[root@isildur-adm ~]# cat /etc/multipath.conf
# VDSM REVISION 2.0
# This file is managed by vdsm.
defaults {
polling_interval 5
no_path_retry 16
user_friendly_names no
flush_on_last_del yes
fast_io_fail_tmo 5
dev_loss_tmo 30
max_fds 4096
}
blacklist {
protocol "(scsi:adt|scsi:sbp)"
}
no_path_retry 16
}
Have you some idea why this link is flapping on my two hypervisor?
Thanks a lot in advance.
- Benoit Chatelain
3 years, 9 months
How to enable nested virtualization
by miguel.garcia@toshibagcs.com
Hi all,
We are trying to run virtual box in a vm that is running over ovirt platform. I had followed the instruction to enable nested virtualization on host as described in this document https://ostechnix.com/how-to-enable-nested-virtualization-in-kvm-in-linux/
However I was not able to enable nested virtualization to the vm since nothing is listed from virsh console. Also tryed to edit vm configuration lookig up for "cpu mode" or "copy cpu configuration as host"
Also from oVirt admin portal I had enabled host for nested virtualization through Compute > Host > selected Host > edit > Kernel Nested virtualization checked.
But the host still returns:
$ cat /sys/module/kvm_intel/parameters/nested
N
3 years, 9 months