Tasks stuck waiting on another after failed storage migration (yet not visible on SPM)
by David Sekne
Hello,
I'm running oVirt version 4.3.9.4-1.el7.
After a failed live storage migration a VM got stuck with snapshot.
Checking the engine logs I can see that the snapshot removal task is
waiting for Merge to complete and vice versa.
2020-05-26 18:34:04,826+02 INFO
[org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback]
(EE-ManagedThreadFactory-engineScheduled-Thread-70)
[90f428b0-9c4e-4ac0-8de6-1103fc13da9e] Command
'RemoveSnapshotSingleDiskLive' (id: '60ce36c1-bf74-40a9-9fb0-7fcf7eb95f40')
waiting on child command id: 'f7d1de7b-9e87-47ba-9ba0-ee04301ba3b1'
type:'Merge' to complete
2020-05-26 18:34:04,827+02 INFO
[org.ovirt.engine.core.bll.MergeCommandCallback]
(EE-ManagedThreadFactory-engineScheduled-Thread-70)
[90f428b0-9c4e-4ac0-8de6-1103fc13da9e] Waiting on merge command to complete
(jobId = f694590a-1577-4dce-bf0c-3a8d74adf341)
2020-05-26 18:34:04,845+02 INFO
[org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback]
(EE-ManagedThreadFactory-engineScheduled-Thread-70)
[90f428b0-9c4e-4ac0-8de6-1103fc13da9e] Command 'RemoveSnapshot' (id:
'47c9a847-5b4b-4256-9264-a760acde8275') waiting on child command id:
'60ce36c1-bf74-40a9-9fb0-7fcf7eb95f40' type:'RemoveSnapshotSingleDiskLive'
to complete
2020-05-26 18:34:14,277+02 INFO
[org.ovirt.engine.core.vdsbroker.monitoring.VmJobsMonitoring]
(EE-ManagedThreadFactory-engineScheduled-Thread-96) [] VM Job
[f694590a-1577-4dce-bf0c-3a8d74adf341]: In progress (no change)
I cannot see any runnig tasks on the SPM (vdsm-client Host
getAllTasksInfo). I also cannot find the task ID in any of the other node's
logs.
I already tried restarting the Engine (didn't help).
To start I'm puzzled as to where this task is queueing?
Any Ideas on how I could resolve this?
Thank you.
Regards,
David
4 years, 6 months
Re: Single instance scaleup.
by Strahil
Hi Leo,
As you do not have a distributed volume , you can easily switch to replica 2 arbiter 1 or replica 3 volumes.
You can use the following for adding the bricks:
https://access.redhat.com/documentation/en-US/Red_Hat_Storage/2.1/html/Ad...
Best Regards,
Strahil NikolivOn May 26, 2019 10:54, Leo David <leoalex(a)gmail.com> wrote:
>
> Hi Stahil,
> Thank you so much for yout input !
>
> gluster volume info
>
>
> Volume Name: engine
> Type: Distribute
> Volume ID: d7449fc2-cc35-4f80-a776-68e4a3dbd7e1
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1
> Transport-type: tcp
> Bricks:
> Brick1: 192.168.80.191:/gluster_bricks/engine/engine
> Options Reconfigured:
> nfs.disable: on
> transport.address-family: inet
> storage.owner-uid: 36
> storage.owner-gid: 36
> features.shard: on
> performance.low-prio-threads: 32
> performance.strict-o-direct: off
> network.remote-dio: off
> network.ping-timeout: 30
> user.cifs: off
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> cluster.eager-lock: enable
> Volume Name: ssd-samsung
> Type: Distribute
> Volume ID: 76576cc6-220b-4651-952d-99846178a19e
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1
> Transport-type: tcp
> Bricks:
> Brick1: 192.168.80.191:/gluster_bricks/sdc/data
> Options Reconfigured:
> cluster.eager-lock: enable
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> user.cifs: off
> network.ping-timeout: 30
> network.remote-dio: off
> performance.strict-o-direct: on
> performance.low-prio-threads: 32
> features.shard: on
> storage.owner-gid: 36
> storage.owner-uid: 36
> transport.address-family: inet
> nfs.disable: on
>
> The other two hosts will be 192.168.80.192/193 - this is gluster dedicated network over 10GB sfp+ switch.
> - host 2 wil have identical harware configuration with host 1 ( each disk is actually a raid0 array )
> - host 3 has:
> - 1 ssd for OS
> - 1 ssd - for adding to engine volume in a full replica 3
> - 2 ssd's in a raid 1 array to be added as arbiter for the data volume ( ssd-samsung )
> So the plan is to have "engine" scaled in a full replica 3, and "ssd-samsung" scalled in a replica 3 arbitrated.
>
>
>
>
> On Sun, May 26, 2019 at 10:34 AM Strahil <hunter86_bg(a)yahoo.com> wrote:
>>
>> Hi Leo,
>>
>> Gluster is quite smart, but in order to provide any hints , can you provide output of 'gluster volume info <glustervol>'.
>> If you have 2 more systems , keep in mind that it is best to mirror the storage on the second replica (2 disks on 1 machine -> 2 disks on the new machine), while for the arbiter this is not neccessary.
>>
>> What is your network and NICs ? Based on my experience , I can recommend at least 10 gbit/s interfase(s).
>>
>> Best Regards,
>> Strahil Nikolov
>>
>> On May 26, 2019 07:52, Leo David <leoalex(a)gmail.com> wrote:
>>>
>>> Hello Everyone,
>>> Can someone help me to clarify this ?
>>> I have a single-node 4.2.8 installation ( only two gluster storage domains - distributed single drive volumes ). Now I just got two identintical servers and I would like to go for a 3 nodes bundle.
>>> Is it possible ( after joining the new nodes to the cluster ) to expand the existing volumes across the new nodes and change them to replica 3 arbitrated ?
>>> If so, could you share with me what would it be the procedure ?
>>> Thank you very much !
>>>
>>> Leo
>
>
>
> --
> Best regards, Leo David
4 years, 6 months
oVirt 4.4 install fails
by Me
Hi All
Not sure where to start, but here goes.
I'm not totally new to oVirt, I used RHEV V3.x in production for several
years, it was a breeze to setup.
Installing 4.4 on to a host with local SSD and FC for storage.
Issue 1, having selected the SSD for install which has failed 4.4 beta
on it (several times), I reclaim the space and after a few minutes of
not being able to enter a root password on the next install screen, it
fails as it can't delete the data on the SSD! Yes, really tried this
several times, choosing the recover option and getting a prompt,
fdisk /dev/sda delete the two partitions created by oVirt and I can
install. This was the case with the beta I tried a few weeks ago too.
Having reconfigured the switch attached the the host as a dumb 10GBE
port as the enterprise OS installer still doesn't appear to support
anything more advanced like teaming and VLANS, I have the initial
install on the single SSD and a network connection.
Issue 2, I use FF 72.0.2 on Linux x64 to connect by
https://hostname:9090 to the web interface, but I can't enter login
details as the boxes (everything) are disabled???? There is no warning
like "we don't like your choice of browser", but the screen is a not
very accessible dark grey on darker grey (a poor choice in what I
thought were more enlightened times) so this maybe the case. I have
disabled all security add-ons in FF, makes no difference.
Any suggestions?
M
4 years, 6 months
basic infra and glusterfs sizing question
by Jiří Sléžka
Hello,
I am just curious if basic gluster HCI layout which is suggested in
cockpit has some deeper meaning.
There are suggested 3 volumes
* engine - it is clear, it is the volume where engine vm is running.
When this vm is 51GB big how small could this volume be? I have 1TB SSD
storage and I would like utilize it as much as possible. Could I create
this volume as small as this vm is? Is it safe for example for future
upgrades?
* vmstore - it make sense it is a space for all other vms running in
oVirt. Right?
* data - which purpose has this volume? other data like for example
ISOs? Direct disks?
Another infra question... or maybe request for comment
I have small amount of public ipv4 addresses in my housing (but I have
own switches there so I can create vlans and separate internal traffic).
I can access only these public ipv4 addresses directly. I would like to
conserve these addressess as much as possible so what is the best
approach in your opinion?
* Install all hosts and HE with management network on private addressess
* have small router (hw appliance with for example LEDE) which will
utilize one ipv4 address and will do NAT and vpn for accessing my
internals vlans.
+ looks like simple approach to me
- single point of failure in this router (not really - just in case
oVirt is badly broken and I need to access internal vlans to recover it)
* have this router as virtual appliance inside oVirt (something like
pfSense for example)
+ no need hw router
+ not sure but I could probably configure vrrp redundancy
- still single point of failure like in first case
* any other approach? Could ovn help here somehow?
* Install all hosts and HE with public addresses :-)
+ access to all hosts directly
- 3 node HCI cluster uses 4 public ip addressess
Thanks for your opinions
Cheers,
Jiri
4 years, 6 months
ETL service aggregation error
by Ayansh Rocks
Hi,
I am using 4.3.7 self hosted engine. From Few days i am getting regular
below error messages :-
[image: image.png]
Logs in /var/log/ovirt-engine-dwh/ovirt-engine-dwhd.log
[image: image.png]
What could be the reason for this?
Thanks
Shashank
4 years, 6 months
4.4 regression: engine-setup fails if admin password in answerfile contains a "%"
by Stephen Panicho
I encountered this error when deploying the Hosted Engine via Cockpit:
[ INFO ] TASK [ovirt.engine-setup : Run engine-setup with answerfile]
[ ERROR ] fatal: [localhost -> engine.ovirt.trashnet.xyz]: FAILED! =>
{"changed": true, "cmd": ["engine-setup", "--accept-defaults",
"--config-append=/root/ovirt-engine-answers"], "delta": "0:00:01.396490",
"end": "2020-05-22 18:32:41.965984", "msg": "non-zero return code", "rc":
1, "start": "2020-05-22 18:32:40.569494", "stderr": "", "stderr_lines": [],
"stdout": "[ INFO ] Stage: Initializing\n[ ERROR ] Failed to execute stage
'Initializing': '%' must be followed by '%' or '(', found: '%JUUj'\n[ INFO
] Stage: Clean up\n Log file is located at
/var/log/ovirt-engine/setup/ovirt-engine-setup-20200522183241-c7d1kh.log\n[
ERROR ] Failed to execute stage 'Clean up': 'NoneType' object has no
attribute 'cleanup'\n[ INFO ] Generating answer file
'/var/lib/ovirt-engine/setup/answers/20200522183241-setup.conf'\n[ INFO ]
Stage: Pre-termination\n[ INFO ] Stage: Termination\n[ ERROR ] Execution of
setup failed", "stdout_lines": ["[ INFO ] Stage: Initializing", "[ ERROR ]
Failed to execute stage 'Initializing': '%' must be followed by '%' or '(',
found: '%JUUj'", "[ INFO ] Stage: Clean up", " Log file is located at
/var/log/ovirt-engine/setup/ovirt-engine-setup-20200522183241-c7d1kh.log",
"[ ERROR ] Failed to execute stage 'Clean up': 'NoneType' object has no
attribute 'cleanup'", "[ INFO ] Generating answer file
'/var/lib/ovirt-engine/setup/answers/20200522183241-setup.conf'", "[ INFO ]
Stage: Pre-termination", "[ INFO ] Stage: Termination", "[ ERROR ]
Execution of setup failed"]}
The important bit is this: Failed to execute stage 'Initializing': '%' must
be followed by '%' or '(', found: '%JUUj'"
Hey! Those are the last few characters of the admin password. Note that I
don't mean the root password to the VM, but the one for the "admin" user of
the web interface. I added some debug lines to the Ansible play to see the
answerfile that was being generated.
OVESETUP_CONFIG/adminPassword=str:&6&yGfcWf#b%JUUj
Apparently engine-setup can no longer handle an answerfile with a "%"
character in it. This same password worked in 4.3.
Once I changed the admin password, installation progressed normally.
4 years, 6 months
ovirt imageio problem...
by matteo fedeli
Hi! I' installed CentOS 8 and ovirt package following this step:
systemctl enable --now cockpit.socket
yum install https://resources.ovirt.org/pub/yum-repo/ovirt-release44.rpm
yum module -y enable javapackages-tools
yum module -y enable pki-deps
yum module -y enable postgresql:12
yum -y install glibc-locale-source glibc-langpack-en
localedef -v -c -i en_US -f UTF-8 en_US.UTF-8
yum update
yum install ovirt-engine
engine-setup (by keeping all default)
It's possible ovirt-imageio-proxy service is not installed? (service ovirt-imageio-proxy status --> not found, yum install ovirt-imageio-proxy --> not found) I'm not able to upload iso... I also installed CA cert in firefox...
4 years, 6 months
Issues deploying 4.4 with HE on new EPYC hosts
by Mark R
Hello all,
I have some EPYC servers that are not yet in production, so I wanted to go ahead and move them off of 4.3 (which was working) to 4.4. I flattened and reinstalled the hosts with CentOS 8.1 Minimal and installed all updates. Some very simple networking, just a bond and two iSCSI interfaces. After adding the oVirt 4.4 repo and installing the requirements, I run 'hosted-engine --deploy' and proceed through the setup. Everything looks as though it is going nicely and the local HE starts and runs perfectly. After copying the HE disks out to storage, the system tries to start it there but is using a different CPU definition and it's impossible to start it. At this point I'm stuck but hoping someone knows the fix, because this is as vanilla a deployment as I could attempt and it appears EPYC CPUs are a no-go right now with 4.4.
When the HostedEngineLocal VM is running, the CPU definition is:
<cpu mode='custom' match='exact' check='full'>
<model fallback='forbid'>EPYC-IBPB</model>
<vendor>AMD</vendor>
<feature policy='require' name='x2apic'/>
<feature policy='require' name='tsc-deadline'/>
<feature policy='require' name='hypervisor'/>
<feature policy='require' name='tsc_adjust'/>
<feature policy='require' name='clwb'/>
<feature policy='require' name='umip'/>
<feature policy='require' name='arch-capabilities'/>
<feature policy='require' name='cmp_legacy'/>
<feature policy='require' name='perfctr_core'/>
<feature policy='require' name='wbnoinvd'/>
<feature policy='require' name='amd-ssbd'/>
<feature policy='require' name='skip-l1dfl-vmentry'/>
<feature policy='disable' name='monitor'/>
<feature policy='disable' name='svm'/>
<feature policy='require' name='topoext'/>
</cpu>
Once the HostedEngine VM is defined and trying to start, the CPU definition is simply:
<cpu mode='custom' match='exact' check='partial'>
<model fallback='allow'>EPYC</model>
<topology sockets='16' cores='4' threads='1'/>
<feature policy='require' name='ibpb'/>
<feature policy='require' name='virt-ssbd'/>
<numa>
<cell id='0' cpus='0-63' memory='16777216' unit='KiB'/>
</numa>
</cpu>
On attempts to start it, the host is logging this error: "CPU is incompatible with host CPU: Host CPU does not provide required features: virt-ssbd".
So, the HostedEngineLocal VM works because it has a requirement set for 'amd-ssbd' instead of 'virt-ssbd', and a VM requiring 'virt-ssbd' can't run on EPYC CPUs with CentOS 8.1. As mentioned, the HostedEngine ran fine on oVirt 4.3 with CentOS 7.8, and on 4.3 the cpu definition also required 'virt-ssbd', so I can only imagine that perhaps this is due to the more recent 4.x kernel that I now need HE to require 'amd-ssbd' instead?
Any clues to help with this? I can completely wipe/reconfigure the hosts as needed so I'm willing to try whatever so that I can move forward with a 4.4 deployment.
Thanks!
Mark
4 years, 6 months
tun: unexpected GSO type: 0x0, gso_size 1368, hdr_len 66
by lejeczek
hi everyone,
With 4.4 I get:
...
tun: unexpected GSO type
...
It happens to a "third-party" kernel, namely:
5.6.15-1.el8.elrepo.x86_64 on Centos 8.
I wonder if anybody sees the same or similar and I also
wonder if I should report it somewhere in Bugzilla as
"heads-up" towards new kernels?
[Fri May 29 08:00:02 2020] tun: unexpected GSO type: 0x0,
gso_size 1368, hdr_len 66
[Fri May 29 08:00:02 2020] tun: 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 ................
[Fri May 29 08:00:02 2020] tun: 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 ................
[Fri May 29 08:00:02 2020] tun: 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 ................
[Fri May 29 08:00:02 2020] tun: 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 ................
[Fri May 29 08:00:02 2020] ------------[ cut here ]------------
[Fri May 29 08:00:02 2020] WARNING: CPU: 2 PID: 3605 at
drivers/net/tun.c:2123 tun_do_read+0x524/0x6c0 [tun]
[Fri May 29 08:00:02 2020] Modules linked in: sd_mod sg
vhost_net vhost tap xt_CHECKSUM xt_MASQUERADE xt_conntrack
nf_nat_tftp nf_conntrack_tftp tun nft_nat ipt_REJECT bridge
nft_counter nft_objref nf_conntrack_netbios_ns
nf_conntrack_broadcast nft_masq nft_fib_inet nft_fib_ipv4
nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat
nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 qedf qed
ip6_tables crc8 nft_compat bnx2fc ip_set cnic uio libfcoe
8021q garp mrp stp llc libfc scsi_transport_fc nf_tables
nfnetlink sunrpc vfat fat ext4 mbcache jbd2
snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg
snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device
edac_mce_amd snd_pcm kvm_amd kvm eeepc_wmi asus_wmi
sp5100_tco sparse_keymap irqbypass rfkill wmi_bmof pcspkr
joydev i2c_piix4 k10temp snd_timer snd soundcore gpio_amdpt
gpio_generic acpi_cpufreq ip_tables xfs libcrc32c dm_crypt
ax88179_178a usbnet mii hid_lenovo nouveau video mxm_wmi
i2c_algo_bit
[Fri May 29 08:00:02 2020] drm_kms_helper syscopyarea
sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul ttm
crc32_pclmul crc32c_intel ahci libahci drm
ghash_clmulni_intel libata nvme ccp r8169 nvme_core realtek
wmi t10_pi pinctrl_amd dm_mirror dm_region_hash dm_log dm_mod
[Fri May 29 08:00:02 2020] CPU: 2 PID: 3605 Comm: vhost-3578
Not tainted 5.6.15-1.el8.elrepo.x86_64 #1
[Fri May 29 08:00:02 2020] Hardware name: System
manufacturer System Product Name/PRIME B450M-A, BIOS 2006
11/13/2019
[Fri May 29 08:00:02 2020] RIP: 0010:tun_do_read+0x524/0x6c0
[tun]
[Fri May 29 08:00:02 2020] Code: 00 6a 01 0f b7 44 24 22 b9
10 00 00 00 48 c7 c6 cb 33 09 c1 48 c7 c7 d1 33 09 c1 83 f8
40 48 0f 4f c2 31 d2 50 e8 4c 14 df c5 <0f> 0b 58 5a 48 c7
c5 ea ff ff ff e9 d2 fc ff ff 4c 89 e2 be 04 00
[Fri May 29 08:00:02 2020] RSP: 0018:ffffaaf301dfbcb8
EFLAGS: 00010292
[Fri May 29 08:00:02 2020] RAX: 0000000000000000 RBX:
ffff88ceae6b4800 RCX: 0000000000000007
[Fri May 29 08:00:02 2020] RDX: 0000000000000000 RSI:
0000000000000096 RDI: ffff88d14e8996b0
[Fri May 29 08:00:02 2020] RBP: 000000000000004e R08:
0000000000000516 R09: 0000000000000055
[Fri May 29 08:00:02 2020] R10: 000000000000072e R11:
ffffaaf301dfba88 R12: ffffaaf301dfbe50
[Fri May 29 08:00:02 2020] R13: ffff88d0e98b8900 R14:
0000000000000000 R15: 0000000000000000
[Fri May 29 08:00:02 2020] FS: 0000000000000000(0000)
GS:ffff88d14e880000(0000) knlGS:0000000000000000
[Fri May 29 08:00:02 2020] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[Fri May 29 08:00:02 2020] CR2: 000055f6b3af2bd8 CR3:
00000003c6c84000 CR4: 0000000000340ee0
[Fri May 29 08:00:02 2020] Call Trace:
[Fri May 29 08:00:02 2020] ? __wake_up_common+0x77/0x140
[Fri May 29 08:00:02 2020] tun_recvmsg+0x6b/0xf0 [tun]
[Fri May 29 08:00:02 2020] handle_rx+0x573/0x940 [vhost_net]
[Fri May 29 08:00:02 2020] ? log_used.part.45+0x20/0x20 [vhost]
[Fri May 29 08:00:02 2020] vhost_worker+0xcc/0x140 [vhost]
[Fri May 29 08:00:02 2020] kthread+0x10c/0x130
[Fri May 29 08:00:02 2020] ? kthread_park+0x80/0x80
[Fri May 29 08:00:02 2020] ret_from_fork+0x22/0x40
[Fri May 29 08:00:02 2020] ---[ end trace 9df20668f2e81977 ]---
many thanks, L.
4 years, 6 months
Problem with Ovirt Machines
by aigini82@gmail.com
Hi,
Our company uses Ovirt to host some of its virtual machines. The version used is 4.2.6.4-1.el7. There are about 36 virtual hosts in it.
The specifications used for the host machine is 30G RAM and 6 CPUs. Some of the VMs in the ovirt host run with 4 CPUs. Some with 2 CPUs.
The problem I face now is that recently there was a need for high CPU and memory specs to setup a VM for DR. I created a VM with 16G RAM and 6 CPUs, without checking the CPUs available in the host first. After DR, the VM was brought down already. Then later another person in the team brought the VM back up for a different DR use, for a much larger DB restoration purpose.
This caused the VM to pause due to storage error. And then worse things happened, whereby 2 other VMs inadvertently went down. Although I assumed that this was caused by storage errors/problems, the senior admins in the team concluded that the problem was due to fencing because of the max allotted CPU for the host being used for the VM.
Now what I need to know is how to properly allocate CPU resources to a host to run multiple virtual machines in it like the situation above.
I even tried to look for errors in vdsm.log, but this log was not available in the host machine nor in the affected VM. My colleague asked me to check "Events" section of the ovirt management interface to see past the past events. However, I don't find much details about the fencing activity or how the fencing occurred or what caused the fencing.
And how did they conclude that the CPU count caused the fencing and not the storage?
4 years, 6 months