Re: Hosted Engine Abruptly Stopped Responding - Unexpected Shutdown
by Strahil
On Jun 6, 2019 12:52, souvaliotimaria(a)mail.com wrote:
>
> Hello,
>
> I came upon a problem the previous month that I figured it would be good to discuss here. I'm sorry I didn't post here earlier but time slipped me.
>
> I have set up a glustered, hyperconverged oVirt environment for experimental use as a means to see its behaviour and get used to its management and performance before setting it up as a production environment for use in our organization. The environment is up and running since 2018 October. The three nodes are HP ProLiant DL380 G7 and have the following characteristics:
>
> Mem: 22GB
> CPU: 2x Hexa Core - Intel Xeon Hexa Core E56xx
> HDD: 5x 300GB
> Network: BCM5709C with dual-port Gigabit
> OS: Linux RedHat 7.5.1804(Core 3.10.0-862.3.2.el7.x86_64 x86_64) - Ovirt Node 4.2.3.1
>
> As I was working on the environment, the engine stopped working.
> Not long before the time the HE stopped, I was in the web interface managing my VMs, when the browser froze and the HE was also not responding to ICMP requests.
>
> The first thing I did was to connect via ssh to all nodes and run the command
> #hosted-engine --vm-status
> which showed that the HE was down in nodes 1 and 2 and up on the 3rd node.
>
> After executing
> #virsh -r list
> the VM list that was shown contained two of the VMs I had previously created and were up; the HE was nowhere.
>
> I tried to restart the HE with the
> #hosted-engine --vm-start
> but it didn't work.
>
> I then put all nodes in maintenance mode with the command
> #hosted-engine --set-maintenance --mode=global
> (I guess I should have done that earlier) and re-run
> #hosted-engine --vm-start
> that had the same result as it previously did.
>
> After checking the mails the system sent to the root user, I saw there were several mails on the 3rd node (where the HE had been), informing of the HE's state. The messages were changing between EngineDown-EngineStart, EngineStart-EngineStarting, EngineStarting-EngineMaybeAway, EngineMaybeAway-EngineUnexpectedlyDown, EngineUnexpectedlyDown-EngineDown, EngineDown-EngineStart and so forth.
>
> I continued by searching the following logs in all nodes :
> /var/log/libvirt/qemu/HostedEngine.log
> /var/log/libvirt/qemu/win10.log
> /var/log/libvirt/qemu/DNStest.log
> /var/log/vdsm/vdsm.log
> /var/log/ovirt-hosted-engine-ha/agent.log
>
> After that I spotted and error that had started appearing almost a month ago in node #2:
> ERROR Internal server error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request res = method(**params) File "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod result = fn(*methodArgs) File "/usr/lib/python2.7/site-packages/vdsm/gluster/apiwrapper.py", line 85, in logicalVolumeList return self._gluster.logicalVolumeList() File "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 90, in wrapper rv = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 808, in logicalVolumeList status = self.svdsmProxy.glusterLogicalVolumeList() File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__ return callMethod() File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 52, in <lambda> getattr(self._supervdsmProxy._svdsm, self._funcName)(*args, AttributeError: 'AutoProxy[instance]' object has no attribute 'glusterLogicalVolumeList'
>
>
> The outputs of the following commands were also checked as a way to see if there was a mandatory process missing/killed, a memory problem or even disk space shortage that led to the sudden death of a process
> #ps -A
> #top
> #free -h
> #df -hT
>
> Finally, after some time delving in the logs, the output of the
> #journalctl --dmesg
> showed the following message
> "Out of memory: Kill process 5422 (qemu-kvm) score 514 or sacrifice child.
> Killed process 5422 (qemu-kvm) total-vm:17526548kB, anon-rss:9310396kB,
> file-rss:2336kB, shmem-rss:12kB"
> which after that the ovirtmgmt started not responding.
If you run out of memory, you should take that serious.Droping the cache seems like a workaround and not a fix.
Check if KSM is enabled - this will merge your VM's memory pages for an exchange for CPU cycles - still better than getting a VM killed.
Also, you can protect the HostedEngine from OOM killer.
> I tried to restart the vhostd by executing
> #/etc/rc.d/init.d/vhostmd start
> but it didn't work.
>
> Finally, I decided to run the HE restart command on the other nodes as well (I'd figured that since the HE was last running on the node #3, that's where I should try to restart it). So, I run
> #hosted-engine --vm-start
> and the output was
> "Command VM.getStats with args {'vmID':'...<το ID της HE>....'} failed:
> (code=1,message=Virtual machine does not exist: {'vmID':'...<το ID της
> HE>....'})"
> And then I run the command again and the output was
> "VM exists and its status is Powering Up."
>
> After that I executed
> #virsh -r list
> and the output was the following:
> Id Name State
> ----------------------------------------------------
> 2 HostedEngine running
>
> After the HE's restart two mails came that stated: ReinitializeFSMEngineStarting and EngineStarting-EngineUp
>
> After that and after checking that we had access to the web interface again, we executed
> hosted-engine --set-maintenance --mode=none
> to get out of the maintenance mode.
>
> The thing is, I still am not 1000% sure what the problem was that led to the shutdown of the hosted engine and I think that maybe some of the steps I took were not needed. I believe it was because the process qemu-kvm was killed after there was not enough memory for it but is this the real cause? I wasn't doing anything unusual before the shutdown to believe it was because of the new VM that was still in shutdown mode or anything of the sort. Also, I believe it may be because of memory shortage because I hadn't executed the
> #sync ; echo 3 > /proc/sys/vm/drop_caches
> command for a couple of weeks.
>
> What are your thoughts on this? Could you point me to where to search for more information on the topic or tell me what is the right process to follow when something like this happens?
Check the sar (there is a graphical util called 'ksar' and check cpu , memory, swap, context switches , I/O and network usage).
Crreate simple systemd service to monitor your nodes, or even better put a real monitoring software so you can proactively take any actions.
> Also, I have set up a few VMs but only three are Up and they have no users yet, even so the buffers fill almost to the brim when the usage is almost non-existant. If you have an environment that has some users or you use the VMs as virtual servers of some sort, what is the consumption of the memory? What's the optimal size for the memory?
What is your tuned profile ? Any customizations there ?
Best Regards,
Strahil Nikolov
> Thank you all very much.
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/PKRB26GSDQ5...
5 years, 6 months
Re: 4.3 live migration creates wrong image permissions.
by Strahil
Hi Alex,
Did you migrate from gluster v3 to v5 ?
If yes, then it could be the issue with v5.3 where permissions go wrong.
If so, pick ovirt 4.3.4 as it uses a newer (fixed ) version of gluster -> v5.6
Best Regards,
Strahil NikolovOn Jun 13, 2019 09:46, Alex McWhirter <alex(a)triadic.us> wrote:
>
> after upgrading from 4.2 to 4.3, after a vm live migrates it's disk
> images are become owned by root:root. Live migration succeeds and the vm
> stays up, but after shutting down the VM from this point, starting it up
> again will cause it to fail. At this point i have to go in and change
> the permissions back to vdsm:kvm on the images, and the VM will boot
> again.
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/TSWRTC2E7XZ...
5 years, 6 months
High Performance VM: trouble using vNUMA and hugepages
by Matthias Leopold
Hi,
I'm having trouble using vNUMA and hugepages at the same time:
- hypervisor host hast 2 CPU and 768G RAM
- hypervisor host is configured to allocate 512 1G hugepages
- VM configuration
* 2 virtual sockets, vCPUs are evenly pinned to 2 physical CPUs
* 512G RAM
* 2 vNUMA nodes that are pinned to the 2 host NUMA nodes
* custom property "hugepages=1048576"
- VM is the only VM on hypervisor host
when I want to start the VM I'm getting the error message
"The host foo did not satisfy internal filter NUMA because cannot
accommodate memory of VM's pinned virtual NUMA nodes within host's
physical NUMA nodes"
VM start only works when VM memory is shrunk so that it fits in (host
memory - allocated huge pages)
I don't understand why this happens. Can someone explain to me how this
is supposed to work?
oVirt engine is 4.3.3
oVirt host is 4.3.4
thanks
matthias
5 years, 6 months
[ANN] oVirt 4.3.5 First Release Candidate is now available
by Sandro Bonazzola
The oVirt Project is pleased to announce the availability of the oVirt
4.3.5 First Release Candidate, as of June 13th, 2019.
This update is a release candidate of the fifth in a series of
stabilization updates to the 4.3 series.
This is pre-release software. This pre-release should not to be used in
production.
This release is available now on x86_64 architecture for:
* Red Hat Enterprise Linux 7.6 or later
* CentOS Linux (or similar) 7.6 or later
This release supports Hypervisor Hosts on x86_64 and ppc64le architectures
for:
* Red Hat Enterprise Linux 7.6 or later
* CentOS Linux (or similar) 7.6 or later
* oVirt Node 4.3 (available for x86_64 only)
See the release notes [1] for installation / upgrade instructions and a
list of new features and bugs fixed.
Notes:
- oVirt Appliance is already available
- oVirt Node is already available[2]
Additional Resources:
* Read more about the oVirt 4.3.5 release highlights:
http://www.ovirt.org/release/4.3.5/
* Get more oVirt Project updates on Twitter: https://twitter.com/ovirt
* Check out the latest project news on the oVirt blog:
http://www.ovirt.org/blog/
[1] http://www.ovirt.org/release/4.3.5/
[2] http://resources.ovirt.org/pub/ovirt-4.3-pre/iso/
--
Sandro Bonazzola
MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV
Red Hat EMEA <https://www.redhat.com/>
sbonazzo(a)redhat.com
<https://www.redhat.com/>
5 years, 6 months
Re: Memory ballon question
by Strahil
Hi Darrell,
Yes , all VMs (both openSUSE and RedHat/CentOS 7) have the ovirt-guest-agent up and running.
Best Regards,
Strahil NikolovOn Jun 12, 2019 22:07, Darrell Budic <budic(a)onholyground.com> wrote:
>
> Do you have the overt-guest-agent running on your VMs? It’s required for ballooning to control allocations on the guest side.
>
>> On Jun 12, 2019, at 11:32 AM, Strahil <hunter86_bg(a)yahoo.com> wrote:
>>
>> Hello All,
>>
>> as a KVM user I know how usefull is the memory balloon and how you can both increase - and also decrease memory live (both Linux & Windows).
>> I have noticed that I cannot decrease the memory in oVirt.
>>
>> Does anyone got a clue why the situation is like that ?
>>
>> I was expecting that the guaranteed memory is the minimum to which the balloon driver will not go bellow, but when I put my host under pressure - the host just started to swap instead of reducing some of the VM memory (and my VMs had plenty of free space).
>>
>> It will be great if oVirt can decrease the memory (if the VM has unallocated memory) when the host is under pressure and the VM cannot be relocated.
>>
>> Best Regards,
>> Strahil Nikolov
>>
>> _______________________________________________
>> Users mailing list -- users(a)ovirt.org
>> To unsubscribe send an email to users-leave(a)ovirt.org
>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
>> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/LUWCN2MLNTD...
>
>
5 years, 6 months
Re: Ovirt hiperconverged setup error
by Strahil
Simone has provided you a workaround, so check his e-mail.
The ansible is actually running:
dig ov-node2 +short
As dig is quarrying your DNS server, you need to bypass this check if you use /etc/host entries.
Best Regards,
Strahil NikolovOn Jun 12, 2019 21:57, PS Kazi <faruk.apsara(a)gmail.com> wrote:
>
> Hi,
> How to check my DNS settings is ok or not for ovirt?
>
> On Wed, 12 Jun 2019, 6:55 pm Strahil Nikolov, <hunter86_bg(a)yahoo.com> wrote:
>>
>> Command run is "dig' which tries to resolve the hostname of each server.
>> Do you have a DNS resolver properly configured ?
>>
>> Best Regards,
>> Strahil Nikolov
>>
>> В сряда, 12 юни 2019 г., 3:59:14 ч. Гринуич-4, PS Kazi <faruk.apsara(a)gmail.com> написа:
>>
>>
>> ovirt Node version 4.3.3.1
>> I am trying to configure 3 node Gluster storage and oVirt hosted engine but gettng following error:
>>
>> TASK [gluster.features/roles/gluster_hci : Check if valid FQDN is provided] ****
>> failed: [ov-node-2 -> localhost] (item=ov-node-2) => {"changed": true, "cmd": ["dig", "ov-node-2", "+short"], "delta": "0:00:00.041003", "end": "2019-06-12 12:52:34.158688", "failed_when_result": true, "item": "ov-node-2", "rc": 0, "start": "2019-06-12 12:52:34.117685", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
>> failed: [ov-node-2 -> localhost] (item=ov-node-3) => {"changed": true, "cmd": ["dig", "ov-node-3", "+short"], "delta": "0:00:00.038688", "end": "2019-06-12 12:52:34.459176", "failed_when_result": true, "item": "ov-node-3", "rc": 0, "start": "2019-06-12 12:52:34.420488", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
>> failed: [ov-node-2 -> localhost] (item=ov-node-1) => {"changed": true, "cmd": ["dig", "ov-node-1", "+short"], "delta": "0:00:00.047938", "end": "2019-06-12 12:52:34.768149", "failed_when_result": true, "item": "ov-node-1", "rc": 0, "start": "2019-06-12 12:52:34.720211", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
>>
>>
>> Please help
>> _______________________________________________
>> Users mailing list -- users(a)ovirt.org
>> To unsubscribe send an email to users-leave(a)ovirt.org
>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
>> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BXMOTKHGI5T...
5 years, 6 months
Re: Replace bad Host from a 9 Node hyperconverged setup 4.3.3
by Strahil
Hi Adrian,
You have several options:
A) If you have space on another gluster volume (or volumes) or on NFS-based storage, you can migrate all VMs live . Once you do it, the simple way will be to stop and remove the storage domain (from UI) and gluster volume that correspond to the problematic brick. Once gone, you can remove the entry in oVirt for the old host and add the newly built one.Then you can recreate your volume and migrate the data back.
B) If you don't have space you have to use a more riskier approach (usually it shouldn't be risky, but I had bad experience in gluster v3):
- New server has same IP and hostname:
Use command line and run the 'gluster volume reset-brick VOLNAME HOSTNAME:BRICKPATH HOSTNAME:BRICKPATH commit'
Replace VOLNAME with your volume name.
A more practical example would be:
'gluster volume reset-brick data ovirt3:/gluster_bricks/data/brick ovirt3:/gluster_ ricks/data/brick commit'
If it refuses, then you have to cleanup '/gluster_bricks/data' (which should be empty).
Also check if the new peer has been probed via 'gluster peer status'.Check the firewall is allowing gluster communication (you can compare it to the firewalls on another gluster host).
The automatic healing will kick in 10 minutes (if it succeeds) and will stress the other 2 replicas, so pick your time properly.
Note: I'm not recommending you to use the 'force' option in the previous command ... for now :)
- The new server has a different IP/hostname:
Instead of 'reset-brick' you can use 'replace-brick':
It should be like this:
gluster volume replace-brick data old-server:/path/to/brick new-server:/new/path/to/brick commit force
In both cases check the status via:
gluster volume info VOLNAME
If your cluster is in production , I really recommend you the first option as it is less risky and the chance for unplanned downtime will be minimal.
The 'reset-brick' in your previous e-mail shows that one of the servers is not connected. Check peer status on all servers, if they are less than they should check for network and/or firewall issues.
On the new node check if glusterd is enabled and running.
In order to debug - you should provide more info like 'gluster volume info' and the peer status from each node.
Best Regards,
Strahil Nikolov
On Jun 10, 2019 20:10, Adrian Quintero <adrianquintero(a)gmail.com> wrote:
>
> Can you let me know how to fix the gluster and missing brick?,
> I tried removing it by going to "storage > Volumes > vmstore > bricks > selected the brick
> However it is showing as an unknown status (which is expected because the server was completely wiped) so if I try to "remove", "replace brick" or "reset brick" it wont work
> If i do remove brick: Incorrect bricks selected for removal in Distributed Replicate volume. Either all the selected bricks should be from the same sub volume or one brick each for every sub volume!
> If I try "replace brick" I cant because I dont have another server with extra bricks/disks
> And if I try "reset brick": Error while executing action Start Gluster Volume Reset Brick: Volume reset brick commit force failed: rc=-1 out=() err=['Host myhost1_mydomain_com not connected']
>
> Are you suggesting to try and fix the gluster using command line?
>
> Note that I cant "peer detach" the sever , so if I force the removal of the bricks would I need to force downgrade to replica 2 instead of 3? what would happen to oVirt as it only supports replica 3?
>
> thanks again.
>
> On Mon, Jun 10, 2019 at 12:52 PM Strahil <hunter86_bg(a)yahoo.com> wrote:
>>
>> Hi Adrian,
>> Did you fix the issue with the gluster and the missing brick?
>> If yes, try to set the 'old' host in maintenance an
5 years, 6 months
oVirt engine/engine-setup with other port than default HTTPS 443 possible?
by Dirk Rydvan
Hello,
in a home lab with only one puplic IPv4-Adress is the Port 443 a very precious one.
The installation of oVirt 4.3 on a single node/host with bare metal installed CentOS works well. (oVirt Cockpit and local installed engine added after installation of Centos 7.6).
But it is more difficult to change the port from 443 to maybe 4443 the save the port 443.
The change in:
- /etc/httpd/conf/conf.d/ssl.conf
- /var/lib/ovirt-engine/jboss_runtime/config/ovirt-engine.xml
and
- disable selinux
- add 4443 to public with firewall-cmd
It does not work... now I see is more difficult than I think before...
A port redirection at the edge route from <public-IPv4>:4443 to <private-IPv4>:443 does also not work, because the links points all to the standard https adress without a port number.
Is there a way to change the default port 443 of oVirt engine to a other port?
Many Thanks in advance!
5 years, 6 months
RFE: HostedEngine to use boom by default
by Strahil Nikolov
Hello All,
I have seen a lot of cases where the HostedEngine gets corrupted/broken and beyond repair.
I think that BOOM is a good option for our HostedEngine appliances due to the fact that it supports booting from LVM snapshots and thus being able to easily recover after upgrades or other outstanding situations.
Sadly, BOOM has 1 drawback - that everything should be under a single snapshot - thus no separation of /var /log or /audit.
Do you think that changing the appliance layout is worth it ?
Note: I might have an unsupported layout that could cause my confusion.Is your layout a single root LV ?
Best Regards,Strahil Nikolov
5 years, 6 months
4K Sector Hard drive Support - oVirt 4.3.4
by nico.kruger@darkmatter.ae
Hi Guys,
Does anyone have any idea if/when 4K Sector drives will be supported?
My understanding is that this was one of the features for 4.3.4
Thanks
Nico Kruger
5 years, 6 months