Re: Hosted Engine Abruptly Stopped Responding - Unexpected Shutdown

On Jun 6, 2019 12:52, souvaliotimaria@mail.com wrote:
Hello,
I came upon a problem the previous month that I figured it would be good to discuss here. I'm sorry I didn't post here earlier but time slipped me.
I have set up a glustered, hyperconverged oVirt environment for experimental use as a means to see its behaviour and get used to its management and performance before setting it up as a production environment for use in our organization. The environment is up and running since 2018 October. The three nodes are HP ProLiant DL380 G7 and have the following characteristics:
Mem: 22GB CPU: 2x Hexa Core - Intel Xeon Hexa Core E56xx HDD: 5x 300GB Network: BCM5709C with dual-port Gigabit OS: Linux RedHat 7.5.1804(Core 3.10.0-862.3.2.el7.x86_64 x86_64) - Ovirt Node 4.2.3.1
As I was working on the environment, the engine stopped working. Not long before the time the HE stopped, I was in the web interface managing my VMs, when the browser froze and the HE was also not responding to ICMP requests.
The first thing I did was to connect via ssh to all nodes and run the command #hosted-engine --vm-status which showed that the HE was down in nodes 1 and 2 and up on the 3rd node.
After executing #virsh -r list the VM list that was shown contained two of the VMs I had previously created and were up; the HE was nowhere.
I tried to restart the HE with the #hosted-engine --vm-start but it didn't work.
I then put all nodes in maintenance mode with the command #hosted-engine --set-maintenance --mode=global (I guess I should have done that earlier) and re-run #hosted-engine --vm-start that had the same result as it previously did.
After checking the mails the system sent to the root user, I saw there were several mails on the 3rd node (where the HE had been), informing of the HE's state. The messages were changing between EngineDown-EngineStart, EngineStart-EngineStarting, EngineStarting-EngineMaybeAway, EngineMaybeAway-EngineUnexpectedlyDown, EngineUnexpectedlyDown-EngineDown, EngineDown-EngineStart and so forth.
I continued by searching the following logs in all nodes : /var/log/libvirt/qemu/HostedEngine.log /var/log/libvirt/qemu/win10.log /var/log/libvirt/qemu/DNStest.log /var/log/vdsm/vdsm.log /var/log/ovirt-hosted-engine-ha/agent.log
After that I spotted and error that had started appearing almost a month ago in node #2: ERROR Internal server error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request res = method(**params) File "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod result = fn(*methodArgs) File "/usr/lib/python2.7/site-packages/vdsm/gluster/apiwrapper.py", line 85, in logicalVolumeList return self._gluster.logicalVolumeList() File "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 90, in wrapper rv = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 808, in logicalVolumeList status = self.svdsmProxy.glusterLogicalVolumeList() File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__ return callMethod() File "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 52, in <lambda> getattr(self._supervdsmProxy._svdsm, self._funcName)(*args, AttributeError: 'AutoProxy[instance]' object has no attribute 'glusterLogicalVolumeList'
The outputs of the following commands were also checked as a way to see if there was a mandatory process missing/killed, a memory problem or even disk space shortage that led to the sudden death of a process #ps -A #top #free -h #df -hT
Finally, after some time delving in the logs, the output of the #journalctl --dmesg showed the following message "Out of memory: Kill process 5422 (qemu-kvm) score 514 or sacrifice child. Killed process 5422 (qemu-kvm) total-vm:17526548kB, anon-rss:9310396kB, file-rss:2336kB, shmem-rss:12kB" which after that the ovirtmgmt started not responding.
If you run out of memory, you should take that serious.Droping the cache seems like a workaround and not a fix. Check if KSM is enabled - this will merge your VM's memory pages for an exchange for CPU cycles - still better than getting a VM killed. Also, you can protect the HostedEngine from OOM killer.
I tried to restart the vhostd by executing #/etc/rc.d/init.d/vhostmd start but it didn't work.
Finally, I decided to run the HE restart command on the other nodes as well (I'd figured that since the HE was last running on the node #3, that's where I should try to restart it). So, I run #hosted-engine --vm-start and the output was "Command VM.getStats with args {'vmID':'...<το ID της HE>....'} failed: (code=1,message=Virtual machine does not exist: {'vmID':'...<το ID της HE>....'})" And then I run the command again and the output was "VM exists and its status is Powering Up."
After that I executed #virsh -r list and the output was the following: Id Name State ---------------------------------------------------- 2 HostedEngine running
After the HE's restart two mails came that stated: ReinitializeFSMEngineStarting and EngineStarting-EngineUp
After that and after checking that we had access to the web interface again, we executed hosted-engine --set-maintenance --mode=none to get out of the maintenance mode.
The thing is, I still am not 1000% sure what the problem was that led to the shutdown of the hosted engine and I think that maybe some of the steps I took were not needed. I believe it was because the process qemu-kvm was killed after there was not enough memory for it but is this the real cause? I wasn't doing anything unusual before the shutdown to believe it was because of the new VM that was still in shutdown mode or anything of the sort. Also, I believe it may be because of memory shortage because I hadn't executed the #sync ; echo 3 > /proc/sys/vm/drop_caches command for a couple of weeks.
What are your thoughts on this? Could you point me to where to search for more information on the topic or tell me what is the right process to follow when something like this happens?
Check the sar (there is a graphical util called 'ksar' and check cpu , memory, swap, context switches , I/O and network usage). Crreate simple systemd service to monitor your nodes, or even better put a real monitoring software so you can proactively take any actions.
Also, I have set up a few VMs but only three are Up and they have no users yet, even so the buffers fill almost to the brim when the usage is almost non-existant. If you have an environment that has some users or you use the VMs as virtual servers of some sort, what is the consumption of the memory? What's the optimal size for the memory?
What is your tuned profile ? Any customizations there ? Best Regards, Strahil Nikolov
Thank you all very much. _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/PKRB26GSDQ5JVH...

Hello and thank you very much for your reply. I'm terribly sorry for being so late to respond. I thought the same, that dropping the cache was more of a workaround and not a real solution but truthfully I was stuck and can't think of anything more than how much I need to upgrade the memory on the nodes. I try to find info about other ovirt virtualization set-ups and the amount of memory allocated so I can get an idea of what my set-up needs. The only thing that I found was that one admin had set ovirt up with 128GB and still needed more because of the growing needs of the system and its users and was about to upgrade its memory too. I'm just worried that ovirt is very memory consuming and no matter how much I will "feed" it, it will still ask for more. Also, I'm worried that there one, two or even more tweaks in the configurations that I still miss and they'd be able to solve the memory problem. Anyway, KSM is enabled. Sar shows that the committed memory when a Windows 10 VM is active too (alongside Hosted Engine of course, and two Linux VMs - 1 CentOS, 1 Debian) is around 89% in the specific host that it runs (together with the Debian VM) and has reached up to 98%. You are correct about the monitoring system too. I have set up a PRTG environment and there's Nagios running but they can't yet see ovirt. I will set them up correctly the next few days. I haven't made any changes to my tuned profile. it's the default from ovirt. Specifically, the active profile says it's set to virtual-host. Again I'm very sorry for taking me so long to reply and thank you very much for your response. Best Regards, Maria Souvalioti
participants (2)
-
souvaliotimaria@mail.com
-
Strahil