When I read your intro, and I hit the memory figure, I was saying to
myself, what????
I'd definitely increase the memory if possible. As high as you can
affordably fit into the servers.
Engine asks 16GB at installation time, add some for gluster services and
you're at your limits before you add a user VM.
My first non-hyperconverged hosted-engine install used a 32GB and a 24GB
dual xeon machines with only 8GB allocated for the engine VM.
I felt more confident in it when I upgraded the 24GB node to 48GB. So 48GB
would my minimum, 64 OK, and the more the better..
Later, I was able to find some used 144GB supermicro servers which I
replaced the above nodes with.
Modern 64bit CentOS likes to have around 2GB per core for basic server
functions.
For desktops, I say have at least 8GB because web browsers eat up RAM.
On Thu, Jun 6, 2019 at 5:52 AM <souvaliotimaria(a)mail.com> wrote:
Hello,
I came upon a problem the previous month that I figured it would be good
to discuss here. I'm sorry I didn't post here earlier but time slipped me.
I have set up a glustered, hyperconverged oVirt environment for
experimental use as a means to see its behaviour and get used to its
management and performance before setting it up as a production environment
for use in our organization. The environment is up and running since 2018
October. The three nodes are HP ProLiant DL380 G7 and have the following
characteristics:
Mem: 22GB
CPU: 2x Hexa Core - Intel Xeon Hexa Core E56xx
HDD: 5x 300GB
Network: BCM5709C with dual-port Gigabit
OS: Linux RedHat 7.5.1804(Core 3.10.0-862.3.2.el7.x86_64 x86_64) - Ovirt
Node 4.2.3.1
As I was working on the environment, the engine stopped working.
Not long before the time the HE stopped, I was in the web interface
managing my VMs, when the browser froze and the HE was also not responding
to ICMP requests.
The first thing I did was to connect via ssh to all nodes and run the
command
#hosted-engine --vm-status
which showed that the HE was down in nodes 1 and 2 and up on the 3rd node.
After executing
#virsh -r list
the VM list that was shown contained two of the VMs I had previously
created and were up; the HE was nowhere.
I tried to restart the HE with the
#hosted-engine --vm-start
but it didn't work.
I then put all nodes in maintenance mode with the command
#hosted-engine --set-maintenance --mode=global
(I guess I should have done that earlier) and re-run
#hosted-engine --vm-start
that had the same result as it previously did.
After checking the mails the system sent to the root user, I saw there
were several mails on the 3rd node (where the HE had been), informing of
the HE's state. The messages were changing between EngineDown-EngineStart,
EngineStart-EngineStarting, EngineStarting-EngineMaybeAway,
EngineMaybeAway-EngineUnexpectedlyDown, EngineUnexpectedlyDown-EngineDown,
EngineDown-EngineStart and so forth.
I continued by searching the following logs in all nodes :
/var/log/libvirt/qemu/HostedEngine.log
/var/log/libvirt/qemu/win10.log
/var/log/libvirt/qemu/DNStest.log
/var/log/vdsm/vdsm.log
/var/log/ovirt-hosted-engine-ha/agent.log
After that I spotted and error that had started appearing almost a month
ago in node #2:
ERROR Internal server error Traceback (most recent call last): File
"/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in
_handle_request res = method(**params) File
"/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in
_dynamicMethod result = fn(*methodArgs) File
"/usr/lib/python2.7/site-packages/vdsm/gluster/apiwrapper.py", line 85, in
logicalVolumeList return self._gluster.logicalVolumeList() File
"/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 90, in wrapper
rv = func(*args, **kwargs) File
"/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 808, in
logicalVolumeList status = self.svdsmProxy.glusterLogicalVolumeList() File
"/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in
__call__ return callMethod() File
"/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 52, in
<lambda> getattr(self._supervdsmProxy._svdsm, self._funcName)(*args,
AttributeError: 'AutoProxy[instance]' object has no attribute
'glusterLogicalVolumeList'
The outputs of the following commands were also checked as a way to see if
there was a mandatory process missing/killed, a memory problem or even disk
space shortage that led to the sudden death of a process
#ps -A
#top
#free -h
#df -hT
Finally, after some time delving in the logs, the output of the
#journalctl --dmesg
showed the following message
"Out of memory: Kill process 5422 (qemu-kvm) score 514 or sacrifice child.
Killed process 5422 (qemu-kvm) total-vm:17526548kB, anon-rss:9310396kB,
file-rss:2336kB, shmem-rss:12kB"
which after that the ovirtmgmt started not responding.
I tried to restart the vhostd by executing
#/etc/rc.d/init.d/vhostmd start
but it didn't work.
Finally, I decided to run the HE restart command on the other nodes as
well (I'd figured that since the HE was last running on the node #3, that's
where I should try to restart it). So, I run
#hosted-engine --vm-start
and the output was
"Command VM.getStats with args {'vmID':'...<το ID της
HE>....'} failed:
(code=1,message=Virtual machine does not exist: {'vmID':'...<το ID της
HE>....'})"
And then I run the command again and the output was
"VM exists and its status is Powering Up."
After that I executed
#virsh -r list
and the output was the following:
Id Name State
----------------------------------------------------
2 HostedEngine running
After the HE's restart two mails came that stated:
ReinitializeFSMEngineStarting and EngineStarting-EngineUp
After that and after checking that we had access to the web interface
again, we executed
hosted-engine --set-maintenance --mode=none
to get out of the maintenance mode.
The thing is, I still am not 1000% sure what the problem was that led to
the shutdown of the hosted engine and I think that maybe some of the steps
I took were not needed. I believe it was because the process qemu-kvm was
killed after there was not enough memory for it but is this the real cause?
I wasn't doing anything unusual before the shutdown to believe it was
because of the new VM that was still in shutdown mode or anything of the
sort. Also, I believe it may be because of memory shortage because I hadn't
executed the
#sync ; echo 3 > /proc/sys/vm/drop_caches
command for a couple of weeks.
What are your thoughts on this? Could you point me to where to search for
more information on the topic or tell me what is the right process to
follow when something like this happens?
Also, I have set up a few VMs but only three are Up and they have no users
yet, even so the buffers fill almost to the brim when the usage is almost
non-existant. If you have an environment that has some users or you use the
VMs as virtual servers of some sort, what is the consumption of the memory?
What's the optimal size for the memory?
Thank you all very much.
_______________________________________________
Users mailing list -- users(a)ovirt.org
To unsubscribe send an email to users-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/PKRB26GSDQ5...