[ovirt-users] Hosted-Engine HA problem
Sandro Bonazzola
sbonazzo at redhat.com
Fri Oct 31 10:40:59 UTC 2014
Il 31/10/2014 10:26, Jaicel ha scritto:
> i've increased the limit and then restarted agent and broker. status normalize, but then right now it went to "False" state again but still both having 2400 score. agent logs remains the same, with "ovirt-ha-agent dead but subsys locked" status. ha-broker logs below
>
> Thread-138::INFO::2014-10-31 17:24:22,981::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
> Thread-138::INFO::2014-10-31 17:24:22,991::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
> Thread-139::INFO::2014-10-31 17:24:38,385::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
> Thread-139::INFO::2014-10-31 17:24:38,395::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
> Thread-140::INFO::2014-10-31 17:24:53,816::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
> Thread-140::INFO::2014-10-31 17:24:53,827::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
> Thread-141::INFO::2014-10-31 17:25:09,172::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
> Thread-141::INFO::2014-10-31 17:25:09,182::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
> Thread-142::INFO::2014-10-31 17:25:24,551::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
> Thread-142::INFO::2014-10-31 17:25:24,562::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
>
> Thanks,
> Jaicel
>
> ----- Original Message -----
> From: "Jiri Moskovcak" <jmoskovc at redhat.com>
> To: "Jaicel R. Sabonsolin" <jaicel at asti.dost.gov.ph>, "Niels de Vos" <ndevos at redhat.com>
> Cc: "Vijay Bellur" <vbellur at redhat.com>, users at ovirt.org, "Gluster Devel" <gluster-devel at gluster.org>
> Sent: Friday, October 31, 2014 4:32:02 PM
> Subject: Re: [ovirt-users] Hosted-Engine HA problem
>
> On 10/31/2014 03:53 AM, Jaicel R. Sabonsolin wrote:
>> Hi guys,
>>
>> these logs appear on both hosts just like the result of --vm-status. tried to tcpdump on ovirt hosts and gluster nodes but only packets exchange with my monitoring VM(zabbix) appeared.
>>
>> agent.log
>> new_data = self.refresh(self._state.data)
>> File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py", line 77, in refresh
>> stats.update(self.hosted_engine.collect_stats())
>> File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 662, in collect_stats
>> constants.SERVICE_TYPE)
>> File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 171, in get_stats_from_storage
>> result = self._checked_communicate(request)
>> File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 199, in _checked_communicate
>> .format(message or response))
>> RequestError: Request failed: <type 'exceptions.OSError'>
>>
>> broker.log
>> File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 165, in handle
>> response = "success " + self._dispatch(data)
>> File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 261, in _dispatch
>> .get_all_stats_for_service_type(**options)
>> File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 41, in get_all_stats_for_service_type
>> d = self.get_raw_stats_for_service_type(storage_dir, service_type)
>> File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 74, in get_raw_stats_for_service_type
>> f = os.open(path, direct_flag | os.O_RDONLY)
>> OSError: [Errno 24] Too many open files: '/rhev/data-center/mnt/gluster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent/hosted-engine.metadata'
>
> - ah, there we go ^^^^^^ you might need to tweak the limit of allowed
> open files as described here [1] or find the app keeps so many files open
It would be nice to understand if this is related to that host only or if this is a common case and we should increase the limit within setup.
Never seen this issue before.
>
>
> --Jirka
>
> [1]
> http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/
>
>> Thread-38160::INFO::2014-10-31 10:28:37,989::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
>> Thread-38161::INFO::2014-10-31 10:28:53,656::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup) Connection established
>> Thread-38161::ERROR::2014-10-31 10:28:53,657::listener::190::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Error handling request, data: 'get-stats storage_dir=/rhev/data-center/mnt/gluster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent service_type=hosted-engine'
>> Traceback (most recent call last):
>> File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 165, in handle
>> response = "success " + self._dispatch(data)
>> File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 261, in _dispatch
>> .get_all_stats_for_service_type(**options)
>> File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 41, in get_all_stats_for_service_type
>> d = self.get_raw_stats_for_service_type(storage_dir, service_type)
>> File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 74, in get_raw_stats_for_service_type
>> f = os.open(path, direct_flag | os.O_RDONLY)
>> OSError: [Errno 24] Too many open files: '/rhev/data-center/mnt/gluster1:_engine/6eb220be-daff-4785-8f78-111cc24139c4/ha_agent/hosted-engine.metadata'
>> Thread-38161::INFO::2014-10-31 10:28:53,658::listener::184::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Connection closed
>>
>> Thanks,
>> Jaicel
>>
>> ----- Original Message -----
>> From: "Niels de Vos" <ndevos at redhat.com>
>> To: "Vijay Bellur" <vbellur at redhat.com>
>> Cc: "Jiri Moskovcak" <jmoskovc at redhat.com>, "Jaicel R. Sabonsolin" <jaicel at asti.dost.gov.ph>, users at ovirt.org, "Gluster Devel" <gluster-devel at gluster.org>
>> Sent: Friday, October 31, 2014 4:11:25 AM
>> Subject: Re: [ovirt-users] Hosted-Engine HA problem
>>
>> On Thu, Oct 30, 2014 at 09:07:24PM +0530, Vijay Bellur wrote:
>>> On 10/30/2014 06:45 PM, Jiri Moskovcak wrote:
>>>> On 10/30/2014 09:22 AM, Jaicel R. Sabonsolin wrote:
>>>>> Hi Guys,
>>>>>
>>>>> I need help with my ovirt Hosted-Engine HA setup. I am running on 2
>>>>> ovirt hosts and 2 gluster nodes with replicated volumes. i already have
>>>>> VMs running on my hosts and they can migrate normally once i for example
>>>>> power off the host that they are running on. the problem is that the
>>>>> engine can't migrate once i switch off the host that hosts the engine.
>>>>>
>>>>> oVirt 3.4.3-1.el6
>>>>> KVM 0.12.1.2 - 2.415.el6_5.10
>>>>> LIBVIRT libvirt-0.10.2-29.el6_5.9
>>>>> VDSM vdsm-4.14.17-0.el6
>>>>>
>>>>>
>>>>> right now, i have this result from hosted-engine --vm-status.
>>>>>
>>>>> File "/usr/lib64/python2.6/runpy.py", line 122, in
>>>>> _run_module_as_main
>>>>> "__main__", fname, loader, pkg_name)
>>>>> File "/usr/lib64/python2.6/runpy.py", line 34, in _run_code
>>>>> exec code in run_globals
>>>>> File
>>>>>
>>>>> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py",
>>>>>
>>>>> line 111, in <module>
>>>>> if not status_checker.print_status():
>>>>> File
>>>>>
>>>>> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_setup/vm_status.py",
>>>>>
>>>>> line 58, in print_status
>>>>> all_host_stats = ha_cli.get_all_host_stats()
>>>>> File
>>>>>
>>>>> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py",
>>>>>
>>>>> line 137, in get_all_host_stats
>>>>> return self.get_all_stats(self.StatModes.HOST)
>>>>> File
>>>>>
>>>>> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py",
>>>>>
>>>>> line 86, in get_all_stats
>>>>> constants.SERVICE_TYPE)
>>>>> File
>>>>>
>>>>> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
>>>>>
>>>>> line 171, in get_stats_from_storage
>>>>> result = self._checked_communicate(request)
>>>>> File
>>>>>
>>>>> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
>>>>>
>>>>> line 199, in _checked_communicate
>>>>> .format(message or response))
>>>>> ovirt_hosted_engine_ha.lib.exceptions.RequestError: Request failed:
>>>>> <type 'exceptions.OSError'>
>>>>>
>>>>>
>>>>> restarting ha-broker and ha-agent normalizes the status but eventually
>>>>> it would become "false" and then return to the result above. hope you
>>>>> guys could help me with this.
>>>>>
>>>>
>>>> Hi Jaicel,
>>>> please attach agent.log and broker.log from the host where you trying to
>>>> run hosted-engine --vm-status. I have a feeling that you ran into a
>>>> known problem on gluster - stalled file descriptor, in that case the
>>>> only known solution at this time is to restart the broker & agent as you
>>>> have already found out.
>>>>
>>>
>>> Adding Niels and gluster-devel to troubleshoot from Gluster NFS perspective.
>>
>> I'd welcome any details on this "stalled file descriptor" problem. Is
>> there a bug filed with some details like logs, sysrq-t and maybe even
>> tcpdumps? If there is an easy way to reproduce this behaviour, I can
>> surely look into it and hopefully come up with some advise or fix.
>>
>> Thanks,
>> Niels
>>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
--
Sandro Bonazzola
Better technology. Faster innovation. Powered by community collaboration.
See how it works at redhat.com
More information about the Users
mailing list