[ovirt-users] hosted engine health check issues

Wed Apr 23 09:10:26 UTC 2014

On 04/23/2014 11:08 AM, Martin Sivak wrote:
> Hi René,
>
>>>> libvirtError: Failed to acquire lock: No space left on device
>
>>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid
>>>> lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82
>
> Can you please check the contents of /rhev/data-center/<your nfs mount>/<nfs domain uuid>/ha_agent/?
>
> This is how it should look like:
>
> [root at dev-03 ~]# ls -al /rhev/data-center/mnt/euryale\:_home_ovirt_he/e16de6a2-53f5-4ab3-95a3-255d08398824/ha_agent/
> total 2036
> drwxr-x---. 2 vdsm kvm    4096 Mar 19 18:46 .
> drwxr-xr-x. 6 vdsm kvm    4096 Mar 19 18:46 ..
> -rw-rw----. 1 vdsm kvm 1048576 Apr 23 11:05 hosted-engine.lockspace
> -rw-rw----. 1 vdsm kvm 1028096 Mar 19 18:46 hosted-engine.metadata
>
> The errors seem to indicate that you somehow lost the lockspace file.

True :)
Isn't this file created when hosted engine is started? Or how can I 
create this file manually?

>
> --
> Martin Sivák
> msivak at redhat.com
> Red Hat Czech
> RHEV-M SLA / Brno, CZ
>
> ----- Original Message -----
>> On 04/23/2014 12:28 AM, Doron Fediuck wrote:
>>> Hi Rene,
>>> any idea what closed your ovirtmgmt bridge?
>>> as long as it is down vdsm may have issues starting up properly
>>> and this is why you see the complaints on the rpc server.
>>>
>>> Can you try manually fixing the network part first and then
>>> restart vdsm?
>>> Once vdsm is happy hosted engine VM will start.
>>
>> Thanks for your feedback, Doron.
>>
>> My ovirtmgmt bridge seems to be on or isn't it:
>> # brctl show ovirtmgmt
>> bridge name	bridge id		STP enabled	interfaces
>> ovirtmgmt		8000.0025907587c2	no		eth0.200
>>
>> # ip a s ovirtmgmt
>> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
>> state UNKNOWN
>>       link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff
>>       inet 10.0.200.102/24 brd 10.0.200.255 scope global ovirtmgmt
>>       inet6 fe80::225:90ff:fe75:87c2/64 scope link
>>          valid_lft forever preferred_lft forever
>>
>> # ip a s eth0.200
>> 6: eth0.200 at eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
>> noqueue state UP
>>       link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff
>>       inet6 fe80::225:90ff:fe75:87c2/64 scope link
>>          valid_lft forever preferred_lft forever
>>
>> I tried the following yesterday:
>> Copy virtual disk from GlusterFS storage to local disk of host and
>> create a new vm with virt-manager which loads ovirtmgmt disk. I could
>> reach my engine over the ovirtmgmt bridge (so bridge must be working).
>>
>> I also started libvirtd with Option -v and I saw the following in
>> libvirtd.log when trying to start ovirt engine:
>> 2014-04-22 14:18:25.432+0000: 8901: debug : virCommandRunAsync:2250 :
>> Command result 0, with PID 11491
>> 2014-04-22 14:18:25.478+0000: 8901: debug : virCommandRun:2045 : Result
>> exit status 255, stdout: '' stderr: 'iptables v1.4.7: goto 'FO-vnet0' is
>> not a chain
>>
>> So it could be that something is broken in my hosted-engine network. Do
>> you have any clue how I can troubleshoot this?
>>
>>
>> Thanks,
>> René
>>
>>
>>>
>>> ----- Original Message -----
>>>> From: "René Koch" <rkoch at linuxland.at>
>>>> To: "Martin Sivak" <msivak at redhat.com>
>>>> Cc: users at ovirt.org
>>>> Sent: Tuesday, April 22, 2014 1:46:38 PM
>>>> Subject: Re: [ovirt-users] hosted engine health check issues
>>>>
>>>> Hi,
>>>>
>>>> I rebooted one of my ovirt hosts today and the result is now that I
>>>> can't start hosted-engine anymore.
>>>>
>>>> ovirt-ha-agent isn't running because the lockspace file is missing
>>>> (sanlock complains about it).
>>>> So I tried to start hosted-engine with --vm-start and I get the
>>>> following errors:
>>>>
>>>> ==> /var/log/sanlock.log <==
>>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid
>>>> lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82
>>>>
>>>> ==> /var/log/messages <==
>>>> Apr 22 12:38:17 ovirt-host02 sanlock[3079]: 2014-04-22 12:38:17+0200 654
>>>> [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 failed 0 name
>>>> 2851af27-8744-445d-9fb1-a0d083c8dc82
>>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) entering
>>>> disabled state
>>>> Apr 22 12:38:17 ovirt-host02 kernel: device vnet0 left promiscuous mode
>>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) entering
>>>> disabled state
>>>>
>>>> ==> /var/log/vdsm/vdsm.log <==
>>>> Thread-21::DEBUG::2014-04-22
>>>> 12:38:17,563::libvirtconnection::124::root::(wrapper) Unknown
>>>> libvirterror: ecode: 38 edom: 42 level: 2 message: Failed to acquire
>>>> lock: No space left on device
>>>> Thread-21::DEBUG::2014-04-22
>>>> 12:38:17,563::vm::2263::vm.Vm::(_startUnderlyingVm)
>>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::_ongoingCreations released
>>>> Thread-21::ERROR::2014-04-22
>>>> 12:38:17,564::vm::2289::vm.Vm::(_startUnderlyingVm)
>>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start process failed
>>>> Traceback (most recent call last):
>>>>      File "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm
>>>>        self._run()
>>>>      File "/usr/share/vdsm/vm.py", line 3170, in _run
>>>>        self._connection.createXML(domxml, flags),
>>>>      File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py",
>>>> line 92, in wrapper
>>>>        ret = f(*args, **kwargs)
>>>>      File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in
>>>> createXML
>>>>        if ret is None:raise libvirtError('virDomainCreateXML() failed',
>>>> conn=self)
>>>> libvirtError: Failed to acquire lock: No space left on device
>>>>
>>>> ==> /var/log/messages <==
>>>> Apr 22 12:38:17 ovirt-host02 vdsm vm.Vm ERROR
>>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start process
>>>> failed#012Traceback (most recent call last):#012  File
>>>> "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm#012
>>>> self._run()#012  File "/usr/share/vdsm/vm.py", line 3170, in _run#012
>>>>     self._connection.createXML(domxml, flags),#012  File
>>>> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 92,
>>>> in wrapper#012    ret = f(*args, **kwargs)#012  File
>>>> "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in
>>>> createXML#012    if ret is None:raise libvirtError('virDomainCreateXML()
>>>> failed', conn=self)#012libvirtError: Failed to acquire lock: No space
>>>> left on device
>>>>
>>>> ==> /var/log/vdsm/vdsm.log <==
>>>> Thread-21::DEBUG::2014-04-22
>>>> 12:38:17,569::vm::2731::vm.Vm::(setDownStatus)
>>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::Changed state to Down:
>>>> Failed to acquire lock: No space left on device
>>>>
>>>>
>>>> No space left on device is nonsense as there is enough space (I had this
>>>> issue last time as well where I had to patch machine.py, but this file
>>>> is now Python 2.6.6 compatible.
>>>>
>>>> Any idea what prevents hosted-engine from starting?
>>>> ovirt-ha-broker, vdsmd and sanlock are running btw.
>>>>
>>>> Btw, I can see in log that json rpc server module is missing - which
>>>> package is required for CentOS 6.5?
>>>> Apr 22 12:37:14 ovirt-host02 vdsm vds WARNING Unable to load the json
>>>> rpc server module. Please make sure it is installed.
>>>>
>>>>
>>>> Thanks,
>>>> René
>>>>
>>>>
>>>>
>>>> On 04/17/2014 10:02 AM, Martin Sivak wrote:
>>>>> Hi,
>>>>>
>>>>>>>> How can I disable notifications?
>>>>>
>>>>> The notification is configured in /etc/ovirt-hosted-engine-ha/broker.conf
>>>>> section notification.
>>>>> The email is sent when the key state_transition exists and the string
>>>>> OldState-NewState contains the (case insensitive) regexp from the value.
>>>>>
>>>>>>>> Is it intended to send out these messages and detect that ovirt engine
>>>>>>>> is down (which is false anyway), but not to restart the vm?
>>>>>
>>>>> Forget about emails for now and check the
>>>>> /var/log/ovirt-hosted-engine-ha/agent.log and broker.log (and attach them
>>>>> as well btw).
>>>>>
>>>>>>>> oVirt hosts think that hosted engine is down because it seems that
>>>>>>>> hosts
>>>>>>>> can't write to hosted-engine.lockspace due to glusterfs issues (or at
>>>>>>>> least I think so).
>>>>>
>>>>> The hosts think so or can't really write there? The lockspace is managed
>>>>> by
>>>>> sanlock and our HA daemons do not touch it at all. We only ask sanlock to
>>>>> get make sure we have unique server id.
>>>>>
>>>>>>>> Is is possible or planned to make the whole ha feature optional?
>>>>>
>>>>> Well the system won't perform any automatic actions if you put the hosted
>>>>> engine to global maintenance and only start/stop/migrate the VM manually.
>>>>> I would discourage you from stopping agent/broker, because the engine
>>>>> itself has some logic based on the reporting.
>>>>>
>>>>> Regards
>>>>>
>>>>> --
>>>>> Martin Sivák
>>>>> msivak at redhat.com
>>>>> Red Hat Czech
>>>>> RHEV-M SLA / Brno, CZ
>>>>>
>>>>> ----- Original Message -----
>>>>>> On 04/15/2014 04:53 PM, Jiri Moskovcak wrote:
>>>>>>> On 04/14/2014 10:50 AM, René Koch wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have some issues with hosted engine status.
>>>>>>>>
>>>>>>>> oVirt hosts think that hosted engine is down because it seems that
>>>>>>>> hosts
>>>>>>>> can't write to hosted-engine.lockspace due to glusterfs issues (or at
>>>>>>>> least I think so).
>>>>>>>>
>>>>>>>> Here's the output of vm-status:
>>>>>>>>
>>>>>>>> # hosted-engine --vm-status
>>>>>>>>
>>>>>>>>
>>>>>>>> --== Host 1 status ==--
>>>>>>>>
>>>>>>>> Status up-to-date                  : False
>>>>>>>> Hostname                           : 10.0.200.102
>>>>>>>> Host ID                            : 1
>>>>>>>> Engine status                      : unknown stale-data
>>>>>>>> Score                              : 2400
>>>>>>>> Local maintenance                  : False
>>>>>>>> Host timestamp                     : 1397035677
>>>>>>>> Extra metadata (valid at timestamp):
>>>>>>>>         metadata_parse_version=1
>>>>>>>>         metadata_feature_version=1
>>>>>>>>         timestamp=1397035677 (Wed Apr  9 11:27:57 2014)
>>>>>>>>         host-id=1
>>>>>>>>         score=2400
>>>>>>>>         maintenance=False
>>>>>>>>         state=EngineUp
>>>>>>>>
>>>>>>>>
>>>>>>>> --== Host 2 status ==--
>>>>>>>>
>>>>>>>> Status up-to-date                  : True
>>>>>>>> Hostname                           : 10.0.200.101
>>>>>>>> Host ID                            : 2
>>>>>>>> Engine status                      : {'reason': 'vm not running on
>>>>>>>> this
>>>>>>>> host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}
>>>>>>>> Score                              : 0
>>>>>>>> Local maintenance                  : False
>>>>>>>> Host timestamp                     : 1397464031
>>>>>>>> Extra metadata (valid at timestamp):
>>>>>>>>         metadata_parse_version=1
>>>>>>>>         metadata_feature_version=1
>>>>>>>>         timestamp=1397464031 (Mon Apr 14 10:27:11 2014)
>>>>>>>>         host-id=2
>>>>>>>>         score=0
>>>>>>>>         maintenance=False
>>>>>>>>         state=EngineUnexpectedlyDown
>>>>>>>>         timeout=Mon Apr 14 10:35:05 2014
>>>>>>>>
>>>>>>>> oVirt engine is sending me 2 emails every 10 minutes with the
>>>>>>>> following
>>>>>>>> subjects:
>>>>>>>> - ovirt-hosted-engine state transition EngineDown-EngineStart
>>>>>>>> - ovirt-hosted-engine state transition EngineStart-EngineUp
>>>>>>>>
>>>>>>>> In oVirt webadmin I can see the following message:
>>>>>>>> VM HostedEngine is down. Exit message: internal error Failed to
>>>>>>>> acquire
>>>>>>>> lock: error -243.
>>>>>>>>
>>>>>>>> These messages are really annoying as oVirt isn't doing anything with
>>>>>>>> hosted engine - I have an uptime of 9 days in my engine vm.
>>>>>>>>
>>>>>>>> So my questions are now:
>>>>>>>> Is it intended to send out these messages and detect that ovirt engine
>>>>>>>> is down (which is false anyway), but not to restart the vm?
>>>>>>>>
>>>>>>>> How can I disable notifications? I'm planning to write a Nagios plugin
>>>>>>>> which parses the output of hosted-engine --vm-status and only Nagios
>>>>>>>> should notify me, not hosted-engine script.
>>>>>>>>
>>>>>>>> Is is possible or planned to make the whole ha feature optional? I
>>>>>>>> really really really hate cluster software as it causes more troubles
>>>>>>>> then standalone machines and in my case the hosted-engine ha feature
>>>>>>>> really causes troubles (and I didn't had a hardware or network outage
>>>>>>>> yet only issues with hosted-engine ha agent). I don't need any ha
>>>>>>>> feature for hosted engine. I just want to run engine virtualized on
>>>>>>>> oVirt and if engine vm fails (e.g. because of issues with a host) I'll
>>>>>>>> restart it on another node.
>>>>>>>
>>>>>>> Hi, you can:
>>>>>>> 1. edit /etc/ovirt-hosted-engine-ha/{agent,broker}-log.conf and tweak
>>>>>>> the logger as you like
>>>>>>> 2. or kill ovirt-ha-broker & ovirt-ha-agent services
>>>>>>
>>>>>> Thanks for the information.
>>>>>> So engine is able to run when ovirt-ha-broker and ovirt-ha-agent isn't
>>>>>> running?
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> René
>>>>>>
>>>>>>>
>>>>>>> --Jirka
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> René
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list
>>>>>> Users at ovirt.org
>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users at ovirt.org
>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>
>>