Re: [ovirt-users] hosted engine health check issues

Hi,
Isn't this file created when hosted engine is started?
The file is created by the setup script. If it got lost then there was probably something bad happening in your NFS or Gluster storage.
Or how can I create this file manually?
I can give you experimental treatment for this. We do not have any official way as this is something that should not ever happen :) !! But before you do that make sure you do not have any nodes running properly. This will destroy and reinitialize the lockspace database for the whole hosted-engine environment (which you apparently lack, but..). !! You have to create the ha_agent/hosted-engine.lockspace file with the expected size (1MB) and then tell sanlock to initialize it as a lockspace using: # python
import sanlock sanlock.write_lockspace(lockspace="hosted-engine", ... path="/rhev/data-center/mnt/<nfs>/<hosted engine storage domain>/ha_agent/hosted-engine.lockspace", ... offset=0)
Then try starting the services (both broker and agent) again. -- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ ----- Original Message -----
On 04/23/2014 11:08 AM, Martin Sivak wrote:
Hi René,
libvirtError: Failed to acquire lock: No space left on device
2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82
Can you please check the contents of /rhev/data-center/<your nfs mount>/<nfs domain uuid>/ha_agent/?
This is how it should look like:
[root@dev-03 ~]# ls -al /rhev/data-center/mnt/euryale\:_home_ovirt_he/e16de6a2-53f5-4ab3-95a3-255d08398824/ha_agent/ total 2036 drwxr-x---. 2 vdsm kvm 4096 Mar 19 18:46 . drwxr-xr-x. 6 vdsm kvm 4096 Mar 19 18:46 .. -rw-rw----. 1 vdsm kvm 1048576 Apr 23 11:05 hosted-engine.lockspace -rw-rw----. 1 vdsm kvm 1028096 Mar 19 18:46 hosted-engine.metadata
The errors seem to indicate that you somehow lost the lockspace file.
True :) Isn't this file created when hosted engine is started? Or how can I create this file manually?
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
----- Original Message -----
On 04/23/2014 12:28 AM, Doron Fediuck wrote:
Hi Rene, any idea what closed your ovirtmgmt bridge? as long as it is down vdsm may have issues starting up properly and this is why you see the complaints on the rpc server.
Can you try manually fixing the network part first and then restart vdsm? Once vdsm is happy hosted engine VM will start.
Thanks for your feedback, Doron.
My ovirtmgmt bridge seems to be on or isn't it: # brctl show ovirtmgmt bridge name bridge id STP enabled interfaces ovirtmgmt 8000.0025907587c2 no eth0.200
# ip a s ovirtmgmt 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff inet 10.0.200.102/24 brd 10.0.200.255 scope global ovirtmgmt inet6 fe80::225:90ff:fe75:87c2/64 scope link valid_lft forever preferred_lft forever
# ip a s eth0.200 6: eth0.200@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff inet6 fe80::225:90ff:fe75:87c2/64 scope link valid_lft forever preferred_lft forever
I tried the following yesterday: Copy virtual disk from GlusterFS storage to local disk of host and create a new vm with virt-manager which loads ovirtmgmt disk. I could reach my engine over the ovirtmgmt bridge (so bridge must be working).
I also started libvirtd with Option -v and I saw the following in libvirtd.log when trying to start ovirt engine: 2014-04-22 14:18:25.432+0000: 8901: debug : virCommandRunAsync:2250 : Command result 0, with PID 11491 2014-04-22 14:18:25.478+0000: 8901: debug : virCommandRun:2045 : Result exit status 255, stdout: '' stderr: 'iptables v1.4.7: goto 'FO-vnet0' is not a chain
So it could be that something is broken in my hosted-engine network. Do you have any clue how I can troubleshoot this?
Thanks, René
----- Original Message -----
From: "René Koch" <rkoch@linuxland.at> To: "Martin Sivak" <msivak@redhat.com> Cc: users@ovirt.org Sent: Tuesday, April 22, 2014 1:46:38 PM Subject: Re: [ovirt-users] hosted engine health check issues
Hi,
I rebooted one of my ovirt hosts today and the result is now that I can't start hosted-engine anymore.
ovirt-ha-agent isn't running because the lockspace file is missing (sanlock complains about it). So I tried to start hosted-engine with --vm-start and I get the following errors:
==> /var/log/sanlock.log <== 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82
==> /var/log/messages <== Apr 22 12:38:17 ovirt-host02 sanlock[3079]: 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82 Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) entering disabled state Apr 22 12:38:17 ovirt-host02 kernel: device vnet0 left promiscuous mode Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) entering disabled state
==> /var/log/vdsm/vdsm.log <== Thread-21::DEBUG::2014-04-22 12:38:17,563::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 38 edom: 42 level: 2 message: Failed to acquire lock: No space left on device Thread-21::DEBUG::2014-04-22 12:38:17,563::vm::2263::vm.Vm::(_startUnderlyingVm) vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::_ongoingCreations released Thread-21::ERROR::2014-04-22 12:38:17,564::vm::2289::vm.Vm::(_startUnderlyingVm) vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start process failed Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm self._run() File "/usr/share/vdsm/vm.py", line 3170, in _run self._connection.createXML(domxml, flags), File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 92, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in createXML if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self) libvirtError: Failed to acquire lock: No space left on device
==> /var/log/messages <== Apr 22 12:38:17 ovirt-host02 vdsm vm.Vm ERROR vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start process failed#012Traceback (most recent call last):#012 File "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm#012 self._run()#012 File "/usr/share/vdsm/vm.py", line 3170, in _run#012 self._connection.createXML(domxml, flags),#012 File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 92, in wrapper#012 ret = f(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in createXML#012 if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)#012libvirtError: Failed to acquire lock: No space left on device
==> /var/log/vdsm/vdsm.log <== Thread-21::DEBUG::2014-04-22 12:38:17,569::vm::2731::vm.Vm::(setDownStatus) vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::Changed state to Down: Failed to acquire lock: No space left on device
No space left on device is nonsense as there is enough space (I had this issue last time as well where I had to patch machine.py, but this file is now Python 2.6.6 compatible.
Any idea what prevents hosted-engine from starting? ovirt-ha-broker, vdsmd and sanlock are running btw.
Btw, I can see in log that json rpc server module is missing - which package is required for CentOS 6.5? Apr 22 12:37:14 ovirt-host02 vdsm vds WARNING Unable to load the json rpc server module. Please make sure it is installed.
Thanks, René
On 04/17/2014 10:02 AM, Martin Sivak wrote:
Hi,
>>> How can I disable notifications?
The notification is configured in /etc/ovirt-hosted-engine-ha/broker.conf section notification. The email is sent when the key state_transition exists and the string OldState-NewState contains the (case insensitive) regexp from the value.
>>> Is it intended to send out these messages and detect that ovirt >>> engine >>> is down (which is false anyway), but not to restart the vm?
Forget about emails for now and check the /var/log/ovirt-hosted-engine-ha/agent.log and broker.log (and attach them as well btw).
>>> oVirt hosts think that hosted engine is down because it seems that >>> hosts >>> can't write to hosted-engine.lockspace due to glusterfs issues (or >>> at >>> least I think so).
The hosts think so or can't really write there? The lockspace is managed by sanlock and our HA daemons do not touch it at all. We only ask sanlock to get make sure we have unique server id.
>>> Is is possible or planned to make the whole ha feature optional?
Well the system won't perform any automatic actions if you put the hosted engine to global maintenance and only start/stop/migrate the VM manually. I would discourage you from stopping agent/broker, because the engine itself has some logic based on the reporting.
Regards
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
----- Original Message ----- > On 04/15/2014 04:53 PM, Jiri Moskovcak wrote: >> On 04/14/2014 10:50 AM, René Koch wrote: >>> Hi, >>> >>> I have some issues with hosted engine status. >>> >>> oVirt hosts think that hosted engine is down because it seems that >>> hosts >>> can't write to hosted-engine.lockspace due to glusterfs issues (or >>> at >>> least I think so). >>> >>> Here's the output of vm-status: >>> >>> # hosted-engine --vm-status >>> >>> >>> --== Host 1 status ==-- >>> >>> Status up-to-date : False >>> Hostname : 10.0.200.102 >>> Host ID : 1 >>> Engine status : unknown stale-data >>> Score : 2400 >>> Local maintenance : False >>> Host timestamp : 1397035677 >>> Extra metadata (valid at timestamp): >>> metadata_parse_version=1 >>> metadata_feature_version=1 >>> timestamp=1397035677 (Wed Apr 9 11:27:57 2014) >>> host-id=1 >>> score=2400 >>> maintenance=False >>> state=EngineUp >>> >>> >>> --== Host 2 status ==-- >>> >>> Status up-to-date : True >>> Hostname : 10.0.200.101 >>> Host ID : 2 >>> Engine status : {'reason': 'vm not running on >>> this >>> host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'} >>> Score : 0 >>> Local maintenance : False >>> Host timestamp : 1397464031 >>> Extra metadata (valid at timestamp): >>> metadata_parse_version=1 >>> metadata_feature_version=1 >>> timestamp=1397464031 (Mon Apr 14 10:27:11 2014) >>> host-id=2 >>> score=0 >>> maintenance=False >>> state=EngineUnexpectedlyDown >>> timeout=Mon Apr 14 10:35:05 2014 >>> >>> oVirt engine is sending me 2 emails every 10 minutes with the >>> following >>> subjects: >>> - ovirt-hosted-engine state transition EngineDown-EngineStart >>> - ovirt-hosted-engine state transition EngineStart-EngineUp >>> >>> In oVirt webadmin I can see the following message: >>> VM HostedEngine is down. Exit message: internal error Failed to >>> acquire >>> lock: error -243. >>> >>> These messages are really annoying as oVirt isn't doing anything >>> with >>> hosted engine - I have an uptime of 9 days in my engine vm. >>> >>> So my questions are now: >>> Is it intended to send out these messages and detect that ovirt >>> engine >>> is down (which is false anyway), but not to restart the vm? >>> >>> How can I disable notifications? I'm planning to write a Nagios >>> plugin >>> which parses the output of hosted-engine --vm-status and only Nagios >>> should notify me, not hosted-engine script. >>> >>> Is is possible or planned to make the whole ha feature optional? I >>> really really really hate cluster software as it causes more >>> troubles >>> then standalone machines and in my case the hosted-engine ha feature >>> really causes troubles (and I didn't had a hardware or network >>> outage >>> yet only issues with hosted-engine ha agent). I don't need any ha >>> feature for hosted engine. I just want to run engine virtualized on >>> oVirt and if engine vm fails (e.g. because of issues with a host) >>> I'll >>> restart it on another node. >> >> Hi, you can: >> 1. edit /etc/ovirt-hosted-engine-ha/{agent,broker}-log.conf and tweak >> the logger as you like >> 2. or kill ovirt-ha-broker & ovirt-ha-agent services > > Thanks for the information. > So engine is able to run when ovirt-ha-broker and ovirt-ha-agent isn't > running? > > > Regards, > René > >> >> --Jirka >>> >>> Thanks, >>> René >>> >>> >> > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users >
Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

same pb. ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9. 2014-04-23 13:55 GMT+02:00 Martin Sivak <msivak@redhat.com>:
Hi,
Isn't this file created when hosted engine is started?
The file is created by the setup script. If it got lost then there was probably something bad happening in your NFS or Gluster storage.
Or how can I create this file manually?
I can give you experimental treatment for this. We do not have any official way as this is something that should not ever happen :)
!! But before you do that make sure you do not have any nodes running properly. This will destroy and reinitialize the lockspace database for the whole hosted-engine environment (which you apparently lack, but..). !!
You have to create the ha_agent/hosted-engine.lockspace file with the expected size (1MB) and then tell sanlock to initialize it as a lockspace using:
# python
import sanlock sanlock.write_lockspace(lockspace="hosted-engine", ... path="/rhev/data-center/mnt/<nfs>/<hosted engine storage domain>/ha_agent/hosted-engine.lockspace", ... offset=0)
Then try starting the services (both broker and agent) again.
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
On 04/23/2014 11:08 AM, Martin Sivak wrote:
Hi René,
libvirtError: Failed to acquire lock: No space left on device
2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82
Can you please check the contents of /rhev/data-center/<your nfs mount>/<nfs domain uuid>/ha_agent/?
This is how it should look like:
[root@dev-03 ~]# ls -al
/rhev/data-center/mnt/euryale\:_home_ovirt_he/e16de6a2-53f5-4ab3-95a3-255d08398824/ha_agent/
total 2036 drwxr-x---. 2 vdsm kvm 4096 Mar 19 18:46 . drwxr-xr-x. 6 vdsm kvm 4096 Mar 19 18:46 .. -rw-rw----. 1 vdsm kvm 1048576 Apr 23 11:05 hosted-engine.lockspace -rw-rw----. 1 vdsm kvm 1028096 Mar 19 18:46 hosted-engine.metadata
The errors seem to indicate that you somehow lost the lockspace file.
True :) Isn't this file created when hosted engine is started? Or how can I create this file manually?
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
----- Original Message -----
On 04/23/2014 12:28 AM, Doron Fediuck wrote:
Hi Rene, any idea what closed your ovirtmgmt bridge? as long as it is down vdsm may have issues starting up properly and this is why you see the complaints on the rpc server.
Can you try manually fixing the network part first and then restart vdsm? Once vdsm is happy hosted engine VM will start.
Thanks for your feedback, Doron.
My ovirtmgmt bridge seems to be on or isn't it: # brctl show ovirtmgmt bridge name bridge id STP enabled interfaces ovirtmgmt 8000.0025907587c2 no eth0.200
# ip a s ovirtmgmt 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff inet 10.0.200.102/24 brd 10.0.200.255 scope global ovirtmgmt inet6 fe80::225:90ff:fe75:87c2/64 scope link valid_lft forever preferred_lft forever
# ip a s eth0.200 6: eth0.200@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff inet6 fe80::225:90ff:fe75:87c2/64 scope link valid_lft forever preferred_lft forever
I tried the following yesterday: Copy virtual disk from GlusterFS storage to local disk of host and create a new vm with virt-manager which loads ovirtmgmt disk. I could reach my engine over the ovirtmgmt bridge (so bridge must be working).
I also started libvirtd with Option -v and I saw the following in libvirtd.log when trying to start ovirt engine: 2014-04-22 14:18:25.432+0000: 8901: debug : virCommandRunAsync:2250 : Command result 0, with PID 11491 2014-04-22 14:18:25.478+0000: 8901: debug : virCommandRun:2045 :
Result
exit status 255, stdout: '' stderr: 'iptables v1.4.7: goto 'FO-vnet0' is not a chain
So it could be that something is broken in my hosted-engine network. Do you have any clue how I can troubleshoot this?
Thanks, René
----- Original Message -----
From: "René Koch" <rkoch@linuxland.at> To: "Martin Sivak" <msivak@redhat.com> Cc: users@ovirt.org Sent: Tuesday, April 22, 2014 1:46:38 PM Subject: Re: [ovirt-users] hosted engine health check issues
Hi,
I rebooted one of my ovirt hosts today and the result is now that I can't start hosted-engine anymore.
ovirt-ha-agent isn't running because the lockspace file is missing (sanlock complains about it). So I tried to start hosted-engine with --vm-start and I get the following errors:
==> /var/log/sanlock.log <== 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 failed 0 name
2851af27-8744-445d-9fb1-a0d083c8dc82
==> /var/log/messages <== Apr 22 12:38:17 ovirt-host02 sanlock[3079]: 2014-04-22
12:38:17+0200 654
[3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82 Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) entering disabled state Apr 22 12:38:17 ovirt-host02 kernel: device vnet0 left promiscuous mode Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) entering disabled state
==> /var/log/vdsm/vdsm.log <== Thread-21::DEBUG::2014-04-22 12:38:17,563::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 38 edom: 42 level: 2 message: Failed to acquire lock: No space left on device Thread-21::DEBUG::2014-04-22 12:38:17,563::vm::2263::vm.Vm::(_startUnderlyingVm) vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::_ongoingCreations released Thread-21::ERROR::2014-04-22 12:38:17,564::vm::2289::vm.Vm::(_startUnderlyingVm) vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start process failed Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm self._run() File "/usr/share/vdsm/vm.py", line 3170, in _run self._connection.createXML(domxml, flags), File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 92, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in createXML if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self) libvirtError: Failed to acquire lock: No space left on device
==> /var/log/messages <== Apr 22 12:38:17 ovirt-host02 vdsm vm.Vm ERROR vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start process failed#012Traceback (most recent call last):#012 File "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm#012 self._run()#012 File "/usr/share/vdsm/vm.py", line 3170, in _run#012 self._connection.createXML(domxml, flags),#012 File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py",
in wrapper#012 ret = f(*args, **kwargs)#012 File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in createXML#012 if ret is None:raise
failed', conn=self)#012libvirtError: Failed to acquire lock: No space left on device
==> /var/log/vdsm/vdsm.log <== Thread-21::DEBUG::2014-04-22 12:38:17,569::vm::2731::vm.Vm::(setDownStatus) vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::Changed state to Down: Failed to acquire lock: No space left on device
No space left on device is nonsense as there is enough space (I had
issue last time as well where I had to patch machine.py, but this file is now Python 2.6.6 compatible.
Any idea what prevents hosted-engine from starting? ovirt-ha-broker, vdsmd and sanlock are running btw.
Btw, I can see in log that json rpc server module is missing - which package is required for CentOS 6.5? Apr 22 12:37:14 ovirt-host02 vdsm vds WARNING Unable to load the json rpc server module. Please make sure it is installed.
Thanks, René
On 04/17/2014 10:02 AM, Martin Sivak wrote: > Hi, > >>>> How can I disable notifications? > > The notification is configured in > /etc/ovirt-hosted-engine-ha/broker.conf > section notification. > The email is sent when the key state_transition exists and the string > OldState-NewState contains the (case insensitive) regexp from the > value. > >>>> Is it intended to send out these messages and detect that ovirt >>>> engine >>>> is down (which is false anyway), but not to restart the vm? > > Forget about emails for now and check the > /var/log/ovirt-hosted-engine-ha/agent.log and broker.log (and attach > them > as well btw). > >>>> oVirt hosts think that hosted engine is down because it seems
>>>> hosts >>>> can't write to hosted-engine.lockspace due to glusterfs issues (or >>>> at >>>> least I think so). > > The hosts think so or can't really write there? The lockspace is > managed > by > sanlock and our HA daemons do not touch it at all. We only ask sanlock > to > get make sure we have unique server id. > >>>> Is is possible or planned to make the whole ha feature optional? > > Well the system won't perform any automatic actions if you put the > hosted > engine to global maintenance and only start/stop/migrate the VM > manually. > I would discourage you from stopping agent/broker, because the engine > itself has some logic based on the reporting. > > Regards > > -- > Martin Sivák > msivak@redhat.com > Red Hat Czech > RHEV-M SLA / Brno, CZ > > ----- Original Message ----- >> On 04/15/2014 04:53 PM, Jiri Moskovcak wrote: >>> On 04/14/2014 10:50 AM, René Koch wrote: >>>> Hi, >>>> >>>> I have some issues with hosted engine status. >>>> >>>> oVirt hosts think that hosted engine is down because it seems
>>>> hosts >>>> can't write to hosted-engine.lockspace due to glusterfs issues (or >>>> at >>>> least I think so). >>>> >>>> Here's the output of vm-status: >>>> >>>> # hosted-engine --vm-status >>>> >>>> >>>> --== Host 1 status ==-- >>>> >>>> Status up-to-date : False >>>> Hostname : 10.0.200.102 >>>> Host ID : 1 >>>> Engine status : unknown stale-data >>>> Score : 2400 >>>> Local maintenance : False >>>> Host timestamp : 1397035677 >>>> Extra metadata (valid at timestamp): >>>> metadata_parse_version=1 >>>> metadata_feature_version=1 >>>> timestamp=1397035677 (Wed Apr 9 11:27:57 2014) >>>> host-id=1 >>>> score=2400 >>>> maintenance=False >>>> state=EngineUp >>>> >>>> >>>> --== Host 2 status ==-- >>>> >>>> Status up-to-date : True >>>> Hostname : 10.0.200.101 >>>> Host ID : 2 >>>> Engine status : {'reason': 'vm not running on >>>> this >>>> host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'} >>>> Score : 0 >>>> Local maintenance : False >>>> Host timestamp : 1397464031 >>>> Extra metadata (valid at timestamp): >>>> metadata_parse_version=1 >>>> metadata_feature_version=1 >>>> timestamp=1397464031 (Mon Apr 14 10:27:11 2014) >>>> host-id=2 >>>> score=0 >>>> maintenance=False >>>> state=EngineUnexpectedlyDown >>>> timeout=Mon Apr 14 10:35:05 2014 >>>> >>>> oVirt engine is sending me 2 emails every 10 minutes with the >>>> following >>>> subjects: >>>> - ovirt-hosted-engine state transition EngineDown-EngineStart >>>> - ovirt-hosted-engine state transition EngineStart-EngineUp >>>> >>>> In oVirt webadmin I can see the following message: >>>> VM HostedEngine is down. Exit message: internal error Failed to >>>> acquire >>>> lock: error -243. >>>> >>>> These messages are really annoying as oVirt isn't doing anything >>>> with >>>> hosted engine - I have an uptime of 9 days in my engine vm. >>>> >>>> So my questions are now: >>>> Is it intended to send out these messages and detect that ovirt >>>> engine >>>> is down (which is false anyway), but not to restart the vm? >>>> >>>> How can I disable notifications? I'm planning to write a Nagios >>>> plugin >>>> which parses the output of hosted-engine --vm-status and only Nagios >>>> should notify me, not hosted-engine script. >>>> >>>> Is is possible or planned to make the whole ha feature
----- Original Message ----- line 92, libvirtError('virDomainCreateXML() this that that optional? I
>>>> really really really hate cluster software as it causes more >>>> troubles >>>> then standalone machines and in my case the hosted-engine ha feature >>>> really causes troubles (and I didn't had a hardware or network >>>> outage >>>> yet only issues with hosted-engine ha agent). I don't need any ha >>>> feature for hosted engine. I just want to run engine virtualized on >>>> oVirt and if engine vm fails (e.g. because of issues with a host) >>>> I'll >>>> restart it on another node. >>> >>> Hi, you can: >>> 1. edit /etc/ovirt-hosted-engine-ha/{agent,broker}-log.conf and tweak >>> the logger as you like >>> 2. or kill ovirt-ha-broker & ovirt-ha-agent services >> >> Thanks for the information. >> So engine is able to run when ovirt-ha-broker and ovirt-ha-agent isn't >> running? >> >> >> Regards, >> René >> >>> >>> --Jirka >>>> >>>> Thanks, >>>> René >>>> >>>> >>> >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >> _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Hi Kevin,
same pb.
Are you missing the lockspace file as well while running on top of GlusterFS?
ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9.
Defunct process eating full four cores? I wonder how is that possible.. What are the status flags of that process when you do ps axwu? Can you attach the log files please? -- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ ----- Original Message -----
same pb. ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9.
2014-04-23 13:55 GMT+02:00 Martin Sivak <msivak@redhat.com>:
Hi,
Isn't this file created when hosted engine is started?
The file is created by the setup script. If it got lost then there was probably something bad happening in your NFS or Gluster storage.
Or how can I create this file manually?
I can give you experimental treatment for this. We do not have any official way as this is something that should not ever happen :)
!! But before you do that make sure you do not have any nodes running properly. This will destroy and reinitialize the lockspace database for the whole hosted-engine environment (which you apparently lack, but..). !!
You have to create the ha_agent/hosted-engine.lockspace file with the expected size (1MB) and then tell sanlock to initialize it as a lockspace using:
# python
import sanlock sanlock.write_lockspace(lockspace="hosted-engine", ... path="/rhev/data-center/mnt/<nfs>/<hosted engine storage domain>/ha_agent/hosted-engine.lockspace", ... offset=0)
Then try starting the services (both broker and agent) again.
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
On 04/23/2014 11:08 AM, Martin Sivak wrote:
Hi René,
> libvirtError: Failed to acquire lock: No space left on device
> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid > lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82
Can you please check the contents of /rhev/data-center/<your nfs mount>/<nfs domain uuid>/ha_agent/?
This is how it should look like:
[root@dev-03 ~]# ls -al
/rhev/data-center/mnt/euryale\:_home_ovirt_he/e16de6a2-53f5-4ab3-95a3-255d08398824/ha_agent/
total 2036 drwxr-x---. 2 vdsm kvm 4096 Mar 19 18:46 . drwxr-xr-x. 6 vdsm kvm 4096 Mar 19 18:46 .. -rw-rw----. 1 vdsm kvm 1048576 Apr 23 11:05 hosted-engine.lockspace -rw-rw----. 1 vdsm kvm 1028096 Mar 19 18:46 hosted-engine.metadata
The errors seem to indicate that you somehow lost the lockspace file.
True :) Isn't this file created when hosted engine is started? Or how can I create this file manually?
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
----- Original Message -----
On 04/23/2014 12:28 AM, Doron Fediuck wrote:
Hi Rene, any idea what closed your ovirtmgmt bridge? as long as it is down vdsm may have issues starting up properly and this is why you see the complaints on the rpc server.
Can you try manually fixing the network part first and then restart vdsm? Once vdsm is happy hosted engine VM will start.
Thanks for your feedback, Doron.
My ovirtmgmt bridge seems to be on or isn't it: # brctl show ovirtmgmt bridge name bridge id STP enabled interfaces ovirtmgmt 8000.0025907587c2 no eth0.200
# ip a s ovirtmgmt 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff inet 10.0.200.102/24 brd 10.0.200.255 scope global ovirtmgmt inet6 fe80::225:90ff:fe75:87c2/64 scope link valid_lft forever preferred_lft forever
# ip a s eth0.200 6: eth0.200@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff inet6 fe80::225:90ff:fe75:87c2/64 scope link valid_lft forever preferred_lft forever
I tried the following yesterday: Copy virtual disk from GlusterFS storage to local disk of host and create a new vm with virt-manager which loads ovirtmgmt disk. I could reach my engine over the ovirtmgmt bridge (so bridge must be working).
I also started libvirtd with Option -v and I saw the following in libvirtd.log when trying to start ovirt engine: 2014-04-22 14:18:25.432+0000: 8901: debug : virCommandRunAsync:2250 : Command result 0, with PID 11491 2014-04-22 14:18:25.478+0000: 8901: debug : virCommandRun:2045 :
Result
exit status 255, stdout: '' stderr: 'iptables v1.4.7: goto 'FO-vnet0' is not a chain
So it could be that something is broken in my hosted-engine network. Do you have any clue how I can troubleshoot this?
Thanks, René
----- Original Message ----- > From: "René Koch" <rkoch@linuxland.at> > To: "Martin Sivak" <msivak@redhat.com> > Cc: users@ovirt.org > Sent: Tuesday, April 22, 2014 1:46:38 PM > Subject: Re: [ovirt-users] hosted engine health check issues > > Hi, > > I rebooted one of my ovirt hosts today and the result is now that I > can't start hosted-engine anymore. > > ovirt-ha-agent isn't running because the lockspace file is missing > (sanlock complains about it). > So I tried to start hosted-engine with --vm-start and I get the > following errors: > > ==> /var/log/sanlock.log <== > 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid > lockspace found -1 failed 0 name
2851af27-8744-445d-9fb1-a0d083c8dc82
> > ==> /var/log/messages <== > Apr 22 12:38:17 ovirt-host02 sanlock[3079]: 2014-04-22 12:38:17+0200 654 > [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 failed 0 name > 2851af27-8744-445d-9fb1-a0d083c8dc82 > Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) entering > disabled state > Apr 22 12:38:17 ovirt-host02 kernel: device vnet0 left promiscuous mode > Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) entering > disabled state > > ==> /var/log/vdsm/vdsm.log <== > Thread-21::DEBUG::2014-04-22 > 12:38:17,563::libvirtconnection::124::root::(wrapper) Unknown > libvirterror: ecode: 38 edom: 42 level: 2 message: Failed to acquire > lock: No space left on device > Thread-21::DEBUG::2014-04-22 > 12:38:17,563::vm::2263::vm.Vm::(_startUnderlyingVm) > vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::_ongoingCreations released > Thread-21::ERROR::2014-04-22 > 12:38:17,564::vm::2289::vm.Vm::(_startUnderlyingVm) > vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start process failed > Traceback (most recent call last): > File "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm > self._run() > File "/usr/share/vdsm/vm.py", line 3170, in _run > self._connection.createXML(domxml, flags), > File > "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", > line 92, in wrapper > ret = f(*args, **kwargs) > File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in > createXML > if ret is None:raise libvirtError('virDomainCreateXML() failed', > conn=self) > libvirtError: Failed to acquire lock: No space left on device > > ==> /var/log/messages <== > Apr 22 12:38:17 ovirt-host02 vdsm vm.Vm ERROR > vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start process > failed#012Traceback (most recent call last):#012 File > "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm#012 > self._run()#012 File "/usr/share/vdsm/vm.py", line 3170, in _run#012 > self._connection.createXML(domxml, flags),#012 File > "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py",
> in wrapper#012 ret = f(*args, **kwargs)#012 File > "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in > createXML#012 if ret is None:raise
> failed', conn=self)#012libvirtError: Failed to acquire lock: No space > left on device > > ==> /var/log/vdsm/vdsm.log <== > Thread-21::DEBUG::2014-04-22 > 12:38:17,569::vm::2731::vm.Vm::(setDownStatus) > vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::Changed state to Down: > Failed to acquire lock: No space left on device > > > No space left on device is nonsense as there is enough space (I had
> issue last time as well where I had to patch machine.py, but this file > is now Python 2.6.6 compatible. > > Any idea what prevents hosted-engine from starting? > ovirt-ha-broker, vdsmd and sanlock are running btw. > > Btw, I can see in log that json rpc server module is missing - which > package is required for CentOS 6.5? > Apr 22 12:37:14 ovirt-host02 vdsm vds WARNING Unable to load the json > rpc server module. Please make sure it is installed. > > > Thanks, > René > > > > On 04/17/2014 10:02 AM, Martin Sivak wrote: >> Hi, >> >>>>> How can I disable notifications? >> >> The notification is configured in >> /etc/ovirt-hosted-engine-ha/broker.conf >> section notification. >> The email is sent when the key state_transition exists and the string >> OldState-NewState contains the (case insensitive) regexp from the >> value. >> >>>>> Is it intended to send out these messages and detect that ovirt >>>>> engine >>>>> is down (which is false anyway), but not to restart the vm? >> >> Forget about emails for now and check the >> /var/log/ovirt-hosted-engine-ha/agent.log and broker.log (and attach >> them >> as well btw). >> >>>>> oVirt hosts think that hosted engine is down because it seems
>>>>> hosts >>>>> can't write to hosted-engine.lockspace due to glusterfs issues (or >>>>> at >>>>> least I think so). >> >> The hosts think so or can't really write there? The lockspace is >> managed >> by >> sanlock and our HA daemons do not touch it at all. We only ask sanlock >> to >> get make sure we have unique server id. >> >>>>> Is is possible or planned to make the whole ha feature optional? >> >> Well the system won't perform any automatic actions if you put the >> hosted >> engine to global maintenance and only start/stop/migrate the VM >> manually. >> I would discourage you from stopping agent/broker, because the engine >> itself has some logic based on the reporting. >> >> Regards >> >> -- >> Martin Sivák >> msivak@redhat.com >> Red Hat Czech >> RHEV-M SLA / Brno, CZ >> >> ----- Original Message ----- >>> On 04/15/2014 04:53 PM, Jiri Moskovcak wrote: >>>> On 04/14/2014 10:50 AM, René Koch wrote: >>>>> Hi, >>>>> >>>>> I have some issues with hosted engine status. >>>>> >>>>> oVirt hosts think that hosted engine is down because it seems
>>>>> hosts >>>>> can't write to hosted-engine.lockspace due to glusterfs issues (or >>>>> at >>>>> least I think so). >>>>> >>>>> Here's the output of vm-status: >>>>> >>>>> # hosted-engine --vm-status >>>>> >>>>> >>>>> --== Host 1 status ==-- >>>>> >>>>> Status up-to-date : False >>>>> Hostname : 10.0.200.102 >>>>> Host ID : 1 >>>>> Engine status : unknown stale-data >>>>> Score : 2400 >>>>> Local maintenance : False >>>>> Host timestamp : 1397035677 >>>>> Extra metadata (valid at timestamp): >>>>> metadata_parse_version=1 >>>>> metadata_feature_version=1 >>>>> timestamp=1397035677 (Wed Apr 9 11:27:57 2014) >>>>> host-id=1 >>>>> score=2400 >>>>> maintenance=False >>>>> state=EngineUp >>>>> >>>>> >>>>> --== Host 2 status ==-- >>>>> >>>>> Status up-to-date : True >>>>> Hostname : 10.0.200.101 >>>>> Host ID : 2 >>>>> Engine status : {'reason': 'vm not running on >>>>> this >>>>> host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'} >>>>> Score : 0 >>>>> Local maintenance : False >>>>> Host timestamp : 1397464031 >>>>> Extra metadata (valid at timestamp): >>>>> metadata_parse_version=1 >>>>> metadata_feature_version=1 >>>>> timestamp=1397464031 (Mon Apr 14 10:27:11 2014) >>>>> host-id=2 >>>>> score=0 >>>>> maintenance=False >>>>> state=EngineUnexpectedlyDown >>>>> timeout=Mon Apr 14 10:35:05 2014 >>>>> >>>>> oVirt engine is sending me 2 emails every 10 minutes with the >>>>> following >>>>> subjects: >>>>> - ovirt-hosted-engine state transition EngineDown-EngineStart >>>>> - ovirt-hosted-engine state transition EngineStart-EngineUp >>>>> >>>>> In oVirt webadmin I can see the following message: >>>>> VM HostedEngine is down. Exit message: internal error Failed to >>>>> acquire >>>>> lock: error -243. >>>>> >>>>> These messages are really annoying as oVirt isn't doing anything >>>>> with >>>>> hosted engine - I have an uptime of 9 days in my engine vm. >>>>> >>>>> So my questions are now: >>>>> Is it intended to send out these messages and detect that ovirt >>>>> engine >>>>> is down (which is false anyway), but not to restart the vm? >>>>> >>>>> How can I disable notifications? I'm planning to write a Nagios >>>>> plugin >>>>> which parses the output of hosted-engine --vm-status and only Nagios >>>>> should notify me, not hosted-engine script. >>>>> >>>>> Is is possible or planned to make the whole ha feature
----- Original Message ----- line 92, libvirtError('virDomainCreateXML() this that that optional? I
>>>>> really really really hate cluster software as it causes more >>>>> troubles >>>>> then standalone machines and in my case the hosted-engine ha feature >>>>> really causes troubles (and I didn't had a hardware or network >>>>> outage >>>>> yet only issues with hosted-engine ha agent). I don't need any ha >>>>> feature for hosted engine. I just want to run engine virtualized on >>>>> oVirt and if engine vm fails (e.g. because of issues with a host) >>>>> I'll >>>>> restart it on another node. >>>> >>>> Hi, you can: >>>> 1. edit /etc/ovirt-hosted-engine-ha/{agent,broker}-log.conf and tweak >>>> the logger as you like >>>> 2. or kill ovirt-ha-broker & ovirt-ha-agent services >>> >>> Thanks for the information. >>> So engine is able to run when ovirt-ha-broker and ovirt-ha-agent isn't >>> running? >>> >>> >>> Regards, >>> René >>> >>>> >>>> --Jirka >>>>> >>>>> Thanks, >>>>> René >>>>> >>>>> >>>> >>> _______________________________________________ >>> Users mailing list >>> Users@ovirt.org >>> http://lists.ovirt.org/mailman/listinfo/users >>> > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users >
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

top 1729 vdsm 20 0 0 0 0 Z 373.8 0.0 252:08.51 ovirt-ha-broker <defunct> [root@host01 ~]# ps axwu | grep 1729 vdsm 1729 0.7 0.0 0 0 ? Zl Apr02 240:24 [ovirt-ha-broker] <defunct> [root@host01 ~]# ll /rhev/data-center/mnt/host01.ovirt.lan\:_home_NFS01/aea040f8-ab9d-435b-9ecf-ddd4272e592f/ha_agent/ total 2028 -rw-rw----. 1 vdsm kvm 1048576 23 avril 17:35 hosted-engine.lockspace -rw-rw----. 1 vdsm kvm 1028096 23 avril 17:35 hosted-engine.metadata cat /var/log/vdsm/vdsm.log Thread-120518::DEBUG::2014-04-23 17:38:02,299::task::1185::TaskManager.Task::(prepare) Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::finished: {'aea040f8-ab9d-435b-9ecf-ddd4272e592f': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000410963', 'lastCheck': '3.4', 'valid': True}, '5ae613a4-44e4-42cb-89fc-7b5d34c1f30f': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000412357', 'lastCheck': '6.8', 'valid': True}, 'cc51143e-8ad7-4b0b-a4d2-9024dffc1188': {'code': 0, 'version': 0, 'acquired': True, 'delay': '0.000455292', 'lastCheck': '1.2', 'valid': True}, 'ff98d346-4515-4349-8437-fb2f5e9eaadf': {'code': 0, 'version': 0, 'acquired': True, 'delay': '0.00817113', 'lastCheck': '1.7', 'valid': True}} Thread-120518::DEBUG::2014-04-23 17:38:02,300::task::595::TaskManager.Task::(_updateState) Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::moving from state preparing -> state finished Thread-120518::DEBUG::2014-04-23 17:38:02,300::resourceManager::940::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-120518::DEBUG::2014-04-23 17:38:02,300::resourceManager::977::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-120518::DEBUG::2014-04-23 17:38:02,300::task::990::TaskManager.Task::(_decref) Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::ref 0 aborting False Thread-120518::ERROR::2014-04-23 17:38:02,302::brokerlink::72::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) Failed to connect to broker: [Errno 2] No such file or directory Thread-120518::ERROR::2014-04-23 17:38:02,302::API::1612::vds::(_getHaInfo) failed to retrieve Hosted Engine HA info Traceback (most recent call last): File "/usr/share/vdsm/API.py", line 1603, in _getHaInfo stats = instance.get_all_stats() File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 83, in get_all_stats with broker.connection(): File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__ return self.gen.next() File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 96, in connection self.connect() File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 64, in connect self._socket.connect(constants.BROKER_SOCKET_FILE) File "<string>", line 1, in connect error: [Errno 2] No such file or directory Thread-78::DEBUG::2014-04-23 17:38:05,490::fileSD::225::Storage.Misc.excCmd::(getReadDelay) '/bin/dd iflag=direct if=/rhev/data-center/mnt/host01.ovirt.lan:_home_DATA/5ae613a4-44e4-42cb-89fc-7b5d34c1f30f/dom_md/metadata bs=4096 count=1' (cwd None) Thread-78::DEBUG::2014-04-23 17:38:05,523::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n545 bytes (545 B) copied, 0.000412209 s, 1.3 MB/s\n'; <rc> = 0 2014-04-23 17:27 GMT+02:00 Martin Sivak <msivak@redhat.com>:
Hi Kevin,
same pb.
Are you missing the lockspace file as well while running on top of GlusterFS?
ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9.
Defunct process eating full four cores? I wonder how is that possible.. What are the status flags of that process when you do ps axwu?
Can you attach the log files please?
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
same pb. ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9.
2014-04-23 13:55 GMT+02:00 Martin Sivak <msivak@redhat.com>:
Hi,
Isn't this file created when hosted engine is started?
The file is created by the setup script. If it got lost then there was probably something bad happening in your NFS or Gluster storage.
Or how can I create this file manually?
I can give you experimental treatment for this. We do not have any official way as this is something that should not ever happen :)
!! But before you do that make sure you do not have any nodes running properly. This will destroy and reinitialize the lockspace database for the whole hosted-engine environment (which you apparently lack, but..). !!
You have to create the ha_agent/hosted-engine.lockspace file with the expected size (1MB) and then tell sanlock to initialize it as a lockspace using:
# python
import sanlock sanlock.write_lockspace(lockspace="hosted-engine", ... path="/rhev/data-center/mnt/<nfs>/<hosted engine storage domain>/ha_agent/hosted-engine.lockspace", ... offset=0)
Then try starting the services (both broker and agent) again.
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
----- Original Message -----
On 04/23/2014 11:08 AM, Martin Sivak wrote:
Hi René,
>> libvirtError: Failed to acquire lock: No space left on device
>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid >> lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82
Can you please check the contents of /rhev/data-center/<your nfs mount>/<nfs domain uuid>/ha_agent/?
This is how it should look like:
[root@dev-03 ~]# ls -al
/rhev/data-center/mnt/euryale\:_home_ovirt_he/e16de6a2-53f5-4ab3-95a3-255d08398824/ha_agent/
total 2036 drwxr-x---. 2 vdsm kvm 4096 Mar 19 18:46 . drwxr-xr-x. 6 vdsm kvm 4096 Mar 19 18:46 .. -rw-rw----. 1 vdsm kvm 1048576 Apr 23 11:05 hosted-engine.lockspace -rw-rw----. 1 vdsm kvm 1028096 Mar 19 18:46 hosted-engine.metadata
The errors seem to indicate that you somehow lost the lockspace file.
True :) Isn't this file created when hosted engine is started? Or how can I create this file manually?
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
----- Original Message -----
On 04/23/2014 12:28 AM, Doron Fediuck wrote: > Hi Rene, > any idea what closed your ovirtmgmt bridge? > as long as it is down vdsm may have issues starting up properly > and this is why you see the complaints on the rpc server. > > Can you try manually fixing the network part first and then > restart vdsm? > Once vdsm is happy hosted engine VM will start.
Thanks for your feedback, Doron.
My ovirtmgmt bridge seems to be on or isn't it: # brctl show ovirtmgmt bridge name bridge id STP enabled
interfaces
ovirtmgmt 8000.0025907587c2 no eth0.200
# ip a s ovirtmgmt 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff inet 10.0.200.102/24 brd 10.0.200.255 scope global ovirtmgmt inet6 fe80::225:90ff:fe75:87c2/64 scope link valid_lft forever preferred_lft forever
# ip a s eth0.200 6: eth0.200@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff inet6 fe80::225:90ff:fe75:87c2/64 scope link valid_lft forever preferred_lft forever
I tried the following yesterday: Copy virtual disk from GlusterFS storage to local disk of host and create a new vm with virt-manager which loads ovirtmgmt disk. I could reach my engine over the ovirtmgmt bridge (so bridge must be working).
I also started libvirtd with Option -v and I saw the following in libvirtd.log when trying to start ovirt engine: 2014-04-22 14:18:25.432+0000: 8901: debug : virCommandRunAsync:2250 : Command result 0, with PID 11491 2014-04-22 14:18:25.478+0000: 8901: debug : virCommandRun:2045 : Result exit status 255, stdout: '' stderr: 'iptables v1.4.7: goto 'FO-vnet0' is not a chain
So it could be that something is broken in my hosted-engine network. Do you have any clue how I can troubleshoot this?
Thanks, René
> > ----- Original Message ----- >> From: "René Koch" <rkoch@linuxland.at> >> To: "Martin Sivak" <msivak@redhat.com> >> Cc: users@ovirt.org >> Sent: Tuesday, April 22, 2014 1:46:38 PM >> Subject: Re: [ovirt-users] hosted engine health check issues >> >> Hi, >> >> I rebooted one of my ovirt hosts today and the result is now
>> can't start hosted-engine anymore. >> >> ovirt-ha-agent isn't running because the lockspace file is missing >> (sanlock complains about it). >> So I tried to start hosted-engine with --vm-start and I get the >> following errors: >> >> ==> /var/log/sanlock.log <== >> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid >> lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82 >> >> ==> /var/log/messages <== >> Apr 22 12:38:17 ovirt-host02 sanlock[3079]: 2014-04-22 12:38:17+0200 654 >> [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 failed 0 name >> 2851af27-8744-445d-9fb1-a0d083c8dc82 >> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) entering >> disabled state >> Apr 22 12:38:17 ovirt-host02 kernel: device vnet0 left
mode
>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) entering >> disabled state >> >> ==> /var/log/vdsm/vdsm.log <== >> Thread-21::DEBUG::2014-04-22 >> 12:38:17,563::libvirtconnection::124::root::(wrapper) Unknown >> libvirterror: ecode: 38 edom: 42 level: 2 message: Failed to acquire >> lock: No space left on device >> Thread-21::DEBUG::2014-04-22 >> 12:38:17,563::vm::2263::vm.Vm::(_startUnderlyingVm) >> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::_ongoingCreations released >> Thread-21::ERROR::2014-04-22 >> 12:38:17,564::vm::2289::vm.Vm::(_startUnderlyingVm) >> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start
failed
>> Traceback (most recent call last): >> File "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm >> self._run() >> File "/usr/share/vdsm/vm.py", line 3170, in _run >> self._connection.createXML(domxml, flags), >> File >> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", >> line 92, in wrapper >> ret = f(*args, **kwargs) >> File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in >> createXML >> if ret is None:raise libvirtError('virDomainCreateXML() failed', >> conn=self) >> libvirtError: Failed to acquire lock: No space left on device >> >> ==> /var/log/messages <== >> Apr 22 12:38:17 ovirt-host02 vdsm vm.Vm ERROR >> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start
>> failed#012Traceback (most recent call last):#012 File >> "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm#012 >> self._run()#012 File "/usr/share/vdsm/vm.py", line 3170, in _run#012 >> self._connection.createXML(domxml, flags),#012 File >> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 92, >> in wrapper#012 ret = f(*args, **kwargs)#012 File >> "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in >> createXML#012 if ret is None:raise libvirtError('virDomainCreateXML() >> failed', conn=self)#012libvirtError: Failed to acquire lock: No space >> left on device >> >> ==> /var/log/vdsm/vdsm.log <== >> Thread-21::DEBUG::2014-04-22 >> 12:38:17,569::vm::2731::vm.Vm::(setDownStatus) >> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::Changed state to Down: >> Failed to acquire lock: No space left on device >> >> >> No space left on device is nonsense as there is enough space (I had this >> issue last time as well where I had to patch machine.py, but
file
>> is now Python 2.6.6 compatible. >> >> Any idea what prevents hosted-engine from starting? >> ovirt-ha-broker, vdsmd and sanlock are running btw. >> >> Btw, I can see in log that json rpc server module is missing - which >> package is required for CentOS 6.5? >> Apr 22 12:37:14 ovirt-host02 vdsm vds WARNING Unable to load the json >> rpc server module. Please make sure it is installed. >> >> >> Thanks, >> René >> >> >> >> On 04/17/2014 10:02 AM, Martin Sivak wrote: >>> Hi, >>> >>>>>> How can I disable notifications? >>> >>> The notification is configured in >>> /etc/ovirt-hosted-engine-ha/broker.conf >>> section notification. >>> The email is sent when the key state_transition exists and the string >>> OldState-NewState contains the (case insensitive) regexp from
>>> value. >>> >>>>>> Is it intended to send out these messages and detect that ovirt >>>>>> engine >>>>>> is down (which is false anyway), but not to restart the vm? >>> >>> Forget about emails for now and check the >>> /var/log/ovirt-hosted-engine-ha/agent.log and broker.log (and attach >>> them >>> as well btw). >>> >>>>>> oVirt hosts think that hosted engine is down because it seems that >>>>>> hosts >>>>>> can't write to hosted-engine.lockspace due to glusterfs issues (or >>>>>> at >>>>>> least I think so). >>> >>> The hosts think so or can't really write there? The lockspace is >>> managed >>> by >>> sanlock and our HA daemons do not touch it at all. We only ask sanlock >>> to >>> get make sure we have unique server id. >>> >>>>>> Is is possible or planned to make the whole ha feature
>>> >>> Well the system won't perform any automatic actions if you put
>>> hosted >>> engine to global maintenance and only start/stop/migrate the VM >>> manually. >>> I would discourage you from stopping agent/broker, because the engine >>> itself has some logic based on the reporting. >>> >>> Regards >>> >>> -- >>> Martin Sivák >>> msivak@redhat.com >>> Red Hat Czech >>> RHEV-M SLA / Brno, CZ >>> >>> ----- Original Message ----- >>>> On 04/15/2014 04:53 PM, Jiri Moskovcak wrote: >>>>> On 04/14/2014 10:50 AM, René Koch wrote: >>>>>> Hi, >>>>>> >>>>>> I have some issues with hosted engine status. >>>>>> >>>>>> oVirt hosts think that hosted engine is down because it seems that >>>>>> hosts >>>>>> can't write to hosted-engine.lockspace due to glusterfs issues (or >>>>>> at >>>>>> least I think so). >>>>>> >>>>>> Here's the output of vm-status: >>>>>> >>>>>> # hosted-engine --vm-status >>>>>> >>>>>> >>>>>> --== Host 1 status ==-- >>>>>> >>>>>> Status up-to-date : False >>>>>> Hostname : 10.0.200.102 >>>>>> Host ID : 1 >>>>>> Engine status : unknown stale-data >>>>>> Score : 2400 >>>>>> Local maintenance : False >>>>>> Host timestamp : 1397035677 >>>>>> Extra metadata (valid at timestamp): >>>>>> metadata_parse_version=1 >>>>>> metadata_feature_version=1 >>>>>> timestamp=1397035677 (Wed Apr 9 11:27:57 2014) >>>>>> host-id=1 >>>>>> score=2400 >>>>>> maintenance=False >>>>>> state=EngineUp >>>>>> >>>>>> >>>>>> --== Host 2 status ==-- >>>>>> >>>>>> Status up-to-date : True >>>>>> Hostname : 10.0.200.101 >>>>>> Host ID : 2 >>>>>> Engine status : {'reason': 'vm not running on >>>>>> this >>>>>> host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'} >>>>>> Score : 0 >>>>>> Local maintenance : False >>>>>> Host timestamp : 1397464031 >>>>>> Extra metadata (valid at timestamp): >>>>>> metadata_parse_version=1 >>>>>> metadata_feature_version=1 >>>>>> timestamp=1397464031 (Mon Apr 14 10:27:11 2014) >>>>>> host-id=2 >>>>>> score=0 >>>>>> maintenance=False >>>>>> state=EngineUnexpectedlyDown >>>>>> timeout=Mon Apr 14 10:35:05 2014 >>>>>> >>>>>> oVirt engine is sending me 2 emails every 10 minutes with
----- Original Message ----- that I promiscuous process process this the optional? the the
>>>>>> following >>>>>> subjects: >>>>>> - ovirt-hosted-engine state transition EngineDown-EngineStart >>>>>> - ovirt-hosted-engine state transition EngineStart-EngineUp >>>>>> >>>>>> In oVirt webadmin I can see the following message: >>>>>> VM HostedEngine is down. Exit message: internal error Failed to >>>>>> acquire >>>>>> lock: error -243. >>>>>> >>>>>> These messages are really annoying as oVirt isn't doing anything >>>>>> with >>>>>> hosted engine - I have an uptime of 9 days in my engine vm. >>>>>> >>>>>> So my questions are now: >>>>>> Is it intended to send out these messages and detect that ovirt >>>>>> engine >>>>>> is down (which is false anyway), but not to restart the vm? >>>>>> >>>>>> How can I disable notifications? I'm planning to write a Nagios >>>>>> plugin >>>>>> which parses the output of hosted-engine --vm-status and only Nagios >>>>>> should notify me, not hosted-engine script. >>>>>> >>>>>> Is is possible or planned to make the whole ha feature optional? I >>>>>> really really really hate cluster software as it causes more >>>>>> troubles >>>>>> then standalone machines and in my case the hosted-engine ha feature >>>>>> really causes troubles (and I didn't had a hardware or network >>>>>> outage >>>>>> yet only issues with hosted-engine ha agent). I don't need any ha >>>>>> feature for hosted engine. I just want to run engine virtualized on >>>>>> oVirt and if engine vm fails (e.g. because of issues with a host) >>>>>> I'll >>>>>> restart it on another node. >>>>> >>>>> Hi, you can: >>>>> 1. edit /etc/ovirt-hosted-engine-ha/{agent,broker}-log.conf and tweak >>>>> the logger as you like >>>>> 2. or kill ovirt-ha-broker & ovirt-ha-agent services >>>> >>>> Thanks for the information. >>>> So engine is able to run when ovirt-ha-broker and ovirt-ha-agent isn't >>>> running? >>>> >>>> >>>> Regards, >>>> René >>>> >>>>> >>>>> --Jirka >>>>>> >>>>>> Thanks, >>>>>> René >>>>>> >>>>>> >>>>> >>>> _______________________________________________ >>>> Users mailing list >>>> Users@ovirt.org >>>> http://lists.ovirt.org/mailman/listinfo/users >>>> >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >>
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Hi, I try to reboot my hosts and now [supervdsmServer] is <defunct>. /var/log/vdsm/supervdsm.log MainProcess|Thread-120::DEBUG::2014-04-24 12:22:19,955::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) return validateAccess with None MainProcess|Thread-120::DEBUG::2014-04-24 12:22:20,010::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) call validateAccess with ('qemu', ('qemu', 'kvm'), '/rhev/data-center/mnt/host01.ovirt.lan:_home_export', 5) {} MainProcess|Thread-120::DEBUG::2014-04-24 12:22:20,014::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) return validateAccess with None MainProcess|Thread-120::DEBUG::2014-04-24 12:22:20,059::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) call validateAccess with ('qemu', ('qemu', 'kvm'), '/rhev/data-center/mnt/host01.ovirt.lan:_home_iso', 5) {} MainProcess|Thread-120::DEBUG::2014-04-24 12:22:20,063::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) return validateAccess with None and one host don't mount the NFS used for hosted engine. MainThread::CRITICAL::2014-04-24 12:36:16,603::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Could not start ha-agent Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 97, in run self._run_agent() File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 154, in _run_agent hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring() File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 299, in start_monitoring self._initialize_vdsm() File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 418, in _initialize_vdsm self._sd_path = env_path.get_domain_path(self._config) File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py", line 40, in get_domain_path .format(sd_uuid, parent)) Exception: path to storage domain aea040f8-ab9d-435b-9ecf-ddd4272e592f not found in /rhev/data-center/mnt 2014-04-23 17:40 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com>:
top 1729 vdsm 20 0 0 0 0 Z 373.8 0.0 252:08.51 ovirt-ha-broker <defunct>
[root@host01 ~]# ps axwu | grep 1729 vdsm 1729 0.7 0.0 0 0 ? Zl Apr02 240:24 [ovirt-ha-broker] <defunct>
[root@host01 ~]# ll /rhev/data-center/mnt/host01.ovirt.lan\:_home_NFS01/aea040f8-ab9d-435b-9ecf-ddd4272e592f/ha_agent/ total 2028 -rw-rw----. 1 vdsm kvm 1048576 23 avril 17:35 hosted-engine.lockspace -rw-rw----. 1 vdsm kvm 1028096 23 avril 17:35 hosted-engine.metadata
cat /var/log/vdsm/vdsm.log
Thread-120518::DEBUG::2014-04-23 17:38:02,299::task::1185::TaskManager.Task::(prepare) Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::finished: {'aea040f8-ab9d-435b-9ecf-ddd4272e592f': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000410963', 'lastCheck': '3.4', 'valid': True}, '5ae613a4-44e4-42cb-89fc-7b5d34c1f30f': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000412357', 'lastCheck': '6.8', 'valid': True}, 'cc51143e-8ad7-4b0b-a4d2-9024dffc1188': {'code': 0, 'version': 0, 'acquired': True, 'delay': '0.000455292', 'lastCheck': '1.2', 'valid': True}, 'ff98d346-4515-4349-8437-fb2f5e9eaadf': {'code': 0, 'version': 0, 'acquired': True, 'delay': '0.00817113', 'lastCheck': '1.7', 'valid': True}} Thread-120518::DEBUG::2014-04-23 17:38:02,300::task::595::TaskManager.Task::(_updateState) Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::moving from state preparing -> state finished Thread-120518::DEBUG::2014-04-23 17:38:02,300::resourceManager::940::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-120518::DEBUG::2014-04-23 17:38:02,300::resourceManager::977::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-120518::DEBUG::2014-04-23 17:38:02,300::task::990::TaskManager.Task::(_decref) Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::ref 0 aborting False Thread-120518::ERROR::2014-04-23 17:38:02,302::brokerlink::72::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) Failed to connect to broker: [Errno 2] No such file or directory Thread-120518::ERROR::2014-04-23 17:38:02,302::API::1612::vds::(_getHaInfo) failed to retrieve Hosted Engine HA info Traceback (most recent call last): File "/usr/share/vdsm/API.py", line 1603, in _getHaInfo stats = instance.get_all_stats() File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", line 83, in get_all_stats with broker.connection(): File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__ return self.gen.next() File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 96, in connection self.connect() File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 64, in connect self._socket.connect(constants.BROKER_SOCKET_FILE) File "<string>", line 1, in connect error: [Errno 2] No such file or directory Thread-78::DEBUG::2014-04-23 17:38:05,490::fileSD::225::Storage.Misc.excCmd::(getReadDelay) '/bin/dd iflag=direct if=/rhev/data-center/mnt/host01.ovirt.lan:_home_DATA/5ae613a4-44e4-42cb-89fc-7b5d34c1f30f/dom_md/metadata bs=4096 count=1' (cwd None) Thread-78::DEBUG::2014-04-23 17:38:05,523::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n545 bytes (545 B) copied, 0.000412209 s, 1.3 MB/s\n'; <rc> = 0
2014-04-23 17:27 GMT+02:00 Martin Sivak <msivak@redhat.com>:
Hi Kevin,
same pb.
Are you missing the lockspace file as well while running on top of GlusterFS?
ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9.
Defunct process eating full four cores? I wonder how is that possible.. What are the status flags of that process when you do ps axwu?
Can you attach the log files please?
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
same pb. ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9.
2014-04-23 13:55 GMT+02:00 Martin Sivak <msivak@redhat.com>:
Hi,
Isn't this file created when hosted engine is started?
The file is created by the setup script. If it got lost then there was probably something bad happening in your NFS or Gluster storage.
Or how can I create this file manually?
I can give you experimental treatment for this. We do not have any official way as this is something that should not ever happen :)
!! But before you do that make sure you do not have any nodes running properly. This will destroy and reinitialize the lockspace database for the whole hosted-engine environment (which you apparently lack, but..). !!
You have to create the ha_agent/hosted-engine.lockspace file with the expected size (1MB) and then tell sanlock to initialize it as a lockspace using:
# python
> import sanlock > sanlock.write_lockspace(lockspace="hosted-engine", ... path="/rhev/data-center/mnt/<nfs>/<hosted engine storage domain>/ha_agent/hosted-engine.lockspace", ... offset=0) >
Then try starting the services (both broker and agent) again.
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
----- Original Message -----
On 04/23/2014 11:08 AM, Martin Sivak wrote:
Hi René,
>>> libvirtError: Failed to acquire lock: No space left on device
>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid >>> lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82
Can you please check the contents of /rhev/data-center/<your nfs mount>/<nfs domain uuid>/ha_agent/?
This is how it should look like:
[root@dev-03 ~]# ls -al
/rhev/data-center/mnt/euryale\:_home_ovirt_he/e16de6a2-53f5-4ab3-95a3-255d08398824/ha_agent/
total 2036 drwxr-x---. 2 vdsm kvm 4096 Mar 19 18:46 . drwxr-xr-x. 6 vdsm kvm 4096 Mar 19 18:46 .. -rw-rw----. 1 vdsm kvm 1048576 Apr 23 11:05 hosted-engine.lockspace -rw-rw----. 1 vdsm kvm 1028096 Mar 19 18:46 hosted-engine.metadata
The errors seem to indicate that you somehow lost the lockspace file.
True :) Isn't this file created when hosted engine is started? Or how can I create this file manually?
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
----- Original Message ----- > On 04/23/2014 12:28 AM, Doron Fediuck wrote: >> Hi Rene, >> any idea what closed your ovirtmgmt bridge? >> as long as it is down vdsm may have issues starting up properly >> and this is why you see the complaints on the rpc server. >> >> Can you try manually fixing the network part first and then >> restart vdsm? >> Once vdsm is happy hosted engine VM will start. > > Thanks for your feedback, Doron. > > My ovirtmgmt bridge seems to be on or isn't it: > # brctl show ovirtmgmt > bridge name bridge id STP enabled
interfaces
> ovirtmgmt 8000.0025907587c2 no eth0.200 > > # ip a s ovirtmgmt > 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue > state UNKNOWN > link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff > inet 10.0.200.102/24 brd 10.0.200.255 scope global ovirtmgmt > inet6 fe80::225:90ff:fe75:87c2/64 scope link > valid_lft forever preferred_lft forever > > # ip a s eth0.200 > 6: eth0.200@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc > noqueue state UP > link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff > inet6 fe80::225:90ff:fe75:87c2/64 scope link > valid_lft forever preferred_lft forever > > I tried the following yesterday: > Copy virtual disk from GlusterFS storage to local disk of host and > create a new vm with virt-manager which loads ovirtmgmt disk. I could > reach my engine over the ovirtmgmt bridge (so bridge must be working). > > I also started libvirtd with Option -v and I saw the following in > libvirtd.log when trying to start ovirt engine: > 2014-04-22 14:18:25.432+0000: 8901: debug : virCommandRunAsync:2250 : > Command result 0, with PID 11491 > 2014-04-22 14:18:25.478+0000: 8901: debug : virCommandRun:2045 : Result > exit status 255, stdout: '' stderr: 'iptables v1.4.7: goto 'FO-vnet0' is > not a chain > > So it could be that something is broken in my hosted-engine network. Do > you have any clue how I can troubleshoot this? > > > Thanks, > René > > >> >> ----- Original Message ----- >>> From: "René Koch" <rkoch@linuxland.at> >>> To: "Martin Sivak" <msivak@redhat.com> >>> Cc: users@ovirt.org >>> Sent: Tuesday, April 22, 2014 1:46:38 PM >>> Subject: Re: [ovirt-users] hosted engine health check issues >>> >>> Hi, >>> >>> I rebooted one of my ovirt hosts today and the result is now
>>> can't start hosted-engine anymore. >>> >>> ovirt-ha-agent isn't running because the lockspace file is missing >>> (sanlock complains about it). >>> So I tried to start hosted-engine with --vm-start and I get the >>> following errors: >>> >>> ==> /var/log/sanlock.log <== >>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid >>> lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82 >>> >>> ==> /var/log/messages <== >>> Apr 22 12:38:17 ovirt-host02 sanlock[3079]: 2014-04-22 12:38:17+0200 654 >>> [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 failed 0 name >>> 2851af27-8744-445d-9fb1-a0d083c8dc82 >>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) entering >>> disabled state >>> Apr 22 12:38:17 ovirt-host02 kernel: device vnet0 left
mode
>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) entering >>> disabled state >>> >>> ==> /var/log/vdsm/vdsm.log <== >>> Thread-21::DEBUG::2014-04-22 >>> 12:38:17,563::libvirtconnection::124::root::(wrapper) Unknown >>> libvirterror: ecode: 38 edom: 42 level: 2 message: Failed to acquire >>> lock: No space left on device >>> Thread-21::DEBUG::2014-04-22 >>> 12:38:17,563::vm::2263::vm.Vm::(_startUnderlyingVm) >>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::_ongoingCreations released >>> Thread-21::ERROR::2014-04-22 >>> 12:38:17,564::vm::2289::vm.Vm::(_startUnderlyingVm) >>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start
failed
>>> Traceback (most recent call last): >>> File "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm >>> self._run() >>> File "/usr/share/vdsm/vm.py", line 3170, in _run >>> self._connection.createXML(domxml, flags), >>> File >>> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", >>> line 92, in wrapper >>> ret = f(*args, **kwargs) >>> File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in >>> createXML >>> if ret is None:raise libvirtError('virDomainCreateXML() failed', >>> conn=self) >>> libvirtError: Failed to acquire lock: No space left on device >>> >>> ==> /var/log/messages <== >>> Apr 22 12:38:17 ovirt-host02 vdsm vm.Vm ERROR >>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start
>>> failed#012Traceback (most recent call last):#012 File >>> "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm#012 >>> self._run()#012 File "/usr/share/vdsm/vm.py", line 3170, in _run#012 >>> self._connection.createXML(domxml, flags),#012 File >>> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 92, >>> in wrapper#012 ret = f(*args, **kwargs)#012 File >>> "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in >>> createXML#012 if ret is None:raise libvirtError('virDomainCreateXML() >>> failed', conn=self)#012libvirtError: Failed to acquire lock: No space >>> left on device >>> >>> ==> /var/log/vdsm/vdsm.log <== >>> Thread-21::DEBUG::2014-04-22 >>> 12:38:17,569::vm::2731::vm.Vm::(setDownStatus) >>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::Changed state to Down: >>> Failed to acquire lock: No space left on device >>> >>> >>> No space left on device is nonsense as there is enough space (I had this >>> issue last time as well where I had to patch machine.py, but
file
>>> is now Python 2.6.6 compatible. >>> >>> Any idea what prevents hosted-engine from starting? >>> ovirt-ha-broker, vdsmd and sanlock are running btw. >>> >>> Btw, I can see in log that json rpc server module is missing - which >>> package is required for CentOS 6.5? >>> Apr 22 12:37:14 ovirt-host02 vdsm vds WARNING Unable to load
json
>>> rpc server module. Please make sure it is installed. >>> >>> >>> Thanks, >>> René >>> >>> >>> >>> On 04/17/2014 10:02 AM, Martin Sivak wrote: >>>> Hi, >>>> >>>>>>> How can I disable notifications? >>>> >>>> The notification is configured in >>>> /etc/ovirt-hosted-engine-ha/broker.conf >>>> section notification. >>>> The email is sent when the key state_transition exists and the string >>>> OldState-NewState contains the (case insensitive) regexp from
>>>> value. >>>> >>>>>>> Is it intended to send out these messages and detect that ovirt >>>>>>> engine >>>>>>> is down (which is false anyway), but not to restart the vm? >>>> >>>> Forget about emails for now and check the >>>> /var/log/ovirt-hosted-engine-ha/agent.log and broker.log (and attach >>>> them >>>> as well btw). >>>> >>>>>>> oVirt hosts think that hosted engine is down because it seems that >>>>>>> hosts >>>>>>> can't write to hosted-engine.lockspace due to glusterfs issues (or >>>>>>> at >>>>>>> least I think so). >>>> >>>> The hosts think so or can't really write there? The lockspace is >>>> managed >>>> by >>>> sanlock and our HA daemons do not touch it at all. We only ask sanlock >>>> to >>>> get make sure we have unique server id. >>>> >>>>>>> Is is possible or planned to make the whole ha feature
>>>> >>>> Well the system won't perform any automatic actions if you
>>>> hosted >>>> engine to global maintenance and only start/stop/migrate the VM >>>> manually. >>>> I would discourage you from stopping agent/broker, because the engine >>>> itself has some logic based on the reporting. >>>> >>>> Regards >>>> >>>> -- >>>> Martin Sivák >>>> msivak@redhat.com >>>> Red Hat Czech >>>> RHEV-M SLA / Brno, CZ >>>> >>>> ----- Original Message ----- >>>>> On 04/15/2014 04:53 PM, Jiri Moskovcak wrote: >>>>>> On 04/14/2014 10:50 AM, René Koch wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I have some issues with hosted engine status. >>>>>>> >>>>>>> oVirt hosts think that hosted engine is down because it seems that >>>>>>> hosts >>>>>>> can't write to hosted-engine.lockspace due to glusterfs issues (or >>>>>>> at >>>>>>> least I think so). >>>>>>> >>>>>>> Here's the output of vm-status: >>>>>>> >>>>>>> # hosted-engine --vm-status >>>>>>> >>>>>>> >>>>>>> --== Host 1 status ==-- >>>>>>> >>>>>>> Status up-to-date : False >>>>>>> Hostname : 10.0.200.102 >>>>>>> Host ID : 1 >>>>>>> Engine status : unknown stale-data >>>>>>> Score : 2400 >>>>>>> Local maintenance : False >>>>>>> Host timestamp : 1397035677 >>>>>>> Extra metadata (valid at timestamp): >>>>>>> metadata_parse_version=1 >>>>>>> metadata_feature_version=1 >>>>>>> timestamp=1397035677 (Wed Apr 9 11:27:57 2014) >>>>>>> host-id=1 >>>>>>> score=2400 >>>>>>> maintenance=False >>>>>>> state=EngineUp >>>>>>> >>>>>>> >>>>>>> --== Host 2 status ==-- >>>>>>> >>>>>>> Status up-to-date : True >>>>>>> Hostname : 10.0.200.101 >>>>>>> Host ID : 2 >>>>>>> Engine status : {'reason': 'vm not running on >>>>>>> this >>>>>>> host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'} >>>>>>> Score : 0 >>>>>>> Local maintenance : False >>>>>>> Host timestamp : 1397464031 >>>>>>> Extra metadata (valid at timestamp): >>>>>>> metadata_parse_version=1 >>>>>>> metadata_feature_version=1 >>>>>>> timestamp=1397464031 (Mon Apr 14 10:27:11 2014) >>>>>>> host-id=2 >>>>>>> score=0 >>>>>>> maintenance=False >>>>>>> state=EngineUnexpectedlyDown >>>>>>> timeout=Mon Apr 14 10:35:05 2014 >>>>>>> >>>>>>> oVirt engine is sending me 2 emails every 10 minutes with
----- Original Message ----- that I promiscuous process process this the the optional? put the the
>>>>>>> following >>>>>>> subjects: >>>>>>> - ovirt-hosted-engine state transition EngineDown-EngineStart >>>>>>> - ovirt-hosted-engine state transition EngineStart-EngineUp >>>>>>> >>>>>>> In oVirt webadmin I can see the following message: >>>>>>> VM HostedEngine is down. Exit message: internal error Failed to >>>>>>> acquire >>>>>>> lock: error -243. >>>>>>> >>>>>>> These messages are really annoying as oVirt isn't doing anything >>>>>>> with >>>>>>> hosted engine - I have an uptime of 9 days in my engine vm. >>>>>>> >>>>>>> So my questions are now: >>>>>>> Is it intended to send out these messages and detect that ovirt >>>>>>> engine >>>>>>> is down (which is false anyway), but not to restart the vm? >>>>>>> >>>>>>> How can I disable notifications? I'm planning to write a Nagios >>>>>>> plugin >>>>>>> which parses the output of hosted-engine --vm-status and only Nagios >>>>>>> should notify me, not hosted-engine script. >>>>>>> >>>>>>> Is is possible or planned to make the whole ha feature optional? I >>>>>>> really really really hate cluster software as it causes more >>>>>>> troubles >>>>>>> then standalone machines and in my case the hosted-engine ha feature >>>>>>> really causes troubles (and I didn't had a hardware or network >>>>>>> outage >>>>>>> yet only issues with hosted-engine ha agent). I don't need any ha >>>>>>> feature for hosted engine. I just want to run engine virtualized on >>>>>>> oVirt and if engine vm fails (e.g. because of issues with a host) >>>>>>> I'll >>>>>>> restart it on another node. >>>>>> >>>>>> Hi, you can: >>>>>> 1. edit /etc/ovirt-hosted-engine-ha/{agent,broker}-log.conf and tweak >>>>>> the logger as you like >>>>>> 2. or kill ovirt-ha-broker & ovirt-ha-agent services >>>>> >>>>> Thanks for the information. >>>>> So engine is able to run when ovirt-ha-broker and ovirt-ha-agent isn't >>>>> running? >>>>> >>>>> >>>>> Regards, >>>>> René >>>>> >>>>>> >>>>>> --Jirka >>>>>>> >>>>>>> Thanks, >>>>>>> René >>>>>>> >>>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Users mailing list >>>>> Users@ovirt.org >>>>> http://lists.ovirt.org/mailman/listinfo/users >>>>> >>> _______________________________________________ >>> Users mailing list >>> Users@ovirt.org >>> http://lists.ovirt.org/mailman/listinfo/users >>> >
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Ok i mount manualy the domain for hosted engine and agent go up. But vm-status : --== Host 2 status ==-- Status up-to-date : False Hostname : 192.168.99.103 Host ID : 2 Engine status : unknown stale-data Score : 0 Local maintenance : False Host timestamp : 1398333438 And in my engine, host02 Ha is no active. 2014-04-24 12:48 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com>: > Hi, > > I try to reboot my hosts and now [supervdsmServer] is <defunct>. > > /var/log/vdsm/supervdsm.log > > > MainProcess|Thread-120::DEBUG::2014-04-24 > 12:22:19,955::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) > return validateAccess with None > MainProcess|Thread-120::DEBUG::2014-04-24 > 12:22:20,010::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) call > validateAccess with ('qemu', ('qemu', 'kvm'), > '/rhev/data-center/mnt/host01.ovirt.lan:_home_export', 5) {} > MainProcess|Thread-120::DEBUG::2014-04-24 > 12:22:20,014::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) > return validateAccess with None > MainProcess|Thread-120::DEBUG::2014-04-24 > 12:22:20,059::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) call > validateAccess with ('qemu', ('qemu', 'kvm'), > '/rhev/data-center/mnt/host01.ovirt.lan:_home_iso', 5) {} > MainProcess|Thread-120::DEBUG::2014-04-24 > 12:22:20,063::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) > return validateAccess with None > > and one host don't mount the NFS used for hosted engine. > > MainThread::CRITICAL::2014-04-24 > 12:36:16,603::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run) > Could not start ha-agent > Traceback (most recent call last): > File > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", > line 97, in run > self._run_agent() > File > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", > line 154, in _run_agent > hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring() > File > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", > line 299, in start_monitoring > self._initialize_vdsm() > File > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", > line 418, in _initialize_vdsm > self._sd_path = env_path.get_domain_path(self._config) > File > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py", line > 40, in get_domain_path > .format(sd_uuid, parent)) > Exception: path to storage domain aea040f8-ab9d-435b-9ecf-ddd4272e592f not > found in /rhev/data-center/mnt > > > > 2014-04-23 17:40 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com>: > > top >> 1729 vdsm 20 0 0 0 0 Z 373.8 0.0 252:08.51 >> ovirt-ha-broker <defunct> >> >> >> [root@host01 ~]# ps axwu | grep 1729 >> vdsm 1729 0.7 0.0 0 0 ? Zl Apr02 240:24 >> [ovirt-ha-broker] <defunct> >> >> [root@host01 ~]# ll >> /rhev/data-center/mnt/host01.ovirt.lan\:_home_NFS01/aea040f8-ab9d-435b-9ecf-ddd4272e592f/ha_agent/ >> total 2028 >> -rw-rw----. 1 vdsm kvm 1048576 23 avril 17:35 hosted-engine.lockspace >> -rw-rw----. 1 vdsm kvm 1028096 23 avril 17:35 hosted-engine.metadata >> >> cat /var/log/vdsm/vdsm.log >> >> Thread-120518::DEBUG::2014-04-23 >> 17:38:02,299::task::1185::TaskManager.Task::(prepare) >> Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::finished: >> {'aea040f8-ab9d-435b-9ecf-ddd4272e592f': {'code': 0, 'version': 3, >> 'acquired': True, 'delay': '0.000410963', 'lastCheck': '3.4', 'valid': >> True}, '5ae613a4-44e4-42cb-89fc-7b5d34c1f30f': {'code': 0, 'version': 3, >> 'acquired': True, 'delay': '0.000412357', 'lastCheck': '6.8', 'valid': >> True}, 'cc51143e-8ad7-4b0b-a4d2-9024dffc1188': {'code': 0, 'version': 0, >> 'acquired': True, 'delay': '0.000455292', 'lastCheck': '1.2', 'valid': >> True}, 'ff98d346-4515-4349-8437-fb2f5e9eaadf': {'code': 0, 'version': 0, >> 'acquired': True, 'delay': '0.00817113', 'lastCheck': '1.7', 'valid': True}} >> Thread-120518::DEBUG::2014-04-23 >> 17:38:02,300::task::595::TaskManager.Task::(_updateState) >> Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::moving from state preparing -> >> state finished >> Thread-120518::DEBUG::2014-04-23 >> 17:38:02,300::resourceManager::940::ResourceManager.Owner::(releaseAll) >> Owner.releaseAll requests {} resources {} >> Thread-120518::DEBUG::2014-04-23 >> 17:38:02,300::resourceManager::977::ResourceManager.Owner::(cancelAll) >> Owner.cancelAll requests {} >> Thread-120518::DEBUG::2014-04-23 >> 17:38:02,300::task::990::TaskManager.Task::(_decref) >> Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::ref 0 aborting False >> Thread-120518::ERROR::2014-04-23 >> 17:38:02,302::brokerlink::72::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) >> Failed to connect to broker: [Errno 2] No such file or directory >> Thread-120518::ERROR::2014-04-23 >> 17:38:02,302::API::1612::vds::(_getHaInfo) failed to retrieve Hosted Engine >> HA info >> Traceback (most recent call last): >> File "/usr/share/vdsm/API.py", line 1603, in _getHaInfo >> stats = instance.get_all_stats() >> File >> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", >> line 83, in get_all_stats >> with broker.connection(): >> File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__ >> return self.gen.next() >> File >> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", >> line 96, in connection >> self.connect() >> File >> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", >> line 64, in connect >> self._socket.connect(constants.BROKER_SOCKET_FILE) >> File "<string>", line 1, in connect >> error: [Errno 2] No such file or directory >> Thread-78::DEBUG::2014-04-23 >> 17:38:05,490::fileSD::225::Storage.Misc.excCmd::(getReadDelay) '/bin/dd >> iflag=direct >> if=/rhev/data-center/mnt/host01.ovirt.lan:_home_DATA/5ae613a4-44e4-42cb-89fc-7b5d34c1f30f/dom_md/metadata >> bs=4096 count=1' (cwd None) >> Thread-78::DEBUG::2014-04-23 >> 17:38:05,523::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: >> <err> = '0+1 records in\n0+1 records out\n545 bytes (545 B) copied, >> 0.000412209 s, 1.3 MB/s\n'; <rc> = 0 >> >> >> >> >> 2014-04-23 17:27 GMT+02:00 Martin Sivak <msivak@redhat.com>: >> >> Hi Kevin, >>> >>> > same pb. >>> >>> Are you missing the lockspace file as well while running on top of >>> GlusterFS? >>> >>> > ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9. >>> >>> Defunct process eating full four cores? I wonder how is that possible.. >>> What are the status flags of that process when you do ps axwu? >>> >>> Can you attach the log files please? >>> >>> -- >>> Martin Sivák >>> msivak@redhat.com >>> Red Hat Czech >>> RHEV-M SLA / Brno, CZ >>> >>> ----- Original Message ----- >>> > same pb. ovirt-ha-broker have 400% cpu and is defunct. I can't kill >>> with -9. >>> > >>> > >>> > 2014-04-23 13:55 GMT+02:00 Martin Sivak <msivak@redhat.com>: >>> > >>> > > Hi, >>> > > >>> > > > Isn't this file created when hosted engine is started? >>> > > >>> > > The file is created by the setup script. If it got lost then there >>> was >>> > > probably something bad happening in your NFS or Gluster storage. >>> > > >>> > > > Or how can I create this file manually? >>> > > >>> > > I can give you experimental treatment for this. We do not have any >>> > > official way as this is something that should not ever happen :) >>> > > >>> > > !! But before you do that make sure you do not have any nodes running >>> > > properly. This will destroy and reinitialize the lockspace database >>> for the >>> > > whole hosted-engine environment (which you apparently lack, but..). >>> !! >>> > > >>> > > You have to create the ha_agent/hosted-engine.lockspace file with the >>> > > expected size (1MB) and then tell sanlock to initialize it as a >>> lockspace >>> > > using: >>> > > >>> > > # python >>> > > >>> import sanlock >>> > > >>> sanlock.write_lockspace(lockspace="hosted-engine", >>> > > ... path="/rhev/data-center/mnt/<nfs>/<hosted engine storage >>> > > domain>/ha_agent/hosted-engine.lockspace", >>> > > ... offset=0) >>> > > >>> >>> > > >>> > > Then try starting the services (both broker and agent) again. >>> > > >>> > > -- >>> > > Martin Sivák >>> > > msivak@redhat.com >>> > > Red Hat Czech >>> > > RHEV-M SLA / Brno, CZ >>> > > >>> > > >>> > > ----- Original Message ----- >>> > > > On 04/23/2014 11:08 AM, Martin Sivak wrote: >>> > > > > Hi René, >>> > > > > >>> > > > >>>> libvirtError: Failed to acquire lock: No space left on device >>> > > > > >>> > > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 >>> invalid >>> > > > >>>> lockspace found -1 failed 0 name >>> > > 2851af27-8744-445d-9fb1-a0d083c8dc82 >>> > > > > >>> > > > > Can you please check the contents of /rhev/data-center/<your nfs >>> > > > > mount>/<nfs domain uuid>/ha_agent/? >>> > > > > >>> > > > > This is how it should look like: >>> > > > > >>> > > > > [root@dev-03 ~]# ls -al >>> > > > > >>> > > >>> /rhev/data-center/mnt/euryale\:_home_ovirt_he/e16de6a2-53f5-4ab3-95a3-255d08398824/ha_agent/ >>> > > > > total 2036 >>> > > > > drwxr-x---. 2 vdsm kvm 4096 Mar 19 18:46 . >>> > > > > drwxr-xr-x. 6 vdsm kvm 4096 Mar 19 18:46 .. >>> > > > > -rw-rw----. 1 vdsm kvm 1048576 Apr 23 11:05 >>> hosted-engine.lockspace >>> > > > > -rw-rw----. 1 vdsm kvm 1028096 Mar 19 18:46 >>> hosted-engine.metadata >>> > > > > >>> > > > > The errors seem to indicate that you somehow lost the lockspace >>> file. >>> > > > >>> > > > True :) >>> > > > Isn't this file created when hosted engine is started? Or how can I >>> > > > create this file manually? >>> > > > >>> > > > > >>> > > > > -- >>> > > > > Martin Sivák >>> > > > > msivak@redhat.com >>> > > > > Red Hat Czech >>> > > > > RHEV-M SLA / Brno, CZ >>> > > > > >>> > > > > ----- Original Message ----- >>> > > > >> On 04/23/2014 12:28 AM, Doron Fediuck wrote: >>> > > > >>> Hi Rene, >>> > > > >>> any idea what closed your ovirtmgmt bridge? >>> > > > >>> as long as it is down vdsm may have issues starting up properly >>> > > > >>> and this is why you see the complaints on the rpc server. >>> > > > >>> >>> > > > >>> Can you try manually fixing the network part first and then >>> > > > >>> restart vdsm? >>> > > > >>> Once vdsm is happy hosted engine VM will start. >>> > > > >> >>> > > > >> Thanks for your feedback, Doron. >>> > > > >> >>> > > > >> My ovirtmgmt bridge seems to be on or isn't it: >>> > > > >> # brctl show ovirtmgmt >>> > > > >> bridge name bridge id STP enabled >>> interfaces >>> > > > >> ovirtmgmt 8000.0025907587c2 no >>> eth0.200 >>> > > > >> >>> > > > >> # ip a s ovirtmgmt >>> > > > >> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc >>> noqueue >>> > > > >> state UNKNOWN >>> > > > >> link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff >>> > > > >> inet 10.0.200.102/24 brd 10.0.200.255 scope global >>> ovirtmgmt >>> > > > >> inet6 fe80::225:90ff:fe75:87c2/64 scope link >>> > > > >> valid_lft forever preferred_lft forever >>> > > > >> >>> > > > >> # ip a s eth0.200 >>> > > > >> 6: eth0.200@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 >>> qdisc >>> > > > >> noqueue state UP >>> > > > >> link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff >>> > > > >> inet6 fe80::225:90ff:fe75:87c2/64 scope link >>> > > > >> valid_lft forever preferred_lft forever >>> > > > >> >>> > > > >> I tried the following yesterday: >>> > > > >> Copy virtual disk from GlusterFS storage to local disk of host >>> and >>> > > > >> create a new vm with virt-manager which loads ovirtmgmt disk. I >>> could >>> > > > >> reach my engine over the ovirtmgmt bridge (so bridge must be >>> working). >>> > > > >> >>> > > > >> I also started libvirtd with Option -v and I saw the following >>> in >>> > > > >> libvirtd.log when trying to start ovirt engine: >>> > > > >> 2014-04-22 14:18:25.432+0000: 8901: debug : >>> virCommandRunAsync:2250 : >>> > > > >> Command result 0, with PID 11491 >>> > > > >> 2014-04-22 14:18:25.478+0000: 8901: debug : virCommandRun:2045 : >>> > > Result >>> > > > >> exit status 255, stdout: '' stderr: 'iptables v1.4.7: goto >>> 'FO-vnet0' >>> > > is >>> > > > >> not a chain >>> > > > >> >>> > > > >> So it could be that something is broken in my hosted-engine >>> network. >>> > > Do >>> > > > >> you have any clue how I can troubleshoot this? >>> > > > >> >>> > > > >> >>> > > > >> Thanks, >>> > > > >> René >>> > > > >> >>> > > > >> >>> > > > >>> >>> > > > >>> ----- Original Message ----- >>> > > > >>>> From: "René Koch" <rkoch@linuxland.at> >>> > > > >>>> To: "Martin Sivak" <msivak@redhat.com> >>> > > > >>>> Cc: users@ovirt.org >>> > > > >>>> Sent: Tuesday, April 22, 2014 1:46:38 PM >>> > > > >>>> Subject: Re: [ovirt-users] hosted engine health check issues >>> > > > >>>> >>> > > > >>>> Hi, >>> > > > >>>> >>> > > > >>>> I rebooted one of my ovirt hosts today and the result is now >>> that I >>> > > > >>>> can't start hosted-engine anymore. >>> > > > >>>> >>> > > > >>>> ovirt-ha-agent isn't running because the lockspace file is >>> missing >>> > > > >>>> (sanlock complains about it). >>> > > > >>>> So I tried to start hosted-engine with --vm-start and I get >>> the >>> > > > >>>> following errors: >>> > > > >>>> >>> > > > >>>> ==> /var/log/sanlock.log <== >>> > > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 >>> invalid >>> > > > >>>> lockspace found -1 failed 0 name >>> > > 2851af27-8744-445d-9fb1-a0d083c8dc82 >>> > > > >>>> >>> > > > >>>> ==> /var/log/messages <== >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 sanlock[3079]: 2014-04-22 >>> > > 12:38:17+0200 654 >>> > > > >>>> [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 >>> failed 0 >>> > > name >>> > > > >>>> 2851af27-8744-445d-9fb1-a0d083c8dc82 >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) >>> > > entering >>> > > > >>>> disabled state >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: device vnet0 left >>> promiscuous >>> > > mode >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) >>> > > entering >>> > > > >>>> disabled state >>> > > > >>>> >>> > > > >>>> ==> /var/log/vdsm/vdsm.log <== >>> > > > >>>> Thread-21::DEBUG::2014-04-22 >>> > > > >>>> 12:38:17,563::libvirtconnection::124::root::(wrapper) Unknown >>> > > > >>>> libvirterror: ecode: 38 edom: 42 level: 2 message: Failed to >>> acquire >>> > > > >>>> lock: No space left on device >>> > > > >>>> Thread-21::DEBUG::2014-04-22 >>> > > > >>>> 12:38:17,563::vm::2263::vm.Vm::(_startUnderlyingVm) >>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::_ongoingCreations >>> > > released >>> > > > >>>> Thread-21::ERROR::2014-04-22 >>> > > > >>>> 12:38:17,564::vm::2289::vm.Vm::(_startUnderlyingVm) >>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start >>> process >>> > > failed >>> > > > >>>> Traceback (most recent call last): >>> > > > >>>> File "/usr/share/vdsm/vm.py", line 2249, in >>> _startUnderlyingVm >>> > > > >>>> self._run() >>> > > > >>>> File "/usr/share/vdsm/vm.py", line 3170, in _run >>> > > > >>>> self._connection.createXML(domxml, flags), >>> > > > >>>> File >>> > > > >>>> >>> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", >>> > > > >>>> line 92, in wrapper >>> > > > >>>> ret = f(*args, **kwargs) >>> > > > >>>> File "/usr/lib64/python2.6/site-packages/libvirt.py", >>> line >>> > > 2665, in >>> > > > >>>> createXML >>> > > > >>>> if ret is None:raise libvirtError('virDomainCreateXML() >>> > > failed', >>> > > > >>>> conn=self) >>> > > > >>>> libvirtError: Failed to acquire lock: No space left on device >>> > > > >>>> >>> > > > >>>> ==> /var/log/messages <== >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 vdsm vm.Vm ERROR >>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start >>> process >>> > > > >>>> failed#012Traceback (most recent call last):#012 File >>> > > > >>>> "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm#012 >>> > > > >>>> self._run()#012 File "/usr/share/vdsm/vm.py", line 3170, in >>> > > _run#012 >>> > > > >>>> self._connection.createXML(domxml, flags),#012 File >>> > > > >>>> >>> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", >>> > > line 92, >>> > > > >>>> in wrapper#012 ret = f(*args, **kwargs)#012 File >>> > > > >>>> "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in >>> > > > >>>> createXML#012 if ret is None:raise >>> > > libvirtError('virDomainCreateXML() >>> > > > >>>> failed', conn=self)#012libvirtError: Failed to acquire lock: >>> No >>> > > space >>> > > > >>>> left on device >>> > > > >>>> >>> > > > >>>> ==> /var/log/vdsm/vdsm.log <== >>> > > > >>>> Thread-21::DEBUG::2014-04-22 >>> > > > >>>> 12:38:17,569::vm::2731::vm.Vm::(setDownStatus) >>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::Changed state to >>> Down: >>> > > > >>>> Failed to acquire lock: No space left on device >>> > > > >>>> >>> > > > >>>> >>> > > > >>>> No space left on device is nonsense as there is enough space >>> (I had >>> > > this >>> > > > >>>> issue last time as well where I had to patch machine.py, but >>> this >>> > > file >>> > > > >>>> is now Python 2.6.6 compatible. >>> > > > >>>> >>> > > > >>>> Any idea what prevents hosted-engine from starting? >>> > > > >>>> ovirt-ha-broker, vdsmd and sanlock are running btw. >>> > > > >>>> >>> > > > >>>> Btw, I can see in log that json rpc server module is missing >>> - which >>> > > > >>>> package is required for CentOS 6.5? >>> > > > >>>> Apr 22 12:37:14 ovirt-host02 vdsm vds WARNING Unable to load >>> the >>> > > json >>> > > > >>>> rpc server module. Please make sure it is installed. >>> > > > >>>> >>> > > > >>>> >>> > > > >>>> Thanks, >>> > > > >>>> René >>> > > > >>>> >>> > > > >>>> >>> > > > >>>> >>> > > > >>>> On 04/17/2014 10:02 AM, Martin Sivak wrote: >>> > > > >>>>> Hi, >>> > > > >>>>> >>> > > > >>>>>>>> How can I disable notifications? >>> > > > >>>>> >>> > > > >>>>> The notification is configured in >>> > > > >>>>> /etc/ovirt-hosted-engine-ha/broker.conf >>> > > > >>>>> section notification. >>> > > > >>>>> The email is sent when the key state_transition exists and >>> the >>> > > string >>> > > > >>>>> OldState-NewState contains the (case insensitive) regexp >>> from the >>> > > > >>>>> value. >>> > > > >>>>> >>> > > > >>>>>>>> Is it intended to send out these messages and detect that >>> ovirt >>> > > > >>>>>>>> engine >>> > > > >>>>>>>> is down (which is false anyway), but not to restart the >>> vm? >>> > > > >>>>> >>> > > > >>>>> Forget about emails for now and check the >>> > > > >>>>> /var/log/ovirt-hosted-engine-ha/agent.log and broker.log (and >>> > > attach >>> > > > >>>>> them >>> > > > >>>>> as well btw). >>> > > > >>>>> >>> > > > >>>>>>>> oVirt hosts think that hosted engine is down because it >>> seems >>> > > that >>> > > > >>>>>>>> hosts >>> > > > >>>>>>>> can't write to hosted-engine.lockspace due to glusterfs >>> issues >>> > > (or >>> > > > >>>>>>>> at >>> > > > >>>>>>>> least I think so). >>> > > > >>>>> >>> > > > >>>>> The hosts think so or can't really write there? The >>> lockspace is >>> > > > >>>>> managed >>> > > > >>>>> by >>> > > > >>>>> sanlock and our HA daemons do not touch it at all. We only >>> ask >>> > > sanlock >>> > > > >>>>> to >>> > > > >>>>> get make sure we have unique server id. >>> > > > >>>>> >>> > > > >>>>>>>> Is is possible or planned to make the whole ha feature >>> optional? >>> > > > >>>>> >>> > > > >>>>> Well the system won't perform any automatic actions if you >>> put the >>> > > > >>>>> hosted >>> > > > >>>>> engine to global maintenance and only start/stop/migrate the >>> VM >>> > > > >>>>> manually. >>> > > > >>>>> I would discourage you from stopping agent/broker, because >>> the >>> > > engine >>> > > > >>>>> itself has some logic based on the reporting. >>> > > > >>>>> >>> > > > >>>>> Regards >>> > > > >>>>> >>> > > > >>>>> -- >>> > > > >>>>> Martin Sivák >>> > > > >>>>> msivak@redhat.com >>> > > > >>>>> Red Hat Czech >>> > > > >>>>> RHEV-M SLA / Brno, CZ >>> > > > >>>>> >>> > > > >>>>> ----- Original Message ----- >>> > > > >>>>>> On 04/15/2014 04:53 PM, Jiri Moskovcak wrote: >>> > > > >>>>>>> On 04/14/2014 10:50 AM, René Koch wrote: >>> > > > >>>>>>>> Hi, >>> > > > >>>>>>>> >>> > > > >>>>>>>> I have some issues with hosted engine status. >>> > > > >>>>>>>> >>> > > > >>>>>>>> oVirt hosts think that hosted engine is down because it >>> seems >>> > > that >>> > > > >>>>>>>> hosts >>> > > > >>>>>>>> can't write to hosted-engine.lockspace due to glusterfs >>> issues >>> > > (or >>> > > > >>>>>>>> at >>> > > > >>>>>>>> least I think so). >>> > > > >>>>>>>> >>> > > > >>>>>>>> Here's the output of vm-status: >>> > > > >>>>>>>> >>> > > > >>>>>>>> # hosted-engine --vm-status >>> > > > >>>>>>>> >>> > > > >>>>>>>> >>> > > > >>>>>>>> --== Host 1 status ==-- >>> > > > >>>>>>>> >>> > > > >>>>>>>> Status up-to-date : False >>> > > > >>>>>>>> Hostname : 10.0.200.102 >>> > > > >>>>>>>> Host ID : 1 >>> > > > >>>>>>>> Engine status : unknown stale-data >>> > > > >>>>>>>> Score : 2400 >>> > > > >>>>>>>> Local maintenance : False >>> > > > >>>>>>>> Host timestamp : 1397035677 >>> > > > >>>>>>>> Extra metadata (valid at timestamp): >>> > > > >>>>>>>> metadata_parse_version=1 >>> > > > >>>>>>>> metadata_feature_version=1 >>> > > > >>>>>>>> timestamp=1397035677 (Wed Apr 9 11:27:57 2014) >>> > > > >>>>>>>> host-id=1 >>> > > > >>>>>>>> score=2400 >>> > > > >>>>>>>> maintenance=False >>> > > > >>>>>>>> state=EngineUp >>> > > > >>>>>>>> >>> > > > >>>>>>>> >>> > > > >>>>>>>> --== Host 2 status ==-- >>> > > > >>>>>>>> >>> > > > >>>>>>>> Status up-to-date : True >>> > > > >>>>>>>> Hostname : 10.0.200.101 >>> > > > >>>>>>>> Host ID : 2 >>> > > > >>>>>>>> Engine status : {'reason': 'vm not >>> running >>> > > on >>> > > > >>>>>>>> this >>> > > > >>>>>>>> host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'} >>> > > > >>>>>>>> Score : 0 >>> > > > >>>>>>>> Local maintenance : False >>> > > > >>>>>>>> Host timestamp : 1397464031 >>> > > > >>>>>>>> Extra metadata (valid at timestamp): >>> > > > >>>>>>>> metadata_parse_version=1 >>> > > > >>>>>>>> metadata_feature_version=1 >>> > > > >>>>>>>> timestamp=1397464031 (Mon Apr 14 10:27:11 2014) >>> > > > >>>>>>>> host-id=2 >>> > > > >>>>>>>> score=0 >>> > > > >>>>>>>> maintenance=False >>> > > > >>>>>>>> state=EngineUnexpectedlyDown >>> > > > >>>>>>>> timeout=Mon Apr 14 10:35:05 2014 >>> > > > >>>>>>>> >>> > > > >>>>>>>> oVirt engine is sending me 2 emails every 10 minutes with >>> the >>> > > > >>>>>>>> following >>> > > > >>>>>>>> subjects: >>> > > > >>>>>>>> - ovirt-hosted-engine state transition >>> EngineDown-EngineStart >>> > > > >>>>>>>> - ovirt-hosted-engine state transition >>> EngineStart-EngineUp >>> > > > >>>>>>>> >>> > > > >>>>>>>> In oVirt webadmin I can see the following message: >>> > > > >>>>>>>> VM HostedEngine is down. Exit message: internal error >>> Failed to >>> > > > >>>>>>>> acquire >>> > > > >>>>>>>> lock: error -243. >>> > > > >>>>>>>> >>> > > > >>>>>>>> These messages are really annoying as oVirt isn't doing >>> anything >>> > > > >>>>>>>> with >>> > > > >>>>>>>> hosted engine - I have an uptime of 9 days in my engine >>> vm. >>> > > > >>>>>>>> >>> > > > >>>>>>>> So my questions are now: >>> > > > >>>>>>>> Is it intended to send out these messages and detect that >>> ovirt >>> > > > >>>>>>>> engine >>> > > > >>>>>>>> is down (which is false anyway), but not to restart the >>> vm? >>> > > > >>>>>>>> >>> > > > >>>>>>>> How can I disable notifications? I'm planning to write a >>> Nagios >>> > > > >>>>>>>> plugin >>> > > > >>>>>>>> which parses the output of hosted-engine --vm-status and >>> only >>> > > Nagios >>> > > > >>>>>>>> should notify me, not hosted-engine script. >>> > > > >>>>>>>> >>> > > > >>>>>>>> Is is possible or planned to make the whole ha feature >>> > > optional? I >>> > > > >>>>>>>> really really really hate cluster software as it causes >>> more >>> > > > >>>>>>>> troubles >>> > > > >>>>>>>> then standalone machines and in my case the hosted-engine >>> ha >>> > > feature >>> > > > >>>>>>>> really causes troubles (and I didn't had a hardware or >>> network >>> > > > >>>>>>>> outage >>> > > > >>>>>>>> yet only issues with hosted-engine ha agent). I don't >>> need any >>> > > ha >>> > > > >>>>>>>> feature for hosted engine. I just want to run engine >>> > > virtualized on >>> > > > >>>>>>>> oVirt and if engine vm fails (e.g. because of issues with >>> a >>> > > host) >>> > > > >>>>>>>> I'll >>> > > > >>>>>>>> restart it on another node. >>> > > > >>>>>>> >>> > > > >>>>>>> Hi, you can: >>> > > > >>>>>>> 1. edit >>> /etc/ovirt-hosted-engine-ha/{agent,broker}-log.conf and >>> > > tweak >>> > > > >>>>>>> the logger as you like >>> > > > >>>>>>> 2. or kill ovirt-ha-broker & ovirt-ha-agent services >>> > > > >>>>>> >>> > > > >>>>>> Thanks for the information. >>> > > > >>>>>> So engine is able to run when ovirt-ha-broker and >>> ovirt-ha-agent >>> > > isn't >>> > > > >>>>>> running? >>> > > > >>>>>> >>> > > > >>>>>> >>> > > > >>>>>> Regards, >>> > > > >>>>>> René >>> > > > >>>>>> >>> > > > >>>>>>> >>> > > > >>>>>>> --Jirka >>> > > > >>>>>>>> >>> > > > >>>>>>>> Thanks, >>> > > > >>>>>>>> René >>> > > > >>>>>>>> >>> > > > >>>>>>>> >>> > > > >>>>>>> >>> > > > >>>>>> _______________________________________________ >>> > > > >>>>>> Users mailing list >>> > > > >>>>>> Users@ovirt.org >>> > > > >>>>>> http://lists.ovirt.org/mailman/listinfo/users >>> > > > >>>>>> >>> > > > >>>> _______________________________________________ >>> > > > >>>> Users mailing list >>> > > > >>>> Users@ovirt.org >>> > > > >>>> http://lists.ovirt.org/mailman/listinfo/users >>> > > > >>>> >>> > > > >> >>> > > > >>> > > _______________________________________________ >>> > > Users mailing list >>> > > Users@ovirt.org >>> > > http://lists.ovirt.org/mailman/listinfo/users >>> > > >>> > >>> _______________________________________________ >>> Users mailing list >>> Users@ovirt.org >>> http://lists.ovirt.org/mailman/listinfo/users >>> >> >> >

Hi Kevin, can you please tell us what version of hosted-engine are you running? rpm -q ovirt-hosted-engine-ha Also, do I understand it correctly that the engine VM is running, but you see bad status when you execute the hosted-engine --vm-status command? If that is so, can you give us current logs from /var/log/ovirt-hosted-engine-ha? -- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ ----- Original Message ----- > Ok i mount manualy the domain for hosted engine and agent go up. > > But vm-status : > > --== Host 2 status ==-- > > Status up-to-date : False > Hostname : 192.168.99.103 > Host ID : 2 > Engine status : unknown stale-data > Score : 0 > Local maintenance : False > Host timestamp : 1398333438 > > And in my engine, host02 Ha is no active. > > > 2014-04-24 12:48 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com>: > > > Hi, > > > > I try to reboot my hosts and now [supervdsmServer] is <defunct>. > > > > /var/log/vdsm/supervdsm.log > > > > > > MainProcess|Thread-120::DEBUG::2014-04-24 > > 12:22:19,955::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) > > return validateAccess with None > > MainProcess|Thread-120::DEBUG::2014-04-24 > > 12:22:20,010::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) call > > validateAccess with ('qemu', ('qemu', 'kvm'), > > '/rhev/data-center/mnt/host01.ovirt.lan:_home_export', 5) {} > > MainProcess|Thread-120::DEBUG::2014-04-24 > > 12:22:20,014::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) > > return validateAccess with None > > MainProcess|Thread-120::DEBUG::2014-04-24 > > 12:22:20,059::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) call > > validateAccess with ('qemu', ('qemu', 'kvm'), > > '/rhev/data-center/mnt/host01.ovirt.lan:_home_iso', 5) {} > > MainProcess|Thread-120::DEBUG::2014-04-24 > > 12:22:20,063::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) > > return validateAccess with None > > > > and one host don't mount the NFS used for hosted engine. > > > > MainThread::CRITICAL::2014-04-24 > > 12:36:16,603::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run) > > Could not start ha-agent > > Traceback (most recent call last): > > File > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", > > line 97, in run > > self._run_agent() > > File > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", > > line 154, in _run_agent > > hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring() > > File > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", > > line 299, in start_monitoring > > self._initialize_vdsm() > > File > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", > > line 418, in _initialize_vdsm > > self._sd_path = env_path.get_domain_path(self._config) > > File > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py", line > > 40, in get_domain_path > > .format(sd_uuid, parent)) > > Exception: path to storage domain aea040f8-ab9d-435b-9ecf-ddd4272e592f not > > found in /rhev/data-center/mnt > > > > > > > > 2014-04-23 17:40 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com>: > > > > top > >> 1729 vdsm 20 0 0 0 0 Z 373.8 0.0 252:08.51 > >> ovirt-ha-broker <defunct> > >> > >> > >> [root@host01 ~]# ps axwu | grep 1729 > >> vdsm 1729 0.7 0.0 0 0 ? Zl Apr02 240:24 > >> [ovirt-ha-broker] <defunct> > >> > >> [root@host01 ~]# ll > >> /rhev/data-center/mnt/host01.ovirt.lan\:_home_NFS01/aea040f8-ab9d-435b-9ecf-ddd4272e592f/ha_agent/ > >> total 2028 > >> -rw-rw----. 1 vdsm kvm 1048576 23 avril 17:35 hosted-engine.lockspace > >> -rw-rw----. 1 vdsm kvm 1028096 23 avril 17:35 hosted-engine.metadata > >> > >> cat /var/log/vdsm/vdsm.log > >> > >> Thread-120518::DEBUG::2014-04-23 > >> 17:38:02,299::task::1185::TaskManager.Task::(prepare) > >> Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::finished: > >> {'aea040f8-ab9d-435b-9ecf-ddd4272e592f': {'code': 0, 'version': 3, > >> 'acquired': True, 'delay': '0.000410963', 'lastCheck': '3.4', 'valid': > >> True}, '5ae613a4-44e4-42cb-89fc-7b5d34c1f30f': {'code': 0, 'version': 3, > >> 'acquired': True, 'delay': '0.000412357', 'lastCheck': '6.8', 'valid': > >> True}, 'cc51143e-8ad7-4b0b-a4d2-9024dffc1188': {'code': 0, 'version': 0, > >> 'acquired': True, 'delay': '0.000455292', 'lastCheck': '1.2', 'valid': > >> True}, 'ff98d346-4515-4349-8437-fb2f5e9eaadf': {'code': 0, 'version': 0, > >> 'acquired': True, 'delay': '0.00817113', 'lastCheck': '1.7', 'valid': > >> True}} > >> Thread-120518::DEBUG::2014-04-23 > >> 17:38:02,300::task::595::TaskManager.Task::(_updateState) > >> Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::moving from state preparing > >> -> > >> state finished > >> Thread-120518::DEBUG::2014-04-23 > >> 17:38:02,300::resourceManager::940::ResourceManager.Owner::(releaseAll) > >> Owner.releaseAll requests {} resources {} > >> Thread-120518::DEBUG::2014-04-23 > >> 17:38:02,300::resourceManager::977::ResourceManager.Owner::(cancelAll) > >> Owner.cancelAll requests {} > >> Thread-120518::DEBUG::2014-04-23 > >> 17:38:02,300::task::990::TaskManager.Task::(_decref) > >> Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::ref 0 aborting False > >> Thread-120518::ERROR::2014-04-23 > >> 17:38:02,302::brokerlink::72::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) > >> Failed to connect to broker: [Errno 2] No such file or directory > >> Thread-120518::ERROR::2014-04-23 > >> 17:38:02,302::API::1612::vds::(_getHaInfo) failed to retrieve Hosted > >> Engine > >> HA info > >> Traceback (most recent call last): > >> File "/usr/share/vdsm/API.py", line 1603, in _getHaInfo > >> stats = instance.get_all_stats() > >> File > >> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", > >> line 83, in get_all_stats > >> with broker.connection(): > >> File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__ > >> return self.gen.next() > >> File > >> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", > >> line 96, in connection > >> self.connect() > >> File > >> "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", > >> line 64, in connect > >> self._socket.connect(constants.BROKER_SOCKET_FILE) > >> File "<string>", line 1, in connect > >> error: [Errno 2] No such file or directory > >> Thread-78::DEBUG::2014-04-23 > >> 17:38:05,490::fileSD::225::Storage.Misc.excCmd::(getReadDelay) '/bin/dd > >> iflag=direct > >> if=/rhev/data-center/mnt/host01.ovirt.lan:_home_DATA/5ae613a4-44e4-42cb-89fc-7b5d34c1f30f/dom_md/metadata > >> bs=4096 count=1' (cwd None) > >> Thread-78::DEBUG::2014-04-23 > >> 17:38:05,523::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: > >> <err> = '0+1 records in\n0+1 records out\n545 bytes (545 B) copied, > >> 0.000412209 s, 1.3 MB/s\n'; <rc> = 0 > >> > >> > >> > >> > >> 2014-04-23 17:27 GMT+02:00 Martin Sivak <msivak@redhat.com>: > >> > >> Hi Kevin, > >>> > >>> > same pb. > >>> > >>> Are you missing the lockspace file as well while running on top of > >>> GlusterFS? > >>> > >>> > ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9. > >>> > >>> Defunct process eating full four cores? I wonder how is that possible.. > >>> What are the status flags of that process when you do ps axwu? > >>> > >>> Can you attach the log files please? > >>> > >>> -- > >>> Martin Sivák > >>> msivak@redhat.com > >>> Red Hat Czech > >>> RHEV-M SLA / Brno, CZ > >>> > >>> ----- Original Message ----- > >>> > same pb. ovirt-ha-broker have 400% cpu and is defunct. I can't kill > >>> with -9. > >>> > > >>> > > >>> > 2014-04-23 13:55 GMT+02:00 Martin Sivak <msivak@redhat.com>: > >>> > > >>> > > Hi, > >>> > > > >>> > > > Isn't this file created when hosted engine is started? > >>> > > > >>> > > The file is created by the setup script. If it got lost then there > >>> was > >>> > > probably something bad happening in your NFS or Gluster storage. > >>> > > > >>> > > > Or how can I create this file manually? > >>> > > > >>> > > I can give you experimental treatment for this. We do not have any > >>> > > official way as this is something that should not ever happen :) > >>> > > > >>> > > !! But before you do that make sure you do not have any nodes running > >>> > > properly. This will destroy and reinitialize the lockspace database > >>> for the > >>> > > whole hosted-engine environment (which you apparently lack, but..). > >>> !! > >>> > > > >>> > > You have to create the ha_agent/hosted-engine.lockspace file with the > >>> > > expected size (1MB) and then tell sanlock to initialize it as a > >>> lockspace > >>> > > using: > >>> > > > >>> > > # python > >>> > > >>> import sanlock > >>> > > >>> sanlock.write_lockspace(lockspace="hosted-engine", > >>> > > ... path="/rhev/data-center/mnt/<nfs>/<hosted engine storage > >>> > > domain>/ha_agent/hosted-engine.lockspace", > >>> > > ... offset=0) > >>> > > >>> > >>> > > > >>> > > Then try starting the services (both broker and agent) again. > >>> > > > >>> > > -- > >>> > > Martin Sivák > >>> > > msivak@redhat.com > >>> > > Red Hat Czech > >>> > > RHEV-M SLA / Brno, CZ > >>> > > > >>> > > > >>> > > ----- Original Message ----- > >>> > > > On 04/23/2014 11:08 AM, Martin Sivak wrote: > >>> > > > > Hi René, > >>> > > > > > >>> > > > >>>> libvirtError: Failed to acquire lock: No space left on device > >>> > > > > > >>> > > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 > >>> invalid > >>> > > > >>>> lockspace found -1 failed 0 name > >>> > > 2851af27-8744-445d-9fb1-a0d083c8dc82 > >>> > > > > > >>> > > > > Can you please check the contents of /rhev/data-center/<your nfs > >>> > > > > mount>/<nfs domain uuid>/ha_agent/? > >>> > > > > > >>> > > > > This is how it should look like: > >>> > > > > > >>> > > > > [root@dev-03 ~]# ls -al > >>> > > > > > >>> > > > >>> /rhev/data-center/mnt/euryale\:_home_ovirt_he/e16de6a2-53f5-4ab3-95a3-255d08398824/ha_agent/ > >>> > > > > total 2036 > >>> > > > > drwxr-x---. 2 vdsm kvm 4096 Mar 19 18:46 . > >>> > > > > drwxr-xr-x. 6 vdsm kvm 4096 Mar 19 18:46 .. > >>> > > > > -rw-rw----. 1 vdsm kvm 1048576 Apr 23 11:05 > >>> hosted-engine.lockspace > >>> > > > > -rw-rw----. 1 vdsm kvm 1028096 Mar 19 18:46 > >>> hosted-engine.metadata > >>> > > > > > >>> > > > > The errors seem to indicate that you somehow lost the lockspace > >>> file. > >>> > > > > >>> > > > True :) > >>> > > > Isn't this file created when hosted engine is started? Or how can I > >>> > > > create this file manually? > >>> > > > > >>> > > > > > >>> > > > > -- > >>> > > > > Martin Sivák > >>> > > > > msivak@redhat.com > >>> > > > > Red Hat Czech > >>> > > > > RHEV-M SLA / Brno, CZ > >>> > > > > > >>> > > > > ----- Original Message ----- > >>> > > > >> On 04/23/2014 12:28 AM, Doron Fediuck wrote: > >>> > > > >>> Hi Rene, > >>> > > > >>> any idea what closed your ovirtmgmt bridge? > >>> > > > >>> as long as it is down vdsm may have issues starting up properly > >>> > > > >>> and this is why you see the complaints on the rpc server. > >>> > > > >>> > >>> > > > >>> Can you try manually fixing the network part first and then > >>> > > > >>> restart vdsm? > >>> > > > >>> Once vdsm is happy hosted engine VM will start. > >>> > > > >> > >>> > > > >> Thanks for your feedback, Doron. > >>> > > > >> > >>> > > > >> My ovirtmgmt bridge seems to be on or isn't it: > >>> > > > >> # brctl show ovirtmgmt > >>> > > > >> bridge name bridge id STP enabled > >>> interfaces > >>> > > > >> ovirtmgmt 8000.0025907587c2 no > >>> eth0.200 > >>> > > > >> > >>> > > > >> # ip a s ovirtmgmt > >>> > > > >> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc > >>> noqueue > >>> > > > >> state UNKNOWN > >>> > > > >> link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff > >>> > > > >> inet 10.0.200.102/24 brd 10.0.200.255 scope global > >>> ovirtmgmt > >>> > > > >> inet6 fe80::225:90ff:fe75:87c2/64 scope link > >>> > > > >> valid_lft forever preferred_lft forever > >>> > > > >> > >>> > > > >> # ip a s eth0.200 > >>> > > > >> 6: eth0.200@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 > >>> qdisc > >>> > > > >> noqueue state UP > >>> > > > >> link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff > >>> > > > >> inet6 fe80::225:90ff:fe75:87c2/64 scope link > >>> > > > >> valid_lft forever preferred_lft forever > >>> > > > >> > >>> > > > >> I tried the following yesterday: > >>> > > > >> Copy virtual disk from GlusterFS storage to local disk of host > >>> and > >>> > > > >> create a new vm with virt-manager which loads ovirtmgmt disk. I > >>> could > >>> > > > >> reach my engine over the ovirtmgmt bridge (so bridge must be > >>> working). > >>> > > > >> > >>> > > > >> I also started libvirtd with Option -v and I saw the following > >>> in > >>> > > > >> libvirtd.log when trying to start ovirt engine: > >>> > > > >> 2014-04-22 14:18:25.432+0000: 8901: debug : > >>> virCommandRunAsync:2250 : > >>> > > > >> Command result 0, with PID 11491 > >>> > > > >> 2014-04-22 14:18:25.478+0000: 8901: debug : virCommandRun:2045 : > >>> > > Result > >>> > > > >> exit status 255, stdout: '' stderr: 'iptables v1.4.7: goto > >>> 'FO-vnet0' > >>> > > is > >>> > > > >> not a chain > >>> > > > >> > >>> > > > >> So it could be that something is broken in my hosted-engine > >>> network. > >>> > > Do > >>> > > > >> you have any clue how I can troubleshoot this? > >>> > > > >> > >>> > > > >> > >>> > > > >> Thanks, > >>> > > > >> René > >>> > > > >> > >>> > > > >> > >>> > > > >>> > >>> > > > >>> ----- Original Message ----- > >>> > > > >>>> From: "René Koch" <rkoch@linuxland.at> > >>> > > > >>>> To: "Martin Sivak" <msivak@redhat.com> > >>> > > > >>>> Cc: users@ovirt.org > >>> > > > >>>> Sent: Tuesday, April 22, 2014 1:46:38 PM > >>> > > > >>>> Subject: Re: [ovirt-users] hosted engine health check issues > >>> > > > >>>> > >>> > > > >>>> Hi, > >>> > > > >>>> > >>> > > > >>>> I rebooted one of my ovirt hosts today and the result is now > >>> that I > >>> > > > >>>> can't start hosted-engine anymore. > >>> > > > >>>> > >>> > > > >>>> ovirt-ha-agent isn't running because the lockspace file is > >>> missing > >>> > > > >>>> (sanlock complains about it). > >>> > > > >>>> So I tried to start hosted-engine with --vm-start and I get > >>> the > >>> > > > >>>> following errors: > >>> > > > >>>> > >>> > > > >>>> ==> /var/log/sanlock.log <== > >>> > > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 > >>> invalid > >>> > > > >>>> lockspace found -1 failed 0 name > >>> > > 2851af27-8744-445d-9fb1-a0d083c8dc82 > >>> > > > >>>> > >>> > > > >>>> ==> /var/log/messages <== > >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 sanlock[3079]: 2014-04-22 > >>> > > 12:38:17+0200 654 > >>> > > > >>>> [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 > >>> failed 0 > >>> > > name > >>> > > > >>>> 2851af27-8744-445d-9fb1-a0d083c8dc82 > >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) > >>> > > entering > >>> > > > >>>> disabled state > >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: device vnet0 left > >>> promiscuous > >>> > > mode > >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) > >>> > > entering > >>> > > > >>>> disabled state > >>> > > > >>>> > >>> > > > >>>> ==> /var/log/vdsm/vdsm.log <== > >>> > > > >>>> Thread-21::DEBUG::2014-04-22 > >>> > > > >>>> 12:38:17,563::libvirtconnection::124::root::(wrapper) Unknown > >>> > > > >>>> libvirterror: ecode: 38 edom: 42 level: 2 message: Failed to > >>> acquire > >>> > > > >>>> lock: No space left on device > >>> > > > >>>> Thread-21::DEBUG::2014-04-22 > >>> > > > >>>> 12:38:17,563::vm::2263::vm.Vm::(_startUnderlyingVm) > >>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::_ongoingCreations > >>> > > released > >>> > > > >>>> Thread-21::ERROR::2014-04-22 > >>> > > > >>>> 12:38:17,564::vm::2289::vm.Vm::(_startUnderlyingVm) > >>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start > >>> process > >>> > > failed > >>> > > > >>>> Traceback (most recent call last): > >>> > > > >>>> File "/usr/share/vdsm/vm.py", line 2249, in > >>> _startUnderlyingVm > >>> > > > >>>> self._run() > >>> > > > >>>> File "/usr/share/vdsm/vm.py", line 3170, in _run > >>> > > > >>>> self._connection.createXML(domxml, flags), > >>> > > > >>>> File > >>> > > > >>>> > >>> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", > >>> > > > >>>> line 92, in wrapper > >>> > > > >>>> ret = f(*args, **kwargs) > >>> > > > >>>> File "/usr/lib64/python2.6/site-packages/libvirt.py", > >>> line > >>> > > 2665, in > >>> > > > >>>> createXML > >>> > > > >>>> if ret is None:raise libvirtError('virDomainCreateXML() > >>> > > failed', > >>> > > > >>>> conn=self) > >>> > > > >>>> libvirtError: Failed to acquire lock: No space left on device > >>> > > > >>>> > >>> > > > >>>> ==> /var/log/messages <== > >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 vdsm vm.Vm ERROR > >>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start > >>> process > >>> > > > >>>> failed#012Traceback (most recent call last):#012 File > >>> > > > >>>> "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm#012 > >>> > > > >>>> self._run()#012 File "/usr/share/vdsm/vm.py", line 3170, in > >>> > > _run#012 > >>> > > > >>>> self._connection.createXML(domxml, flags),#012 File > >>> > > > >>>> > >>> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", > >>> > > line 92, > >>> > > > >>>> in wrapper#012 ret = f(*args, **kwargs)#012 File > >>> > > > >>>> "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in > >>> > > > >>>> createXML#012 if ret is None:raise > >>> > > libvirtError('virDomainCreateXML() > >>> > > > >>>> failed', conn=self)#012libvirtError: Failed to acquire lock: > >>> No > >>> > > space > >>> > > > >>>> left on device > >>> > > > >>>> > >>> > > > >>>> ==> /var/log/vdsm/vdsm.log <== > >>> > > > >>>> Thread-21::DEBUG::2014-04-22 > >>> > > > >>>> 12:38:17,569::vm::2731::vm.Vm::(setDownStatus) > >>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::Changed state to > >>> Down: > >>> > > > >>>> Failed to acquire lock: No space left on device > >>> > > > >>>> > >>> > > > >>>> > >>> > > > >>>> No space left on device is nonsense as there is enough space > >>> (I had > >>> > > this > >>> > > > >>>> issue last time as well where I had to patch machine.py, but > >>> this > >>> > > file > >>> > > > >>>> is now Python 2.6.6 compatible. > >>> > > > >>>> > >>> > > > >>>> Any idea what prevents hosted-engine from starting? > >>> > > > >>>> ovirt-ha-broker, vdsmd and sanlock are running btw. > >>> > > > >>>> > >>> > > > >>>> Btw, I can see in log that json rpc server module is missing > >>> - which > >>> > > > >>>> package is required for CentOS 6.5? > >>> > > > >>>> Apr 22 12:37:14 ovirt-host02 vdsm vds WARNING Unable to load > >>> the > >>> > > json > >>> > > > >>>> rpc server module. Please make sure it is installed. > >>> > > > >>>> > >>> > > > >>>> > >>> > > > >>>> Thanks, > >>> > > > >>>> René > >>> > > > >>>> > >>> > > > >>>> > >>> > > > >>>> > >>> > > > >>>> On 04/17/2014 10:02 AM, Martin Sivak wrote: > >>> > > > >>>>> Hi, > >>> > > > >>>>> > >>> > > > >>>>>>>> How can I disable notifications? > >>> > > > >>>>> > >>> > > > >>>>> The notification is configured in > >>> > > > >>>>> /etc/ovirt-hosted-engine-ha/broker.conf > >>> > > > >>>>> section notification. > >>> > > > >>>>> The email is sent when the key state_transition exists and > >>> the > >>> > > string > >>> > > > >>>>> OldState-NewState contains the (case insensitive) regexp > >>> from the > >>> > > > >>>>> value. > >>> > > > >>>>> > >>> > > > >>>>>>>> Is it intended to send out these messages and detect that > >>> ovirt > >>> > > > >>>>>>>> engine > >>> > > > >>>>>>>> is down (which is false anyway), but not to restart the > >>> vm? > >>> > > > >>>>> > >>> > > > >>>>> Forget about emails for now and check the > >>> > > > >>>>> /var/log/ovirt-hosted-engine-ha/agent.log and broker.log (and > >>> > > attach > >>> > > > >>>>> them > >>> > > > >>>>> as well btw). > >>> > > > >>>>> > >>> > > > >>>>>>>> oVirt hosts think that hosted engine is down because it > >>> seems > >>> > > that > >>> > > > >>>>>>>> hosts > >>> > > > >>>>>>>> can't write to hosted-engine.lockspace due to glusterfs > >>> issues > >>> > > (or > >>> > > > >>>>>>>> at > >>> > > > >>>>>>>> least I think so). > >>> > > > >>>>> > >>> > > > >>>>> The hosts think so or can't really write there? The > >>> lockspace is > >>> > > > >>>>> managed > >>> > > > >>>>> by > >>> > > > >>>>> sanlock and our HA daemons do not touch it at all. We only > >>> ask > >>> > > sanlock > >>> > > > >>>>> to > >>> > > > >>>>> get make sure we have unique server id. > >>> > > > >>>>> > >>> > > > >>>>>>>> Is is possible or planned to make the whole ha feature > >>> optional? > >>> > > > >>>>> > >>> > > > >>>>> Well the system won't perform any automatic actions if you > >>> put the > >>> > > > >>>>> hosted > >>> > > > >>>>> engine to global maintenance and only start/stop/migrate the > >>> VM > >>> > > > >>>>> manually. > >>> > > > >>>>> I would discourage you from stopping agent/broker, because > >>> the > >>> > > engine > >>> > > > >>>>> itself has some logic based on the reporting. > >>> > > > >>>>> > >>> > > > >>>>> Regards > >>> > > > >>>>> > >>> > > > >>>>> -- > >>> > > > >>>>> Martin Sivák > >>> > > > >>>>> msivak@redhat.com > >>> > > > >>>>> Red Hat Czech > >>> > > > >>>>> RHEV-M SLA / Brno, CZ > >>> > > > >>>>> > >>> > > > >>>>> ----- Original Message ----- > >>> > > > >>>>>> On 04/15/2014 04:53 PM, Jiri Moskovcak wrote: > >>> > > > >>>>>>> On 04/14/2014 10:50 AM, René Koch wrote: > >>> > > > >>>>>>>> Hi, > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> I have some issues with hosted engine status. > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> oVirt hosts think that hosted engine is down because it > >>> seems > >>> > > that > >>> > > > >>>>>>>> hosts > >>> > > > >>>>>>>> can't write to hosted-engine.lockspace due to glusterfs > >>> issues > >>> > > (or > >>> > > > >>>>>>>> at > >>> > > > >>>>>>>> least I think so). > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> Here's the output of vm-status: > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> # hosted-engine --vm-status > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> --== Host 1 status ==-- > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> Status up-to-date : False > >>> > > > >>>>>>>> Hostname : 10.0.200.102 > >>> > > > >>>>>>>> Host ID : 1 > >>> > > > >>>>>>>> Engine status : unknown stale-data > >>> > > > >>>>>>>> Score : 2400 > >>> > > > >>>>>>>> Local maintenance : False > >>> > > > >>>>>>>> Host timestamp : 1397035677 > >>> > > > >>>>>>>> Extra metadata (valid at timestamp): > >>> > > > >>>>>>>> metadata_parse_version=1 > >>> > > > >>>>>>>> metadata_feature_version=1 > >>> > > > >>>>>>>> timestamp=1397035677 (Wed Apr 9 11:27:57 2014) > >>> > > > >>>>>>>> host-id=1 > >>> > > > >>>>>>>> score=2400 > >>> > > > >>>>>>>> maintenance=False > >>> > > > >>>>>>>> state=EngineUp > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> --== Host 2 status ==-- > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> Status up-to-date : True > >>> > > > >>>>>>>> Hostname : 10.0.200.101 > >>> > > > >>>>>>>> Host ID : 2 > >>> > > > >>>>>>>> Engine status : {'reason': 'vm not > >>> running > >>> > > on > >>> > > > >>>>>>>> this > >>> > > > >>>>>>>> host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'} > >>> > > > >>>>>>>> Score : 0 > >>> > > > >>>>>>>> Local maintenance : False > >>> > > > >>>>>>>> Host timestamp : 1397464031 > >>> > > > >>>>>>>> Extra metadata (valid at timestamp): > >>> > > > >>>>>>>> metadata_parse_version=1 > >>> > > > >>>>>>>> metadata_feature_version=1 > >>> > > > >>>>>>>> timestamp=1397464031 (Mon Apr 14 10:27:11 2014) > >>> > > > >>>>>>>> host-id=2 > >>> > > > >>>>>>>> score=0 > >>> > > > >>>>>>>> maintenance=False > >>> > > > >>>>>>>> state=EngineUnexpectedlyDown > >>> > > > >>>>>>>> timeout=Mon Apr 14 10:35:05 2014 > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> oVirt engine is sending me 2 emails every 10 minutes with > >>> the > >>> > > > >>>>>>>> following > >>> > > > >>>>>>>> subjects: > >>> > > > >>>>>>>> - ovirt-hosted-engine state transition > >>> EngineDown-EngineStart > >>> > > > >>>>>>>> - ovirt-hosted-engine state transition > >>> EngineStart-EngineUp > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> In oVirt webadmin I can see the following message: > >>> > > > >>>>>>>> VM HostedEngine is down. Exit message: internal error > >>> Failed to > >>> > > > >>>>>>>> acquire > >>> > > > >>>>>>>> lock: error -243. > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> These messages are really annoying as oVirt isn't doing > >>> anything > >>> > > > >>>>>>>> with > >>> > > > >>>>>>>> hosted engine - I have an uptime of 9 days in my engine > >>> vm. > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> So my questions are now: > >>> > > > >>>>>>>> Is it intended to send out these messages and detect that > >>> ovirt > >>> > > > >>>>>>>> engine > >>> > > > >>>>>>>> is down (which is false anyway), but not to restart the > >>> vm? > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> How can I disable notifications? I'm planning to write a > >>> Nagios > >>> > > > >>>>>>>> plugin > >>> > > > >>>>>>>> which parses the output of hosted-engine --vm-status and > >>> only > >>> > > Nagios > >>> > > > >>>>>>>> should notify me, not hosted-engine script. > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> Is is possible or planned to make the whole ha feature > >>> > > optional? I > >>> > > > >>>>>>>> really really really hate cluster software as it causes > >>> more > >>> > > > >>>>>>>> troubles > >>> > > > >>>>>>>> then standalone machines and in my case the hosted-engine > >>> ha > >>> > > feature > >>> > > > >>>>>>>> really causes troubles (and I didn't had a hardware or > >>> network > >>> > > > >>>>>>>> outage > >>> > > > >>>>>>>> yet only issues with hosted-engine ha agent). I don't > >>> need any > >>> > > ha > >>> > > > >>>>>>>> feature for hosted engine. I just want to run engine > >>> > > virtualized on > >>> > > > >>>>>>>> oVirt and if engine vm fails (e.g. because of issues with > >>> a > >>> > > host) > >>> > > > >>>>>>>> I'll > >>> > > > >>>>>>>> restart it on another node. > >>> > > > >>>>>>> > >>> > > > >>>>>>> Hi, you can: > >>> > > > >>>>>>> 1. edit > >>> /etc/ovirt-hosted-engine-ha/{agent,broker}-log.conf and > >>> > > tweak > >>> > > > >>>>>>> the logger as you like > >>> > > > >>>>>>> 2. or kill ovirt-ha-broker & ovirt-ha-agent services > >>> > > > >>>>>> > >>> > > > >>>>>> Thanks for the information. > >>> > > > >>>>>> So engine is able to run when ovirt-ha-broker and > >>> ovirt-ha-agent > >>> > > isn't > >>> > > > >>>>>> running? > >>> > > > >>>>>> > >>> > > > >>>>>> > >>> > > > >>>>>> Regards, > >>> > > > >>>>>> René > >>> > > > >>>>>> > >>> > > > >>>>>>> > >>> > > > >>>>>>> --Jirka > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> Thanks, > >>> > > > >>>>>>>> René > >>> > > > >>>>>>>> > >>> > > > >>>>>>>> > >>> > > > >>>>>>> > >>> > > > >>>>>> _______________________________________________ > >>> > > > >>>>>> Users mailing list > >>> > > > >>>>>> Users@ovirt.org > >>> > > > >>>>>> http://lists.ovirt.org/mailman/listinfo/users > >>> > > > >>>>>> > >>> > > > >>>> _______________________________________________ > >>> > > > >>>> Users mailing list > >>> > > > >>>> Users@ovirt.org > >>> > > > >>>> http://lists.ovirt.org/mailman/listinfo/users > >>> > > > >>>> > >>> > > > >> > >>> > > > > >>> > > _______________________________________________ > >>> > > Users mailing list > >>> > > Users@ovirt.org > >>> > > http://lists.ovirt.org/mailman/listinfo/users > >>> > > > >>> > > >>> _______________________________________________ > >>> Users mailing list > >>> Users@ovirt.org > >>> http://lists.ovirt.org/mailman/listinfo/users > >>> > >> > >> > > >

Hi, I use this version : ovirt-hosted-engine-ha-1.1.2-1.el6.noarch For 3 days, my engine-ha worked perfectly but i tried to snapshot a Vm and ha service make defunct ==> 400% CPU !! Agent.log and broker.log says nothing. But vdsm.log i have errors : Thread-9462::DEBUG::2014-04-28 07:23:58,994::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 84 edom: 10 level: 2 message: Operation not supported: live disk snapshot not supported with this QEMU binary Thread-9462::ERROR::2014-04-28 07:23:58,995::vm::4006::vm.Vm::(snapshot) vmId=`773f6e6d-c670-49f3-ae8c-dfbcfa22d0a5`::Unable to take snapshot Thread-9352::DEBUG::2014-04-28 08:41:39,922::lvm::295::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'r|.*|\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name cc51143e-8ad7-4b0b-a4d2-9024dffc1188 ff98d346-4515-4349-8437-fb2f5e9eaadf' (cwd None) I'll try to reboot my node with hosted-engine. 2014-04-25 13:54 GMT+02:00 Martin Sivak <msivak@redhat.com>: > Hi Kevin, > > can you please tell us what version of hosted-engine are you running? > > rpm -q ovirt-hosted-engine-ha > > Also, do I understand it correctly that the engine VM is running, but you > see bad status when you execute the hosted-engine --vm-status command? > > If that is so, can you give us current logs from > /var/log/ovirt-hosted-engine-ha? > > -- > Martin Sivák > msivak@redhat.com > Red Hat Czech > RHEV-M SLA / Brno, CZ > > ----- Original Message ----- > > Ok i mount manualy the domain for hosted engine and agent go up. > > > > But vm-status : > > > > --== Host 2 status ==-- > > > > Status up-to-date : False > > Hostname : 192.168.99.103 > > Host ID : 2 > > Engine status : unknown stale-data > > Score : 0 > > Local maintenance : False > > Host timestamp : 1398333438 > > > > And in my engine, host02 Ha is no active. > > > > > > 2014-04-24 12:48 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com>: > > > > > Hi, > > > > > > I try to reboot my hosts and now [supervdsmServer] is <defunct>. > > > > > > /var/log/vdsm/supervdsm.log > > > > > > > > > MainProcess|Thread-120::DEBUG::2014-04-24 > > > 12:22:19,955::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) > > > return validateAccess with None > > > MainProcess|Thread-120::DEBUG::2014-04-24 > > > 12:22:20,010::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) > call > > > validateAccess with ('qemu', ('qemu', 'kvm'), > > > '/rhev/data-center/mnt/host01.ovirt.lan:_home_export', 5) {} > > > MainProcess|Thread-120::DEBUG::2014-04-24 > > > 12:22:20,014::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) > > > return validateAccess with None > > > MainProcess|Thread-120::DEBUG::2014-04-24 > > > 12:22:20,059::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) > call > > > validateAccess with ('qemu', ('qemu', 'kvm'), > > > '/rhev/data-center/mnt/host01.ovirt.lan:_home_iso', 5) {} > > > MainProcess|Thread-120::DEBUG::2014-04-24 > > > 12:22:20,063::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) > > > return validateAccess with None > > > > > > and one host don't mount the NFS used for hosted engine. > > > > > > MainThread::CRITICAL::2014-04-24 > > > > 12:36:16,603::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run) > > > Could not start ha-agent > > > Traceback (most recent call last): > > > File > > > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", > > > line 97, in run > > > self._run_agent() > > > File > > > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", > > > line 154, in _run_agent > > > > hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring() > > > File > > > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", > > > line 299, in start_monitoring > > > self._initialize_vdsm() > > > File > > > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", > > > line 418, in _initialize_vdsm > > > self._sd_path = env_path.get_domain_path(self._config) > > > File > > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py", > line > > > 40, in get_domain_path > > > .format(sd_uuid, parent)) > > > Exception: path to storage domain aea040f8-ab9d-435b-9ecf-ddd4272e592f > not > > > found in /rhev/data-center/mnt > > > > > > > > > > > > 2014-04-23 17:40 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com>: > > > > > > top > > >> 1729 vdsm 20 0 0 0 0 Z 373.8 0.0 252:08.51 > > >> ovirt-ha-broker <defunct> > > >> > > >> > > >> [root@host01 ~]# ps axwu | grep 1729 > > >> vdsm 1729 0.7 0.0 0 0 ? Zl Apr02 240:24 > > >> [ovirt-ha-broker] <defunct> > > >> > > >> [root@host01 ~]# ll > > >> > /rhev/data-center/mnt/host01.ovirt.lan\:_home_NFS01/aea040f8-ab9d-435b-9ecf-ddd4272e592f/ha_agent/ > > >> total 2028 > > >> -rw-rw----. 1 vdsm kvm 1048576 23 avril 17:35 hosted-engine.lockspace > > >> -rw-rw----. 1 vdsm kvm 1028096 23 avril 17:35 hosted-engine.metadata > > >> > > >> cat /var/log/vdsm/vdsm.log > > >> > > >> Thread-120518::DEBUG::2014-04-23 > > >> 17:38:02,299::task::1185::TaskManager.Task::(prepare) > > >> Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::finished: > > >> {'aea040f8-ab9d-435b-9ecf-ddd4272e592f': {'code': 0, 'version': 3, > > >> 'acquired': True, 'delay': '0.000410963', 'lastCheck': '3.4', 'valid': > > >> True}, '5ae613a4-44e4-42cb-89fc-7b5d34c1f30f': {'code': 0, 'version': > 3, > > >> 'acquired': True, 'delay': '0.000412357', 'lastCheck': '6.8', 'valid': > > >> True}, 'cc51143e-8ad7-4b0b-a4d2-9024dffc1188': {'code': 0, 'version': > 0, > > >> 'acquired': True, 'delay': '0.000455292', 'lastCheck': '1.2', 'valid': > > >> True}, 'ff98d346-4515-4349-8437-fb2f5e9eaadf': {'code': 0, 'version': > 0, > > >> 'acquired': True, 'delay': '0.00817113', 'lastCheck': '1.7', 'valid': > > >> True}} > > >> Thread-120518::DEBUG::2014-04-23 > > >> 17:38:02,300::task::595::TaskManager.Task::(_updateState) > > >> Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::moving from state > preparing > > >> -> > > >> state finished > > >> Thread-120518::DEBUG::2014-04-23 > > >> > 17:38:02,300::resourceManager::940::ResourceManager.Owner::(releaseAll) > > >> Owner.releaseAll requests {} resources {} > > >> Thread-120518::DEBUG::2014-04-23 > > >> 17:38:02,300::resourceManager::977::ResourceManager.Owner::(cancelAll) > > >> Owner.cancelAll requests {} > > >> Thread-120518::DEBUG::2014-04-23 > > >> 17:38:02,300::task::990::TaskManager.Task::(_decref) > > >> Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::ref 0 aborting False > > >> Thread-120518::ERROR::2014-04-23 > > >> > 17:38:02,302::brokerlink::72::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) > > >> Failed to connect to broker: [Errno 2] No such file or directory > > >> Thread-120518::ERROR::2014-04-23 > > >> 17:38:02,302::API::1612::vds::(_getHaInfo) failed to retrieve Hosted > > >> Engine > > >> HA info > > >> Traceback (most recent call last): > > >> File "/usr/share/vdsm/API.py", line 1603, in _getHaInfo > > >> stats = instance.get_all_stats() > > >> File > > >> > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", > > >> line 83, in get_all_stats > > >> with broker.connection(): > > >> File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__ > > >> return self.gen.next() > > >> File > > >> > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", > > >> line 96, in connection > > >> self.connect() > > >> File > > >> > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", > > >> line 64, in connect > > >> self._socket.connect(constants.BROKER_SOCKET_FILE) > > >> File "<string>", line 1, in connect > > >> error: [Errno 2] No such file or directory > > >> Thread-78::DEBUG::2014-04-23 > > >> 17:38:05,490::fileSD::225::Storage.Misc.excCmd::(getReadDelay) > '/bin/dd > > >> iflag=direct > > >> > if=/rhev/data-center/mnt/host01.ovirt.lan:_home_DATA/5ae613a4-44e4-42cb-89fc-7b5d34c1f30f/dom_md/metadata > > >> bs=4096 count=1' (cwd None) > > >> Thread-78::DEBUG::2014-04-23 > > >> 17:38:05,523::fileSD::225::Storage.Misc.excCmd::(getReadDelay) > SUCCESS: > > >> <err> = '0+1 records in\n0+1 records out\n545 bytes (545 B) copied, > > >> 0.000412209 s, 1.3 MB/s\n'; <rc> = 0 > > >> > > >> > > >> > > >> > > >> 2014-04-23 17:27 GMT+02:00 Martin Sivak <msivak@redhat.com>: > > >> > > >> Hi Kevin, > > >>> > > >>> > same pb. > > >>> > > >>> Are you missing the lockspace file as well while running on top of > > >>> GlusterFS? > > >>> > > >>> > ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9. > > >>> > > >>> Defunct process eating full four cores? I wonder how is that > possible.. > > >>> What are the status flags of that process when you do ps axwu? > > >>> > > >>> Can you attach the log files please? > > >>> > > >>> -- > > >>> Martin Sivák > > >>> msivak@redhat.com > > >>> Red Hat Czech > > >>> RHEV-M SLA / Brno, CZ > > >>> > > >>> ----- Original Message ----- > > >>> > same pb. ovirt-ha-broker have 400% cpu and is defunct. I can't kill > > >>> with -9. > > >>> > > > >>> > > > >>> > 2014-04-23 13:55 GMT+02:00 Martin Sivak <msivak@redhat.com>: > > >>> > > > >>> > > Hi, > > >>> > > > > >>> > > > Isn't this file created when hosted engine is started? > > >>> > > > > >>> > > The file is created by the setup script. If it got lost then > there > > >>> was > > >>> > > probably something bad happening in your NFS or Gluster storage. > > >>> > > > > >>> > > > Or how can I create this file manually? > > >>> > > > > >>> > > I can give you experimental treatment for this. We do not have > any > > >>> > > official way as this is something that should not ever happen :) > > >>> > > > > >>> > > !! But before you do that make sure you do not have any nodes > running > > >>> > > properly. This will destroy and reinitialize the lockspace > database > > >>> for the > > >>> > > whole hosted-engine environment (which you apparently lack, > but..). > > >>> !! > > >>> > > > > >>> > > You have to create the ha_agent/hosted-engine.lockspace file > with the > > >>> > > expected size (1MB) and then tell sanlock to initialize it as a > > >>> lockspace > > >>> > > using: > > >>> > > > > >>> > > # python > > >>> > > >>> import sanlock > > >>> > > >>> sanlock.write_lockspace(lockspace="hosted-engine", > > >>> > > ... path="/rhev/data-center/mnt/<nfs>/<hosted engine storage > > >>> > > domain>/ha_agent/hosted-engine.lockspace", > > >>> > > ... offset=0) > > >>> > > >>> > > >>> > > > > >>> > > Then try starting the services (both broker and agent) again. > > >>> > > > > >>> > > -- > > >>> > > Martin Sivák > > >>> > > msivak@redhat.com > > >>> > > Red Hat Czech > > >>> > > RHEV-M SLA / Brno, CZ > > >>> > > > > >>> > > > > >>> > > ----- Original Message ----- > > >>> > > > On 04/23/2014 11:08 AM, Martin Sivak wrote: > > >>> > > > > Hi René, > > >>> > > > > > > >>> > > > >>>> libvirtError: Failed to acquire lock: No space left on > device > > >>> > > > > > > >>> > > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire > 2,9,5733 > > >>> invalid > > >>> > > > >>>> lockspace found -1 failed 0 name > > >>> > > 2851af27-8744-445d-9fb1-a0d083c8dc82 > > >>> > > > > > > >>> > > > > Can you please check the contents of /rhev/data-center/<your > nfs > > >>> > > > > mount>/<nfs domain uuid>/ha_agent/? > > >>> > > > > > > >>> > > > > This is how it should look like: > > >>> > > > > > > >>> > > > > [root@dev-03 ~]# ls -al > > >>> > > > > > > >>> > > > > >>> > /rhev/data-center/mnt/euryale\:_home_ovirt_he/e16de6a2-53f5-4ab3-95a3-255d08398824/ha_agent/ > > >>> > > > > total 2036 > > >>> > > > > drwxr-x---. 2 vdsm kvm 4096 Mar 19 18:46 . > > >>> > > > > drwxr-xr-x. 6 vdsm kvm 4096 Mar 19 18:46 .. > > >>> > > > > -rw-rw----. 1 vdsm kvm 1048576 Apr 23 11:05 > > >>> hosted-engine.lockspace > > >>> > > > > -rw-rw----. 1 vdsm kvm 1028096 Mar 19 18:46 > > >>> hosted-engine.metadata > > >>> > > > > > > >>> > > > > The errors seem to indicate that you somehow lost the > lockspace > > >>> file. > > >>> > > > > > >>> > > > True :) > > >>> > > > Isn't this file created when hosted engine is started? Or how > can I > > >>> > > > create this file manually? > > >>> > > > > > >>> > > > > > > >>> > > > > -- > > >>> > > > > Martin Sivák > > >>> > > > > msivak@redhat.com > > >>> > > > > Red Hat Czech > > >>> > > > > RHEV-M SLA / Brno, CZ > > >>> > > > > > > >>> > > > > ----- Original Message ----- > > >>> > > > >> On 04/23/2014 12:28 AM, Doron Fediuck wrote: > > >>> > > > >>> Hi Rene, > > >>> > > > >>> any idea what closed your ovirtmgmt bridge? > > >>> > > > >>> as long as it is down vdsm may have issues starting up > properly > > >>> > > > >>> and this is why you see the complaints on the rpc server. > > >>> > > > >>> > > >>> > > > >>> Can you try manually fixing the network part first and then > > >>> > > > >>> restart vdsm? > > >>> > > > >>> Once vdsm is happy hosted engine VM will start. > > >>> > > > >> > > >>> > > > >> Thanks for your feedback, Doron. > > >>> > > > >> > > >>> > > > >> My ovirtmgmt bridge seems to be on or isn't it: > > >>> > > > >> # brctl show ovirtmgmt > > >>> > > > >> bridge name bridge id STP enabled > > >>> interfaces > > >>> > > > >> ovirtmgmt 8000.0025907587c2 no > > >>> eth0.200 > > >>> > > > >> > > >>> > > > >> # ip a s ovirtmgmt > > >>> > > > >> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 > qdisc > > >>> noqueue > > >>> > > > >> state UNKNOWN > > >>> > > > >> link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff > > >>> > > > >> inet 10.0.200.102/24 brd 10.0.200.255 scope global > > >>> ovirtmgmt > > >>> > > > >> inet6 fe80::225:90ff:fe75:87c2/64 scope link > > >>> > > > >> valid_lft forever preferred_lft forever > > >>> > > > >> > > >>> > > > >> # ip a s eth0.200 > > >>> > > > >> 6: eth0.200@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu > 1500 > > >>> qdisc > > >>> > > > >> noqueue state UP > > >>> > > > >> link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff > > >>> > > > >> inet6 fe80::225:90ff:fe75:87c2/64 scope link > > >>> > > > >> valid_lft forever preferred_lft forever > > >>> > > > >> > > >>> > > > >> I tried the following yesterday: > > >>> > > > >> Copy virtual disk from GlusterFS storage to local disk of > host > > >>> and > > >>> > > > >> create a new vm with virt-manager which loads ovirtmgmt > disk. I > > >>> could > > >>> > > > >> reach my engine over the ovirtmgmt bridge (so bridge must be > > >>> working). > > >>> > > > >> > > >>> > > > >> I also started libvirtd with Option -v and I saw the > following > > >>> in > > >>> > > > >> libvirtd.log when trying to start ovirt engine: > > >>> > > > >> 2014-04-22 14:18:25.432+0000: 8901: debug : > > >>> virCommandRunAsync:2250 : > > >>> > > > >> Command result 0, with PID 11491 > > >>> > > > >> 2014-04-22 14:18:25.478+0000: 8901: debug : > virCommandRun:2045 : > > >>> > > Result > > >>> > > > >> exit status 255, stdout: '' stderr: 'iptables v1.4.7: goto > > >>> 'FO-vnet0' > > >>> > > is > > >>> > > > >> not a chain > > >>> > > > >> > > >>> > > > >> So it could be that something is broken in my hosted-engine > > >>> network. > > >>> > > Do > > >>> > > > >> you have any clue how I can troubleshoot this? > > >>> > > > >> > > >>> > > > >> > > >>> > > > >> Thanks, > > >>> > > > >> René > > >>> > > > >> > > >>> > > > >> > > >>> > > > >>> > > >>> > > > >>> ----- Original Message ----- > > >>> > > > >>>> From: "René Koch" <rkoch@linuxland.at> > > >>> > > > >>>> To: "Martin Sivak" <msivak@redhat.com> > > >>> > > > >>>> Cc: users@ovirt.org > > >>> > > > >>>> Sent: Tuesday, April 22, 2014 1:46:38 PM > > >>> > > > >>>> Subject: Re: [ovirt-users] hosted engine health check > issues > > >>> > > > >>>> > > >>> > > > >>>> Hi, > > >>> > > > >>>> > > >>> > > > >>>> I rebooted one of my ovirt hosts today and the result is > now > > >>> that I > > >>> > > > >>>> can't start hosted-engine anymore. > > >>> > > > >>>> > > >>> > > > >>>> ovirt-ha-agent isn't running because the lockspace file is > > >>> missing > > >>> > > > >>>> (sanlock complains about it). > > >>> > > > >>>> So I tried to start hosted-engine with --vm-start and I > get > > >>> the > > >>> > > > >>>> following errors: > > >>> > > > >>>> > > >>> > > > >>>> ==> /var/log/sanlock.log <== > > >>> > > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire > 2,9,5733 > > >>> invalid > > >>> > > > >>>> lockspace found -1 failed 0 name > > >>> > > 2851af27-8744-445d-9fb1-a0d083c8dc82 > > >>> > > > >>>> > > >>> > > > >>>> ==> /var/log/messages <== > > >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 sanlock[3079]: 2014-04-22 > > >>> > > 12:38:17+0200 654 > > >>> > > > >>>> [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 > > >>> failed 0 > > >>> > > name > > >>> > > > >>>> 2851af27-8744-445d-9fb1-a0d083c8dc82 > > >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port > 2(vnet0) > > >>> > > entering > > >>> > > > >>>> disabled state > > >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: device vnet0 left > > >>> promiscuous > > >>> > > mode > > >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port > 2(vnet0) > > >>> > > entering > > >>> > > > >>>> disabled state > > >>> > > > >>>> > > >>> > > > >>>> ==> /var/log/vdsm/vdsm.log <== > > >>> > > > >>>> Thread-21::DEBUG::2014-04-22 > > >>> > > > >>>> 12:38:17,563::libvirtconnection::124::root::(wrapper) > Unknown > > >>> > > > >>>> libvirterror: ecode: 38 edom: 42 level: 2 message: Failed > to > > >>> acquire > > >>> > > > >>>> lock: No space left on device > > >>> > > > >>>> Thread-21::DEBUG::2014-04-22 > > >>> > > > >>>> 12:38:17,563::vm::2263::vm.Vm::(_startUnderlyingVm) > > >>> > > > >>>> > vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::_ongoingCreations > > >>> > > released > > >>> > > > >>>> Thread-21::ERROR::2014-04-22 > > >>> > > > >>>> 12:38:17,564::vm::2289::vm.Vm::(_startUnderlyingVm) > > >>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start > > >>> process > > >>> > > failed > > >>> > > > >>>> Traceback (most recent call last): > > >>> > > > >>>> File "/usr/share/vdsm/vm.py", line 2249, in > > >>> _startUnderlyingVm > > >>> > > > >>>> self._run() > > >>> > > > >>>> File "/usr/share/vdsm/vm.py", line 3170, in _run > > >>> > > > >>>> self._connection.createXML(domxml, flags), > > >>> > > > >>>> File > > >>> > > > >>>> > > >>> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", > > >>> > > > >>>> line 92, in wrapper > > >>> > > > >>>> ret = f(*args, **kwargs) > > >>> > > > >>>> File "/usr/lib64/python2.6/site-packages/libvirt.py", > > >>> line > > >>> > > 2665, in > > >>> > > > >>>> createXML > > >>> > > > >>>> if ret is None:raise > libvirtError('virDomainCreateXML() > > >>> > > failed', > > >>> > > > >>>> conn=self) > > >>> > > > >>>> libvirtError: Failed to acquire lock: No space left on > device > > >>> > > > >>>> > > >>> > > > >>>> ==> /var/log/messages <== > > >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 vdsm vm.Vm ERROR > > >>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start > > >>> process > > >>> > > > >>>> failed#012Traceback (most recent call last):#012 File > > >>> > > > >>>> "/usr/share/vdsm/vm.py", line 2249, in > _startUnderlyingVm#012 > > >>> > > > >>>> self._run()#012 File "/usr/share/vdsm/vm.py", line 3170, > in > > >>> > > _run#012 > > >>> > > > >>>> self._connection.createXML(domxml, flags),#012 File > > >>> > > > >>>> > > >>> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", > > >>> > > line 92, > > >>> > > > >>>> in wrapper#012 ret = f(*args, **kwargs)#012 File > > >>> > > > >>>> "/usr/lib64/python2.6/site-packages/libvirt.py", line > 2665, in > > >>> > > > >>>> createXML#012 if ret is None:raise > > >>> > > libvirtError('virDomainCreateXML() > > >>> > > > >>>> failed', conn=self)#012libvirtError: Failed to acquire > lock: > > >>> No > > >>> > > space > > >>> > > > >>>> left on device > > >>> > > > >>>> > > >>> > > > >>>> ==> /var/log/vdsm/vdsm.log <== > > >>> > > > >>>> Thread-21::DEBUG::2014-04-22 > > >>> > > > >>>> 12:38:17,569::vm::2731::vm.Vm::(setDownStatus) > > >>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::Changed > state to > > >>> Down: > > >>> > > > >>>> Failed to acquire lock: No space left on device > > >>> > > > >>>> > > >>> > > > >>>> > > >>> > > > >>>> No space left on device is nonsense as there is enough > space > > >>> (I had > > >>> > > this > > >>> > > > >>>> issue last time as well where I had to patch machine.py, > but > > >>> this > > >>> > > file > > >>> > > > >>>> is now Python 2.6.6 compatible. > > >>> > > > >>>> > > >>> > > > >>>> Any idea what prevents hosted-engine from starting? > > >>> > > > >>>> ovirt-ha-broker, vdsmd and sanlock are running btw. > > >>> > > > >>>> > > >>> > > > >>>> Btw, I can see in log that json rpc server module is > missing > > >>> - which > > >>> > > > >>>> package is required for CentOS 6.5? > > >>> > > > >>>> Apr 22 12:37:14 ovirt-host02 vdsm vds WARNING Unable to > load > > >>> the > > >>> > > json > > >>> > > > >>>> rpc server module. Please make sure it is installed. > > >>> > > > >>>> > > >>> > > > >>>> > > >>> > > > >>>> Thanks, > > >>> > > > >>>> René > > >>> > > > >>>> > > >>> > > > >>>> > > >>> > > > >>>> > > >>> > > > >>>> On 04/17/2014 10:02 AM, Martin Sivak wrote: > > >>> > > > >>>>> Hi, > > >>> > > > >>>>> > > >>> > > > >>>>>>>> How can I disable notifications? > > >>> > > > >>>>> > > >>> > > > >>>>> The notification is configured in > > >>> > > > >>>>> /etc/ovirt-hosted-engine-ha/broker.conf > > >>> > > > >>>>> section notification. > > >>> > > > >>>>> The email is sent when the key state_transition exists > and > > >>> the > > >>> > > string > > >>> > > > >>>>> OldState-NewState contains the (case insensitive) regexp > > >>> from the > > >>> > > > >>>>> value. > > >>> > > > >>>>> > > >>> > > > >>>>>>>> Is it intended to send out these messages and detect > that > > >>> ovirt > > >>> > > > >>>>>>>> engine > > >>> > > > >>>>>>>> is down (which is false anyway), but not to restart > the > > >>> vm? > > >>> > > > >>>>> > > >>> > > > >>>>> Forget about emails for now and check the > > >>> > > > >>>>> /var/log/ovirt-hosted-engine-ha/agent.log and broker.log > (and > > >>> > > attach > > >>> > > > >>>>> them > > >>> > > > >>>>> as well btw). > > >>> > > > >>>>> > > >>> > > > >>>>>>>> oVirt hosts think that hosted engine is down because > it > > >>> seems > > >>> > > that > > >>> > > > >>>>>>>> hosts > > >>> > > > >>>>>>>> can't write to hosted-engine.lockspace due to > glusterfs > > >>> issues > > >>> > > (or > > >>> > > > >>>>>>>> at > > >>> > > > >>>>>>>> least I think so). > > >>> > > > >>>>> > > >>> > > > >>>>> The hosts think so or can't really write there? The > > >>> lockspace is > > >>> > > > >>>>> managed > > >>> > > > >>>>> by > > >>> > > > >>>>> sanlock and our HA daemons do not touch it at all. We > only > > >>> ask > > >>> > > sanlock > > >>> > > > >>>>> to > > >>> > > > >>>>> get make sure we have unique server id. > > >>> > > > >>>>> > > >>> > > > >>>>>>>> Is is possible or planned to make the whole ha feature > > >>> optional? > > >>> > > > >>>>> > > >>> > > > >>>>> Well the system won't perform any automatic actions if > you > > >>> put the > > >>> > > > >>>>> hosted > > >>> > > > >>>>> engine to global maintenance and only start/stop/migrate > the > > >>> VM > > >>> > > > >>>>> manually. > > >>> > > > >>>>> I would discourage you from stopping agent/broker, > because > > >>> the > > >>> > > engine > > >>> > > > >>>>> itself has some logic based on the reporting. > > >>> > > > >>>>> > > >>> > > > >>>>> Regards > > >>> > > > >>>>> > > >>> > > > >>>>> -- > > >>> > > > >>>>> Martin Sivák > > >>> > > > >>>>> msivak@redhat.com > > >>> > > > >>>>> Red Hat Czech > > >>> > > > >>>>> RHEV-M SLA / Brno, CZ > > >>> > > > >>>>> > > >>> > > > >>>>> ----- Original Message ----- > > >>> > > > >>>>>> On 04/15/2014 04:53 PM, Jiri Moskovcak wrote: > > >>> > > > >>>>>>> On 04/14/2014 10:50 AM, René Koch wrote: > > >>> > > > >>>>>>>> Hi, > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> I have some issues with hosted engine status. > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> oVirt hosts think that hosted engine is down because > it > > >>> seems > > >>> > > that > > >>> > > > >>>>>>>> hosts > > >>> > > > >>>>>>>> can't write to hosted-engine.lockspace due to > glusterfs > > >>> issues > > >>> > > (or > > >>> > > > >>>>>>>> at > > >>> > > > >>>>>>>> least I think so). > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> Here's the output of vm-status: > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> # hosted-engine --vm-status > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> --== Host 1 status ==-- > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> Status up-to-date : False > > >>> > > > >>>>>>>> Hostname : 10.0.200.102 > > >>> > > > >>>>>>>> Host ID : 1 > > >>> > > > >>>>>>>> Engine status : unknown > stale-data > > >>> > > > >>>>>>>> Score : 2400 > > >>> > > > >>>>>>>> Local maintenance : False > > >>> > > > >>>>>>>> Host timestamp : 1397035677 > > >>> > > > >>>>>>>> Extra metadata (valid at timestamp): > > >>> > > > >>>>>>>> metadata_parse_version=1 > > >>> > > > >>>>>>>> metadata_feature_version=1 > > >>> > > > >>>>>>>> timestamp=1397035677 (Wed Apr 9 11:27:57 > 2014) > > >>> > > > >>>>>>>> host-id=1 > > >>> > > > >>>>>>>> score=2400 > > >>> > > > >>>>>>>> maintenance=False > > >>> > > > >>>>>>>> state=EngineUp > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> --== Host 2 status ==-- > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> Status up-to-date : True > > >>> > > > >>>>>>>> Hostname : 10.0.200.101 > > >>> > > > >>>>>>>> Host ID : 2 > > >>> > > > >>>>>>>> Engine status : {'reason': 'vm > not > > >>> running > > >>> > > on > > >>> > > > >>>>>>>> this > > >>> > > > >>>>>>>> host', 'health': 'bad', 'vm': 'down', 'detail': > 'unknown'} > > >>> > > > >>>>>>>> Score : 0 > > >>> > > > >>>>>>>> Local maintenance : False > > >>> > > > >>>>>>>> Host timestamp : 1397464031 > > >>> > > > >>>>>>>> Extra metadata (valid at timestamp): > > >>> > > > >>>>>>>> metadata_parse_version=1 > > >>> > > > >>>>>>>> metadata_feature_version=1 > > >>> > > > >>>>>>>> timestamp=1397464031 (Mon Apr 14 10:27:11 > 2014) > > >>> > > > >>>>>>>> host-id=2 > > >>> > > > >>>>>>>> score=0 > > >>> > > > >>>>>>>> maintenance=False > > >>> > > > >>>>>>>> state=EngineUnexpectedlyDown > > >>> > > > >>>>>>>> timeout=Mon Apr 14 10:35:05 2014 > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> oVirt engine is sending me 2 emails every 10 minutes > with > > >>> the > > >>> > > > >>>>>>>> following > > >>> > > > >>>>>>>> subjects: > > >>> > > > >>>>>>>> - ovirt-hosted-engine state transition > > >>> EngineDown-EngineStart > > >>> > > > >>>>>>>> - ovirt-hosted-engine state transition > > >>> EngineStart-EngineUp > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> In oVirt webadmin I can see the following message: > > >>> > > > >>>>>>>> VM HostedEngine is down. Exit message: internal error > > >>> Failed to > > >>> > > > >>>>>>>> acquire > > >>> > > > >>>>>>>> lock: error -243. > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> These messages are really annoying as oVirt isn't > doing > > >>> anything > > >>> > > > >>>>>>>> with > > >>> > > > >>>>>>>> hosted engine - I have an uptime of 9 days in my > engine > > >>> vm. > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> So my questions are now: > > >>> > > > >>>>>>>> Is it intended to send out these messages and detect > that > > >>> ovirt > > >>> > > > >>>>>>>> engine > > >>> > > > >>>>>>>> is down (which is false anyway), but not to restart > the > > >>> vm? > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> How can I disable notifications? I'm planning to > write a > > >>> Nagios > > >>> > > > >>>>>>>> plugin > > >>> > > > >>>>>>>> which parses the output of hosted-engine --vm-status > and > > >>> only > > >>> > > Nagios > > >>> > > > >>>>>>>> should notify me, not hosted-engine script. > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> Is is possible or planned to make the whole ha feature > > >>> > > optional? I > > >>> > > > >>>>>>>> really really really hate cluster software as it > causes > > >>> more > > >>> > > > >>>>>>>> troubles > > >>> > > > >>>>>>>> then standalone machines and in my case the > hosted-engine > > >>> ha > > >>> > > feature > > >>> > > > >>>>>>>> really causes troubles (and I didn't had a hardware or > > >>> network > > >>> > > > >>>>>>>> outage > > >>> > > > >>>>>>>> yet only issues with hosted-engine ha agent). I don't > > >>> need any > > >>> > > ha > > >>> > > > >>>>>>>> feature for hosted engine. I just want to run engine > > >>> > > virtualized on > > >>> > > > >>>>>>>> oVirt and if engine vm fails (e.g. because of issues > with > > >>> a > > >>> > > host) > > >>> > > > >>>>>>>> I'll > > >>> > > > >>>>>>>> restart it on another node. > > >>> > > > >>>>>>> > > >>> > > > >>>>>>> Hi, you can: > > >>> > > > >>>>>>> 1. edit > > >>> /etc/ovirt-hosted-engine-ha/{agent,broker}-log.conf and > > >>> > > tweak > > >>> > > > >>>>>>> the logger as you like > > >>> > > > >>>>>>> 2. or kill ovirt-ha-broker & ovirt-ha-agent services > > >>> > > > >>>>>> > > >>> > > > >>>>>> Thanks for the information. > > >>> > > > >>>>>> So engine is able to run when ovirt-ha-broker and > > >>> ovirt-ha-agent > > >>> > > isn't > > >>> > > > >>>>>> running? > > >>> > > > >>>>>> > > >>> > > > >>>>>> > > >>> > > > >>>>>> Regards, > > >>> > > > >>>>>> René > > >>> > > > >>>>>> > > >>> > > > >>>>>>> > > >>> > > > >>>>>>> --Jirka > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> Thanks, > > >>> > > > >>>>>>>> René > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>>> > > >>> > > > >>>>>>> > > >>> > > > >>>>>> _______________________________________________ > > >>> > > > >>>>>> Users mailing list > > >>> > > > >>>>>> Users@ovirt.org > > >>> > > > >>>>>> http://lists.ovirt.org/mailman/listinfo/users > > >>> > > > >>>>>> > > >>> > > > >>>> _______________________________________________ > > >>> > > > >>>> Users mailing list > > >>> > > > >>>> Users@ovirt.org > > >>> > > > >>>> http://lists.ovirt.org/mailman/listinfo/users > > >>> > > > >>>> > > >>> > > > >> > > >>> > > > > > >>> > > _______________________________________________ > > >>> > > Users mailing list > > >>> > > Users@ovirt.org > > >>> > > http://lists.ovirt.org/mailman/listinfo/users > > >>> > > > > >>> > > > >>> _______________________________________________ > > >>> Users mailing list > > >>> Users@ovirt.org > > >>> http://lists.ovirt.org/mailman/listinfo/users > > >>> > > >> > > >> > > > > > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users >

Hi Kevin, thanks for the information.
Agent.log and broker.log says nothing.
Can you please attach those files? I would like to see how the crashed Qemu process is reported to us and what are the state machine trainsitions that cause the load.
07:23:58,994::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 84 edom: 10 level: 2 message: Operation not supported: live disk snapshot not supported with this QEMU binary
What are the versions of vdsm, libvirt, qemu-kvm and kernel? If you feel like it try updating virt packages from the virt-preview repository: http://fedoraproject.org/wiki/Virtualization_Preview_Repository -- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ ----- Original Message -----
Hi,
I use this version : ovirt-hosted-engine-ha-1.1.2-1.el6.noarch
For 3 days, my engine-ha worked perfectly but i tried to snapshot a Vm and ha service make defunct ==> 400% CPU !!
Agent.log and broker.log says nothing. But vdsm.log i have errors :
Thread-9462::DEBUG::2014-04-28 07:23:58,994::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 84 edom: 10 level: 2 message: Operation not supported: live disk snapshot not supported with this QEMU binary
Thread-9462::ERROR::2014-04-28 07:23:58,995::vm::4006::vm.Vm::(snapshot) vmId=`773f6e6d-c670-49f3-ae8c-dfbcfa22d0a5`::Unable to take snapshot
Thread-9352::DEBUG::2014-04-28 08:41:39,922::lvm::295::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'r|.*|\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name cc51143e-8ad7-4b0b-a4d2-9024dffc1188 ff98d346-4515-4349-8437-fb2f5e9eaadf' (cwd None)
I'll try to reboot my node with hosted-engine.
2014-04-25 13:54 GMT+02:00 Martin Sivak <msivak@redhat.com>:
Hi Kevin,
can you please tell us what version of hosted-engine are you running?
rpm -q ovirt-hosted-engine-ha
Also, do I understand it correctly that the engine VM is running, but you see bad status when you execute the hosted-engine --vm-status command?
If that is so, can you give us current logs from /var/log/ovirt-hosted-engine-ha?
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
Ok i mount manualy the domain for hosted engine and agent go up.
But vm-status :
--== Host 2 status ==--
Status up-to-date : False Hostname : 192.168.99.103 Host ID : 2 Engine status : unknown stale-data Score : 0 Local maintenance : False Host timestamp : 1398333438
And in my engine, host02 Ha is no active.
2014-04-24 12:48 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com>:
Hi,
I try to reboot my hosts and now [supervdsmServer] is <defunct>.
/var/log/vdsm/supervdsm.log
MainProcess|Thread-120::DEBUG::2014-04-24 12:22:19,955::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) return validateAccess with None MainProcess|Thread-120::DEBUG::2014-04-24 12:22:20,010::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) call validateAccess with ('qemu', ('qemu', 'kvm'), '/rhev/data-center/mnt/host01.ovirt.lan:_home_export', 5) {} MainProcess|Thread-120::DEBUG::2014-04-24 12:22:20,014::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) return validateAccess with None MainProcess|Thread-120::DEBUG::2014-04-24 12:22:20,059::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) call validateAccess with ('qemu', ('qemu', 'kvm'), '/rhev/data-center/mnt/host01.ovirt.lan:_home_iso', 5) {} MainProcess|Thread-120::DEBUG::2014-04-24 12:22:20,063::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) return validateAccess with None
and one host don't mount the NFS used for hosted engine.
MainThread::CRITICAL::2014-04-24
12:36:16,603::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
Could not start ha-agent Traceback (most recent call last): File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
line 97, in run self._run_agent() File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
line 154, in _run_agent
hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring()
File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 299, in start_monitoring self._initialize_vdsm() File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 418, in _initialize_vdsm self._sd_path = env_path.get_domain_path(self._config) File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py",
40, in get_domain_path .format(sd_uuid, parent)) Exception: path to storage domain aea040f8-ab9d-435b-9ecf-ddd4272e592f not found in /rhev/data-center/mnt
2014-04-23 17:40 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com>:
top
1729 vdsm 20 0 0 0 0 Z 373.8 0.0 252:08.51 ovirt-ha-broker <defunct>
[root@host01 ~]# ps axwu | grep 1729 vdsm 1729 0.7 0.0 0 0 ? Zl Apr02 240:24 [ovirt-ha-broker] <defunct>
[root@host01 ~]# ll
/rhev/data-center/mnt/host01.ovirt.lan\:_home_NFS01/aea040f8-ab9d-435b-9ecf-ddd4272e592f/ha_agent/
total 2028 -rw-rw----. 1 vdsm kvm 1048576 23 avril 17:35 hosted-engine.lockspace -rw-rw----. 1 vdsm kvm 1028096 23 avril 17:35 hosted-engine.metadata
cat /var/log/vdsm/vdsm.log
Thread-120518::DEBUG::2014-04-23 17:38:02,299::task::1185::TaskManager.Task::(prepare) Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::finished: {'aea040f8-ab9d-435b-9ecf-ddd4272e592f': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000410963', 'lastCheck': '3.4', 'valid': True}, '5ae613a4-44e4-42cb-89fc-7b5d34c1f30f': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000412357', 'lastCheck': '6.8', 'valid': True}, 'cc51143e-8ad7-4b0b-a4d2-9024dffc1188': {'code': 0, 'version': 0, 'acquired': True, 'delay': '0.000455292', 'lastCheck': '1.2', 'valid': True}, 'ff98d346-4515-4349-8437-fb2f5e9eaadf': {'code': 0, 'version': 0, 'acquired': True, 'delay': '0.00817113', 'lastCheck': '1.7', 'valid': True}} Thread-120518::DEBUG::2014-04-23 17:38:02,300::task::595::TaskManager.Task::(_updateState) Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::moving from state
-> state finished Thread-120518::DEBUG::2014-04-23
17:38:02,300::resourceManager::940::ResourceManager.Owner::(releaseAll)
Owner.releaseAll requests {} resources {} Thread-120518::DEBUG::2014-04-23 17:38:02,300::resourceManager::977::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-120518::DEBUG::2014-04-23 17:38:02,300::task::990::TaskManager.Task::(_decref) Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::ref 0 aborting False Thread-120518::ERROR::2014-04-23
17:38:02,302::brokerlink::72::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect)
Failed to connect to broker: [Errno 2] No such file or directory Thread-120518::ERROR::2014-04-23 17:38:02,302::API::1612::vds::(_getHaInfo) failed to retrieve Hosted Engine HA info Traceback (most recent call last): File "/usr/share/vdsm/API.py", line 1603, in _getHaInfo stats = instance.get_all_stats() File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py",
line 83, in get_all_stats with broker.connection(): File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__ return self.gen.next() File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 96, in connection self.connect() File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 64, in connect self._socket.connect(constants.BROKER_SOCKET_FILE) File "<string>", line 1, in connect error: [Errno 2] No such file or directory Thread-78::DEBUG::2014-04-23 17:38:05,490::fileSD::225::Storage.Misc.excCmd::(getReadDelay) '/bin/dd iflag=direct
if=/rhev/data-center/mnt/host01.ovirt.lan:_home_DATA/5ae613a4-44e4-42cb-89fc-7b5d34c1f30f/dom_md/metadata
bs=4096 count=1' (cwd None) Thread-78::DEBUG::2014-04-23 17:38:05,523::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n545 bytes (545 B) copied, 0.000412209 s, 1.3 MB/s\n'; <rc> = 0
2014-04-23 17:27 GMT+02:00 Martin Sivak <msivak@redhat.com>:
Hi Kevin,
> same pb.
Are you missing the lockspace file as well while running on top of GlusterFS?
> ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9.
Defunct process eating full four cores? I wonder how is that
What are the status flags of that process when you do ps axwu?
Can you attach the log files please?
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
----- Original Message ----- > same pb. ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9. > > > 2014-04-23 13:55 GMT+02:00 Martin Sivak <msivak@redhat.com>: > > > Hi, > > > > > Isn't this file created when hosted engine is started? > > > > The file is created by the setup script. If it got lost then
was > > probably something bad happening in your NFS or Gluster storage. > > > > > Or how can I create this file manually? > > > > I can give you experimental treatment for this. We do not have any > > official way as this is something that should not ever happen :) > > > > !! But before you do that make sure you do not have any nodes running > > properly. This will destroy and reinitialize the lockspace database for the > > whole hosted-engine environment (which you apparently lack, but..). !! > > > > You have to create the ha_agent/hosted-engine.lockspace file with the > > expected size (1MB) and then tell sanlock to initialize it as a lockspace > > using: > > > > # python > > >>> import sanlock > > >>> sanlock.write_lockspace(lockspace="hosted-engine", > > ... path="/rhev/data-center/mnt/<nfs>/<hosted engine storage > > domain>/ha_agent/hosted-engine.lockspace", > > ... offset=0) > > >>> > > > > Then try starting the services (both broker and agent) again. > > > > -- > > Martin Sivák > > msivak@redhat.com > > Red Hat Czech > > RHEV-M SLA / Brno, CZ > > > > > > ----- Original Message ----- > > > On 04/23/2014 11:08 AM, Martin Sivak wrote: > > > > Hi René, > > > > > > > >>>> libvirtError: Failed to acquire lock: No space left on device > > > > > > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid > > > >>>> lockspace found -1 failed 0 name > > 2851af27-8744-445d-9fb1-a0d083c8dc82 > > > > > > > > Can you please check the contents of /rhev/data-center/<your nfs > > > > mount>/<nfs domain uuid>/ha_agent/? > > > > > > > > This is how it should look like: > > > > > > > > [root@dev-03 ~]# ls -al > > > > > >
/rhev/data-center/mnt/euryale\:_home_ovirt_he/e16de6a2-53f5-4ab3-95a3-255d08398824/ha_agent/
> > > > total 2036 > > > > drwxr-x---. 2 vdsm kvm 4096 Mar 19 18:46 . > > > > drwxr-xr-x. 6 vdsm kvm 4096 Mar 19 18:46 .. > > > > -rw-rw----. 1 vdsm kvm 1048576 Apr 23 11:05 hosted-engine.lockspace > > > > -rw-rw----. 1 vdsm kvm 1028096 Mar 19 18:46 hosted-engine.metadata > > > > > > > > The errors seem to indicate that you somehow lost the lockspace file. > > > > > > True :) > > > Isn't this file created when hosted engine is started? Or how can I > > > create this file manually? > > > > > > > > > > > -- > > > > Martin Sivák > > > > msivak@redhat.com > > > > Red Hat Czech > > > > RHEV-M SLA / Brno, CZ > > > > > > > > ----- Original Message ----- > > > >> On 04/23/2014 12:28 AM, Doron Fediuck wrote: > > > >>> Hi Rene, > > > >>> any idea what closed your ovirtmgmt bridge? > > > >>> as long as it is down vdsm may have issues starting up
> > > >>> and this is why you see the complaints on the rpc server. > > > >>> > > > >>> Can you try manually fixing the network part first and then > > > >>> restart vdsm? > > > >>> Once vdsm is happy hosted engine VM will start. > > > >> > > > >> Thanks for your feedback, Doron. > > > >> > > > >> My ovirtmgmt bridge seems to be on or isn't it: > > > >> # brctl show ovirtmgmt > > > >> bridge name bridge id STP enabled interfaces > > > >> ovirtmgmt 8000.0025907587c2 no eth0.200 > > > >> > > > >> # ip a s ovirtmgmt > > > >> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue > > > >> state UNKNOWN > > > >> link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff > > > >> inet 10.0.200.102/24 brd 10.0.200.255 scope global ovirtmgmt > > > >> inet6 fe80::225:90ff:fe75:87c2/64 scope link > > > >> valid_lft forever preferred_lft forever > > > >> > > > >> # ip a s eth0.200 > > > >> 6: eth0.200@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc > > > >> noqueue state UP > > > >> link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff > > > >> inet6 fe80::225:90ff:fe75:87c2/64 scope link > > > >> valid_lft forever preferred_lft forever > > > >> > > > >> I tried the following yesterday: > > > >> Copy virtual disk from GlusterFS storage to local disk of host and > > > >> create a new vm with virt-manager which loads ovirtmgmt disk. I could > > > >> reach my engine over the ovirtmgmt bridge (so bridge must be working). > > > >> > > > >> I also started libvirtd with Option -v and I saw the following in > > > >> libvirtd.log when trying to start ovirt engine: > > > >> 2014-04-22 14:18:25.432+0000: 8901: debug : virCommandRunAsync:2250 : > > > >> Command result 0, with PID 11491 > > > >> 2014-04-22 14:18:25.478+0000: 8901: debug : virCommandRun:2045 : > > Result > > > >> exit status 255, stdout: '' stderr: 'iptables v1.4.7: goto 'FO-vnet0' > > is > > > >> not a chain > > > >> > > > >> So it could be that something is broken in my hosted-engine network. > > Do > > > >> you have any clue how I can troubleshoot this? > > > >> > > > >> > > > >> Thanks, > > > >> René > > > >> > > > >> > > > >>> > > > >>> ----- Original Message ----- > > > >>>> From: "René Koch" <rkoch@linuxland.at> > > > >>>> To: "Martin Sivak" <msivak@redhat.com> > > > >>>> Cc: users@ovirt.org > > > >>>> Sent: Tuesday, April 22, 2014 1:46:38 PM > > > >>>> Subject: Re: [ovirt-users] hosted engine health check issues > > > >>>> > > > >>>> Hi, > > > >>>> > > > >>>> I rebooted one of my ovirt hosts today and the result is now that I > > > >>>> can't start hosted-engine anymore. > > > >>>> > > > >>>> ovirt-ha-agent isn't running because the lockspace file is missing > > > >>>> (sanlock complains about it). > > > >>>> So I tried to start hosted-engine with --vm-start and I get the > > > >>>> following errors: > > > >>>> > > > >>>> ==> /var/log/sanlock.log <== > > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 invalid > > > >>>> lockspace found -1 failed 0 name > > 2851af27-8744-445d-9fb1-a0d083c8dc82 > > > >>>> > > > >>>> ==> /var/log/messages <== > > > >>>> Apr 22 12:38:17 ovirt-host02 sanlock[3079]: 2014-04-22 > > 12:38:17+0200 654 > > > >>>> [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 failed 0 > > name > > > >>>> 2851af27-8744-445d-9fb1-a0d083c8dc82 > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) > > entering > > > >>>> disabled state > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: device vnet0 left promiscuous > > mode > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) > > entering > > > >>>> disabled state > > > >>>> > > > >>>> ==> /var/log/vdsm/vdsm.log <== > > > >>>> Thread-21::DEBUG::2014-04-22 > > > >>>> 12:38:17,563::libvirtconnection::124::root::(wrapper) Unknown > > > >>>> libvirterror: ecode: 38 edom: 42 level: 2 message: Failed to acquire > > > >>>> lock: No space left on device > > > >>>> Thread-21::DEBUG::2014-04-22 > > > >>>> 12:38:17,563::vm::2263::vm.Vm::(_startUnderlyingVm) > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::_ongoingCreations > > released > > > >>>> Thread-21::ERROR::2014-04-22 > > > >>>> 12:38:17,564::vm::2289::vm.Vm::(_startUnderlyingVm) > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start process > > failed > > > >>>> Traceback (most recent call last): > > > >>>> File "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm > > > >>>> self._run() > > > >>>> File "/usr/share/vdsm/vm.py", line 3170, in _run > > > >>>> self._connection.createXML(domxml, flags), > > > >>>> File > > > >>>> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", > > > >>>> line 92, in wrapper > > > >>>> ret = f(*args, **kwargs) > > > >>>> File "/usr/lib64/python2.6/site-packages/libvirt.py", line > > 2665, in > > > >>>> createXML > > > >>>> if ret is None:raise
> > failed', > > > >>>> conn=self) > > > >>>> libvirtError: Failed to acquire lock: No space left on device > > > >>>> > > > >>>> ==> /var/log/messages <== > > > >>>> Apr 22 12:38:17 ovirt-host02 vdsm vm.Vm ERROR > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start process > > > >>>> failed#012Traceback (most recent call last):#012 File > > > >>>> "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm#012 > > > >>>> self._run()#012 File "/usr/share/vdsm/vm.py", line 3170, in > > _run#012 > > > >>>> self._connection.createXML(domxml, flags),#012 File > > > >>>> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", > > line 92, > > > >>>> in wrapper#012 ret = f(*args, **kwargs)#012 File > > > >>>> "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in > > > >>>> createXML#012 if ret is None:raise > > libvirtError('virDomainCreateXML() > > > >>>> failed', conn=self)#012libvirtError: Failed to acquire lock: No > > space > > > >>>> left on device > > > >>>> > > > >>>> ==> /var/log/vdsm/vdsm.log <== > > > >>>> Thread-21::DEBUG::2014-04-22 > > > >>>> 12:38:17,569::vm::2731::vm.Vm::(setDownStatus) > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::Changed state to Down: > > > >>>> Failed to acquire lock: No space left on device > > > >>>> > > > >>>> > > > >>>> No space left on device is nonsense as there is enough space (I had > > this > > > >>>> issue last time as well where I had to patch machine.py, but this > > file > > > >>>> is now Python 2.6.6 compatible. > > > >>>> > > > >>>> Any idea what prevents hosted-engine from starting? > > > >>>> ovirt-ha-broker, vdsmd and sanlock are running btw. > > > >>>> > > > >>>> Btw, I can see in log that json rpc server module is missing - which > > > >>>> package is required for CentOS 6.5? > > > >>>> Apr 22 12:37:14 ovirt-host02 vdsm vds WARNING Unable to load the > > json > > > >>>> rpc server module. Please make sure it is installed. > > > >>>> > > > >>>> > > > >>>> Thanks, > > > >>>> René > > > >>>> > > > >>>> > > > >>>> > > > >>>> On 04/17/2014 10:02 AM, Martin Sivak wrote: > > > >>>>> Hi, > > > >>>>> > > > >>>>>>>> How can I disable notifications? > > > >>>>> > > > >>>>> The notification is configured in > > > >>>>> /etc/ovirt-hosted-engine-ha/broker.conf > > > >>>>> section notification. > > > >>>>> The email is sent when the key state_transition exists and the > > string > > > >>>>> OldState-NewState contains the (case insensitive) regexp from the > > > >>>>> value. > > > >>>>> > > > >>>>>>>> Is it intended to send out these messages and detect
ovirt > > > >>>>>>>> engine > > > >>>>>>>> is down (which is false anyway), but not to restart
vm? > > > >>>>> > > > >>>>> Forget about emails for now and check the > > > >>>>> /var/log/ovirt-hosted-engine-ha/agent.log and broker.log (and > > attach > > > >>>>> them > > > >>>>> as well btw). > > > >>>>> > > > >>>>>>>> oVirt hosts think that hosted engine is down because it seems > > that > > > >>>>>>>> hosts > > > >>>>>>>> can't write to hosted-engine.lockspace due to glusterfs issues > > (or > > > >>>>>>>> at > > > >>>>>>>> least I think so). > > > >>>>> > > > >>>>> The hosts think so or can't really write there? The lockspace is > > > >>>>> managed > > > >>>>> by > > > >>>>> sanlock and our HA daemons do not touch it at all. We only ask > > sanlock > > > >>>>> to > > > >>>>> get make sure we have unique server id. > > > >>>>> > > > >>>>>>>> Is is possible or planned to make the whole ha feature optional? > > > >>>>> > > > >>>>> Well the system won't perform any automatic actions if you put the > > > >>>>> hosted > > > >>>>> engine to global maintenance and only start/stop/migrate
VM > > > >>>>> manually. > > > >>>>> I would discourage you from stopping agent/broker, because the > > engine > > > >>>>> itself has some logic based on the reporting. > > > >>>>> > > > >>>>> Regards > > > >>>>> > > > >>>>> -- > > > >>>>> Martin Sivák > > > >>>>> msivak@redhat.com > > > >>>>> Red Hat Czech > > > >>>>> RHEV-M SLA / Brno, CZ > > > >>>>> > > > >>>>> ----- Original Message ----- > > > >>>>>> On 04/15/2014 04:53 PM, Jiri Moskovcak wrote: > > > >>>>>>> On 04/14/2014 10:50 AM, René Koch wrote: > > > >>>>>>>> Hi, > > > >>>>>>>> > > > >>>>>>>> I have some issues with hosted engine status. > > > >>>>>>>> > > > >>>>>>>> oVirt hosts think that hosted engine is down because it seems > > that > > > >>>>>>>> hosts > > > >>>>>>>> can't write to hosted-engine.lockspace due to glusterfs issues > > (or > > > >>>>>>>> at > > > >>>>>>>> least I think so). > > > >>>>>>>> > > > >>>>>>>> Here's the output of vm-status: > > > >>>>>>>> > > > >>>>>>>> # hosted-engine --vm-status > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> --== Host 1 status ==-- > > > >>>>>>>> > > > >>>>>>>> Status up-to-date : False > > > >>>>>>>> Hostname : 10.0.200.102 > > > >>>>>>>> Host ID : 1 > > > >>>>>>>> Engine status : unknown stale-data > > > >>>>>>>> Score : 2400 > > > >>>>>>>> Local maintenance : False > > > >>>>>>>> Host timestamp : 1397035677 > > > >>>>>>>> Extra metadata (valid at timestamp): > > > >>>>>>>> metadata_parse_version=1 > > > >>>>>>>> metadata_feature_version=1 > > > >>>>>>>> timestamp=1397035677 (Wed Apr 9 11:27:57
> > > >>>>>>>> host-id=1 > > > >>>>>>>> score=2400 > > > >>>>>>>> maintenance=False > > > >>>>>>>> state=EngineUp > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> --== Host 2 status ==-- > > > >>>>>>>> > > > >>>>>>>> Status up-to-date : True > > > >>>>>>>> Hostname : 10.0.200.101 > > > >>>>>>>> Host ID : 2 > > > >>>>>>>> Engine status : {'reason': 'vm not running > > on > > > >>>>>>>> this > > > >>>>>>>> host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'} > > > >>>>>>>> Score : 0 > > > >>>>>>>> Local maintenance : False > > > >>>>>>>> Host timestamp : 1397464031 > > > >>>>>>>> Extra metadata (valid at timestamp): > > > >>>>>>>> metadata_parse_version=1 > > > >>>>>>>> metadata_feature_version=1 > > > >>>>>>>> timestamp=1397464031 (Mon Apr 14 10:27:11
> > > >>>>>>>> host-id=2 > > > >>>>>>>> score=0 > > > >>>>>>>> maintenance=False > > > >>>>>>>> state=EngineUnexpectedlyDown > > > >>>>>>>> timeout=Mon Apr 14 10:35:05 2014 > > > >>>>>>>> > > > >>>>>>>> oVirt engine is sending me 2 emails every 10 minutes with the > > > >>>>>>>> following > > > >>>>>>>> subjects: > > > >>>>>>>> - ovirt-hosted-engine state transition EngineDown-EngineStart > > > >>>>>>>> - ovirt-hosted-engine state transition EngineStart-EngineUp > > > >>>>>>>> > > > >>>>>>>> In oVirt webadmin I can see the following message: > > > >>>>>>>> VM HostedEngine is down. Exit message: internal error Failed to > > > >>>>>>>> acquire > > > >>>>>>>> lock: error -243. > > > >>>>>>>> > > > >>>>>>>> These messages are really annoying as oVirt isn't doing anything > > > >>>>>>>> with > > > >>>>>>>> hosted engine - I have an uptime of 9 days in my engine vm. > > > >>>>>>>> > > > >>>>>>>> So my questions are now: > > > >>>>>>>> Is it intended to send out these messages and detect
ovirt > > > >>>>>>>> engine > > > >>>>>>>> is down (which is false anyway), but not to restart
----- Original Message ----- line preparing possible.. there properly libvirtError('virDomainCreateXML() that the the 2014) 2014) that the
vm? > > > >>>>>>>> > > > >>>>>>>> How can I disable notifications? I'm planning to write a Nagios > > > >>>>>>>> plugin > > > >>>>>>>> which parses the output of hosted-engine --vm-status and only > > Nagios > > > >>>>>>>> should notify me, not hosted-engine script. > > > >>>>>>>> > > > >>>>>>>> Is is possible or planned to make the whole ha feature > > optional? I > > > >>>>>>>> really really really hate cluster software as it causes more > > > >>>>>>>> troubles > > > >>>>>>>> then standalone machines and in my case the hosted-engine ha > > feature > > > >>>>>>>> really causes troubles (and I didn't had a hardware or network > > > >>>>>>>> outage > > > >>>>>>>> yet only issues with hosted-engine ha agent). I don't need any > > ha > > > >>>>>>>> feature for hosted engine. I just want to run engine > > virtualized on > > > >>>>>>>> oVirt and if engine vm fails (e.g. because of issues with a > > host) > > > >>>>>>>> I'll > > > >>>>>>>> restart it on another node. > > > >>>>>>> > > > >>>>>>> Hi, you can: > > > >>>>>>> 1. edit /etc/ovirt-hosted-engine-ha/{agent,broker}-log.conf and > > tweak > > > >>>>>>> the logger as you like > > > >>>>>>> 2. or kill ovirt-ha-broker & ovirt-ha-agent services > > > >>>>>> > > > >>>>>> Thanks for the information. > > > >>>>>> So engine is able to run when ovirt-ha-broker and ovirt-ha-agent > > isn't > > > >>>>>> running? > > > >>>>>> > > > >>>>>> > > > >>>>>> Regards, > > > >>>>>> René > > > >>>>>> > > > >>>>>>> > > > >>>>>>> --Jirka > > > >>>>>>>> > > > >>>>>>>> Thanks, > > > >>>>>>>> René > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> _______________________________________________ > > > >>>>>> Users mailing list > > > >>>>>> Users@ovirt.org > > > >>>>>> http://lists.ovirt.org/mailman/listinfo/users > > > >>>>>> > > > >>>> _______________________________________________ > > > >>>> Users mailing list > > > >>>> Users@ovirt.org > > > >>>> http://lists.ovirt.org/mailman/listinfo/users > > > >>>> > > > >> > > > > > _______________________________________________ > > Users mailing list > > Users@ovirt.org > > http://lists.ovirt.org/mailman/listinfo/users > > > _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Hi, qemu-kvm-0.12.1.2-2.415.el6_5.8.x86_64 libvirt-0.10.2-29.el6_5.7.x86_64 vdsm-4.14.6-0.el6.x86_64 kernel-2.6.32-431.el6.x86_64 kernel-2.6.32-431.11.2.el6.x86_64 i add this repop and try to update. 2014-04-28 11:57 GMT+02:00 Martin Sivak <msivak@redhat.com>:
Hi Kevin,
thanks for the information.
Agent.log and broker.log says nothing.
Can you please attach those files? I would like to see how the crashed Qemu process is reported to us and what are the state machine trainsitions that cause the load.
07:23:58,994::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 84 edom: 10 level: 2 message: Operation not supported: live disk snapshot not supported with this QEMU binary
What are the versions of vdsm, libvirt, qemu-kvm and kernel?
If you feel like it try updating virt packages from the virt-preview repository: http://fedoraproject.org/wiki/Virtualization_Preview_Repository
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
Hi,
I use this version : ovirt-hosted-engine-ha-1.1.2-1.el6.noarch
For 3 days, my engine-ha worked perfectly but i tried to snapshot a Vm and ha service make defunct ==> 400% CPU !!
Agent.log and broker.log says nothing. But vdsm.log i have errors :
Thread-9462::DEBUG::2014-04-28 07:23:58,994::libvirtconnection::124::root::(wrapper) Unknown
----- Original Message ----- libvirterror:
ecode: 84 edom: 10 level: 2 message: Operation not supported: live disk snapshot not supported with this QEMU binary
Thread-9462::ERROR::2014-04-28 07:23:58,995::vm::4006::vm.Vm::(snapshot) vmId=`773f6e6d-c670-49f3-ae8c-dfbcfa22d0a5`::Unable to take snapshot
Thread-9352::DEBUG::2014-04-28 08:41:39,922::lvm::295::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'r|.*|\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o
cc51143e-8ad7-4b0b-a4d2-9024dffc1188 ff98d346-4515-4349-8437-fb2f5e9eaadf' (cwd None)
I'll try to reboot my node with hosted-engine.
2014-04-25 13:54 GMT+02:00 Martin Sivak <msivak@redhat.com>:
Hi Kevin,
can you please tell us what version of hosted-engine are you running?
rpm -q ovirt-hosted-engine-ha
Also, do I understand it correctly that the engine VM is running, but you see bad status when you execute the hosted-engine --vm-status command?
If that is so, can you give us current logs from /var/log/ovirt-hosted-engine-ha?
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
----- Original Message -----
Ok i mount manualy the domain for hosted engine and agent go up.
But vm-status :
--== Host 2 status ==--
Status up-to-date : False Hostname : 192.168.99.103 Host ID : 2 Engine status : unknown stale-data Score : 0 Local maintenance : False Host timestamp : 1398333438
And in my engine, host02 Ha is no active.
2014-04-24 12:48 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com>:
Hi,
I try to reboot my hosts and now [supervdsmServer] is <defunct>.
/var/log/vdsm/supervdsm.log
MainProcess|Thread-120::DEBUG::2014-04-24
12:22:19,955::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper)
return validateAccess with None MainProcess|Thread-120::DEBUG::2014-04-24
12:22:20,010::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) call
validateAccess with ('qemu', ('qemu', 'kvm'), '/rhev/data-center/mnt/host01.ovirt.lan:_home_export', 5) {} MainProcess|Thread-120::DEBUG::2014-04-24
12:22:20,014::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper)
return validateAccess with None MainProcess|Thread-120::DEBUG::2014-04-24
12:22:20,059::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) call
validateAccess with ('qemu', ('qemu', 'kvm'), '/rhev/data-center/mnt/host01.ovirt.lan:_home_iso', 5) {} MainProcess|Thread-120::DEBUG::2014-04-24
12:22:20,063::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper)
return validateAccess with None
and one host don't mount the NFS used for hosted engine.
MainThread::CRITICAL::2014-04-24
12:36:16,603::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
Could not start ha-agent Traceback (most recent call last): File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
line 97, in run self._run_agent() File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
line 154, in _run_agent
hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring()
File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 299, in start_monitoring self._initialize_vdsm() File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 418, in _initialize_vdsm self._sd_path = env_path.get_domain_path(self._config) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py", line
40, in get_domain_path .format(sd_uuid, parent)) Exception: path to storage domain aea040f8-ab9d-435b-9ecf-ddd4272e592f not found in /rhev/data-center/mnt
2014-04-23 17:40 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com>:
top
1729 vdsm 20 0 0 0 0 Z 373.8 0.0 252:08.51 ovirt-ha-broker <defunct>
[root@host01 ~]# ps axwu | grep 1729 vdsm 1729 0.7 0.0 0 0 ? Zl Apr02 240:24 [ovirt-ha-broker] <defunct>
[root@host01 ~]# ll
/rhev/data-center/mnt/host01.ovirt.lan\:_home_NFS01/aea040f8-ab9d-435b-9ecf-ddd4272e592f/ha_agent/
total 2028 -rw-rw----. 1 vdsm kvm 1048576 23 avril 17:35 hosted-engine.lockspace -rw-rw----. 1 vdsm kvm 1028096 23 avril 17:35 hosted-engine.metadata
cat /var/log/vdsm/vdsm.log
Thread-120518::DEBUG::2014-04-23 17:38:02,299::task::1185::TaskManager.Task::(prepare) Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::finished: {'aea040f8-ab9d-435b-9ecf-ddd4272e592f': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000410963', 'lastCheck': '3.4', 'valid': True}, '5ae613a4-44e4-42cb-89fc-7b5d34c1f30f': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000412357', 'lastCheck': '6.8', 'valid': True}, 'cc51143e-8ad7-4b0b-a4d2-9024dffc1188': {'code': 0, 'version': 0, 'acquired': True, 'delay': '0.000455292', 'lastCheck': '1.2', 'valid': True}, 'ff98d346-4515-4349-8437-fb2f5e9eaadf': {'code': 0, 'version': 0, 'acquired': True, 'delay': '0.00817113', 'lastCheck': '1.7', 'valid': True}} Thread-120518::DEBUG::2014-04-23 17:38:02,300::task::595::TaskManager.Task::(_updateState) Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::moving from state preparing -> state finished Thread-120518::DEBUG::2014-04-23
17:38:02,300::resourceManager::940::ResourceManager.Owner::(releaseAll)
Owner.releaseAll requests {} resources {} Thread-120518::DEBUG::2014-04-23
17:38:02,300::resourceManager::977::ResourceManager.Owner::(cancelAll)
Owner.cancelAll requests {} Thread-120518::DEBUG::2014-04-23 17:38:02,300::task::990::TaskManager.Task::(_decref) Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::ref 0 aborting False Thread-120518::ERROR::2014-04-23
17:38:02,302::brokerlink::72::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect)
Failed to connect to broker: [Errno 2] No such file or directory Thread-120518::ERROR::2014-04-23 17:38:02,302::API::1612::vds::(_getHaInfo) failed to retrieve Hosted Engine HA info Traceback (most recent call last): File "/usr/share/vdsm/API.py", line 1603, in _getHaInfo stats = instance.get_all_stats() File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py",
line 83, in get_all_stats with broker.connection(): File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__ return self.gen.next() File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 96, in connection self.connect() File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 64, in connect self._socket.connect(constants.BROKER_SOCKET_FILE) File "<string>", line 1, in connect error: [Errno 2] No such file or directory Thread-78::DEBUG::2014-04-23 17:38:05,490::fileSD::225::Storage.Misc.excCmd::(getReadDelay) '/bin/dd iflag=direct
if=/rhev/data-center/mnt/host01.ovirt.lan:_home_DATA/5ae613a4-44e4-42cb-89fc-7b5d34c1f30f/dom_md/metadata
bs=4096 count=1' (cwd None) Thread-78::DEBUG::2014-04-23 17:38:05,523::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n545 bytes (545 B) copied, 0.000412209 s, 1.3 MB/s\n'; <rc> = 0
2014-04-23 17:27 GMT+02:00 Martin Sivak <msivak@redhat.com>:
Hi Kevin, > > > same pb. > > Are you missing the lockspace file as well while running on top of > GlusterFS? > > > ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9. > > Defunct process eating full four cores? I wonder how is that possible.. > What are the status flags of that process when you do ps axwu? > > Can you attach the log files please? > > -- > Martin Sivák > msivak@redhat.com > Red Hat Czech > RHEV-M SLA / Brno, CZ > > ----- Original Message ----- > > same pb. ovirt-ha-broker have 400% cpu and is defunct. I can't kill > with -9. > > > > > > 2014-04-23 13:55 GMT+02:00 Martin Sivak <msivak@redhat.com>: > > > > > Hi, > > > > > > > Isn't this file created when hosted engine is started? > > > > > > The file is created by the setup script. If it got lost then there > was > > > probably something bad happening in your NFS or Gluster storage. > > > > > > > Or how can I create this file manually? > > > > > > I can give you experimental treatment for this. We do not have any > > > official way as this is something that should not ever happen :) > > > > > > !! But before you do that make sure you do not have any nodes running > > > properly. This will destroy and reinitialize the lockspace database > for the > > > whole hosted-engine environment (which you apparently lack, but..). > !! > > > > > > You have to create the ha_agent/hosted-engine.lockspace file with the > > > expected size (1MB) and then tell sanlock to initialize it as a > lockspace > > > using: > > > > > > # python > > > >>> import sanlock > > > >>> sanlock.write_lockspace(lockspace="hosted-engine", > > > ... path="/rhev/data-center/mnt/<nfs>/<hosted engine storage > > > domain>/ha_agent/hosted-engine.lockspace", > > > ... offset=0) > > > >>> > > > > > > Then try starting the services (both broker and agent) again. > > > > > > -- > > > Martin Sivák > > > msivak@redhat.com > > > Red Hat Czech > > > RHEV-M SLA / Brno, CZ > > > > > > > > > ----- Original Message ----- > > > > On 04/23/2014 11:08 AM, Martin Sivak wrote: > > > > > Hi René, > > > > > > > > > >>>> libvirtError: Failed to acquire lock: No space left on device > > > > > > > > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 > invalid > > > > >>>> lockspace found -1 failed 0 name > > > 2851af27-8744-445d-9fb1-a0d083c8dc82 > > > > > > > > > > Can you please check the contents of /rhev/data-center/<your nfs > > > > > mount>/<nfs domain uuid>/ha_agent/? > > > > > > > > > > This is how it should look like: > > > > > > > > > > [root@dev-03 ~]# ls -al > > > > > > > > >
/rhev/data-center/mnt/euryale\:_home_ovirt_he/e16de6a2-53f5-4ab3-95a3-255d08398824/ha_agent/
> > > > > total 2036 > > > > > drwxr-x---. 2 vdsm kvm 4096 Mar 19 18:46 . > > > > > drwxr-xr-x. 6 vdsm kvm 4096 Mar 19 18:46 .. > > > > > -rw-rw----. 1 vdsm kvm 1048576 Apr 23 11:05 > hosted-engine.lockspace > > > > > -rw-rw----. 1 vdsm kvm 1028096 Mar 19 18:46 > hosted-engine.metadata > > > > > > > > > > The errors seem to indicate that you somehow lost the lockspace > file. > > > > > > > > True :) > > > > Isn't this file created when hosted engine is started? Or how can I > > > > create this file manually? > > > > > > > > > > > > > > -- > > > > > Martin Sivák > > > > > msivak@redhat.com > > > > > Red Hat Czech > > > > > RHEV-M SLA / Brno, CZ > > > > > > > > > > ----- Original Message ----- > > > > >> On 04/23/2014 12:28 AM, Doron Fediuck wrote: > > > > >>> Hi Rene, > > > > >>> any idea what closed your ovirtmgmt bridge? > > > > >>> as long as it is down vdsm may have issues starting up properly > > > > >>> and this is why you see the complaints on the rpc server. > > > > >>> > > > > >>> Can you try manually fixing the network part first and
uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name then
> > > > >>>>>>>> host-id=1 > > > > >>>>>>>> score=2400 > > > > >>>>>>>> maintenance=False > > > > >>>>>>>> state=EngineUp > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> --== Host 2 status ==-- > > > > >>>>>>>> > > > > >>>>>>>> Status up-to-date : True > > > > >>>>>>>> Hostname : 10.0.200.101 > > > > >>>>>>>> Host ID : 2 > > > > >>>>>>>> Engine status : {'reason': 'vm not > running > > > on > > > > >>>>>>>> this > > > > >>>>>>>> host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'} > > > > >>>>>>>> Score : 0 > > > > >>>>>>>> Local maintenance : False > > > > >>>>>>>> Host timestamp : 1397464031 > > > > >>>>>>>> Extra metadata (valid at timestamp): > > > > >>>>>>>> metadata_parse_version=1 > > > > >>>>>>>> metadata_feature_version=1 > > > > >>>>>>>> timestamp=1397464031 (Mon Apr 14 10:27:11
> > > > >>> restart vdsm? > > > > >>> Once vdsm is happy hosted engine VM will start. > > > > >> > > > > >> Thanks for your feedback, Doron. > > > > >> > > > > >> My ovirtmgmt bridge seems to be on or isn't it: > > > > >> # brctl show ovirtmgmt > > > > >> bridge name bridge id STP enabled > interfaces > > > > >> ovirtmgmt 8000.0025907587c2 no > eth0.200 > > > > >> > > > > >> # ip a s ovirtmgmt > > > > >> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc > noqueue > > > > >> state UNKNOWN > > > > >> link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff > > > > >> inet 10.0.200.102/24 brd 10.0.200.255 scope global > ovirtmgmt > > > > >> inet6 fe80::225:90ff:fe75:87c2/64 scope link > > > > >> valid_lft forever preferred_lft forever > > > > >> > > > > >> # ip a s eth0.200 > > > > >> 6: eth0.200@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 > qdisc > > > > >> noqueue state UP > > > > >> link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff > > > > >> inet6 fe80::225:90ff:fe75:87c2/64 scope link > > > > >> valid_lft forever preferred_lft forever > > > > >> > > > > >> I tried the following yesterday: > > > > >> Copy virtual disk from GlusterFS storage to local disk of host > and > > > > >> create a new vm with virt-manager which loads ovirtmgmt disk. I > could > > > > >> reach my engine over the ovirtmgmt bridge (so bridge must be > working). > > > > >> > > > > >> I also started libvirtd with Option -v and I saw the following > in > > > > >> libvirtd.log when trying to start ovirt engine: > > > > >> 2014-04-22 14:18:25.432+0000: 8901: debug : > virCommandRunAsync:2250 : > > > > >> Command result 0, with PID 11491 > > > > >> 2014-04-22 14:18:25.478+0000: 8901: debug : virCommandRun:2045 : > > > Result > > > > >> exit status 255, stdout: '' stderr: 'iptables v1.4.7: goto > 'FO-vnet0' > > > is > > > > >> not a chain > > > > >> > > > > >> So it could be that something is broken in my hosted-engine > network. > > > Do > > > > >> you have any clue how I can troubleshoot this? > > > > >> > > > > >> > > > > >> Thanks, > > > > >> René > > > > >> > > > > >> > > > > >>> > > > > >>> ----- Original Message ----- > > > > >>>> From: "René Koch" <rkoch@linuxland.at> > > > > >>>> To: "Martin Sivak" <msivak@redhat.com> > > > > >>>> Cc: users@ovirt.org > > > > >>>> Sent: Tuesday, April 22, 2014 1:46:38 PM > > > > >>>> Subject: Re: [ovirt-users] hosted engine health check issues > > > > >>>> > > > > >>>> Hi, > > > > >>>> > > > > >>>> I rebooted one of my ovirt hosts today and the result is now > that I > > > > >>>> can't start hosted-engine anymore. > > > > >>>> > > > > >>>> ovirt-ha-agent isn't running because the lockspace file is > missing > > > > >>>> (sanlock complains about it). > > > > >>>> So I tried to start hosted-engine with --vm-start and I get > the > > > > >>>> following errors: > > > > >>>> > > > > >>>> ==> /var/log/sanlock.log <== > > > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 > invalid > > > > >>>> lockspace found -1 failed 0 name > > > 2851af27-8744-445d-9fb1-a0d083c8dc82 > > > > >>>> > > > > >>>> ==> /var/log/messages <== > > > > >>>> Apr 22 12:38:17 ovirt-host02 sanlock[3079]: 2014-04-22 > > > 12:38:17+0200 654 > > > > >>>> [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 > failed 0 > > > name > > > > >>>> 2851af27-8744-445d-9fb1-a0d083c8dc82 > > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) > > > entering > > > > >>>> disabled state > > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: device vnet0 left > promiscuous > > > mode > > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) > > > entering > > > > >>>> disabled state > > > > >>>> > > > > >>>> ==> /var/log/vdsm/vdsm.log <== > > > > >>>> Thread-21::DEBUG::2014-04-22 > > > > >>>> 12:38:17,563::libvirtconnection::124::root::(wrapper) Unknown > > > > >>>> libvirterror: ecode: 38 edom: 42 level: 2 message: Failed to > acquire > > > > >>>> lock: No space left on device > > > > >>>> Thread-21::DEBUG::2014-04-22 > > > > >>>> 12:38:17,563::vm::2263::vm.Vm::(_startUnderlyingVm) > > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::_ongoingCreations > > > released > > > > >>>> Thread-21::ERROR::2014-04-22 > > > > >>>> 12:38:17,564::vm::2289::vm.Vm::(_startUnderlyingVm) > > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start > process > > > failed > > > > >>>> Traceback (most recent call last): > > > > >>>> File "/usr/share/vdsm/vm.py", line 2249, in > _startUnderlyingVm > > > > >>>> self._run() > > > > >>>> File "/usr/share/vdsm/vm.py", line 3170, in _run > > > > >>>> self._connection.createXML(domxml, flags), > > > > >>>> File > > > > >>>> > "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", > > > > >>>> line 92, in wrapper > > > > >>>> ret = f(*args, **kwargs) > > > > >>>> File "/usr/lib64/python2.6/site-packages/libvirt.py", > line > > > 2665, in > > > > >>>> createXML > > > > >>>> if ret is None:raise libvirtError('virDomainCreateXML() > > > failed', > > > > >>>> conn=self) > > > > >>>> libvirtError: Failed to acquire lock: No space left on device > > > > >>>> > > > > >>>> ==> /var/log/messages <== > > > > >>>> Apr 22 12:38:17 ovirt-host02 vdsm vm.Vm ERROR > > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start > process > > > > >>>> failed#012Traceback (most recent call last):#012 File > > > > >>>> "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm#012 > > > > >>>> self._run()#012 File "/usr/share/vdsm/vm.py", line 3170, in > > > _run#012 > > > > >>>> self._connection.createXML(domxml, flags),#012 File > > > > >>>> > "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", > > > line 92, > > > > >>>> in wrapper#012 ret = f(*args, **kwargs)#012 File > > > > >>>> "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in > > > > >>>> createXML#012 if ret is None:raise > > > libvirtError('virDomainCreateXML() > > > > >>>> failed', conn=self)#012libvirtError: Failed to acquire lock: > No > > > space > > > > >>>> left on device > > > > >>>> > > > > >>>> ==> /var/log/vdsm/vdsm.log <== > > > > >>>> Thread-21::DEBUG::2014-04-22 > > > > >>>> 12:38:17,569::vm::2731::vm.Vm::(setDownStatus) > > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::Changed state to > Down: > > > > >>>> Failed to acquire lock: No space left on device > > > > >>>> > > > > >>>> > > > > >>>> No space left on device is nonsense as there is enough space > (I had > > > this > > > > >>>> issue last time as well where I had to patch machine.py, but > this > > > file > > > > >>>> is now Python 2.6.6 compatible. > > > > >>>> > > > > >>>> Any idea what prevents hosted-engine from starting? > > > > >>>> ovirt-ha-broker, vdsmd and sanlock are running btw. > > > > >>>> > > > > >>>> Btw, I can see in log that json rpc server module is missing > - which > > > > >>>> package is required for CentOS 6.5? > > > > >>>> Apr 22 12:37:14 ovirt-host02 vdsm vds WARNING Unable to load > the > > > json > > > > >>>> rpc server module. Please make sure it is installed. > > > > >>>> > > > > >>>> > > > > >>>> Thanks, > > > > >>>> René > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> On 04/17/2014 10:02 AM, Martin Sivak wrote: > > > > >>>>> Hi, > > > > >>>>> > > > > >>>>>>>> How can I disable notifications? > > > > >>>>> > > > > >>>>> The notification is configured in > > > > >>>>> /etc/ovirt-hosted-engine-ha/broker.conf > > > > >>>>> section notification. > > > > >>>>> The email is sent when the key state_transition exists and > the > > > string > > > > >>>>> OldState-NewState contains the (case insensitive) regexp > from the > > > > >>>>> value. > > > > >>>>> > > > > >>>>>>>> Is it intended to send out these messages and detect that > ovirt > > > > >>>>>>>> engine > > > > >>>>>>>> is down (which is false anyway), but not to restart the > vm? > > > > >>>>> > > > > >>>>> Forget about emails for now and check the > > > > >>>>> /var/log/ovirt-hosted-engine-ha/agent.log and broker.log (and > > > attach > > > > >>>>> them > > > > >>>>> as well btw). > > > > >>>>> > > > > >>>>>>>> oVirt hosts think that hosted engine is down because it > seems > > > that > > > > >>>>>>>> hosts > > > > >>>>>>>> can't write to hosted-engine.lockspace due to glusterfs > issues > > > (or > > > > >>>>>>>> at > > > > >>>>>>>> least I think so). > > > > >>>>> > > > > >>>>> The hosts think so or can't really write there? The > lockspace is > > > > >>>>> managed > > > > >>>>> by > > > > >>>>> sanlock and our HA daemons do not touch it at all. We only > ask > > > sanlock > > > > >>>>> to > > > > >>>>> get make sure we have unique server id. > > > > >>>>> > > > > >>>>>>>> Is is possible or planned to make the whole ha feature > optional? > > > > >>>>> > > > > >>>>> Well the system won't perform any automatic actions if you > put the > > > > >>>>> hosted > > > > >>>>> engine to global maintenance and only start/stop/migrate the > VM > > > > >>>>> manually. > > > > >>>>> I would discourage you from stopping agent/broker, because > the > > > engine > > > > >>>>> itself has some logic based on the reporting. > > > > >>>>> > > > > >>>>> Regards > > > > >>>>> > > > > >>>>> -- > > > > >>>>> Martin Sivák > > > > >>>>> msivak@redhat.com > > > > >>>>> Red Hat Czech > > > > >>>>> RHEV-M SLA / Brno, CZ > > > > >>>>> > > > > >>>>> ----- Original Message ----- > > > > >>>>>> On 04/15/2014 04:53 PM, Jiri Moskovcak wrote: > > > > >>>>>>> On 04/14/2014 10:50 AM, René Koch wrote: > > > > >>>>>>>> Hi, > > > > >>>>>>>> > > > > >>>>>>>> I have some issues with hosted engine status. > > > > >>>>>>>> > > > > >>>>>>>> oVirt hosts think that hosted engine is down because it > seems > > > that > > > > >>>>>>>> hosts > > > > >>>>>>>> can't write to hosted-engine.lockspace due to glusterfs > issues > > > (or > > > > >>>>>>>> at > > > > >>>>>>>> least I think so). > > > > >>>>>>>> > > > > >>>>>>>> Here's the output of vm-status: > > > > >>>>>>>> > > > > >>>>>>>> # hosted-engine --vm-status > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> --== Host 1 status ==-- > > > > >>>>>>>> > > > > >>>>>>>> Status up-to-date : False > > > > >>>>>>>> Hostname : 10.0.200.102 > > > > >>>>>>>> Host ID : 1 > > > > >>>>>>>> Engine status : unknown stale-data > > > > >>>>>>>> Score : 2400 > > > > >>>>>>>> Local maintenance : False > > > > >>>>>>>> Host timestamp : 1397035677 > > > > >>>>>>>> Extra metadata (valid at timestamp): > > > > >>>>>>>> metadata_parse_version=1 > > > > >>>>>>>> metadata_feature_version=1 > > > > >>>>>>>> timestamp=1397035677 (Wed Apr 9 11:27:57
> > > > >>>>>>>> host-id=2 > > > > >>>>>>>> score=0 > > > > >>>>>>>> maintenance=False > > > > >>>>>>>> state=EngineUnexpectedlyDown > > > > >>>>>>>> timeout=Mon Apr 14 10:35:05 2014 > > > > >>>>>>>> > > > > >>>>>>>> oVirt engine is sending me 2 emails every 10 minutes with > the > > > > >>>>>>>> following > > > > >>>>>>>> subjects: > > > > >>>>>>>> - ovirt-hosted-engine state transition > EngineDown-EngineStart > > > > >>>>>>>> - ovirt-hosted-engine state transition > EngineStart-EngineUp > > > > >>>>>>>> > > > > >>>>>>>> In oVirt webadmin I can see the following message: > > > > >>>>>>>> VM HostedEngine is down. Exit message: internal error > Failed to > > > > >>>>>>>> acquire > > > > >>>>>>>> lock: error -243. > > > > >>>>>>>> > > > > >>>>>>>> These messages are really annoying as oVirt isn't doing > anything > > > > >>>>>>>> with > > > > >>>>>>>> hosted engine - I have an uptime of 9 days in my engine > vm. > > > > >>>>>>>> > > > > >>>>>>>> So my questions are now: > > > > >>>>>>>> Is it intended to send out these messages and detect that > ovirt > > > > >>>>>>>> engine > > > > >>>>>>>> is down (which is false anyway), but not to restart the > vm? > > > > >>>>>>>> > > > > >>>>>>>> How can I disable notifications? I'm planning to write a > Nagios > > > > >>>>>>>> plugin > > > > >>>>>>>> which parses the output of hosted-engine --vm-status and > only > > > Nagios > > > > >>>>>>>> should notify me, not hosted-engine script. > > > > >>>>>>>> > > > > >>>>>>>> Is is possible or planned to make the whole ha feature > > > optional? I > > > > >>>>>>>> really really really hate cluster software as it causes > more > > > > >>>>>>>> troubles > > > > >>>>>>>> then standalone machines and in my case the hosted-engine > ha > > > feature > > > > >>>>>>>> really causes troubles (and I didn't had a hardware or > network > > > > >>>>>>>> outage > > > > >>>>>>>> yet only issues with hosted-engine ha agent). I don't > need any > > > ha > > > > >>>>>>>> feature for hosted engine. I just want to run engine > > > virtualized on > > > > >>>>>>>> oVirt and if engine vm fails (e.g. because of issues with > a > > > host) > > > > >>>>>>>> I'll > > > > >>>>>>>> restart it on another node. > > > > >>>>>>> > > > > >>>>>>> Hi, you can: > > > > >>>>>>> 1. edit > /etc/ovirt-hosted-engine-ha/{agent,broker}-log.conf and > > > tweak > > > > >>>>>>> the logger as you like > > > > >>>>>>> 2. or kill ovirt-ha-broker & ovirt-ha-agent services > > > > >>>>>> > > > > >>>>>> Thanks for the information. > > > > >>>>>> So engine is able to run when ovirt-ha-broker and > ovirt-ha-agent > > > isn't > > > > >>>>>> running? > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> Regards, > > > > >>>>>> René > > > > >>>>>> > > > > >>>>>>> > > > > >>>>>>> --Jirka > > > > >>>>>>>> > > > > >>>>>>>> Thanks, > > > > >>>>>>>> René > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>> > > > > >>>>>> _______________________________________________ > > > > >>>>>> Users mailing list > > > > >>>>>> Users@ovirt.org > > > > >>>>>> http://lists.ovirt.org/mailman/listinfo/users > > > > >>>>>> > > > > >>>> _______________________________________________ > > > > >>>> Users mailing list > > > > >>>> Users@ovirt.org > > > > >>>> http://lists.ovirt.org/mailman/listinfo/users > > > > >>>> > > > > >> > > > > > > > _______________________________________________ > > > Users mailing list > > > Users@ovirt.org > > > http://lists.ovirt.org/mailman/listinfo/users > > > > > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users >
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

I'am on Centos 6.5 and this repo is for fedora... 2014-04-28 12:16 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com>:
Hi,
qemu-kvm-0.12.1.2-2.415.el6_5.8.x86_64 libvirt-0.10.2-29.el6_5.7.x86_64 vdsm-4.14.6-0.el6.x86_64 kernel-2.6.32-431.el6.x86_64 kernel-2.6.32-431.11.2.el6.x86_64
i add this repop and try to update.
2014-04-28 11:57 GMT+02:00 Martin Sivak <msivak@redhat.com>:
Hi Kevin,
thanks for the information.
Agent.log and broker.log says nothing.
Can you please attach those files? I would like to see how the crashed Qemu process is reported to us and what are the state machine trainsitions that cause the load.
07:23:58,994::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 84 edom: 10 level: 2 message: Operation not supported: live disk snapshot not supported with this QEMU binary
What are the versions of vdsm, libvirt, qemu-kvm and kernel?
If you feel like it try updating virt packages from the virt-preview repository: http://fedoraproject.org/wiki/Virtualization_Preview_Repository
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
Hi,
I use this version : ovirt-hosted-engine-ha-1.1.2-1.el6.noarch
For 3 days, my engine-ha worked perfectly but i tried to snapshot a Vm and ha service make defunct ==> 400% CPU !!
Agent.log and broker.log says nothing. But vdsm.log i have errors :
Thread-9462::DEBUG::2014-04-28 07:23:58,994::libvirtconnection::124::root::(wrapper) Unknown
----- Original Message ----- libvirterror:
ecode: 84 edom: 10 level: 2 message: Operation not supported: live disk snapshot not supported with this QEMU binary
Thread-9462::ERROR::2014-04-28 07:23:58,995::vm::4006::vm.Vm::(snapshot) vmId=`773f6e6d-c670-49f3-ae8c-dfbcfa22d0a5`::Unable to take snapshot
Thread-9352::DEBUG::2014-04-28 08:41:39,922::lvm::295::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'r|.*|\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o
uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
cc51143e-8ad7-4b0b-a4d2-9024dffc1188 ff98d346-4515-4349-8437-fb2f5e9eaadf' (cwd None)
I'll try to reboot my node with hosted-engine.
2014-04-25 13:54 GMT+02:00 Martin Sivak <msivak@redhat.com>:
Hi Kevin,
can you please tell us what version of hosted-engine are you running?
rpm -q ovirt-hosted-engine-ha
Also, do I understand it correctly that the engine VM is running, but you see bad status when you execute the hosted-engine --vm-status command?
If that is so, can you give us current logs from /var/log/ovirt-hosted-engine-ha?
-- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ
----- Original Message -----
Ok i mount manualy the domain for hosted engine and agent go up.
But vm-status :
--== Host 2 status ==--
Status up-to-date : False Hostname : 192.168.99.103 Host ID : 2 Engine status : unknown stale-data Score : 0 Local maintenance : False Host timestamp : 1398333438
And in my engine, host02 Ha is no active.
2014-04-24 12:48 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com>:
Hi,
I try to reboot my hosts and now [supervdsmServer] is <defunct>.
/var/log/vdsm/supervdsm.log
MainProcess|Thread-120::DEBUG::2014-04-24
12:22:19,955::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper)
return validateAccess with None MainProcess|Thread-120::DEBUG::2014-04-24
12:22:20,010::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) call
validateAccess with ('qemu', ('qemu', 'kvm'), '/rhev/data-center/mnt/host01.ovirt.lan:_home_export', 5) {} MainProcess|Thread-120::DEBUG::2014-04-24
12:22:20,014::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper)
return validateAccess with None MainProcess|Thread-120::DEBUG::2014-04-24
12:22:20,059::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) call
validateAccess with ('qemu', ('qemu', 'kvm'), '/rhev/data-center/mnt/host01.ovirt.lan:_home_iso', 5) {} MainProcess|Thread-120::DEBUG::2014-04-24
12:22:20,063::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper)
return validateAccess with None
and one host don't mount the NFS used for hosted engine.
MainThread::CRITICAL::2014-04-24
12:36:16,603::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
Could not start ha-agent Traceback (most recent call last): File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
line 97, in run self._run_agent() File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
line 154, in _run_agent
hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring()
File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 299, in start_monitoring self._initialize_vdsm() File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 418, in _initialize_vdsm self._sd_path = env_path.get_domain_path(self._config) File
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py", line
40, in get_domain_path .format(sd_uuid, parent)) Exception: path to storage domain aea040f8-ab9d-435b-9ecf-ddd4272e592f not found in /rhev/data-center/mnt
2014-04-23 17:40 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com>:
top > 1729 vdsm 20 0 0 0 0 Z 373.8 0.0 252:08.51 > ovirt-ha-broker <defunct> > > > [root@host01 ~]# ps axwu | grep 1729 > vdsm 1729 0.7 0.0 0 0 ? Zl Apr02 240:24 > [ovirt-ha-broker] <defunct> > > [root@host01 ~]# ll >
/rhev/data-center/mnt/host01.ovirt.lan\:_home_NFS01/aea040f8-ab9d-435b-9ecf-ddd4272e592f/ha_agent/
> total 2028 > -rw-rw----. 1 vdsm kvm 1048576 23 avril 17:35 hosted-engine.lockspace > -rw-rw----. 1 vdsm kvm 1028096 23 avril 17:35 hosted-engine.metadata > > cat /var/log/vdsm/vdsm.log > > Thread-120518::DEBUG::2014-04-23 > 17:38:02,299::task::1185::TaskManager.Task::(prepare) > Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::finished: > {'aea040f8-ab9d-435b-9ecf-ddd4272e592f': {'code': 0, 'version': 3, > 'acquired': True, 'delay': '0.000410963', 'lastCheck': '3.4', 'valid': > True}, '5ae613a4-44e4-42cb-89fc-7b5d34c1f30f': {'code': 0, 'version': 3, > 'acquired': True, 'delay': '0.000412357', 'lastCheck': '6.8', 'valid': > True}, 'cc51143e-8ad7-4b0b-a4d2-9024dffc1188': {'code': 0, 'version': 0, > 'acquired': True, 'delay': '0.000455292', 'lastCheck': '1.2', 'valid': > True}, 'ff98d346-4515-4349-8437-fb2f5e9eaadf': {'code': 0, 'version': 0, > 'acquired': True, 'delay': '0.00817113', 'lastCheck': '1.7', 'valid': > True}} > Thread-120518::DEBUG::2014-04-23 > 17:38:02,300::task::595::TaskManager.Task::(_updateState) > Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::moving from state preparing > -> > state finished > Thread-120518::DEBUG::2014-04-23 >
17:38:02,300::resourceManager::940::ResourceManager.Owner::(releaseAll)
> Owner.releaseAll requests {} resources {} > Thread-120518::DEBUG::2014-04-23 > 17:38:02,300::resourceManager::977::ResourceManager.Owner::(cancelAll) > Owner.cancelAll requests {} > Thread-120518::DEBUG::2014-04-23 > 17:38:02,300::task::990::TaskManager.Task::(_decref) > Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::ref 0 aborting False > Thread-120518::ERROR::2014-04-23 >
17:38:02,302::brokerlink::72::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect)
> Failed to connect to broker: [Errno 2] No such file or directory > Thread-120518::ERROR::2014-04-23 > 17:38:02,302::API::1612::vds::(_getHaInfo) failed to retrieve Hosted > Engine > HA info > Traceback (most recent call last): > File "/usr/share/vdsm/API.py", line 1603, in _getHaInfo > stats = instance.get_all_stats() > File >
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py",
> line 83, in get_all_stats > with broker.connection(): > File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__ > return self.gen.next() > File >
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> line 96, in connection > self.connect() > File >
"/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> line 64, in connect > self._socket.connect(constants.BROKER_SOCKET_FILE) > File "<string>", line 1, in connect > error: [Errno 2] No such file or directory > Thread-78::DEBUG::2014-04-23 > 17:38:05,490::fileSD::225::Storage.Misc.excCmd::(getReadDelay) '/bin/dd > iflag=direct >
if=/rhev/data-center/mnt/host01.ovirt.lan:_home_DATA/5ae613a4-44e4-42cb-89fc-7b5d34c1f30f/dom_md/metadata
> bs=4096 count=1' (cwd None) > Thread-78::DEBUG::2014-04-23 > 17:38:05,523::fileSD::225::Storage.Misc.excCmd::(getReadDelay) SUCCESS: > <err> = '0+1 records in\n0+1 records out\n545 bytes (545 B) copied, > 0.000412209 s, 1.3 MB/s\n'; <rc> = 0 > > > > > 2014-04-23 17:27 GMT+02:00 Martin Sivak <msivak@redhat.com>: > > Hi Kevin, >> >> > same pb. >> >> Are you missing the lockspace file as well while running on top of >> GlusterFS? >> >> > ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9. >> >> Defunct process eating full four cores? I wonder how is that possible.. >> What are the status flags of that process when you do ps axwu? >> >> Can you attach the log files please? >> >> -- >> Martin Sivák >> msivak@redhat.com >> Red Hat Czech >> RHEV-M SLA / Brno, CZ >> >> ----- Original Message ----- >> > same pb. ovirt-ha-broker have 400% cpu and is defunct. I can't kill >> with -9. >> > >> > >> > 2014-04-23 13:55 GMT+02:00 Martin Sivak <msivak@redhat.com>: >> > >> > > Hi, >> > > >> > > > Isn't this file created when hosted engine is started? >> > > >> > > The file is created by the setup script. If it got lost then there >> was >> > > probably something bad happening in your NFS or Gluster storage. >> > > >> > > > Or how can I create this file manually? >> > > >> > > I can give you experimental treatment for this. We do not have any >> > > official way as this is something that should not ever happen :) >> > > >> > > !! But before you do that make sure you do not have any nodes running >> > > properly. This will destroy and reinitialize the lockspace database >> for the >> > > whole hosted-engine environment (which you apparently lack, but..). >> !! >> > > >> > > You have to create the ha_agent/hosted-engine.lockspace file with the >> > > expected size (1MB) and then tell sanlock to initialize it as a >> lockspace >> > > using: >> > > >> > > # python >> > > >>> import sanlock >> > > >>> sanlock.write_lockspace(lockspace="hosted-engine", >> > > ... path="/rhev/data-center/mnt/<nfs>/<hosted engine storage >> > > domain>/ha_agent/hosted-engine.lockspace", >> > > ... offset=0) >> > > >>> >> > > >> > > Then try starting the services (both broker and agent) again. >> > > >> > > -- >> > > Martin Sivák >> > > msivak@redhat.com >> > > Red Hat Czech >> > > RHEV-M SLA / Brno, CZ >> > > >> > > >> > > ----- Original Message ----- >> > > > On 04/23/2014 11:08 AM, Martin Sivak wrote: >> > > > > Hi René, >> > > > > >> > > > >>>> libvirtError: Failed to acquire lock: No space left on device >> > > > > >> > > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 >> invalid >> > > > >>>> lockspace found -1 failed 0 name >> > > 2851af27-8744-445d-9fb1-a0d083c8dc82 >> > > > > >> > > > > Can you please check the contents of /rhev/data-center/<your nfs >> > > > > mount>/<nfs domain uuid>/ha_agent/? >> > > > > >> > > > > This is how it should look like: >> > > > > >> > > > > [root@dev-03 ~]# ls -al >> > > > > >> > > >>
/rhev/data-center/mnt/euryale\:_home_ovirt_he/e16de6a2-53f5-4ab3-95a3-255d08398824/ha_agent/
>> > > > >>>>>>>> host-id=1 >> > > > >>>>>>>> score=2400 >> > > > >>>>>>>> maintenance=False >> > > > >>>>>>>> state=EngineUp >> > > > >>>>>>>> >> > > > >>>>>>>> >> > > > >>>>>>>> --== Host 2 status ==-- >> > > > >>>>>>>> >> > > > >>>>>>>> Status up-to-date : True >> > > > >>>>>>>> Hostname : 10.0.200.101 >> > > > >>>>>>>> Host ID : 2 >> > > > >>>>>>>> Engine status : {'reason': 'vm not >> running >> > > on >> > > > >>>>>>>> this >> > > > >>>>>>>> host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'} >> > > > >>>>>>>> Score : 0 >> > > > >>>>>>>> Local maintenance : False >> > > > >>>>>>>> Host timestamp : 1397464031 >> > > > >>>>>>>> Extra metadata (valid at timestamp): >> > > > >>>>>>>> metadata_parse_version=1 >> > > > >>>>>>>> metadata_feature_version=1 >> > > > >>>>>>>> timestamp=1397464031 (Mon Apr 14 10:27:11
>> > > > > total 2036 >> > > > > drwxr-x---. 2 vdsm kvm 4096 Mar 19 18:46 . >> > > > > drwxr-xr-x. 6 vdsm kvm 4096 Mar 19 18:46 .. >> > > > > -rw-rw----. 1 vdsm kvm 1048576 Apr 23 11:05 >> hosted-engine.lockspace >> > > > > -rw-rw----. 1 vdsm kvm 1028096 Mar 19 18:46 >> hosted-engine.metadata >> > > > > >> > > > > The errors seem to indicate that you somehow lost the lockspace >> file. >> > > > >> > > > True :) >> > > > Isn't this file created when hosted engine is started? Or how can I >> > > > create this file manually? >> > > > >> > > > > >> > > > > -- >> > > > > Martin Sivák >> > > > > msivak@redhat.com >> > > > > Red Hat Czech >> > > > > RHEV-M SLA / Brno, CZ >> > > > > >> > > > > ----- Original Message ----- >> > > > >> On 04/23/2014 12:28 AM, Doron Fediuck wrote: >> > > > >>> Hi Rene, >> > > > >>> any idea what closed your ovirtmgmt bridge? >> > > > >>> as long as it is down vdsm may have issues starting up properly >> > > > >>> and this is why you see the complaints on the rpc server. >> > > > >>> >> > > > >>> Can you try manually fixing the network part first and then >> > > > >>> restart vdsm? >> > > > >>> Once vdsm is happy hosted engine VM will start. >> > > > >> >> > > > >> Thanks for your feedback, Doron. >> > > > >> >> > > > >> My ovirtmgmt bridge seems to be on or isn't it: >> > > > >> # brctl show ovirtmgmt >> > > > >> bridge name bridge id STP enabled >> interfaces >> > > > >> ovirtmgmt 8000.0025907587c2 no >> eth0.200 >> > > > >> >> > > > >> # ip a s ovirtmgmt >> > > > >> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc >> noqueue >> > > > >> state UNKNOWN >> > > > >> link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff >> > > > >> inet 10.0.200.102/24 brd 10.0.200.255 scope global >> ovirtmgmt >> > > > >> inet6 fe80::225:90ff:fe75:87c2/64 scope link >> > > > >> valid_lft forever preferred_lft forever >> > > > >> >> > > > >> # ip a s eth0.200 >> > > > >> 6: eth0.200@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 >> qdisc >> > > > >> noqueue state UP >> > > > >> link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff >> > > > >> inet6 fe80::225:90ff:fe75:87c2/64 scope link >> > > > >> valid_lft forever preferred_lft forever >> > > > >> >> > > > >> I tried the following yesterday: >> > > > >> Copy virtual disk from GlusterFS storage to local disk of host >> and >> > > > >> create a new vm with virt-manager which loads ovirtmgmt disk. I >> could >> > > > >> reach my engine over the ovirtmgmt bridge (so bridge must be >> working). >> > > > >> >> > > > >> I also started libvirtd with Option -v and I saw the following >> in >> > > > >> libvirtd.log when trying to start ovirt engine: >> > > > >> 2014-04-22 14:18:25.432+0000: 8901: debug : >> virCommandRunAsync:2250 : >> > > > >> Command result 0, with PID 11491 >> > > > >> 2014-04-22 14:18:25.478+0000: 8901: debug : virCommandRun:2045 : >> > > Result >> > > > >> exit status 255, stdout: '' stderr: 'iptables v1.4.7: goto >> 'FO-vnet0' >> > > is >> > > > >> not a chain >> > > > >> >> > > > >> So it could be that something is broken in my hosted-engine >> network. >> > > Do >> > > > >> you have any clue how I can troubleshoot this? >> > > > >> >> > > > >> >> > > > >> Thanks, >> > > > >> René >> > > > >> >> > > > >> >> > > > >>> >> > > > >>> ----- Original Message ----- >> > > > >>>> From: "René Koch" <rkoch@linuxland.at> >> > > > >>>> To: "Martin Sivak" <msivak@redhat.com> >> > > > >>>> Cc: users@ovirt.org >> > > > >>>> Sent: Tuesday, April 22, 2014 1:46:38 PM >> > > > >>>> Subject: Re: [ovirt-users] hosted engine health check issues >> > > > >>>> >> > > > >>>> Hi, >> > > > >>>> >> > > > >>>> I rebooted one of my ovirt hosts today and the result is now >> that I >> > > > >>>> can't start hosted-engine anymore. >> > > > >>>> >> > > > >>>> ovirt-ha-agent isn't running because the lockspace file is >> missing >> > > > >>>> (sanlock complains about it). >> > > > >>>> So I tried to start hosted-engine with --vm-start and I get >> the >> > > > >>>> following errors: >> > > > >>>> >> > > > >>>> ==> /var/log/sanlock.log <== >> > > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire 2,9,5733 >> invalid >> > > > >>>> lockspace found -1 failed 0 name >> > > 2851af27-8744-445d-9fb1-a0d083c8dc82 >> > > > >>>> >> > > > >>>> ==> /var/log/messages <== >> > > > >>>> Apr 22 12:38:17 ovirt-host02 sanlock[3079]: 2014-04-22 >> > > 12:38:17+0200 654 >> > > > >>>> [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 >> failed 0 >> > > name >> > > > >>>> 2851af27-8744-445d-9fb1-a0d083c8dc82 >> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) >> > > entering >> > > > >>>> disabled state >> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: device vnet0 left >> promiscuous >> > > mode >> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port 2(vnet0) >> > > entering >> > > > >>>> disabled state >> > > > >>>> >> > > > >>>> ==> /var/log/vdsm/vdsm.log <== >> > > > >>>> Thread-21::DEBUG::2014-04-22 >> > > > >>>> 12:38:17,563::libvirtconnection::124::root::(wrapper) Unknown >> > > > >>>> libvirterror: ecode: 38 edom: 42 level: 2 message: Failed to >> acquire >> > > > >>>> lock: No space left on device >> > > > >>>> Thread-21::DEBUG::2014-04-22 >> > > > >>>> 12:38:17,563::vm::2263::vm.Vm::(_startUnderlyingVm) >> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::_ongoingCreations >> > > released >> > > > >>>> Thread-21::ERROR::2014-04-22 >> > > > >>>> 12:38:17,564::vm::2289::vm.Vm::(_startUnderlyingVm) >> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start >> process >> > > failed >> > > > >>>> Traceback (most recent call last): >> > > > >>>> File "/usr/share/vdsm/vm.py", line 2249, in >> _startUnderlyingVm >> > > > >>>> self._run() >> > > > >>>> File "/usr/share/vdsm/vm.py", line 3170, in _run >> > > > >>>> self._connection.createXML(domxml, flags), >> > > > >>>> File >> > > > >>>> >> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", >> > > > >>>> line 92, in wrapper >> > > > >>>> ret = f(*args, **kwargs) >> > > > >>>> File "/usr/lib64/python2.6/site-packages/libvirt.py", >> line >> > > 2665, in >> > > > >>>> createXML >> > > > >>>> if ret is None:raise libvirtError('virDomainCreateXML() >> > > failed', >> > > > >>>> conn=self) >> > > > >>>> libvirtError: Failed to acquire lock: No space left on device >> > > > >>>> >> > > > >>>> ==> /var/log/messages <== >> > > > >>>> Apr 22 12:38:17 ovirt-host02 vdsm vm.Vm ERROR >> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start >> process >> > > > >>>> failed#012Traceback (most recent call last):#012 File >> > > > >>>> "/usr/share/vdsm/vm.py", line 2249, in _startUnderlyingVm#012 >> > > > >>>> self._run()#012 File "/usr/share/vdsm/vm.py", line 3170, in >> > > _run#012 >> > > > >>>> self._connection.createXML(domxml, flags),#012 File >> > > > >>>> >> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", >> > > line 92, >> > > > >>>> in wrapper#012 ret = f(*args, **kwargs)#012 File >> > > > >>>> "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in >> > > > >>>> createXML#012 if ret is None:raise >> > > libvirtError('virDomainCreateXML() >> > > > >>>> failed', conn=self)#012libvirtError: Failed to acquire lock: >> No >> > > space >> > > > >>>> left on device >> > > > >>>> >> > > > >>>> ==> /var/log/vdsm/vdsm.log <== >> > > > >>>> Thread-21::DEBUG::2014-04-22 >> > > > >>>> 12:38:17,569::vm::2731::vm.Vm::(setDownStatus) >> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::Changed state to >> Down: >> > > > >>>> Failed to acquire lock: No space left on device >> > > > >>>> >> > > > >>>> >> > > > >>>> No space left on device is nonsense as there is enough space >> (I had >> > > this >> > > > >>>> issue last time as well where I had to patch machine.py, but >> this >> > > file >> > > > >>>> is now Python 2.6.6 compatible. >> > > > >>>> >> > > > >>>> Any idea what prevents hosted-engine from starting? >> > > > >>>> ovirt-ha-broker, vdsmd and sanlock are running btw. >> > > > >>>> >> > > > >>>> Btw, I can see in log that json rpc server module is missing >> - which >> > > > >>>> package is required for CentOS 6.5? >> > > > >>>> Apr 22 12:37:14 ovirt-host02 vdsm vds WARNING Unable to load >> the >> > > json >> > > > >>>> rpc server module. Please make sure it is installed. >> > > > >>>> >> > > > >>>> >> > > > >>>> Thanks, >> > > > >>>> René >> > > > >>>> >> > > > >>>> >> > > > >>>> >> > > > >>>> On 04/17/2014 10:02 AM, Martin Sivak wrote: >> > > > >>>>> Hi, >> > > > >>>>> >> > > > >>>>>>>> How can I disable notifications? >> > > > >>>>> >> > > > >>>>> The notification is configured in >> > > > >>>>> /etc/ovirt-hosted-engine-ha/broker.conf >> > > > >>>>> section notification. >> > > > >>>>> The email is sent when the key state_transition exists and >> the >> > > string >> > > > >>>>> OldState-NewState contains the (case insensitive) regexp >> from the >> > > > >>>>> value. >> > > > >>>>> >> > > > >>>>>>>> Is it intended to send out these messages and detect that >> ovirt >> > > > >>>>>>>> engine >> > > > >>>>>>>> is down (which is false anyway), but not to restart the >> vm? >> > > > >>>>> >> > > > >>>>> Forget about emails for now and check the >> > > > >>>>> /var/log/ovirt-hosted-engine-ha/agent.log and broker.log (and >> > > attach >> > > > >>>>> them >> > > > >>>>> as well btw). >> > > > >>>>> >> > > > >>>>>>>> oVirt hosts think that hosted engine is down because it >> seems >> > > that >> > > > >>>>>>>> hosts >> > > > >>>>>>>> can't write to hosted-engine.lockspace due to glusterfs >> issues >> > > (or >> > > > >>>>>>>> at >> > > > >>>>>>>> least I think so). >> > > > >>>>> >> > > > >>>>> The hosts think so or can't really write there? The >> lockspace is >> > > > >>>>> managed >> > > > >>>>> by >> > > > >>>>> sanlock and our HA daemons do not touch it at all. We only >> ask >> > > sanlock >> > > > >>>>> to >> > > > >>>>> get make sure we have unique server id. >> > > > >>>>> >> > > > >>>>>>>> Is is possible or planned to make the whole ha feature >> optional? >> > > > >>>>> >> > > > >>>>> Well the system won't perform any automatic actions if you >> put the >> > > > >>>>> hosted >> > > > >>>>> engine to global maintenance and only start/stop/migrate the >> VM >> > > > >>>>> manually. >> > > > >>>>> I would discourage you from stopping agent/broker, because >> the >> > > engine >> > > > >>>>> itself has some logic based on the reporting. >> > > > >>>>> >> > > > >>>>> Regards >> > > > >>>>> >> > > > >>>>> -- >> > > > >>>>> Martin Sivák >> > > > >>>>> msivak@redhat.com >> > > > >>>>> Red Hat Czech >> > > > >>>>> RHEV-M SLA / Brno, CZ >> > > > >>>>> >> > > > >>>>> ----- Original Message ----- >> > > > >>>>>> On 04/15/2014 04:53 PM, Jiri Moskovcak wrote: >> > > > >>>>>>> On 04/14/2014 10:50 AM, René Koch wrote: >> > > > >>>>>>>> Hi, >> > > > >>>>>>>> >> > > > >>>>>>>> I have some issues with hosted engine status. >> > > > >>>>>>>> >> > > > >>>>>>>> oVirt hosts think that hosted engine is down because it >> seems >> > > that >> > > > >>>>>>>> hosts >> > > > >>>>>>>> can't write to hosted-engine.lockspace due to glusterfs >> issues >> > > (or >> > > > >>>>>>>> at >> > > > >>>>>>>> least I think so). >> > > > >>>>>>>> >> > > > >>>>>>>> Here's the output of vm-status: >> > > > >>>>>>>> >> > > > >>>>>>>> # hosted-engine --vm-status >> > > > >>>>>>>> >> > > > >>>>>>>> >> > > > >>>>>>>> --== Host 1 status ==-- >> > > > >>>>>>>> >> > > > >>>>>>>> Status up-to-date : False >> > > > >>>>>>>> Hostname : 10.0.200.102 >> > > > >>>>>>>> Host ID : 1 >> > > > >>>>>>>> Engine status : unknown stale-data >> > > > >>>>>>>> Score : 2400 >> > > > >>>>>>>> Local maintenance : False >> > > > >>>>>>>> Host timestamp : 1397035677 >> > > > >>>>>>>> Extra metadata (valid at timestamp): >> > > > >>>>>>>> metadata_parse_version=1 >> > > > >>>>>>>> metadata_feature_version=1 >> > > > >>>>>>>> timestamp=1397035677 (Wed Apr 9 11:27:57
>> > > > >>>>>>>> host-id=2 >> > > > >>>>>>>> score=0 >> > > > >>>>>>>> maintenance=False >> > > > >>>>>>>> state=EngineUnexpectedlyDown >> > > > >>>>>>>> timeout=Mon Apr 14 10:35:05 2014 >> > > > >>>>>>>> >> > > > >>>>>>>> oVirt engine is sending me 2 emails every 10 minutes with >> the >> > > > >>>>>>>> following >> > > > >>>>>>>> subjects: >> > > > >>>>>>>> - ovirt-hosted-engine state transition >> EngineDown-EngineStart >> > > > >>>>>>>> - ovirt-hosted-engine state transition >> EngineStart-EngineUp >> > > > >>>>>>>> >> > > > >>>>>>>> In oVirt webadmin I can see the following message: >> > > > >>>>>>>> VM HostedEngine is down. Exit message: internal error >> Failed to >> > > > >>>>>>>> acquire >> > > > >>>>>>>> lock: error -243. >> > > > >>>>>>>> >> > > > >>>>>>>> These messages are really annoying as oVirt isn't doing >> anything >> > > > >>>>>>>> with >> > > > >>>>>>>> hosted engine - I have an uptime of 9 days in my engine >> vm. >> > > > >>>>>>>> >> > > > >>>>>>>> So my questions are now: >> > > > >>>>>>>> Is it intended to send out these messages and detect that >> ovirt >> > > > >>>>>>>> engine >> > > > >>>>>>>> is down (which is false anyway), but not to restart the >> vm? >> > > > >>>>>>>> >> > > > >>>>>>>> How can I disable notifications? I'm planning to write a >> Nagios >> > > > >>>>>>>> plugin >> > > > >>>>>>>> which parses the output of hosted-engine --vm-status and >> only >> > > Nagios >> > > > >>>>>>>> should notify me, not hosted-engine script. >> > > > >>>>>>>> >> > > > >>>>>>>> Is is possible or planned to make the whole ha feature >> > > optional? I >> > > > >>>>>>>> really really really hate cluster software as it causes >> more >> > > > >>>>>>>> troubles >> > > > >>>>>>>> then standalone machines and in my case the hosted-engine >> ha >> > > feature >> > > > >>>>>>>> really causes troubles (and I didn't had a hardware or >> network >> > > > >>>>>>>> outage >> > > > >>>>>>>> yet only issues with hosted-engine ha agent). I don't >> need any >> > > ha >> > > > >>>>>>>> feature for hosted engine. I just want to run engine >> > > virtualized on >> > > > >>>>>>>> oVirt and if engine vm fails (e.g. because of issues with >> a >> > > host) >> > > > >>>>>>>> I'll >> > > > >>>>>>>> restart it on another node. >> > > > >>>>>>> >> > > > >>>>>>> Hi, you can: >> > > > >>>>>>> 1. edit >> /etc/ovirt-hosted-engine-ha/{agent,broker}-log.conf and >> > > tweak >> > > > >>>>>>> the logger as you like >> > > > >>>>>>> 2. or kill ovirt-ha-broker & ovirt-ha-agent services >> > > > >>>>>> >> > > > >>>>>> Thanks for the information. >> > > > >>>>>> So engine is able to run when ovirt-ha-broker and >> ovirt-ha-agent >> > > isn't >> > > > >>>>>> running? >> > > > >>>>>> >> > > > >>>>>> >> > > > >>>>>> Regards, >> > > > >>>>>> René >> > > > >>>>>> >> > > > >>>>>>> >> > > > >>>>>>> --Jirka >> > > > >>>>>>>> >> > > > >>>>>>>> Thanks, >> > > > >>>>>>>> René >> > > > >>>>>>>> >> > > > >>>>>>>> >> > > > >>>>>>> >> > > > >>>>>> _______________________________________________ >> > > > >>>>>> Users mailing list >> > > > >>>>>> Users@ovirt.org >> > > > >>>>>> http://lists.ovirt.org/mailman/listinfo/users >> > > > >>>>>> >> > > > >>>> _______________________________________________ >> > > > >>>> Users mailing list >> > > > >>>> Users@ovirt.org >> > > > >>>> http://lists.ovirt.org/mailman/listinfo/users >> > > > >>>> >> > > > >> >> > > > >> > > _______________________________________________ >> > > Users mailing list >> > > Users@ovirt.org >> > > http://lists.ovirt.org/mailman/listinfo/users >> > > >> > >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >> > >
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Are you able to reproduce reliably? If so please send us the full logs from vdsm, ha-broker and ha-agent. So far it seems like there is more problems mixed in this thread: 1. libvirt+vdsm+qemu problem when creating a snapshot 2. storage not mounted after reboot Thank you, Jirka On 04/28/2014 12:19 PM, Kevin Tibi wrote:
I'am on Centos 6.5 and this repo is for fedora...
2014-04-28 12:16 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com <mailto:kevintibi@hotmail.com>>:
Hi,
qemu-kvm-0.12.1.2-2.415.el6_5.8.x86_64 libvirt-0.10.2-29.el6_5.7.x86_64 vdsm-4.14.6-0.el6.x86_64 kernel-2.6.32-431.el6.x86_64 kernel-2.6.32-431.11.2.el6.x86_64
i add this repop and try to update.
2014-04-28 11:57 GMT+02:00 Martin Sivak <msivak@redhat.com <mailto:msivak@redhat.com>>:
Hi Kevin,
thanks for the information.
> Agent.log and broker.log says nothing.
Can you please attach those files? I would like to see how the crashed Qemu process is reported to us and what are the state machine trainsitions that cause the load.
> 07:23:58,994::libvirtconnection::124::root::(wrapper) Unknown libvirterror: > ecode: 84 edom: 10 level: 2 message: Operation not supported: live disk > snapshot not supported with this QEMU binary
What are the versions of vdsm, libvirt, qemu-kvm and kernel?
If you feel like it try updating virt packages from the virt-preview repository: http://fedoraproject.org/wiki/Virtualization_Preview_Repository
-- Martin Sivák msivak@redhat.com <mailto:msivak@redhat.com> Red Hat Czech RHEV-M SLA / Brno, CZ
----- Original Message ----- > Hi, > > I use this version : ovirt-hosted-engine-ha-1.1.2-1.el6.noarch > > For 3 days, my engine-ha worked perfectly but i tried to snapshot a Vm and > ha service make defunct ==> 400% CPU !! > > Agent.log and broker.log says nothing. But vdsm.log i have errors : > > Thread-9462::DEBUG::2014-04-28 > 07:23:58,994::libvirtconnection::124::root::(wrapper) Unknown libvirterror: > ecode: 84 edom: 10 level: 2 message: Operation not supported: live disk > snapshot not supported with this QEMU binary > > Thread-9462::ERROR::2014-04-28 07:23:58,995::vm::4006::vm.Vm::(snapshot) > vmId=`773f6e6d-c670-49f3-ae8c-dfbcfa22d0a5`::Unable to take snapshot > > > Thread-9352::DEBUG::2014-04-28 > 08:41:39,922::lvm::295::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n > /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] > ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 > obtain_device_list_from_udev=0 filter = [ \'r|.*|\' ] } global { > locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { > retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix > --separator | -o > uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name > cc51143e-8ad7-4b0b-a4d2-9024dffc1188 ff98d346-4515-4349-8437-fb2f5e9eaadf' > (cwd None) > > I'll try to reboot my node with hosted-engine. > > > > 2014-04-25 13:54 GMT+02:00 Martin Sivak <msivak@redhat.com <mailto:msivak@redhat.com>>: > > > Hi Kevin, > > > > can you please tell us what version of hosted-engine are you running? > > > > rpm -q ovirt-hosted-engine-ha > > > > Also, do I understand it correctly that the engine VM is running, but you > > see bad status when you execute the hosted-engine --vm-status command? > > > > If that is so, can you give us current logs from > > /var/log/ovirt-hosted-engine-ha? > > > > -- > > Martin Sivák > > msivak@redhat.com <mailto:msivak@redhat.com> > > Red Hat Czech > > RHEV-M SLA / Brno, CZ > > > > ----- Original Message ----- > > > Ok i mount manualy the domain for hosted engine and agent go up. > > > > > > But vm-status : > > > > > > --== Host 2 status ==-- > > > > > > Status up-to-date : False > > > Hostname : 192.168.99.103 > > > Host ID : 2 > > > Engine status : unknown stale-data > > > Score : 0 > > > Local maintenance : False > > > Host timestamp : 1398333438 > > > > > > And in my engine, host02 Ha is no active. > > > > > > > > > 2014-04-24 12:48 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com <mailto:kevintibi@hotmail.com>>: > > > > > > > Hi, > > > > > > > > I try to reboot my hosts and now [supervdsmServer] is <defunct>. > > > > > > > > /var/log/vdsm/supervdsm.log > > > > > > > > > > > > MainProcess|Thread-120::DEBUG::2014-04-24 > > > > 12:22:19,955::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) > > > > return validateAccess with None > > > > MainProcess|Thread-120::DEBUG::2014-04-24 > > > > 12:22:20,010::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) > > call > > > > validateAccess with ('qemu', ('qemu', 'kvm'), > > > > '/rhev/data-center/mnt/host01.ovirt.lan:_home_export', 5) {} > > > > MainProcess|Thread-120::DEBUG::2014-04-24 > > > > 12:22:20,014::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) > > > > return validateAccess with None > > > > MainProcess|Thread-120::DEBUG::2014-04-24 > > > > 12:22:20,059::supervdsmServer::96::SuperVdsm.ServerCallback::(wrapper) > > call > > > > validateAccess with ('qemu', ('qemu', 'kvm'), > > > > '/rhev/data-center/mnt/host01.ovirt.lan:_home_iso', 5) {} > > > > MainProcess|Thread-120::DEBUG::2014-04-24 > > > > 12:22:20,063::supervdsmServer::103::SuperVdsm.ServerCallback::(wrapper) > > > > return validateAccess with None > > > > > > > > and one host don't mount the NFS used for hosted engine. > > > > > > > > MainThread::CRITICAL::2014-04-24 > > > > > > 12:36:16,603::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run) > > > > Could not start ha-agent > > > > Traceback (most recent call last): > > > > File > > > > > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", > > > > line 97, in run > > > > self._run_agent() > > > > File > > > > > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", > > > > line 154, in _run_agent > > > > > > hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring() > > > > File > > > > > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", > > > > line 299, in start_monitoring > > > > self._initialize_vdsm() > > > > File > > > > > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", > > > > line 418, in _initialize_vdsm > > > > self._sd_path = env_path.get_domain_path(self._config) > > > > File > > > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/env/path.py", > > line > > > > 40, in get_domain_path > > > > .format(sd_uuid, parent)) > > > > Exception: path to storage domain aea040f8-ab9d-435b-9ecf-ddd4272e592f > > not > > > > found in /rhev/data-center/mnt > > > > > > > > > > > > > > > > 2014-04-23 17:40 GMT+02:00 Kevin Tibi <kevintibi@hotmail.com <mailto:kevintibi@hotmail.com>>: > > > > > > > > top > > > >> 1729 vdsm 20 0 0 0 0 Z 373.8 0.0 252:08.51 > > > >> ovirt-ha-broker <defunct> > > > >> > > > >> > > > >> [root@host01 ~]# ps axwu | grep 1729 > > > >> vdsm 1729 0.7 0.0 0 0 ? Zl Apr02 240:24 > > > >> [ovirt-ha-broker] <defunct> > > > >> > > > >> [root@host01 ~]# ll > > > >> > > /rhev/data-center/mnt/host01.ovirt.lan\:_home_NFS01/aea040f8-ab9d-435b-9ecf-ddd4272e592f/ha_agent/ > > > >> total 2028 > > > >> -rw-rw----. 1 vdsm kvm 1048576 23 avril 17:35 hosted-engine.lockspace > > > >> -rw-rw----. 1 vdsm kvm 1028096 23 avril 17:35 hosted-engine.metadata > > > >> > > > >> cat /var/log/vdsm/vdsm.log > > > >> > > > >> Thread-120518::DEBUG::2014-04-23 > > > >> 17:38:02,299::task::1185::TaskManager.Task::(prepare) > > > >> Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::finished: > > > >> {'aea040f8-ab9d-435b-9ecf-ddd4272e592f': {'code': 0, 'version': 3, > > > >> 'acquired': True, 'delay': '0.000410963', 'lastCheck': '3.4', 'valid': > > > >> True}, '5ae613a4-44e4-42cb-89fc-7b5d34c1f30f': {'code': 0, 'version': > > 3, > > > >> 'acquired': True, 'delay': '0.000412357', 'lastCheck': '6.8', 'valid': > > > >> True}, 'cc51143e-8ad7-4b0b-a4d2-9024dffc1188': {'code': 0, 'version': > > 0, > > > >> 'acquired': True, 'delay': '0.000455292', 'lastCheck': '1.2', 'valid': > > > >> True}, 'ff98d346-4515-4349-8437-fb2f5e9eaadf': {'code': 0, 'version': > > 0, > > > >> 'acquired': True, 'delay': '0.00817113', 'lastCheck': '1.7', 'valid': > > > >> True}} > > > >> Thread-120518::DEBUG::2014-04-23 > > > >> 17:38:02,300::task::595::TaskManager.Task::(_updateState) > > > >> Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::moving from state > > preparing > > > >> -> > > > >> state finished > > > >> Thread-120518::DEBUG::2014-04-23 > > > >> > > 17:38:02,300::resourceManager::940::ResourceManager.Owner::(releaseAll) > > > >> Owner.releaseAll requests {} resources {} > > > >> Thread-120518::DEBUG::2014-04-23 > > > >> 17:38:02,300::resourceManager::977::ResourceManager.Owner::(cancelAll) > > > >> Owner.cancelAll requests {} > > > >> Thread-120518::DEBUG::2014-04-23 > > > >> 17:38:02,300::task::990::TaskManager.Task::(_decref) > > > >> Task=`f13e71f1-ac7c-49ab-8079-8f099ebf72b6`::ref 0 aborting False > > > >> Thread-120518::ERROR::2014-04-23 > > > >> > > 17:38:02,302::brokerlink::72::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(connect) > > > >> Failed to connect to broker: [Errno 2] No such file or directory > > > >> Thread-120518::ERROR::2014-04-23 > > > >> 17:38:02,302::API::1612::vds::(_getHaInfo) failed to retrieve Hosted > > > >> Engine > > > >> HA info > > > >> Traceback (most recent call last): > > > >> File "/usr/share/vdsm/API.py", line 1603, in _getHaInfo > > > >> stats = instance.get_all_stats() > > > >> File > > > >> > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/client/client.py", > > > >> line 83, in get_all_stats > > > >> with broker.connection(): > > > >> File "/usr/lib64/python2.6/contextlib.py", line 16, in __enter__ > > > >> return self.gen.next() > > > >> File > > > >> > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", > > > >> line 96, in connection > > > >> self.connect() > > > >> File > > > >> > > "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", > > > >> line 64, in connect > > > >> self._socket.connect(constants.BROKER_SOCKET_FILE) > > > >> File "<string>", line 1, in connect > > > >> error: [Errno 2] No such file or directory > > > >> Thread-78::DEBUG::2014-04-23 > > > >> 17:38:05,490::fileSD::225::Storage.Misc.excCmd::(getReadDelay) > > '/bin/dd > > > >> iflag=direct > > > >> > > if=/rhev/data-center/mnt/host01.ovirt.lan:_home_DATA/5ae613a4-44e4-42cb-89fc-7b5d34c1f30f/dom_md/metadata > > > >> bs=4096 count=1' (cwd None) > > > >> Thread-78::DEBUG::2014-04-23 > > > >> 17:38:05,523::fileSD::225::Storage.Misc.excCmd::(getReadDelay) > > SUCCESS: > > > >> <err> = '0+1 records in\n0+1 records out\n545 bytes (545 B) copied, > > > >> 0.000412209 s, 1.3 MB/s\n'; <rc> = 0 > > > >> > > > >> > > > >> > > > >> > > > >> 2014-04-23 17:27 GMT+02:00 Martin Sivak <msivak@redhat.com <mailto:msivak@redhat.com>>: > > > >> > > > >> Hi Kevin, > > > >>> > > > >>> > same pb. > > > >>> > > > >>> Are you missing the lockspace file as well while running on top of > > > >>> GlusterFS? > > > >>> > > > >>> > ovirt-ha-broker have 400% cpu and is defunct. I can't kill with -9. > > > >>> > > > >>> Defunct process eating full four cores? I wonder how is that > > possible.. > > > >>> What are the status flags of that process when you do ps axwu? > > > >>> > > > >>> Can you attach the log files please? > > > >>> > > > >>> -- > > > >>> Martin Sivák > > > >>> msivak@redhat.com <mailto:msivak@redhat.com> > > > >>> Red Hat Czech > > > >>> RHEV-M SLA / Brno, CZ > > > >>> > > > >>> ----- Original Message ----- > > > >>> > same pb. ovirt-ha-broker have 400% cpu and is defunct. I can't kill > > > >>> with -9. > > > >>> > > > > >>> > > > > >>> > 2014-04-23 13:55 GMT+02:00 Martin Sivak <msivak@redhat.com <mailto:msivak@redhat.com>>: > > > >>> > > > > >>> > > Hi, > > > >>> > > > > > >>> > > > Isn't this file created when hosted engine is started? > > > >>> > > > > > >>> > > The file is created by the setup script. If it got lost then > > there > > > >>> was > > > >>> > > probably something bad happening in your NFS or Gluster storage. > > > >>> > > > > > >>> > > > Or how can I create this file manually? > > > >>> > > > > > >>> > > I can give you experimental treatment for this. We do not have > > any > > > >>> > > official way as this is something that should not ever happen :) > > > >>> > > > > > >>> > > !! But before you do that make sure you do not have any nodes > > running > > > >>> > > properly. This will destroy and reinitialize the lockspace > > database > > > >>> for the > > > >>> > > whole hosted-engine environment (which you apparently lack, > > but..). > > > >>> !! > > > >>> > > > > > >>> > > You have to create the ha_agent/hosted-engine.lockspace file > > with the > > > >>> > > expected size (1MB) and then tell sanlock to initialize it as a > > > >>> lockspace > > > >>> > > using: > > > >>> > > > > > >>> > > # python > > > >>> > > >>> import sanlock > > > >>> > > >>> sanlock.write_lockspace(lockspace="hosted-engine", > > > >>> > > ... path="/rhev/data-center/mnt/<nfs>/<hosted engine storage > > > >>> > > domain>/ha_agent/hosted-engine.lockspace", > > > >>> > > ... offset=0) > > > >>> > > >>> > > > >>> > > > > > >>> > > Then try starting the services (both broker and agent) again. > > > >>> > > > > > >>> > > -- > > > >>> > > Martin Sivák > > > >>> > > msivak@redhat.com <mailto:msivak@redhat.com> > > > >>> > > Red Hat Czech > > > >>> > > RHEV-M SLA / Brno, CZ > > > >>> > > > > > >>> > > > > > >>> > > ----- Original Message ----- > > > >>> > > > On 04/23/2014 11:08 AM, Martin Sivak wrote: > > > >>> > > > > Hi René, > > > >>> > > > > > > > >>> > > > >>>> libvirtError: Failed to acquire lock: No space left on > > device > > > >>> > > > > > > > >>> > > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire > > 2,9,5733 > > > >>> invalid > > > >>> > > > >>>> lockspace found -1 failed 0 name > > > >>> > > 2851af27-8744-445d-9fb1-a0d083c8dc82 > > > >>> > > > > > > > >>> > > > > Can you please check the contents of /rhev/data-center/<your > > nfs > > > >>> > > > > mount>/<nfs domain uuid>/ha_agent/? > > > >>> > > > > > > > >>> > > > > This is how it should look like: > > > >>> > > > > > > > >>> > > > > [root@dev-03 ~]# ls -al > > > >>> > > > > > > > >>> > > > > > >>> > > /rhev/data-center/mnt/euryale\:_home_ovirt_he/e16de6a2-53f5-4ab3-95a3-255d08398824/ha_agent/ > > > >>> > > > > total 2036 > > > >>> > > > > drwxr-x---. 2 vdsm kvm 4096 Mar 19 18:46 . > > > >>> > > > > drwxr-xr-x. 6 vdsm kvm 4096 Mar 19 18:46 .. > > > >>> > > > > -rw-rw----. 1 vdsm kvm 1048576 Apr 23 11:05 > > > >>> hosted-engine.lockspace > > > >>> > > > > -rw-rw----. 1 vdsm kvm 1028096 Mar 19 18:46 > > > >>> hosted-engine.metadata > > > >>> > > > > > > > >>> > > > > The errors seem to indicate that you somehow lost the > > lockspace > > > >>> file. > > > >>> > > > > > > >>> > > > True :) > > > >>> > > > Isn't this file created when hosted engine is started? Or how > > can I > > > >>> > > > create this file manually? > > > >>> > > > > > > >>> > > > > > > > >>> > > > > -- > > > >>> > > > > Martin Sivák > > > >>> > > > > msivak@redhat.com <mailto:msivak@redhat.com> > > > >>> > > > > Red Hat Czech > > > >>> > > > > RHEV-M SLA / Brno, CZ > > > >>> > > > > > > > >>> > > > > ----- Original Message ----- > > > >>> > > > >> On 04/23/2014 12:28 AM, Doron Fediuck wrote: > > > >>> > > > >>> Hi Rene, > > > >>> > > > >>> any idea what closed your ovirtmgmt bridge? > > > >>> > > > >>> as long as it is down vdsm may have issues starting up > > properly > > > >>> > > > >>> and this is why you see the complaints on the rpc server. > > > >>> > > > >>> > > > >>> > > > >>> Can you try manually fixing the network part first and then > > > >>> > > > >>> restart vdsm? > > > >>> > > > >>> Once vdsm is happy hosted engine VM will start. > > > >>> > > > >> > > > >>> > > > >> Thanks for your feedback, Doron. > > > >>> > > > >> > > > >>> > > > >> My ovirtmgmt bridge seems to be on or isn't it: > > > >>> > > > >> # brctl show ovirtmgmt > > > >>> > > > >> bridge name bridge id STP enabled > > > >>> interfaces > > > >>> > > > >> ovirtmgmt 8000.0025907587c2 no > > > >>> eth0.200 > > > >>> > > > >> > > > >>> > > > >> # ip a s ovirtmgmt > > > >>> > > > >> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 > > qdisc > > > >>> noqueue > > > >>> > > > >> state UNKNOWN > > > >>> > > > >> link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff > > > >>> > > > >> inet 10.0.200.102/24 <http://10.0.200.102/24> brd 10.0.200.255 scope global > > > >>> ovirtmgmt > > > >>> > > > >> inet6 fe80::225:90ff:fe75:87c2/64 scope link > > > >>> > > > >> valid_lft forever preferred_lft forever > > > >>> > > > >> > > > >>> > > > >> # ip a s eth0.200 > > > >>> > > > >> 6: eth0.200@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu > > 1500 > > > >>> qdisc > > > >>> > > > >> noqueue state UP > > > >>> > > > >> link/ether 00:25:90:75:87:c2 brd ff:ff:ff:ff:ff:ff > > > >>> > > > >> inet6 fe80::225:90ff:fe75:87c2/64 scope link > > > >>> > > > >> valid_lft forever preferred_lft forever > > > >>> > > > >> > > > >>> > > > >> I tried the following yesterday: > > > >>> > > > >> Copy virtual disk from GlusterFS storage to local disk of > > host > > > >>> and > > > >>> > > > >> create a new vm with virt-manager which loads ovirtmgmt > > disk. I > > > >>> could > > > >>> > > > >> reach my engine over the ovirtmgmt bridge (so bridge must be > > > >>> working). > > > >>> > > > >> > > > >>> > > > >> I also started libvirtd with Option -v and I saw the > > following > > > >>> in > > > >>> > > > >> libvirtd.log when trying to start ovirt engine: > > > >>> > > > >> 2014-04-22 14:18:25.432+0000: 8901: debug : > > > >>> virCommandRunAsync:2250 : > > > >>> > > > >> Command result 0, with PID 11491 > > > >>> > > > >> 2014-04-22 14:18:25.478+0000: 8901: debug : > > virCommandRun:2045 : > > > >>> > > Result > > > >>> > > > >> exit status 255, stdout: '' stderr: 'iptables v1.4.7: goto > > > >>> 'FO-vnet0' > > > >>> > > is > > > >>> > > > >> not a chain > > > >>> > > > >> > > > >>> > > > >> So it could be that something is broken in my hosted-engine > > > >>> network. > > > >>> > > Do > > > >>> > > > >> you have any clue how I can troubleshoot this? > > > >>> > > > >> > > > >>> > > > >> > > > >>> > > > >> Thanks, > > > >>> > > > >> René > > > >>> > > > >> > > > >>> > > > >> > > > >>> > > > >>> > > > >>> > > > >>> ----- Original Message ----- > > > >>> > > > >>>> From: "René Koch" <rkoch@linuxland.at <mailto:rkoch@linuxland.at>> > > > >>> > > > >>>> To: "Martin Sivak" <msivak@redhat.com <mailto:msivak@redhat.com>> > > > >>> > > > >>>> Cc: users@ovirt.org <mailto:users@ovirt.org> > > > >>> > > > >>>> Sent: Tuesday, April 22, 2014 1:46:38 PM > > > >>> > > > >>>> Subject: Re: [ovirt-users] hosted engine health check > > issues > > > >>> > > > >>>> > > > >>> > > > >>>> Hi, > > > >>> > > > >>>> > > > >>> > > > >>>> I rebooted one of my ovirt hosts today and the result is > > now > > > >>> that I > > > >>> > > > >>>> can't start hosted-engine anymore. > > > >>> > > > >>>> > > > >>> > > > >>>> ovirt-ha-agent isn't running because the lockspace file is > > > >>> missing > > > >>> > > > >>>> (sanlock complains about it). > > > >>> > > > >>>> So I tried to start hosted-engine with --vm-start and I > > get > > > >>> the > > > >>> > > > >>>> following errors: > > > >>> > > > >>>> > > > >>> > > > >>>> ==> /var/log/sanlock.log <== > > > >>> > > > >>>> 2014-04-22 12:38:17+0200 654 [3093]: r2 cmd_acquire > > 2,9,5733 > > > >>> invalid > > > >>> > > > >>>> lockspace found -1 failed 0 name > > > >>> > > 2851af27-8744-445d-9fb1-a0d083c8dc82 > > > >>> > > > >>>> > > > >>> > > > >>>> ==> /var/log/messages <== > > > >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 sanlock[3079]: 2014-04-22 > > > >>> > > 12:38:17+0200 654 > > > >>> > > > >>>> [3093]: r2 cmd_acquire 2,9,5733 invalid lockspace found -1 > > > >>> failed 0 > > > >>> > > name > > > >>> > > > >>>> 2851af27-8744-445d-9fb1-a0d083c8dc82 > > > >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port > > 2(vnet0) > > > >>> > > entering > > > >>> > > > >>>> disabled state > > > >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: device vnet0 left > > > >>> promiscuous > > > >>> > > mode > > > >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 kernel: ovirtmgmt: port > > 2(vnet0) > > > >>> > > entering > > > >>> > > > >>>> disabled state > > > >>> > > > >>>> > > > >>> > > > >>>> ==> /var/log/vdsm/vdsm.log <== > > > >>> > > > >>>> Thread-21::DEBUG::2014-04-22 > > > >>> > > > >>>> 12:38:17,563::libvirtconnection::124::root::(wrapper) > > Unknown > > > >>> > > > >>>> libvirterror: ecode: 38 edom: 42 level: 2 message: Failed > > to > > > >>> acquire > > > >>> > > > >>>> lock: No space left on device > > > >>> > > > >>>> Thread-21::DEBUG::2014-04-22 > > > >>> > > > >>>> 12:38:17,563::vm::2263::vm.Vm::(_startUnderlyingVm) > > > >>> > > > >>>> > > vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::_ongoingCreations > > > >>> > > released > > > >>> > > > >>>> Thread-21::ERROR::2014-04-22 > > > >>> > > > >>>> 12:38:17,564::vm::2289::vm.Vm::(_startUnderlyingVm) > > > >>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start > > > >>> process > > > >>> > > failed > > > >>> > > > >>>> Traceback (most recent call last): > > > >>> > > > >>>> File "/usr/share/vdsm/vm.py", line 2249, in > > > >>> _startUnderlyingVm > > > >>> > > > >>>> self._run() > > > >>> > > > >>>> File "/usr/share/vdsm/vm.py", line 3170, in _run > > > >>> > > > >>>> self._connection.createXML(domxml, flags), > > > >>> > > > >>>> File > > > >>> > > > >>>> > > > >>> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", > > > >>> > > > >>>> line 92, in wrapper > > > >>> > > > >>>> ret = f(*args, **kwargs) > > > >>> > > > >>>> File "/usr/lib64/python2.6/site-packages/libvirt.py", > > > >>> line > > > >>> > > 2665, in > > > >>> > > > >>>> createXML > > > >>> > > > >>>> if ret is None:raise > > libvirtError('virDomainCreateXML() > > > >>> > > failed', > > > >>> > > > >>>> conn=self) > > > >>> > > > >>>> libvirtError: Failed to acquire lock: No space left on > > device > > > >>> > > > >>>> > > > >>> > > > >>>> ==> /var/log/messages <== > > > >>> > > > >>>> Apr 22 12:38:17 ovirt-host02 vdsm vm.Vm ERROR > > > >>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::The vm start > > > >>> process > > > >>> > > > >>>> failed#012Traceback (most recent call last):#012 File > > > >>> > > > >>>> "/usr/share/vdsm/vm.py", line 2249, in > > _startUnderlyingVm#012 > > > >>> > > > >>>> self._run()#012 File "/usr/share/vdsm/vm.py", line 3170, > > in > > > >>> > > _run#012 > > > >>> > > > >>>> self._connection.createXML(domxml, flags),#012 File > > > >>> > > > >>>> > > > >>> "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", > > > >>> > > line 92, > > > >>> > > > >>>> in wrapper#012 ret = f(*args, **kwargs)#012 File > > > >>> > > > >>>> "/usr/lib64/python2.6/site-packages/libvirt.py", line > > 2665, in > > > >>> > > > >>>> createXML#012 if ret is None:raise > > > >>> > > libvirtError('virDomainCreateXML() > > > >>> > > > >>>> failed', conn=self)#012libvirtError: Failed to acquire > > lock: > > > >>> No > > > >>> > > space > > > >>> > > > >>>> left on device > > > >>> > > > >>>> > > > >>> > > > >>>> ==> /var/log/vdsm/vdsm.log <== > > > >>> > > > >>>> Thread-21::DEBUG::2014-04-22 > > > >>> > > > >>>> 12:38:17,569::vm::2731::vm.Vm::(setDownStatus) > > > >>> > > > >>>> vmId=`f26dd37e-13b5-430c-b2f2-ecd098b82a91`::Changed > > state to > > > >>> Down: > > > >>> > > > >>>> Failed to acquire lock: No space left on device > > > >>> > > > >>>> > > > >>> > > > >>>> > > > >>> > > > >>>> No space left on device is nonsense as there is enough > > space > > > >>> (I had > > > >>> > > this > > > >>> > > > >>>> issue last time as well where I had to patch machine.py, > > but > > > >>> this > > > >>> > > file > > > >>> > > > >>>> is now Python 2.6.6 compatible. > > > >>> > > > >>>> > > > >>> > > > >>>> Any idea what prevents hosted-engine from starting? > > > >>> > > > >>>> ovirt-ha-broker, vdsmd and sanlock are running btw. > > > >>> > > > >>>> > > > >>> > > > >>>> Btw, I can see in log that json rpc server module is > > missing > > > >>> - which > > > >>> > > > >>>> package is required for CentOS 6.5? > > > >>> > > > >>>> Apr 22 12:37:14 ovirt-host02 vdsm vds WARNING Unable to > > load > > > >>> the > > > >>> > > json > > > >>> > > > >>>> rpc server module. Please make sure it is installed. > > > >>> > > > >>>> > > > >>> > > > >>>> > > > >>> > > > >>>> Thanks, > > > >>> > > > >>>> René > > > >>> > > > >>>> > > > >>> > > > >>>> > > > >>> > > > >>>> > > > >>> > > > >>>> On 04/17/2014 10:02 AM, Martin Sivak wrote: > > > >>> > > > >>>>> Hi, > > > >>> > > > >>>>> > > > >>> > > > >>>>>>>> How can I disable notifications? > > > >>> > > > >>>>> > > > >>> > > > >>>>> The notification is configured in > > > >>> > > > >>>>> /etc/ovirt-hosted-engine-ha/broker.conf > > > >>> > > > >>>>> section notification. > > > >>> > > > >>>>> The email is sent when the key state_transition exists > > and > > > >>> the > > > >>> > > string > > > >>> > > > >>>>> OldState-NewState contains the (case insensitive) regexp > > > >>> from the > > > >>> > > > >>>>> value. > > > >>> > > > >>>>> > > > >>> > > > >>>>>>>> Is it intended to send out these messages and detect > > that > > > >>> ovirt > > > >>> > > > >>>>>>>> engine > > > >>> > > > >>>>>>>> is down (which is false anyway), but not to restart > > the > > > >>> vm? > > > >>> > > > >>>>> > > > >>> > > > >>>>> Forget about emails for now and check the > > > >>> > > > >>>>> /var/log/ovirt-hosted-engine-ha/agent.log and broker.log > > (and > > > >>> > > attach > > > >>> > > > >>>>> them > > > >>> > > > >>>>> as well btw). > > > >>> > > > >>>>> > > > >>> > > > >>>>>>>> oVirt hosts think that hosted engine is down because > > it > > > >>> seems > > > >>> > > that > > > >>> > > > >>>>>>>> hosts > > > >>> > > > >>>>>>>> can't write to hosted-engine.lockspace due to > > glusterfs > > > >>> issues > > > >>> > > (or > > > >>> > > > >>>>>>>> at > > > >>> > > > >>>>>>>> least I think so). > > > >>> > > > >>>>> > > > >>> > > > >>>>> The hosts think so or can't really write there? The > > > >>> lockspace is > > > >>> > > > >>>>> managed > > > >>> > > > >>>>> by > > > >>> > > > >>>>> sanlock and our HA daemons do not touch it at all. We > > only > > > >>> ask > > > >>> > > sanlock > > > >>> > > > >>>>> to > > > >>> > > > >>>>> get make sure we have unique server id. > > > >>> > > > >>>>> > > > >>> > > > >>>>>>>> Is is possible or planned to make the whole ha feature > > > >>> optional? > > > >>> > > > >>>>> > > > >>> > > > >>>>> Well the system won't perform any automatic actions if > > you > > > >>> put the > > > >>> > > > >>>>> hosted > > > >>> > > > >>>>> engine to global maintenance and only start/stop/migrate > > the > > > >>> VM > > > >>> > > > >>>>> manually. > > > >>> > > > >>>>> I would discourage you from stopping agent/broker, > > because > > > >>> the > > > >>> > > engine > > > >>> > > > >>>>> itself has some logic based on the reporting. > > > >>> > > > >>>>> > > > >>> > > > >>>>> Regards > > > >>> > > > >>>>> > > > >>> > > > >>>>> -- > > > >>> > > > >>>>> Martin Sivák > > > >>> > > > >>>>> msivak@redhat.com <mailto:msivak@redhat.com> > > > >>> > > > >>>>> Red Hat Czech > > > >>> > > > >>>>> RHEV-M SLA / Brno, CZ > > > >>> > > > >>>>> > > > >>> > > > >>>>> ----- Original Message ----- > > > >>> > > > >>>>>> On 04/15/2014 04:53 PM, Jiri Moskovcak wrote: > > > >>> > > > >>>>>>> On 04/14/2014 10:50 AM, René Koch wrote: > > > >>> > > > >>>>>>>> Hi, > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> I have some issues with hosted engine status. > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> oVirt hosts think that hosted engine is down because > > it > > > >>> seems > > > >>> > > that > > > >>> > > > >>>>>>>> hosts > > > >>> > > > >>>>>>>> can't write to hosted-engine.lockspace due to > > glusterfs > > > >>> issues > > > >>> > > (or > > > >>> > > > >>>>>>>> at > > > >>> > > > >>>>>>>> least I think so). > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> Here's the output of vm-status: > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> # hosted-engine --vm-status > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> --== Host 1 status ==-- > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> Status up-to-date : False > > > >>> > > > >>>>>>>> Hostname : 10.0.200.102 > > > >>> > > > >>>>>>>> Host ID : 1 > > > >>> > > > >>>>>>>> Engine status : unknown > > stale-data > > > >>> > > > >>>>>>>> Score : 2400 > > > >>> > > > >>>>>>>> Local maintenance : False > > > >>> > > > >>>>>>>> Host timestamp : 1397035677 > > > >>> > > > >>>>>>>> Extra metadata (valid at timestamp): > > > >>> > > > >>>>>>>> metadata_parse_version=1 > > > >>> > > > >>>>>>>> metadata_feature_version=1 > > > >>> > > > >>>>>>>> timestamp=1397035677 (Wed Apr 9 11:27:57 > > 2014) > > > >>> > > > >>>>>>>> host-id=1 > > > >>> > > > >>>>>>>> score=2400 > > > >>> > > > >>>>>>>> maintenance=False > > > >>> > > > >>>>>>>> state=EngineUp > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> --== Host 2 status ==-- > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> Status up-to-date : True > > > >>> > > > >>>>>>>> Hostname : 10.0.200.101 > > > >>> > > > >>>>>>>> Host ID : 2 > > > >>> > > > >>>>>>>> Engine status : {'reason': 'vm > > not > > > >>> running > > > >>> > > on > > > >>> > > > >>>>>>>> this > > > >>> > > > >>>>>>>> host', 'health': 'bad', 'vm': 'down', 'detail': > > 'unknown'} > > > >>> > > > >>>>>>>> Score : 0 > > > >>> > > > >>>>>>>> Local maintenance : False > > > >>> > > > >>>>>>>> Host timestamp : 1397464031 > > > >>> > > > >>>>>>>> Extra metadata (valid at timestamp): > > > >>> > > > >>>>>>>> metadata_parse_version=1 > > > >>> > > > >>>>>>>> metadata_feature_version=1 > > > >>> > > > >>>>>>>> timestamp=1397464031 (Mon Apr 14 10:27:11 > > 2014) > > > >>> > > > >>>>>>>> host-id=2 > > > >>> > > > >>>>>>>> score=0 > > > >>> > > > >>>>>>>> maintenance=False > > > >>> > > > >>>>>>>> state=EngineUnexpectedlyDown > > > >>> > > > >>>>>>>> timeout=Mon Apr 14 10:35:05 2014 > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> oVirt engine is sending me 2 emails every 10 minutes > > with > > > >>> the > > > >>> > > > >>>>>>>> following > > > >>> > > > >>>>>>>> subjects: > > > >>> > > > >>>>>>>> - ovirt-hosted-engine state transition > > > >>> EngineDown-EngineStart > > > >>> > > > >>>>>>>> - ovirt-hosted-engine state transition > > > >>> EngineStart-EngineUp > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> In oVirt webadmin I can see the following message: > > > >>> > > > >>>>>>>> VM HostedEngine is down. Exit message: internal error > > > >>> Failed to > > > >>> > > > >>>>>>>> acquire > > > >>> > > > >>>>>>>> lock: error -243. > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> These messages are really annoying as oVirt isn't > > doing > > > >>> anything > > > >>> > > > >>>>>>>> with > > > >>> > > > >>>>>>>> hosted engine - I have an uptime of 9 days in my > > engine > > > >>> vm. > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> So my questions are now: > > > >>> > > > >>>>>>>> Is it intended to send out these messages and detect > > that > > > >>> ovirt > > > >>> > > > >>>>>>>> engine > > > >>> > > > >>>>>>>> is down (which is false anyway), but not to restart > > the > > > >>> vm? > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> How can I disable notifications? I'm planning to > > write a > > > >>> Nagios > > > >>> > > > >>>>>>>> plugin > > > >>> > > > >>>>>>>> which parses the output of hosted-engine --vm-status > > and > > > >>> only > > > >>> > > Nagios > > > >>> > > > >>>>>>>> should notify me, not hosted-engine script. > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> Is is possible or planned to make the whole ha feature > > > >>> > > optional? I > > > >>> > > > >>>>>>>> really really really hate cluster software as it > > causes > > > >>> more > > > >>> > > > >>>>>>>> troubles > > > >>> > > > >>>>>>>> then standalone machines and in my case the > > hosted-engine > > > >>> ha > > > >>> > > feature > > > >>> > > > >>>>>>>> really causes troubles (and I didn't had a hardware or > > > >>> network > > > >>> > > > >>>>>>>> outage > > > >>> > > > >>>>>>>> yet only issues with hosted-engine ha agent). I don't > > > >>> need any > > > >>> > > ha > > > >>> > > > >>>>>>>> feature for hosted engine. I just want to run engine > > > >>> > > virtualized on > > > >>> > > > >>>>>>>> oVirt and if engine vm fails (e.g. because of issues > > with > > > >>> a > > > >>> > > host) > > > >>> > > > >>>>>>>> I'll > > > >>> > > > >>>>>>>> restart it on another node. > > > >>> > > > >>>>>>> > > > >>> > > > >>>>>>> Hi, you can: > > > >>> > > > >>>>>>> 1. edit > > > >>> /etc/ovirt-hosted-engine-ha/{agent,broker}-log.conf and > > > >>> > > tweak > > > >>> > > > >>>>>>> the logger as you like > > > >>> > > > >>>>>>> 2. or kill ovirt-ha-broker & ovirt-ha-agent services > > > >>> > > > >>>>>> > > > >>> > > > >>>>>> Thanks for the information. > > > >>> > > > >>>>>> So engine is able to run when ovirt-ha-broker and > > > >>> ovirt-ha-agent > > > >>> > > isn't > > > >>> > > > >>>>>> running? > > > >>> > > > >>>>>> > > > >>> > > > >>>>>> > > > >>> > > > >>>>>> Regards, > > > >>> > > > >>>>>> René > > > >>> > > > >>>>>> > > > >>> > > > >>>>>>> > > > >>> > > > >>>>>>> --Jirka > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> Thanks, > > > >>> > > > >>>>>>>> René > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>>> > > > >>> > > > >>>>>>> > > > >>> > > > >>>>>> _______________________________________________ > > > >>> > > > >>>>>> Users mailing list > > > >>> > > > >>>>>> Users@ovirt.org <mailto:Users@ovirt.org> > > > >>> > > > >>>>>> http://lists.ovirt.org/mailman/listinfo/users > > > >>> > > > >>>>>> > > > >>> > > > >>>> _______________________________________________ > > > >>> > > > >>>> Users mailing list > > > >>> > > > >>>> Users@ovirt.org <mailto:Users@ovirt.org> > > > >>> > > > >>>> http://lists.ovirt.org/mailman/listinfo/users > > > >>> > > > >>>> > > > >>> > > > >> > > > >>> > > > > > > >>> > > _______________________________________________ > > > >>> > > Users mailing list > > > >>> > > Users@ovirt.org <mailto:Users@ovirt.org> > > > >>> > > http://lists.ovirt.org/mailman/listinfo/users > > > >>> > > > > > >>> > > > > >>> _______________________________________________ > > > >>> Users mailing list > > > >>> Users@ovirt.org <mailto:Users@ovirt.org> > > > >>> http://lists.ovirt.org/mailman/listinfo/users > > > >>> > > > >> > > > >> > > > > > > > > > _______________________________________________ > > Users mailing list > > Users@ovirt.org <mailto:Users@ovirt.org> > > http://lists.ovirt.org/mailman/listinfo/users > > > _______________________________________________ Users mailing list Users@ovirt.org <mailto:Users@ovirt.org> http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
participants (3)
-
Jiri Moskovcak
-
Kevin Tibi
-
Martin Sivak